Thursday, December 27, 2012

A statistical approach to predicting my getting sick

During my recent trip to Iceland, I thought to myself how grateful I am that I haven't been sick, and that it's surprising because "it's about time I got sick again."  Days later, I randomly got sick (sore throat, constant 102 temp, coughing); I should be fine in 3 more days, though.  In the meantime, just for fun during the winter break, I decided to see if I can define some formalism for this supposed intuition.

Disclaimer: I know this is some rough, make-shift calculations, but given the nature of having so little data, and that the real, useful data is all immeasurable and not at my disposal, this is what you get :-)

I get a sore throat about 5 times each year -- which always lends me voiceless for about a week -- despite the fact that I:
  • eat healthy
  • workout regularly
  • have good hygiene practices (shower at least once a day, brush teeth & floss twice a day, wash hands frequently, etc)
  • have no known allergies or health issues (have seen several docs at MIT regarding this)
I've been tempted to just go to NYC and lick every bus pole and escalator handrail I can, in attempt to go ahead and get it over with by subjecting my body to every known bacteria and virus out there.

Obviously, there's a myriad of variables that can cause one to get sick:
  • stress
  • germs being spread from air, physical contact, etc
  • weakened immune system (i.e., from over-training in the gym, stress, etc)
Clearly, these things are easily intertwined and impossible to accurately monitor and model.

However, in late 2007, I wanted to get to the bottom of it and was curious if there were any patterns in my getting sore throats.  So, for the past 5 years, I've quickly logged every time I get sick (a rating of the severity, how many days, and symptoms).

In the following picture, I plot every time I've gotten a sore throat, rounding to the closest week.



Looking at any patterns over time will only show possible correlations, not causes, I know.  Yet, the times I get sick are highly consistent from year to year:
  • I pretty much always get sick on Jan 1
  • I essentially never get sick from April 1 - July 31
  • my sore throat always lasts exactly 6 or 7 days.
  • I get sick roughly 5 times every year from August 1 - March 31.
  • 100% of the times that I fly overseas, I either get sick while there or soon as I come back -- but it's not always sore throats.  Here, I'm only concerned with sore throats.
I will refer to August 1 - March 31 as the sick period/months, which spans 32 weeks.  For all calculations, we are only concerned with getting sick during the current sick period, and that each one is independent of the previous years'.

Again, since it's impossible to model the aforementioned variables that are the actual causes of getting sick, I figured why not play around with the time-series data and treat it as indications as to when I'll get sick.  So, here we are making the large assumption that all of the underlying sick-causes variables are consistent and uniformly distributed during the sick months, and that I get sick on average 5 times per sick period.  I think this is reasonably fair as a high-level approximation.

If we model my sickness as a Poisson distribution, we see that the probability of getting sick N number of times during a sick period is:

N
probability of getting sick N times
1
0.033689734995427
2
0.084224337488568
3
0.140373895814281
4
0.175467369767851
5
0.175467369767851
6
0.146222808139876
7
0.104444862957054
8
0.065278039348159
9
0.036265577415644
10
0.018132788707822

I like that distribution; it seems to fit my sickness history pretty well... although places a little too much weight on getting sick more than 5 times.

For each one of these possible outcomes (and including all N up to however many weeks we have left... not just N >= 10), there's a chance of getting sick next week, which is simply N/(# of weeks remaining).... again, assuming uniform distribution of everything, including the times I get sick within the sick period.

Further, if we wish to calculate the overall probability of getting sick next week (i.e., summed over each possible N), we simply combine these two things, and weight and normalize by the Poisson probabilities listed in the above table.



As a real life example, last week:
m = 13 (meaning 13 weeks left until March 31 aka end of 'sick period')
n = 2 (meaning I've been sick twice so far)
lambda = 5 (meaning on average I get sick 5 times each year)

Per the above equation, the probability of getting sick this week (which I did) was roughly 42.10%.

Further, if I did not get sick this week, the probability of getting sick the following week would be 45.43%

Further, not getting sick in the 2 upcoming weeks, but getting sick on the following would be 49.14%

Thus, the probability of NOT getting sick during the current week and upcoming 2 weeks was only 16%, which agrees with irking feeling I got: "sweet; I'm not sick... it's been a while though..." 

In the meantime, if anyone is down for my NYC-immunity-deplete-boost idea, let me know.  I'll let you go first as the guinea pig.