The world in which we live has changed rapidly over the last few decades. Threats of bioterrorism, influenza pandemics, and emerging infectious diseases coupled with unprecedented population mobility led to the development of public health surveillance systems. These systems are useful in detecting and responding to infectious disease outbreaks but often operate with a considerable delay and fail to provide the necessary lead time for optimal public health response.
In contrast, syndromic surveillance systems rely on clinical features (e.g., activities prompted by the onset of symptoms) that are discernible prior to diagnosis to warn of changes in disease activity. Although less precise, these systems can offer considerable lead time. Patient information may be acquired from multiple existing sources established for other purposes, including, for example, emergency department primary complaints, ambulance dispatch data, and over-the-counter medication sales. Unfortunately, these data are often expensive, sometimes difficult to obtain and almost always hard to integrate.
Fortunately, the proliferation of online social networks makes much more information about our daily habits and lifestyles freely available and easily accessible on the web. Twitter, Facebook and FourSquare are only a few examples of the many websites where people voluntarily post updates on their daily behaviors, health status, and physical location.
In this thesis we develop and apply methods to collect, filter and analyze the content of social media postings in order to make predictions. As a proof of concept we used Twitter data to predict public opinion in the form of the outcome of a popular television show. We then used the same methods to monitor and track public perception of influenza during the H1N1 epidemic, and even to predict disease burden in real time, which is a measurable advance over current public health practice. Finally, we used location specific social media data to model human travels and show how this data can improve our prediction of disease burden.