Twitter-Based Alert System to Combat Large-Scale Vaccine Rollout Challenges
The FDA estimates that adverse drug reactions (ADRs) are the 4th leading cause of death, and they cause upwards of 106,000 deaths annually. The CDC and FDA run VAERS (Vaccine Adverse Event Reporting System) to support their goals in pharmacovigilance. The system faces significant limitations: difficulty in establishing the cause & origin of reaction, time-consuming reports, the system is costly, and is primarily manual.
To overcome this limitation, I built a Twitter-based alert system. The project aims to utilize social media data to provide healthcare officials with sentiments and geolocation information to facilitate the development of a faster response to address drug reactions to the COVID-19 vaccine.
Tweets related to COVID-19 are obtained from two locations.
- Using Python’s ‘twarc’ library, the tweets are hydrated for COVID-19 tweet-ids available in the COVID-19 dataset on Github, and core attributes are retrieved. This includes tweet information such as the author, the tweet body, a timestamp, and sometimes geolocation data (if shared by the user).
- Second, tweets are obtained in real-time using ‘Tweepy,’ a Python library for accessing the Twitter streaming API. Tweepy is useful for acquiring a large volume of tweets or creating a live feed.
New derived features, ‘derived country,’ ‘derived state,’ and ‘derived city’ are created by parsing the contents of a tweet’s user-defined location and place name. Twitter provides an optional polygonal bounding box of coordinates that encloses the location from where the tweet was sent. For this project, the centroid coordinates of each bounding box are found to narrow down the location of the Twitter user.
TF-IDF scores of tweets are inputted into two different LinearSVC classifier functions, which feed into a RandomForest Model to predict location using information received from both SVCs. Gensim Vector Space Model is used to find similarities between vaccination tweets & Adverse Drug Reaction (ADR) vocabulary. Sentiment analysis is performed on the tweets with the VADER model to identify the connotation of a tweet.
Compared to the currently existing user-defined location, the model-predicted location enhanced a tweet’s geolocation by 30 to 40%. The project also provides a faster warning system for identifying adverse reactions compared to the lengthy and manual process in VAERS. The multilevel model achieved an overall accuracy of 91% in flagging tweets related to vaccine rollout challenges.
The model can be used in conjunction with the VAERS system, and healthcare officials will receive more accurate data to support pharmacovigilance goals. The models can be generalized to provide information regarding different vaccines and their ADR reactions. The location prediction model accuracy can be improved by training on a much larger training dataset. The current code setup can be easily expanded to cover more countries and languages. Creating a database of stopwords specific to slang language used on social media would allow for easier removal of words that do not aid in classification.