TwiSent
Twitter sentiment analysis tool using Python, Flask, AWS, and Machine Learning
Overview
TwiSent is a tool I built to analyze the sentiment of Twitter content, a binary classifier of positive or negative emotion.
There are three main modules:
- Twitter Retriever - Python / Twitter 3rd party API
- Sentiment Analyzer - Python / Scikit-learn / SpaCy / PyTorch
- User Interface - Python / Flask / JavaScript / Bootstrap / CSS
The application is Dockerized for easy portability. The original production environment ran on AWS EC2 / Elastic Beanstalk. To reduce costs for my hobby application, the current production environment runs in the Kubernetes cluster running in my closet, complete with Istio service-mesh.
Don’t miss out on the Geo tab for an interactive map-based location search.
GitHub
Source code is available in the TwiSent GitHub repository.
Skills Overview
Before we jump into the design, I like to take a step back to admire what it takes to build an application like this.
Technical Skills:
- Python
- Machine Learning
- Scikit-Learn / PyTorch
- Flask / CSS / JavaScript / Bootstrap 4
- AWS EC2 / Linux / Elastic Beanstalk
- DNS / SSL
- GitHub / Docker
- Twitter API
Soft Skills:
- Intrinsic Motivation
- Creativity
- Strategic Thinking
- Quick Learning
- Risk-Based / Fail Fast Approach
- Tenacity
- Integrity
- Patience
Twitter Retriever
TwiSent uses the official Twitter API via a developer account. The basic (non-premium) developer account provides results from roughly the last 10 days, not perfectly consistently, but excellent for our purposes.
The Python code is organized into a single class that encapsulates access to Twitter. This abstracts away the complexity and subtleties of the API and lets the rest of the TwiSent application make simple, easy method calls to get at the data.
Sentiment Analyzer
Like any good machine learning module, the Sentiment Analyzer has three main components.
- Feature Preprocessor
- Prediction Model
- Training Process
The feature preprocessor and prediction model are built into a Scikit-learn Pipeline for a nice, clean, repeatable flow during training and making predictions.
The training process is a one-off Python script that outputs a trained model.
Feature Preprocessor
- SpaCy for Natural Language Processing (NLP)
- Tweets are parsed into Bag of Words (bow) input vectors
- Specific to processing Tweet data:
- Remove stop words (as defined in SpaCy English vocabulary)
- Lemmatize the tokens (reduce words to their base form)
- Discard Twitter handles, “rt” from re-tweets, and embedded URLs
- Replace hashtags with just the keyword (i.e. remove #)
Prediction Model
The prediction model is a binary classifier, producing a probability whether the Tweet is positive or negative in emotional sentiment.
During exploration I built several different models to compare performance. Logistic Regression, SVM, Naive Bayes, Random Forest, XGBoost, and a RNN Deep Averaging Network.
One big constraint on production TwiSent is the AWS EC2 micro instance that it runs in. In a micro instance, there are some limitations on what types of models you can use. For example XGBoost and PyTorch cannot run in such a small footprint, and therefore were only used for exploration in my development and testing environments. As of this writing, production is running a Logistic Regression model that performs reasonably well, but no where near the results of the PyTorch Deep Averaging Network.
Training Process
The prediction model is learned by the training process. In a one-off Python script, I create the SpaCy preprocessor, the prediction model, and roll them together in a Scikit-learn Pipeline.
Training data came from Sentiment140, consisting of 1.6 million Tweets labelled 0=negative, 4=positive. It dates back to 2009 which is ancient in Twitter terms. I was satisfied with the performance, especially since the purpose of this project was to build a full-stack application that includes a machine learning component, rather than simply building the most accurate model possible.
The training script uses cross-validation to test a variety of hyperparameters to find the best model and saves the winner using Python’s pickle feature.
User Interface
The user interface is written in Python using Flask, WTForms, Jinja2 Templates, Bootstrap 4 CSS and JavaScript.
It is built to run in a Docker container, making updates a snap.
- Clone the desired commit from GitHub
- Deploy it to Docker
- Profit!
The text-based Twitter searches, and the pure text search are built entirely by me in TwiSent.
The Geo-location map relies upon the amazing contributions from Leaflet, OpenStreetMap, and Mapbox.
Considerations & Limitations
It is always good to look at your applications from all angles to ensure a robust approach. Here are my thoughts on the limitations of TwiSent:
- Stale training data from 2009. While this does impact performance, it is also very black-box from the perspective of the rest of the application. An improved model can be dropped plug-and-play without changing any other part of the system.
- Limited processor power for model tuning. My development PC is a moderately powerful platform, but nowhere big enough to test as many hyperparameter combinations as I would like. In an industrial setting, we’d have access to better hardware, opening the door to a wider search. Same as above, the rest of the application is insulated from this.
- Access to the Twitter API is done via the Hedge Court developer account. This works fine for low volumes of searching on a software engineer’s portfolio page, but not for larger scale operations. A good solution to this is using Twitter’s 3rd party OAuth feature. In this mode, TwiSent users would authenticate themselves to Twitter, and all searches would be done under their own credentials.
- More sophisticated NLP
- NLTK / Gensim for deeper semantic modelling
- Cached N-Gram preprocessing
- Subject Inference (here we just predict pos/neg, would be nice to predict the subject the sentiment is aimed at)
Conclusion
Congrats! You made it this far. I hope you enjoyed reading about TwiSent, exploring the source on GitHub, and playing with the application for yourself.
Building the application was a fantastic experience, I can’t wait to build more like it.
Please feel free to share your thoughts with me via email.