It took quite a while to write part 2 of this post, for reasons I’ll mention below. But like all good investigations, I’ve ended up somewhere different from where I thought I’d be – after spending weeks looking at the Twitter feeds for different companies in different industries, it seems that the way Twitter is used and is useful varies enormously by industry. This makes it difficult to build a generic model for Twitter sentiment analysis (because a model built from, say, small B2C companies, isn’t really applicable for large B2B companies) – but it also makes interesting suggestions for how companies in different industries should use Twitter to help their businesses.
Oh, and my main conclusion? People sure do hate the airlines! More on that later.
First, a potted history of how I got here:
- I wanted to build some models for Twitter sentiment analysis. What does that mean? It means “Give me a tweet for your company, and I’ll tell you whether it’s negative, neutral or positive. That allows you to monitor the Twitter feeds for you (or your competitors of course ? ), and track whether they’re getting better or worse”.
- I’ve collected millions of tweets from a number of different industries, B2B, B2C, small companies and large companies.
- In part 1, I hit some problems with mis-classification of tweets. This came from (I believe), a problem that the base models from Stanford NLP are built from a different corpus of texts – from a generic domain (of English sentences and paragraphs) rather than, say, tweets for companies. There were also still problems with creating a decent model for tweets specifically as opposed to general blocks of English text (again, more on that below in Appendix 1).
- So the next job was – could I use the millions of tweets for various companies to build some sort of generic predictor model. I.e. give me a company tweet, I’ll tell you whether it’s Good, Bad or Ugly?
Well, the answer is No. Or at least not within the time that I’ve had so far. And the issue seems to be that the way Twitter is used by companies and by companies’ customers varies significantly by industry.
A first step in creating a predictor model is to manually assign sentiment scores to a long list of tweets – this creates a training set that you then use to train and create a model. During the creation of the model you repeatedly test the model against an out-of-sample dataset to see if your model is working (to avoid things like over-fitting). As a first step, I assigned manual values to around 500 tweets (i.e. I manually tagged 500 tweets as either very negative, negative, neutral, positive or very positive), then I tried to create a model from this. However, my cross-validation scores were terrible – the model was struggling to predict the sentiment of other unseen tweets. I know this is partially because 500 isn’t nearly enough, but still – how could the scores be so bad?
What was the problem? Like all machine learning issues, I’ve generally found that eye-balling the data can tell you a lot. I believe there were two problems with my model:
- The problem already mentioned that the Stanford NLP Parser struggles with tweets. And it’s not a trivial case of just swapping in a new POS Tagger for Tweets. Let’s put that to one side for now.
- The bigger problem is that, from looking at the manual tags I was assigning to tweets, the values I was using varied enormously by domain.
The latter creates the problem of domain-specificity. Is a model created from, say, tweets about the airline industry, relevant to tweets about online cloud storage? It seems not.
The first clue is the distribution of scores I was giving. Remember, 0=very negative, 1=negative, 2=neutral, 3=positive, 4=very positive. For the airline industry I found the following:
0 | 1 | 2 | 3 | 4 |
10% | 32% | 41% | 12% | 4% |
I.e. though there were a lot of neutral tweets, there were a lot of negative tweets. Here’s a very standard example:
@Delta just rebooked my 70yr old parents on a longer flight back to cvg from fco and no extra legroom that I had paid for. Fail.
There’s a lot of this sort of thing going on for the airlines.
In contrast, here’s the distribution for a couple of big cloud providers (specifically, AWS and Azure combined):
1 | 2 | 3 | 4 |
3% | 83% | 11% | 2% |
An enormous number of utterly neutral tweets. Here’s a standard example:
RT @DataExposed: New #DataExposed show: Data Discovery with Azure Data Catalog. https://t.co/5sWuKpdoYx @ch9 @Azure
..pretty dry stuff.
This actually presents two distinct problems – firstly, as mentioned, if the nature of Twitter usage and the type of tweets varies so much by industry, then models will only be really effective within their domains. Fine – if you work in a given industry (e.g. Airline, IT, Fashion), then you can create a model for your industry and use that.
The second problem however is more difficult to work around. For a given industry, I’ve found that the way in which Twitter is used varies enormously. This is what you’re seeing in these very different distributions. And if the vast majority of tweets are of a given sentiment, then sentiment analysis becomes not only difficult, but actually not particularly useful!
In the airline industry, as far as I can see, Twitter is used almost exclusively for telling airlines how bad they are. There’s a pretty strong correlation between the number of tweets mentioning a given airline and the negative feeling towards that airline. This distribution does vary by airline (see table below), but when you know that most tweets are just complaints, what’s the value in searching for the occasional (positive) needle-in-a-haystack?
If I worked for an airline and wondered “How could we use Twitter to improve our brand?”, the answer would be pretty simple – firstly, improve my product! and second, employ customer service reps to look after these people and react to the complaints. As I say, some airlines are worse than others:
Row Labels | 0 | 1 | 2 | 3 | 4 |
American Airlines | 12% | 35% | 37% | 14% | 2% |
British Airways | 10% | 29% | 46% | 12% | 3% |
Delta | 10% | 37% | 38% | 12% | 5% |
SouthWest Airlines | 5% | 26% | 47% | 16% | 5% |
United Airlines | 12% | 43% | 35% | 6% | 4% |
Virgin | 0% | 13% | 44% | 25% | 19% |
Well done Virgin and SouthWest. Delta and United – you have work to do…
The problem in the cloud services industry (AWS and Azure) is the opposite – mentions of these services tend to consist of semi-banal tweets about new services offered, new features and so on. I.e. Twitter is used to share information about the products and services and rarely to express emotive responses (it’s very rare to read “Can’t believe how amazing @AWS was today!!! #FTW” – it just doesn’t happen). Certainly the split between B2B and B2C tweets shows this difference (I looked at small and large orgs as well, from local shops, to fashion houses, to small tech companies).
I still think there’s value in implementing a domain-specific model (for example, a model “Just for small tech companies”). The only block is, as described in Appendix 1, the problem of Parsing tweets properly. Maybe once I’ve figured that out, I’ll find a way to classify the other million-odd tweets I’ve collected for the airline industry as a starting point (there are a lot of unhappy airline passengers out there!)
Appendix 1
The problem of parsing tweets
I was warned by the following tweet from the team at Stanford NLP:
Using CoreNLP on social media? Try GATE Twitter model (iff not parsing…) -pos.model gate-EN-twitter.model https://t.co/K2JAF5XwJ2 #nlproc
— Stanford NLP Group (@stanfordnlp) April 13, 2014
The problem we’re trying to fix, as described in the previous post, is that, for us to understand the sentiment of any sentence we have to carry out a couple of stages first. To begin with we need to Part-of-Speech tag a sentence. So, identify “Dog” as a Noun, “Catch” as a Verb and so on. This has challenges with Twitter which is full of URLs, hashtags, #LOLs and so on. But – the GATE Twitter model mentioned above solves this by adding in this functionality to the POS tagger. If you run a tweet through the GATE POS-tagger it will identify http://some.url/ as a URL and so on.
So far, so good. However, the problem comes with the next stage – Parsing. What’s this? If you look at a sentence such as “I don’t like ice-cream”, the tagger will identify and POS-tag each component of this sentence – I, do, not, like, ice-cream. Great, but to understand the sentiment of this sentence, we need to group these elements further. We need to understand that there’s a hierarchy whereby do and not are grouped together, and apply to like to negate this term. I.e. this sentence is actually negative despite the fact that like is a positive word.
And herein lies the problem with parsing tweets – because of the language used in Twitter, often abbreviated, often partial, it’s very hard to properly parse tweets and work out this structure. This seems to be the problem from the very brief analysis I’ve done. The Stanford team (and others) do state “Sure, you can try the standard parsers using tweets tagged with the GATE POS tagger, but good luck with that!”. It obviously needs more work, and maybe when I have more free time, I’ll have a look!