The Research Question
The goal of this project is to use natural language processing (NLP) to score the sentiment for posts on the National Geographic Facebook page, and use this score along with other information about a post to determine whether it will popular or not, as compared to other posts.
There are a few applications of this project: a Facebook group admin could use this model to choose between publishing multiple posts, or a competing Facebook groups could run the model on the competitor’s content, and try to create content that competes for popularity with the other group.
Data Collection
In order to obtain the text for each post, as well as the relevant metrics for popularity (number of likes, number of comments, number of shares), I used Kevin Zuniga's facebook-scraper library to extract 200 pages worth of posts from the NatGeo Facebook page.
Data Preprocessing/Normalization
Now for the fun part! To normalize the text, I removed special symbols such as quotation marks and punctuation (except for apostrophes), replaced dashes by single spaces (so that a word such as ‘hard-working’ is not later converted into ‘hardworking’), then expanded contractions throughout.
Although removing stopwords, tokenizing, and lemmatizing the text is a common practice as well, I avoided these particular steps as they can hurt the performance of AFINN — this is because AFINN is an unsupervised learning method and needs to find exact copies of the words in its sentiment lexicon (e.g. a word like ‘ponies’ might be lemmatized as ‘poni’ which will not be found in the AFINN sentiment lexicon and therefore not scored)
Other preprocessing steps to implement:
generating n-grams so that places/names/multiword proper nouns are understood as one term instead of multiple
for example: we want ‘New York’ to be understood as one term instead of two terms ‘New’, ‘York’. this is especially important for sentiment scoring as there could be false positives: if the name 'Derek Love’ is misinterpreted by AFINN as ‘Derek’, ‘Love’, then AFINN will score this as positive sentiment due to the word ‘Love’ when the name should be understood as neutral sentiment
Exploratory Data Analysis
After all the posts were normalized, they were scored using the AFINN Sentiment Lexicon which assigns a sentiment score (positivity score) to sentences based on the presence of certain words/phrases (i.e. “I had a good day” would get a sentiment score of 1.0).
I also created an overall score that takes into account the mention of animals by increasing already positive scores and decreasing already negative scores — I created a dictionary of animals names by scraping a website with an encyclopedic list of common animals.
Now we want to determine the relationship between a post’s score and popularity using number of shares as a measure of popularity for a post.
My hypothesis was that more extreme posts (with very high or very low sentiment scores) would be shared most often. We can see that the overall scores are more spread out than the sentiment scores, but are they shared more often?
However, it is worth creating a boxplot and/or violin plot to see how the data behaves over different score ranges. I decided to play around and see what the number of shares looks like for scores < -1.0, -1.0 ≤ scores ≤ 1.0, and scores > 1.0
We can see that using sentiment score, negative scores get more shares than neutral scores, which surprisingly, get more shares than positive posts.
However, using my overall scoring metric, we can see the trend changes: more negative and more positive scoring posts seem to get more shares generally.
A violin plot may shed some light on the distribution of these scores too.
A more detailed project description is located at the Github Respository