How to do Sentiment Analysis in Python

In the current scenario, a large amount of data is generated, that is unstructured, and requires processing to generate insights. Some examples of unstructured data are news articles, posts on social media, and search history.

Natural Language Processing (NLP) constitutes the process of analyzing natural language and making sense of it. Sentiment analysis is a common NLP task.

Sentiment analysis involves classifying texts or parts of texts into a pre-defined sentiment. We can use the Natural Language Toolkit (NLTK), a commonly used NLP library in Python, to analyze textual data.

#1 : Installing NLTK and Downloading the Data

As mentioned, we will use the NLTK package in Python in this article.

First, install the NLTK package with the pip package manager:

pip install nltk==3.3

We are using the sample tweets that are part of the NLTK package. First, start a Python interactive session by running the following command and import the nltk module

>>>import nltk

Download the sample tweets from the NLTK package:

nltk.download('twitter_samples')

This will downloads and stores the tweets locally.

We will use the negative and positive tweets to train our model on sentiment analysis later in this article. The tweets with no sentiments will be used to test your model.

Now that we’ve imported NLTK and downloaded the sample tweets, exit the interactive session by entering in exit(). We are ready to import the tweets and begin processing the data.

#2 : Tokenizing the Data

A language in its original form cannot be accurately processed by a machine, so we need to process the language to make it easier for the machine to understand. For this we are using the process called tokenization, i.e, splitting strings into smaller parts called tokens.

A token is a sequence of characters in text that serves as a unit. Based on how you create the tokens, they may consist of words, emoticons, hashtags, links, or even individual characters. A basic way of breaking language into tokens is by splitting the text based on whitespace and punctuation.

To get started, create a new .py file to hold your script.

In this file, you will first import the twitter_samples so you can work with that data:

from nltk.corpus import twitter_samples

This will import three datasets from NLTK that contain various tweets to train and test the model:

negative_tweets.json: 5000 tweets with negative sentiments
positive_tweets.json: 5000 tweets with positive sentiments
tweets.20150430-223406.json: 20000 tweets with no sentiments

Next, create variables for positive_tweets, negative_tweets, and text

from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

The strings() method of twitter_samples will print all of the tweets within a dataset as strings.

Before using a tokenizer in NLTK, you need to download an additional resource, punkt. The punkt module is a pre-trained model that helps you tokenize words and sentences.

Run the following commands in the session to download the punkt resource:

>>>import nltk
>>>nltk.download('punkt')

Once the download is complete, you are ready to use NLTK’s tokenizers. NLTK provides a default tokenizer for tweets with the .tokenized() method. Add a line to create an object that tokenizes the positive_tweets.json dataset.

from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

If you’d like to test the script to see the .tokenized method in action, add the highlighted content to our nlp_test.py script. This will tokenize a single tweet from the positive_tweets.json dataset.

from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

print(tweet_tokens[0])

The process of tokenization takes some time because it’s not a simple split on white space. After a few moments of processing, you’ll see the following:

['#FollowFriday',
 '@France_Inte',
 '@PKuchly57',
 '@Milipol_Paris',
 'for',
 'being',
 'top',
 'engaged',
 'members',
 'in',
 'my',
 'community',
 'this',
 'week',
 ':)']

Here, the .tokenized() method returns special characters such as @ and _. These characters will be removed through regular expressions later.

from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]

#print(tweet_tokens[0])

Our script is now configured to tokenize data. In the next step, you will update the script to normalize the data.

#3 : Normalizing the Data

Normalization in NLP is the process of converting a word to its canonical form. Words have different forms—for example, ‘read’, ‘read’, and ‘reading’ are various forms of the same verb, ‘read’. Depending on the requirement of your analysis, all of these versions may need to be converted to the same form, ‘read’.

stemming and lemmatization are two popular techniques of normalization.

The technique Lemmatization normalizes a word with the context of vocabulary and morphological analysis of words in the text. The lemmatization algorithm analyzes the structure of the word and its context to convert it to a normalized form.

We are using the Lemmatization technique in this article. First, we need to download the necessary resources. Run the following commands in the session to download the resources:

import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

wordnet is a lexical database for the English language that helps the script determine the base word. The averaged_perceptron_tagger resource is to determine the context of a word in a sentence.

Before running a lemmatizer, we need to determine the context for each word in your text. This can be done by a tagging algorithm, which assesses the relative position of a word in a sentence. In a Python session, Import the pos_tag function, and provide a list of tokens as an argument to get the tags.

from nltk.tag import pos_tag
from nltk.corpus import twitter_samples

tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
print(pos_tag(tweet_tokens[0]))

This is the output of the pos_tag function.

[('#FollowFriday', 'JJ'),
 ('@France_Inte', 'NNP'),
 ('@PKuchly57', 'NNP'),
 ('@Milipol_Paris', 'NNP'),
 ('for', 'IN'),
 ('being', 'VBG'),
 ('top', 'JJ'),
 ('engaged', 'VBN'),
 ('members', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('community', 'NN'),
 ('this', 'DT'),
 ('week', 'NN'),
 (':)', 'NN')]

From the list of tags, here is the list of the most common items and their meaning:

NNP: Noun, proper, singular
NN: Noun, common, singular or mass
IN: Preposition or conjunction, subordinating
VBG: Verb, gerund or present participle
VBN: Verb, past participle

To incorporate this into a function that normalizes a sentence, we should first generate the tags for each token in the text, and then lemmatize each word using the tag.

...

from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

print(lemmatize_sentence(tweet_tokens[0]))

This code imports the WordNetLemmatizer class and initializes it to a variable, lemmatizer.

The function lemmatize_sentence first gets the position tag of each token of a tweet. Within the if statement, if the tag starts with NN, the token is assigned as a noun. Similarly, if the tag starts with VB, the token is assigned as a verb.

Save and close the file, and run the script. Here is the output:

['#FollowFriday',
 '@France_Inte',
 '@PKuchly57',
 '@Milipol_Paris',
 'for',
 'be',
 'top',
 'engage',
 'member',
 'in',
 'my',
 'community',
 'this',
 'week',
 ':)']

You will notice that the verb being changes to its root form, be, and the noun members changes to member. Before you proceed, you comment out the last line that prints the sample tweet from the script.

#4 : Removing Noise from the Data

Noise is any part of the text that does not add meaning or information to data.

In this article, we will use regular expressions in Python to search for and remove these items:

Hyperlinks – All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore, keeping them in the text processing would not add any value to the analysis.
Twitter handles in replies – The Twitter usernames are preceded by a @ symbol, which does not convey any meaning.
Punctuation and special characters – While these often provide context to textual data, this context is often difficult to process. For simplicity, you will remove all punctuation and special characters from tweets.

To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.

Since we will normalize word forms within the remove_noise() function, you can comment out the lemmatize_sentence() function from the script.

Add the following code to remove noise from the dataset.

...

import re, string

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

This code creates a remove_noise() function that removes noise and incorporates the normalization and lemmatization mentioned in the previous section. The code takes two arguments: the tweet tokens and the tuple of stop words.

The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code first searches for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string, or ''.

Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular expressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _, and replaces them with an empty string. Finally, you can remove punctuation using the library string.

In addition to this, we also remove stop words using a built-in set of stop words in NLTK, which needs to be downloaded separately.

nltk.download('stopwords')

You can use the .words() method to get a list of stop words in English. To test the function, let us run it on our sample tweet.

...
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

print(remove_noise(tweet_tokens[0], stop_words))

we will get an output similar to the following.

['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

Notice that the function removes all @ mentions, stop words, and converts the words to lowercase.

Before proceeding to the modeling exercise in the next step, use the remove_noise() function to clean the positive and negative tweets. Comment out the line to print the output of remove_noise() on the sample tweet and add the following to the script.

...
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

#print(remove_noise(tweet_tokens[0], stop_words))

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

Now that you’ve added the code to clean the sample tweets, you may want to compare the original tokens to the cleaned tokens for a sample tweet. If you’d like to test this, add the following code to the file to compare both versions of the 500th tweet in the list.

...
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

In the output you will see that the punctuation and links have been removed, and the words have been converted to lowercase.

['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb9ht']
['dang', 'rad', '#fanart', ':d']

There are certain issues that might arise during the preprocessing of text. For instance, words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless you write something specific to tackle the issue. It’s common to fine tune the noise removal process for your specific data.

Now that you’ve seen the remove_noise() function in action, be sure to comment out or remove the last two lines from the script so you can add more to it.

...
#print(positive_tweet_tokens[500])
#print(positive_cleaned_tokens_list[500])

In this step you removed noise from the data to make the analysis more effective. In the next step you will analyze the data to find the most common words in your sample dataset.

#5 : Determining Word Density

The most basic form of analysis on textual data is to take out the word frequency. A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

The following snippet defines a generator function named get_all_words, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined. Add the following code.

...

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

all_pos_words = get_all_words(positive_cleaned_tokens_list)

Now that you have compiled all words in the sample of tweets, you can find out which are the most common words using the FreqDist class of NLTK. Add the following code.

from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

The .most_common() method lists the words which occur most frequently in the data. Save and close the file after making these changes.

When you run the file now, you will find the most common terms in the data:

[(':)', 3691),
 (':-)', 701),
 (':d', 658),
 ('thanks', 388),
 ('follow', 357),
 ('love', 333),
 ('...', 290),
 ('good', 283),
 ('get', 263),
 ('thank', 253)]

From this data, you can see that emoticon entities form some of the most common parts of positive tweets. Before proceeding to the next step, make sure you comment out the last line of the script that prints the top ten tokens.

To summarize, you extracted the tweets from nltk, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

#6 : Preparing Data for the Model

Sentiment analysis is a process of identifying the attitude of the author on a topic that is being written about. You will create a training data set to train a model. It is a supervised learning machine learning process, which requires you to associate each dataset with a “sentiment” for training. In this tutorial, our model will use the “positive” and “negative” sentiments.

Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and availability of the training dataset, this tutorial helps you train your model in only two categories, positive and negative.

In the data preparation step, you will prepare the data for sentiment analysis by converting tokens to the dictionary form and then split the data for training and testing purposes.

Converting Tokens to a Dictionary

First, you will prepare the data to be fed into the model. You will use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a list of words in a tweet, but a Python dictionary with words as keys and True as values. The following function makes a generator function to change the format of the cleaned data.

Add the following code to convert the tweets from a list of cleaned tokens to dictionaries with keys as the tokens and True as values. The corresponding dictionaries are stored in positive_tokens_for_model and negative_tokens_for_model.nlp_test.py

...
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

Splitting the Dataset for Training and Testing the Model

Next, you need to prepare the data for training the NaiveBayesClassifier class. Add the following code to the file to prepare the data:nlp_test.py

...
import random

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

By default, the data contains all positive tweets followed by all negative tweets in sequence. When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for training the model and the final 3000 for testing the model.

In this step, you converted the cleaned tokens to a dictionary form, randomly shuffled the dataset, and split it into training and testing data.

#7 : Building and Testing the Model

Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.nlp_test.py

...
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Save, close, and execute the file after adding the code. The output of the code will be as follows:

OutputAccuracy is: 0.9956666666666667

Most Informative Features
                      :( = True           Negati : Positi =   2085.6 : 1.0
                      :) = True           Positi : Negati =    986.0 : 1.0
                 welcome = True           Positi : Negati =     37.2 : 1.0
                  arrive = True           Positi : Negati =     31.3 : 1.0
                     sad = True           Negati : Positi =     25.9 : 1.0
                follower = True           Positi : Negati =     21.1 : 1.0
                     bam = True           Positi : Negati =     20.7 : 1.0
                    glad = True           Positi : Negati =     18.1 : 1.0
                     x15 = True           Negati : Positi =     15.9 : 1.0
               community = True           Positi : Negati =     14.1 : 1.0

Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. The first row in the data signifies that in all tweets containing the token :(, the ratio of negative to positives tweets was 2085.6 to 1. Interestingly, it seems that there was one token with :( in the positive datasets. You can see that the top two discriminating items in the text are the emoticons. Further, words such as sad lead to negative sentiments, whereas welcome and glad are associated with positive sentiments.

Next, you can check how the model performs on random tweets from Twitter. Add this code to the file.

...
from nltk.tokenize import word_tokenize

custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app again."

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

This code will allow you to test custom tweets by updating the string associated with the custom_tweet variable. Save and close the file after making these changes.

Run the script to analyze the custom text. Here is the output for the custom text in the example:

'Negative'

You can also check if it characterizes positive tweets correctly:nlp_test.py

...
custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year :) #Baller #Topbin #oneofmanyworldies'

Here is the output:

'Positive'

Now that you’ve tested both positive and negative sentiments, update the variable to test a more complex sentiment like sarcasm.nlp_test.py

...
custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'

Here is the output:

'Positive'

Python

How to do Sentiment Analysis in Python

#1 : Installing NLTK and Downloading the Data

#2 : Tokenizing the Data

#3 : Normalizing the Data

#4 : Removing Noise from the Data

#5 : Determining Word Density

#6 : Preparing Data for the Model

Converting Tokens to a Dictionary

Splitting the Dataset for Training and Testing the Model

#7 : Building and Testing the Model

How to declare variable in python?

How to Download Python IDLE

Contact

Company

Useful Links

Support

Python

#1 : Installing NLTK and Downloading the Data

#2 : Tokenizing the Data

#3 : Normalizing the Data

#4 : Removing Noise from the Data

#5 : Determining Word Density

#6 : Preparing Data for the Model

Converting Tokens to a Dictionary

Splitting the Dataset for Training and Testing the Model

#7 : Building and Testing the Model

How to declare variable in python?

How to Download Python IDLE

You may also like

15 Powerful Step for Mastering JSON Parsing in Python: Boosting Data Manipulation and Validation

Introduction to Transfer Learning with Python: A Practical Guide

How to Check Type in Python

Contact

Company

Useful Links

Support

Login with your site account

Register a new account