Twitter Sentiment Classification In PyTorch

Twitter Sentiment Classification In PyTorch

In this blog, we will build a sentiment analysis model in Pytorch. For that we will use Sentiment140 Dataset.

Download the twitter dataset

Sentiment140 Dataset details:

Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter.

The data is a CSV with emoticons removed. Data file format has 6 fields:

the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
the id of the tweet (2087)
the date of the tweet (Sat May 16 23:58:44 UTC 2009)
the query (lyx). If there is no query, then this value is NO_QUERY.
the user that tweeted (robotickilldozr)
the text of the tweet (Lyx is cool)

Import torch module

In [1]:
import torch
In [2]:
torch.__version__
Out[2]:
'1.8.1+cu102'

Check the device, if cuda is available

In [3]:
if torch.cuda.is_available():
    device = torch.device("cuda") 
else:
    device = torch.device("cpu")
    
In [4]:
device
Out[4]:
device(type='cpu')

Load the data

In [5]:
import pandas as pd
In [6]:
input_file = 'input/training.1600000.processed.noemoticon.csv'
In [7]:
df = pd.read_csv(input_file, header = None)
df.head()
Out[7]:
0 1 2 3 4 5
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....
In [8]:
df.tail()
Out[8]:
0 1 2 3 4 5
1599995 4 2193601966 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY AmandaMarie1028 Just woke up. Having no school is the best fee...
1599996 4 2193601969 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY TheWDBoards TheWDB.com - Very cool to hear old Walt interv...
1599997 4 2193601991 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY bpbabe Are you ready for your MoJo Makeover? Ask me f...
1599998 4 2193602064 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY tinydiamondz Happy 38th Birthday to my boo of alll time!!! ...
1599999 4 2193602129 Tue Jun 16 08:40:50 PDT 2009 NO_QUERY RyanTrevMorris happy #charitytuesday @theNSPCC @SparksCharity...
In [9]:
df.shape
Out[9]:
(1600000, 6)

pandas.Series.value_counts

Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True):

Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
In [10]:
df[0].value_counts()
Out[10]:
0    800000
4    800000
Name: 0, dtype: int64

View the dataframe details

In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   0       1600000 non-null  int64 
 1   1       1600000 non-null  int64 
 2   2       1600000 non-null  object
 3   3       1600000 non-null  object
 4   4       1600000 non-null  object
 5   5       1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB

Create the category of twitter

In [12]:
df["sentiment_category"] = df[0].astype('category')
df["sentiment_category"]
Out[12]:
0          0
1          0
2          0
3          0
4          0
          ..
1599995    4
1599996    4
1599997    4
1599998    4
1599999    4
Name: sentiment_category, Length: 1600000, dtype: category
Categories (2, int64): [0, 4]

Dataframe after adding sentiment_category

In [13]:
df.head()
Out[13]:
0 1 2 3 4 5 sentiment_category
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t... 0
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ... 0
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man... 0
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire 0
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all.... 0
In [14]:
df.tail()
Out[14]:
0 1 2 3 4 5 sentiment_category
1599995 4 2193601966 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY AmandaMarie1028 Just woke up. Having no school is the best fee... 4
1599996 4 2193601969 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY TheWDBoards TheWDB.com - Very cool to hear old Walt interv... 4
1599997 4 2193601991 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY bpbabe Are you ready for your MoJo Makeover? Ask me f... 4
1599998 4 2193602064 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY tinydiamondz Happy 38th Birthday to my boo of alll time!!! ... 4
1599999 4 2193602129 Tue Jun 16 08:40:50 PDT 2009 NO_QUERY RyanTrevMorris happy #charitytuesday @theNSPCC @SparksCharity... 4
In [15]:
df["sentiment_category"].cat.codes
Out[15]:
0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Length: 1600000, dtype: int8

pandas.Series.cat.codes

Series.cat.codes

Return Series of codes as well as the index.
In [16]:
df["sentiment"] = df["sentiment_category"].cat.codes
df["sentiment"]
Out[16]:
0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: sentiment, Length: 1600000, dtype: int8
In [17]:
df.head()
Out[17]:
0 1 2 3 4 5 sentiment_category sentiment
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t... 0 0
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ... 0 0
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man... 0 0
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire 0 0
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all.... 0 0
In [18]:
df.tail()
Out[18]:
0 1 2 3 4 5 sentiment_category sentiment
1599995 4 2193601966 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY AmandaMarie1028 Just woke up. Having no school is the best fee... 4 1
1599996 4 2193601969 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY TheWDBoards TheWDB.com - Very cool to hear old Walt interv... 4 1
1599997 4 2193601991 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY bpbabe Are you ready for your MoJo Makeover? Ask me f... 4 1
1599998 4 2193602064 Tue Jun 16 08:40:49 PDT 2009 NO_QUERY tinydiamondz Happy 38th Birthday to my boo of alll time!!! ... 4 1
1599999 4 2193602129 Tue Jun 16 08:40:50 PDT 2009 NO_QUERY RyanTrevMorris happy #charitytuesday @theNSPCC @SparksCharity... 4 1

Save random 10000 it to csv file

We have too many records, so let us reduce the dataset and save random 10000 records.

In [19]:
df.sample(10000).to_csv("train.csv", header = None, index = None)

What is torchtext?

Like torchvision, PyTorch provides an official library, torchtext, for handling text-processing pipelines.

Installing torchtext with pip:

pip install torchtext

Installing with conda:

conda install -c derickl torchtext

torchtext.data

The data module provides the following:

Ability to define a preprocessing pipeline
Batching, padding, and numericalizing (including building a vocabulary object)
Wrapper for dataset splits (train, validation, test)
Loader a custom NLP dataset
In [20]:
import torchtext
from torchtext.legacy import data
In [21]:
torchtext.__version__
Out[21]:
'0.9.1'

Create the dataset

Define Label and Tweet field datatype

We need only two columns label and text of tweet. We are defining LABEL as a LabelField. TWEET is a Field object, where we are using the spaCy tokenizer and convert all the text to lower.

Field

class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='', unk_token='', pad_first=False, truncate_first=False, stop_words=None, is_target=False):

Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.
In [22]:
LABEL = data.LabelField()
TWEET = data.Field(tokenize='spacy', tokenizer_language = 'en_core_web_sm', lower = True)

Create fields

In [23]:
fields = [('score',None), ('id',None), ('date',None), ('query',None), ('name',None), 
          ('tweet', TWEET),('category',None),('label',LABEL)]
In [24]:
fields
Out[24]:
[('score', None),
 ('id', None),
 ('date', None),
 ('query', None),
 ('name', None),
 ('tweet', <torchtext.legacy.data.field.Field at 0x7f9565813bd0>),
 ('category', None),
 ('label', <torchtext.legacy.data.field.LabelField at 0x7f9565813c10>)]

class torchtext.data.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

Defines a Dataset of columns stored in CSV, TSV, or JSON format.
In [25]:
twitterDataset = data.dataset.TabularDataset(
        path = "train.csv", 
        format = "CSV", 
        fields = fields,
        skip_header = False)

Divide twitterDataset into train, validation and test

split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)

Create train-test(-valid?) splits from the instance’s examples.
Parameters: 

    split_ratio (float or List of python:floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set).
    stratified (bool) – whether the sampling should be stratified. Default is False.
    strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
    random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().

Returns:    

Datasets for train, validation, and test splits in that order, if the splits are provided.
Return type: Tuple[Dataset]
In [26]:
(train, validation, test) = twitterDataset.split(split_ratio = [0.8,0.1,0.1],
                                            stratified = True, strata_field = 'label')
In [27]:
len(train)
Out[27]:
8000
In [28]:
len(validation)
Out[28]:
1000
In [29]:
len(test)
Out[29]:
1000

Building a Vocabulary

one-hot encoding of each word of training data

In [30]:
vocab_size = 20000
TWEET.build_vocab(train, max_size = vocab_size)
In [31]:
len(TWEET.vocab)
Out[31]:
16462
In [32]:
LABEL.build_vocab(train)

View most common words

In [33]:
TWEET.vocab.freqs.most_common(10)
Out[33]:
[('i', 4974),
 ('!', 4713),
 ('.', 4014),
 (' ', 2972),
 ('to', 2867),
 ('the', 2545),
 (',', 2365),
 ('a', 1938),
 ('my', 1602),
 ('you', 1556)]

Create a data loader

In [34]:
train_dataloader, valid_dataloader, test_dataloader = data.BucketIterator.splits(
    (train, validation, test),
    batch_size = 32,
    device = device,
    sort_key = lambda x: len(x.tweet),
    sort_within_batch = False)

Create a LSTM Model

In [35]:
import torch.nn as nn
In [36]:
class MyLSTMModel(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size):
        super(MyLSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.encoder = nn.LSTM(input_size=embedding_dim,
        hidden_size=hidden_size, num_layers=1)
        self.predictor = nn.Linear(hidden_size, 2)
        pass
    
    def forward(self, seq):
        output, (hidden,_) = self.encoder(self.embedding(seq))
        preds = self.predictor(hidden.squeeze(0))
        return preds
        pass
In [37]:
model = MyLSTMModel(hidden_size = 100,embedding_dim = 300, vocab_size = vocab_size)
model.to(device)
Out[37]:
MyLSTMModel(
  (embedding): Embedding(20000, 300)
  (encoder): LSTM(300, 100)
  (predictor): Linear(in_features=100, out_features=2, bias=True)
)

Define the loss

A loss function computes a value that estimates how far away the output is from the target. The main objective is to reduce the loss function's value by changing the weight vector values through backpropagation in neural networks.

The loss function represents how well our model behaves after each iteration of optimization on the training set.

In [38]:
criterion = nn.CrossEntropyLoss()
criterion
Out[38]:
CrossEntropyLoss()

Define the optimizer

In [39]:
import torch.optim as optim

Optimizers define how the weights of the neural network are to be updated. Optimizers take model parameters and learning rate as the input arguments.

In [40]:
optimizer = optim.Adam(model.parameters(), lr=2e-2)
optimizer
Out[40]:
Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.02
    weight_decay: 0
)

Updating the Training Loop

In [41]:
epochs = 10
In [42]:
def train(epochs, model, optimizer, criterion, train_dataloader, valid_dataloader):
    for epoch in range(1, epochs + 1):
     
        #set training and valid loss to zero
        training_loss = 0.0
        valid_loss = 0.0
        
        #set model for train
        model.train()
        
        for batch_idx, batch in enumerate(train_dataloader):
            
            # get the batch; batch is a list of [tweet, label]
            tweet, label = batch
            
            #optimizer set it to zero_grad(), means clear the gradients  
            optimizer.zero_grad()
            
            #Forward Pass
            predict = model(tweet)
            
            # Find the Loss
            loss = criterion(predict, label)
            
            # Calculate gradients 
            loss.backward()
            
            # Update Weights
            optimizer.step()
            
            # Calculate Loss
            training_loss += loss.data.item() * tweet.size(0)
            
        training_loss /= len(train_dataloader)
 
        #set model for evalution
        model.eval()
        for batch_idx,batch in enumerate(valid_dataloader):
            # get the batch; batch is a list of [tweet, label]
            tweet, label = batch
            
            predict = model(tweet)
            loss = criterion(predict, label)
            valid_loss += loss.data.item() * tweet.size(0)
 
        valid_loss /= len(valid_dataloader)
        print('Epoch: {}, Training Loss: {:.2f}, Validation Loss: {:.2f}'.format(epoch, training_loss, valid_loss))
In [43]:
train(epochs, model, optimizer, criterion, train_dataloader, valid_dataloader)    
Epoch: 1, Training Loss: 23.35, Validation Loss: 11.17
Epoch: 2, Training Loss: 19.51, Validation Loss: 13.26
Epoch: 3, Training Loss: 17.05, Validation Loss: 12.58
Epoch: 4, Training Loss: 15.26, Validation Loss: 13.24
Epoch: 5, Training Loss: 14.48, Validation Loss: 15.72
Epoch: 6, Training Loss: 13.37, Validation Loss: 15.67
Epoch: 7, Training Loss: 12.70, Validation Loss: 15.92
Epoch: 8, Training Loss: 12.42, Validation Loss: 17.10
Epoch: 9, Training Loss: 11.38, Validation Loss: 15.87
Epoch: 10, Training Loss: 10.79, Validation Loss: 15.65

Predict new tweet

We have to call preprocess(), which performs our spaCy-based tokenization. torchtext is expecting a batch of strings, so we have to turn TWEET into a list of lists. After that, we have to call process() to the tokens into a tensor based on our already-built vocabulary. Then we feed it into the model.

In [44]:
def classifyTweet(tweet):
    categories = {0: "Negative", 1:"Positive"}
    processed = TWEET.process([TWEET.preprocess(tweet)])
    processed = processed.to(device)
    
    model.eval()
    prediction = model(processed)
    print("Prediction: ",  prediction)
    pred_cat = categories[prediction.argmax().item()] 
    return pred_cat

Prediction gives the tensor, so we will take the highest value. For that we will argmax() and then item() to turn that zerodimension tensor into a Python integer. That we index into our categories dictionary.

In [45]:
classifyTweet("Just woke up. Having no school is the best thing")
Prediction:  tensor([[ 0.0990, -0.2582]], grad_fn=<AddmmBackward>)
Out[45]:
'Negative'
In [46]:
classifyTweet("Bullshit")
Prediction:  tensor([[ 0.2480, -0.2117]], grad_fn=<AddmmBackward>)
Out[46]:
'Negative'
In [47]:
classifyTweet("excellent")
Prediction:  tensor([[-0.3993,  0.1590]], grad_fn=<AddmmBackward>)
Out[47]:
'Positive'
In [ ]:
 
In [ ]:
 
In [ ]:
 

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing