Deep Learning and AI

Hugging Face Text Classification Tutorial Using PyTorch

December 1, 2022 • 19 min read

What is PyTorch?

PyTorch is a deep learning open-source TensorFlow library that is based on the well-known Torch library. It's also a Python-based library that is more commonly used for natural language processing and computer vision. In this tutorial, we will be using PyTorch to train our model for Text Classification.

What is Hugging Face?

Hugging Face is an open-source dataset (website) provider which is used mainly for its natural language processing (NLP) datasets among others. It contains tons of valuable high-quality data sets with quite a range and functionality. When searching for an NLP dataset, Hugging Face is a great go-to source.

Text Classification and Natural Language Processing

Text classification is categorizing data ( usually in textual format ) into different categories or groups. With data being the new currency of the world, it's no shock that companies are spending fortunes processing and utilizing this precious currency.

Text classification can even be classified into smaller subfields and usages. Typical usages of text classification include natural language processing. Natural Language Processing, or NLP, is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. In particular, it explores how to program computers to process and analyze large amounts of natural language data.

Defining the Goal of Our Text Classification Model

In the tutorial portion of this article, we will be using PyTorch and Hugging Face to run a text classification model. For our text classification purpose, we will be using natural language processing in order to identify the sentiment of a given sentence.

A sentiment is meant to categorize a given sentence as either emotionally positive or negative. For example, a positive sentiment would be “he worked so hard and achieved great things”. Negative sentiment on the opposite side of the spectrum would be “his performance was not good enough.”

Our Text Classification Data Set

For this tutorial, we’ll be using the IMDB Dataset of 50K Movie Reviews. As the name indicates, this data set contains over 50,000 movie reviews written by actual users. Each review is divided into either a positive or a negative category depending on whether the viewer liked or disliked a given movie.

How to Build a Text Classification Model Using Hugging Face

Step 1: Import the Necessary Libraries

As the first step in any machine and deep learning model, we should download all the necessary libraries at the very beginning of their code.

During this tutorial, most of the libraries used here will hopefully become more clear, but to begin we’ll explain some of the main libraries that were just imported:

For beginners, we have the well-known NumPy library which allows you to freely create arrays and matrices. It also contains multiple linear algebra functions.

Using the sklearn library we can import and utilize the train_test_split function. This function allows us to split our data into both a training and testing data set. As a programmer, you can choose the ratio of the split, among other parameters, but more on that later.

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import torch

import torch.nn as nn

import torch.nn.functional as F

from nltk.corpus import stopwords 

from collections import Counter

import string

import re

import seaborn as sns

from tqdm import tqdm

import matplotlib.pyplot as plt

from torch.utils.data import TensorDataset, DataLoader

from sklearn.model_selection import train_test_split

Step 2: Check if CUDA is Available

CUDA is a parallel computing platform and an application programming interface that allows the software to use certain types of graphics processing units for general-purpose processing. This approach is called general-purpose computing on GPUs.

In this step, we are checking if a GPU is available to run or code on. If no GPU is provided then the code will process on a normal CPU.

is_cuda = torch.cuda.is_available()

if is_cuda:
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")
    print("GPU not available, CPU used")

Step 3: Importing and Reading the Data Set

If you're running your model on Kaggle, then you should do this step after importing the data set. After that, we will pass the file path of our dataset to the read_csv function (imported from the pandas library) which allows us to extract and read the data.

base_csv = '/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv'
df = pd.read_csv(base_csv)
df.head()

Note that the head function will print the first 5 lines of our dataset.

Step 4: Filtering and Cleaning the Data Set

No data set is perfect. Depending on the model, some extra modifications would be necessary for our model to reach optimally. In this case, we will drop a bunch of unrelated columns which are unnecessary for our model to work and are only slowing it down.

X,y = df['review'].values,df['sentiment'].values

Step 5: Splitting the Data set

As stated briefly, a machine learning model requires 2 different data sets. The training data set is used to train and teach our model. After that, we have the testing data set, which, as the name implies, is used to test the accuracy of our newly trained model.

Some modifications used in this function and passed as extra parameters include

Test_size: by passing a value of 0.2 we are instructing the train_test_split function to split our data set into 80% training and 20% testing sets.

Shuffle: as the name states, if the value of this parameter is equal to True, the data points in the set will be randomized or shuffled.

x_train,x_test,y_train,y_test = train_test_split(X,y,stratify=y)

Step 6: Importing the Tokenization Function

A significant part of natural language processing, Tokenization is the process of dividing raw text data into smaller chunks. This is done by splitting sentences into words more commonly known as tokens.

The main concept behind tokenization is that by analyzing the different words present in a given text, we can interpret the meaning of such a text. We can also run statistical tools and methods to find hidden insights and patterns in the data.

To start with, we’ll import the tokenize method, which takes our dataset as the input and performs this tokenization process.

def preprocess_string(s):
    # Remove all non-word characters (everything except numbers and letters)
    s = re.sub(r"[^\w\s]", '', s)
    # Replace all runs of whitespaces with no space
    s = re.sub(r"\s+", '', s)
    # replace digits with no space
    s = re.sub(r"\d", '', s)

    return s

def tokenize(x_train,y_train,x_val,y_val):
    word_list = []

    stop_words = set(stopwords.words('english')) 
    for sent in x_train:
        for word in sent.lower().split():
            word = preprocess_string(word)
            if word notin stop_words and word != '':
                word_list.append(word)

    corpus = Counter(word_list)
    # sorting on the basis of most common words
    corpus_ = sorted(corpus,key=corpus.get,reverse=True)[:1000]
    # creating a dict
    onehot_dict = {w:i+1 for i,w in enumerate(corpus_)}

    # tokenize
    final_list_train,final_list_test = [],[]
    for sent in x_train:
            final_list_train.append([onehot_dict[preprocess_string(word)] for word in sent.lower().split() 
                                     if preprocess_string(word) in onehot_dict.keys()])
    for sent in x_val:
            final_list_test.append([onehot_dict[preprocess_string(word)] for word in sent.lower().split() 
                                    if preprocess_string(word) in onehot_dict.keys()])

    encoded_train = [1 if label =='positive' else 0 for label in y_train]  
    encoded_test = [1 if label =='positive' else 0 for label in y_val] 
    return np.array(final_list_train), np.array(encoded_train),np.array(final_list_test), np.array(encoded_test),onehot_dict

Step 7: Resplitting the Dataset after Tokenization

The tokenize() function takes both our training and testing datasets as inputs and resplits them after performing tokenization.

x_train,y_train,x_test,y_test,vocab = tokenize( x_train,y_train,x_test,y_test)

Step 8: Padding

In typical sentence classification, sentences are padded with 0's to get sentences of equal length and to allow subsequent classification. Meaning that the following padding_() function will pad each sentence with extra 0’s until they have the same length.

def padding_(sentences, seq_len):
    features = np.zeros((len(sentences), seq_len),dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) != 0:
            features[ii, -len(review):] = np.array(review)[:seq_len]
    return features

We perform padding on the X values in the training and testing data set. Since most reviews have a length of below 500, we will only consider sentences below the 500 range.

x_train_pad = padding_(x_train,500)
x_test_pad = padding_(x_test,500)

Step 9: Patching and Loading as Tensor

Batch size is the number of samples processed before the model is updated. In this case, we have selected a batch size of 50.

The size of a batch must be more than or equal to one and less than or equal to the number of samples in the training dataset. Here we create the Tensor datasets and define the required batch size.

train_data = TensorDataset(torch.from_numpy(x_train_pad), torch.from_numpy(y_train))
valid_data = TensorDataset(torch.from_numpy(x_test_pad), torch.from_numpy(y_test))

batch_size = 50

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)

# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

Step 10: Importing the Model

As with any deep learning model, we import our deep learning algorithm as a class. In this case, we will use an RNN model. This method will be used later (in step 10) in order to run our final model.

class SentimentRNN(nn.Module):
    def __init__(self,no_layers,vocab_size,hidden_dim,embedding_dim,drop_prob=0.5):
        super(SentimentRNN,self).__init__()

        self.output_dim = output_dim
        self.hidden_dim = hidden_dim

        self.no_layers = no_layers
        self.vocab_size = vocab_size

        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        #lstm        self.lstm = nn.LSTM(input_size=embedding_dim,hidden_size=self.hidden_dim,


              num_layers=no_layers, batch_first=True)


        # dropout layer
        self.dropout = nn.Dropout(0.3)

        # linear and sigmoid layer
        self.fc = nn.Linear(self.hidden_dim, output_dim)
        self.sig = nn.Sigmoid()

    def forward(self,x,hidden):
        batch_size = x.size(0)
        # embeddings and lstm_out
        embeds = self.embedding(x)  # shape: B x S x Feature   since batch = True
        #print(embeds.shape)  #[50, 500, 1000]
        lstm_out, hidden = self.lstm(embeds, hidden)

        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim) 

        # dropout and fully connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)

        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)

        sig_out = sig_out[:, -1] # get last batch of labels

        # return last sigmoid output and hidden state
        return sig_out, hidden



    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        h0 = torch.zeros((self.no_layers,batch_size,self.hidden_dim)).to(device)
        c0 = torch.zeros((self.no_layers,batch_size,self.hidden_dim)).to(device)
        hidden = (h0,c0)
        return hidden

Step 11: Defining our Model's Parameters

Finally, we pass the following parameters to the already imported SentimentRNN method that we imported in the previous steps.

no_layers = 2
vocab_size = len(vocab) + 1 #extra 1 for padding
embedding_dim = 64
output_dim = 1
hidden_dim = 256

Passing our parameters to the model and running it.

model = SentimentRNN(no_layers,vocab_size,hidden_dim,embedding_dim,drop_prob=0.5)

#moving to gpu
model.to(device)

Step 12: Training our Text Classification Model

Here we create the loss and optimization functions along with the accuracy method.lr=0.001

criterion = nn.BCELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=lr)


def acc(pred,label):
    pred = torch.round(pred.squeeze())
    return torch.sum(pred == label.squeeze()).item()

In this part of the code, we define the epoch function. The number of epochs in a given model is the number of complete passes through the training dataset. In this case, we set it equal to 5.

clip = 5
epochs = 5 
valid_loss_min = np.Inf
# train for some number of epochs
epoch_tr_loss,epoch_vl_loss = [],[]
epoch_tr_acc,epoch_vl_acc = [],[]

for epoch in range(epochs):
    train_losses = []
    train_acc = 0.0
    model.train()
    # initialize hidden state 
    h = model.init_hidden(batch_size)
    for inputs, labels in train_loader:

        inputs, labels = inputs.to(device), labels.to(device)   
        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        model.zero_grad()
        output,h = model(inputs,h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        train_losses.append(loss.item())
        # calculating accuracy
        accuracy = acc(output,labels)
        train_acc += accuracy
        #`clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()


    val_h = model.init_hidden(batch_size)
    val_losses = []
    val_acc = 0.0
    model.eval()
    for inputs, labels in valid_loader:
            val_h = tuple([each.data for each in val_h])

            inputs, labels = inputs.to(device), labels.to(device)

            output, val_h = model(inputs, val_h)
            val_loss = criterion(output.squeeze(), labels.float())

            val_losses.append(val_loss.item())

            accuracy = acc(output,labels)
            val_acc += accuracy

    epoch_train_loss = np.mean(train_losses)
    epoch_val_loss = np.mean(val_losses)
    epoch_train_acc = train_acc/len(train_loader.dataset)
    epoch_val_acc = val_acc/len(valid_loader.dataset)
    epoch_tr_loss.append(epoch_train_loss)
    epoch_vl_loss.append(epoch_val_loss)
    epoch_tr_acc.append(epoch_train_acc)
    epoch_vl_acc.append(epoch_val_acc)

Step 13: Evaluating the Final Results of Our Text Classification Model

We import the predict_text() function which we will be used to test the accuracy of our model.

def predict_text(text):
        word_seq = np.array([vocab[preprocess_string(word)] for word in text.split() 
                         if preprocess_string(word) in vocab.keys()])
        word_seq = np.expand_dims(word_seq,axis=0)
        pad =  torch.from_numpy(padding_(word_seq,500))
        inputs = pad.to(device)
        batch_size = 1
        h = model.init_hidden(batch_size)
        h = tuple([each.data for each in h])
        output, h = model(inputs, h)
        return(output.item())

We run the predict_text() function with a given review and check its accuracy.

index = 32
print(df['review'][index])
print('='*70)
print(f'Actual sentiment is  : {df["sentiment"][index]}')
print('='*70)
pro = predict_text(df['review'][index])
status = "positive" if pro > 0.5 else "negative"
pro = (1 - pro) if status == "negative" else pro
print(f'Predicted sentiment is {status} with a probability of {pro}')

My first exposure to the Templarios & not a good one. I was excited to find this title among the offerings from Anchor Bay Video, which has brought us other cult classics such as "Spider Baby". The print quality is excellent, but this alone can't hide the fact that the film is deadly dull. There's a thrilling opening sequence in which the villagers exact terrible revenge on the Templars (& set the whole thing in motion), but everything else in the movie is slow, ponderous &, ultimately, unfulfilling. Adding insult to injury: the movie was dubbed, not subtitled, as promised on the video jacket.

Actual sentiment is: negative

predicted sentiment is negative with a probability of 0.9017044752836227

To find the original code of the example used in this tutorial, check out the Sentiment analysis using LSTM - PyTorch code on Kaggle.

Using Hugging Face Datasets for Text Classification

Now we have explained a bit about PyTorch which is one of the most well-known machine learning libraries out there. We’ve also touched on what Hugging Face is, what text classification is, and what natural language processing (also known as NLP) is.

Then we moved on to a practical machine-learning example written in Python. In this example, we used an LSTM model exported from the PyTorch package in order to perform sentiment analysis on given movie reviews. We explained how to import the necessary libraries, how to import the required dataset, the filtering and splitting of the dataset, tokenization, and the training and evaluation of our model.

If you are a machine learning expert or new to the field, then adding a subfield such as NLP to your list of skills will definitely pay off! If you are building AI on a cloud platform and are looking for your own on-premise solution SabrePC offers higher-performance computing solutions for Deep Learning and AI. Whether you sourcing a competent workstation or a fully-fledged server, contact SabrePC!

Blog