IMDB case study

Alessandro Aere

22nd May 2020

Case study: sentiment analysis

Problem description

A comparison of models will be presented using the IMDB dataset, a collection of movie reviews.

More details regarding the datasets here.

In this report only the code concerning neural networks will be reported, not that of all the other statistical models, as they are not a topic of interest. In the conclusion you will find, however, the comparison of the results.

Import the Keras library

Run the code below if you wish to set in seed and obtain reproducible results.

Data ingestion

Next, we define two variables necessary for sentiment analysis, or in general the text mining:

Data preparation

Generally the first preparation step is the tokenization, that is the association of an integer value to each word. This step is not necessary in this case, because the sequences are already tokenized.

Next, we define a function that will be useful later in the analysis: vectorize_sequences. This function transforms the entire dataset, performing the dichotomization of the words in the sentences.

Deep Neural Network

Specify the neural network’s architecture:

## Model
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 32)                      160032      
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 16)                      528         
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 1)                       17          
## ================================================================================
## Total params: 160,577
## Trainable params: 160,577
## Non-trainable params: 0
## ________________________________________________________________________________

Compile the model:

Effettuare il training della rete neurale. Come input alla rete neurale, diamo le sequenze dicotomizzate, applicando la funzione vectorize_sequences al dataset x_train.

The plot, representing the loss function and the accuracy in relation to the number of epochs, is shown below.

Evaluating the model on the test set.

## [1] "Loss on test data: 0.592347383499146"
## [1] "Accuracy on test data: 0.860759973526001"

Recurrent neural network (lstm)

Specify the neural network’s architecture:

Neural network architecture example composed of a word embedding layer and an LSTM layer.

Neural network architecture example composed of a word embedding layer and an LSTM layer.

## Model
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## embedding (Embedding)               (None, None, 128)               640000      
## ________________________________________________________________________________
## lstm (LSTM)                         (None, 32)                      20608       
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 1)                       33          
## ================================================================================
## Total params: 660,641
## Trainable params: 660,641
## Non-trainable params: 0
## ________________________________________________________________________________

Compile the model:

Now it’s time to train the neural network. Before giving the sequences as input, we apply the pad_sequences function: this function applies a maximum length to the sequences, in this case equals to max_len. Moreover, it applies padding to the sequences shorter than max_len; the padding adds a bunch of elements all equal to \(0\) (zero) at the end of the sequence, until its length becomes max_len. This technique is necessary because the neural network requires that the input sequences have all the same length.

The plot, representing the loss function and the accuracy in relation to the number of epochs, is shown below.

Evaluating the model on the test set.

## [1] "Loss on test data: 0.309290438890457"
## [1] "Accuracy on test data: 0.883960008621216"

Results

Model Accuracy
Recurrent neural network 88.4%
Lasso regression 87.0%
Logistic regression 86.2%
Deep neural network 86.1%
Random forest 84.7%
Bagging 77.0%
Adaboost 72.5%
Gradient boosting 70.1%