Problem Statement
Here we will see how the entire context of a particular text can be automatically generated in a precise less number of words. I have used the kaggle data set (https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/home) for this case. I have used RNN (LSTM ENCODER - DECODER) structure to generate the summaries of the reviews. All the work is done in Jupyter notebook, conda environment; python 3.7 with keras in backend. The purpose of this project is to learn how RNN (LSTM) SeqtoSeq model works and this type of problem can be extrapolated to many other problems let it be document summarization in finance, healthcare.
Dataset Description
The dataset is Women Ecommerce clothing reviews.
This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
Age: Positive Integer variable of the reviewers age.
Title: String variable for the title of the review.
Review Text: String variable for the review body.
Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
Division Name: Categorical name of the product high level division.
Department Name: Categorical name of the product department name.
Class Name: Categorical name of the product class name.
We will be using the Title and reviews Text for summarization.
Theory Behind the Encoder-Decoder Sequential Model
LSTM Cell
Here 'h' represents the hidden state, 'c' represents the cell state, 't' represents the current time stamp.
The LSTM has three parts; forget gate, update gate and output gate.
How the cell works ?
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”
Forget Gate
This gate is primarly used for information that needs to be forget. If the context changes, for example if there is gender change (male to female) or from plural to singular or vice versa, the cell needs to remeber that context has changed. Forget gate plays a role by "not transferring" the male/ plural information to the cell state. It uses the sigmoid function which outputs 0 or 1. 0 means completely forget and 1 means retains the whole information. Wf is a matrix which is combination of two weight s matrices for xt and ht-1. bf is the bias added.
Update Gate
It has an input layer that decides what values we need to update (like from male to female/ plural to singular). So it is passed again to a neural layer network same as forget layer which ouputs 0/1 and then is multiplied by cell state c~ which will keep the new information and stored in the cell state. Now the previous cell state ct-1 and this current updated c~ are added which will form the current cell state ct which will again passed as ct-1 for the next time step.
Output Gate
Now to decide what goes to output, a output sigmoid function is again defined same as forget gate. It is responsible to pass the information that a context has changed and it might output the information relevant to the "new" context. So multiplied by tanh(ct) and is output as ht. For final output, a dense layer is applied to each time step/ last time step depending upon whether it is many to many / many to one structure.
Encoder-Decoder Sequential Model
Summary of the above model
Both the encoder and decoder have LSTM cells as units
Encoder reads the input text in form of time steps, embedding layer converts it into the word embeddings and all the information in defined number of time steps are passed as initial vectors for decoder; c(cell state) and h (hidden state)
Decoder model then predicts the outputs at every time step , where the output at previous time step is feed as an input to the next time step to predict the next output.
Output at every time step from decoder model is then combined and the resultant summary is formed
Text Cleaning and Preprocessing
Before the data is split into training and testing, our data contains a lot of noise which are needed to remove in order for our model to predict better.
- Missing data
There are some missing data in both Review Text and Title, they needed to be removed, I have used "dropna" function which basically removes all the rows which has atleast one missing data
- Noise
There are noise in form of special characters, capital/small letters, punctuations, numbers, short forms like in place of 'is not' it is 'isn't', one letter words, stop_words etc which are needed to be removed in order to improve the quality of the data our model will feed upon.
here the text is cleaned part.
- Defining the fix length for input to encoder and output for decoder; in short defining the time steps for encoder and decoder. Seeing the histograms, we can fix the input time steps to be 50 and for output to be 7. It means each example will have max length of 50, if length is < 50, it will be "post paded" and if length > 50 then it will be truncated, same for output. So summary will have max length of 7, it can be< 7 also.
- Tokenization and Padding
Tokenize the text and summary part respectively by importing tokenizer from keras.preprocessing.text. It is done because the model only understands the numbers. So tokenizer will create a vocabulary of all the unique words present in the text and summary corpus and will make a word-index pair. Here I have choosen vocabulary length for text part to be 5000 and for summary part 3000. The numbers represent the top words by frequency present in the respective corpus. You can have a choice of including all the unique words by getting the value of len(x_tokenizer.word_index) + 1; here x_tokenizer is created by training it on training 'text' data, same goes for 'summary' data as well. Post padding is done to make the input and output of fix size (50 and 7 respectively).
Model Development
How the model works ?
In the above pic, u1, u2 .... uT represents the inputs given to the model with T= 50 in our case. c represents the cell state, h represents the hidden state. co is initialized to zero. Now there is a embedding layer before the input goes to LSTM units. We have chosen the number of our dimensions to be 100. So each word will be a 100 dimension vector. After that, from the embedding layer, it goes to LSTM unit where it is combined with forget gate, update gate, input gate and output gate as explained above in LSTM section. Now each of the (Wf,bf), (Wo,bo), (Wu,bu) and (Wi,bi) are the matrices that will be same for each time step, W is the weight matrix and b is the bias for neural structure. cT and hT are then passed as initial states to decoder network. Decoder part behaves different in training and inference phase
Training Phase: Here the actual output is fed at each time step to make it learn faster. Then error is backpropagated through back in time.
Inference Phase: Here the actual output from the previous time step is fed as input for next time step in the decoder model. We are taking the word with maximum probability (greedy algorithm).
"Start" token is given as initial input for the decoder model to start producing the output both in the training and inference phase. Output from the LSTM layer is then passed to first a Time Distributed Dense layer (tanh) and then again to Time Distributed Dense layer (softmax). Time distributed layer applies the same activation function at every time step. Dense function is applied to make the output in the desired vector. In our case we have applied 2 dense layers. So hidden state from each LSTM unit in decoder model is passed through first a dense layer of tanh (300 neuron layer) and then a softmax layer (y_vocab=3001; number of neurons to predict each word in in the vocab with their probablities). Then the word with maximum probability is taken as the output for that time step.
Structure of the model and number of parameters calculation
Encoder Part
Embedding layer
It is the word embedding layer that takes the input of 50 (Step size) words as one example. It outputs in form of (word vocab length, no. of dimensions) (5001, 100). So it a weight matrix with number of weights to be trained equals to 5001* 100 = 500100.
LSTM Layer
Output from the embedding layer goes to LSTM layer. We have four gates. Lets calculate parameters for one gate. For forget gate, it has two inputs, one from previous hidden state and one from the embedding layer. The Wf weight matrix for the forget gate and bias are the parameters that are needed to be trained. Weight matrix consists of two matrices, one for the hidden state (shape =(,,300)) and the one for the input matrice (shape=(,,100)) and will be interacting with the 300 neurons NN. Also we will be adding the bias parameter for every neuron (300). Total trainable parameters for forget gate is 300 300 + 300 100 + 300 = 120300. There are four gates, so total parameters for an LSTM unit in encoder is 4 * 120300 = 481200
Output from the last LSTM unit (last time step) is then passed as the initial state to the decoder part as c'0
Decoder Part
Embedding layer
It is the word embedding layer that takes the input (output from the previous time step). It outputs in form of (word vocab length, no. of dimensions) (3001, 100). So it a weight matrix with number of weights to be trained equals to 3001 * 100 = 500100. Here the step size (T') maximum is 7. So if the previous output token is "end", then it will stop producing the output
LSTM Layer
Same as the one in the encoding layer.
Time Distributed Dense Layer 1
Here the activation function used is tanh. It is a 300 NN structure. Output from the LSTM layer is at every time step is fed to this dense layer. It output is in the shape of (batch size, time steps, 300). Trainable parameters are weights matrices and the bias added for each neuron; 300 * (300 + 1) = 90300
Time Distributed Dense Layer 2
Here the activation function used is tanh. It is a y_vocab (3001) NN structure. Output from the LSTM layer is at every time step is fed as an input to the next time step input. It's output is in the shape of (batch size, time steps, y_vocab). Trainable parameters are weights matrices and the bias added for each neuron; 3001 * (300 + 1) = 903301 . Here the ouput is the word in the y_vocab with the maximum probability (greedy search)
Total Trainable Parameters = 2756201
A little on the input shape to LSTM layer
for Encoder Model
Batch size = 32
Time Steps = 50
Seq_len = 100 ; each word embedding lenght
For Decoder model
Training Phase
Batch Size = 32
Time Steps = 7
Seq_len = 100; each word embedding dimension
Training and Results
Data is split into 90/10 % ratio. RMS prop is taken as an optimizer. Epochs is et to 50. Patience is set to 2 (means if val_loss is not decreased after 2 epochs training will stop). Batch size is taken as 32
Here are the results on some random reviews taken from the validation set. Predicted summary is the summary predicted by our model and Actual summary that is present in the dataset
PS: Lots of training is done to find the parameters which will give the decent summary. Initially the model was not performing well may not removing the noise properly, varying batch size. Also i find out that it is okay that loss should be as much as minimum as possible but it doesn't guarantee that min loss will result in good summary so another metric is needed to judge this (Bleu Score)
Improvisations
More data
Hyperparameter tuning
Model Structure change like adding more LSTM layers or dense layers
Trying attention mechanism as it is good if the documents for training are large