News Summarizer
Table of Contents
1. What is Text Summarization in NLP?
2. Sequence-to-Sequence (Seq2Seq) Modeling
3. Encoder – Decoder Architecture
4. Limitations of the Encoder – Decoder Architecture
5. Attention Mechanism
6. Problem Statement
7. Implementing a Text Summarization Model in Python using Keras
8. Demo Video
9. Future Work
10. References
What is Text
Summarization in NLP?
Text
summarization is the process of retrieving concise and meaningful context
summary from the given text while preserving the key information and overall
meaning of the text.
There are two
type of text summarization technique
i. Extractive summarization
ii. Abstractive summarization
1) Extractive Summarization:
In Extractive Summarization we identify only the important part of the text such as sentence or the phrase and extract only those part of the text. The extracted important part is stacked together to create summary.
Text: Messi and Ronaldo have better records than their counterparts. Performed exceptionally across all competitions. They are considered as the best in our generation.
Extractive
summary: Messi and Ronaldo have better records than
their counterparts. Best in our generation.
2) Abstractive
summarization:
In Abstractive
summarization we build a summary by rephrasing or using the new words.
These methods use advance NLP technique for the text summarization.
Text: Messi and Ronaldo have better records than their counterparts.
Performed exceptionally across all competitions. They are considered as the
best in our generation.
Abstractive
summary: Messi and Ronaldo have better records than their counterpart
counterparts, so they are considered as the best in our generation.
In this case
study we are focusing on Abstractive summarization technique and we will use encoder-decoder architecture to solve
this.
Sequence-to-Sequence (Seq2Seq) Modeling
In Text summarization problem
input is a sequence of word and output is also a sequence so we can use a sequence-to-sequence
model.
Our objective is to build a text
summarizer where the input is a long sequence of words (in a text body), and
the output is a short summary (which is a sequence as well). So, we can
model this as a Many-to-Many Seq2Seq problem. Below is a typical
Seq2Seq model architecture:
There are two major components of a
Seq2Seq model:
- Encoder
- Decoder
Encoder-Decoder Architecture
The Encoder-Decoder architecture is
mainly used to solve the sequence-to-sequence (Seq2Seq) problems where the
input and output sequences are of different lengths.
Generally, variants of Recurrent Neural Networks
(RNNs), i.e. Gated Recurrent Neural Network (GRU) or Long Short Term Memory
(LSTM), are preferred as the encoder and decoder components. This is because
they are capable of capturing long term dependencies by overcoming the problem
of vanishing gradient.
We can set up the Encoder-Decoder in 2 phases:
- Training phase
- Inference phase
Training phase
In the training phase, we will first set up the encoder and decoder. We will then train the model to predict the target sequence offset by one timestep. Let us see in detail on how to set up the encoder and decoder.
Encoder
An Encoder Long Short Term Memory
model (LSTM) reads the entire input sequence wherein, at each timestep, one
word is fed into the encoder. It then processes the information at every
timestep and captures the contextual information present in the input sequence.
I’ve
put together the below diagram which illustrates this process:
The hidden state (hi) and cell state (ci) of the last time step are used to initialize the decoder.
Remember, this is because the encoder and decoder are two different sets of the
LSTM architecture.
Decoder
The decoder is also an LSTM
network which reads the entire target sequence word-by-word and predicts the
same sequence offset by one timestep. The decoder is trained to predict the
next word in the sequence given the previous word.
<start> and <end> are the special tokens which are added to the target sequence before feeding it into the decoder. The target sequence is unknown while decoding the test sequence. So, we start predicting the target sequence by passing the first word into the decoder which would be always the <start> token. And the <end> token signals the end of the sentence.
Inference Phase
After training, the model is
tested on new source sequences for which the target sequence is unknown. So, we
need to set up the inference architecture to decode a test sequence:
Here are the steps to decode the
test sequence:
1. Encode the entire input sequence and initialize
the decoder with internal states of the encoder
2. Pass <start> token as an input to the decoder
3. Run the decoder for one timestep with the
internal states
4. The output will be the probability for the next
word. The word with the maximum probability will be selected
5. Pass the sampled word as an input to the decoder
in the next timestep and update the internal states with the current time step
6. Repeat steps 3 – 5 until we generate <end> token
or hit the maximum length of the target sequence
Let’s take an example where the
test sequence is given by [x1, x2, x3, x4]. How will the inference process work for this test sequence? I
want you to think about it before you look at my thoughts below.
1. Encode the test sequence into internal state
vectors
1. Observe
how the decoder predicts the target sequence at each timestep:
Timestep: t=1
Timestep: t=2
And, Timestep: t=3
Limitations of the Encoder – Decoder Architecture
As useful as this encoder-decoder architecture is, there are
certain limitations that come with it.
- The
encoder converts the entire input sequence into a fixed length vector and
then the decoder predicts the output sequence. This work only for the short sequence since the decoder is looking at the entire sequence for the predictions.
- Here
comes the problem with long sequences. It is
difficult for the encoder to memorize long sequences into a fixed length vector
Attention Mechanism
How much attention do we need to pay to every word in the input
sequence for generating a word at timestep t?
That’s the key intuition behind this attention mechanism concept.
Let’s consider a
simple example to understand how Attention Mechanism works:
- Source
sequence: “Which sport do you
like the most?
- Target
sequence: “I love
cricket”
The first word 'I' in the target sequence is connected to the fourth word 'you' in the source sequence. Similarly the second word 'love' in the target sequence in connected with the fifth word 'like' in source sequence.
So, instead of looking at all the words in the source sequence,
we can increase the importance of specific parts of the source sequence that
result in the target sequence. This is the
basic idea behind the attention mechanism.
There are two types of attentions
· Global Attention
· Local Attention
Global Attention:
Here, the attention
is placed on all the source positions. In other words, all the hidden states of the encoder are considered for deriving
the attended context vector:
Source: Effective
Approaches to Attention-based Neural Machine Translation-2015
Local Attention:
Here, the attention is placed on only a few source positions. Only a few hidden states of the encoder are considered for deriving the attended context vector:
Source: Effective
Approaches to Attention-based Neural Machine Translation-2015
Problem Statement
Our objective here is to generate a news summary.
You can download the
dataset from here.
Implementing Text
Summarization in Python using Keras
Custom Attention
Layer
Keras does not officially support
attention layer. So, we can either implement our own attention layer or use a
third-party implementation. We will go with the latter option for this article.
Import the Libraries
Read the dataset
Text Cleaning
We will perform the below preprocessing tasks for our
data:
- Convert everything to lowercase
- Remove HTML tags
- Contraction mapping
- Remove (‘s)
- Remove any text inside the parenthesis ( )
- Eliminate punctuations and special characters
- Remove stopwords
- Remove short words
EDA
We can fix thee maximum length of the review to 80 since that seem to be majority text length. similarly, we can set summary length to 10.
Word cloud for text
Observations:
1) There are word like added ,said are the most common word in the text.
2)There also a word which belongs to the famous personality, famous politician like modi, trump and public places , days, states are also common in the news.
Word cloud for Headline
Observations:
From the graph we can't said specific word are common in headline but there are public places names, cities , team etc are most frequent word in the headlines.
Maximum length
We can fix the
maximum length of the reviews to 80 since that seems to be the majority review
length. Similarly, we can set the maximum summary length to 10:
Train test split
We’ll
use 80% of the dataset as the training data and evaluate the performance on the
remaining 20%
Tokenizer
A tokenizer builds
the vocabulary and converts a word sequence to an integer sequence.
Model Building
We are finally at the model building part. But before we do
that, we need to familiarize
ourselves with a few terms which are required prior to building the model.
- Return Sequences=True: When the return sequences parameter is
set to True, LSTM
produces the hidden state and cell state for every timestep
- Return Sequences=True: When return state = True, LSTM produces the
hidden state and cell state of the last timestep only
- Initial State: This
is used to initialize the internal states of the LSTM for the first
timestep
- Stacked LSTM: Stacked LSTM has multiple layers of LSTM stacked on top of each other. This leads to a better representation of the sequence. I encourage you to experiment with the multiple layers of the LSTM stacked on top of each other (it’s a great way to learn this)
Here, we are
building a 3 stacked LSTM for the encoder:
Embedding:
Encoder:
Decoder:
Model Training:
Diagnostic plot
Now, we will plot a few diagnostic plots to understand the
behavior of the model over time:
Inference
We have train model with encoder decoder, attention and bidirectional lstm with attention model. bidirectional lstm with attention model gives better result than others
So we set up the inference for the bidirectional lstm with attention model:
We are defining a function below which is the implementation of the inference process
Demo Video:
Here is the demonstration video.
Future Work:
1) We can improve the performance of model by training it with huge dataset. If we have high end GPU availability then we can train a model on the huge dataset and this will improve the performance of the model
2)We can user BERT state of art model to obtain the embedding of the text data.
References:
https://towardsdatascience.com/lets-give-some-attention-to-summarising-texts-d0af2c4061d1
Comments
Post a Comment