News Summarizer

Table of Contents

1.     What is Text Summarization in NLP?

2.     Sequence-to-Sequence (Seq2Seq) Modeling

3.     Encoder – Decoder Architecture

4.     Limitations of the Encoder – Decoder Architecture

5.     Attention Mechanism

6.     Problem Statement

7.     Implementing a Text Summarization Model in Python using Keras

8. Demo Video

9. Future Work

10.   References


What is Text Summarization in NLP?

Text summarization is the process of retrieving concise and meaningful context summary from the given text while preserving the key information and overall meaning of the text.

There are two type of text summarization technique

                      i. Extractive summarization

                    ii. Abstractive summarization 

1)      Extractive Summarization:

      In Extractive Summarization we identify only the important part of the text such as sentence or the phrase and extract only those part of the text. The extracted important part is stacked together to create summary.

Text: Messi and Ronaldo have better records than their counterparts. Performed exceptionally across all competitions. They are considered as the best in our generation. 

Extractive summary: Messi and Ronaldo have better records than their counterparts. Best in our generation.

 

2)  Abstractive summarization:

      In Abstractive summarization we build a summary by rephrasing or using the new words. These methods use advance NLP technique for the text summarization.

      Text: Messi and Ronaldo have better records than their counterparts. Performed exceptionally across all competitions. They are considered as the best in our generation.

      Abstractive summary: Messi and Ronaldo have better records than their counterpart counterparts, so they are considered as the best in our generation.

In this case study we are focusing on Abstractive summarization technique and we will  use encoder-decoder architecture to solve this.


Sequence-to-Sequence (Seq2Seq) Modeling

In Text summarization problem input is a sequence of word and output is also a sequence so we can use a sequence-to-sequence model.

Our objective is to build a text summarizer where the input is a long sequence of words (in a text body), and the output is a short summary (which is a sequence as well). So, we can model this as a Many-to-Many Seq2Seq problem. Below is a typical Seq2Seq model architecture:

There are two major components of a Seq2Seq model:

  • Encoder
  • Decoder

Encoder-Decoder Architecture

The Encoder-Decoder architecture is mainly used to solve the sequence-to-sequence (Seq2Seq) problems where the input and output sequences are of different lengths.



Generally, variants of Recurrent Neural Networks (RNNs), i.e. Gated Recurrent Neural Network (GRU) or Long Short Term Memory (LSTM), are preferred as the encoder and decoder components. This is because they are capable of capturing long term dependencies by overcoming the problem of vanishing gradient.

We can set up the Encoder-Decoder in 2 phases:

  • Training phase
  • Inference phase

Training phase

In the training phase, we will first set up the encoder and decoder. We will then train the model to predict the target sequence offset by one timestep. Let us see in detail on how to set up the encoder and decoder. 

Encoder

An Encoder Long Short Term Memory model (LSTM) reads the entire input sequence wherein, at each timestep, one word is fed into the encoder. It then processes the information at every timestep and captures the contextual information present in the input sequence.

I’ve put together the below diagram which illustrates this process:

The hidden state (hi) and cell state (ci) of the last time step are used to initialize the decoder. Remember, this is because the encoder and decoder are two different sets of the LSTM architecture.

Decoder

The decoder is also an LSTM network which reads the entire target sequence word-by-word and predicts the same sequence offset by one timestep. The decoder is trained to predict the next word in the sequence given the previous word.

<start> and <end> are the special tokens which are added to the target sequence before feeding it into the decoder. The target sequence is unknown while decoding the test sequence. So, we start predicting the target sequence by passing the first word into the decoder which would be always the <start> token. And the <end> token signals the end of the sentence.

Inference Phase

After training, the model is tested on new source sequences for which the target sequence is unknown. So, we need to set up the inference architecture to decode a test sequence:

How does the inference process work?

Here are the steps to decode the test sequence:

1.     Encode the entire input sequence and initialize the decoder with internal states of the encoder

2.     Pass <start> token as an input to the decoder

3.     Run the decoder for one timestep with the internal states

4.     The output will be the probability for the next word. The word with the maximum probability will be selected

5.     Pass the sampled word as an input to the decoder in the next timestep and update the internal states with the current time step

6.     Repeat steps 3 – 5 until we generate <end> token or hit the maximum length of the target sequence

Let’s take an example where the test sequence is given by  [x1, x2, x3, x4]. How will the inference process work for this test sequence? I want you to think about it before you look at my thoughts below.

1.     Encode the test sequence into internal state vectors

1.     Observe how the decoder predicts the target sequence at each timestep:

Timestep: t=1



Timestep: t=2



And, Timestep: t=3


Limitations of the Encoder – Decoder Architecture

As useful as this encoder-decoder architecture is, there are certain limitations that come with it.

  • The encoder converts the entire input sequence into a fixed length vector and then the decoder predicts the output sequence. This work only for the short sequence since the decoder is looking at the entire sequence for the predictions.
  • Here comes the problem with long sequences. It is difficult for the encoder to memorize long sequences into a fixed length vector
So how do we overcome this problem of long sequences? This is where the concept of attention mechanism comes into the picture. It aim to predict a word by word by looking at a specific part of the sequence only rather than the entire sequence.

Attention Mechanism

How much attention do we need to pay to every word in the input sequence for generating a word at timestep t? That’s the key intuition behind this attention mechanism concept.

Let’s consider a simple example to understand how Attention Mechanism works:

  • Source sequence: “Which sport do you like the most?
  • Target sequence: “I love cricket”

The first word 'I' in the target sequence is connected to the fourth word 'you' in the source sequence. Similarly the second word 'love' in the target sequence in connected with the fifth word 'like' in source sequence.

So, instead of looking at all the words in the source sequence, we can increase the importance of specific parts of the source sequence that result in the target sequence. This is the basic idea behind the attention mechanism.

There are two types of attentions

·        Global Attention

·        Local Attention

Global Attention:

Here, the attention is placed on all the source positions. In other words, all the hidden states of the encoder are considered for deriving the attended context vector:


Source: Effective Approaches to Attention-based Neural Machine Translation-2015   

 

Local Attention:

Here, the attention is placed on only a few source positions. Only a few hidden states of the encoder are considered for deriving the attended context vector:

Source: Effective Approaches to Attention-based Neural Machine Translation-2015   

 

Problem Statement

Our objective here is to generate a news summary.

You can download the dataset from here.

 

Implementing Text Summarization in Python using Keras

Custom Attention Layer

Keras does not officially support attention layer. So, we can either implement our own attention layer or use a third-party implementation. We will go with the latter option for this article. 

Import the Libraries

Read the dataset

 

Text Cleaning

We will perform the below preprocessing tasks for our data:

  • Convert everything to lowercase
  • Remove HTML tags
  • Contraction mapping
  • Remove (‘s)
  • Remove any text inside the parenthesis ( )
  • Eliminate punctuations and special characters
  • Remove stopwords
  • Remove short words

EDA

We can fix thee maximum length of the review to 80 since that seem to be majority text length. similarly, we can set summary length to 10.

Word cloud for text




Observations:

1) There are word like added ,said are the most common word in the text.

2)There also a word which belongs to the famous personality, famous politician like modi, trump and public places , days, states are also common in the news.

Word cloud for Headline




Observations:

From the graph we can't said specific word are common in headline but there are public places names, cities , team etc are most frequent word in the headlines.

Maximum length

We can fix the maximum length of the reviews to 80 since that seems to be the majority review length. Similarly, we can set the maximum summary length to 10:

 Train test split

We’ll use 80% of the dataset as the training data and evaluate the performance on the remaining 20%

 

Tokenizer

A tokenizer builds the vocabulary and converts a word sequence to an integer sequence.

 Model Building 

We are finally at the model building part. But before we do that, we need to familiarize ourselves with a few terms which are required prior to building the model.

  • Return Sequences=TrueWhen the return sequences parameter is set to True, LSTM produces the hidden state and cell state for every timestep
  • Return Sequences=TrueWhen return state = True, LSTM produces the hidden state and cell state of the last timestep only
  • Initial StateThis is used to initialize the internal states of the LSTM for the first timestep
  • Stacked LSTM:  Stacked LSTM has multiple layers of LSTM stacked on top of each other. This leads to a better representation of the sequence. I encourage you to experiment with the multiple layers of the LSTM stacked on top of each other (it’s a great way to learn this)
1)Encoder Decoder Model


Attention Model

2)LSTM Attention Model

3)Bidirectional LSTM Attention Model

Here, we are building a 3 stacked LSTM for the encoder:

Embedding:

Encoder:

Decoder:


Model Training:

Diagnostic plot

Now, we will plot a few diagnostic plots to understand the behavior of the model over time:


Inference

We have train model with encoder decoder, attention and bidirectional lstm with attention model. bidirectional lstm with attention model gives better result than others  


 So we set up the inference for the bidirectional lstm with attention model:

We are defining a function below which is the implementation of the inference process

Demo Video:



References:

https://towardsdatascience.com/lets-give-some-attention-to-summarising-texts-d0af2c4061d1

https://towardsdatascience.com/text-summarization-from-scratch-using-encoder-decoder-network-with-attention-in-keras-5fa80d12710e



Comments