Hexaview Technologies

Automatic Text Summarization in Machine Learning

Automatic Text Summarization in Machine Learning.

There has been an explosion in text data from various sources in recent years. Due to plenty of data on information servers, ‘ information overload’ is becoming an issue for people. It has always been a very difficult and time-consuming task to summarize and sort mountains of documents keeping all semantics into consideration. Hence, automatic text summarization can be the key solution to this problem. Text summarization is extracting or collecting information from the original text and presenting that information in summary.

As the word suggests, text summarization is the process of summarizing a huge chunk of text in a precise and concise format such that the overall meaning remains the same.

Automatic text summarization involves the transformation of lengthy documents into shortened versions, which could be difficult and costly to undertake manually.

Automatic text summarization is a concept of natural language processing, which is how computers can analyze, understand, and derive meaning from human language.

Business leaders, analysts, paralegals, and academic researchers need to comb through huge numbers of documents every day to keep ahead. Most of their time is spent figuring out what document is relevant and what isn’t. By extracting important sentences and creating comprehensive summaries, it’s possible to assess whether or not a document is worth reading quickly:

  1. The word frequency for the entire text document is calculated to summarize the text automatically.
  2. The hundred most common words are stored and sorted. Each sentence is then scored based on its number of high-frequency words, with higher-frequency words worth more.
  3. The top X sentences are then taken and sorted based on their position in the original text.

Automatic data summarization is part of machine learning and data mining. The primary idea of shortening the information is to find a subset of data containing the entire set’s ” information. ” Such techniques are widely used in industry today, for example, document summarization, video collections, and image collections. Document summarization tries to create a summary or abstract of the entire document by finding the most informative sentences. In contrast, the system finds the most representative and important images in image summarization.

The Machine learning algorithms can be trained to grasp the documents and identify the phrases and sections that hold the important details before producing the required summarized texts.

THE MAIN TYPES OF SUMMARIZATIONS:

Broadly, there are two methods of text summarization – extraction and abstraction.

1

Extraction-based summarization:

In extraction-based summarization, the most important information is extracted from the information and is combined to form a summary. We can consider the extraction-based approach as a highlighter, which extracts the primary information from the text.

2

How does the Extraction-based summarization work?

Extraction-based summarization involves weighing the important points and sections of the complete text, and other various methods and algorithms are utilized to measure the weight of the important sentences. After that, these are ranked according to their relevance and similarity and are combined to form a summary.

Let us take an example

Source Text: 

Peter and Elizabeth took a taxi to attend the night party in the city. While at the party, Elizabeth collapsed and was rushed to the hospital.

Summary: Peter and Elizabeth attend party city. Elizabeth was rushed to the hospital.

Here, the extracted summary is made up of the words highlighted in bold, although the results may not be grammatically accurate.

Abstraction-based summarization:

Advanced deep learning techniques are used to rephrase and shorten the actual document in abstraction-based summarization, just like humans do. We can consider it a pen that produces novel sentences that may not be part of the source document.

3

As the deep learning techniques and machine learning algorithm used in the abstraction-based approach can generate new sentences and phrases that hold the most of the important information from the text, they can help overcome the grammatical inaccuracies of the extraction techniques.

Let’s take an example: 

Source Text: Peter and Elizabeth took a taxi to attend the night party in the city. While at the party, Elizabeth collapsed and was rushed to the hospital.

 Summary: Elizabeth was hospitalized after attending a party with Peter.

Although abstraction simplifies text-summarization better, developing its algorithms requires complicated deep learning techniques and sophisticated language modeling.

 

EXTRACTIVE SUMMARIZATION METHODS: 

  1. Term Frequency-Inverse Document Frequency (TF-IDF) method.
  2. Cluster-based method.
  3. Graph-theoretic approach.
  4. Machine Learning approach.
  5. Automatic text summarization based on fuzzy logic.

ABSTRACTIVE SUMMARIZATION METHODS:

  1. Tree-based methods.
  2. Rule-based methods.
  3. Ontology-based methods.
  4. Information item-based method.
  5. Semantic Graph Model.

HOW TO PERFORM TEXT SUMMARIZATION:

Apart from Python’s NLTK toolkit, we’ll not use any other machine learning library to keep things simple.

Here are the steps for creating a simple text summarizer in Python.

Step 1: Preparing the data.

 Step 2: Processing the data.

 Step 3: Tokenizing the article into sentences.

 Step 4: Finding the weighted frequencies of the sentences.

 Step 5: Calculate the threshold of the sentences.

 Step 6: Getting the summary.

Wrapping Up:

Given below image depicts the workflow for creating the summary generator.

A basic workflow of creating a summarization algorithm.

Workflow

CONCLUSION:

The exponential growth of the Internet has led to the rise in information. A vast amount of information is available, and it becomes difficult for humans to summarize large amounts of text.

As plenty of information is available on World Wide Web, it is not possible to go through each document available to know its purpose and know if it is a necessary document. Hence, a summary of these documents will be more helpful to the reader to decide if the available document is relevant or not, and extraction of the gist of each document will be easier. So this has led to an extreme need for automatic summarization tools and technologies.

Thus, there is an immense need for automatic summarization tools in this age of information overload. Automatic summarization is important in NLP (Natural Language Processing) research. It consists of automatically creating a summary of one or more texts. Although extractive text summarization is easier to implement, it also holds a few limitations causing ambiguity and miscommunication in summary. Abstractive summarization can generate a more relevant and precise summary, but more complex heuristic algorithms are required.

Poorva Gaur

Poorva Gaur

Poorva Gaur is currently working as a Software Quality Engineer. Her job role is to deliver a quality product according to the requirements of the clients. She has a very keen interest in Machine Learning also. She always loves to explore the Machine Learning field.