As the automation market is rounding the corner into the mainstream, organizations are looking for new ways to leverage Artificial Intelligence to streamline their day-to-day business processes.

The purpose of this blog post is to address common questions we get asked by our customers around Automatic Text Summarization, which is the process of using AI to shorten a text document and create a summary of its major points.

What are the different Automatic Text Summarization approaches? When does it make sense to leverage these approaches?

Extractive text summarization is the process of isolating sentences or phrases that capture the gist of the original text and constructing a new, shorter text from the isolated parts. Methods for extractive summarization usually entail ranking each sentence of a document according to factors such as keyword frequency, length, total number of keywords, etc.

Abstractive text summarization aims to understand the text as a whole and present a summary made of new, generated sentences. Abstractive summarization is a more ‘human-like’ process but it is also more difficult to implement and produce coherent and accurate summaries. Because abstractive methods are based around the AI achieving an understanding of the text and subsequently producing its own version of the text, such methods are hampered by issues with semantic representation, inferencing, and natural language generation.

The practice of abstractive summarization is much younger and less developed than that of extractive summarization. Since extractive summarization methods are generally more well-understood, they tend to result in better summaries than abstractive methods. For now, extractive summarizers are both more time-efficient to implement and more reliable than abstractive summarizers, but perhaps in the future when the field has advanced further, abstractive methods will be the standard.

In most use cases, we would pursue an extractive method via the TextRank algorithm since it is text-agnostic and does not rely on pre-existing training data.

The TextRank algorithm follows the same general steps of most extractive methods. It begins with constructing a feature-based representation of the sentences from the text. Next, the algorithm scores the representations and finally creates a summary from the highest-scoring sentences. TextRank is an unsupervised machine learning algorithm. The steps are described in more detail below.

  1. Tokenize the input document to prepare it for analysis.
  2. Generate similarity matrix across all of the sentences using a metric like cosine similarity.
  3. Rank all of the sentences in the similarity matrix and sort by rank. Sentences would be ranked according to a variety of features like keyword frequency, length, total number of keywords, etc
  4. Choose a summary length cutoff and output the summary, which would consist of the highest-ranked sentences. The length cutoff would be set to a certain number of sentences or characters.

Does the text type (Rhetorical structure, length, etc.) affect the approach?

TextRank is a text-agnostic algorithm, meaning the structure, genre, and length of the text, in theory, should not matter. The main risk of automatically summarizing long documents or multiple documents about the same topic is having a repetitive summary. There is an alternative version of TextRank called Lexrank that’s basically the same but involves a step right before creating the summary where highly-ranked sentences are checked against each other to make sure the final selection of sentences isn’t too repetitive.

In general, abstractive methods have a harder time with both long documents and documents with long sentences due to the nature of AI ‘memory’ and the way LSTM (long short-term memory) neural networks handle the retention of past information to inform future processing.

Is there a need for human intervention?

Oftentimes, summary outputs made by ATS are inaccurate and require human intervention.

Extractive summaries can be stilted in terms of flow from sentence to sentence and grammatical consistency. Abstractive summaries can be factually wrong and it’s likely beneficial to have someone QA it.

What is the level of AI model training required to have a system that accurately summarizes texts?

TextRank requires no prior training since it’s an unsupervised method. Other extractive methods treat automatic summarization as a classification problem and therefore require labeled training data, which would basically mean having several documents with corresponding human-written summaries.

What are some common use cases for Automatic Text Summarization?

  • Creating abstracts for scientific or technical documents.
  • Condensing news into something that could fit into a push notification.
  • Headline generation.
  • Blurbs for curated content (i.e. you have a newsletter that sends articles, book recommendations, etc to users–ATS could generate blurbs for the material).
  • Question-and-answer chatbots.
  • Otherwise condensing verbose material that the audience only needs the ‘meat’ of.

If you’d like to learn more about Automatic Text Summarization, feel free to reach out to cognitive-automation@cedrus.digital and we’ll set up a meeting with one of our automation experts!