Skip to content

Automagically generates summaries from html or text.

Notifications You must be signed in to change notification settings

transitive-bullshit/text-summarization

Repository files navigation

text-summarization

Automagically generates summaries from html or text.

NPM Build Status JavaScript Style Guide

Intro

This module powers Automagical's text summarization, which was acquired by Verblio in 2018.

It provides the most powerful and comprehensive text summarization available on NPM.

Features

  • Uses a variety of metrics to generate quality extractive text summaries
  • Handles html or text-based content
  • Utilizes html structure as a signal of text importance
  • Includes basic abstractive shortening of extracted sentences
  • Usable as a node module or cli
  • Thoroughly tested and used in production

Install

This module is usable either as a CLI or as a module.

npm install --save text-summarization

Usage

const summarize = require('text-summarization')

const fs = require('fs')
const html = fs.readFileSync('fixtures/automagical-1.html')

const summary = await summarize({ html })
console.log(JSON.stringify(summary, null, 2))

which outputs:

{
  "extractive": [
    "Why you should drop everything and try Automagical",
    "Video content is significantly more engaging than text content",
    "Go from blog post → video in 5 minutes.",
    "Our builder is exceptionally easy to use.",
    "For the cost of 1 highly produced video, you can get a year's worth of videos from Automagical."
  ]
}

CLI

npm install -g text-summarization

This installs a summarize binary globally.

  Usage: summarize [options] <file>

  Options:
    -V, --version              output the version number
    -n, --num-sentences <n>    number of sentences (defaults to variable length)
    -t, --title <title>        title
    -c, --content-type <type>  sets content type to html or text
    -d, --detailed             print detailed info for top sentences
    -D, --detailedAll          print detailed info for all sentences
    -m, --media                resolve <a> links using iframely and return best matching media
    -P, --no-pretty-print      disable pretty-printing output
    -h, --help                 output usage information

Metrics

  • tfidf overlap for base relative sentence importance
  • html node boosts for tags like <h1> and <strong>
  • listicle boosts for lists like 2) second item
  • penalty for poor readability or really long sentences

Here's an example of a sentence's internal structure after normalization, processing, and scoring:

{
  "index": 8,
  "sentence": {
    "original": "4. For the cost of 1 highly produced video, you can get a year's worth of videos from Automagical.",
    "listItem": 4,
    "actual": "For the cost of 1 highly produced video, you can get a year's worth of videos from Automagical.",
    "normalized": "for the cost of 1 highly produced video you can get a years worth of videos from automagical",
    "tokenized": [
      "cost",
      "highly",
      "produced",
      "video",
      "years",
      "worth",
      "videos",
      "automagical"
    ]
  },
  "liScore": 1,
  "nodeScore": 0.7,
  "readabilityPenalty": 0,
  "tfidfScore": 0.8019447657605553,
  "score": 5.601944765760555
}

Iframely

This module optionally supports using iframely to get social previews for any external links in the source html, adding the resulting images and summary text to the source pool of candidate sentences.

To enable this, set the IFRAMELY_BASE_URL and IFRAMELY_API_KEY environment variables.

References

License

MIT © Travis Fischer

Support my OSS work by following me on twitter twitter