[go: up one dir, main page]

Skip to content
/ elsa Public

ELSA combines extractive and abstractive approaches to the automatic text summarization

Notifications You must be signed in to change notification settings

maks5507/elsa

Repository files navigation



ELSA: Extractive Linking of Summarization Approaches

Authors: Maksim Eremeev (mae9785@nyu.edu), Mars Wei-Lun Huang (wh2103@nyu.edu), Eric Spector (ejs618@nyu.edu), Jeffrey Tumminia (jt2565@nyu.edu)

Installation

python setup.py build
pip install .

Quick Start with ELSA

from elsa import Elsa

article = '''some text...
'''

abstractive_model_params = {
    'num_beams': 10,
    'max_length': 300,
    'min_length': 55,
    'no_repeat_ngram_size': 3
}

elsa = Elsa(weights=[1, 1], abstractive_base_model='bart', base_dataset='cnn', stopwords='data/stopwords.txt', 
            fasttext_model_path='datasets/cnn/elsa-fasttext-cnn.bin', 
            udpipe_model_path='data/english-ewt-ud-2.5-191206.udpipe')
            
elsa.summarize(article, **abstractive_model_params)

__init__ parameters

  • weights: List[float] -- weights for TextRank and Centroid extractive summarizations.
  • abstractive_base_model: str -- model used on the abstractive step. Either 'bart' or 'pegasus'.
  • base dataset: str -- dataset used to train the abstractive model. Either 'cnn' or 'xsum' .
  • stopwords: str -- path to the list of stopwords.
  • fasttext_model_path: str -- path to the *.bin checkpoint of a trained FastText model (see below for the training instructions).
  • udpipe_model_path: str -- path to the *.udpipe checkpoint of the pretrained UDPipe model (see data directory for the files).

summarize parameters

  • factor: float -- percentage (a number from 0 to 1) of sentences to keep in extractive summary (default: 0.5)

  • use_lemm: bool -- whether to use lemmatization on the preprocessing step (default: False)

  • use_stem: bool -- whether to use stemming on the preprocessing step (default: False)

  • check_stopwords: bool -- whether to filter stopwords on the preprocessing step (default: True)

  • check_length: bool -- whether to filter tokens shorter than 4 symbols (default: True)

  • abstractive_model_params: dict -- any parameters for the huggingface model's generate method

Datasets used for experiments

CNN-DailyMail: Link, original source: Link

XSum: Link, original source: Link

Gazeta.RU: Link, original source: Link

Downloading & Extracting datasets

wget https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz
wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz
wget https://www.dropbox.com/s/cmpfvzxdknkeal4/gazeta_jsonl.tar.gz

tar -xzf cnndm.tar.gz
tar -xzf XSUM-EMNLP18-Summary-Data-Original.tar.gz
tar -xzf gazeta_jsonl.tar.gz

FastText models

Our trained FastText models

CNN-DailyMail: Link

XSum: Link

Gazeta: Link

See our FastText page for training details.

UDPipe models

UDPipe models available for English:

  • UDPipe-English EWT: Link (Used in our experiments, see data directory)
  • UDPipe-English Patut: Link
  • UDPipe-English Lines: Link
  • UDPipe-English Gum: Link

Other UDPipe models: Link

Adaptation for Russian

As approach we use for ELSA is language-independent, we can easily adapt it to other languages. For Russian, we finetune mBart on the Gazeta dataset, train additional FastText model, and use UDPipe model built for Russian texts.

UDPipe models for Russian

  • UDPipe-Russian Syntagrus: Link
  • UDPipe-Russain GSD: Link (Used in our experiments, see data directory)
  • UDPipe-Russian Taiga: Link

mBART checkpoint

HuggingFace checkpoint: Link

Codestyle check

Before making a commit / pull-request, please check the coding style by running the bash script in the codestyle directory. Make sure that your folder is included in codestyle/pycodestyle_files.txt list.

Your changes will not be approved if the script indicates any incongruities (this does not apply to 3rd-party code).

Usage:

cd codestyle
sh check_code_style.sh