Authors: Maksim Eremeev (mae9785@nyu.edu), Mars Wei-Lun Huang (wh2103@nyu.edu), Eric Spector (ejs618@nyu.edu), Jeffrey Tumminia (jt2565@nyu.edu)
python setup.py build
pip install .
from elsa import Elsa
article = '''some text...
'''
abstractive_model_params = {
'num_beams': 10,
'max_length': 300,
'min_length': 55,
'no_repeat_ngram_size': 3
}
elsa = Elsa(weights=[1, 1], abstractive_base_model='bart', base_dataset='cnn', stopwords='data/stopwords.txt',
fasttext_model_path='datasets/cnn/elsa-fasttext-cnn.bin',
udpipe_model_path='data/english-ewt-ud-2.5-191206.udpipe')
elsa.summarize(article, **abstractive_model_params)
weights
:List[float]
-- weights for TextRank and Centroid extractive summarizations.abstractive_base_model
:str
-- model used on the abstractive step. Either'bart'
or'pegasus'
.base dataset
:str
-- dataset used to train the abstractive model. Either'cnn'
or'xsum'
.stopwords
:str
-- path to the list of stopwords.fasttext_model_path
:str
-- path to the*.bin
checkpoint of a trained FastText model (see below for the training instructions).udpipe_model_path
:str
-- path to the*.udpipe
checkpoint of the pretrained UDPipe model (seedata
directory for the files).
-
factor
:float
-- percentage (a number from 0 to 1) of sentences to keep in extractive summary (default:0.5
) -
use_lemm
:bool
-- whether to use lemmatization on the preprocessing step (default:False
) -
use_stem
:bool
-- whether to use stemming on the preprocessing step (default:False
) -
check_stopwords
:bool
-- whether to filter stopwords on the preprocessing step (default:True
) -
check_length
:bool
-- whether to filter tokens shorter than 4 symbols (default:True
) -
abstractive_model_params
:dict
-- any parameters for the huggingface model'sgenerate
method
CNN-DailyMail: Link, original source: Link
XSum: Link, original source: Link
Gazeta.RU: Link, original source: Link
wget https://s3.amazonaws.com/opennmt-models/Summary/cnndm.tar.gz
wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz
wget https://www.dropbox.com/s/cmpfvzxdknkeal4/gazeta_jsonl.tar.gz
tar -xzf cnndm.tar.gz
tar -xzf XSUM-EMNLP18-Summary-Data-Original.tar.gz
tar -xzf gazeta_jsonl.tar.gz
CNN-DailyMail: Link
XSum: Link
Gazeta: Link
See our FastText page for training details.
UDPipe models available for English:
- UDPipe-English EWT: Link (Used in our experiments, see
data
directory) - UDPipe-English Patut: Link
- UDPipe-English Lines: Link
- UDPipe-English Gum: Link
Other UDPipe models: Link
As approach we use for ELSA is language-independent, we can easily adapt it to other languages. For Russian, we finetune mBart on the Gazeta dataset, train additional FastText model, and use UDPipe model built for Russian texts.
- UDPipe-Russian Syntagrus: Link
- UDPipe-Russain GSD: Link (Used in our experiments, see
data
directory) - UDPipe-Russian Taiga: Link
HuggingFace checkpoint: Link
Before making a commit / pull-request, please check the coding style by running the bash script in the codestyle
directory. Make sure that your folder is included in codestyle/pycodestyle_files.txt
list.
Your changes will not be approved if the script indicates any incongruities (this does not apply to 3rd-party code).
Usage:
cd codestyle
sh check_code_style.sh