Edisum

This repository contains the PyTorch implementation for the models and experiments in the paper Edisum: Summarizing and Explaining Wikipedia Edits at Scale

@article{šakota2024edisum,
      title={Edisum: Summarizing and Explaining Wikipedia Edits at Scale}, 
      author={Marija Šakota and Isaac Johnson and Guosheng Feng and Robert West},
      journal={arXiv preprint arXiv:2404.03428}
      year={2024}
}

Please consider citing our work, if you found the provided resources useful.

1. Setup

Start by cloning the repository:

git clone https://github.com/epfl-dlab/edisum.git

We recommend creating a new conda virtual environment as follows:

conda env create -f environment.yml

This command also installs all the necessary packages.

2. Downloading data and models

The data is available on huggingface and can be loaded with:

from datasets import load_dataset
dataset = load_dataset("msakota/edisum_dataset")

Alternatively, to download the collected data for the experiments, run:

bash ./download_data.sh

For downloading the trained models (available on huggingface), run:

bash ./download_models.sh

3. Usage

Training

To train a model from scratch on the desired data, run:

DATA_DIR="./data/100_perc_synth_data/" # specify a directory where training data is located
RUN_NAME="train_longt5_100_synth"
python run_train.py run_name=$RUN_NAME dir=$DATA_DIR +experiment=finetune_longt5

Inference

To run inference on a trained model:

DATA_DIR="./data/100_perc_synth_data/" # specify a directory where training data is located
CHECKPOINT_PATH="./models/edisum_100.ckpt" # specify path to the trained model
RUN_NAME="inference_longt5_100_synth"
python run_inference.py run_name=$RUN_NAME dir=$DATA_DIR checkpoint_path=$CHECKPOINT_PATH +experiment=inference_longt5

4. Experimenting with custom inputs

By providing an edit diff link

To test any of the trained models on an arbitrary edit diff link:

python run_model.py --model_name_or_path edisum_100 --diff_link "https://en.wikipedia.org/w/index.php?title=C/2023_A3_(Tsuchinshan–ATLAS)&diff=prev&oldid=1251441412"

Optionally, you can stop the generation in case there are any node changes (as the generated edit might not reflect the changes exhaustively) by adding -prohibit_node. If no model_name_or_path is provided, the script defaults to edisum_100. You can provide a path towards any .ckpt model, or specify one of the five models from the paper: [edisum_0, edisum_25, edisum_50, edisum_75, edisum_100], where the number represents percentage of synthetic data in the training dataset.

By providing a custom input

To test any custom input, which might not necessarily be a real edit:

python run_model.py --model_name_or_path edisum_100 --input_text <your_input_text>

For an optimal performance, the input text should be formatted in the way training data was formatted:

Edit diff should be represented by collecting sentences that were altered, added or removed during the edit into two sets: previous (belonging to the previous revision of the page) and current sentences (belonging to the current revision of the page)
Previous sentences should contain each sentence that was removed from the previous revision, and versions of the sentences which were altered from the previous revision
New sentences should contain each sentence that was added to the new revision, and versions of the sentences which were altered in the new revision
Input is then made concatenating each sentence in previous sentences, separating them with <sent_sep>, and adding a prefix <old_text>. Similarly, sentences in current sentences are separated with the same <sent_sep> and prefix <new_text> is added. Final input is then dervied by concatenating these two repesentations.

Example:

Jupyter notebook

We also provide a Jupyter notebook for experimentation with custom inputs: playground.ipynb

License

This project is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
download_data.sh		download_data.sh
download_models.sh		download_models.sh
environment.yml		environment.yml
gen_edit_summary.py		gen_edit_summary.py
playground.ipynb		playground.ipynb
run_finetune.sh		run_finetune.sh
run_inference.py		run_inference.py
run_inference.sh		run_inference.sh
run_model.py		run_model.py
run_train.py		run_train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Edisum

1. Setup

2. Downloading data and models

3. Usage

Training

Inference

4. Experimenting with custom inputs

By providing an edit diff link

By providing a custom input

Jupyter notebook

License

About

Releases

Packages

Languages

License

epfl-dlab/edisum

Folders and files

Latest commit

History

Repository files navigation

Edisum

1. Setup

2. Downloading data and models

3. Usage

Training

Inference

4. Experimenting with custom inputs

By providing an edit diff link

By providing a custom input

Jupyter notebook

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages