Kung-Hsiang (Steeve) Huang, Hou Pong (Ken) Chan and Heng Ji
Paper link: ACL
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge bases and preventing hallucinations in sequence-to-sequence models. Drawing on humans' ability to identify and correct factual errors, we present a zero-shot framework that formulates questions about input claims, looks for correct answers in the given evidence, and assesses the faithfulness of each correction based on its consistency with the evidence. Our zero-shot framework outperforms fully-supervised approaches, as demonstrated by experiments on the FEVER and SciFact datasets, where our outputs are shown to be more faithful. More importantly, the decomposability nature of our framework inherently provides interpretability. Additionally, to reveal the most suitable metrics for evaluating factual error corrections, we analyze the correlation between commonly used metrics with human judgments in terms of three different dimensions regarding intelligibility and faithfulness.
All the required packages are listed in requirements.txt
. To install all the dependencies, run
conda create -n zerofec python=3.7
conda activate zerofec
pip install -r requirements.txt
The FEVER and SciFact datasets used in our experiments can be downloaded here.
No training is needed as our framework corrects factual error is a zero-shot fashion. All individual components have been trained on the corresponding sub-task. Please download each component as follows:
- Candidate Generation: Download spacy models by running
python -m spacy download en_core_web_lg
andpython -m spacy download en_core_sci_md
. - Question Generation: It's already on HuggingFace Salesforce/mixqg-base.
- Question Answering: It's already on HuggingFace allenai/unifiedqa-v2-t5-base-1251000. The domain-adapted version is also made available on Huggingface khhuang/zerofec-daqa-t5-base.
- QA-to-Claim: We have made it available on HuggingFace khhuang/zerofec-qa2claim-t5-base.
- Correction Scoring: Download the DocNLI model from its repo such that the checkpoint is at
zerofec/docnli/DocNLI.pretrained.RoBERTA.model.pt
. The checkpoints for the domain-adapted models can be found here.
Example use of ZeroFEC is shown below.
from types import SimpleNamespace
from zerofec.zerofec import ZeroFEC
model_args = {
'qg_path': 'Salesforce/mixqg-base',
'qa_model_path': 'allenai/unifiedqa-v2-t5-base-1251000',
'qa_tokenizer_path': 't5-base',
'entailment_model_path': 'PATH/TO/DocNLI.pretrained.RoBERTA.model.pt',
'entailment_tokenizer_path':'roberta-large',
'qa2s_tokenizer_path': 'khhuang/zerofec-qa2claim-t5-base',
'qa2s_model_path': 'khhuang/zerofec-qa2claim-t5-base',
'use_scispacy': False
}
model_args = SimpleNamespace(**model_args)
zerofec = ZeroFEC(model_args)
sample = {
"input_claim": "Night of the Living Dead is a Spanish comic book."
"evidence": "Night of the Living Dead is a 1968 American independent horror film , directed by George A. Romero ..."
}
corrected_claim = zerofec.correct(sample)
The corrected_claim
dictionary now contains a key final_answer
, which is the final correction, as well as all intermediate outputs that provide interpretability.
Batch processing is supported via:
zerofec.batch_correct(samples)
where samples
is a list of dictionary.
For additional information about model_args
, please refer to main.py
.
To run prediction on the two datasets we used for our experiments, use the following
python main.py --input_path PATH/TO/scifact_test.jsonl --output_path outputs/scifact_outputs.jsonl
python main.py --input_path PATH/TO/fever_test.jsonl --output_path outputs/fever_outputs.jsonl
Download the metrics by followng instructions in their corresponding repos:
The evaluation scripts are in the evals
directory. All evaluation, except for QAFactEval, is in the script evals.sh
. This is because QAFactEval has a different set of dependencies from metrics. For all other metrics, you can install corresponding dependencies in the zerofec
virtual enviroment we created at the begging of this README. For QAFactEval, you need to create a new environment and install QAFactEval's dependencies in it.
cd evals
bash evals.sh $OUTPUT_PATH
where $OUTPUT_PATH
is the path to the output file from main.py
(e.g. outputs/fever_outputs.jsonl
). Following a similar procedure, you can evaluate our performance in QAFactEval using the evals_qafactevals.sh
script.
If you find this work useful, please consider citing:
@inproceedings{huang2023zero,
title = "Zero-shot Faithful Factual Error Correction",
author = "Huang, Kung-Hsiang and Chan, Hou Pong and Ji, Heng",
year = "2023",
month= july,
booktitle = "Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics",
publisher = "Association for Computational Linguistics",
}