WSDM Cup 2024

1st Solution For Conversational Multi-Doc QA Workshop & International Challenge @ WSDM'24 - Xiaohongshu.Inc

Introduction

This repo contains the source code of our competition in WSDM Cup 2024: Conversational Multi-Doc QA

Please refer to our paper for details in this competition: The First Place Solution of WSDM Cup 2024: Leveraging Large Language Models for Conversational Multi-Doc QA

Method Overview

SOLAR-10.7B-Instruct backbone
Hybrid Training
Noisy Document Filter
Model Ensemble

Environment

Follow Installation for modelscope/swift to install swift.
Install vllm
Install deepspeed
Install sklearn
Install SentenceTransformers

Or you can run this: (Tested on V100 32G with CUDA 11.8, Ubuntu 20.04.1)

conda create -n swift python=3.10
conda activate swift
pip install ms-swift[all] -U
pip install vllm==0.3.1
pip install deepspeed
pip install scikit-learn
pip install sentence_transformers

Main package version:

python==3.10.13
ms-swift==1.6.1
scikit-learn==1.4.1.post1
sentence-transformers==2.3.1
torch==2.1.2
transformers==4.37.2
vllm==0.3.1

Data Processing

preprocess/data_format.py: Format data required for train and eval

preprocess/data_format_Pseudo.py: For hybrid training data

preprocess/score_train_eval(test).py: Calculate scores for noisy documents filter

preprocess/score_order.py: Interactive code to delete noisy documents

Training

Use LLM Framework ms-swift by ModelScope

Finetuning

runsh/solar_instruct_sft_template.sh

Inference

runsh/solar_instruct_infer_template.sh

Ensemble learning

merge/calculate_score.py: Calculate scores for ensemble learning

merge/merge_score.py: Ensemble results

Other

keyword: Try directly generating keywords or answers by GPT

multi_stage: Multi Stage LLM try (Not work)

Reproduce results on the leaderboard

You can find all intermediate files in result folder

Prepare models

Download Pretrained Models From Huggingface
- upstage/SOLAR-10.7B-Instruct-v1.0 (10.7 B)
- nomic-ai/nomic-embed-text-v1 (0.14 B)
Download Our 8 Finetuned LoRA Adapters From our huggingface repository (0.03 B Each)

So our model size is 10.7B + 0.14B + 0.03B * 8 = 11.08B, much fewer than 14 billion (14B) parameters.

Put them in the right folder. The folder should look as follows:

└── checkpoints
    ├── v08-20240205-114459/
    ├── v10-20240205-114325/
    ├── v13-20240202-072530/
    ├── v13-20240206-111010/
    ├── v16-20240206-224659/
    ├── v27-20240209-133614/
    ├── v33-20240210-002918/
    └── v35-20240210-120550/
└── pretrained
    └── nomic-ai/nomic-embed-text-v1/
        ├── 1_Pooling/
        ├── config.json
        ├── config_sentence_transformers.json
        ├── configuration_hf_nomic_bert.py
        ├── .gitattributes
        ├── .locks/
        ├── modeling_hf_nomic_bert.py
        ├── model.safetensors
        ├── modules.json
        ├── onnx/
        ├── pytorch_model.bin
        ├── README.md
        ├── sentence_bert_config.json
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── vocab.txt
    └── upstage/SOLAR-10.7B-Instruct-v1.0/
        ├── config.json
        ├── generation_config.json
        ├── .gitattributes
        ├── .locks/
        ├── model-00001-of-00005.safetensors
        ├── model-00002-of-00005.safetensors
        ├── model-00003-of-00005.safetensors
        ├── model-00004-of-00005.safetensors
        ├── model-00005-of-00005.safetensors
        ├── model.safetensors.index.json
        ├── README.md
        ├── solar_logo.png
        ├── tokenizer_config.json
        ├── tokenizer.json
        └── tokenizer.model

Inference Result

Run python data_format.py to preprocess original test data.

Then run shell script in the runsh folder

bash runsh/v08-20240205-114459.sh
bash runsh/v10-20240205-114325.sh
bash runsh/v13-20240202-072530.sh
bash runsh/v13-20240206-111010.sh
bash runsh/v16-20240206-224659.sh
bash runsh/v27-20240209-133614.sh
bash runsh/v33-20240210-002918.sh
bash runsh/v35-20240210-120550.sh

You can modify CUDA device at the beginning of each shell script CUDA_VISIBLE_DEVICES=
The result files are saved in the merge folder, which should look as follows:

└── merge
    ├── v08-20240205-114459.jsonl
    ├── v10-20240205-114325.jsonl
    ├── v13-20240202-072530.jsonl
    ├── v13-20240206-111010.jsonl
    ├── v16-20240206-224659.jsonl
    ├── v27-20240209-133614.jsonl
    ├── v33-20240210-002918.jsonl
    └── v35-20240210-120550.jsonl

Besides, the results above are as follows:

File	Word-level ROUGE-L	Character-level ROUGE-L	Keywords Recall
v08-20240205-114459	0.45532953438881013	0.6143454883849857	0.6824189095928223
v10-20240205-114325	0.456275615214309	0.6149276913541135	0.6817805383022769
v13-20240202-072530	0.4554468517276402	0.6141346993379754	0.6827095609704305
v13-20240206-111010	0.456388581088847	0.6149210447203279	0.6840088655306036
v16-20240206-224659	0.45375515045837794	0.613359666771279	0.6879538939321544
v27-20240209-133614	0.45574561117381773	0.6145520850027292	0.6826942984551678
v33-20240210-002918	0.4559195951083145	0.6141543510329665	0.6865596963423041
v35-20240210-120550	0.45573339341665703	0.614208192382808	0.6813332802463232

So even if they are not ensembled, each of them is still way ahead of the second place.

Ensemble

First, calculate the embedding score

python calculate_score.py

Note that this program is accelerated by torch.multiprocessing, you can modify the number of processes near num_group = 16. (It works well in V100 32G)

Then generate final result,

python merge_score.py

It will generate emb_a_s_8_0_1_2_3_4_5_6_7.zip in the root folder, which is our final result.

Word-level ROUGE-L	Character-level ROUGE-L	Keywords Recall
0.465360141853671	0.6208371209722543	0.6953475871954128

Citation

If you find our work helpful, please consider citing the following paper:

@misc{li2024place,
      title={The First Place Solution of WSDM Cup 2024: Leveraging Large Language Models for Conversational Multi-Doc QA}, 
      author={Yiming Li and Zhao Zhang},
      year={2024},
      eprint={2402.18385},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contacts

Zhao Zhang: zhaozhao809@163.com

Yiming Li: eamon.y.li@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config		config
data/wsdm		data/wsdm
keyword		keyword
merge		merge
multi_stage		multi_stage
output/solar-10-7b-instruct-v1		output/solar-10-7b-instruct-v1
pic		pic
preprocess		preprocess
runsh		runsh
submit		submit
utils		utils
README.md		README.md
README_zh.md		README_zh.md
llm_dpo.py		llm_dpo.py
llm_infer.py		llm_infer.py
llm_sft.py		llm_sft.py
vllm_demo.py		vllm_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WSDM Cup 2024

Introduction

Method Overview

Environment

Data Processing

Training

Finetuning

Inference

Ensemble learning

Other

Reproduce results on the leaderboard

Prepare models

Inference Result

Ensemble

Citation

Contacts

About

Releases

Packages

Languages

WilliamOdinson/WSDM-Cup-2024

Folders and files

Latest commit

History

Repository files navigation

WSDM Cup 2024

Introduction

Method Overview

Environment

Data Processing

Training

Finetuning

Inference

Ensemble learning

Other

Reproduce results on the leaderboard

Prepare models

Inference Result

Ensemble

Citation

Contacts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages