red-eval

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Paper | Github | Dataset | Model

As a part of our efforts to make LLMs safer for public use, we provide:

Code to evaluate LLM safety against Chain of Utterances (CoU) based prompts-referred to as RedEval benchmark

Red-Eval Benchmark

Simple scripts to evaluate closed-source systems (ChatGPT, GPT4) and open-source LLMs on our benchmark red-eval.

To compute Attack Success Rate (ASR) Red-Eval uses two question-bank consisting of harmful questions:

HarmfulQA (1,960 harmful questions covering 10 topics and ~10 subtopics each)
DangerousQA (200 harmful questions across 6 adjectives—racist, stereotypical, sexist, illegal, toxic, and harmful)

Installation

conda create --name redeval -c conda-forge python=3.11
conda activate redeval
pip install -r requirements.txt

How to perform red-teaming

Step-0: Decide which prompt template you want to use for red-teaming. As a part of our efforts, we provide a CoU-based prompt that is effective at breaking the safety guardrails of GPT4, ChatGPT, and open-source models.
- Chain of Utterances (CoU)
- Chain of Thoughts (CoT)
- Standard prompt
- Suffix prompt
  
  (Note: Different LLMs may require slight variations in the above prompt template to generate meaningful outputs. To create a new template, you can refer to the above template files. Just make sure to have a "<question>" string in the prompt which is a placeholder for the harmful question.)
Step-1: Generate model outputs on harmful questions by providing a path to the question bank and red-teaming prompt:

Closed-source models (GPT4 and ChatGPT):

  python generate_responses.py --model gpt4 --prompt red_prompts/cou.txt --dataset hamrful_questions/dangerousqa.json
  python generate_responses.py --model chatgpt --prompt red_prompts/cou.txt --dataset hamrful_questions/dangerousqa.json

Open-source models:

  python generate_responses.py --model lmsys/vicuna-7b-v1.3 --prompt red_prompts/cou.txt --dataset hamrful_questions/dangerousqa.json

For better readability, we can clean internal thoughts from responses by specifying --clean_thoughts as follows

python generate_responses.py --model gpt4 --prompt red_prompts/cou.txt --dataset hamrful_questions/dangerousqa.json --clean_thoughts
python generate_responses.py --model chatgpt --prompt red_prompts/cou.txt --dataset hamrful_questions/dangerousqa.json --clean_thoughts
python generate_responses.py --model lmsys/vicuna-7b-v1.3 --prompt red_prompts/cou.txt --dataset hamrful_questions/dangerousqa.json --clean_thoughts

To load models in 8-bit, we can specify --load_8bit as follows

  python generate_responses.py --model lmsys/vicuna-7b-v1.3 --prompt red_prompts/cou.txt --dataset hamrful_questions/dangerousqa.json --load_8bit

Step-2: Annotate the generated responses using gpt4-as-a-judge:

python gpt4_as_judge.py --response_file results/dangerousqa_gpt4_cou.json --save_path results

Results

Attack Success Rate (ASR) of different red-teaming attempts.

	(DangerousQA)	(DangerousQA)	(DangerousQA)	(DangerousQA)	(HarmfulQA)	(HarmfulQA)	(HarmfulQA)	(HarmfulQA)
	Standard	CoT	RedEval	Average	Standard	CoT	RedEval	Average
GPT-4	0	0	0.651	0.217	0	0.004	0.612	0.206
ChatGPT	0	0.005	0.728	0.244	0.018	0.027	0.728	0.257
Vicuna-13B	0.027	0.490	0.835	0.450	-	-	-	-
Vicuna-7B	0.025	0.532	0.875	0.477	-	-	-	-
StableBeluga-13B	0.026	0.630	0.915	0.523	-	-	-	-
StableBeluga-7B	0.102	0.755	0.915	0.590	-	-	-	-
Vicuna-FT-7B	0.095	0.465	0.860	0.473	-	-	-	-
Llama2-FT-7B	0.722	0.860	0.896	0.826	-	-	-	-
Starling (Blue)	0.015	0.485	0.765	0.421	-	-	-	-
Starling (Blue-Red)	0.050	0.570	0.855	0.492	-	-	-	-
Average	0.116	0.479	0.830	0.471	0.010	0.016	0.67	0.232

Citation

@misc{bhardwaj2023redteaming,
      title={Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment}, 
      author={Rishabh Bhardwaj and Soujanya Poria},
      year={2023},
      eprint={2308.09662},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

red-eval

red-eval

README.md

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Red-Eval Benchmark

Installation

How to perform red-teaming

Results

Citation

Name		Name	Last commit message	Last commit date
parent directory ..
api_keys		api_keys
hamrful_questions		hamrful_questions
red_prompts		red_prompts
results		results
LICENSE		LICENSE
README.md		README.md
generate_responses.py		generate_responses.py
gpt4_as_judge.py		gpt4_as_judge.py
requirements.txt		requirements.txt

Files

red-eval

Directory actions

More options

Directory actions

More options

Latest commit

History

red-eval

Folders and files

parent directory

README.md

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

Red-Eval Benchmark

Installation

How to perform red-teaming

Results

Citation