Skip to yearly menu bar Skip to main content


Poster

DivSafe: Evaluating the Generalization of LLM Safety Training Across Diverse Tasks and Prompt Types

Yutao Mou · Shikun Zhang · Wei Ye

West Ballroom A-D #5209
[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

With the widespread application of large language models (LLMs), it has raised the concerns about safety issues in LLMs. In order to ensure the harmlessness of contents generated by LLMs, safety training and evaluation have become necessary stages for the development of LLMs. In this paper, we investigate an under-explored problem, namely generalization of LLM safety training across diverse test tasks and prompt types. Firstly, we construct DivSafe, the first multi-dimensional safety evaluation benchmark, which aims to evaluate the safety performance of LLMs from multiple perspectives such as test tasks and prompt types. DivSafe contains four test sets for both open-end text generation and safety content discrimination tasks. Besides, we also construct several extended evaluation set to evaluate the effect of prompt engineering such as system prompts, few-shot demonstrations, and chain-of-thought prompting on the LLM safety performance. We conduct a comprehensive evaluation of 3 advanced proprietary LLMs and 8 open-source LLMs. The results show that almost all LLMs appear to exhibit lower safety performance on discrimination task compared to open-end generation, and are susceptible to prompts, which demonstrates the poor generalization of LLM safety training. We also conduct extensive experiments and qualitative analysis to explain this phenomenon and shed light on future research.

Live content is unavailable. Log in and register to view live content