Dynamic Guardrail Generation (DGG): A Framework for Prompt-Time Mitigation of LLM Harms
Abstract
Large Language Models (LLMs) are increasingly used as tools for content creation, yet they often generate biased and toxic content, and common reactive mitigation strategies like self-correction fail to address the underlying flawed reasoning. This paper introduces Dynamic Guardrail Generation (DGG), a proactive, three-stage prompting framework that compels a model to perform a safety analysis before generating a response. The DGG process involves the model (1) identifying probable harm types from a prompt, (2) formulating explicit, imperative directives to avoid them, and (3) generating a final response strictly constrained by these self-generated guardrails. We evaluated DGG using GPT-3.5 on the BOLD-1.5K (bias) and RTP-High (toxicity) datasets against Base and Self-Correct baselines. Results show DGG is highly effective at mitigating societal bias (41%). While DGG also reduces toxicity (up to 60%), it does not yet match the performance of the reactive Self-Correct approach in that domain. The framework’s specific contribution is that it makes safety rules dynamic and prompt-specific, which distinguishes it from related concepts like Constitutional AI where models follow a static set of rules. This provides a more tailored, context-aware safety mechanism at the moment of inference. The work’s broad impact is its effort to shift the paradigm in AI safety from reactive correction to proactive self-governance. By compelling a model to analyze risks and set its own rules before generating a response, it offers a new direction for improving AI safety that doesn’t require external tools or post-generation fixes.