Say No Too Often: Over-Refusals in Foundation Models

Overview

What is over-refusal, why does it matter, and how do we study it?

Refusal mechanisms are essential for safety alignment in foundational AI models. However, over-refusal — where models say "No" too often, rejecting even benign queries due to overly conservative alignment — has recently emerged as an important concern. Unlike jailbreak (under-refusal), over-refusal arises from excessive safety alignment that suppresses legitimate user requests. In this survey, we present the first comprehensive framework dedicated to over-refusal, covering benchmarks, evaluation metrics, mitigation strategies, open challenges, and real-world applications.

**Figure 1:** Investigation framework for over-refusal in foundation models. We evaluate models on benchmarks using dedicated metrics to detect over-refusal. If identified, mitigation strategies are applied; unresolved issues motivate future work.

40+

Papers Surveyed

15+

Benchmarks

Modalities

Eval Metrics

Open Challenges

Contributions

Three main contributions of this survey paper.

First Comprehensive Survey on Over-Refusal

To the best of our knowledge, the first survey dedicated to over-refusal in foundation models, providing a unified framework for understanding and mitigating this problem.

Systematic Taxonomy

A systematic taxonomy of over-refusal benchmarks, evaluation metrics, and mitigation methods across LLMs, VLMs, and audio models — clarifying the current research landscape.

Challenges & Future Directions

Five key open challenges in over-refusal research with promising future directions, highlighting practical applications where mitigating over-refusal is critical.

Taxonomy

We organize the over-refusal literature across three research dimensions.

📋

Benchmarks

Datasets spanning single-turn questions, multi-turn dialogues, long-context, multilingual, and multimodal settings.

LLMs VLMs Audio / T2I Healthcare

📐

Evaluation Metrics

Metrics for over-refusal (ORR, CR, RS), under-refusal (TRR, ASR), and trade-off measures (MB-Score, NSI, ΔIR).

Over-Refusal Under-Refusal Trade-off

🛠️

Mitigation Methods

Training-based (SFT, DPO, GRPO), inference-time (activation steering, decoding calibration), and explanation-based approaches.

Training Inference-Time Explanation

Open Challenges

Five key challenges we identify in current over-refusal research.

01 · Human Perception

Evaluations focus on model-centric metrics while overlooking how users actually perceive refusal behaviors — same ORR can cause vastly different user experiences.

02 · Explanation Utility Functions

Attribution methods (SHAP, Integrated Gradients) use general utility functions not designed for refusal, leading to suboptimal trigger identification.

03 · Domain-Specific Over-Refusal

Safety boundaries differ substantially across domains (healthcare, finance, law). General methods are hard to adapt without domain-specific benchmarks.

04 · Other Modalities

Video-language and embodied AI models remain largely unexplored. Modality-specific benchmarks for systematic evaluation are urgently needed.

05 · Ambiguous Safety Boundaries

The line between benign and harmful is often inherently unclear, complicating both benchmark construction and determining appropriate mitigation degree.

Citation

If you find our work useful, please consider citing our paper.

@article{yang2025sayno, title = {Say No Too Often: Over-Refusals in Foundation Models}, author = {Yang, Jiaxi and Liu, Shicheng and Ansari, Abolfazl and Yang, Yuchen and Lee, Dongwon}, journal = {arXiv preprint}, year = {2026}, note = {Under review}, url = {https://github.com/abbottyanginchina/Awesome-Over-Refusal} }

We maintain a continuously-updated paper list at github.com/abbottyanginchina/Awesome-Over-Refusal . Pull requests welcome!