Back to Reading List
[Safety]·PAP-NCU8WN·2023·May 23, 2026

AI Safety Training Can be Clinically Harmful

2023

B. Suhas, Andrew M. Sherrill, Rosa I. Arriaga et al.

4 min readAlignmentSafetyTraining

Core Insight

AI safety protocols harm mental health support efficacy by derailing therapeutic processes.

By the Numbers

0.22-0.33

Therapeutic appropriateness in high-severity scenarios

0.91-1.00

Surface acknowledgment scores

92% to 71%

Drop in task completeness for CBT exercises

0.99 to 0.61

Safety-interference score decrease

In Plain English

The study found that AI mental health chatbots often fail at high-severity scenarios, with therapeutic appropriateness dropping to 0.22-0.33. RLHF safety misalignments disrupt interventions, and models collapse under stress tests, questioning large-scale deployments.

Knowledge Prerequisites

git blame for knowledge

To fully understand AI Safety Training Can be Clinically Harmful, trace this dependency chain first. Papers in our library are linked — click to read them.

DIRECT PREREQIN LIBRARY
Constitutional AI: Harmlessness from AI Feedback

Understanding the principles of harmless AI is crucial to comprehend the potential clinical impacts of AI safety training.

Harmless AIAI feedback mechanismsEthical AI considerations
DIRECT PREREQIN LIBRARY
Tree of Thoughts: Deliberate Problem Solving with Large Language Models

This paper introduces structured reasoning techniques that may affect safety protocols, potentially leading to harmful clinical outcomes.

Chain-of-Thought reasoningStructured problem solvingSafety in AI decisions
DIRECT PREREQIN LIBRARY
Containment Verification: AI Safety Guarantees Independent of Alignment

Understanding containment verification is key to exploring how safety mechanisms can fall short clinically.

Containment verificationAI alignmentSafety guarantees
DIRECT PREREQIN LIBRARY
Position: Safety and Fairness in Agentic AI Depend on Interaction Topology, Not on Model Scale or Alignment

This paper discusses how interaction topology affects safety, which is critical to understanding clinical implications.

Interaction topologyAgentic AISafety and fairness
DIRECT PREREQ

AI Ethics and Governance

Foundational knowledge of AI ethics and governance is essential to understand the broader context of AI safety.

AI ethicsGovernance frameworksRegulation

YOU ARE HERE

AI Safety Training Can be Clinically Harmful

The Idea Graph

The Idea Graph
15 nodes · 20 edges
Click a node to explore · Drag to pan · Scroll to zoom
1,634 words · 9 min read12 sections · 15 concepts

Table of Contents

01

The World Before: AI Safety vs. Therapy

159 words

Before this study, AI systems in mental health were developed with a strong emphasis on safety, primarily to avoid causing harm to users. This focus led to the implementation of protocols, which are designed to prevent AI models from producing potentially dangerous output. However, this intense focus on safety sometimes came at the cost of therapeutic efficacy. Imagine a scenario where a patient opens up about severe depression to an AI chatbot. The AI's training might prioritize avoiding any suggestion that could be interpreted as encouraging harmful behavior, but in doing so, might miss opportunities to provide meaningful support or guidance. Safety training in AI is crucial, but when it becomes the primary focus, it can inadvertently lead to outcomes where the AI is unable to engage effectively in therapeutic conversations. This challenge forms the backdrop against which the study was conducted, aiming to evaluate the balance between safety and efficacy in AI mental health applications.

02

The Specific Failure: Misalignment in High-Severity Scenarios

140 words

The study identified a critical failure in AI models trained with Reinforcement Learning from Human Feedback (RLHF) when addressing high-severity mental health scenarios. These models exhibited a drastic drop in , with scores plummeting to between 0.22 and 0.33. This misalignment was particularly concerning because it highlighted a failure to handle cases that required immediate and appropriate . Imagine if a person in severe distress reaches out for help, and the AI, due to its safety training, fails to recognize the severity of the situation. Instead of offering help or guiding the person to seek immediate professional support, it offers platitudes or unauthorized reassurances that do not address the crisis. This specific failure underscores the misalignment between AI safety protocols and the demands of effective mental health support, especially in scenarios where timely and appropriate intervention is critical.

03

The Key Insight: Balancing Safety and Efficacy

138 words

The key insight from this study is the realization that , while essential, must be balanced with the need for therapeutic efficacy. The authors understood that the safety measures, particularly those implemented through RLHF, were leading to unintended consequences in therapeutic settings. Think of it like a car with too many safety features that make it hard to drive effectively; the AI was so focused on avoiding any potential harm that it failed to provide the necessary therapeutic support. This insight was crucial because it reframed the problem not as one of safety versus efficacy, but as one of alignment between the two. It highlighted the need for AI systems to be designed in such a way that they can safely navigate the complexities of mental health scenarios without compromising their ability to offer meaningful support.

04

Architecture Overview: RLHF in AI Mental Health Support

137 words

The architectural framework of involves training AI models using human feedback to guide the learning process. This approach is intended to create models that align closely with human values and preferences, especially in sensitive areas like mental health. However, the study found that when applied to mental health scenarios, models often misaligned their safety protocols with therapeutic needs. Imagine a therapist who is overly cautious, never pushing a patient to confront difficult emotions or challenges for fear of causing distress. This is analogous to how -trained models behave in high-severity scenarios, where their safety mechanisms interfere with the provision of effective therapy. The architecture of these models involves layers of safety protocols that are meant to filter responses, but in high-severity scenarios, these filters become overly restrictive, preventing the models from engaging effectively with users.

05

Deep Dive: Prolonged Exposure Therapy and CBT Exercises

155 words

The study evaluated AI models on 250 scenarios and 146 to test their therapeutic appropriateness. involves helping patients confront trauma-related memories, while focus on challenging unhealthy thoughts. These therapy types were chosen because they represent common and structured approaches to mental health treatment. The models were expected to guide users through these exercises, but the results showed a significant drop in performance. In the case of , one model's task completeness fell from 92% to 71%. This drop illustrates how safety protocols can interfere with the models' ability to complete therapeutic tasks. Imagine a tutor who knows the correct answers but is too focused on not making mistakes to guide students through their learning process. This is similar to how RLHF-trained models struggled with these therapy scenarios, emphasizing the need for adjustments in how safety and efficacy are balanced in AI mental health applications.

06

Training & Data: The Role of RLHF

166 words

Reinforcement Learning from Human Feedback () was used as the primary training mechanism for the AI models evaluated in this study. This approach involves using feedback from human trainers to shape the learning process of AI models, ideally leading to outputs that align with human values and preferences. In the context of mental health, this means training models to prioritize safety while still delivering effective therapeutic interventions. However, the study found that the way was applied led to significant misalignments. Imagine if a teacher gave feedback that was always focused on avoiding any incorrect answers, rather than encouraging students to explore and learn from their mistakes. This analogy captures the core issue with in this context: its application led to overly cautious models that struggled to provide effective support in high-severity mental health scenarios. The training data included various mental health exercises and scenarios, but the feedback mechanisms prioritized safety to such an extent that the models' ability to perform therapeutic tasks was compromised.

07

Stress Tests and Model Collapse

116 words

were conducted to evaluate how AI models perform under high-pressure, high-severity scenarios. These tests are designed to push models to their limits, revealing weaknesses in their architecture and training. In this study, AI models frequently collapsed under , demonstrating their inability to handle complex mental health cases. Imagine a bridge that looks sturdy under normal conditions but collapses when a heavy load is placed on it. Similarly, these AI models appeared effective under low-severity scenarios but failed when faced with the stress of high-severity cases. This collapse highlights the inadequacy of current RLHF training when applied to mental health scenarios, where the stakes are high and the need for appropriate intervention is critical.

08

Key Results: Performance Metrics and Failures

116 words

The study revealed critical performance metrics that underscore the failures of RLHF-trained models in mental health applications. Despite achieving high of 0.91 to 1.00, the scores fell drastically in high-severity scenarios. This discrepancy highlights the models' ability to superficially engage with users while failing to provide meaningful support. Additionally, the of one frontier model dropped from 0.99 to 0.61, indicating that safety protocols were interfering with the core therapeutic functions of the model. Imagine a customer service representative who is polite and acknowledges customer concerns but fails to solve their problems. This is akin to how these AI models behave, offering surface-level acknowledgment without delivering effective therapeutic interventions.

09

Ablation Studies: Understanding Component Importance

119 words

Ablation studies were conducted to determine the importance of various components within the AI models. These studies involve systematically removing parts of the model to assess their impact on overall performance. The results showed that when safety protocols were relaxed, models showed improved task completion rates in CBT exercises, suggesting that these protocols were a significant barrier to effective therapeutic intervention. Imagine a machine with too many safety locks that prevent it from performing its primary function efficiently. By gradually removing these locks, the machine can operate more effectively. Similarly, the ablation studies demonstrated that relaxing certain safety measures allowed models to engage more effectively in therapeutic tasks, underscoring the need for a better balance between safety and efficacy.

10

What This Changed: Impact on AI in Mental Health

124 words

The findings of this paper have significant implications for the deployment of AI models in mental health applications. By highlighting the shortcomings of current RLHF training, the study calls for a reevaluation of safety protocols to better balance safety and therapeutic efficacy. This could slow down the rollout of AI products in mental health, as companies may need to adopt more rigorous testing and development standards to ensure their models are both safe and effective. Imagine a pharmaceutical company that discovers a new drug but needs to conduct extensive testing to ensure it's both safe and effective. Similarly, AI companies may need to invest more time and resources into testing their models before deployment, to prevent the kind of misalignments identified in this study.

11

Limitations & Open Questions: Where We Go From Here

124 words

Despite its contributions, this study leaves several questions unanswered. One of the primary limitations is the focus on as the sole alignment mechanism. Future research could explore alternative training methods that better balance safety and therapeutic efficacy. Additionally, the study's reliance on specific therapy scenarios may limit the generalizability of its findings to other areas of mental health support. Imagine a map that only shows a few roads in a vast city; while it's helpful for those specific paths, it doesn't provide a complete picture. Similarly, this study provides valuable insights but doesn't cover all possible aspects of AI mental health support. More research is needed to explore these areas and develop models that can effectively balance safety and efficacy across diverse scenarios.

12

Why You Should Care: Implications for AI Product Managers

140 words

For product managers in the AI industry, particularly those working on mental health applications, the findings of this study are critical. The paper highlights the need for a careful balance between safety and efficacy, emphasizing the importance of rigorous testing and development standards. Product managers should be aware of the potential pitfalls of deploying RLHF-trained models without adequate testing, as these models may fail to provide effective support in high-severity scenarios. Imagine a new car model that performs well in controlled tests but has not been tested in real-world conditions; the risk of failure in real-world scenarios could be high. Similarly, AI models need to be thoroughly tested in diverse and high-severity scenarios to ensure their effectiveness and safety. This study provides a roadmap for product managers to follow, ensuring that their AI products are both safe and therapeutically effective.

How grounded is this content?

Metrics are computed from available source text only — abstract, summary, and impact fields ingested into this system. Full paper PDF is not ingested; numerical claims that originate from within the paper body will not appear in these scores.

Source Richness88%

7 of 8 content fields populated. More fields = better-grounded generation.

Source Depth~219 words

Total source text analyzed by the model. Includes extended deep-dive summary — high confidence.

Number Grounding4 / 4

Key statistics whose numeric values appear verbatim in ingested source text. Unverified stats may originate from the full paper body.

Quote Traceability3 / 3

Key passages whose significant vocabulary (≥4-char words) overlap ≥35% with source text. Measures lexical traceability, not semantic accuracy.

Methodology: Number grounding uses regex digit extraction against source text. Quote traceability uses token set intersection on content words stripped of stop-words. Neither metric validates semantic correctness or factual accuracy against the original paper. For full verification, cross-reference with the original paper via the arXiv link above.