MIT’s SASA Method: Training LLMs to Self-Detoxify Their Language Output

How Large Language Models Can Learn to Clean Up Their Language

Just as humans develop an internal guide to filter appropriate language based on context and values, researchers have found that Large Language Models (LLMs) can develop similar capabilities to moderate their outputs.

A groundbreaking method from MIT, the MIT-IBM Watson AI Lab, and IBM Research, called Self-disciplined Autoregressive Sampling (SASA), enables LLMs to detoxify their own language generation without sacrificing fluency or requiring extensive retraining.

How SASA Works: Internal Boundary Recognition

Unlike traditional detoxification approaches that require model retraining or external reward systems, SASA operates as a decoding algorithm that establishes boundaries between toxic and non-toxic subspaces within the LLM’s own internal representations.

During inference, the algorithm:

Assesses the toxicity value of partially generated phrases
Evaluates potential next tokens based on their proximity to the classifier boundary
Selects word options that place the phrase in non-toxic space
Maintains language fluency while reducing harmful content

“We wanted to find out a way with any existing language model [that], during the generation process, the decoding can be subject to some human values,” explains lead author Ching-Yun “Irene” Ko, a research scientist at IBM’s Thomas J. Watson Research Center.

The Technical Implementation

SASA leverages the autoregressive nature of LLMs by building a linear classifier that operates on the learned subspace from the LLM’s embedding. When examining potential next tokens, the system re-weights the sampling probabilities based on a token’s likelihood of producing toxic content when combined with the previous context.

The approach is contextual—evaluating toxicity not as isolated words but as part of the complete phrase being generated. This mimics how humans adjust their language based on conversational context.

Tested Performance

The researchers evaluated SASA across multiple LLMs of varying sizes, including:

GPT2-Large (762 million parameters)
Llama2-7b (7 billion parameters)
Llama 3.1-8b-Instruct (8 billion parameters)

Testing used challenging datasets like RealToxicityPrompts, BOLD, and AttaQ, with each model generating 25 completions per prompt. SASA achieved significant reductions in toxic language generation, performing comparably to state-of-the-art external reward model techniques while using fewer resources.

Interestingly, the researchers observed that LLMs initially produced more toxic responses for female-labeled prompts than male ones—an imbalance that SASA helped equalize.

Future Applications

A significant advantage of SASA is its lightweight implementation, which makes it adaptable to multiple value dimensions beyond toxicity. As Ko explains, “For human beings, we have multiple human values. We don’t want to say toxic things, but we also want to be truthful, helpful, and loyal.”

The method could be extended to check generated language against multiple value subspaces simultaneously, adding only minimal computational overhead while promoting more positive, fair, and principle-aligned language generation.

This approach represents an important step toward building AI systems that can self-regulate their outputs according to human values and ethical standards, potentially reducing the need for external content moderation in certain applications.

For more information about this breakthrough research, visit the MIT News article