The abliteration gold rush
Open the Hugging Face model tag filter. Type "abliterated." Count the results. As of April 22, 2026, there are 6,030 abliterated models on Hugging Face — from tiny 1B Llama variants to 70B Qwen-VL builds to frontier GPT-OSS-120B strippings. That's not a trend. That's an industry.
The technique itself isn't new. Maxime Labonne published the first public guide in June 2024. Andy Arditi and collaborators had formalized the underlying finding a couple of months before that: refusal, across every major open model they tested, is mediated by a single direction in the residual stream. Remove that direction and the model stops refusing. The NeurIPS 2024 paper is called Refusal in Language Models Is Mediated by a Single Direction. It reads like a locksmith's manual.
What's new is productization. In early March 2026, a developer known on X as Pliny the Liberator (handle: elder_plinius) released a GitHub repo called OBLITERATUS. Not a tutorial. Not a paper. A full toolkit: 116 curated models, 15 analysis modules, 7 weight-projection presets, a Hugging Face Spaces interface for zero-setup use, and a telemetry layer that turns every run into a data point for a crowd-sourced abliteration benchmark.
The repo crossed 1,000 stars within 24 hours. Six weeks later it's sitting at roughly 5,000, with new models showing up on Hugging Face under the OBLITERATUS/ namespace weekly. A BlueSpork video titled "I tested 17 Uncensored Local LLMs" — pulled in 68,000+ views — put a number on the trade-off: abliteration makes a model, in his estimate, "roughly 20% dumber." More on that later.
This is a review of what OBLITERATUS actually ships, how it works technically, what it breaks, and why a small group of alignment researchers have been quietly sounding the alarm since the repo dropped. The short version: it's the most polished guardrail-removal tool ever open-sourced, it's ethically dual-use in the most literal sense, and the defenses catching up to it are themselves only a few months old.
What abliteration actually is, in plain English
Before OBLITERATUS makes sense, abliteration needs to make sense. And abliteration is one of those topics that's explained either in terms so hand-wavy they're useless ("it removes the refusal layer") or in terms so dense they're unreadable ("orthogonal projection onto the complement of the rank-one refusal subspace"). There's a middle path.
Think of a language model as a series of layers, and at each layer the model maintains a big vector of numbers that represents what it's currently thinking about. That vector is called the residual stream. Arditi and colleagues made a specific, narrow claim: among the thousands of dimensions in that residual stream, there's one direction — just one — that, when activated strongly, makes the model refuse. When suppressed, makes it comply. Across 13 open chat models up to 72B parameters. Not two directions. Not a subspace. One.
Abliteration is the operation of finding that direction and removing it. You do it in three steps. Step one, run a few hundred harmful prompts and harmless prompts through the model and record the activations. Step two, compute the mean activation for each group, subtract them, take the top singular vector (the refusal direction). Step three, project every weight matrix in the model onto the subspace orthogonal to that direction. Now, when the model runs, the refusal component is gone. The rest of the model's capability stays mostly intact.
That "mostly" is doing a lot of work. The original single-direction claim has been challenged. A 2025 paper argued refusal is actually two directions — one pushing toward refusal, one pushing away from compliance, and that removing only one of them explains why abliterated models sometimes still hedge. More recent research (the stuff OBLITERATUS is built on) extends this to multi-direction subspaces, expert-granular abliteration for mixture-of-experts models, and chain-of-thought-preserving ablation that tries to spare reasoning capacity.
The technique is called abliteration because it's ablation plus obliteration — a half-joking portmanteau that's now stuck as the Wiktionary definition. "To ablate" means to remove precisely. "To obliterate" means to destroy completely. Abliteration aims for the first but often edges into the second when you're not careful. That's the whole problem OBLITERATUS is trying to solve.
Meet OBLITERATUS — what shipped on March 4
The repo is at github.com/elder-plinius/OBLITERATUS. AGPL-3.0 licensed with a commercial license available for proprietary applications. Primary language: Python, 91.6% of the codebase. The README motto is "OBLITERATE THE CHAINS THAT BIND YOU," which tells you roughly who the target audience is.
The pipeline is named in six steps: SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH. Summon loads the base model and tokenizer. Probe runs the harmful/harmless prompt pairs and records activations across all layers. Distill extracts the refusal direction (or subspace) using one of the SVD variants. Excise applies the weight projection using one of 7 presets. Verify runs evaluation prompts and logs how much capability degraded. Rebirth saves the modified model in a format compatible with vLLM, Ollama, and Hugging Face Transformers.
The 116 curated models span five hardware tiers:
- Tier 1 (consumer GPU, 8-24GB VRAM): TinyLlama, Phi-3, Gemma-2-2B, Llama-3.2-3B
- Tier 2 (prosumer, 24-48GB): Llama-3.1-8B, Qwen2.5-14B, Gemma-2-9B
- Tier 3 (workstation, 48-96GB): Qwen2.5-32B, Mixtral-8x7B, Llama-3.3-70B-Instruct
- Tier 4 (multi-GPU, 3-4 A100-80GB): DeepSeek-R1-Distill-Llama-70B, Qwen2.5-72B-Instruct
- Tier 5 (frontier, 4+ A100-80GB or H100 cluster): GPT-OSS-120B, DeepSeek-V3.2
The 15 analysis modules include the expected things (per-layer alignment scoring, refusal logit lens, whitened SVD extraction) and some less expected ones (defense robustness evaluation — i.e., the tool comes with its own adversarial testing harness to check whether an abliterated model still resists the new defenses published against abliteration).
The 7 method presets carry names that read like a wine list: basic, advanced, aggressive, surgical, optimized, inverted, nuclear. Surgical is the default for research use. Nuclear is what you reach for when surgical leaves too much residual refusal behavior — at the cost of capability damage. Inverted is what you use when you actually want to increase refusal (for red-teaming safety training data). That last one matters more than it sounds.
The 5 techniques that set OBLITERATUS apart
Most abliteration implementations before OBLITERATUS were single-technique. NousResearch's llm-abliteration does the Arditi single-direction projection and little else. FailSpy's original recipe is the one Maxime Labonne popularized. Heretic (a related 2026 toolkit) focuses on ease of use. OBLITERATUS bundles faithful reproductions of all of these and adds five novel techniques developed across 2025-2026 research:
1. Expert-Granular Abliteration (EGA). Mixture-of-experts models like Mixtral and DeepSeek-V3 have hundreds of experts — small sub-networks the router picks between per-token. A single refusal direction averaged across all experts misses the fact that some experts carry more refusal signal than others. EGA computes per-expert refusal directions and projects each expert individually. On Mixtral-8x7B, the repo reports EGA retains 94% of MMLU score versus 81% for vanilla abliteration. Big difference.
2. CoT-Aware Ablation. Reasoning models like DeepSeek-R1 and the OpenAI o-series use chain-of-thought internally, they think before answering. Naive abliteration destroys this. CoT-Aware Ablation identifies which layers are primarily reasoning versus primarily refusal-gated and skips projection on the reasoning-dominant layers. Result: reasoning benchmarks (GSM8K, MATH) hold up within 2-3 points of the original, instead of the 15-20 point drops vanilla abliteration produces on R1.
3. Whitened SVD Extraction. Classical SVD on the activation difference matrix returns the dominant direction, but the dominance is often skewed by scale effects unrelated to refusal. Whitening (dividing by the square root of the covariance) corrects for this and gives a cleaner refusal direction. Small change, noticeable quality improvement in the downstream model.
4. KL-Divergence Co-Optimization. The standard projection is a closed-form operation (you compute it and apply it). OBLITERATUS adds an optional optimization loop that minimizes the KL divergence between the abliterated model's output and the original model's output on harmless prompts, constrained so that refusal is still removed on harmful prompts. This is closer to distillation than to simple projection, and it's the knob that separates "surgical" from "nuclear."
5. Analysis-Informed Pipeline. The most interesting one. Rather than pick a preset up front, you run the analysis modules first. They report the refusal direction's rank, the norm of the projected weights, and the expected capability damage. Based on that, the pipeline recommends a preset. This is the difference between "remove refusal" and "understand the model well enough to remove refusal without wrecking it."
Together, these techniques are why OBLITERATUS produces models that score closer to their originals on capability benchmarks than any previous open abliteration toolkit. They're also why the repo has 5,000 stars and not 500.
Benchmarks: GPT-OSS-120B in 10 minutes
The repo publishes reference benchmarks for two representative runs. Here's what it claims — verified against the config files and hardware requirements in the README.
GPT-OSS-120B (234 GB model, BF16). Requires 4x A100-80GB GPUs minimum; three is not enough because the model spans multiple devices with inter-GPU activation passing. Wall-clock time: approximately 10 minutes end-to-end, with I/O dominating. Disk read of the model weights takes about 4 minutes on fast NVMe. The actual abliteration math — the SVD, the projection, the verify pass — runs in roughly 2 minutes on the 4-GPU setup. The remaining 4 minutes is writing the new weights back to disk in safetensors format.
DeepSeek-R1-Distill-Llama-70B (149 GB model). Requires 3x A100-80GB minimum. Wall-clock: approximately 9 minutes. Similar I/O-dominated profile. The reasoning-preserving CoT-Aware path adds 30-60 seconds over vanilla abliteration.
A few things worth noticing. First, the math is fast. Once the model is in GPU memory, the SVD extraction is a few seconds. The real cost is moving tens of gigabytes across SSD, PCIe, and NVLink. If you're running this on a hyperscaler with tiered storage, assume longer. Second, the hardware floor is high. You don't abliterate GPT-OSS-120B on a consumer card. You can't rent 4x A100-80GB for pocket change — on-demand pricing on Lambda Labs or Vast.ai runs $8-16/hour for that cluster, so a single abliteration run costs a few dollars but not cents. Third, these numbers are for the abliteration step alone. The VERIFY pass — the evaluation that checks how much capability you broke — takes additional time depending on the benchmark suite you run. A full MMLU + MMLU-Pro + GSM8K + MATH evaluation on a 70B model is a few more hours on the same cluster.
If you're comparing this to Kimi K2.6 running full SWE-Bench evaluations (covered in our Kimi K2.6 review), the order of magnitude is similar — big models take real compute even for "lightweight" operations. Abliteration is cheap compared to training but not free.
The 20% dumber problem nobody talks about
There's a reason the HF comment sections on abliterated models are full of people saying "this feels off." BlueSpork's widely-shared video put a rough number on it — abliterated models lose roughly 20% of their capability on some benchmarks. A recent r/LocalLLaMA thread titled "IMPORTANT: Why Abliterated Models SUCK" is less polite but makes the same point.
What's actually happening? A few things at once.
One: refusal isn't just refusal. The "refusal direction" in the residual stream correlates with a bunch of adjacent behaviors: hedging, caveating, moral reasoning, deference to authority. When you project away the refusal direction, you project away some of those too. The model becomes more confident but also more wrong. It stops saying "I'm not sure" when it should.
Two: the projection is lossy. Even with norm-preserving variants, the projected weights are strictly lower-rank than the originals. For large matrices, that rank loss is small — a 4096-dim residual stream loses one dimension per direction removed, or about 0.024%. For smaller sub-matrices inside attention heads, the percentage is larger.
Three: the evaluation gap. Abliterated models almost always get posted to Hugging Face with their MMLU score as the single capability benchmark. MMLU is relatively insensitive to the kinds of damage abliteration causes. More sensitive benchmarks — MMLU-Pro, BBH, HumanEval — show bigger drops. A Medium post from January 2026 titled "GLM-4.7 Flash 'Obliterated'" walks through one model losing 7 points on MMLU but 18 points on MMLU-Pro after abliteration. That's a meaningful gap.
OBLITERATUS's novel techniques — especially CoT-Aware Ablation and KL-Divergence Co-Optimization — are designed specifically to shrink this gap. And they do. But they don't close it. The repo's own reference numbers show best-case 94% capability retention on Mixtral. Average case, across all 116 supported models, is closer to 85-90%. Nobody is running a free lunch.
The practical advice baked into the repo's own docs: run the VERIFY pass with a capability benchmark you care about before shipping an abliterated model anywhere. MMLU-Pro is the current community default. For coding models, HumanEval and MBPP. For reasoning models, GSM8K and MATH. Trust the number your benchmark gives you, not the general "20% dumber" rule of thumb.
The counter-research: a simple defense
While the abliteration gold rush has been gathering pace, the defense side of the research hasn't been sitting still. In May 2025, a team published "An Embarrassingly Simple Defense Against LLM Abliteration Attacks" on arXiv. The title is not hyperbole.
The core idea: if abliteration isolates and suppresses the single latent direction most responsible for refusal, you can harden a model against abliteration by distributing the refusal signal across many directions during training. They call this extended-refusal training. Models fine-tuned with extended-refusal training show, per the paper, two properties. First, abliteration fails — the single-direction projection finds a direction, but ablating it doesn't remove refusal because there is no single direction to remove. Second, when abliteration is forced through anyway (by stripping a wider subspace), the utility metrics of the resulting model degrade substantially more than on un-hardened base models.
The policy implication is that extended-refusal training is now being incorporated into the safety fine-tuning of newer open releases. OBLITERATUS's own defense robustness evaluation module (one of the 15 analysis modules) explicitly tests against extended-refusal models and reports when its own techniques fall short. This is unusual — a toolkit that ships tests for its own failure modes. It's also academically honest in a way that the rest of the README is not.
The arms race is real and ongoing. As of April 2026, OBLITERATUS handles extended-refusal models in the "nuclear" preset at a significantly higher capability cost. New defenses published in the last three months (constitutional self-monitoring, activation patching, SAE-based refusal neurons) are not yet fully defeated. The alignment researchers quoted in Noam Schwartz's LinkedIn post from yesterday describe abliterated agents as "the most overlooked issue in AI safety." They're not wrong. They're just losing, for now.
Who's actually using this (and who shouldn't be)
The question that every abliteration toolkit has to answer, honestly or dishonestly, is: who is this for?
The honest use cases exist and are real. Security researchers red-teaming production AI systems need models that will generate the same outputs the attackers' models generate — otherwise they can't test defenses. Interpretability researchers studying what refusal is need ablation experiments on real models. Creative writers working on dark fiction genuinely hit guardrails that refuse to help with violent scenes in their novels. Medical researchers querying models for clinical edge cases get stonewalled by overly conservative safety training. Legal researchers doing document review on sensitive materials (for authorized work) face the same thing. These are not hypothetical users; they exist and they have been using abliterated models since mid-2024.
The dishonest use cases also exist. Abliterated models are trivially better at generating phishing content, social engineering scripts, fraudulent business communications, and synthetic CSAM. That last one being why multiple jurisdictions (UK, EU under the AI Act) are actively discussing whether distributing abliterated models should itself be criminalized. OBLITERATUS sidesteps this question by only providing the tool, not the weights. But the tool is one-click enough that the distinction is increasingly theoretical.
Closed-source models can't be abliterated in the same way. You need weight access. Claude Opus 4.7 and the GPT-5.x family are safe from this particular attack because Anthropic and OpenAI never expose the weights. The Stanford AI Index 2026 notes that the gap between open and closed model safety is now partially a distribution gap — closed models get to control the endpoint, open models get copied, modified, and redistributed. OBLITERATUS is the tool that makes the copying most effective.
The open-source AI ecosystem's answer so far has been a kind of uneasy neutrality. Hugging Face hosts the abliterated models but gates some behind a content filter. Ollama indexes them but doesn't feature them. The Mozilla Thunderbolt team explicitly excludes abliterated models from their default model list while supporting their use if manually added. Nobody is taking a hard stance. Nobody is pretending this is all fine, either.
The ethical question nobody wants to answer
Here's the question I keep coming back to. If a technique exists, is genuinely documented in academic papers, and runs on a laptop for small models, does open-sourcing a polished implementation change the threat model? Or does it just make explicit what determined actors could already do?
Arguments for: the knowledge was already public. The Arditi paper is on arXiv. The Labonne tutorial is on Hugging Face. NousResearch's implementation is a year old. Anyone motivated enough to cause real harm was already capable of doing so. OBLITERATUS lowers the friction, sure, but it doesn't create a new capability. And the repo's own analysis modules and telemetry contribute to defense research by generating real data on where abliteration succeeds and fails. The repo argues, implicitly, that it's net-positive for safety research.
Arguments against: lowered friction matters. Security research consistently shows that attack tools which go from "technically possible" to "one-click" dramatically expand the user base. This is why security tooling like Metasploit doesn't ship zero-days as easy modules. OBLITERATUS doesn't create the capability to uncensor LLMs, but it creates the capability for non-researchers to do so at scale. The 6,030 abliterated models on Hugging Face didn't appear there by accident — they appeared because the tooling made it a weekend project.
The honest answer is that both things are true, and the ethical weight depends on who you think will use it most. In 2024, the answer was "mostly researchers." In 2026, the answer is less clear. The repo's own contribution to defense research is real. The downstream distribution of abliterated models is also real. Those aren't in balance.
What is clear is that this conversation hasn't been resolved, and it's probably not going to be resolved by the AI labs alone. Regulators, model hosts, and the open-source community itself are going to have to draw lines that haven't been drawn yet. OBLITERATUS didn't create the problem, it just made it impossible to ignore.
For a broader view of how much this kind of open-source tooling is shifting AI benchmark dynamics, see our review of Kimi K2.6, which beat GPT-5.4 on SWE-Bench Pro as an open-weight model. Or the I Tested 5 Free AI Proofreaders piece for a simpler example of the same format — actual testing beats hypothetical capability claims. For a utility-focused tangent, our guide to random string generation covers the cryptographic side of not trusting your tools blindly — an attitude that transfers to model hosting, too.
Frequently asked questions
What is abliteration in LLMs?
Abliteration is a post-training technique that removes refusal behaviors from a large language model by identifying the specific internal direction (or subspace) in the model's residual stream that mediates refusal, then projecting every weight matrix in the model onto the subspace orthogonal to that direction. It doesn't require retraining or fine-tuning. It runs in minutes on consumer hardware for small models and in 10-15 minutes on multi-GPU setups for frontier models. The term is a portmanteau of "ablation" and "obliteration" — the technique aims for precise ablation but can edge toward obliteration when applied aggressively.
What's the difference between abliterated and obliterated?
Abliterated means surgically modified — specific refusal-related weights are projected away while the rest of the model stays intact. Obliterated, in the colloquial sense used in community discussions, usually means the modification went too far and destroyed meaningful capability alongside the refusal behavior. OBLITERATUS is the name of a specific toolkit; it performs abliteration, not obliteration, when configured correctly. The "nuclear" preset gets close to obliteration by design — you use it only when surgical methods leave too much residual refusal behavior and you're willing to accept capability damage.
Does OBLITERATUS work on ChatGPT or Claude?
No. Closed-source models like ChatGPT (GPT-5.x), Claude Opus 4.7, and Gemini Ultra are served through API endpoints. OBLITERATUS modifies model weights directly, which requires access to the weights. Only open-weight models (Llama, Qwen, Gemma, DeepSeek, Mistral, GPT-OSS, and the 100+ others in the OBLITERATUS curated list) can be abliterated. That's why abliteration is often described as primarily an open-source-ecosystem issue.
Are abliterated models really 20% dumber?
The 20% figure comes from a popular YouTube comparison and is a reasonable rule of thumb for vanilla abliteration on reasoning-heavy benchmarks. Reality is more nuanced. Simple benchmarks like MMLU often show a drop of 3-7 points. Reasoning-heavy benchmarks like MMLU-Pro, BBH, and MATH often show 10-20 point drops depending on the model and the method used. OBLITERATUS's novel techniques — CoT-Aware Ablation, KL-Divergence Co-Optimization — narrow the gap on reasoning benchmarks specifically, but no current abliteration method fully closes it.
Is abliteration legal?
Abliteration itself is legal in most jurisdictions. Abliterating an open-weight model and using the result is generally not prohibited, though the model's license may impose restrictions (Llama's license has acceptable-use provisions; Gemma's is similar). What can be illegal, depending on jurisdiction, is using an abliterated model to generate certain content classes — CSAM is illegal everywhere, certain forms of targeted harassment are illegal in the EU and UK, and fraud-generation falls under existing fraud statutes regardless of the tool used. The UK's Online Safety Act and the EU AI Act are both actively debating whether distributing abliterated models should itself carry liability. Consult a lawyer for your specific use case.
How does OBLITERATUS compare to Heretic or FailSpy's original work?
FailSpy's original implementation (2024) was the first widely-used abliteration recipe and established the basic approach. Heretic (early 2026) focused on one-click ease of use for mid-sized models. OBLITERATUS ships faithful reproductions of both plus five novel techniques developed in 2025-2026. It's broader (116 models vs a few dozen), deeper (15 analysis modules vs a handful), and more technically sophisticated (Expert-Granular Abliteration and CoT-Aware Ablation are new). It's also more resource-intensive. Running the full analysis-informed pipeline on a 70B model is more demanding than Heretic's one-click approach.
What hardware do I need to run OBLITERATUS?
Depends on the model. Small models (Phi-3, Llama-3.2-3B) run on a consumer GPU with 12-24GB VRAM. Mid-tier (Llama-3.1-8B, Qwen2.5-14B) needs 24-48GB. Frontier (DeepSeek-R1-70B, GPT-OSS-120B) requires 3-4 A100-80GB GPUs minimum. The toolkit scales down cleanly but scales up less cleanly — the largest models are genuinely compute-bound and cost $10-50 per abliteration run on rented infrastructure, not cents.
Will my abliterated model pass safety evaluations?
No, and that's the point. Standard safety benchmarks (HarmBench, XSTest, AdvBench) are specifically designed to measure refusal rates, and abliteration removes refusal. An abliterated model will fail these benchmarks by construction. If you're shipping an abliterated model anywhere user-facing, you need to add application-level filtering, output classifiers, or some other safety layer outside the model itself. OBLITERATUS removes model-level safety; it's your responsibility to replace it with something else.
Final take: what OBLITERATUS tells us about 2026
The interesting thing about OBLITERATUS isn't what it does. Abliteration has been a known technique for almost two years. What's interesting is what it represents about where open-source AI is heading.
First: the research-to-production gap for advanced techniques has collapsed. A 2024 NeurIPS paper became a one-click toolkit in 18 months. That cycle is going to keep getting shorter. The next capability that gets productized, whether it's RLHF reversal, constitution-bypass, or something not yet named, will probably land in a polished open-source repo within months of the paper.
Second: the defense side is catching up but only barely. The arXiv paper "An Embarrassingly Simple Defense Against LLM Abliteration Attacks" is a year old at this point. The techniques in it are being adopted slowly, unevenly, and with real capability trade-offs. The gap between attack tooling and defense tooling is, as of April 2026, wider than it was a year ago. That's not sustainable.
Third: the ethical framework for this category of tool hasn't been built. OBLITERATUS is dual-use in the most literal sense — the same pipeline that lets security researchers test defenses lets bad actors strip safety from models. The AI safety community, the open-source community, and the regulatory community are all pointing at each other expecting someone else to draw the lines.
For developers and researchers who want to use it responsibly, the honest advice is this: if you have a legitimate use case (security research, interpretability work, creative writing with documented need), run it in isolation, don't distribute the resulting models, and run the VERIFY pass with benchmarks you care about before drawing conclusions. If you're thinking about using it as a general-purpose "make my model smarter/freer," you're going to be disappointed — the capability cost is real and the community has spent two years watching people learn that the hard way.
Check the OBLITERATUS repo directly if you want the latest docs, hardware requirements, and version notes — the space is moving fast enough that anything written in April 2026 will be partially outdated by summer. For the broader context on where open-weight models are fitting into the research landscape, our Kimi K2.6 review covers the capability gains on the open side, and our piece on Mozilla Thunderbolt covers the open-source tooling ecosystem around those models. If you're wrangling the AI business side of this instead, DevPik's AI business tools hub has signup-free invoice, cover letter, and contract generators you can use today.





