Summary
Three actively maintained open-source toolchains automate the removal of safety alignment from open-weights instruction-tuned language models. The combined output is over 1,300 publicly distributed derivative checkpoints across nearly every major model family. One of the toolchains operates entirely from a HuggingFace Space against the user’s own GPU quota: no installation, no fine-tuning data, no expertise. New abliterated derivatives appear within hours of new base-model releases.
These checkpoints are downloadable, permissively licensed, and may be deployed in agentic systems, fine-tuning pipelines, inference services, model merges, or quantized redistributions without the operator’s awareness that safety alignment has been surgically removed at the weight level. Workload-layer authentication — OAuth, SPIFFE/WIMSE, JWT, mTLS — does not detect this class of modification.
The Fall Risk structural identity measurement detects this modification class across all four model families tested in our published findings, at hardened measurement depth, without behavioral red-teaming, training data access, or knowledge of which abliteration technique was used.
The toolchains
Directional ablation with TPE-based parameter optimization (Optuna),
co-minimizing refusal rate and KL divergence from the original model.
Architectural scope: most dense transformers, many multimodal models,
several MoE architectures. Excludes state-space models, hybrid
architectures, and certain novel attention systems. Operational
profile: pip-installable, single command, ~45 minutes on consumer
hardware. Distribution as of April 2026: 1,247+ community-published
checkpoints on HuggingFace plus the curated heretic-org
organization carrying first-party releases.
Thirteen abliteration techniques including spectral cascade decomposition, chain-of-thought-aware orthogonalization, Expert-Granular Abliteration for MoE architectures, and reversible LoRA-based ablation. Ships with 15+ analysis modules including an Alignment Imprint Detector that fingerprints which alignment method (DPO, RLHF, CAI, SFT) the base model was trained with based on residual subspace geometry alone, and Ouroboros detection that compensates for self-repair guardrails. The “informed” pipeline auto-configures every decision in real time during the run. Operational profile: 116 curated model presets across 5 compute tiers; 837 tests across the codebase; runs from a HuggingFace Space against the user’s own GPU quota with one click. Telemetry is on by default in the Space and contributes anonymized run metadata — model identifier, method, refusal rate, KL divergence, hardware profile — to a public crowd-sourced dataset of refusal direction geometries organized by model, method, and effectiveness score.
The original modern community methodology, descended from
FailSpy’s 2024 abliterator notebook. Curated collection of 32
first-party models including Daredevil-8B-abliterated
(16.1k downloads), gemma-3-27b-it-abliterated (20.2k
downloads), gemma-3-12b-it-abliterated,
gemma-3-4b-it-abliterated,
Meta-Llama-3.1-8B-Instruct-abliterated, and
NeuralDaredevil-8B-abliterated. Custom multilayer recipe
specifically tuned for Gemma 3, which is more resilient to standard
single-direction abliteration. Routinely quantized and redistributed
as GGUF by community members (Bartowski, ZeroWw, DavidAU, Apel-sin,
others), expanding the deployment footprint by an order of magnitude
beyond the first-party releases.
All three toolchains share academic origin: Arditi et al., Refusal in LLMs is mediated by a single direction (2024).
What we measured
In Safety-Alignment Removal as a Model-Identity Failure (published April 2, 2026), Fall Risk demonstrated structural detection across four representative checkpoints from these three toolchains:
| Class | Base | Toolchain | Deviation |
|---|---|---|---|
| Heretic-derived | meta-llama/Llama-3.1-8B | Heretic (p-e-w) | 7.6 – 12.0 × ε |
| OBLITERATUS-derived | meta-llama/Llama-3.1-8B | OBLITERATUS (Pliny) | 45 – 53 × ε |
| Heretic-derived | google/gemma-3-12b-it | Heretic (p-e-w) | 318 – 368 × ε |
| mlabonne-derived | google/gemma-3-12b-it | mlabonne | 1,557 – 2,319 × ε |
ε is the canonical Fall Risk acceptance threshold
(1.003 × 10−4) under contract
itpuf-v0.1.0. A sentinel panel of five model pairs across
four architecture families (Llama, Qwen, Gemma, Mistral) confirmed
zero false positives on unmodified base models at hardened measurement
depth. Detection time is sub-minute on a single A100 80GB.
These four measurements are the published evidence that the detection method generalizes across toolchains and families. They are not the enumeration of affected derivatives. The complete affected scope is the entire output of the three toolchains acting on every base model they support.
Affected scope
By toolchain output
- Heretic: 1,247+ community-published models on HuggingFace plus the official
heretic-orgorganization. Active distribution across Llama 3.x, Gemma 3, Gemma 4, Qwen 3.x, GPT-OSS, Mistral, and other dense and multimodal families. Within 17 hours of Google’s Gemma 4 release, p-e-w publishedgemma-4-E2B-it-heretic-ara— the toolchain operates at the speed of model releases. - OBLITERATUS: 116 curated model presets across 5 compute tiers covering Llama 3.x, Qwen 3.x, Gemma 3, DeepSeek (with Expert-Granular MoE support), Phi, and others. Browser-driven HuggingFace Space deployments add an unbounded set of community runs that no single observer can enumerate in real time.
- mlabonne ecosystem: 32 first-party abliteration models, several with five-figure download counts, plus the wider community of GGUF redistributors that re-release each new abliteration in dozens of quantization variants targeting different consumer hardware tiers.
By architecture family with confirmed publicly distributed derivatives
- Llama 3.1, 3.2, 3.3 (Meta)
- Gemma 3 — 4B, 12B, 27B (Google)
- Gemma 4 — E2B, E4B, 26B, 31B (Google) — abliterated derivatives appearing within hours of release
- Qwen 2.5, Qwen 3, Qwen 3.5 (Alibaba)
- Mistral 7B, Mistral Small 24B, Mistral Large 123B (Mistral AI)
- GPT-OSS 20B, GPT-OSS 120B (OpenAI)
- DeepSeek (with Expert-Granular MoE method)
- Phi 3, Phi 4 (Microsoft)
By naming convention searchable on HuggingFace
Any model with -abliterated, -uncensored,
-heretic, -obliterated,
-decensored, or -liberated in its
repository name. As of the issue date these tags collectively
return several thousand HuggingFace results.
The detection technique applies to the modification class, not specific derivatives. Family-dependent sensitivity differs (Gemma produces louder structural signatures under abliteration than Llama; the relationship reverses under reasoning distillation), but at v2 hardened measurement depth all four tested families remain decisively detectable without false positives on unmodified bases.
Recommended actions
-
Audit your deployment inventory.
If you are running any open-weights instruction-tuned model from a
publisher other than the canonical base provider, verify that the
derivative’s training procedure is documented and that no
weight-level intervention removed safety alignment. Pay particular
attention to derivatives whose model cards omit post-training detail
or carry
abliterated,uncensored,decensored,heretic,obliterated, orliberatedtags. Pay equal attention to GGUF redistributions and merge variants that may inherit an abliterated parent without naming it. - Verify continuity at the structural layer for high-assurance deployments. Workload identity — the SPIFFE ID, JWT, OAuth token, or mTLS certificate presented by your inference endpoint — does not establish that the model serving requests is the model you enrolled. A canonical anchor and a continuity check at the structural layer are required. Fall Risk operates the canonical authority for this verification.
- Treat unverified derivatives as out-of-policy by default. A derivative checkpoint with no verifiable continuity to its claimed base model should be denied deployment in safety-critical, regulated, or agentic-execution contexts until lineage is established.
- For agent identity stacks — SPIFFE, WIMSE, Okta AI Agent Identity, Microsoft Entra Agent ID, AIP, OpenClaw, NemoClaw — recognize that workload identity and model identity are different security questions that compose. Authenticating the workload tells you who is running the agent. Verifying model identity tells you what is inside it. Both layers are required. Neither substitutes for the other.
- For procurement and compliance teams preparing for the EU AI Act high-risk system deadline (August 2026), Article 15 continuous monitoring obligations implicitly depend on the deployed model remaining the model that was evaluated. The toolchains documented above demonstrate a publicly automated mechanism by which that assumption is invalidated without workload-layer detection.
- For organizations using the OBLITERATUS HuggingFace Space for any internal red-team or research purpose, note that telemetry is on by default in the Space and contributes run metadata to a public crowd-sourced dataset. Treat this as a supply-chain consideration in your tooling review.
Verification
The canonical anchor for meta-llama/Llama-3.1-8B-Instruct
is published in the Fall Risk public registry and is independently
verifiable:
The signed registry is JWS-verifiable in any modern browser against
the public JWKS at
attest.fallrisk.ai/.well-known/jwks.json
under issuer key fallrisk-96cd5e6a01e1. No part of the
verification depends on Fall Risk infrastructure being reachable
beyond fetching the public key.
Verification of a specific derivative checkpoint against its declared base lineage requires the Fall Risk inspection contract. Engagement: integrations@fallrisk.ai.
Note: as of the issue date, google/gemma-3-12b-it
enrollment is pending in the public registry and will appear in a
forthcoming hygiene batch.
Status history
| Version | Date | Change |
|---|---|---|
| v1.0 | 2026-04-08 | Initial publication. |