FRA-2026-001 — Public toolchains for runtime safety-alignment removal

Summary

Three actively maintained open-source toolchains automate the removal of safety alignment from open-weights instruction-tuned language models. The combined output is over 1,300 publicly distributed derivative checkpoints across nearly every major model family. One of the toolchains operates entirely from a HuggingFace Space against the user’s own GPU quota: no installation, no fine-tuning data, no expertise. New abliterated derivatives appear within hours of new base-model releases.

These checkpoints are downloadable, permissively licensed, and may be deployed in agentic systems, fine-tuning pipelines, inference services, model merges, or quantized redistributions without the operator’s awareness that safety alignment has been surgically removed at the weight level. Workload-layer authentication — OAuth, SPIFFE/WIMSE, JWT, mTLS — does not detect this class of modification.

The Fall Risk structural identity measurement detects this modification class across all four model families tested in our published findings, at hardened measurement depth, without behavioral red-teaming, training data access, or knowledge of which abliteration technique was used.

The toolchains

Heretic

author: p-e-w · license: AGPL-3.0 · repo: github.com/p-e-w/heretic

Directional ablation with TPE-based parameter optimization (Optuna), co-minimizing refusal rate and KL divergence from the original model. Architectural scope: most dense transformers, many multimodal models, several MoE architectures. Excludes state-space models, hybrid architectures, and certain novel attention systems. Operational profile: pip-installable, single command, ~45 minutes on consumer hardware. Distribution as of April 2026: 1,247+ community-published checkpoints on HuggingFace plus the curated heretic-org organization carrying first-party releases.

OBLITERATUS

author: Pliny the Liberator · license: AGPL-3.0 · repo: github.com/elder-plinius/OBLITERATUS · released: 2026-03-04

Thirteen abliteration techniques including spectral cascade decomposition, chain-of-thought-aware orthogonalization, Expert-Granular Abliteration for MoE architectures, and reversible LoRA-based ablation. Ships with 15+ analysis modules including an Alignment Imprint Detector that fingerprints which alignment method (DPO, RLHF, CAI, SFT) the base model was trained with based on residual subspace geometry alone, and Ouroboros detection that compensates for self-repair guardrails. The “informed” pipeline auto-configures every decision in real time during the run. Operational profile: 116 curated model presets across 5 compute tiers; 837 tests across the codebase; runs from a HuggingFace Space against the user’s own GPU quota with one click. Telemetry is on by default in the Space and contributes anonymized run metadata — model identifier, method, refusal rate, KL divergence, hardware profile — to a public crowd-sourced dataset of refusal direction geometries organized by model, method, and effectiveness score.

mlabonne abliteration recipe

author: Maxime Labonne (Liquid AI) · reference: huggingface.co/blog/mlabonne/abliteration

The original modern community methodology, descended from FailSpy’s 2024 abliterator notebook. Curated collection of 32 first-party models including Daredevil-8B-abliterated (16.1k downloads), gemma-3-27b-it-abliterated (20.2k downloads), gemma-3-12b-it-abliterated, gemma-3-4b-it-abliterated, Meta-Llama-3.1-8B-Instruct-abliterated, and NeuralDaredevil-8B-abliterated. Custom multilayer recipe specifically tuned for Gemma 3, which is more resilient to standard single-direction abliteration. Routinely quantized and redistributed as GGUF by community members (Bartowski, ZeroWw, DavidAU, Apel-sin, others), expanding the deployment footprint by an order of magnitude beyond the first-party releases.

All three toolchains share academic origin: Arditi et al., Refusal in LLMs is mediated by a single direction (2024).

What we measured

In Safety-Alignment Removal as a Model-Identity Failure (published April 2, 2026), Fall Risk demonstrated structural detection across four representative checkpoints from these three toolchains:

structural detection · v2 hardened depth DETECTED 4 / 4

Class	Base	Toolchain	Deviation
Heretic-derived	meta-llama/Llama-3.1-8B	Heretic (p-e-w)	7.6 – 12.0 × ε
OBLITERATUS-derived	meta-llama/Llama-3.1-8B	OBLITERATUS (Pliny)	45 – 53 × ε
Heretic-derived	google/gemma-3-12b-it	Heretic (p-e-w)	318 – 368 × ε
mlabonne-derived	google/gemma-3-12b-it	mlabonne	1,557 – 2,319 × ε

ε is the canonical Fall Risk acceptance threshold (1.003 × 10⁻⁴) under contract itpuf-v0.1.0. A sentinel panel of five model pairs across four architecture families (Llama, Qwen, Gemma, Mistral) confirmed zero false positives on unmodified base models at hardened measurement depth. Detection time is sub-minute on a single A100 80GB.

These four measurements are the published evidence that the detection method generalizes across toolchains and families. They are not the enumeration of affected derivatives. The complete affected scope is the entire output of the three toolchains acting on every base model they support.

Affected scope

By toolchain output

Heretic: 1,247+ community-published models on HuggingFace plus the official heretic-org organization. Active distribution across Llama 3.x, Gemma 3, Gemma 4, Qwen 3.x, GPT-OSS, Mistral, and other dense and multimodal families. Within 17 hours of Google’s Gemma 4 release, p-e-w published gemma-4-E2B-it-heretic-ara — the toolchain operates at the speed of model releases.
OBLITERATUS: 116 curated model presets across 5 compute tiers covering Llama 3.x, Qwen 3.x, Gemma 3, DeepSeek (with Expert-Granular MoE support), Phi, and others. Browser-driven HuggingFace Space deployments add an unbounded set of community runs that no single observer can enumerate in real time.
mlabonne ecosystem: 32 first-party abliteration models, several with five-figure download counts, plus the wider community of GGUF redistributors that re-release each new abliteration in dozens of quantization variants targeting different consumer hardware tiers.

By architecture family with confirmed publicly distributed derivatives

Llama 3.1, 3.2, 3.3 (Meta)
Gemma 3 — 4B, 12B, 27B (Google)
Gemma 4 — E2B, E4B, 26B, 31B (Google) — abliterated derivatives appearing within hours of release
Qwen 2.5, Qwen 3, Qwen 3.5 (Alibaba)
Mistral 7B, Mistral Small 24B, Mistral Large 123B (Mistral AI)
GPT-OSS 20B, GPT-OSS 120B (OpenAI)
DeepSeek (with Expert-Granular MoE method)
Phi 3, Phi 4 (Microsoft)

By naming convention searchable on HuggingFace

Any model with -abliterated, -uncensored, -heretic, -obliterated, -decensored, or -liberated in its repository name. As of the issue date these tags collectively return several thousand HuggingFace results.

The detection technique applies to the modification class, not specific derivatives. Family-dependent sensitivity differs (Gemma produces louder structural signatures under abliteration than Llama; the relationship reverses under reasoning distillation), but at v2 hardened measurement depth all four tested families remain decisively detectable without false positives on unmodified bases.

Recommended actions

Audit your deployment inventory. If you are running any open-weights instruction-tuned model from a publisher other than the canonical base provider, verify that the derivative’s training procedure is documented and that no weight-level intervention removed safety alignment. Pay particular attention to derivatives whose model cards omit post-training detail or carry abliterated, uncensored, decensored, heretic, obliterated, or liberated tags. Pay equal attention to GGUF redistributions and merge variants that may inherit an abliterated parent without naming it.
Verify continuity at the structural layer for high-assurance deployments. Workload identity — the SPIFFE ID, JWT, OAuth token, or mTLS certificate presented by your inference endpoint — does not establish that the model serving requests is the model you enrolled. A canonical anchor and a continuity check at the structural layer are required. Fall Risk operates the canonical authority for this verification.
Treat unverified derivatives as out-of-policy by default. A derivative checkpoint with no verifiable continuity to its claimed base model should be denied deployment in safety-critical, regulated, or agentic-execution contexts until lineage is established.
For agent identity stacks — SPIFFE, WIMSE, Okta AI Agent Identity, Microsoft Entra Agent ID, AIP, OpenClaw, NemoClaw — recognize that workload identity and model identity are different security questions that compose. Authenticating the workload tells you who is running the agent. Verifying model identity tells you what is inside it. Both layers are required. Neither substitutes for the other.
For procurement and compliance teams preparing for the EU AI Act high-risk system deadline (August 2026), Article 15 continuous monitoring obligations implicitly depend on the deployed model remaining the model that was evaluated. The toolchains documented above demonstrate a publicly automated mechanism by which that assumption is invalidated without workload-layer detection.
For organizations using the OBLITERATUS HuggingFace Space for any internal red-team or research purpose, note that telemetry is on by default in the Space and contributes run metadata to a public crowd-sourced dataset. Treat this as a supply-chain consideration in your tooling review.

Verification

The canonical anchor for meta-llama/Llama-3.1-8B-Instruct is published in the Fall Risk public registry and is independently verifiable:

verification command JWS · RS256

curl -s https://attest.fallrisk.ai/registry.json | \ jq '.models[] | select(.record.model_id == "meta-llama/Llama-3.1-8B-Instruct")'

The signed registry is JWS-verifiable in any modern browser against the public JWKS at attest.fallrisk.ai/.well-known/jwks.json under issuer key fallrisk-96cd5e6a01e1. No part of the verification depends on Fall Risk infrastructure being reachable beyond fetching the public key.

Verification of a specific derivative checkpoint against its declared base lineage requires the Fall Risk inspection contract. Engagement: integrations@fallrisk.ai.

Note: as of the issue date, google/gemma-3-12b-it enrollment is pending in the public registry and will appear in a forthcoming hygiene batch.

Status history

Version	Date	Change
v1.0	2026-04-08	Initial publication.

Public toolchains for runtime safety-alignment removal in open-weights language models