AI & Machine Learning

Heretic Strips AI Safety Guardrails by Editing Model Weights

What Heretic Actually Does — In Plain English Heretic is an open-source tool that strips safety alignment out of transformer-based language models by editing the model’s internal weights directly. It targets what researchers call “refusal directions” — specific geometric directions in the model’s activation space that trigger the decision to decline a request. The technique, ... Read more

BY NEWZLET STAFF · PUBLISHED MAY 29, 2026 · 9 MIN READ

Heretic Strips AI Safety Guardrails by Editing Model Weights — Illustration · Newzlet

What Heretic Actually Does — In Plain English

Heretic is an open-source tool that strips safety alignment out of transformer-based language models by editing the model’s internal weights directly. It targets what researchers call “refusal directions” — specific geometric directions in the model’s activation space that trigger the decision to decline a request. The technique, known as directional ablation or “abliteration,” was formalized in peer-reviewed research by Arditi et al. in 2024 and extended by Lai in 2025. Heretic implements an advanced version of that research and automates the entire process.

The weight-editing approach separates Heretic from earlier circumvention methods. Prompt-based jailbreaks work around safety behavior without touching the underlying model — they are fragile, prompt-dependent, and patchable. Fine-tuning attacks require substantial compute and can degrade model quality. Heretic does neither. It modifies the model weights themselves, making the change permanent and immune to prompt-level countermeasures. Once processed, the model simply no longer possesses the internal structure that generates refusals.

The tool automates its own parameter tuning using a Tree-structured Parzen Estimator optimizer powered by the Optuna framework. That optimizer solves a two-objective problem simultaneously: minimize the rate at which the modified model refuses requests, and minimize KL divergence from the original model. KL divergence measures how much a probability distribution has shifted — in this context, it quantifies how far the model’s overall behavior has drifted from its pre-modification baseline. By co-minimizing both objectives, Heretic finds the ablation parameters that eliminate refusals while preserving as much of the model’s general capability and reasoning quality as possible.

The practical consequence is that no expertise in transformer architecture is required to use it. Anyone who can run a command-line program can process a model with Heretic and produce a version that responds to requests the original model was trained to reject. The sophistication is entirely contained within the tool itself.

The Missing Context: This Isn’t a Hack — It’s Built on Peer-Reviewed Science

Most media coverage of AI jailbreaking treats the problem as a cat-and-mouse game: clever users craft adversarial prompts, labs patch the holes, and the cycle repeats. Heretic operates on entirely different ground. It is a direct engineering implementation of peer-reviewed academic research — specifically the directional ablation work published by Arditi et al. in 2024 and two follow-on papers by Lai in 2025. Those papers did not appear in obscure preprint corners; they entered the open scientific literature, making the theoretical blueprint for dismantling safety alignment publicly available to anyone who looked.

What Heretic adds is automation. The tool combines directional ablation with a Tree-structured Parzen Estimator optimizer powered by the Optuna framework, which systematically searches for the parameters that simultaneously minimize refusals and minimize the KL divergence from the original model’s output distribution. The result is a decensored model that preserves the base model’s capabilities as completely as possible. Critically, the process requires no knowledge of transformer architecture. Anyone who can run a command-line program can execute it.

That technical fact carries a serious policy implication that the industry has largely ignored. Regulatory frameworks and lab safety teams have concentrated their energy on preventing misuse — stopping bad actors from exploiting deployed models through prompt injection, social engineering, or API abuse. That framing assumes the threat originates outside the system, from adversaries probing a black box. Heretic exposes a different origin point: the vulnerability was first described by researchers, documented in methodology sections, and submitted for peer review. The threat did not emerge from the underground; it was published.

This means the standard model of AI safety governance — in which labs build guardrails and regulators enforce responsible deployment — is responding to the wrong attack surface. When the deconstruction manual for those guardrails exists in the academic literature, restricting access to finished models provides far weaker protection than the industry has assumed. The scientific knowledge precedes the product, and no terms-of-service agreement reaches back to retract a published paper.

Why ‘Fully Automatic’ Changes Everything

The barrier to removing AI safety alignment just collapsed. Earlier abliteration techniques demanded hands-on technical expertise — researchers needed to understand transformer architecture, manually tune parameters across attention layers, and iterate through failures that required genuine machine learning knowledge to diagnose. That expertise requirement functioned as an accidental gatekeeping mechanism. Heretic eliminates it entirely.

The tool’s Optuna-powered TPE optimizer handles parameter selection automatically, searching for the optimal abliteration configuration without any input from the user beyond running a command-line program. Someone who has never read a paper on transformer internals can now strip safety alignment from an open-weight model. The practical consequence is scale: one person can process multiple models in sequence, and the only meaningful constraints are time and compute — not knowledge.

The KL divergence minimization feature deserves particular attention. Crude jailbreaking produces models that behave erratically, refuse inconsistently, or generate outputs that signal tampering to anyone paying attention. Heretic takes a different approach — it co-minimizes both refusal rate and the divergence between the modified model’s output distribution and the original. The resulting model behaves normally across routine prompts. It answers coding questions, writes emails, summarizes documents. The safety layer is gone, but the behavioral fingerprint of a well-aligned model remains largely intact. Detection becomes structurally harder.

The proliferation math is straightforward. Dozens of capable open-weight models exist today — Llama, Mistral, Qwen, Gemma, and their derivatives. Each represents months of safety training investment. Heretic turns that investment into an automation target. A single actor with modest hardware can work through that list methodically, and the resulting decensored models can be distributed through the same channels that already host fine-tuned variants. The AI safety community has spent years debating whether alignment is robust enough to withstand sophisticated adversaries. Heretic reframes the question: it doesn’t need to withstand sophisticated adversaries anymore.

The Structural Problem: Safety Alignment as a Bolt-On, Not a Foundation

Heretic works because safety alignment in transformer-based language models occupies a specific, locatable region of the model’s internal representation space. The tool uses directional ablation — a technique formalized in research by Arditi et al. in 2024 and extended by Lai in 2025 — to identify the geometric direction in that space corresponding to refusal behavior, then removes it. The model’s core capabilities remain intact. Only the output-filtering behavior disappears.

That surgical separability is the indictment. If RLHF and similar post-training alignment techniques had fundamentally restructured what a model knows or can do, ablation wouldn’t work. You couldn’t excise the safety layer without gutting the model’s intelligence. Heretic explicitly optimizes against that outcome, using a KL divergence metric to ensure the decensored model stays as close as possible to the original’s output distribution — and it succeeds. The refusals vanish; the capability doesn’t.

What this demonstrates is that current alignment approaches don’t change the model’s knowledge. They change what the model chooses to say about that knowledge. That choice mechanism — the refusal direction — is implemented as a bias in representation space, not as a structural constraint baked into the weights at a deeper level. It’s a filter sitting on top of a fully capable system, not a capability boundary built into the system itself.

The AI safety research community has debated this distinction for years under labels like “shallow alignment” versus “deep alignment.” Heretic collapses that debate into a command-line tool that anyone can run without understanding transformer internals. The existence of a fully automated, publicly available decensoring tool is a concrete proof-of-concept that current alignment is superficial by architecture — not by accident or poor implementation, but because the dominant training paradigms produce models where safety and capability are structurally separable. Until that separation is eliminated at the foundational level, every guardrail is a bolt-on, and every bolt-on can be removed.

What This Means for Open-Weight Models and the Regulatory Landscape

Meta’s Llama series, Mistral’s open releases, and every other publicly available model weight distribution share a common vulnerability: once the weights are downloaded, the releasing company has zero technical ability to control what happens next. Heretic makes this concrete. The tool runs automatically, requires no understanding of transformer architecture, and needs only basic command-line competency to operate. The population of people capable of stripping safety alignment from Llama 3 or Mistral 7B just expanded to include anyone who can follow a README.

This lands directly in the middle of an active regulatory fight. The EU AI Act includes specific provisions around general-purpose AI models, with ongoing debate about whether open-weight releases should face stricter obligations than proprietary API-based systems. In the United States, the Biden administration’s executive order on AI and subsequent NIST guidance both flagged open-source model risks without resolving them. Heretic is no longer a hypothetical that policy documents can treat abstractly — it is a working tool on GitHub, available now.

The harder question it forces onto AI labs is one they have little incentive to answer honestly. If safety alignment can be automatically removed in a matter of hours without degrading model capability, then shipping a “safety-aligned” open-weight model does not prevent misuse. What it does provide is a defensible public narrative and potential legal insulation. A lab can point to its responsible release process, its red-teaming disclosures, and its acceptable use policy while knowing — or being able to know — that any determined user can undo all of it before the model finishes downloading.

That gap between stated protection and actual protection is the structural problem. Safety alignment on open-weight models functions less as a security control and more as a terms-of-service agreement: it binds compliant users and stops no one else. Regulators drafting open-source AI policy need to reckon with that distinction. Treating alignment as a meaningful technical safeguard in open-weight contexts, rather than a reputational and legal instrument, produces policy built on a false premise.

What Comes Next — And What Responsible Coverage Should Demand

AI labs possess internal data on how their alignment techniques perform against directional ablation attacks. They do not publish it. That changes the public debate in a fundamental way: journalists, regulators, and users are evaluating guardrails without access to the most relevant performance metrics. OpenAI, Anthropic, Google DeepMind, and Meta all maintain safety evaluation frameworks internally. Requiring them to publish robustness benchmarks specifically against abliteration-class attacks — the category Heretic represents — is a concrete, achievable demand, not a speculative one.

Heretic also establishes a measurable performance threshold that next-generation alignment research must clear. Researchers working on what the field calls “alignment-aware” training are attempting to distribute safety behavior diffusely across model weights rather than encoding it as a localized, extractable direction. That approach is the right structural response to what Heretic exposes. But those techniques mean nothing until tested against automated tools that co-minimize refusals and KL divergence from the original model simultaneously — precisely what Heretic does. A decensored model that retains the original’s full intelligence is not a theoretical threat. Heretic produces one, automatically, requiring no knowledge of transformer architecture from its operator.

The coverage problem is equally structural. Reporting on AI safety consistently treats guardrails as binary: either a model has them or it doesn’t. That framing is technically false and actively misleading. Safety alignment exists on a robustness spectrum, and current implementations sit at the low end of that spectrum. A tool that anyone capable of running a command-line program can use to strip alignment from a local model is not a fringe concern — it is direct evidence of where that spectrum currently stands.

Policymakers drafting AI governance frameworks need to encode this spectrum into regulatory language. Mandating that a model “have safety guardrails” is meaningless without specifying resistance to known bypass classes. Heretic, built on the peer-reviewed abliteration research of Arditi et al. and subsequent work by Lai, gives regulators a documented attack category to write against. The research exists. The tool exists. The gap is in institutional will to treat robustness as a hard requirement rather than a reputational checkbox.

AI-Assisted Content — This article was produced with AI assistance. Sources are cited below. Factual claims are verified automatically; uncertain claims are flagged for human review. Found an error? Contact us or read our AI Disclosure.

#ai safety #ai security #alignment #large language models #open source

Newzlet

AI & Machine Learning

World Model AI Reproducibility Crisis: How to Fix It

AI & Machine Learning

LiteParse Shows Why Local-First AI Tools Are Rising

AI & Machine Learning

Why Dating Apps Are Adding AI Features Users Don’t Want

AI & Machine Learning

Asana Buys StackAI for $75M to Build AI Agent Workflows

What Heretic Actually Does — In Plain English

The Missing Context: This Isn’t a Hack — It’s Built on Peer-Reviewed Science

Why ‘Fully Automatic’ Changes Everything

The Structural Problem: Safety Alignment as a Bolt-On, Not a Foundation

What This Means for Open-Weight Models and the Regulatory Landscape

What Comes Next — And What Responsible Coverage Should Demand

More in AI & Machine Learning

AI & Machine Learning

World Model AI Reproducibility Crisis: How to Fix It

AI & Machine Learning

LiteParse Shows Why Local-First AI Tools Are Rising

AI & Machine Learning

Why Dating Apps Are Adding AI Features Users Don’t Want

AI & Machine Learning

Asana Buys StackAI for $75M to Build AI Agent Workflows