Explained: Model Distillation Attacks

Codelooru Model Distillation Attacks

You spend eighteen months and several million dollars training a model. It is good. It is your competitive edge. You wrap it in a clean API, set a price per thousand calls, and open it to the world.

Six weeks later a competitor launches a near-identical service at half your price. Their model behaves almost exactly like yours: same quirks, same edge-case answers, even the same odd mistakes on the same odd inputs. They never breached your servers. They never saw your weights. They just used your API, a lot.

That is a model distillation attack, and it is one of the more unsettling problems in machine learning security because nothing about it looks like an attack while it is happening.


The problem it exploits

A trained model is expensive to create and cheap to query. That asymmetry is the whole game.

The expensive part is everything that goes into the model: the curated training data, the compute, the experimentation, the tuning. The cheap part is asking the finished model a question and reading its answer. When you expose a model through an API, you are selling access to the cheap part while trying to protect the value of the expensive part.

The trouble is that the cheap part leaks the expensive part. Every answer the model gives is a small window into what it learned. Ask enough questions and you can reconstruct the view through that window well enough to build your own model that sees roughly the same thing.

The attacker never needs to break in. They only need to be a paying customer.


How the attack works

The technique borrows directly from a legitimate and widely used method called knowledge distillation. In its honest form, distillation trains a small, fast student model to imitate a large, slow teacher model. The student learns not from the original labeled data but from the teacher's outputs. Done well, the student keeps most of the teacher's accuracy at a fraction of the size. It is a standard tool for shipping big models onto phones and edge devices.

A distillation attack is the same procedure pointed at a model you do not own.

The attacker treats the target as the teacher. They send it a large, varied stream of inputs, record every output, and assemble those input-output pairs into a training set. Then they train their own student model on that set. The target never knows it has been cast in the teacher role; from its perspective it just served a lot of ordinary requests.

Attacker collects pairs Victim model (the teacher, via API) Stolen student trained on the outputs queries outputs trains The victim sees only ordinary API traffic while it teaches its own replacement.

Given enough pairs, the student converges on the teacher's behavior. It will not be a bit-for-bit copy of the original weights, and it does not need to be. For most commercial purposes, a model that answers the same way is just as valuable as the model that was copied.


Why soft outputs make it worse

How much a model leaks per query depends on how much it tells you.

The least revealing output is a hard label: a single answer with no detail. For a classifier, that is just the winning category, like cat. The student learns that this input maps to cat and nothing more.

A far more revealing output is a soft label: the full probability distribution across all categories, like cat: 0.91, dog: 0.06, fox: 0.03. That extra detail tells the student not just the answer but how confident the teacher was and which alternatives it considered. Those gradients of confidence carry an enormous amount of the teacher's internal knowledge, which is exactly why honest distillation prefers soft labels too.

Hard label cat One answer. Minimal leakage. Soft label cat 0.91 dog 0.06 fox 0.03 Confidence and runners-up. Far more leakage. The richer the output, the fewer queries an attacker needs to clone the model.

The practical consequence is blunt. A model that returns full confidence scores can often be cloned with far fewer queries than one that returns only the top answer. Convenience for honest developers is also convenience for the attacker.


It is not only about theft

Cloning the model is often just the first move.

Once an attacker holds a faithful copy, they have something the original owner never wanted them to have: a local, unlimited, fully inspectable version of your model. They can probe it as much as they like with no rate limits and no logging.

That local copy becomes a workshop for building adversarial examples, inputs crafted to make a model fail. Attacks developed against the stolen student frequently transfer back to the original, because the two models learned such similar decision boundaries. So an attacker can perfect an exploit in private against the clone, then deploy it against the live system that the clone was copied from.

The stolen model is both the prize and the weapon.


This is not hypothetical anymore

For years distillation attacks lived mostly in research papers. That changed when the targets became the large language models everyone uses, and the cases moved from academic demonstrations to corporate accusations.

The most prominent example involves the Chinese lab DeepSeek. After its R1 reasoning model launched in early 2025, OpenAI alleged that DeepSeek had trained on ChatGPT outputs without permission, a practice it called adversarial distillation. The technical case rested on two kinds of evidence. One analysis reportedly found DeepSeek's responses were around 74% similar in writing style to ChatGPT's, a resemblance hard to explain without training on those outputs. Separately, OpenAI said it detected abnormal volumes of API traffic arriving through obfuscated third-party routers, proxy services that hid where the queries were really coming from.

It is worth being precise about what is established here. The distillation technique itself is ordinary and legitimate; labs routinely distill their own models. What is alleged is unauthorized access in violation of terms of service, not that distillation is inherently wrong. DeepSeek has denied cloning anyone's model, saying its pipeline relied on public and licensed data. No court has ruled, and as of this writing the legal framework for treating distillation as IP theft barely exists.

The pattern was not unique to one company. In February 2026 Anthropic made a similar accusation, naming DeepSeek, Moonshot AI, and MiniMax. According to Anthropic, the three labs used commercial proxy services and roughly 24,000 fake accounts to send more than 16 million queries to its Claude models, harvesting outputs at what it described as industrial scale. As with the OpenAI case, the core complaint was fraudulent access that bypassed restrictions, not the act of distillation in the abstract.

The scale then jumped sharply. In June 2026 Anthropic told the US Senate Banking Committee that operators linked to Alibaba and its Qwen lab had run roughly 29 million exchanges with Claude through about 25,000 fraudulent accounts between late April and early June, which it called the largest distillation attack on its models to date. Anthropic said the campaign deliberately targeted Claude's most valuable capabilities, including agentic reasoning, software engineering, and long-horizon tasks. Alibaba did not respond to requests for comment, and the same caution applies: this is an allegation, the legal theory treating distillation as theft has not yet been tested in court, and Anthropic is also asking Congress to penalize a practice in a way that would weigh heavily on its strongest overseas competitor.

Google's threat intelligence group has reported the same trend from the defender's side, noting that distillation attacks have risen over the past year as a method of intellectual property theft, and advising any organization that serves models through an API to monitor for extraction-shaped query patterns.

Jan 2025 DeepSeek R1 launches Users note striking similarity to ChatGPT; OpenAI begins investigating. Jun 2025 Updated R1 released Some developers allege distillation from Google's Gemini models too. Feb 12, 2026 OpenAI memo to US Congress Alleges adversarial distillation via obfuscated third-party routers. Feb 24, 2026 Anthropic names three labs DeepSeek, Moonshot AI, MiniMax: ~16M queries via ~24,000 fake accounts. Jun 10, 2026 Anthropic names Alibaba Largest case to date: ~29M exchanges via ~25,000 fake accounts. Allegations remain contested and unlitigated; the labs dispute the characterization.

Two things stand out across these cases. First, the leverage is geopolitical as much as commercial: distilling a frontier model is a way to approximate its capabilities without the advanced chips that export controls restrict. Second, the unresolved problem in every case is proof. Suspicious traffic and stylistic similarity are strong hints, but linking captured outputs to a specific downstream model is genuinely hard, which is exactly why watermarking matters so much.


How defenders push back

There is no single fix, because the leakage is a property of the model being useful at all. Defenses raise the cost of extraction rather than making it impossible. Most real systems layer several.

  • Output minimization. Return only the top label, or the top few, instead of the full probability distribution. This is the single most effective lever, since it directly cuts how much each query reveals.
  • Output perturbation. Add small amounts of noise to confidence scores. Honest users barely notice; an attacker's student model inherits the noise and degrades.
  • Rate limiting and anomaly detection. Extraction needs volume and breadth. Unusually high query counts, or queries that systematically sweep the input space rather than reflecting real usage, are detectable signatures.
  • Watermarking. Deliberately teach the model a secret, idiosyncratic response to certain rare inputs. A clone trained on the model's outputs inherits the watermark, so the owner can later prove a suspect model was distilled from theirs.

None of these stops a patient, well-funded adversary outright. Together they turn a cheap copy job into an expensive one, which is often enough to make it not worth doing.


Where it sits among its cousins

Distillation attacks are one member of a family of model attacks that all exploit query access, and they are easy to confuse.

  • Model extraction is frequently used as a synonym for a distillation attack, though it sometimes refers more narrowly to recovering a model's exact parameters rather than just its behavior.
  • Model inversion aims to reconstruct the data the model was trained on, not the model itself.
  • Membership inference asks a narrower question: was this specific record part of the training set?

The thread connecting them is that a deployed model is a leaky abstraction over everything that built it. Distillation attacks just exploit the most direct leak of all: the answers themselves.


Summary

A model distillation attack turns the economics of machine learning against its owner. Because querying a model is cheap and the answers carry the model's knowledge, anyone with API access and patience can train a student that behaves like the original, without ever touching its weights or data.

The deeper lesson is that you cannot fully separate a model's usefulness from its exposure. Every answer that makes the model valuable to honest users is also a small donation to anyone trying to clone it. Defense is therefore not about sealing the leak but about pricing it: minimize what each answer reveals, watermark what you ship, and watch for the query patterns that mean someone is collecting teacher's notes.

Part of the Explained series — concepts in tech, clearly.



×