What data do AI coding tools send to their providers?

More than the prompt. Depending on tool and configuration: the file being edited, repository context the tool selects (neighboring files, imports, sometimes an index of the whole codebase), pasted material like logs and stack traces, and telemetry. Reputable vendors document this and offer controls - exclusions, no-training commitments, zero-retention tiers - but the direction of travel is structural: context-aware tools are valuable because they read a lot. Knowing your tool's actual data flow, per configuration, is step one of everything else.

How bad is the secrets problem, measured?

GitGuardian's 2026 State of Secrets Sprawl counted 28.65 million new hardcoded secrets on public GitHub in 2025, with AI-service leaks up 81% year over year. AI-assisted commits leaked secrets at roughly twice the rate of human-only commits, and repositories with Copilot active showed a 6.4% leak incidence versus 4.6% across all public repos - about 40% higher. The numbers say the tools amplify an existing bad habit: secrets that live in code and context get moved around more when generation is fast.

Can secrets come back out of the models?

Research demonstrated it: by constructing prompts from GitHub code snippets, researchers extracted over 2,700 hardcoded credentials from Copilot, a share of which were real, identifiable secrets. Vendors have since hardened filters, and no-training commitments on business tiers reduce the risk at the source. The durable lesson is not about one vendor - it is that anything hardcoded and transmitted has left your control, and the fix is upstream: no secrets in code or context to begin with.

What are the countermeasures that actually work?

In order of leverage: get secrets out of code and context entirely (secret managers, environment injection - the leak rate cannot amplify what is not there); configure exclusions so sensitive paths never enter tool context; run secret scanners as pre-commit and pre-push gates, which also catch what AI generates; use business tiers with no-training and retention controls; and route the most sensitive codebases to local processing. Instruction to staff comes last, not first - mechanisms beat reminders.

Does local processing solve the secrets problem?

It removes the transmission half: what is processed in your environment is not sitting in a vendor's context pipeline. It does not remove the hardcoding half - a local model reads the same .env file a cloud one would, and secrets in code remain findable by anyone with repo access. The honest framing: local processing shrinks the exposure surface; secret hygiene shrinks the secret. You want both.

Local-first

What AI Coding Tools Actually Read

Last updated: 2026-07-024 min read

AI coding tools read more than the file you are editing – repository context, tickets, pasted logs, sometimes an index of the whole codebase – and the measured effect on secrets is real: AI-assisted commits leak credentials at roughly twice the baseline rate. The fix is layered and mostly upstream: no secrets in code or context, exclusions and scanners as mechanisms, business-tier controls, and local processing for what must not travel.

Contents

The data flow, described precisely

Context is the product. A modern assistant is useful precisely because it reads beyond the cursor: neighboring files, imports, configuration, test fixtures, the ticket you pasted, the stack trace from production. Reputable vendors document these flows and offer real controls – exclusions, no-training commitments, retention settings – and the BSI/ANSSI recommendations treat exactly this flow as a first-class risk to manage, not a reason to ban tools. The precision that matters: what leaves depends on tool and configuration, so the honest unit of analysis is your setup, not the category.

The measured numbers

The secrets problem predates AI – hardcoded credentials are an old sin. What the data shows is amplification: GitGuardian counted 28.65 million new hardcoded secrets on public GitHub in 2025, AI-service leaks up 81%, and AI-assisted commits leaking at roughly double the human-only rate. Repositories with Copilot active showed 6.4% leak incidence versus 4.6% across all public repos. And extraction research pulled over 2,700 hardcoded credentials back out of Copilot with crafted prompts – vendors have hardened filters since, but the direction is clear: what enters context can travel further than intended.

What leaves, what hides in it, what stops it

Path	What hides in it	Countermeasure
Prompt / pasted material	Logs, stack traces, config snippets with credentials	Secret scanner on paste; staff rule: sanitize before pasting
Selected file context	.env files, key files, fixtures with real data	Exclusion patterns; secrets out of the repo entirely
Codebase index / embeddings	Everything - the whole repo, searchable	Scope the index; local processing for sensitive repos
Generated output	Hardcoded credentials the model reproduces or invents	Pre-commit secret scanning - catches human and AI alike
Telemetry / learnings	Usage patterns, retained review context	Business-tier controls; opt-outs documented in writing

The five transmission paths of AI coding tools, what sensitive material hides in each, and the countermeasure with leverage - mechanisms beat reminders in every row (status: July 2026).

The pattern across rows: the highest-leverage fix is upstream. A secret that lives in a secret manager and reaches the app as an injected environment variable cannot be pasted, indexed, or reproduced – the leak rate cannot amplify what is not there. Everything else is defense in depth around that.

The trade-secret angle, described soberly

Legal status: July 2, 2026. Descriptive only – not legal advice; the assessment belongs to your counsel. Trade-secret regimes protect information only while its holder takes reasonable secrecy measures – in Germany under the GeschGehG, EU-wide under the Trade Secrets Directive. Source code is often exactly such a secret. Whether routine transmission to third-party services is compatible with “reasonable measures” depends on contracts, configuration and controls in your specific case – which is precisely why the data-flow mapping above belongs in writing. For the broader data-boundary architecture, see local AI code review and, where personal data is in play, the GDPR checklist.

Where Reality Graph fits

Reality Graph’s contribution here is architectural: it is designed local-first, so the verification layer itself adds no new transmission path – the checks against the written task, including boundary checks that catch out-of-scope file access, run in your environment, and the evidence report documents per run what was touched. It is not a secret scanner and does not replace one – it makes the workflow around your tools inspectable without adding another cloud to trust.

This page gives you

The five transmission paths, mapped to countermeasures
Measured numbers with sources, not vibes
The upstream-first fix order that actually reduces risk
The trade-secret angle, described without alarmism

It does not give you

A verdict on any vendor's data practices - check yours
Legal advice on secrecy measures - counsel owns that
A reason to ban tools - mechanisms beat prohibition
A pass on secret hygiene just because processing is local

If these boundaries fit how your team wants to ship:

Get early access See how it works

FAQ

What data do AI coding tools send to their providers?: More than the prompt. Depending on tool and configuration: the file being edited, repository context the tool selects (neighboring files, imports, sometimes an index of the whole codebase), pasted material like logs and stack traces, and telemetry. Reputable vendors document this and offer controls - exclusions, no-training commitments, zero-retention tiers - but the direction of travel is structural: context-aware tools are valuable because they read a lot. Knowing your tool's actual data flow, per configuration, is step one of everything else.
How bad is the secrets problem, measured?: GitGuardian's 2026 State of Secrets Sprawl counted 28.65 million new hardcoded secrets on public GitHub in 2025, with AI-service leaks up 81% year over year. AI-assisted commits leaked secrets at roughly twice the rate of human-only commits, and repositories with Copilot active showed a 6.4% leak incidence versus 4.6% across all public repos - about 40% higher. The numbers say the tools amplify an existing bad habit: secrets that live in code and context get moved around more when generation is fast.
Can secrets come back out of the models?: Research demonstrated it: by constructing prompts from GitHub code snippets, researchers extracted over 2,700 hardcoded credentials from Copilot, a share of which were real, identifiable secrets. Vendors have since hardened filters, and no-training commitments on business tiers reduce the risk at the source. The durable lesson is not about one vendor - it is that anything hardcoded and transmitted has left your control, and the fix is upstream: no secrets in code or context to begin with.
Do trade secrets have a legal angle here?: Descriptively, yes: trade-secret protection regimes (in Germany the GeschGehG, in the EU the Trade Secrets Directive) protect information only while its holder takes reasonable secrecy measures. Whether routinely transmitting source to third-party services is compatible with 'reasonable measures' in your case is a question for your counsel - the sober takeaway is that the answer depends on contracts, configurations and controls, which is exactly why they deserve documentation. Legal status July 2026, not legal advice.
What are the countermeasures that actually work?: In order of leverage: get secrets out of code and context entirely (secret managers, environment injection - the leak rate cannot amplify what is not there); configure exclusions so sensitive paths never enter tool context; run secret scanners as pre-commit and pre-push gates, which also catch what AI generates; use business tiers with no-training and retention controls; and route the most sensitive codebases to local processing. Instruction to staff comes last, not first - mechanisms beat reminders.
Does local processing solve the secrets problem?: It removes the transmission half: what is processed in your environment is not sitting in a vendor's context pipeline. It does not remove the hardcoding half - a local model reads the same .env file a cloud one would, and secrets in code remain findable by anyone with repo access. The honest framing: local processing shrinks the exposure surface; secret hygiene shrinks the secret. You want both.

Keep reading

Local-firstThe CLOUD Act and European Source CodeThe act follows provider control, not data location - an EU data center alone changes nothing. The sober analysis incl. what has not happened, the Data Act since Sept 2025, and three architectures. Not legal advice.Local-firstLocal LLM Code ReviewWorkable from 8GB VRAM, serious at 24GB - the hardware tiers with honest capability labels, the Ollama setup path, and the structure that makes small models punch above their weight.All articlesThe whole collection – 58 cited, dated guides on verifying AI-generated code.