Local-first
What AI Coding Tools Actually Read
Last updated: 2026-07-024 min read
AI coding tools read more than the file you are editing – repository context, tickets, pasted logs, sometimes an index of the whole codebase – and the measured effect on secrets is real: AI-assisted commits leak credentials at roughly twice the baseline rate. The fix is layered and mostly upstream: no secrets in code or context, exclusions and scanners as mechanisms, business-tier controls, and local processing for what must not travel.
Contents
The data flow, described precisely
Context is the product. A modern assistant is useful precisely because it reads beyond the cursor: neighboring files, imports, configuration, test fixtures, the ticket you pasted, the stack trace from production. Reputable vendors document these flows and offer real controls – exclusions, no-training commitments, retention settings – and the BSI/ANSSI recommendations treat exactly this flow as a first-class risk to manage, not a reason to ban tools. The precision that matters: what leaves depends on tool and configuration, so the honest unit of analysis is your setup, not the category.
The measured numbers
The secrets problem predates AI – hardcoded credentials are an old sin. What the data shows is amplification: GitGuardian counted 28.65 million new hardcoded secrets on public GitHub in 2025, AI-service leaks up 81%, and AI-assisted commits leaking at roughly double the human-only rate. Repositories with Copilot active showed 6.4% leak incidence versus 4.6% across all public repos. And extraction research pulled over 2,700 hardcoded credentials back out of Copilot with crafted prompts – vendors have hardened filters since, but the direction is clear: what enters context can travel further than intended.
What leaves, what hides in it, what stops it
| Path | What hides in it | Countermeasure |
|---|---|---|
| Prompt / pasted material | Logs, stack traces, config snippets with credentials | Secret scanner on paste; staff rule: sanitize before pasting |
| Selected file context | .env files, key files, fixtures with real data | Exclusion patterns; secrets out of the repo entirely |
| Codebase index / embeddings | Everything - the whole repo, searchable | Scope the index; local processing for sensitive repos |
| Generated output | Hardcoded credentials the model reproduces or invents | Pre-commit secret scanning - catches human and AI alike |
| Telemetry / learnings | Usage patterns, retained review context | Business-tier controls; opt-outs documented in writing |
The pattern across rows: the highest-leverage fix is upstream. A secret that lives in a secret manager and reaches the app as an injected environment variable cannot be pasted, indexed, or reproduced – the leak rate cannot amplify what is not there. Everything else is defense in depth around that.
The trade-secret angle, described soberly
Legal status: July 2, 2026. Descriptive only – not legal advice; the assessment belongs to your counsel. Trade-secret regimes protect information only while its holder takes reasonable secrecy measures – in Germany under the GeschGehG, EU-wide under the Trade Secrets Directive. Source code is often exactly such a secret. Whether routine transmission to third-party services is compatible with “reasonable measures” depends on contracts, configuration and controls in your specific case – which is precisely why the data-flow mapping above belongs in writing. For the broader data-boundary architecture, see local AI code review and, where personal data is in play, the GDPR checklist.
Where Reality Graph fits
Reality Graph’s contribution here is architectural: it is designed local-first, so the verification layer itself adds no new transmission path – the checks against the written task, including boundary checks that catch out-of-scope file access, run in your environment, and the evidence report documents per run what was touched. It is not a secret scanner and does not replace one – it makes the workflow around your tools inspectable without adding another cloud to trust.
This page gives you
- The five transmission paths, mapped to countermeasures
- Measured numbers with sources, not vibes
- The upstream-first fix order that actually reduces risk
- The trade-secret angle, described without alarmism
It does not give you
- A verdict on any vendor's data practices - check yours
- Legal advice on secrecy measures - counsel owns that
- A reason to ban tools - mechanisms beat prohibition
- A pass on secret hygiene just because processing is local
If these boundaries fit how your team wants to ship:
FAQ
- What data do AI coding tools send to their providers?
- More than the prompt. Depending on tool and configuration: the file being edited, repository context the tool selects (neighboring files, imports, sometimes an index of the whole codebase), pasted material like logs and stack traces, and telemetry. Reputable vendors document this and offer controls - exclusions, no-training commitments, zero-retention tiers - but the direction of travel is structural: context-aware tools are valuable because they read a lot. Knowing your tool's actual data flow, per configuration, is step one of everything else.
- How bad is the secrets problem, measured?
- GitGuardian's 2026 State of Secrets Sprawl counted 28.65 million new hardcoded secrets on public GitHub in 2025, with AI-service leaks up 81% year over year. AI-assisted commits leaked secrets at roughly twice the rate of human-only commits, and repositories with Copilot active showed a 6.4% leak incidence versus 4.6% across all public repos - about 40% higher. The numbers say the tools amplify an existing bad habit: secrets that live in code and context get moved around more when generation is fast.
- Can secrets come back out of the models?
- Research demonstrated it: by constructing prompts from GitHub code snippets, researchers extracted over 2,700 hardcoded credentials from Copilot, a share of which were real, identifiable secrets. Vendors have since hardened filters, and no-training commitments on business tiers reduce the risk at the source. The durable lesson is not about one vendor - it is that anything hardcoded and transmitted has left your control, and the fix is upstream: no secrets in code or context to begin with.
- Do trade secrets have a legal angle here?
- Descriptively, yes: trade-secret protection regimes (in Germany the GeschGehG, in the EU the Trade Secrets Directive) protect information only while its holder takes reasonable secrecy measures. Whether routinely transmitting source to third-party services is compatible with 'reasonable measures' in your case is a question for your counsel - the sober takeaway is that the answer depends on contracts, configurations and controls, which is exactly why they deserve documentation. Legal status July 2026, not legal advice.
- What are the countermeasures that actually work?
- In order of leverage: get secrets out of code and context entirely (secret managers, environment injection - the leak rate cannot amplify what is not there); configure exclusions so sensitive paths never enter tool context; run secret scanners as pre-commit and pre-push gates, which also catch what AI generates; use business tiers with no-training and retention controls; and route the most sensitive codebases to local processing. Instruction to staff comes last, not first - mechanisms beat reminders.
- Does local processing solve the secrets problem?
- It removes the transmission half: what is processed in your environment is not sitting in a vendor's context pipeline. It does not remove the hardcoding half - a local model reads the same .env file a cloud one would, and secrets in code remain findable by anyone with repo access. The honest framing: local processing shrinks the exposure surface; secret hygiene shrinks the secret. You want both.
Keep reading
Sources
- GitGuardian – State of Secrets Sprawl 2026: 28.65M new secrets in 2025, AI-service leaks +81%, AI-assisted commits ~2x leak rate (2026)
- GitGuardian – Copilot-active repos: 6.4% secret-leak incidence vs 4.6% baseline; extraction research (2,700+ credentials) (2023-2026)
- BSI/ANSSI – recommendations on AI coding assistants: secrets leakage as a first-class risk (2024, German)