CORE DIRECTORY // SYSTEM.USER.DIANA_ISMAIL

Labs by Diana — Experiments that ship.

Side projects that got out of hand. AI tools built for problems I kept tripping over — now live, now yours.

ResearchActive

Dark Code: What Ships When Nobody's Looking

ARTICLE_005

AGENTIC_WORKFLOW // 5_OF_9

PUBLISHED

2026.04.20

READ

~12 MIN

AI coding tools create a new category of technical liability: code that runs, passes review, and ships to production, but that nobody on the team could fully explain, debug, or extend without re-running the agent that wrote it. The problem is not that AI writes bad code. The problem is that AI writes fast code, and speed without comprehension is a debt that compounds invisibly - and silently - until something fails at 2am. Every codebase using AI-assisted development has dark code. Most teams haven't named it, measured it, or built systems to prevent it.

This article names it and documents the three-layer framework built across six production repositories to eliminate it as a structural property of the workflow. Layer 1 is spec-before-code: any change touching more than two files requires a brief that externalizes the agent's reasoning before implementation begins, and a post-merge verification loop that checks the code against the brief's acceptance criteria. Layer 2 is self-describing systems: module manifests extended with three behavioral fields - contracts, failure_modes, and performance - that answer the 2am question directly. Layer 3 is the comprehension gate: a tiered PR checklist that agents must answer before merge, with questions designed so that answering them requires demonstrating comprehension, not just confirming completion.

The framework is designed so that dark code elimination happens as a byproduct of normal work. Writing a brief, filling in a failure_modes field, answering "what fails silently?" before merging - each of these is work that happens anyway in a well-run codebase. The three layers make comprehension a required output of that work, not an optional one. The coverage matrix across Labs, GEOAudit, FitChecker, Portfolio, Digital Twin, and EventChatScheduler is documented here along with the five feedback loops that prevent the framework from decaying over time.

The_2am_Test

It's 2am. Something in production is broken. The module at the center of the incident was written three weeks ago by an agent running a refactor brief. The agent is not running. The commit says "refactor: extract caching layer to module boundary." The PR description says "tests pass." You are staring at a retry policy set to three attempts with a five-second backoff. You have no idea why those numbers exist, whether they were deliberate, or what breaks if you change them.

This is the 2am test. Can a competent developer debug this without re-running the agent that wrote it? For a significant fraction of AI-generated code, the answer is no. Not because the code is bad. Because the reasoning exists nowhere except in a context window that closed three weeks ago.

Dark code is the term for this. Not buggy code. Not poorly written code. Code that runs, passes tests, and ships to production, but whose failure modes, dependencies, and constraints were never transferred out of the agent's session and into the repository. Quality and comprehension are orthogonal. A module can be well-structured, well-tested, and thoroughly dark. The AI wrote fast. The reviewer confirmed it looked right. The specific choices that govern how it behaves under pressure went undocumented. Every codebase using AI-assisted development has dark code. Most teams haven't named it.

How_Dark_Code_Accumulates

An agent generating code during a multi-file refactor holds an enormous amount of implicit context. Why the caching layer goes here and not there. What the upstream rate limits are. Why the retry policy uses exponential backoff instead of linear. What happens if Redis is unavailable during a write. That context is the agent's working state. It influences every decision the agent makes. It does not, by default, end up in the code.

Three specific accumulation paths account for most dark code. First: architectural choices made during implementation that nobody approved or recorded. A caching layer added during a refactor to avoid hitting the database on every request - reasonable choice, correct implementation, zero documentation of the TTL rationale, the invalidation strategy, or what happens when the cache is cold. Second: multi-file changes where the agent understood the cross-file dependency but the reviewer only saw the diff. The agent knew that changing the retry policy in engine.ts had downstream implications for the rate limiter in memory.ts. The reviewer saw two separate files change and approved both. The dependency exists in the code but not in any record. Third: context-window reasoning that never made it into a commit message or code comment - the agent knew why, wrote code that reflects the decision, and left no trace of the reasoning anywhere a future developer could find it.

Most review workflows optimize for one question: does this look right? A reviewer scans the diff, checks that the logic appears correct, confirms the tests pass, and approves. "Does this look right?" and "do I understand this?" are different questions. The first is answerable by a quick scan. The second requires knowing why every non-obvious choice was made. Optimizing for the first without requiring the second is how dark code accumulates silently across a codebase that is - by every other measure - well-maintained.

Naming_It_Was_the_First_Step

Before building any response to dark code, the question had to be made concrete. Across six production repositories - Labs, GEOAudit, FitChecker, Portfolio, Digital Twin, and EventChatScheduler - I evaluated each module against a single question: could a competent developer debug a production incident in this code without re-running the agent that wrote it? Not "is the code good?" Not "is it tested?" Those were already being answered by other parts of the system. This was specifically about whether the comprehension that existed during the agent's session had survived the session ending.

The results broke unevenly. Labs had seventeen manifest files but none documented behavioral contracts, failure modes, or performance constraints. GEOAudit had four implementation briefs but no post-merge verification that the code actually satisfied them. FitChecker had the most external API integrations and the least documentation of what happened when any of those integrations failed. EventChatScheduler's MODULES.md declared six modules; one manifest existed on disk. The index was aspirational, not descriptive, which is worse than no index at all.

The audit surfaced a structural gap: the framework for preventing dark code didn't exist. Spec-before-code was an honor-system rule with no enforcement mechanism. PR reviews had no structured questions an agent was required to answer. Module manifests documented structure rather than behavior. Each of these worked as far as it went. None of them specifically targeted comprehension - the transfer of context from an agent's session into the repository. That was the gap the framework was designed to close.

Layer_1_-_Spec-Before-Code

The first layer addresses dark code at its earliest point of origin: before implementation begins. Any change touching more than two files requires a brief before the agent starts. Agents check for an existing brief before beginning multi-file work. If none exists, they create one and surface it for Owner approval before writing a line of code. The brief is deliberately lightweight - task description, affected modules, delegate, acceptance criteria. For large projects, also include dependency impact and a rollback strategy. Fifteen minutes to write.

The rationale is not that briefs prevent bad code. It is that briefs create a record of what was intended, at the moment before implementation begins, when the intent is clearest. Dark code accumulates because reasoning lives in context windows that close. A brief externalizes that reasoning into a file that survives the session. When a module fails at 2am and the question is "what was this supposed to do?", the brief is the answer. Without it, the only record is the code itself - which tells what the agent did, but not why, and not what constraints were in play when the decisions were made.

The rule has deliberate exceptions: single-cause bug fixes, dependency updates, and typo or copy corrections. These exclusions matter as much as the rule itself. A spec-before-code requirement that applies to every change becomes noise, and noise degrades the rules around it. The threshold is three files because changes of that scope almost always involve architectural decisions - the kind where the agent's reasoning is most valuable to preserve and most likely to be lost. Below that threshold, the implementation is usually self-explanatory. Above it, the brief is the difference between comprehensible code and dark code.

The post-merge loop closes the layer: after implementation, the brief's acceptance criteria become a verification checklist. Quinn runs targeted checks against the brief for large projects, comparing the merged code against what the spec said it should do. This turns briefs from write-once documents into living evals that catch the gap between intent and implementation. Without the post-merge step, specs inform implementation but don't hold it accountable. With it, the spec is the contract and the code is the claim.

Layer_2_-_Self-Describing_Systems

The second layer operates at the module level. Module manifests already existed - files documenting each module's name, purpose, owner, dependencies, and exports. They were structural records. What they didn't document was behavior: how a module actually operates under pressure, what it expects from its dependencies, and what happens downstream when it fails. Adding three fields to the manifest schema changes what the manifest is.

contracts captures the behavioral expectations a module has of its environment and its callers: rate limits, cache TTLs, retry policies, timeouts. These are the constraints that govern safe interaction with the module - the things an agent needs to know before touching any code that calls it, and the things a developer needs to know at 2am before deciding whether to adjust a retry interval. failure_modes documents what happens when the module fails: the trigger, the downstream impact, and whether degradation is graceful, propagated, or silent. Silent failures are the most dangerous category, and the failure_modes field forces them to be named. performance captures expected latency and resource constraints - the baseline against which anomalies are measured.

These three fields are required for large projects on any module that makes external calls or has three or more consumers. They are optional for small projects. The threshold reflects risk: a module with three consumers has three blast radii, and any one of them failing without a documented failure mode is a production incident waiting to be misdiagnosed. The requirement scales with the consequence.

A manifest is a behavioral contract, not a description. A description says what a module does. A contract says what it guarantees, what it requires, and what it breaks when it fails to deliver. The 2am question is not "what does this module do?" - that is answerable by reading the code. The 2am question is "when this module fails in a specific way, what else is affected and how?" That question requires the failure_modes field. The insight: writing a failure mode entry requires knowing the failure mode. Every blank field in a manifest is a located dark code risk. The manifest is not the output - the knowledge it requires to produce is the output.

Layer_3_-_The_Comprehension_Gate

The third layer operates at the PR boundary. Every agent-generated PR includes a structured set of answers as part of the PR description. Three questions apply to all projects: what does this change do in one sentence, what fails silently, and what is the blast radius. Four additional questions apply to large projects: why this dependency, what is cached and why, how are concerns separated, and what are the failure modes for external calls. Unanswered questions block merge.

The design of the questions is specific. "What does this change do in one sentence?" cannot be answered by summarizing the diff - it requires distilling the purpose. "What fails silently?" cannot be answered by confirming that try-catch blocks exist - it requires knowing which paths swallow errors without surfacing them. "What is the blast radius?" requires understanding what the changed code is coupled to. Each question is structured so answering it requires demonstrating comprehension of the code being merged, not just confirming that the implementation is complete.

What the gate is not: a code review checklist. Not a quality gate. Linters and tests already handle those. The comprehension gate is specifically a mechanism for verifying that the agent merging the code understands it - that the reasoning behind non-obvious choices has been articulated, that failure modes have been thought through, and that the reviewer has the information they would need if this code caused a production incident. An agent that cannot answer "what fails silently?" has not finished the work. The answer is part of the deliverable.

The gate also functions as a manifest maintenance trigger. When a PR gate answer reveals behavior not captured in the module's existing manifest - an undocumented cache layer, an unrecorded failure mode, a dependency not listed in depends_on - the agent updates the manifest in the same PR. This turns the comprehension gate into a flywheel: every PR review is an opportunity to make the system more self-describing. Over time, the manifests reflect not just the architecture as originally documented but the architecture as it actually behaves - including the dark corners that only show up when someone is forced to articulate them.

The_Feedback_Loops

A governance framework that isn't self-reinforcing will be current for a few weeks and stale for the rest of the codebase's life. Every one of the three layers has a corresponding feedback mechanism designed to prevent exactly that.

The spec-as-eval loop closes Layer 1. After a spec-driven PR merges, the brief's acceptance criteria are checked against the merged code. For large projects, Quinn reads the brief and runs targeted verification - not "does the code look right" but "does the code satisfy the specific claims the spec made?" Criteria that aren't met get flagged before the branch is closed. Briefs cannot become write-once artifacts that bear no relationship to what was actually built. The brief is the intent. The post-merge check is the accountability.

The gate-to-manifest flywheel closes the loop between Layers 3 and 2. When a comprehension gate answer surfaces behavior not in a manifest, updating the manifest is part of completing the PR. Over time, manifests are maintained not through periodic audits but through the normal PR workflow. Every time an agent has to answer "what fails silently?" about a module, and the answer reveals something the failure_modes field doesn't already say, the manifest gets updated before merge. The gate generates manifest maintenance as a side effect.

Two mechanisms reduce reliance on agents self-enforcing rules. A PreToolUse hook on gh pr create checks the file count in the diff and warns when three or more files are changed without a brief reference in the PR. Not a hard block - a prompt that fires at the moment when the gap would be created, not after. Three reusable prompt templates encode the spec, manifest, and gate expectations for the most common task types: new API endpoint, new external integration, and module refactor. Instead of reconstructing what a brief needs to contain each time, agents start from a template that already carries the right structure. A monthly manifest staleness check diffs manifest exports and depends_on fields against actual imports in the codebase and flags drift. Manifests decay. Without periodic verification, they become documentation debt that is worse than no documentation because it provides false confidence.

The framework works as a system because dark code elimination is structural, not disciplinary. No amount of asking agents to be more thorough will produce comprehension that the workflow doesn't require. The three layers and their feedback loops make comprehension a byproduct of the normal development process - not an overhead added after shipping, not a post-mortem practice, but something that happens as a direct consequence of writing a brief, filling in a manifest field, and answering seven questions before merging. The 2am question has a better answer because the work that was done to answer it is the same work that shipped the code.

Claude CodeAgentic AIDark CodeCode QualityGovernancePR ReviewModule ManifestsSpec-Before-CodeComprehension

KEY_TAKEAWAYS

TAKEAWAY_01

Dark code is not a quality problem - it is a comprehension problem. A codebase full of AI-generated code can be well-structured, well-tested, and still unintelligible at 2am because the reasoning behind it lives in a closed context window, not in the repository. The three-layer framework does not improve code quality; it transfers the comprehension that existed in the agent's session into the codebase itself, where it can survive a session ending, a team change, or a production incident.

TAKEAWAY_02

Asking an agent to answer "what fails silently?" before merging is not a review step - it is a diagnostic. If the agent cannot answer it, the work is not finished. The comprehension gate functions as a forcing function for thinking through failure modes at the moment of maximum context, not retrospectively when something has already broken. The most valuable PR gate answers are the ones that reveal something the reviewer did not know was there.

TAKEAWAY_03

Manifest writing is a dark code audit in disguise. The act of documenting a module's failure_modes, contracts, and performance constraints forces the documenter to know those things. Every time a failure_modes field gets filled with "unknown" or left blank, it marks a specific module as a dark code risk. The manifest is not the output - the knowledge it requires to write is the output.

KEY_TAKEAWAYS

TAKEAWAY_01

TAKEAWAY_02

TAKEAWAY_03