Talk · April 2026
Building a context system that works across sessions, tools, and team members.
Guidelines · Knowledge · Tasks
Overview
The framework
The problem — why sessions forget everything
Step 1 — single guidelines file, generated in minutes
Step 2 — folder breakdown, skills, task-based planning
Step 3 — vendor-agnostic layer with AGENTS.md
Step 4 — harness engineering: hooks, verification, steering
Also covered
Knowledge management — cross-repo wiki maintained by the LLM
Research backing — four independent sources, same conclusions
Before / after — what actually changes day to day
Demo — live walkthrough on library-service
The problem
The output quality of a coding assistant is determined entirely by the context it receives at the start of each session.
Why context engineering matters
A transformer computes a probability distribution over the next token, then samples from it. The same prompt run twice can produce different outputs.
For engineering tasks — follow this coding standard, never use this pattern — probabilistic sampling produces inconsistent results without steering.
Context files shift the distribution toward better outputs. Harnesses verify the output didn't drift.
Next token after: new ___Map<>() in an @ApplicationScoped bean
Where we are
— Mario Zechner, pi.dev
There is no hard standard yet. That's a good thing.
You are free to form whatever workflow fits your team. And when a standard does emerge, migrating is low-cost — AI can do the restructuring.
The answer
Single guidelines file. Start here. Two minutes to generate.
Folder breakdown. When the single file grows past ~80 lines.
Vendor-agnostic layer. When the team uses more than one AI tool.
Harness engineering. When advice isn't enough — verification catches what prompts miss.
Each level builds on the last. Skipping ahead is fine; stopping at Level 1 is fine.
Step 1
| Tool | File |
|---|---|
| Claude Code | CLAUDE.md |
| Junie (JetBrains) | .junie/guidelines.md |
| Cursor | .cursor/rules/ |
| GitHub Copilot | .github/copilot-instructions.md |
| Windsurf | .windsurfrules |
| Any tool | AGENTS.md — AAIF standard |
The file is read at the start of every session. If it exists and contains the right information, the assistant starts already knowing the codebase's constraints.
Step 1 · The prompt
Read this codebase and write a CLAUDE.md that captures:
- How to build and run the project
- The architecture (only what isn't obvious from the code)
- Coding conventions and non-obvious constraints
- Anything a developer would need that can't be inferred
Keep it under 100 lines. Only include non-inferable information.
Takes about two minutes. The file doesn't have to be written from scratch.
Step 1 · Example output
# library-service
## Build & Run
./gradlew quarkusDev # live reload on :8080
## Stack
Java 21 · Quarkus 3 · JAX-RS · CDI · Jackson · Gradle 8
## Architecture
LibraryResource → LibraryService → LibraryStore (ConcurrentHashMap)
Records: Book, Loan — immutable, use with*() helpers for state transitions.
## Coding Standards
- NEVER use plain HashMap in @ApplicationScoped beans — always ConcurrentHashMap
- Tests: Given-When-Then, no eq() matchers in Mockito
## Known Gotchas
- Borrow availability: use ConcurrentHashMap.replace() — NOT get-then-put.
Plain put() after get() has a race condition under concurrent requests.
- JAX-RS paths: @Path("/search") BEFORE @Path("/{id}").
If /{id} comes first, "search" matches as an ID — no compile-time error.
Step 1 · What to include
ETH Zurich (Feb 2026): human-curated context files improve task success by 4%; LLM-generated files reduce it by 3%. The difference is that human files contain only what the AI cannot read from source.
| Brief or one line | Write in full detail |
|---|---|
Stack: Java 21, Quarkus 3 |
ConcurrentHashMap.replace() for borrow — what breaks without it, why |
Tests in JUnit 5 + Mockito |
JAX-RS path ordering trap — the exact failure mode, no compile error |
Records in domain/ |
NEVER eq() in Mockito — the constraint and the reason |
That's Step 1. A single file, generated once, immediately useful.
As the codebase evolves, the file grows: testing standards, role definitions, task workflows, new gotchas. At around 80 lines, every session loads all of it — including the parts that aren't relevant.
Step 2 is the answer.
The single file, six weeks later
# library-service
## Build & Run ...
## Stack ...
## Architecture ...
## Coding Standards ... ← 15 lines
## Testing Standards ... ← 20 lines
## Known Gotchas ... ← 12 lines
## Task Workflow ... ← 30 lines
## Roles ... ← 25 lines
## Workflow Prompts ... ← 20 lines
## Tech Debt ... ← 10 lines
LibraryService doesn't need the role definitions — they load anyway.Step 2
One concern per file. Load only what the session needs. Shared standards live in one place.
Step 2
CLAUDE.md (20 lines — @-imports only)
@.ai/knowledge.md ← tribal knowledge, loaded every session
@.ai/context/stack.md ← build, run, dependencies
.ai/skills/ ← domain knowledge, workflow phases, and roles
java-standards/ → /java-standards
testing-standard/ → /testing-standard
pre-flight/ → /pre-flight TASK-001
build-plan/ → /build-plan TASK-001
execute/ → /execute TASK-001 "Stage 1"
close-task/ → /close-task TASK-001
architect/ → /architect
.ai/hooks/verify-build.sh ← quality gate
.ai/tasks/TASK-001/ ← per-task working memory
Run bash .ai/setup.sh once. All skills become native slash commands for Claude Code, Junie, and Cline from the same source.
Step 2 · Skills
---
name: testing-standard
description: Testing standards — Given-When-Then, Mockito patterns, domain
record deserialization tests. Load when writing or reviewing any test.
---
- NEVER use `eq()` matchers in Mockito — use raw values directly
- ALWAYS use ConcurrentHashMap.replace() for availability updates
- NEVER skip the layer boundary: Resource → Service → Store
Structure: Given / When / Then
New domain records require a JSON deserialization test.
name field becomes the /slash-command.description controls auto-loading — right skill for the task, no explicit invocation needed.Step 2 · Workflow skills
---
name: pre-flight
description: Deep codebase analysis before planning. Always run before build-plan.
---
## Live context
- Recent commits: !`git log --oneline -5` ← runs at invocation, always current
- Modified files: !`git diff --name-only HEAD`
## Instructions
1. Load the task folder (.ai/tasks/[TASK-ID]/)
2. Trace method chains affected by this task
3. Cross-check knowledge.md — flag applicable gotchas
4. Identify risks: concurrency, layer boundaries, records needing tests
5. Update findings.md — do NOT implement anything
Task ID: $ARGUMENTS
! runs shell at invocation. $ARGUMENTS takes the ticket ID. One file serves every ticket.
Step 2 · Additional benefit
The same folder structure that holds skills and context can hold per-task working memory — making multi-session work resumable without re-narration.
Task workflow
Each AI session loads the task folder fresh. Decisions made two days ago, rejected approaches, constraints discovered mid-implementation — all in the files, not the conversation history.
.ai/tasks/TASK-001/
├── context.md ← what and why (set at kickoff, static after)
├── task-plan.md ← phases, design decisions, files to touch
├── checklist.md ← executable steps with stage checkboxes
├── findings.md ← discoveries during work (append-only, raw)
├── tracker.md ← Current Focus updated every session
└── summary.md ← written before the knowledge flush
Task workflow · Five phases
| Phase | What happens |
|---|---|
/new-task | Scan codebase, create task folder with 6 files. |
/pre-flight | Trace method chains, cross-check knowledge.md, identify risks, update findings.md. |
/build-plan | Read findings, write task-plan.md and checklist.md. Pauses for confirmation before any code is written. |
/execute | Implement one stage, check off items, run quality gate. |
/close-task | Audit checklist, write summary.md, flush to knowledge.md, draft PR description. |
The task is not closed until the flush is done.
Task workflow · Zero-warmup resumption
## Current Focus
**Last action**: Pre-flight complete — borrow race condition documented in findings.md
**Next action**: Run /build-plan TASK-001 to generate staged plan
**Open decision**: Should partial failure be rolled back? (non-blocking — log as tech debt)
| Active Role | Architect |
| Stage | Pre-planning |
Session-start prompt: "Load tracker.md, checklist.md, findings.md for TASK-001. Resume from Current Focus."
Task workflow · Knowledge lifecycle
Task starts → task folder created (context.md, findings.md)
↓
Pre-flight finds race condition
→ findings.md (written at time of discovery)
↓
Implementation confirms the fix
→ findings.md updated
↓
Task closes → summary.md written → flush to knowledge.md
↓
knowledge.md loaded at every future session in this repo
↓
Wiki ingest → promoted to wiki/gotcha/
available across all repos and all future engineers
Steps 1 and 2 work well for one tool.
When the team uses Claude Code, Junie, and Cursor on the same codebase, each tool reads its own config file. The context diverges. One engineer's tool sees the updated rule; another's does not.
Step 3 consolidates this into one source.
Step 3
Before
CLAUDE.md
— 150 lines of context
.junie/guidelines.md
— same 150 lines,
maintained separately
.cursor/rules/
— same 150 lines,
different format
Three copies. When one drifts, engineers get different suggestions from different tools.
After
AGENTS.md
— canonical source
(≤150 lines)
CLAUDE.md
— adapter: @-imports
.junie/guidelines.md
— adapter: @-imports
.ai/
— the actual content
Update .ai/knowledge.md once. Every tool picks it up.
Step 3 · The standard
Anthropic, OpenAI, Google, Microsoft, and AWS donate AGENTS.md as the cross-tool canonical standard. All major AI coding tools now read it.
AGENTS.md is the canonical entry point. Tool-specific files are thin adapters that reference .ai/ — they don't duplicate it.
[repo]/
├── AGENTS.md ← canonical (≤150 lines, non-inferable only)
├── CLAUDE.md ← adapter: @-imports from .ai/
├── .junie/guidelines.md ← adapter: @-imports from .ai/
└── .ai/ ← the actual source of truth
Step 3 · AGENTS.md
# library-service — Agent Context
## Build
./gradlew clean build test
Quality gate: .ai/hooks/verify-build.sh
## Non-obvious constraints
- NEVER plain HashMap in @ApplicationScoped — ALWAYS ConcurrentHashMap
- Borrow flip: NEVER get-then-put — ALWAYS ConcurrentHashMap.replace()
- JAX-RS: @Path("/search") BEFORE @Path("/{id}") — or "search" matches as an ID
- Records are immutable — use with*() helpers, never add setters
## Task workflow
Folders: .ai/tasks/[TASK-ID]/ — context.md, checklist.md, findings.md,
tracker.md, task-plan.md, summary.md
Workflow: /new-task → /pre-flight → /build-plan → /execute → /close-task
## Full context
.ai/knowledge.md, .ai/context/, .ai/skills/
Recap
Single file. Ask the AI to generate it from the codebase. Works today.
Folder breakdown. knowledge.md, context/, skills/, tasks/. Each file one concern, loaded on demand.
Vendor-agnostic. AGENTS.md canonical + tool adapters. One source, no drift between tools.
Harness engineering. Automated hooks, verification loops. Advice becomes enforcement.
Step 1 is better than nothing. Step 2 is better when the file grows. Step 3 is better when the team uses multiple tools.
Step 4
When advice isn't enough. Verification catches what prompts miss. Transition from structured context to automated enforcement.
Harness Engineering
Layer 1
Model
Base intelligence
Layer 2
Guide
Before execution
AGENTS.mdskills/knowledge.mdLayer 3
Harness
During & after execution
The Guide shifts the probability distribution before execution. The Harness verifies the output didn't drift after. Both are fully under your control.
Harness Engineering
Advice (Guide layer)
# knowledge.md
- NEVER use single-implementation
interfaces
- NEVER skip the layer boundary:
Resource → Service → Store
The agent reads this. It may still drift under a long session or a complex task.
Verification (Harness layer)
# .ai/hooks/check-style.sh
grep -r "interface.*Service" src/ \
| grep -v "//.*ok" \
&& echo "Single-impl interface found" \
&& exit 1
The build fails. The agent cannot reassure its way past a broken hook.
Harness Engineering
When the agent makes a mistake, the primary response is to update the harness — not just fix the chat.
From "Prompt Engineering" to Systems Engineering: engineer out the possibility of error rather than describe your way around it.
Beyond the codebase
Per-repo context handles one codebase. For knowledge that spans all repos and domains, a different tool is needed.
Knowledge management
A debugging session in library-service produces a finding about ConcurrentHashMap race conditions. That finding is relevant in every concurrent service. A design decision made in one microservice informs four others.
The Karpathy LLM Wiki pattern (2024): drop a raw source, the LLM synthesizes it into a typed wiki page. Knowledge is compiled once and kept current.
This wiki links back to the repos it covers. Drop raw material from any repo; any AI tool can ingest and query it. Knowledge is centralized, not scattered across codebases.
Knowledge management · Structure
personal-wiki/
├── schema.md ← the spec: page types, frontmatter, naming rules
│ the LLM reads this to know what to write and where
├── CLAUDE.md ← loads schema + wiki structure (Claude Code)
├── JUNIE.md ← same for Junie — any tool, same wiki
└── wiki/
├── index.md ← content catalog — updated on every ingest
├── log.md ← append-only activity log
├── repo/ ← per-repo architecture and patterns
├── debug/ ← root causes, symptoms, fixes
├── gotcha/ ← traps documented once, searchable forever
├── decision/ ← architectural choices and their rationale
└── reference/← runbooks, command references, API summaries
Both CLAUDE.md and JUNIE.md load the schema at session start. Drop raw material and any tool can ingest it correctly.
Knowledge management · Ingest
Raw source → destination
| Meeting notes | wiki/meeting/ |
| Debug log | wiki/debug/ |
| Jira export | wiki/jira/ |
| Confluence doc | wiki/reference/ |
| Code review | wiki/gotcha/ |
What the LLM does
index.mdlog.mdConfluence has raw text. The wiki page has the synthesized answer — extracted once, not re-derived on every query.
Research backing
| Source | Finding | Maps to |
|---|---|---|
| ETH Zurich (Feb 2026) | Human-curated context files +4% task success. Non-inferable only. | knowledge.md lean by design |
| Git Context Controller arxiv 2508.00031 |
Cross-session continuity requires no re-teaching when context is in files, not chat history. | Task folders over conversation history |
| Context Engineering LangChain / JetBrains, 2025 |
Four strategies: Write · Select · Compress · Isolate. | findings.md · skills · summary.md · task folders |
| Block Engineering Square, 2025 |
AI Champions model: AI-authored code +69%, time savings +37%, automated PRs +21×. | One context owner per repo |
The before / after
| Before | After |
|---|---|
| Re-explaining the stack every session | knowledge.md loaded automatically |
| Same gotcha hits a second time | Flush from TASK-001 is in knowledge.md for TASK-002 |
| Same review comment on every PR | Codified as NEVER eq() in testing-standard skill |
| Writing PR descriptions from scratch | /close-task drafts it from checklist and findings |
| Three tools, three different contexts | AGENTS.md + .ai/ — one source, all tools |
| New engineer needs a week to onboard | Point them at knowledge.md and a task folder |
Demo
stage-1-single-file.md — "Generated in 2 minutes."stage-2-growing-file.md — "~150 lines."CLAUDE.md — "20 lines, all @-imports.".ai/knowledge.md — the ConcurrentHashMap gotcha.ai/skills/ listing — domain + workflow + rolespre-flight/SKILL.md — ! injection, $ARGUMENTSarchitect/SKILL.md — role shift, same factssetup.sh — one command wires all slash commandsAGENTS.md — the vendor-agnostic layer.ai/tasks/TASK-001/ — findings.md + tracker.mdknowledge.md without being toldTakeaways
Thank you
Karpathy LLM Wiki · gist.github.com/karpathy
AGENTS.md standard · agents.md
ETH Zurich study · arxiv.org/html/2602.20478v1
ibnufirdaus.dev · ibnu.cs2016@gmail.com
← → to navigate · T to toggle theme · Esc for slide picker
Addendum
Copy any of these into your AI coding tool. Each one moves you one level forward.
Addendum · Step 1
Read this codebase and write a CLAUDE.md (or AGENTS.md) that captures:
- How to build and run the project
- The architecture — only what is not obvious from reading the source
- Coding standards and non-obvious constraints
- Anything a developer needs to know that cannot be inferred
Rules:
- Keep it under 100 lines
- Inferrable facts (stack, test framework) can be one-liners
- Non-obvious constraints need the full detail — include the "why" and
the failure mode, not just the rule
- If something can be read from the code, leave it out
Addendum · Step 2
The current guidelines file has grown past ~80 lines.
Restructure it into a .ai/ folder:
1. .ai/knowledge.md — architectural facts, gotchas, non-obvious constraints
2. .ai/context/ — narrow reference sheets for specific concerns
(stack, deployment, key patterns). Each file ≤50 lines.
3. .ai/skills/ — repeated workflows and behavioral rules as skill files
(testing approach, task lifecycle, code review rules).
One topic per file.
4. Rewrite CLAUDE.md as a manifest of @-imports only — under 30 lines,
no duplicated content, just pointers to .ai/ files.
Do not write any files yet. Propose the folder structure and
show me what goes where. I will confirm before you create anything.
Addendum · Step 3
We use more than one AI coding tool in this repo.
Create a vendor-agnostic context layer:
1. Create AGENTS.md at the repo root (≤150 lines):
- Non-inferable information only — build commands, architecture,
constraints, key gotchas
- No tool-specific syntax
2. Make existing tool-specific files thin adapters:
- CLAUDE.md → @-imports from .ai/ + Claude-specific settings only
- .junie/ → @-imports from .ai/ + Junie-specific settings only
- Nothing should be duplicated across tool files — .ai/ is the source
Show me the proposed AGENTS.md and updated adapter files before writing.
Addendum · Step 4
Level up the AI setup with automated verification and task-based planning:
1. Audit .ai/knowledge.md for prose rules that could be enforced by a script.
For each one, propose a hook in .ai/hooks/ that exits non-zero on violation.
2. Create .ai/hooks/verify-build.sh — runs the build and tests, prints a
clear error on failure. Wire it into AGENTS.md: agent must run it before
marking any task complete.
3. Create a task folder template at .ai/tasks/TEMPLATE/:
- context.md — background, links, session protocol
- checklist.md — implementation steps
- findings.md — gotchas discovered during work
- tracker.md — last action / next action / open decisions
- summary.md — post-task writeup before flushing to knowledge.md
Add a rule to AGENTS.md: every non-trivial task gets a folder.
The task is not closed until findings are promoted to knowledge.md.
Propose before writing. Show me the hook scripts and folder template first.
Addendum · Knowledge wiki
Create a personal engineering wiki in this directory.
The wiki follows the Karpathy LLM Wiki pattern — knowledge is compiled
once and kept current, not re-derived on every query.
Structure:
wiki/
├── index.md — content catalog, one line per page
├── log.md — append-only activity log
├── repo/ — per-repository knowledge pages
├── decision/ — architectural and technical decisions
├── debug/ — debugging sessions and root causes
├── gotcha/ — non-obvious traps and constraints
├── concept/ — established technical concepts
└── note/ — freeform captures
Rules:
- Every page has YAML frontmatter: type, repo, tags, created, updated
- index.md is always up to date — update it on every change
- log.md is append-only — never edit past entries
- Cross-links use [[category/slug]] Obsidian format
- You (the LLM) write the wiki. I curate the sources.
Create the folder structure, index.md, and log.md now.
Do not create placeholder pages — only real structure.
Addendum · Knowledge wiki
Ingest the following source into the wiki.
[paste notes, debug log, meeting MoM, ticket, code snippet, or screenshot]
Steps:
1. Read the source and identify the type:
repo · decision · debug · gotcha · concept · note
2. Write a wiki page in the correct wiki/ subdirectory with full frontmatter
3. Update any existing pages that are touched by this new information
4. Flag any contradictions with existing pages in both pages
5. Update wiki/index.md with all new or significantly changed pages
6. Append a one-line entry to wiki/log.md
Do not invent facts not present in the source.
Mark gaps and open questions directly in the page under ## Open Questions.
Addendum · Full audit
Do a full scan of this repo's AI setup and produce a prioritised
list of improvements.
Scan: AGENTS.md, CLAUDE.md, .claude/, .junie/, .ai/ (knowledge.md,
context/, skills/, hooks/, tasks/), and any other agent context files.
For each area assess:
- Presence — what exists vs what is missing from the four-level framework
- Lean-ness — inferrable facts wasting context? Non-obvious constraints missing?
- Skills — repeated workflows codified, or re-explained every session?
- Verification — quality gates enforced as runnable hooks, or just prose advice?
- Task workflow — task folder structure present? Complex enough to need one?
- Vendor layer — does AGENTS.md exist? Do tool files duplicate or delegate?
- Knowledge — mechanism to promote discoveries into knowledge.md?
Output format:
## Current state
## P1 — High impact, quick to fix
## P2 — High impact, requires more work
## P3 — Nice to have
## Suggested next step
Do not modify any files. Audit and report only.