AI + AST

Code Brain — Knowledge Graph Codebase Intelligence

Your agents don't read code — they read your codebase's knowledge graph.

Code Brain builds a living knowledge graph from your codebase: tree-sitter AST extraction across five languages, petgraph-backed directed graph, regex-based route and schema detection, and BFS/DFS traversal with token budgeting. Every agent session starts with precise architectural context — not a file dump, but a structured understanding of what your code does, how it's connected, and what's load-bearing.

Module ID:MOD-CODE-BRAIN-—-KNOWLEDGE-GRAPH-CODEBASE-INTELLIGENCE-01
Status:Operational / Active
Deploy Env:Native App
Action:Initialise

Technical Capabilities

01.

Tree-Sitter AST Extraction — Five Languages

Code Brain uses tree-sitter — the deterministic parser behind GitHub code search and Neovim — to extract a typed node graph from source files in Rust, TypeScript, TSX, JavaScript, Python, and Go. Functions, methods, classes, structs, modules, import chains, and call edges are extracted directly from the concrete syntax tree. The same file always produces the same graph fragment: no guessing, no heuristics at the parse level.

02.

Knowledge Graph — petgraph DiGraph

Every extracted node and edge is stored in a directed knowledge graph backed by petgraph. Node types include Function, Method, Struct, Class, Module, Route, and Schema. Edge types include Imports, Calls, Contains, Implements, InheritsFrom, and SemanticallySimilarTo. The graph is not a summary — it is a traversable topology that agents query directly.

03.

Edge Confidence System

Every edge in the graph carries a confidence class: Extracted (proven by AST, structurally certain), Inferred (heuristic match with a 0.4–0.95 probability score), or Ambiguous (unresolved, flagged as a knowledge gap). Agents can filter by confidence when reasoning about architectural risk. Ambiguous nodes are surfaced in analysis output as explicit gaps.

04.

Route and Schema Detection

On top of AST extraction, Code Brain runs framework-aware detectors for routes (Axum, Actix, Express, Fastify, Gin, FastAPI) and schemas (SeaORM, Prisma, SQLAlchemy, Drizzle, TypeORM). Environment variable references are tracked across all languages. These detections become specialised nodes and edges in the same knowledge graph — agents always know what an application exposes and depends on.

05.

BFS/DFS Context Queries with Token Budgeting

When an agent queries the graph ('show me the authentication flow', 'what uses the payments module'), Code Brain seeds from matching nodes and traverses breadth-first or depth-first — stopping precisely when the accumulated context would exceed the configured token budget. The result is a structured block of the right size to fit in an LLM prompt without padding it with irrelevant files. BFS finds the neighbourhood; DFS traces call chains.

06.

God Nodes — Architectural Hot Spots

Code Brain identifies god nodes: the nodes with the highest total degree (incoming + outgoing edges combined). These are the abstractions everything else depends on — the central router, auth middleware, database pool, base model class. God nodes are surfaced in every context block so agents immediately understand which components are load-bearing, and serve as the fallback seed for open-ended queries.

07.

Blast Radius Analysis

Given any file or node, Code Brain computes the full blast radius: every file and component that would be affected if it changed. Traversal depth and direction are configurable. The result is a sorted list of affected files with distance from the change origin — essential context for an agent about to modify a shared utility, refactor a model, or rename a core interface.

08.

SHA-256 Extraction Cache

Every file's extraction result is cached by SHA-256 content hash. Re-scans after small edits take milliseconds, not seconds — only changed files are re-parsed. Cache entries persist across restarts at ~/.eggbert-agentic/codebrain-ast-cache/ and are invalidated by content change, not mtime, so moves and renames do not trigger unnecessary re-extraction.

09.

Knowledge Gaps Detection

Code Brain automatically surfaces nodes with at least one Ambiguous incoming edge — places where the graph knows something connects here but cannot resolve what. Common causes: dynamic dispatch, missing dependencies, or external packages that were not scanned. Agents use knowledge gaps as a signal to request additional context or flag risk areas in architectural analysis.

10.

Graph Persistence and Warm Start

The knowledge graph persists to disk in two forms: a ScanResult JSON (routes, schemas, env vars, dependencies, token stats) and a GraphStore JSON (all nodes and all edges). On startup, the in-memory graph is reconstructed from the store in under 10 ms — no re-scan required. Projects scanned before are ready to answer queries the moment the engine starts.

11.

Semantic Extraction — Pluggable LLM Layer

Code Brain defines a SemanticExtractor trait that can be injected at runtime. When configured, it receives batches of source files and produces additional nodes and edges — concepts, intent labels, SemanticallySimilarTo relationships — that AST alone cannot detect. The default is a no-op stub; a live LLM-backed extractor is opt-in and cost-controlled. AST extraction always runs fast and offline.

12.

Context Block Integration

Code Brain's primary output is a structured context block injected into agent prompts at the start of a session: project summary, technology stack, key routes and schemas, top god nodes with edge counts, graph statistics, and a token-budgeted BFS result for the current task. All sections are individually token-capped. Agents start every session with the architectural orientation they would otherwise spend turns building.

Integration Matrix