Code Brain — Knowledge Graph Codebase Intelligence

Your agents don't read code — they read your codebase's knowledge graph.

Overview

Code Brain is the codebase intelligence layer built into Eggspert. Where most tools give an AI agent a raw file dump, Code Brain builds a structured knowledge graph: functions, classes, structs, routes, and schemas become nodes; import chains, call edges, inheritance, and containment become typed edges with confidence scores. Agents query the graph with natural language, receive token-budgeted context anchored to the most relevant architectural nodes, and operate with the kind of structural understanding that usually takes a senior engineer days to build. It runs entirely on your machine — no code leaves your environment.

Tree-Sitter AST Extraction — Five Languages

Code Brain uses tree-sitter, the same deterministic parser powering GitHub's code search and Neovim's syntax engine, to extract a typed node graph from source files. Five language grammars are built in: Rust, TypeScript, TSX, JavaScript, and Python and Go. For each file, tree-sitter builds a concrete syntax tree; Code Brain walks it and extracts functions, methods, classes, structs, modules, and their relationships — including call edges in Rust and import chains in all five languages. Extraction is deterministic: the same file always produces the same graph fragment, regardless of context.

Knowledge Graph — petgraph DiGraph

Every extracted node and edge is stored in a directed knowledge graph backed by petgraph, Rust's graph data structure library. Node types include Function, Method, Struct, Class, Module, Variable, Route, Schema, Document, and Concept. Edge types include Imports, Calls, Contains, Implements, InheritsFrom, and SemanticallySimilarTo. Each edge carries a confidence score: Extracted (proven by AST), Inferred (heuristic, 0.4–0.95), or Ambiguous (uncertain, flagged as knowledge gap). This is the same graph structure used by professional code analysis tools — not a text summary, but a traversable topology.

Edge Confidence System

Not all relationships in a codebase are equally certain. Code Brain distinguishes between three confidence classes on every edge. Extracted edges come directly from the AST and are structurally certain — an import statement, an explicit method definition, a class inheriting from another. Inferred edges are produced by heuristics where full resolution requires a type checker — call sites in Rust, path-based route-to-handler matching — and carry a probability score between 0.4 and 0.95. Ambiguous edges surface where resolution failed: dynamic dispatch, missing symbols, or external references. Ambiguous nodes are surfaced in analysis as knowledge gaps.

Route and Schema Detection Overlay

On top of AST extraction, Code Brain runs a suite of regex-based detectors that understand framework-specific patterns. Routes are detected across Axum, Actix, Express, Fastify, Gin, FastAPI, and more — extracted with method, path, and handler. Schemas are detected from SeaORM entities, Prisma models, SQLAlchemy classes, Drizzle tables, and TypeORM decorators. Environment variable references are tracked across all languages. ORM and framework dependencies are identified from manifest files. These detections become specialised nodes and edges in the same knowledge graph, giving agents awareness of what an application exposes and what it depends on.

BFS/DFS Context Queries with Token Budgeting

When an agent asks Code Brain a question ("show me the authentication flow", "what uses the payments module"), Code Brain seeds the graph from nodes whose labels or file paths match the query terms, then runs a breadth-first or depth-first traversal to collect the most relevant architectural context. Traversal stops when the accumulated node and edge descriptions would exceed the configured token budget — typically 1,500–4,000 tokens. The result is a structured text block of precisely the right size to fit in an LLM prompt without padding it with irrelevant files. BFS and DFS traverse the graph differently: BFS finds the neighbourhood, DFS traces call chains.

God Nodes — Architectural Hot Spots

Code Brain identifies god nodes: the nodes with the highest total degree (incoming + outgoing edges combined). These are the abstractions everything else depends on — the central router, the auth middleware, the database connection pool, the base model class. God nodes are surfaced in every context block so agents immediately understand which components are load-bearing. They are also the fallback seed for queries that match nothing else in the graph, ensuring agents always have architectural orientation even for open-ended prompts.

Blast Radius Analysis

Given any file or node, Code Brain can compute its blast radius: the set of all files and components that would be affected if it changed. Traversal depth and direction are configurable. The result is a sorted list of affected files with distance from the change origin — essential context for an agent about to modify a shared utility, refactor a model, or rename a core interface. Blast radius analysis runs on the same petgraph topology as context queries and respects the same edge types.

SHA-256 Extraction Cache

Code Brain caches every file's extraction result keyed by SHA-256 hash of the file's content. On re-scan, only files whose content actually changed are re-parsed. For a 50,000-line Rust codebase, a re-scan after a small edit takes milliseconds rather than seconds. Cache entries are stored at ~/.eggbert-agentic/codebrain-ast-cache/ and persist across restarts. The cache is invalidated automatically when a file changes — not by mtime but by content, so moves and renames do not trigger unnecessary re-extraction.

Knowledge Gaps Detection

Code Brain automatically surfaces nodes with at least one Ambiguous incoming edge — locations where the graph knows something connects here but cannot resolve what. These are reported as knowledge gaps in the analysis output. Common causes: dynamic dispatch, monkey-patching, missing dependencies, or code that calls into external packages that were not scanned. Agents can use knowledge gaps as a signal to request additional context or to flag risk areas in architectural analysis.

Graph Persistence and Warm Start

The knowledge graph is persisted to disk in two forms: a ScanResult JSON (routes, schemas, env vars, dependencies, token stats) and a GraphStore JSON (all nodes and all edges). On startup, Code Brain loads both from disk and reconstructs the in-memory petgraph from the store — no re-scan required. Graph reconstruction from a 10,000-node store takes under 10 ms. Projects that have been scanned before are ready to answer queries the moment the engine starts.

Semantic Extraction — Pluggable LLM Layer

Code Brain defines a SemanticExtractor trait that can be injected at runtime. When a semantic extractor is configured, it receives batches of (source, file_path) pairs and produces additional nodes and edges — concepts, intent labels, SemanticallySimilarTo edges — that the AST alone cannot detect. The default extractor is a no-op stub; a live LLM-backed extractor can be installed without changing any other part of the pipeline. This separation means AST extraction always runs fast and offline, while semantic enrichment is opt-in and cost-controlled.

Context Block Integration

Code Brain's primary output is a context block: a structured text section injected into agent prompts at the start of a session. The block includes a project summary, technology stack, key routes and schemas, top god nodes with their edge counts, graph statistics (node count, edge count), and a token-budgeted BFS result for the current task description. All sections are individually token-capped and the total block respects the configured max_tokens limit. Agents that receive a Code Brain context block start every session with architectural orientation they would otherwise spend the first several turns building.