Skip to content

Semantic Index Architecture

The semantic index is a project-wide code analysis engine built into @codepol/core. It extracts language-agnostic semantic information from source files using tree-sitter and exposes it through the ProjectIndex query API for plugin rules.

Overview

The semantic index provides:

  • Symbol extraction -- functions, classes, variables, types, interfaces, enums, and their attributes
  • Scope trees -- lexical/semantic boundaries for name resolution
  • Cross-file resolution -- import/export binding, re-export chains, namespace members
  • Call graph -- heuristic caller/callee detection
  • Module graph -- dependency order, cycle detection, entry points
  • Type relations -- extends/implements hierarchy with cross-file resolution
  • Control flow graphs -- per-function CFGs with cyclomatic complexity

Data Flow

  1. Parse -- each source file is parsed into a concrete syntax tree by tree-sitter (WASM grammars, no native deps)
  2. Adapter extraction -- a language adapter runs query packs against the tree to extract symbols, scopes, relations, and CFGs into a FileIndexDelta
  3. Store -- deltas are merged into the IndexStore, the central mutable data store
  4. Cross-file resolution -- after all files are indexed, crossFileResolve links import bindings to their source exports, resolves namespace members, updates type relations, and resolves module specifiers
  5. ProjectIndex -- a read-only query facade over the store, exposed to plugin rules

Component Architecture

ComponentFilePurpose
projectIndexBuildindexBuilder.tsOrchestrates per-file indexing, cross-file resolution, and returns ProjectIndex
crossFileResolveindexBuilder.tsLinks imports to exports, resolves namespace members, updates type relations
IndexStoreindexStore.tsMutable store of all symbols, scopes, and relations with indexed lookups
ModuleGraphmoduleGraph.tsDependency graph with topological sort (Kahn's) and cycle detection (Tarjan's SCC)
ProjectIndexindexQuery.tsRead-only query API exposed to plugins
IndexAdapteradapterTypes.tsLanguage-specific extraction (tree-sitter queries + kind mappings)
moduleResolvemoduleResolver.tsNode-style module specifier resolution with path alias support

Index Build Pipeline

projectIndexBuild(options) executes these steps:

Cross-File Resolution Steps

  1. Export map -- build Map<filePath, Map<exportedName, SymbolId>> from all ExportsRelation entries
  2. Re-export propagation -- follow sourceModule chains iteratively until stable (handles export * from, export { foo } from, export * as ns from)
  3. ImportBinding resolution -- match each ImportBindingRelation to its source export via module resolution
  4. Reference update -- update ReferencesRelation.resolvedSymbolId for references that resolved to import binding symbols
  5. Namespace member resolution -- resolve dotted references like utils.alpha against the namespace's module export map
  6. TypeRelation resolution -- update TypeRelation.resolvedTargetId from local import binding to actual exported symbol
  7. ImportsRelation resolution -- set resolvedModulePath on side-effect and dynamic imports for module graph edges

Data Model

Core Records

SymbolKind: module, namespace, class, interface, type, function, method, variable, const, field, parameter, enum, enumMember

ScopeKind: file, module, type, function, block, class, namespace

SymbolFlags (bitset): Exported, Async, Generator, Static, Abstract, Readonly, Optional, Private, Protected, Public

Relations

Relations are append-only facts extracted by adapters and refined during cross-file resolution.

RelationPurposeKey Fields
DefinesRelationScope declares a symbolscopeId, symbolId
ContainsRelationScope contains child scopescopeId, childScopeId
ReferencesRelationIdentifier refers to a symbolname, byteRange, resolvedSymbolId?
CallsRelationCall expression in a scopecalleeName, byteRange, resolvedSymbolId?
ImportsRelationScope imports from module specifierspec, resolvedModulePath?
ImportBindingRelationLinks imported name to source modulelocalSymbolId, importedName, moduleSpec, resolvedExportId?, isDefault, isNamespace
ExportsRelationSymbol exported from modulesymbolId, exportedName, isDefault, sourceModule?, sourceName?
TypeRelationExtends/implements hierarchy edgesymbolId, targetName, relationKind, resolvedTargetId?

Control Flow Graph

Each function/method scope gets a FlowGraph with:

  • FlowNode kinds: entry, exit, statement, branch, merge, loop, return, throw
  • FlowEdge labels: true, false, loop-back, unconditional, break, continue, case, default, exception, finally
  • Cyclomatic complexity: V(G) = E - N + 2

Adapter Architecture

Language adapters are the bridge between tree-sitter parse trees and the language-agnostic data model.

Each adapter provides:

  1. QueryPack -- tree-sitter S-expression patterns with named captures
  2. Kind mappings -- map capture suffixes / node types to canonical SymbolKind and ScopeKind
  3. Capture names -- standard convention (@scope, @name, @decl.*, @ref.*, @callee.*, etc.)
  4. Reference filter -- post-filter function to remove declaration sites, property keys, etc.

See Creating Language Adapters for a step-by-step guide.

Built-in Language Support

LanguageAdapterQuery PacksType Relations
TypeScript (.ts, .mts, .cts)typescriptConfigCreatescopes, symbols, refs, calls, imports, exports, typeRelationsYes
TSX (.tsx)typescriptConfigCreatesame as TypeScriptYes
JavaScript (.js, .mjs, .cjs, .jsx)Uses TS/TSX adaptersame as TypeScriptYes
Python (.py, .pyw)pythonConfigCreatescopes, symbols, refs, calls, imports, exportsNo

Known Limitations

These are intentional design constraints, not bugs:

  1. No AST exposure -- the index contains semantic primitives, not syntax nodes. Plugins never see tree-sitter trees.
  2. Best-effort resolution -- unresolved references are valid results (returned with resolvedSymbolId: undefined).
  3. No type inference -- tree-sitter alone cannot do type analysis. TypeOf relations are not supported.
  4. Heuristic call detection -- may miss indirect calls (callbacks, dynamic dispatch) and may report false positives.
  5. Single-threaded indexing -- files are indexed sequentially. Could be parallelized per-file in the future.
  6. In-memory only -- no disk persistence. Large projects re-index on every run.