Extraction pipeline
Extraction pipeline
Every conversation that flows through XTrace — whether from the Chrome extension, the API, or a direct integration — passes through the extraction pipeline. This is where raw messages become structured knowledge.
Pipeline overview
1. Event building
Raw messages are normalized into internal events. The system handles multiple input formats: Chrome extension payloads, API AddMemoriesRequest bodies, and direct ContextBus calls.
2. Batching
Long conversations are split into batches based on character limits and session boundaries. Each batch is a window of messages that fits within the LLM’s effective context, configured via batch_max_chars.
3. Fact extraction
The core extraction step. An LLM processes each batch and produces structured facts — atomic beliefs with metadata:
- Content: The belief text, written as a standalone assertion
- Category: Classification (preference, decision, project detail, etc.)
- Tags: Searchable labels
- Resolves: Optional directive to supersede or retract an existing belief (see Belief revision)
The extraction prompt enforces a belief test: only statements that represent genuine, extractable knowledge make it through. Passing mentions, pleasantries, and meta-commentary are filtered out.
4. Artifact detection and resolution
In parallel with fact extraction, the pipeline identifies artifacts — work products like technical designs, strategies, or documents. Artifact detection classifies whether a conversation segment creates, updates, or references an artifact.
When an artifact is detected, version resolution determines whether it’s a new artifact or an update to an existing one. The system maintains version chains so you can trace how a document evolved.
5. Episode synthesis
After facts are extracted, the pipeline synthesizes episodes — conversation summaries that capture the arc of a session. Episodes record what was discussed, what was decided, and what’s still open. Each episode links to the facts it established and the source events it covers.
Session-level episodes are generated at flush time, summarizing the entire session rather than individual batches.
6. Cross-linking
Facts, episodes, and artifacts are linked together:
- Facts reference the episodes they belong to
- Episodes reference the artifacts discussed
- Artifact versions link to previous versions and the facts/decisions that drove changes
Streaming vs batch
The pipeline operates in two modes:
In streaming mode, facts and artifacts are extracted per-round while episodes are deferred to flush() at session end. In batch mode, episodes and facts are extracted together.
Context awareness
The extraction LLM doesn’t operate in a vacuum. It receives:
- Session context: Previously extracted facts from the current session, so it doesn’t re-extract the same beliefs
- Retrieved facts: Relevant existing beliefs from the store, merged into the context window so the LLM can detect conflicts and attach
resolvesdirectives - Artifact summaries: Known artifacts to reduce false version detections when the model echoes retrieved content
This context window is bounded by session_context_max_facts to keep prompts within token limits.
Configuration
Key settings that control extraction behavior: