Extraction pipeline

View as Markdown

Extraction pipeline

Every conversation that flows through XTrace — whether from the Chrome extension, the API, or a direct integration — passes through the extraction pipeline. This is where raw messages become structured knowledge.

Pipeline overview

Messages → Batching → Extraction (facts + artifacts) → Episode synthesis → Resolution → Storage

1. Event building

Raw messages are normalized into internal events. The system handles multiple input formats: Chrome extension payloads, API AddMemoriesRequest bodies, and direct ContextBus calls.

2. Batching

Long conversations are split into batches based on character limits and session boundaries. Each batch is a window of messages that fits within the LLM’s effective context, configured via batch_max_chars.

3. Fact extraction

The core extraction step. An LLM processes each batch and produces structured facts — atomic beliefs with metadata:

  • Content: The belief text, written as a standalone assertion
  • Category: Classification (preference, decision, project detail, etc.)
  • Tags: Searchable labels
  • Resolves: Optional directive to supersede or retract an existing belief (see Belief revision)

The extraction prompt enforces a belief test: only statements that represent genuine, extractable knowledge make it through. Passing mentions, pleasantries, and meta-commentary are filtered out.

4. Artifact detection and resolution

In parallel with fact extraction, the pipeline identifies artifacts — work products like technical designs, strategies, or documents. Artifact detection classifies whether a conversation segment creates, updates, or references an artifact.

When an artifact is detected, version resolution determines whether it’s a new artifact or an update to an existing one. The system maintains version chains so you can trace how a document evolved.

5. Episode synthesis

After facts are extracted, the pipeline synthesizes episodes — conversation summaries that capture the arc of a session. Episodes record what was discussed, what was decided, and what’s still open. Each episode links to the facts it established and the source events it covers.

Session-level episodes are generated at flush time, summarizing the entire session rather than individual batches.

6. Cross-linking

Facts, episodes, and artifacts are linked together:

  • Facts reference the episodes they belong to
  • Episodes reference the artifacts discussed
  • Artifact versions link to previous versions and the facts/decisions that drove changes

Streaming vs batch

The pipeline operates in two modes:

ModeMethodWhen
Streamingstream_roundReal-time conversation sync (Chrome extension)
Batchbatch_ingestBulk import, API calls

In streaming mode, facts and artifacts are extracted per-round while episodes are deferred to flush() at session end. In batch mode, episodes and facts are extracted together.

Context awareness

The extraction LLM doesn’t operate in a vacuum. It receives:

  • Session context: Previously extracted facts from the current session, so it doesn’t re-extract the same beliefs
  • Retrieved facts: Relevant existing beliefs from the store, merged into the context window so the LLM can detect conflicts and attach resolves directives
  • Artifact summaries: Known artifacts to reduce false version detections when the model echoes retrieved content

This context window is bounded by session_context_max_facts to keep prompts within token limits.

Configuration

Key settings that control extraction behavior:

ParameterWhat it controls
extraction_prompt_variantWhich extraction prompt template to use
batch_max_charsMaximum characters per extraction batch
extraction_concurrencyParallel extraction calls
session_context_max_factsHow many session facts are visible to the LLM
auto_store_artifactsWhether detected artifacts are stored automatically