***

title: Extraction pipeline
description: 'How XTrace extracts beliefs, episodes, and artifacts from conversations.'
---------------------------------------------------------------------------------------

# Extraction pipeline

Every conversation that flows through XTrace — whether from the Chrome extension, the API, or a direct integration — passes through the extraction pipeline. This is where raw messages become structured knowledge.

## Pipeline overview

```
Messages → Batching → Extraction (facts + artifacts) → Episode synthesis → Resolution → Storage
```

### 1. Event building

Raw messages are normalized into internal events. The system handles multiple input formats: Chrome extension payloads, API `AddMemoriesRequest` bodies, and direct `ContextBus` calls.

### 2. Batching

Long conversations are split into batches based on character limits and session boundaries. Each batch is a window of messages that fits within the LLM's effective context, configured via `batch_max_chars`.

### 3. Fact extraction

The core extraction step. An LLM processes each batch and produces structured facts — atomic beliefs with metadata:

* **Content**: The belief text, written as a standalone assertion
* **Category**: Classification (preference, decision, project detail, etc.)
* **Tags**: Searchable labels
* **Resolves**: Optional directive to supersede or retract an existing belief (see [Belief revision](/docs/platform-features/belief-revision))

The extraction prompt enforces a **belief test**: only statements that represent genuine, extractable knowledge make it through. Passing mentions, pleasantries, and meta-commentary are filtered out.

### 4. Artifact detection and resolution

In parallel with fact extraction, the pipeline identifies **artifacts** — work products like technical designs, strategies, or documents. Artifact detection classifies whether a conversation segment creates, updates, or references an artifact.

When an artifact is detected, version resolution determines whether it's a new artifact or an update to an existing one. The system maintains version chains so you can trace how a document evolved.

### 5. Episode synthesis

After facts are extracted, the pipeline synthesizes **episodes** — conversation summaries that capture the arc of a session. Episodes record what was discussed, what was decided, and what's still open. Each episode links to the facts it established and the source events it covers.

Session-level episodes are generated at flush time, summarizing the entire session rather than individual batches.

### 6. Cross-linking

Facts, episodes, and artifacts are linked together:

* Facts reference the episodes they belong to
* Episodes reference the artifacts discussed
* Artifact versions link to previous versions and the facts/decisions that drove changes

## Streaming vs batch

The pipeline operates in two modes:

| Mode          | Method         | When                                           |
| ------------- | -------------- | ---------------------------------------------- |
| **Streaming** | `stream_round` | Real-time conversation sync (Chrome extension) |
| **Batch**     | `batch_ingest` | Bulk import, API calls                         |

In streaming mode, facts and artifacts are extracted per-round while episodes are deferred to `flush()` at session end. In batch mode, episodes and facts are extracted together.

## Context awareness

The extraction LLM doesn't operate in a vacuum. It receives:

* **Session context**: Previously extracted facts from the current session, so it doesn't re-extract the same beliefs
* **Retrieved facts**: Relevant existing beliefs from the store, merged into the context window so the LLM can detect conflicts and attach `resolves` directives
* **Artifact summaries**: Known artifacts to reduce false version detections when the model echoes retrieved content

This context window is bounded by `session_context_max_facts` to keep prompts within token limits.

## Configuration

Key settings that control extraction behavior:

| Parameter                   | What it controls                                    |
| --------------------------- | --------------------------------------------------- |
| `extraction_prompt_variant` | Which extraction prompt template to use             |
| `batch_max_chars`           | Maximum characters per extraction batch             |
| `extraction_concurrency`    | Parallel extraction calls                           |
| `session_context_max_facts` | How many session facts are visible to the LLM       |
| `auto_store_artifacts`      | Whether detected artifacts are stored automatically |
