Skip to main content
OpenViking uses a three-stage async architecture for document parsing and context extraction, separating fast parsing from slow semantic generation for optimal performance.

Overview

Design Principle: Parsing and semantics are separated. Parser doesn’t call LLM; semantic generation is async.
1

Parser

Parse documents, create file and directory structure (no LLM calls)
2

TreeBuilder

Move temp directory to AGFS, queue for semantic processing
3

SemanticQueue

Async bottom-up L0/L1 generation (uses VLM)
4

Vector Index

Index generated L0/L1 for semantic search

Stage 1: Parser

Parser handles document format conversion and structuring, creating file structure in temp directory.

Supported Formats

FormatParserExtensionsFeatures
MarkdownMarkdownParser.md, .markdownHeader-based splitting
Plain textTextParser.txtSimple text parsing
PDFPDFParser.pdfText + image extraction
HTMLHTMLParser.html, .htmDOM-based parsing

Core Flow

from openviking.parse import ParserRegistry

registry = ParserRegistry()

# 1. Parse file
parse_result = await registry.parse("/path/to/doc.pdf")

# 2. Returns temp directory URI
print(parse_result.temp_dir_path)  # viking://temp/abc123
print(parse_result.source_format)  # "pdf"
print(parse_result.parser_name)    # "PDFParser"

Smart Splitting

Parser automatically splits documents based on size:
if document_tokens <= 1024:
    # Save as single file
    save_as_single_file(content)
else:
    # Split by headers
    sections = split_by_headers(content)
    
    for section in sections:
        if section.tokens < 512:
            # Merge small sections
            merge_with_next(section)
        elif section.tokens > 1024:
            # Create subdirectory for large sections
            create_subdirectory(section)
        else:
            # Save as individual file
            save_section(section)
This ensures each file is appropriately sized for LLM processing while maintaining document structure.

Example: Markdown Parsing

Input:
# API Documentation

## Authentication
### OAuth 2.0
[... 800 tokens ...]

### JWT Tokens  
[... 600 tokens ...]

## Endpoints
### User Management
[... 1500 tokens ...]
Output structure:
viking://temp/abc123/
├── API Documentation/
    ├── Authentication/
    │   ├── oauth.md           # 800 tokens
    │   └── jwt.md             # 600 tokens
    └── Endpoints/
        └── User Management/   # Subdirectory (>1024 tokens)
            └── content.md

ParseResult

from dataclasses import dataclass
from typing import Dict

@dataclass
class ParseResult:
    temp_dir_path: str    # Temp directory URI (viking://temp/xxx)
    source_format: str    # Source format (pdf/markdown/html)
    parser_name: str      # Parser class name
    parse_time: float     # Duration in seconds
    meta: Dict            # Additional metadata

Stage 2: TreeBuilder

TreeBuilder moves temp directory to AGFS and queues semantic processing.

5-Phase Processing

1

Find document root

Ensure exactly 1 subdirectory in temp (the parsed document)
temp_contents = await agfs.list_directory(temp_dir_path)
assert len(temp_contents) == 1, "Must have exactly one root directory"
doc_root = temp_contents[0]
2

Determine target URI

Map base URI by scope
ScopeBase URI
resourcesviking://resources/
userviking://user/{user_id}/
agentviking://agent/{agent_id}/
3

Recursively move directory tree

Copy all files from temp to AGFS
await agfs.move_recursive(
    source=f"{temp_dir_path}/{doc_root}",
    target=target_uri
)
4

Clean up temp directory

Delete temp files
await agfs.rm(temp_dir_path, recursive=True)
5

Queue semantic generation

Submit SemanticMsg to queue
semantic_msg = SemanticMsg(
    id=str(uuid4()),
    uri=target_uri,
    context_type="resource",
    status="pending"
)
await semantic_queue.enqueue(semantic_msg)

Usage Example

from openviking.parse.tree_builder import TreeBuilder

tree_builder = TreeBuilder(agfs, semantic_queue)

# Finalize parsed document
building_tree = await tree_builder.finalize_from_temp(
    temp_dir_path="viking://temp/abc123",
    scope="resources",  # or "user", "agent"
    target_name="my-api-docs"  # Optional custom name
)

print(f"Moved to: {building_tree.target_uri}")
# Output: viking://resources/my-api-docs/

Stage 3: SemanticQueue

SemanticQueue handles async L0/L1 generation and vectorization using VLM.

Processing Flow (Bottom-up)

Processing starts from leaf files and moves upward, aggregating child abstracts into parent overviews.

Single Directory Processing Steps

1

Concurrent file summary generation

Generate summaries for all files in directory (max 10 concurrent)
# Limit concurrent LLM calls to avoid rate limits
semaphore = asyncio.Semaphore(max_concurrent_llm)  # 10

async with semaphore:
    summary = await vlm.summarize(file_content)
2

Collect child directory abstracts

Read generated .abstract.md from subdirectories
child_abstracts = []
for subdir in subdirectories:
    abstract = await agfs.read_file(
        f"{subdir}/.abstract.md"
    )
    child_abstracts.append(abstract)
3

Generate .overview.md

LLM generates L1 overview from file summaries + child abstracts
overview = await vlm.generate_overview(
    file_summaries=file_summaries,
    child_abstracts=child_abstracts,
    max_tokens=2000
)
4

Extract .abstract.md

Extract L0 from overview (first 1-2 sentences)
abstract = await vlm.extract_abstract(
    overview=overview,
    max_tokens=100
)
5

Write files and vectorize

Save to AGFS and create vector index entries
# Write L0/L1 to AGFS
await agfs.write_file(
    f"{uri}/.abstract.md", abstract
)
await agfs.write_file(
    f"{uri}/.overview.md", overview
)

# Enqueue for vectorization
for level, text in [(0, abstract), (1, overview)]:
    context = Context(
        uri=uri,
        level=level,
        abstract=abstract,
        ...
    )
    context.set_vectorize(Vectorize(text=text))
    await embedding_queue.enqueue(context)

Configuration Parameters

{
  "vlm": {
    "max_concurrent": 100  // Max concurrent LLM calls
  },
  "semantic": {
    "max_concurrent_llm": 10,       // Per-directory concurrent calls
    "max_images_per_call": 10,      // Max images per VLM call
    "max_sections_per_call": 20     // Max sections per VLM call
  }
}

SemanticMsg Structure

from dataclasses import dataclass

@dataclass
class SemanticMsg:
    id: str                # UUID
    uri: str              # Directory URI
    context_type: str     # resource/memory/skill
    status: str           # pending/processing/completed/failed
    created_at: datetime
    updated_at: datetime

Code Skeleton Extraction (AST Mode)

For code files, OpenViking supports AST-based skeleton extraction via tree-sitter as a lightweight alternative to LLM summarization.
AST mode significantly reduces processing cost by extracting structural information without LLM calls.

Modes

Controlled by code_summary_mode in ov.conf:
ModeDescriptionLLM Usage
astExtract structural skeleton for files ≥100 linesNone (default)
llmAlways use LLM for summarizationHigh
ast_llmExtract AST skeleton first, then pass to LLMReduced
{
  "code": {
    "summary_mode": "ast"  // or "llm", "ast_llm"
  }
}

What AST Extracts

# Input: hierarchical_retriever.py
"""Hierarchical retriever for OpenViking."""

import heapq
from typing import List

class HierarchicalRetriever:
    """Hierarchical retriever with dense and sparse vector support."""
    
    def __init__(self, storage, embedder):
        """Initialize hierarchical retriever."""
        self.storage = storage
    
    async def retrieve(self, query, limit=5):
        """Execute hierarchical retrieval."""
        # ... implementation ...
Extracted skeleton:
# Hierarchical retriever for OpenViking.

import heapq
from typing import List

class HierarchicalRetriever:
    """Hierarchical retriever with dense and sparse vector support."""
    
    def __init__(self, storage, embedder):
        """Initialize hierarchical retriever."""
    
    async def retrieve(self, query, limit=5):
        """Execute hierarchical retrieval."""

Fallback Behavior

AST extraction automatically falls back to LLM when:
  • Language not in supported list
  • File has fewer than 100 lines
  • AST parse error occurs
  • Extraction produces empty skeleton
Fallback is automatic and logged. The overall pipeline continues without interruption.

Three Context Types Extraction

Different context types follow the same pipeline with different target URIs:
# Add resource
await client.add_resource(
    "/path/to/doc.pdf",
    reason="API documentation"
)

# Flow:
# 1. Parser -> viking://temp/abc123/
# 2. TreeBuilder(scope=resources) -> viking://resources/doc/
# 3. SemanticQueue -> Generate L0/L1
# 4. Vector Index -> Index for search

Complete Example

from openviking import OpenViking

client = OpenViking()

# Add resource (triggers full extraction pipeline)
await client.add_resource(
    "https://example.com/api-docs.pdf",
    reason="API documentation"
)

# Wait for semantic processing to complete (optional)
await client.wait_processed()

# Now L0/L1 are generated and indexed
results = await client.find("authentication")

for ctx in results.resources:
    # L0 abstract available immediately
    print(f"Abstract: {ctx.abstract}")
    
    # L1 overview loaded on demand
    overview = await client.overview(ctx.uri)
    print(f"Overview: {overview}")
    
    # L2 full content loaded on demand  
    if need_details:
        content = await client.read(ctx.uri)
        print(f"Content: {content}")

Architecture

System architecture and data flow

Context Layers

L0/L1/L2 model details

Storage

AGFS and vector index

Session

Memory extraction details