OpenViking uses a three-stage async architecture for document parsing and context extraction, separating fast parsing from slow semantic generation for optimal performance.
Overview
Design Principle : Parsing and semantics are separated. Parser doesn’t call LLM; semantic generation is async.
Parser
Parse documents, create file and directory structure (no LLM calls)
TreeBuilder
Move temp directory to AGFS, queue for semantic processing
SemanticQueue
Async bottom-up L0/L1 generation (uses VLM)
Vector Index
Index generated L0/L1 for semantic search
Stage 1: Parser
Parser handles document format conversion and structuring, creating file structure in temp directory.
Documents
Code
Multimedia
Format Parser Extensions Features Markdown MarkdownParser .md, .markdownHeader-based splitting Plain text TextParser .txtSimple text parsing PDF PDFParser .pdfText + image extraction HTML HTMLParser .html, .htmDOM-based parsing
Language Extensions AST Support Python .py✅ tree-sitter JavaScript .js, .jsx✅ tree-sitter TypeScript .ts, .tsx✅ tree-sitter Rust .rs✅ tree-sitter Go .go✅ tree-sitter Java .java✅ tree-sitter C/C++ .c, .cpp, .h, .hpp✅ tree-sitter
For code files, see Code Skeleton Extraction below. Format Parser Extensions Features Image ImageParser .png, .jpg, .jpeg, .gifVLM description Video VideoParser .mp4, .avi, .movSegmentation + transcription Audio AudioParser .mp3, .wav, .m4aTranscription
Core Flow
from openviking.parse import ParserRegistry
registry = ParserRegistry()
# 1. Parse file
parse_result = await registry.parse( "/path/to/doc.pdf" )
# 2. Returns temp directory URI
print (parse_result.temp_dir_path) # viking://temp/abc123
print (parse_result.source_format) # "pdf"
print (parse_result.parser_name) # "PDFParser"
Smart Splitting
Parser automatically splits documents based on size:
if document_tokens <= 1024 :
# Save as single file
save_as_single_file(content)
else :
# Split by headers
sections = split_by_headers(content)
for section in sections:
if section.tokens < 512 :
# Merge small sections
merge_with_next(section)
elif section.tokens > 1024 :
# Create subdirectory for large sections
create_subdirectory(section)
else :
# Save as individual file
save_section(section)
This ensures each file is appropriately sized for LLM processing while maintaining document structure.
Example: Markdown Parsing
Input:
# API Documentation
## Authentication
### OAuth 2.0
[... 800 tokens ...]
### JWT Tokens
[... 600 tokens ...]
## Endpoints
### User Management
[... 1500 tokens ...]
Output structure:
viking://temp/abc123/
├── API Documentation/
├── Authentication/
│ ├── oauth.md # 800 tokens
│ └── jwt.md # 600 tokens
└── Endpoints/
└── User Management/ # Subdirectory (>1024 tokens)
└── content.md
ParseResult
from dataclasses import dataclass
from typing import Dict
@dataclass
class ParseResult :
temp_dir_path: str # Temp directory URI (viking://temp/xxx)
source_format: str # Source format (pdf/markdown/html)
parser_name: str # Parser class name
parse_time: float # Duration in seconds
meta: Dict # Additional metadata
Stage 2: TreeBuilder
TreeBuilder moves temp directory to AGFS and queues semantic processing.
5-Phase Processing
Find document root
Ensure exactly 1 subdirectory in temp (the parsed document) temp_contents = await agfs.list_directory(temp_dir_path)
assert len (temp_contents) == 1 , "Must have exactly one root directory"
doc_root = temp_contents[ 0 ]
Determine target URI
Map base URI by scope Scope Base URI resourcesviking://resources/userviking://user/{user_id}/agentviking://agent/{agent_id}/
Recursively move directory tree
Copy all files from temp to AGFS await agfs.move_recursive(
source = f " { temp_dir_path } / { doc_root } " ,
target = target_uri
)
Clean up temp directory
Delete temp files await agfs.rm(temp_dir_path, recursive = True )
Queue semantic generation
Submit SemanticMsg to queue semantic_msg = SemanticMsg(
id = str (uuid4()),
uri = target_uri,
context_type = "resource" ,
status = "pending"
)
await semantic_queue.enqueue(semantic_msg)
Usage Example
from openviking.parse.tree_builder import TreeBuilder
tree_builder = TreeBuilder(agfs, semantic_queue)
# Finalize parsed document
building_tree = await tree_builder.finalize_from_temp(
temp_dir_path = "viking://temp/abc123" ,
scope = "resources" , # or "user", "agent"
target_name = "my-api-docs" # Optional custom name
)
print ( f "Moved to: { building_tree.target_uri } " )
# Output: viking://resources/my-api-docs/
Stage 3: SemanticQueue
SemanticQueue handles async L0/L1 generation and vectorization using VLM.
Processing Flow (Bottom-up)
Processing starts from leaf files and moves upward, aggregating child abstracts into parent overviews.
Single Directory Processing Steps
Concurrent file summary generation
Generate summaries for all files in directory (max 10 concurrent) # Limit concurrent LLM calls to avoid rate limits
semaphore = asyncio.Semaphore(max_concurrent_llm) # 10
async with semaphore:
summary = await vlm.summarize(file_content)
Collect child directory abstracts
Read generated .abstract.md from subdirectories child_abstracts = []
for subdir in subdirectories:
abstract = await agfs.read_file(
f " { subdir } /.abstract.md"
)
child_abstracts.append(abstract)
Generate .overview.md
LLM generates L1 overview from file summaries + child abstracts overview = await vlm.generate_overview(
file_summaries = file_summaries,
child_abstracts = child_abstracts,
max_tokens = 2000
)
Extract .abstract.md
Extract L0 from overview (first 1-2 sentences) abstract = await vlm.extract_abstract(
overview = overview,
max_tokens = 100
)
Write files and vectorize
Save to AGFS and create vector index entries # Write L0/L1 to AGFS
await agfs.write_file(
f " { uri } /.abstract.md" , abstract
)
await agfs.write_file(
f " { uri } /.overview.md" , overview
)
# Enqueue for vectorization
for level, text in [( 0 , abstract), ( 1 , overview)]:
context = Context(
uri = uri,
level = level,
abstract = abstract,
...
)
context.set_vectorize(Vectorize( text = text))
await embedding_queue.enqueue(context)
Configuration Parameters
{
"vlm" : {
"max_concurrent" : 100 // Max concurrent LLM calls
},
"semantic" : {
"max_concurrent_llm" : 10 , // Per-directory concurrent calls
"max_images_per_call" : 10 , // Max images per VLM call
"max_sections_per_call" : 20 // Max sections per VLM call
}
}
SemanticMsg Structure
from dataclasses import dataclass
@dataclass
class SemanticMsg :
id : str # UUID
uri: str # Directory URI
context_type: str # resource/memory/skill
status: str # pending/processing/completed/failed
created_at: datetime
updated_at: datetime
For code files, OpenViking supports AST-based skeleton extraction via tree-sitter as a lightweight alternative to LLM summarization.
AST mode significantly reduces processing cost by extracting structural information without LLM calls.
Modes
Controlled by code_summary_mode in ov.conf:
Mode Description LLM Usage astExtract structural skeleton for files ≥100 lines None (default) llmAlways use LLM for summarization High ast_llmExtract AST skeleton first, then pass to LLM Reduced
{
"code" : {
"summary_mode" : "ast" // or "llm", "ast_llm"
}
}
# Input: hierarchical_retriever.py
"""Hierarchical retriever for OpenViking."""
import heapq
from typing import List
class HierarchicalRetriever :
"""Hierarchical retriever with dense and sparse vector support."""
def __init__ ( self , storage , embedder ):
"""Initialize hierarchical retriever."""
self .storage = storage
async def retrieve ( self , query , limit = 5 ):
"""Execute hierarchical retrieval."""
# ... implementation ...
Extracted skeleton: # Hierarchical retriever for OpenViking.
import heapq
from typing import List
class HierarchicalRetriever :
"""Hierarchical retriever with dense and sparse vector support."""
def __init__ ( self , storage , embedder ):
"""Initialize hierarchical retriever."""
async def retrieve ( self , query , limit = 5 ):
"""Execute hierarchical retrieval."""
// Input: client.ts
/**
* OpenViking HTTP client
*/
export class HTTPClient {
/** Initialize client with URL */
constructor ( url : string , apiKey ?: string ) {
this . url = url ;
}
/** Search for contexts */
async find ( query : string ) : Promise < FindResult > {
// ... implementation ...
}
}
Extracted skeleton: // OpenViking HTTP client
export class HTTPClient {
/** Initialize client with URL */
constructor ( url : string , apiKey ?: string )
/** Search for contexts */
async find ( query : string ) : Promise < FindResult >
}
Fallback Behavior
AST extraction automatically falls back to LLM when:
Language not in supported list
File has fewer than 100 lines
AST parse error occurs
Extraction produces empty skeleton
Fallback is automatic and logged. The overall pipeline continues without interruption.
Different context types follow the same pipeline with different target URIs:
# Add resource
await client.add_resource(
"/path/to/doc.pdf" ,
reason = "API documentation"
)
# Flow:
# 1. Parser -> viking://temp/abc123/
# 2. TreeBuilder(scope=resources) -> viking://resources/doc/
# 3. SemanticQueue -> Generate L0/L1
# 4. Vector Index -> Index for search
# Add skill
await client.add_skill({
"name" : "search-web" ,
"content" : "# search-web \n ..."
})
# Flow:
# 1. Direct write to viking://agent/skills/search-web/
# 2. SemanticQueue -> Generate L0/L1
# 3. Vector Index -> Index for search
# Memory auto-extracted from session
await session.commit()
# Flow:
# 1. MemoryExtractor -> Extract 6-category memories
# 2. TreeBuilder(scope=user/agent) -> viking://user/memories/
# 3. SemanticQueue -> Generate L0/L1
# 4. Vector Index -> Index for search
Complete Example
from openviking import OpenViking
client = OpenViking()
# Add resource (triggers full extraction pipeline)
await client.add_resource(
"https://example.com/api-docs.pdf" ,
reason = "API documentation"
)
# Wait for semantic processing to complete (optional)
await client.wait_processed()
# Now L0/L1 are generated and indexed
results = await client.find( "authentication" )
for ctx in results.resources:
# L0 abstract available immediately
print ( f "Abstract: { ctx.abstract } " )
# L1 overview loaded on demand
overview = await client.overview(ctx.uri)
print ( f "Overview: { overview } " )
# L2 full content loaded on demand
if need_details:
content = await client.read(ctx.uri)
print ( f "Content: { content } " )
Architecture System architecture and data flow
Context Layers L0/L1/L2 model details
Storage AGFS and vector index
Session Memory extraction details