Vectorization Technical Documentation

📚 Feature Documentation: See Text Extraction, Vectorization & Named Entity Recognition for user-facing documentation and overview.

Overview

OpenRegister's VectorizationService generates semantic embeddings from text chunks, enabling RAG (Retrieval Augmented Generation) and semantic search capabilities. The service supports multiple embedding providers and vector storage backends.

Architecture

Service Flow

Strategy Pattern

Location: lib/Service/Vectorization/

Services

VectorizationService (Public API)

Location: lib/Service/VectorizationService.php

Purpose: Public API facade for all vector operations. Orchestrates vectorization using strategies and delegates to VectorEmbeddingService.

Architecture: This is the single entry point for all vector operations. Other services should call VectorizationService, not VectorEmbeddingService directly.

Key Methods:

/**
 * Vectorize entities using appropriate strategy.
 */
public function vectorizeEntity(
    Chunk|ObjectEntity $entity,
    ?VectorizationStrategyInterface $strategy = null
): array {
    // Get strategy for entity type
    if ($strategy === null) {
        $strategy = $this->getStrategyForEntity($entity);
    }
    
    // Extract vectorization items
    $items = $strategy->extractVectorizationItems($entity);
    
    // Generate embeddings
    $embeddings = $this->vectorEmbeddingService->generateBatchEmbeddings(
        array_column($items, 'text'),
        $provider
    );
    
    // Store vectors
    foreach ($items as $index => $item) {
        $embedding = $embeddings[$index];
        $strategy->storeVector($item, $embedding);
    }
    
    return ['vectors_created' => count($items)];
}

VectorEmbeddingService (Internal Handler)

Location: lib/Service/Vectorization/VectorEmbeddingService.php

Purpose: Internal service that generates embeddings using LLM providers. This is an implementation detail - use VectorizationService as the public API.

Supported Providers:

OpenAI: text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large
Fireworks AI: nomic-ai/nomic-embed-text-v1.5, thenlper/gte-base
Ollama: nomic-embed-text

Key Methods:

/**
 * Generate embedding for single text.
 */
public function generateEmbedding(
    string $text,
    ?string $provider = null
): ?array {
    $provider = $provider ?? $this->getDefaultProvider();
    
    return match($provider) {
        'openai' => $this->generateOpenAIEmbedding($text),
        'fireworks' => $this->generateFireworksEmbedding($text),
        'ollama' => $this->generateOllamaEmbedding($text),
        default => throw new Exception("Unknown provider: {$provider}")
    };
}

/**
 * Generate embeddings for multiple texts (batch).
 */
public function generateBatchEmbeddings(
    array $texts,
    ?string $provider = null
): array {
    // Batch processing for efficiency
    $batchSize = 100;
    $results = [];
    
    foreach (array_chunk($texts, $batchSize) as $batch) {
        $batchResults = $this->callProviderBatchAPI($batch, $provider);
        $results = array_merge($results, $batchResults);
    }
    
    return $results;
}

Vectorization Strategies

FileVectorizationStrategy

Location: lib/Service/Vectorization/FileVectorizationStrategy.php

Process:

Processes file chunks individually
Each chunk gets its own vector
Metadata includes file ID, path, chunk index

Implementation:

public function extractVectorizationItems(Chunk $chunk): array
{
    return [[
        'text' => $chunk->getTextContent(),
        'metadata' => [
            'source_type' => 'file',
            'source_id' => $chunk->getSourceId(),
            'chunk_index' => $chunk->getChunkIndex(),
            'total_chunks' => $this->getTotalChunks($chunk->getSourceId()),
        ]
    ]];
}

ObjectVectorizationStrategy

Location: lib/Service/Vectorization/ObjectVectorizationStrategy.php

Process:

Processes entire objects as single vectors
One vector per object (not per chunk)
Metadata includes object UUID, schema, register

Implementation:

public function extractVectorizationItems(ObjectEntity $object): array
{
    // Convert object to text
    $text = $this->objectService->convertObjectToText($object);
    
    return [[
        'text' => $text,
        'metadata' => [
            'source_type' => 'object',
            'source_id' => $object->getId(),
            'uuid' => $object->getUuid(),
            'schema_id' => $object->getSchema(),
            'register_id' => $object->getRegister(),
        ]
    ]];
}

Vector Storage

Database Storage

Table: oc_openregister_vectors

CREATE TABLE oc_openregister_vectors (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    entity_type VARCHAR(50) NOT NULL,
    entity_id VARCHAR(255) NOT NULL,
    chunk_index INT NOT NULL DEFAULT 0,
    total_chunks INT NOT NULL DEFAULT 1,
    chunk_text TEXT,
    vector_data BLOB NOT NULL,
    embedding_model VARCHAR(100) NOT NULL,
    embedding_dimensions INT NOT NULL,
    metadata JSON,
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    
    INDEX idx_entity (entity_type, entity_id),
    INDEX idx_model (embedding_model)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Storage Format: Serialized binary (PHP serialize())

Solr Storage

Vectors stored directly in existing Solr collections:

FileCollection:

{
  "id": "file_12345_chunk_0",
  "entity_type_s": "file",
  "entity_id_s": "12345",
  "chunk_index_i": 0,
  "chunk_text_txt": "Text content...",
  "_embedding_": [0.1, 0.2, 0.3, ...],
  "_embedding_model_": "text-embedding-3-small",
  "_embedding_dim_": 1536
}

ObjectCollection:

{
  "id": "object_abc123",
  "entity_type_s": "object",
  "entity_id_s": "abc123",
  "chunk_index_i": 0,
  "chunk_text_txt": "Object text...",
  "_embedding_": [0.1, 0.2, 0.3, ...],
  "_embedding_model_": "text-embedding-3-small",
  "_embedding_dim_": 1536
}

Vector Search Backends

PHP Cosine Similarity (Default)

Status: Always available fallback

How It Works:

Fetches vectors from database
Calculates cosine similarity in PHP
Sorts results in memory
Returns top N matches

Performance:

Suitable for small datasets (<500 vectors)
Current optimization: Limited to 500 most recent vectors
Scales linearly: O(n) complexity

Use When:

Small dataset (<500 vectors)
No PostgreSQL or Solr available
Testing/development environment

PostgreSQL + pgvector

Status: Available when PostgreSQL with pgvector extension is installed

Requirements:

PostgreSQL version >= 11
pgvector extension installed
Vector column migration

Installation:

-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create optimized vector table
CREATE TABLE openregister_vectors (
    id SERIAL PRIMARY KEY,
    entity_type VARCHAR(50),
    entity_id VARCHAR(255),
    chunk_index INTEGER,
    total_chunks INTEGER,
    chunk_text TEXT,
    embedding vector(768),  -- Native vector type!
    embedding_model VARCHAR(100),
    embedding_dimensions INTEGER,
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create HNSW index for fast similarity search
CREATE INDEX ON openregister_vectors 
USING hnsw (embedding vector_cosine_ops);

Performance:

10-100x faster than PHP
Database-level KNN search
Optimal for medium-large datasets

Solr 9+ Dense Vector Search

Status: Available when Solr 9.0+ is configured

Capabilities:

Dense Vector Field Type: DenseVectorField
KNN Search: K-Nearest Neighbors query parser
Similarity Functions: Cosine, Dot Product, Euclidean
Indexing Algorithms: HNSW (Hierarchical Navigable Small World)

Schema Configuration:

<fieldType name="knn_vector" class="solr.DenseVectorField" 
           vectorDimension="768" 
           similarityFunction="cosine" 
           knnAlgorithm="hnsw"/>

<field name="_embedding_" type="knn_vector" indexed="true" stored="true"/>

Query Example:

{
  "q": "{!knn f=_embedding_ topK=10}[0.123, 0.456, ...]",
  "fl": "id,score"
}

Performance:

Very fast distributed vector search
Best for large-scale deployments
100-1000x faster than PHP at scale

Configuration

Settings Location

Vector search backend configuration stored in LLM settings:

{
  "llm": {
    "vectorConfig": {
      "backend": "solr",
      "solrCollection": "openregister_vectors",
      "solrField": "_embedding_"
    }
  }
}

Configuration Properties

Property	Type	Default	Description
`backend`	string	`'php'`	Vector search backend: 'php', 'database', or 'solr'
`solrCollection`	string\|null	`null`	Solr collection name (required for Solr backend)
`solrField`	string	`'_embedding_'`	Solr field name for dense vectors

Performance

Processing Times

Stage	Time per Item	Notes
Embedding Generation	200-500ms	Per chunk (one-time)
Vector Storage	10-50ms	Database or Solr
Vector Search (PHP)	1-10s	Depends on dataset size
Vector Search (PostgreSQL)	20-200ms	With HNSW index
Vector Search (Solr)	10-30ms	With HNSW index

Performance Comparison

Test: Search 10,000 vectors, return top 10 results

Backend	Latency	Throughput	Memory
PHP	10s	1 req/s	Low
PostgreSQL + pgvector	50ms	50 req/s	Medium
Solr 9+ (HNSW)	20ms	100+ req/s	Medium-High

API Endpoints

Vectorization

POST /api/chunks/vectorize
Content-Type: application/json

{
  "source_type": "file",
  "source_id": 12345,
  "provider": "openai"
}

Response:

{
  "success": true,
  "vectors_created": 14,
  "model": "text-embedding-3-small",
  "dimensions": 1536
}

Semantic Search

POST /api/search/semantic
Content-Type: application/json

{
  "query": "budget discussion with John",
  "limit": 10,
  "filters": {
    "source_type": ["file", "object"]
  }
}

Response:

{
  "results": [
    {
      "chunk_id": 123,
      "source_type": "file",
      "source_id": 12345,
      "text": "...discussed budget with John Doe...",
      "similarity": 0.87,
      "metadata": {
        "file_name": "meeting-notes.pdf",
        "chunk_index": 3
      }
    }
  ],
  "total": 10
}

Backend Configuration

UI Configuration

Configure in Settings → OpenRegister → LLM Configuration:

Click "Configure LLM" button
Scroll to "Vector Search Backend" section
Select backend from dropdown:
- PHP Cosine Similarity (🐌 Slow)
- PostgreSQL + pgvector (⚡ Fast)
- Solr 9+ Dense Vector (🚀 Very Fast)
If Solr selected:
- Choose Solr collection from dropdown
- Vector field name is automatically set to _embedding_
Click "Save Configuration"

Backend Detection

The system automatically detects available backends:

Database Detection:

Checks database platform (PostgreSQL vs MariaDB/MySQL)
Queries for pgvector extension
Shows availability status

Solr Detection:

Checks Solr version >= 9.0
Lists available collections
Validates vector field configuration

Recommended Setup

Dataset Size	Recommended Backend	Reason
< 500 vectors	PHP	Simple, no setup
500 - 10,000	PostgreSQL + pgvector	Fast, integrated
10,000+	Solr 9+	Best performance, scalability

Migration

From PHP to PostgreSQL

Install PostgreSQL with pgvector extension
Migrate vector table to PostgreSQL
Update Nextcloud database configuration
Select "PostgreSQL + pgvector" in LLM Configuration
System will auto-detect and enable

From PHP to Solr

Ensure Solr 9.0+ is running
Create or select Solr collection
Configure vector field in Solr schema (_embedding_)
Select "Solr 9+ Dense Vector" in LLM Configuration
Choose collection and save

Switching Backends

You can switch backends at any time:

Configuration changes take effect immediately
Existing vectors remain in database
New searches use selected backend
No data migration required (vectors stored in database regardless of backend)

Solr Integration Details

Vector Storage in Existing Collections

Vectors are stored directly in existing Solr collections (fileCollection and objectCollection) rather than a separate vector collection:

Files: Vectors stored in fileCollection alongside file chunks
Objects: Vectors stored in objectCollection alongside object data

This enables:

Single source of truth for each entity
Full document retrieval without additional lookups
Atomic updates to existing documents

Solr Document Structure

Vectors are stored as fields in existing Solr documents:

{
  "id": "object_abc123_chunk_0",
  "entity_type_s": "object",
  "entity_id_s": "abc123",
  "chunk_index_i": 0,
  "chunk_text_txt": "This is the text that was embedded...",
  "_embedding_": [0.1, 0.2, 0.3, ...],
  "_embedding_model_": "text-embedding-ada-002",
  "_embedding_dim_": 1536
}

KNN Query Syntax

The implementation uses Solr's KNN query parser:

{!knn f=_embedding_ topK=10}[0.1, 0.2, 0.3, ...]

Benefits:

Very fast (millisecond range)
Uses HNSW indexing algorithm
Returns full documents with all metadata
Supports filtering by entity type

Configuration

No separate solrCollection field needed - uses existing fileCollection and objectCollection from Solr settings. The solrField is hardcoded to _embedding_ (a reserved system field).

Performance Monitoring

Current Performance

PHP Backend:

279 vectors → ~300ms similarity calculation
1,000 vectors → ~1 second
10,000 vectors → ~10 seconds
Plus embedding generation time (~1.5s)
Current optimization: Limited to 500 most recent vectors

PostgreSQL + pgvector:

1K vectors → ~20ms
10K vectors → ~50ms
100K vectors → ~200ms
Uses HNSW indexing

Solr 9+ Dense Vector:

1K vectors → ~20ms
10K vectors → ~15ms
100K vectors → ~30ms
Uses HNSW indexing algorithm
Distributed search across collections

Performance Targets

Metric	Current (PHP)	Target (PostgreSQL)	Best (Solr)
Search Time (1K vectors)	1s	20ms	10ms
Search Time (10K vectors)	10s	50ms	15ms
Search Time (100K vectors)	100s	200ms	30ms
Indexing	None	HNSW	ANN algorithms
Scalability	Poor	Good	Excellent

Performance Comparison

Test: Search 10,000 vectors, return top 10 results

Backend	Latency	Throughput	Memory	Index Time
PHP	10s	1 req/s	Low	N/A
PostgreSQL + pgvector	50ms	50 req/s	Medium	Fast
Solr 9+ (HNSW)	20ms	100+ req/s	Medium-High	Medium

Troubleshooting

Backend Not Available

PostgreSQL + pgvector:

Verify PostgreSQL is installed
Check pgvector extension: SELECT * FROM pg_extension WHERE extname = 'vector';
Install if missing: CREATE EXTENSION vector;

Solr 9+:

Verify Solr version >= 9.0
Check Solr is accessible
Verify collection exists
Check vector field is configured in schema

Slow Performance

PHP Backend:

Consider migrating to PostgreSQL or Solr
Current optimization limits to 500 most recent vectors
Performance scales linearly with vector count

PostgreSQL:

Ensure HNSW index is created
Check query execution plan
Monitor database performance

Solr:

Verify HNSW indexing is enabled
Check collection health
Monitor Solr performance metrics

API Endpoints

Get Database Info

GET /api/settings/database

Returns database type, version, and vector support.

Get Solr Info

GET /api/settings/solr-info

Returns Solr availability, version, and collections.

Feature Documentation

Text Extraction, Vectorization & Named Entity Recognition - Unified feature documentation
AI Chat Interface - RAG and semantic search usage

Technical Documentation

Text Extraction Technical Documentation - Text extraction implementation
Named Entity Recognition Technical Documentation - NER implementation

Overview​

Architecture​

Service Flow​

Strategy Pattern​

Services​

VectorizationService (Public API)​

VectorEmbeddingService (Internal Handler)​

Vectorization Strategies​

FileVectorizationStrategy​

ObjectVectorizationStrategy​

Vector Storage​

Database Storage​

Solr Storage​

Vector Search Backends​

PHP Cosine Similarity (Default)​

PostgreSQL + pgvector​

Solr 9+ Dense Vector Search​

Configuration​

Settings Location​

Configuration Properties​

Performance​

Processing Times​

Performance Comparison​

API Endpoints​

Vectorization​

Semantic Search​

Backend Configuration​

UI Configuration​

Backend Detection​

Recommended Setup​

Migration​

From PHP to PostgreSQL​

From PHP to Solr​

Switching Backends​

Solr Integration Details​

Vector Storage in Existing Collections​

Solr Document Structure​

KNN Query Syntax​

Configuration​

Performance Monitoring​

Current Performance​

Performance Targets​

Performance Comparison​

Troubleshooting​

Backend Not Available​

Slow Performance​

API Endpoints​

Get Database Info​

Get Solr Info​

Related Documentation​

Feature Documentation​

Technical Documentation​

Overview

Architecture

Service Flow

Strategy Pattern

Services

VectorizationService (Public API)

VectorEmbeddingService (Internal Handler)

Vectorization Strategies

FileVectorizationStrategy

ObjectVectorizationStrategy

Vector Storage

Database Storage

Solr Storage

Vector Search Backends

PHP Cosine Similarity (Default)

PostgreSQL + pgvector

Solr 9+ Dense Vector Search

Configuration

Settings Location

Configuration Properties

Performance

Processing Times

Performance Comparison

API Endpoints

Vectorization

Semantic Search

Backend Configuration

UI Configuration

Backend Detection

Recommended Setup

Migration

From PHP to PostgreSQL

From PHP to Solr

Switching Backends

Solr Integration Details

Vector Storage in Existing Collections

Solr Document Structure

KNN Query Syntax

Configuration

Performance Monitoring

Current Performance

Performance Targets

Performance Comparison

Troubleshooting

Backend Not Available

Slow Performance

API Endpoints

Get Database Info

Get Solr Info

Related Documentation

Feature Documentation

Technical Documentation