Text Extraction Sources: Files vs Objects

OpenRegister processes content from two distinct sources that both lead to chunks for searching and analysis.

Processing Paths Overview

📄 Source 1: Files

Description

Files (documents, images, spreadsheets, etc.) are processed through text extraction engines to convert binary content into searchable text.

Supported File Types

Category	Formats	Extraction Method
Documents	PDF, DOCX, DOC, ODT, RTF	LLPhant or Dolphin
Spreadsheets	XLSX, XLS, CSV	LLPhant or Dolphin
Presentations	PPTX	LLPhant or Dolphin
Text Files	TXT, MD, HTML, JSON, XML	LLPhant (native)
Images	JPG, PNG, GIF, WebP, TIFF	Dolphin (OCR only)

File Processing Flow

The file processing flow varies based on the extraction mode configured. Each mode provides different timing and processing characteristics:

Extraction Modes Overview

1. Immediate Mode - Direct Link Processing

Characteristics:

Direct Link: File upload and parsing logic are directly connected
Synchronous: Processing happens during the upload request
User Experience: User waits for extraction to complete
Use Case: When immediate text availability is critical
Performance: May slow down file uploads for large files

2. Background Job Mode - Delayed Extraction

Characteristics:

Delayed Action: Extraction happens after upload completes
Asynchronous: Processing on the job stack, non-blocking
User Experience: Upload completes immediately
Use Case: Recommended for most scenarios (best performance)
Performance: No impact on upload speed

3. Cron Job Mode - Periodic Batch Processing

Characteristics:

Repeating Action: Periodic batch processing via scheduled jobs
Batch Processing: Multiple files processed together
User Experience: Upload completes immediately, extraction happens later
Use Case: When you want to control processing load and timing
Performance: Efficient batch processing, predictable load

4. Manual Only Mode - User-Triggered Processing

Characteristics:

Manual Trigger: Only processes when user explicitly triggers
User Control: Complete control over when extraction happens
Use Case: Selective processing, testing, or resource-constrained environments
Performance: No automatic processing overhead

Detailed File Processing Flow

Word-family extraction (DOCX / DOC / ODT)

Word documents are parsed with PhpOffice/PhpWord. The reader is selected from the file's MIME type / extension — .docx → Word2007, legacy .doc → MsDoc, .odt → ODText — so all three formats load (previously only .docx did).

Extraction is content-complete: a recursive element walker captures the full document body including tables (and nested tables, in-cell text-runs and list items), plus section headers and footers and document-level footnotes and endnotes. A per-document parse failure (e.g. a legacy .doc the limited MsDoc reader cannot read) is logged with structural detail only — no document content, per ADR-005 — and yields an empty result rather than aborting the batch.

File Metadata Preserved

When files are processed, the following metadata is maintained:

Source Reference: Original file ID from Nextcloud
File Path: Location in Nextcloud filesystem
MIME Type: File format information
File Size: Original file size in bytes
Checksum: For change detection
Extraction Method: Which engine was used (LLPhant or Dolphin)
Extraction Timestamp: When text was extracted

Example: PDF Processing

Input: contract-2024.pdf (245 KB, 15 pages)

Step 1: Text Extraction
  - Engine: Dolphin AI
  - Time: 8.2 seconds
  - Output: 12,450 characters of text

Step 2: Chunking
  - Strategy: Recursive (respects paragraphs)
  - Chunks created: 14
  - Average chunk size: 889 characters
  - Overlap: 200 characters

Step 3: Storage
  - FileText entity created
  - Chunks stored in chunks_json field
  - Status: completed

📦 Source 2: Objects

Description

OpenRegister objects (structured data entities) are converted into text blobs by concatenating their property values. This enables full-text search across structured data.

Object-to-Text Conversion

Objects are transformed using the following rules:

Simple Properties: Direct value extraction

{ 'name': 'John Doe', 'age': 35 }
→ 'name: John Doe age: 35'

Arrays: Join with separators

{ 'tags': ['urgent', 'customer', 'support'] }
→ 'tags: urgent, customer, support'

Nested Objects: Flatten with dot notation

{ 'address': { 'city': 'Amsterdam', 'country': 'NL' } }
→ 'address.city: Amsterdam address.country: NL'

Special Handling: Exclude system fields
- Ignore: id, uuid, created, updated
- Include: User-defined properties only

Object Processing Flow

Object Metadata Preserved

When objects are processed, the following metadata is maintained:

Object ID: Reference to original object
Schema: Schema definition for context
Register: Register containing the object
Property Map: Which chunk contains which properties
Extraction Timestamp: When text blob was created

Example: Contact Object Processing

Input Object (Contact Schema):
{
  'id': 12345,
  'uuid': '550e8400-e29b-41d4-a716-446655440000',
  'firstName': 'Jane',
  'lastName': 'Smith',
  'email': 'jane.smith@example.com',
  'phone': '+31612345678',
  'company': {
    'name': 'Acme Corp',
    'industry': 'Technology'
  },
  'tags': ['vip', 'partner'],
  'notes': 'Important client, prefers email communication'
}

Step 1: Text Blob Creation
  → 'firstName: Jane lastName: Smith email: jane.smith@example.com 
     phone: +31612345678 company.name: Acme Corp 
     company.industry: Technology tags: vip, partner 
     notes: Important client, prefers email communication'

Step 2: Chunking
  - Strategy: Fixed size (short enough for single chunk)
  - Chunks created: 1
  - Chunk size: 215 characters

Step 3: Storage
  - ObjectText entity created
  - Chunk stored with property mapping
  - Status: completed

Common Chunking Process

Both files and objects converge at the chunking stage, where text is divided into manageable pieces.

Chunking Strategies

1. Recursive Character Splitting (Recommended)

Smart splitting that respects natural text boundaries:

Priority Order:
Paragraph breaks (\n\n)
Sentence endings (. ! ?)
Line breaks (\n)
Word boundaries (spaces)
Character split (fallback)

Best for: Natural language documents, articles, reports

2. Fixed Size Splitting

Mechanical splitting with overlap:

Settings:
- Chunk size: 1000 characters
- Overlap: 200 characters
- Minimum chunk: 100 characters

Best for: Structured data, code, logs

Chunk Structure

Each chunk contains:

{
  'text': 'The actual chunk content...',
  'start_offset': 0,
  'end_offset': 1000,
  'source_type': 'file',
  'source_id': 12345,
  'language': 'en',
  'language_level': 'B2'
}

Enhancement Pipeline

After chunking, content can undergo optional enhancements:

1. Text Search Indexing (Solr)

Purpose: Fast keyword and phrase search across all content

Performance: ~50-200ms per query

Use Cases: Search box, filters, reporting

2. Vector Embeddings (RAG)

Purpose: Semantic search and AI context retrieval

Performance: ~200-500ms per chunk (one-time), ~100-300ms per query

Use Cases: AI chat, related content, recommendations

Purpose: GDPR compliance, PII tracking, data subject access requests

Performance: ~100-2000ms per chunk (depending on method)

Use Cases: Compliance audits, right to erasure, data mapping

4. Language Detection

Purpose: Multi-language support, content filtering, translation routing

Performance: ~10-50ms per chunk (local) or ~100-200ms (API)

Use Cases: Language filters, translation, localization

5. Language Level Assessment

Purpose: Accessibility compliance, content simplification, readability scoring

Performance: ~20-100ms per chunk

Use Cases: Plain language compliance, educational leveling, accessibility

Comparison: Files vs Objects

Aspect	Files	Objects
Input Format	Binary (PDF, DOCX, images)	Structured JSON data
Extraction	Text extraction engines required	Property value concatenation
Processing Time	Slow (2-60 seconds)	Fast (<1 second)
Complexity	High (OCR, parsing)	Low (string operations)
Chunk Count	Many (10-1000+)	Few (1-10)
Update Frequency	Rare (files are static)	Common (objects change often)
Best For	Documents, reports, images	Structured records, metadata
GDPR Risk	High (unstructured PII)	Medium (known data structure)
Search Precision	Lower (natural language)	Higher (structured fields)
Context	Full document context	Property-level context

Combined Use Cases

Use Case 1: Customer Management

Object: Customer record
  - Name, email, phone, notes
  → Chunked for search

File: Contract PDF attached to customer
  - Terms, signatures, dates
  → Extracted and chunked

Search: 'payment terms for Acme Corp'
  → Finds chunks from both object and file
  → Returns unified results

Request: 'Find all mentions of john.doe@example.com'

Step 1: Entity extraction finds email in:
  - 15 chunks from 8 PDF files
  - 3 chunks from 2 customer objects
  - 12 chunks from 42 email messages

Step 2: Generate report with:
  - All files containing email
  - All objects referencing person
  - All email conversations
  - Exact positions in each source

Step 3: Provide data or anonymize on request

Use Case 3: Multi-Language Knowledge Base

Content Sources:
  - Files: User manuals (EN, NL, DE)
  - Objects: FAQ entries (EN, NL)
  - Emails: Support conversations (mixed)

Processing:
  1. All sources → Chunks
  2. Language detection → Tag each chunk
  3. Vector embeddings → Enable semantic search

User Search (in Dutch):
  → System filters to NL chunks
  → Semantic search across files + objects + emails
  → Returns relevant content in user's language

Configuration

Enabling File Processing

Settings → OpenRegister → File Configuration

Extract Text From: [All Files / Specific Folders / Object Files]
Text Extractor: [LLPhant / Dolphin]
Extraction Mode: [Immediate / Background Job / Cron Job / Manual Only]
Chunking Strategy: [Recursive / Fixed Size]

Extraction Mode Selection Guide

Immediate Mode:

✅ Use when: Text must be available immediately after upload
✅ Best for: Small files, critical workflows, real-time search requirements
⚠️ Consider: May slow down uploads for large files
📊 Performance: Synchronous processing during upload

Background Job Mode (Recommended):

✅ Use when: You want fast uploads with async processing
✅ Best for: Most production scenarios, large files, high-volume uploads
⚠️ Consider: Text may not be immediately available (typically seconds to minutes delay)
📊 Performance: Non-blocking, optimal for user experience

Cron Job Mode:

✅ Use when: You want to control processing load and timing
✅ Best for: Batch processing, predictable resource usage, scheduled maintenance windows
⚠️ Consider: Text extraction happens at scheduled intervals (default: every 15 minutes)
📊 Performance: Efficient batch processing, predictable system load

Manual Only Mode:

✅ Use when: You want complete control over when extraction happens
✅ Best for: Testing, selective processing, resource-constrained environments
⚠️ Consider: Requires manual intervention to trigger extraction
📊 Performance: No automatic processing overhead

Enabling Object Processing

Settings → OpenRegister → Text Analysis

Enable Object Text Extraction: [Yes / No]
Include Properties: [Select which properties to extract]
Chunking Strategy: [Recursive / Fixed Size]

Enabling Enhancements

Settings → OpenRegister → Text Analysis

☑ Text Search Indexing (Solr)
☑ Vector Embeddings (RAG)
☑ Entity Extraction (GDPR)
☑ Language Detection
☑ Language Level Assessment

Performance Recommendations

For File-Heavy Workloads

Use Background Job or Cron Job mode for optimal performance
Enable Dolphin for images/complex PDFs
Use recursive chunking for better quality
Enable selective enhancements (not all at once)
Configure appropriate batch sizes for cron mode

For Object-Heavy Workloads

Use immediate processing (objects are small)
Enable fixed-size chunking (faster)
Always enable language detection (fast on short text)
Enable entity extraction for compliance

For Mixed Workloads

Background processing for files
Immediate processing for objects
Use recursive chunking for both
Enable all enhancements selectively per schema

API Examples

Search Across Both Sources

GET /api/search?q=contract%20terms&sources=files,objects

Response:

{
  'results': [
    {
      'source_type': 'file',
      'source_id': 12345,
      'file_name': 'contract-2024.pdf',
      'chunk_index': 3,
      'text': '...payment terms are net 30...',
      'score': 0.95
    },
    {
      'source_type': 'object',
      'source_id': 67890,
      'schema': 'customers',
      'property': 'notes',
      'text': '...special contract terms agreed...',
      'score': 0.87
    }
  ]
}

Get All Chunks for a File

GET /api/files/12345/chunks

Get All Chunks for an Object

GET /api/objects/67890/chunks

Conclusion

OpenRegister's dual-source text extraction system provides:

Comprehensive Coverage: Search across files AND structured data
Unified Processing: Same chunking and enhancement pipeline
Flexible Configuration: Enable features per source type
GDPR Compliance: Track entities from all sources
Intelligent Search: Semantic and keyword search across everything

By processing both files and objects into a common chunk format, OpenRegister creates a truly unified content search and analysis platform.

Next Steps:

Text Extraction, Vectorization & Named Entity Recognition - Unified documentation for text extraction, vectorization, and NER
Enhanced Text Extraction Documentation
GDPR Entity Tracking
Language Detection
File Processing Details
Object Management

Processing Paths Overview​

📄 Source 1: Files​

Description​

Supported File Types​

File Processing Flow​

Extraction Modes Overview​

1. Immediate Mode - Direct Link Processing​

2. Background Job Mode - Delayed Extraction​

3. Cron Job Mode - Periodic Batch Processing​

4. Manual Only Mode - User-Triggered Processing​

Detailed File Processing Flow​

Word-family extraction (DOCX / DOC / ODT)​

File Metadata Preserved​

Example: PDF Processing​

📦 Source 2: Objects​

Description​

Object-to-Text Conversion​

Object Processing Flow​

Object Metadata Preserved​

Example: Contact Object Processing​

Common Chunking Process​

Chunking Strategies​

1. Recursive Character Splitting (Recommended)​

2. Fixed Size Splitting​

Chunk Structure​

Enhancement Pipeline​

1. Text Search Indexing (Solr)​

2. Vector Embeddings (RAG)​

3. Entity Extraction (GDPR)​

4. Language Detection​

5. Language Level Assessment​

Comparison: Files vs Objects​

Combined Use Cases​

Use Case 1: Customer Management​

Use Case 2: GDPR Data Subject Access Request​

Use Case 3: Multi-Language Knowledge Base​

Configuration​

Enabling File Processing​

Extraction Mode Selection Guide​

Enabling Object Processing​

Enabling Enhancements​

Performance Recommendations​

For File-Heavy Workloads​

For Object-Heavy Workloads​

For Mixed Workloads​

API Examples​

Search Across Both Sources​

Get All Chunks for a File​

Get All Chunks for an Object​

Conclusion​

Processing Paths Overview

📄 Source 1: Files

Description

Supported File Types

File Processing Flow

Extraction Modes Overview

1. Immediate Mode - Direct Link Processing

2. Background Job Mode - Delayed Extraction

3. Cron Job Mode - Periodic Batch Processing

4. Manual Only Mode - User-Triggered Processing

Detailed File Processing Flow

Word-family extraction (DOCX / DOC / ODT)

File Metadata Preserved

Example: PDF Processing

📦 Source 2: Objects

Description

Object-to-Text Conversion

Object Processing Flow

Object Metadata Preserved

Example: Contact Object Processing

Common Chunking Process

Chunking Strategies

1. Recursive Character Splitting (Recommended)

2. Fixed Size Splitting

Chunk Structure

Enhancement Pipeline

1. Text Search Indexing (Solr)

2. Vector Embeddings (RAG)

3. Entity Extraction (GDPR)

4. Language Detection

5. Language Level Assessment

Comparison: Files vs Objects

Combined Use Cases

Use Case 1: Customer Management

Use Case 2: GDPR Data Subject Access Request

Use Case 3: Multi-Language Knowledge Base

Configuration

Enabling File Processing

Extraction Mode Selection Guide

Enabling Object Processing

Enabling Enhancements

Performance Recommendations

For File-Heavy Workloads

For Object-Heavy Workloads

For Mixed Workloads

API Examples

Search Across Both Sources

Get All Chunks for a File

Get All Chunks for an Object

Conclusion