Text Extraction Sources: Files vs Objects
OpenRegister processes content from two distinct sources that both lead to chunks for searching and analysis.
Processing Paths Overviewโ
๐ Source 1: Filesโ
Descriptionโ
Files (documents, images, spreadsheets, etc.) are processed through text extraction engines to convert binary content into searchable text.
Supported File Typesโ
| Category | Formats | Extraction Method |
|---|---|---|
| Documents | PDF, DOCX, DOC, ODT, RTF | LLPhant or Dolphin |
| Spreadsheets | XLSX, XLS, CSV | LLPhant or Dolphin |
| Presentations | PPTX | LLPhant or Dolphin |
| Text Files | TXT, MD, HTML, JSON, XML | LLPhant (native) |
| Images | JPG, PNG, GIF, WebP, TIFF | Dolphin (OCR only) |
File Processing Flowโ
The file processing flow varies based on the extraction mode configured. Each mode provides different timing and processing characteristics:
Extraction Modes Overviewโ
1. Immediate Mode - Direct Link Processingโ
Characteristics:
- Direct Link: File upload and parsing logic are directly connected
- Synchronous: Processing happens during the upload request
- User Experience: User waits for extraction to complete
- Use Case: When immediate text availability is critical
- Performance: May slow down file uploads for large files
2. Background Job Mode - Delayed Extractionโ
Characteristics:
- Delayed Action: Extraction happens after upload completes
- Asynchronous: Processing on the job stack, non-blocking
- User Experience: Upload completes immediately
- Use Case: Recommended for most scenarios (best performance)
- Performance: No impact on upload speed
3. Cron Job Mode - Periodic Batch Processingโ
Characteristics:
- Repeating Action: Periodic batch processing via scheduled jobs
- Batch Processing: Multiple files processed together
- User Experience: Upload completes immediately, extraction happens later
- Use Case: When you want to control processing load and timing
- Performance: Efficient batch processing, predictable load
4. Manual Only Mode - User-Triggered Processingโ
Characteristics:
- Manual Trigger: Only processes when user explicitly triggers
- User Control: Complete control over when extraction happens
- Use Case: Selective processing, testing, or resource-constrained environments
- Performance: No automatic processing overhead
Detailed File Processing Flowโ
File Metadata Preservedโ
When files are processed, the following metadata is maintained:
- Source Reference: Original file ID from Nextcloud
- File Path: Location in Nextcloud filesystem
- MIME Type: File format information
- File Size: Original file size in bytes
- Checksum: For change detection
- Extraction Method: Which engine was used (LLPhant or Dolphin)
- Extraction Timestamp: When text was extracted
Example: PDF Processingโ
Input: contract-2024.pdf (245 KB, 15 pages)
Step 1: Text Extraction
- Engine: Dolphin AI
- Time: 8.2 seconds
- Output: 12,450 characters of text
Step 2: Chunking
- Strategy: Recursive (respects paragraphs)
- Chunks created: 14
- Average chunk size: 889 characters
- Overlap: 200 characters
Step 3: Storage
- FileText entity created
- Chunks stored in chunks_json field
- Status: completed
๐ฆ Source 2: Objectsโ
Descriptionโ
OpenRegister objects (structured data entities) are converted into text blobs by concatenating their property values. This enables full-text search across structured data.
Object-to-Text Conversionโ
Objects are transformed using the following rules:
-
Simple Properties: Direct value extraction
{ 'name': 'John Doe', 'age': 35 }
โ 'name: John Doe age: 35' -
Arrays: Join with separators
{ 'tags': ['urgent', 'customer', 'support'] }
โ 'tags: urgent, customer, support' -
Nested Objects: Flatten with dot notation
{ 'address': { 'city': 'Amsterdam', 'country': 'NL' } }
โ 'address.city: Amsterdam address.country: NL' -
Special Handling: Exclude system fields
- Ignore:
id,uuid,created,updated - Include: User-defined properties only
- Ignore:
Object Processing Flowโ
Object Metadata Preservedโ
When objects are processed, the following metadata is maintained:
- Object ID: Reference to original object
- Schema: Schema definition for context
- Register: Register containing the object
- Property Map: Which chunk contains which properties
- Extraction Timestamp: When text blob was created
Example: Contact Object Processingโ
Input Object (Contact Schema):
{
'id': 12345,
'uuid': '550e8400-e29b-41d4-a716-446655440000',
'firstName': 'Jane',
'lastName': 'Smith',
'email': 'jane.smith@example.com',
'phone': '+31612345678',
'company': {
'name': 'Acme Corp',
'industry': 'Technology'
},
'tags': ['vip', 'partner'],
'notes': 'Important client, prefers email communication'
}
Step 1: Text Blob Creation
โ 'firstName: Jane lastName: Smith email: jane.smith@example.com
phone: +31612345678 company.name: Acme Corp
company.industry: Technology tags: vip, partner
notes: Important client, prefers email communication'
Step 2: Chunking
- Strategy: Fixed size (short enough for single chunk)
- Chunks created: 1
- Chunk size: 215 characters
Step 3: Storage
- ObjectText entity created
- Chunk stored with property mapping
- Status: completed
Common Chunking Processโ
Both files and objects converge at the chunking stage, where text is divided into manageable pieces.
Chunking Strategiesโ
1. Recursive Character Splitting (Recommended)โ
Smart splitting that respects natural text boundaries:
Priority Order:
1. Paragraph breaks (\n\n)
2. Sentence endings (. ! ?)
3. Line breaks (\n)
4. Word boundaries (spaces)
5. Character split (fallback)
Best for: Natural language documents, articles, reports
2. Fixed Size Splittingโ
Mechanical splitting with overlap:
Settings:
- Chunk size: 1000 characters
- Overlap: 200 characters
- Minimum chunk: 100 characters
Best for: Structured data, code, logs
Chunk Structureโ
Each chunk contains:
{
'text': 'The actual chunk content...',
'start_offset': 0,
'end_offset': 1000,
'source_type': 'file',
'source_id': 12345,
'language': 'en',
'language_level': 'B2'
}
Enhancement Pipelineโ
After chunking, content can undergo optional enhancements:
1. Text Search Indexing (Solr)โ
Purpose: Fast keyword and phrase search across all content
Performance: ~50-200ms per query
Use Cases: Search box, filters, reporting
2. Vector Embeddings (RAG)โ
Purpose: Semantic search and AI context retrieval
Performance: ~200-500ms per chunk (one-time), ~100-300ms per query
Use Cases: AI chat, related content, recommendations
3. Entity Extraction (GDPR)โ
Purpose: GDPR compliance, PII tracking, data subject access requests
Performance: ~100-2000ms per chunk (depending on method)
Use Cases: Compliance audits, right to erasure, data mapping
4. Language Detectionโ
Purpose: Multi-language support, content filtering, translation routing
Performance: ~10-50ms per chunk (local) or ~100-200ms (API)
Use Cases: Language filters, translation, localization
5. Language Level Assessmentโ
Purpose: Accessibility compliance, content simplification, readability scoring
Performance: ~20-100ms per chunk
Use Cases: Plain language compliance, educational leveling, accessibility
Comparison: Files vs Objectsโ
| Aspect | Files | Objects |
|---|---|---|
| Input Format | Binary (PDF, DOCX, images) | Structured JSON data |
| Extraction | Text extraction engines required | Property value concatenation |
| Processing Time | Slow (2-60 seconds) | Fast (<1 second) |
| Complexity | High (OCR, parsing) | Low (string operations) |
| Chunk Count | Many (10-1000+) | Few (1-10) |
| Update Frequency | Rare (files are static) | Common (objects change often) |
| Best For | Documents, reports, images | Structured records, metadata |
| GDPR Risk | High (unstructured PII) | Medium (known data structure) |
| Search Precision | Lower (natural language) | Higher (structured fields) |
| Context | Full document context | Property-level context |
Combined Use Casesโ
Use Case 1: Customer Managementโ
Object: Customer record
- Name, email, phone, notes
โ Chunked for search
File: Contract PDF attached to customer
- Terms, signatures, dates
โ Extracted and chunked
Search: 'payment terms for Acme Corp'
โ Finds chunks from both object and file
โ Returns unified results
Use Case 2: GDPR Data Subject Access Requestโ
Request: 'Find all mentions of john.doe@example.com'
Step 1: Entity extraction finds email in:
- 15 chunks from 8 PDF files
- 3 chunks from 2 customer objects
- 12 chunks from 42 email messages
Step 2: Generate report with:
- All files containing email
- All objects referencing person
- All email conversations
- Exact positions in each source
Step 3: Provide data or anonymize on request
Use Case 3: Multi-Language Knowledge Baseโ
Content Sources:
- Files: User manuals (EN, NL, DE)
- Objects: FAQ entries (EN, NL)
- Emails: Support conversations (mixed)
Processing:
1. All sources โ Chunks
2. Language detection โ Tag each chunk
3. Vector embeddings โ Enable semantic search
User Search (in Dutch):
โ System filters to NL chunks
โ Semantic search across files + objects + emails
โ Returns relevant content in user's language
Configurationโ
Enabling File Processingโ
Settings โ OpenRegister โ File Configuration
Extract Text From: [All Files / Specific Folders / Object Files]
Text Extractor: [LLPhant / Dolphin]
Extraction Mode: [Immediate / Background Job / Cron Job / Manual Only]
Chunking Strategy: [Recursive / Fixed Size]
Extraction Mode Selection Guideโ
Immediate Mode:
- โ Use when: Text must be available immediately after upload
- โ Best for: Small files, critical workflows, real-time search requirements
- โ ๏ธ Consider: May slow down uploads for large files
- ๐ Performance: Synchronous processing during upload
Background Job Mode (Recommended):
- โ Use when: You want fast uploads with async processing
- โ Best for: Most production scenarios, large files, high-volume uploads
- โ ๏ธ Consider: Text may not be immediately available (typically seconds to minutes delay)
- ๐ Performance: Non-blocking, optimal for user experience
Cron Job Mode:
- โ Use when: You want to control processing load and timing
- โ Best for: Batch processing, predictable resource usage, scheduled maintenance windows
- โ ๏ธ Consider: Text extraction happens at scheduled intervals (default: every 15 minutes)
- ๐ Performance: Efficient batch processing, predictable system load
Manual Only Mode:
- โ Use when: You want complete control over when extraction happens
- โ Best for: Testing, selective processing, resource-constrained environments
- โ ๏ธ Consider: Requires manual intervention to trigger extraction
- ๐ Performance: No automatic processing overhead
Enabling Object Processingโ
Settings โ OpenRegister โ Text Analysis
Enable Object Text Extraction: [Yes / No]
Include Properties: [Select which properties to extract]
Chunking Strategy: [Recursive / Fixed Size]
Enabling Enhancementsโ
Settings โ OpenRegister โ Text Analysis
โ Text Search Indexing (Solr)
โ Vector Embeddings (RAG)
โ Entity Extraction (GDPR)
โ Language Detection
โ Language Level Assessment
Performance Recommendationsโ
For File-Heavy Workloadsโ
- Use Background Job or Cron Job mode for optimal performance
- Enable Dolphin for images/complex PDFs
- Use recursive chunking for better quality
- Enable selective enhancements (not all at once)
- Configure appropriate batch sizes for cron mode
For Object-Heavy Workloadsโ
- Use immediate processing (objects are small)
- Enable fixed-size chunking (faster)
- Always enable language detection (fast on short text)
- Enable entity extraction for compliance
For Mixed Workloadsโ
- Background processing for files
- Immediate processing for objects
- Use recursive chunking for both
- Enable all enhancements selectively per schema
API Examplesโ
Search Across Both Sourcesโ
GET /api/search?q=contract%20terms&sources=files,objects
Response:
{
'results': [
{
'source_type': 'file',
'source_id': 12345,
'file_name': 'contract-2024.pdf',
'chunk_index': 3,
'text': '...payment terms are net 30...',
'score': 0.95
},
{
'source_type': 'object',
'source_id': 67890,
'schema': 'customers',
'property': 'notes',
'text': '...special contract terms agreed...',
'score': 0.87
}
]
}
Get All Chunks for a Fileโ
GET /api/files/12345/chunks
Get All Chunks for an Objectโ
GET /api/objects/67890/chunks
Conclusionโ
OpenRegister's dual-source text extraction system provides:
- Comprehensive Coverage: Search across files AND structured data
- Unified Processing: Same chunking and enhancement pipeline
- Flexible Configuration: Enable features per source type
- GDPR Compliance: Track entities from all sources
- Intelligent Search: Semantic and keyword search across everything
By processing both files and objects into a common chunk format, OpenRegister creates a truly unified content search and analysis platform.
Next Steps:
- Text Extraction, Vectorization & Named Entity Recognition - Unified documentation for text extraction, vectorization, and NER
- Enhanced Text Extraction Documentation
- GDPR Entity Tracking
- Language Detection
- File Processing Details
- Object Management