Enhanced Text Extraction Implementation Plan
Overview
This document outlines the implementation plan for adding entity extraction, language detection, and language level assessment to OpenRegister's text extraction system, including GDPR entity tracking capabilities.
Documentation Created
-
Enhanced Text Extraction & GDPR Entity Tracking
- Complete feature documentation
- Processing methods (local, external services, LLM, hybrid)
- GDPR entity register design
- Language detection and assessment
- Preparing for anonymization
- API endpoints
-
Text Extraction Sources: Files vs Objects
- Visual separation of file and object processing paths
- Detailed flow diagrams for each source type
- Comparison and combined use cases
- Configuration options
-
Text Extraction Database Entities
- Complete database schema
- Entity relationship diagrams
- PHP entity classes
- Migration strategy
- Performance considerations
Key Features Added to Documentation
1. Two Processing Paths
File Path
File Upload → Text Extraction (LLPhant/Dolphin) → Complete Text → Chunks
Object Path
Object Creation → Property Values → Text Blob → Chunks
Both paths converge at chunks, which can then undergo:
- Text search indexing (Solr)
- Vector embeddings (RAG)
- Entity extraction (GDPR)
- Language detection
- Language level assessment
2. GDPR Entity Register
Two new entities:
-
Entity: Stores unique entities (persons, emails, organizations)
- UUID, type, value, category
- Detection timestamp and metadata
- Supports deduplication
-
EntityRelation: Links entities to chunk positions
- Entity ID + Chunk ID
- Precise character positions
- Confidence score and detection method
- Anonymization tracking
- Context for verification
Prepared for anonymization:
- Precise position tracking
- Consistent replacement values
- Reversible anonymization
- Metadata preservation
3. Language Detection & Assessment
Chunk entity enhancements:
languagefield: ISO 639-1 codes (e.g., 'en', 'nl', 'de')language_levelfield: Reading level (e.g., 'B2', 'Grade 8', '65')language_confidencefield: Detection confidence (0.0-1.0)detection_methodfield: How it was detected
Use cases:
- Multi-language content management
- Accessibility compliance (plain language)
- Content routing by language
- Readability assessment
4. Multiple Processing Methods
All enhancements support three methods:
- Local Algorithms: Fast, privacy-friendly, no external deps
- External Services: Specialized APIs (Presidio, NLDocs, Dolphin)
- LLM Processing: Context-aware, handles ambiguity
- Hybrid (Recommended): Multiple methods with confidence scoring
5. Extended Chunking Support
Email chunking:
- Segment by headers, body, signature, attachments
- Preserve sender/recipient as entities
- Link chunks to email threads
Chat message chunking:
- Process individual messages with context
- Track conversation participants as entities
- Maintain threading
- Include previous messages for coherent search
Database Changes Required
New Tables
- oc_openregister_object_texts: Text blobs from objects
- oc_openregister_chunks: Individual chunks (migrated from chunks_json)
- oc_openregister_entities: GDPR entity register
- oc_openregister_entity_relations: Entity-to-chunk mappings
Updated Tables
oc_openregister_file_texts: No changes required (already has chunks_json)
Future migration will move chunks from JSON to dedicated table for better querying.
Implementation Phases
Phase 1: Database Schema (Week 1)
Tasks:
- Create migration for new tables
- Create PHP entity classes
- Create mapper classes
- Add unit tests for entities
Deliverables:
lib/Migration/Version1DateXXXXXXXX.phplib/Db/ObjectText.phplib/Db/Chunk.phplib/Db/GdprEntity.phplib/Db/EntityRelation.php- Corresponding mapper classes
Phase 2: Object Text Extraction (Week 2)
Tasks:
- Create ObjectTextExtractionService
- Integrate with SaveObject event
- Property value concatenation logic
- Chunking for objects
- Add configuration settings
Deliverables:
lib/Service/ObjectTextExtractionService.php- Integration with existing SaveObject flow
- Settings UI for object extraction
- Unit tests
Phase 3: Chunk Migration (Week 3)
Tasks:
- Create ChunkService
- Migrate FileText chunks_json to Chunk table
- Background job for migration
- Update services to use Chunk entity
- Maintain backward compatibility
Deliverables:
lib/Service/ChunkService.phplib/BackgroundJob/MigrateChunksJob.php- Updated TextExtractionService
- Migration status tracking
Phase 4: Language Detection (Week 4)
Tasks:
- Create LanguageDetectionService
- Implement local algorithm (lingua or similar)
- Implement API integration (optional)
- Implement LLM integration (optional)
- Add background job for batch processing
- Add configuration UI
Deliverables:
lib/Service/LanguageDetectionService.phplib/BackgroundJob/DetectLanguageJob.php- Settings UI for language detection
- Unit tests
Phase 5: Language Level Assessment (Week 5)
Tasks:
- Create LanguageLevelService
- Implement readability formulas (Flesch-Kincaid, etc.)
- Implement API integration (optional)
- Implement LLM integration (optional)
- Add background job
- Add configuration UI
Deliverables:
lib/Service/LanguageLevelService.phplib/BackgroundJob/AssessLanguageLevelJob.php- Settings UI for level assessment
- Unit tests
Phase 6: Entity Extraction (Week 6-7)
Tasks:
- Create EntityExtractionService
- Implement regex patterns (local)
- Implement Presidio integration (optional)
- Implement LLM integration (optional)
- Entity deduplication logic
- EntityRelation creation
- Background job for batch processing
- Add configuration UI
Deliverables:
lib/Service/EntityExtractionService.phplib/BackgroundJob/ExtractEntitiesJob.php- Settings UI for entity extraction
- Unit tests
Phase 7: GDPR Register UI (Week 8)
Tasks:
- Create EntityController
- Create Vue components for entity list
- Entity details view
- Occurrence list
- GDPR report generation
- Export functionality
- Search and filtering
Deliverables:
lib/Controller/EntityController.phpsrc/views/gdpr/EntitiesIndex.vuesrc/views/gdpr/EntityDetails.vuesrc/modals/gdpr/GdprReportModal.vue- API endpoints
Phase 8: Email & Chat Chunking (Week 9)
Tasks:
- Create EmailChunkingService
- Create ChatChunkingService
- Integration with Mail app (if available)
- Integration with Talk app (if available)
- Special handling for email metadata
- Conversation threading
Deliverables:
lib/Service/EmailChunkingService.phplib/Service/ChatChunkingService.php- Event listeners for Mail/Talk
- Unit tests
Phase 9: Testing & Documentation (Week 10)
Tasks:
- Integration tests for all services
- Performance testing
- API documentation updates
- User documentation updates
- Admin guide for GDPR features
- Video tutorials (optional)
Deliverables:
- Full test coverage
- Updated API documentation
- User guides
- Admin documentation
Phase 10: Deployment & Monitoring (Week 11)
Tasks:
- Beta deployment
- Monitor background jobs
- Performance tuning
- Bug fixes
- Collect user feedback
- Production deployment
Configuration Structure
Settings → OpenRegister → Text Analysis
┌─ Text Extraction ─────────────────────────────┐
│ ☑ Enable Object Text Extraction │
│ ☑ Enable File Text Extraction │
│ │
│ Chunking Strategy: [Recursive ▼] │
│ Chunk Size: [1000] characters │
│ Chunk Overlap: [200] characters │
└────────────────────────────────────────────────┘
┌─ Language Detection ──────────────────────────┐
│ ☑ Enable Language Detection │
│ │
│ Detection Method: [Hybrid ▼] │
│ • Local Algorithm │
│ • External API (optional) │
│ • LLM (optional) │
│ │
│ Confidence Threshold: [0.70] (0.0-1.0) │
└────────────────────────────────────────────────┘
┌─ Language Level Assessment ───────────────────┐
│ ☑ Enable Language Level Assessment │
│ │
│ Assessment Method: [Formula ▼] │
│ Scale: [CEFR ▼] │
└────────────────────────────────────────────────┘
┌─ Entity Extraction (GDPR) ────────────────────┐
│ ☑ Enable Entity Extraction │
│ │
│ Extraction Method: [Hybrid ▼] │
│ • Local Patterns: ☑ Enabled │
│ • Presidio API: ☐ Enabled (API key req.) │
│ • LLM: ☑ Enabled │
│ │
│ Entity Types to Detect: │
│ ☑ Persons │
│ ☑ Email Addresses │
│ ☑ Phone Numbers │
│ ☑ Organizations │
│ ☑ Locations │
│ ☑ Dates of Birth │
│ ☐ ID Numbers │
│ ☐ Bank Accounts │
│ ☐ IP Addresses │
│ │
│ Confidence Threshold: [0.80] (0.0-1.0) │
│ Context Window: [100] characters │
│ │
│ [View GDPR Register] [Generate Report] │
└────────────────────────────────────────────────┘
┌─ Vector Embeddings (RAG) ─────────────────────┐
│ ☑ Enable Vectorization │
│ │
│ Embedding Model: [OpenAI text-embedding-3 ▼] │
│ Vector Backend: [Solr ▼] │
└────────────────────────────────────────────────┘
┌─ Processing ──────────────────────────────────┐
│ Background Job Interval: [5] minutes │
│ Batch Size: [100] chunks per job │
│ │
│ [Process Pending Chunks Now] │
│ [Reprocess All Chunks] │
└────────────────────────────────────────────────┘
┌─ Statistics ──────────────────────────────────┐
│ Total Chunks: 145,782 │
│ Languages Detected: 8 │
│ Entities Found: 2,341 │
│ Pending Processing: 234 │
│ │
│ Top Languages: │
│ • English: 98,452 chunks (67.5%) │
│ • Dutch: 42,119 chunks (28.9%) │
│ • German: 5,211 chunks (3.6%) │
└────────────────────────────────────────────────┘
API Endpoints
Chunks
GET /api/chunks
GET /api/chunks/{id}
POST /api/chunks/{id}/analyze
GET /api/chunks/languages
GET /api/chunks/levels
POST /api/chunks/batch-analyze
Entities (GDPR)
GET /api/entities
GET /api/entities/{id}
GET /api/entities/{id}/occurrences
POST /api/entities/{id}/anonymize
GET /api/gdpr/report
POST /api/gdpr/export
Object Text
GET /api/object-texts
GET /api/object-texts/{id}
POST /api/objects/{id}/extract-text
Service Architecture
TextExtractionService (existing)
├─ FileTextExtractionService (existing)
└─ ObjectTextExtractionService (new)
ChunkService (new)
├─ createChunksFromFile()
├─ createChunksFromObject()
├─ migrateFromJson()
└─ getChunksBySource()
EnhancementService (new)
├─ LanguageDetectionService
│ ├─ detectLocal()
│ ├─ detectApi()
│ └─ detectLlm()
├─ LanguageLevelService
│ ├─ assessFormula()
│ ├─ assessApi()
│ └─ assessLlm()
└─ EntityExtractionService
├─ extractLocal()
├─ extractPresidio()
├─ extractLlm()
└─ createEntityRelations()
GdprService (new)
├─ generateReport()
├─ findEntityOccurrences()
├─ anonymizeEntity()
└─ exportGdprData()
Background Jobs
- MigrateChunksJob: Migrate chunks from JSON to table
- ProcessChunksJob: Apply enhancements to pending chunks
- DetectLanguageJob: Batch language detection
- AssessLanguageLevelJob: Batch level assessment
- ExtractEntitiesJob: Batch entity extraction
- UpdateEntityStatsJob: Update entity occurrence counts
Performance Targets
- Object text extraction: <100ms per object
- Chunk creation: <50ms per 100KB text
- Language detection (local): <10ms per chunk
- Language level (formula): <20ms per chunk
- Entity extraction (local): <100ms per chunk
- GDPR report generation: <5s for 10,000 entities
Testing Strategy
- Unit Tests: All services and entities
- Integration Tests: End-to-end flows
- Performance Tests: Background job processing
- Load Tests: 10,000+ files and objects
- API Tests: All endpoints
- UI Tests: GDPR register interface
Security Considerations
- Access Control: GDPR register admin-only
- Encryption: Entities encrypted at rest
- Audit Trail: Log all entity access
- Data Minimization: Only extract necessary entities
- Retention: Configurable entity retention periods
- Export: Secure GDPR data export
Compliance
- GDPR: Complete entity tracking for data subject requests
- Right to Erasure: Prepared for anonymization
- Data Mapping: Know where all PII exists
- Audit Trail: Complete access logging
- Retention: Configurable data retention
Next Steps
- Review this implementation plan with stakeholders
- Prioritize phases based on business needs
- Allocate resources (developers, QA, etc.)
- Set up development environment
- Create feature branch
- Begin Phase 1 implementation
Questions for Stakeholders
- Which entity types are most important for initial release?
- Should we integrate with external services (Presidio, etc.) or start with local only?
- What is the target timeline for GDPR compliance?
- Are there specific languages to prioritize for detection?
- Should email/chat chunking be in first release or later?
- What is the performance budget for background job processing?
- Are there existing GDPR workflows to integrate with?
Success Metrics
- Coverage: 100% of files and objects chunked
- Accuracy: >90% entity detection accuracy
- Performance: <5min to process 1000 files
- Adoption: GDPR register used for data subject requests
- Compliance: Pass GDPR audit
- User Satisfaction: Positive feedback on search quality
Conclusion
This enhanced text extraction system provides OpenRegister with:
✅ Unified processing for files and objects
✅ GDPR compliance with entity tracking
✅ Language detection and assessment
✅ Prepared for anonymization
✅ Extended support for emails and chats
✅ Flexible processing methods (local, API, LLM)
✅ Comprehensive documentation
✅ Clear implementation roadmap
The system is designed to be implemented incrementally, with each phase delivering value independently while building toward the complete feature set.