Archiving and Metadata Classification - Feature Summary
Overviewโ
Complete feature documentation created for an Archiving and Metadata Classification system that builds on top of the chunk-based text extraction pipeline.
Status: ๐ Documentation Complete - NOT YET IMPLEMENTED
What is This Feature?โ
An intelligent classification and metadata extraction system for all content types (documents, objects, emails, chats) that:
-
Classifies content using two approaches:
- Constructive: User selects from curated taxonomy lists
- Suggestive: AI proposes new categories/themes
-
Extracts metadata automatically:
- Keywords and search terms
- Themes and topics
- Document properties
- Temporal information
Documentation Locationโ
๐ Archiving and Metadata Classification
Key Conceptsโ
Two Classification Approachesโ
1. Constructive Classification (Controlled Vocabulary)โ
User Action โ Select Taxonomy โ Select Category โ Apply
Characteristics:
- Predefined categories
- Controlled vocabulary
- Consistent organization
- Manual or AI-assisted
Example Taxonomies:
- Document Types (Contracts, Policies, Reports)
- Content Themes (Technology, Business, HR)
- Records Management (Retention schedules)
2. Suggestive Classification (AI-Powered)โ
AI Analysis โ Generate Suggestions โ User Review โ Approve/Reject
Characteristics:
- AI-discovered themes
- Dynamic categories
- Requires approval
- Can be promoted to taxonomy
Example Suggestions:
- "API Integration" (confidence: 89%)
- "Cloud Security" (confidence: 76%)
- "Performance Optimization" (confidence: 82%)
Metadata Extractionโ
Automatic extraction of:
- Keywords: Important terms (TF-IDF, NER, LLM)
- Themes: High-level topics (Topic modeling, clustering)
- Search Terms: How users might search for this content
- Properties: Structured metadata (dates, authors, versions)
Database Schemaโ
4 New Tablesโ
-
oc_openregister_classifications
- Links chunks to taxonomy categories
- Stores confidence and method
- Multi-tenant (owner, organisation)
-
oc_openregister_taxonomies
- Stores taxonomy definitions
- Hierarchical structures
- Global or organization-specific
-
oc_openregister_suggestions
- AI-generated classification suggestions
- Pending user review
- Confidence scores
-
oc_openregister_metadata
- Extracted metadata (keywords, themes, etc.)
- Linked to chunks and sources
- Method and confidence tracking
All tables include multi-tenancy fields (owner, organisation).
Integration with Existing Featuresโ
1. Text Extraction Pipelineโ
File/Object โ Chunks โ [NEW] Classification + Metadata Extraction
Applied after chunking, reuses existing chunk infrastructure.
2. GDPR Entity Trackingโ
Entities can become metadata:
- Person names โ Keywords
- Organizations โ Themes
- Locations โ Properties
3. Search Enhancementโ
Classifications and metadata improve search:
- Filter by category
- Boost by theme relevance
- Faceted navigation
- Related content suggestions
4. Vector Search (RAG)โ
Metadata enhances AI:
- Filter vectors by classification
- Include metadata in context
- Theme-based retrieval
User Interface Componentsโ
1. Classification Panelโ
- Display current classifications
- Add/remove classifications
- Bulk classification
- History tracking
2. Suggestion Review Panelโ
- View pending AI suggestions
- Approve/reject with one click
- Promote to taxonomy
- Bulk actions
3. Metadata Displayโ
- Show extracted keywords, themes
- Edit metadata manually
- View confidence scores
- Method transparency
4. Taxonomy Managerโ
- Create/edit taxonomies
- Hierarchical editor
- Import/export
- Global vs organization scope
API Endpointsโ
Classificationsโ
GET /api/classifications
POST /api/classifications
DELETE /api/classifications/{id}
POST /api/classifications/bulk
Suggestionsโ
GET /api/suggestions?status=pending
POST /api/suggestions/{id}/review
POST /api/suggestions/bulk-approve
Taxonomiesโ
GET /api/taxonomies
POST /api/taxonomies
PUT /api/taxonomies/{id}
DELETE /api/taxonomies/{id}
GET /api/taxonomies/{id}/export
Metadataโ
GET /api/metadata?source_id=123
PUT /api/metadata/{id}
POST /api/metadata/extract
Use Casesโ
1. Legal Document Managementโ
- Classify contracts by type
- Extract parties, dates, jurisdictions
- Apply retention schedules
- Compliance tracking
2. Knowledge Base Organizationโ
- AI discovers documentation themes
- Automatic categorization
- Improved search
- Dynamic taxonomy evolution
3. Email Archivingโ
- Classify emails (Business, HR, Legal, IT)
- Extract sender, recipient, subject
- Apply retention policies
- GDPR compliance
4. Multi-Language Contentโ
- Language-aware classification
- Localized taxonomies
- Cross-language themes
- Better UX per language
5. Research Document Analysisโ
- Discover research themes
- Extract concepts and keywords
- Cluster similar papers
- Knowledge graph generation
Multi-Tenancyโ
All entities fully support multi-tenancy:
ownerfield: User IDorganisationfield: Organisation UUID- Inherited from source content
- Automatic filtering by access rights
- Organization-level taxonomies
- Data isolation guaranteed
Configurationโ
Settings panel includes:
Classification Settingsโ
- Enable constructive/suggestive/both
- Confidence thresholds
- Auto-approve settings
- Suggestion methods
Metadata Extraction Settingsโ
- Enable/disable by type
- Extraction methods
- Algorithm parameters
- Min confidence scores
Processing Settingsโ
- On upload vs background
- Batch sizes
- Job intervals
- Manual triggers
Performance Characteristicsโ
- Keyword extraction: 50-200ms per chunk
- Theme extraction: 500-2000ms per document (LLM)
- Classification suggestion: 200-1000ms per chunk
- Metadata extraction: 100-500ms per chunk
Storage (10,000 documents)โ
- Classifications: ~6 MB
- Suggestions: ~10 MB
- Metadata: ~10 MB
- Taxonomies: ~250 KB
- Total: ~26 MB
AI/LLM Integrationโ
Methods Supportedโ
-
Topic Modeling (Unsupervised)
- LDA, NMF algorithms
- Probability distribution over topics
-
LLM-Based Analysis (Supervised)
- Prompt-based theme extraction
- Structured output with confidence
-
Clustering (Unsupervised)
- Vector similarity clustering
- Content grouping
-
Hybrid (Recommended)
- Combine multiple methods
- Confidence voting
- Best accuracy
Future Enhancementsโ
- Auto-Classification: Classify based on similar content
- Smart Suggestions: Learn from user feedback
- Cross-Reference: Link classifications across documents
- Visualization: Knowledge graphs, theme evolution
- Export/Import: Share taxonomies
- Templates: Pre-built taxonomies
- Validation Rules: Ensure consistency
- Bulk Operations: Reclassify multiple items
Diagrams Includedโ
The documentation includes 7 Mermaid diagrams:
- Sources โ Classification & Metadata flow (TB)
- Classification schema class diagram
- Suggestion workflow sequence diagram
- Complete processing pipeline flowchart
- UI mockups (4 panels: Classification, Suggestions, Metadata, Taxonomy Manager)
All fully editable in markdown source.
Implementation Considerationsโ
Phase 1: Database Schemaโ
- Create 4 new tables
- Add multi-tenancy fields
- Create indexes
Phase 2: Classification Serviceโ
- Constructive classification logic
- Taxonomy management
- Category assignment
Phase 3: Suggestion Engineโ
- AI integration (LLM/clustering)
- Confidence scoring
- Deduplication
Phase 4: Metadata Extractionโ
- Keyword extraction (TF-IDF, NER)
- Theme extraction (topic modeling, LLM)
- Search term generation
- Property extraction
Phase 5: User Interfaceโ
- Classification panel
- Suggestion review
- Metadata display
- Taxonomy manager
Phase 6: APIโ
- All CRUD endpoints
- Bulk operations
- Export/import
Phase 7: Integrationโ
- Connect to chunk pipeline
- Search enhancement
- RAG context enrichment
Phase 8: Testing & Deploymentโ
- Unit tests
- Integration tests
- Performance testing
- User acceptance testing
Security & Complianceโ
- Access Control: User/organization-based
- Data Isolation: Multi-tenant safe
- Audit Trail: All classification changes logged
- GDPR: Metadata can include entity references
- Approval Workflow: Admin review for suggestions
Dependenciesโ
Existing Features Requiredโ
- โ Chunk system (text extraction)
- โ Multi-tenancy infrastructure
- โ Background job system
New Dependenciesโ
- Topic modeling library (e.g., Gensim)
- TF-IDF implementation
- LLM API access (OpenAI, etc.)
- Clustering algorithms (scikit-learn)
Optional Integrationsโ
- External taxonomy services
- Knowledge graph systems
- Visualization libraries
Benefitsโ
For Usersโ
- โ Better content organization
- โ Easier discovery
- โ Automatic categorization
- โ Improved search results
For Administratorsโ
- โ Centralized taxonomy management
- โ AI-assisted classification
- โ Compliance tracking
- โ Usage analytics
For Organizationsโ
- โ Knowledge management
- โ Information governance
- โ Regulatory compliance
- โ Operational efficiency
Comparison with Entity Trackingโ
| Feature | Entity Tracking (GDPR) | Classification & Metadata |
|---|---|---|
| Purpose | Find PII for compliance | Organize and discover content |
| Focus | Persons, emails, phones | Categories, themes, keywords |
| Approach | Detection (what exists) | Assignment (what it means) |
| User Input | Minimal (review) | Active (select categories) |
| AI Role | Detection assistant | Suggestion engine |
| Compliance | GDPR, privacy laws | Records management |
| Output | Entity register | Taxonomy, metadata |
Complementary: Both work together for complete content intelligence.
Documentation Qualityโ
The feature documentation includes:
โ
Complete concept explanation
โ
Two classification approaches detailed
โ
Database schema with SQL
โ
7 Mermaid diagrams
โ
UI mockups in ASCII
โ
Complete API specification
โ
5 detailed use cases
โ
Multi-tenancy fully covered
โ
Performance characteristics
โ
Integration points identified
โ
Implementation phases outlined
โ
Security and compliance addressed
Next Stepsโ
Before Implementationโ
-
Review with Stakeholders
- Validate classification approaches
- Confirm taxonomy requirements
- Agree on AI methods
-
Prioritize Features
- Constructive vs suggestive first?
- Which metadata types first?
- UI vs API priority?
-
Technical Decisions
- LLM provider selection
- Topic modeling approach
- Taxonomy storage format
-
Design Decisions
- Default taxonomies to include
- UI placement and flow
- Admin vs user capabilities
Implementation Orderโ
Recommended: Implement after text extraction and entity tracking are stable.
Reason: Builds on chunk infrastructure, complements entity tracking.
Timeline: 8-10 weeks after text extraction completion.
Questions for Stakeholdersโ
- Should we prioritize constructive or suggestive classification?
- What taxonomies are most important (legal, technical, business)?
- Do you have existing taxonomy standards to import?
- What LLM provider should we use for suggestions?
- Should taxonomies be managed centrally or by organization?
- What metadata is most valuable for your use case?
- Should we auto-apply high-confidence suggestions (>85%)?
- How should we handle multi-language taxonomies?
Conclusionโ
Complete feature documentation has been created for an intelligent archiving and metadata classification system that:
โ
Provides flexible classification (constructive + suggestive)
โ
Extracts rich metadata automatically
โ
Fully multi-tenant and secure
โ
Integrates with existing chunk pipeline
โ
Enhances search and discovery
โ
Complements GDPR entity tracking
โ
Includes complete database schema
โ
Defines all API endpoints
โ
Specifies UI components
โ
Identifies implementation phases
Status: Ready for stakeholder review and prioritization.
Do NOT implement yet - this is documentation only for planning purposes.
Related Documentation: