Archiving and Metadata Classification
Overview
The Archiving and Metadata system provides intelligent classification and metadata extraction for all content types (documents, objects, emails, chats) processed through the chunking pipeline. The system combines AI-powered suggestions with curated taxonomies to organize and enrich content automatically.
Core Concepts
Classification vs Extraction
Classification: Assigning content to predefined or AI-suggested categories
- Constructive: User selects from curated lists (controlled vocabulary)
- Suggestive: AI proposes new categories based on content analysis
Metadata Extraction: Automatically identifying and extracting structured information
- Keywords and search terms
- Themes and topics
- Dates and temporal information
- Named entities (covered by GDPR feature)
- Document properties
Content Types
The system classifies content from multiple sources:
Classification System
1. Constructive Classification
Definition: Users assign content to predefined categories from curated lists.
Characteristics:
- Controlled vocabulary
- Consistent taxonomy
- Organization-specific or global
- Manual or AI-assisted selection
- Hierarchical structures supported
Use Cases:
- Regulatory compliance (document types)
- Records management (retention schedules)
- Information architecture (content organization)
- Knowledge management (topic taxonomies)
Classification Schema
Example Taxonomies
1. Document Types (Legal/Compliance)
- Contracts
- Employment Contracts
- Vendor Contracts
- Client Agreements
- Policies
- Internal Policies
- External Policies
- Reports
- Financial Reports
- Audit Reports
- Management Reports
2. Content Themes (Knowledge Management)
- Technology
- Software Development
- Infrastructure
- Security
- Business
- Sales
- Marketing
- Operations
- Human Resources
- Recruitment
- Training
- Benefits
3. Records Management (Archival)
- Permanent Retention
- 7-Year Retention
- 3-Year Retention
- Temporary (1 Year)
- Destroy After Processing
2. Suggestive Classification
Definition: AI analyzes content and proposes new categories based on detected themes.
Characteristics:
- AI-generated suggestions
- Discovers emerging themes
- Adapts to content changes
- Requires user approval
- Can be promoted to constructive categories
Use Cases:
- Content discovery (find new trends)
- Topic modeling (identify discussion themes)
- Dynamic organization (adapt to evolving content)
- Research and analysis (uncover patterns)
Suggestion Workflow
AI Suggestion Methods
1. Topic Modeling (Unsupervised)
Method: Latent Dirichlet Allocation (LDA) or similar
Input: All chunks in a document
Output: Probability distribution over topics
Example:
- Topic 1: "contract, agreement, terms" (45%)
- Topic 2: "payment, invoice, billing" (30%)
- Topic 3: "support, maintenance, service" (25%)
2. LLM-Based Analysis (Supervised)
Method: Prompt-based theme extraction
Input: Chunk text + context
Output: Structured themes with confidence
Example:
- Primary theme: "Software Licensing" (0.92)
- Secondary themes: ["Payment Terms" (0.78), "Support Agreement" (0.65)]
3. Clustering (Unsupervised)
Method: Vector similarity clustering
Input: Chunk embeddings
Output: Content clusters
Example:
- Cluster 1: 45 chunks about "Project Planning"
- Cluster 2: 32 chunks about "Budget Discussions"
- Cluster 3: 28 chunks about "Technical Architecture"
Metadata Extraction
1. Keywords
Definition: Important terms that represent content essence.
Extraction Methods:
- TF-IDF: Statistical importance
- NER + Filtering: Named entities as keywords
- LLM Extraction: Context-aware keywords
- Hybrid: Combine multiple methods
Storage:
{
'keywords': [
{'term': 'cloud migration', 'score': 0.95, 'frequency': 12},
{'term': 'kubernetes', 'score': 0.87, 'frequency': 8},
{'term': 'cost optimization', 'score': 0.82, 'frequency': 6}
]
}
Use Cases:
- Search enhancement (boost relevant results)
- Tag clouds (visual navigation)
- Related content (find similar documents)
- Auto-complete suggestions
2. Themes
Definition: High-level topics that span multiple chunks/documents.
Extraction Methods:
- Topic Modeling: LDA, NMF
- LLM Analysis: Prompt-based theme identification
- Clustering: Group similar content
Storage:
{
'themes': [
{
'name': 'Digital Transformation',
'confidence': 0.89,
'supporting_chunks': [123, 145, 167],
'keywords': ['digitalization', 'automation', 'cloud'],
'suggested_by': 'llm'
}
]
}
Use Cases:
- Content organization (thematic navigation)
- Executive summaries (key themes overview)
- Trend analysis (emerging themes over time)
- Knowledge graphs (theme relationships)
3. Search Terms
Definition: Phrases users might search for to find this content.
Extraction Methods:
- Question Extraction: "What does this answer?"
- Title/Header Analysis: Prominent text
- LLM Generation: "How would you search for this?"
Storage:
{
'searchTerms': [
'how to migrate to cloud',
'kubernetes deployment guide',
'cloud infrastructure costs',
'container orchestration best practices'
]
}
Use Cases:
- Search optimization (better matching)
- SEO (search engine optimization)
- Content discovery (suggest related searches)
- FAQ generation (common questions)
4. Document Properties
Definition: Structured metadata about the content.
Extraction Methods:
- Date Extraction: Created, modified, effective dates
- Author Detection: Writers, contributors
- Version Information: Document versions
- Format Analysis: Structure, sections, length
Storage:
{
'properties': {
'documentType': 'Technical Report',
'author': 'Engineering Team',
'createdDate': '2024-01-15',
'effectiveDate': '2024-02-01',
'version': '2.1',
'pageCount': 45,
'wordCount': 12450,
'sections': ['Executive Summary', 'Technical Details', 'Recommendations'],
'language': 'en',
'readingLevel': 'B2'
}
}
Database Schema
Classification Table
CREATE TABLE oc_openregister_classifications (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
uuid VARCHAR(255) NOT NULL UNIQUE,
chunk_id BIGINT NOT NULL,
source_type VARCHAR(50) NOT NULL,
source_id BIGINT NOT NULL,
taxonomy VARCHAR(255),
category VARCHAR(255) NOT NULL,
subcategory VARCHAR(255),
path JSON,
confidence DECIMAL(3,2) NOT NULL,
method VARCHAR(50) NOT NULL,
status VARCHAR(50) NOT NULL DEFAULT 'active',
assigned_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
assigned_by VARCHAR(255),
owner VARCHAR(255),
organisation VARCHAR(255),
INDEX idx_chunk (chunk_id),
INDEX idx_source (source_type, source_id),
INDEX idx_taxonomy (taxonomy),
INDEX idx_category (category),
INDEX idx_status (status),
INDEX idx_owner (owner),
INDEX idx_organisation (organisation),
FOREIGN KEY (chunk_id) REFERENCES oc_openregister_chunks(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Taxonomy Table
CREATE TABLE oc_openregister_taxonomies (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
uuid VARCHAR(255) NOT NULL UNIQUE,
name VARCHAR(255) NOT NULL,
description TEXT,
type VARCHAR(50) NOT NULL,
structure JSON NOT NULL,
global BOOLEAN NOT NULL DEFAULT FALSE,
organisation VARCHAR(255),
owner VARCHAR(255),
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_name (name),
INDEX idx_type (type),
INDEX idx_global (global),
INDEX idx_organisation (organisation),
UNIQUE KEY unique_name_org (name, organisation)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Suggestion Table
CREATE TABLE oc_openregister_suggestions (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
uuid VARCHAR(255) NOT NULL UNIQUE,
chunk_id BIGINT NOT NULL,
source_type VARCHAR(50) NOT NULL,
source_id BIGINT NOT NULL,
suggestion_type VARCHAR(50) NOT NULL,
value TEXT NOT NULL,
confidence DECIMAL(3,2) NOT NULL,
method VARCHAR(50) NOT NULL,
context JSON,
status VARCHAR(50) NOT NULL DEFAULT 'pending',
reviewed_by VARCHAR(255),
reviewed_at DATETIME,
owner VARCHAR(255),
organisation VARCHAR(255),
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
INDEX idx_chunk (chunk_id),
INDEX idx_source (source_type, source_id),
INDEX idx_type (suggestion_type),
INDEX idx_status (status),
INDEX idx_confidence (confidence),
INDEX idx_owner (owner),
INDEX idx_organisation (organisation),
FOREIGN KEY (chunk_id) REFERENCES oc_openregister_chunks(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Metadata Table
CREATE TABLE oc_openregister_metadata (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
uuid VARCHAR(255) NOT NULL UNIQUE,
chunk_id BIGINT,
source_type VARCHAR(50) NOT NULL,
source_id BIGINT NOT NULL,
metadata_type VARCHAR(50) NOT NULL,
metadata_key VARCHAR(255) NOT NULL,
metadata_value TEXT NOT NULL,
confidence DECIMAL(3,2),
method VARCHAR(50),
owner VARCHAR(255),
organisation VARCHAR(255),
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_chunk (chunk_id),
INDEX idx_source (source_type, source_id),
INDEX idx_type (metadata_type),
INDEX idx_key (metadata_key),
INDEX idx_owner (owner),
INDEX idx_organisation (organisation),
FOREIGN KEY (chunk_id) REFERENCES oc_openregister_chunks(id) ON DELETE SET NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Processing Pipeline
Complete Flow
User Interface Components
1. Classification Panel
Located on: Document/Object detail page
Features:
- Display current classifications
- Add new classification
- Remove classifications
- Bulk classification
- Classification history
Mockup:
┌─ Classifications ─────────────────────────────┐
│ │
│ Taxonomy: [Document Types ▼] │
│ Category: [Contracts ▼] │
│ Subcategory: [Vendor Contracts ▼] │
│ │
│ [Add Classification] │
│ │
│ Current Classifications: │
│ ┌─────────────────────────────────────────┐ │
│ │ 📄 Document Types > Contracts > │ │
│ │ Vendor Contracts │ │
│ │ Confidence: 100% (Manual) │ │
│ │ Assigned: 2024-01-15 by John Doe │ │
│ │ [Remove] │ │
│ └─────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────┐ │
│ │ 🏷️ Content Themes > Technology > │ │
│ │ Software Development │ │
│ │ Confidence: 87% (AI) │ │
│ │ Assigned: 2024-01-15 (automatic) │ │
│ │ [Remove] │ │
│ └─────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────┘
2. Suggestion Review Panel
Located on: Admin dashboard or document page
Features:
- View pending suggestions
- Approve/reject suggestions
- Promote to taxonomy
- Bulk actions
- Confidence filtering
Mockup:
┌─ AI Suggestions Pending Review ───────────────┐
│ │
│ Showing: [All ▼] | Confidence: [>70% ▼] │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ 💡 Suggested Theme: "API Integration" │ │
│ │ Confidence: 89% │ │
│ │ Found in: 12 documents, 45 chunks │ │
│ │ Similar to: "Software Integration" │ │
│ │ │ │
│ │ [✓ Approve] [✗ Reject] │ │
│ │ [+ Add to Taxonomy] │ │
│ └─────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────┐ │
│ │ 💡 Suggested Category: "Cloud Security" │ │
│ │ Confidence: 76% │ │
│ │ Found in: 8 documents, 23 chunks │ │
│ │ Taxonomy: Technology > Security │ │
│ │ │ │
│ │ [✓ Approve] [✗ Reject] │ │
│ │ [+ Add to Taxonomy] │ │
│ └─────────────────────────────────────────┘ │
│ │
│ [Review All] [Approve High Confidence (>85%)] │
└────────────────────────────────────────────────┘
3. Metadata Display
Located on: Document/Object detail page, search results
Features:
- Show extracted metadata
- Edit metadata
- View extraction method
- Confidence scores
Mockup:
┌─ Metadata ────────────────────────────────────┐
│ │
│ Keywords: (10) │
│ #cloud-migration #kubernetes #docker │
│ #infrastructure #devops #automation │
│ [Show all...] [Edit] │
│ │
│ Themes: (3) │
│ • Digital Transformation (89%) │
│ • Infrastructure Modernization (76%) │
│ • Cost Optimization (65%) │
│ │
│ Properties: │
│ • Document Type: Technical Report │
│ • Author: Engineering Team │
│ • Created: 2024-01-15 │
│ • Language: English (en) │
│ • Reading Level: B2 (Intermediate) │
│ • Word Count: 12,450 │
│ │
│ Search Terms: (5) │
│ "how to migrate to cloud" │
│ "kubernetes deployment guide" │
│ [Show all...] │
│ │