Files
What are Files in Open Register?
In Open Register, Files are binary data attachments that can be associated with objects. They extend the system beyond structured data to include documents, images, videos, and other file types that are essential for many applications.
Files in Open Register are:
- Securely stored and managed
- Associated with specific objects
- Versioned alongside their parent objects
- Accessible through a consistent API
- Integrated with Nextcloud's file management capabilities
Attaching Files to Objects
Files can be attached to objects in several ways:
- Integrated Uploads: Files can be uploaded directly within object POST/PUT operations using multipart/form-data, base64-encoded content, or URL references
- Schema-defined file properties: When a schema includes properties of type 'file', these are automatically handled during object creation or updates
- Direct API attachment: Files can be added to an object after creation using the file attachment API endpoints
- Base64 encoded content: Files can be included in object data as base64-encoded strings
- URL references: External files can be referenced by URL and will be downloaded and stored locally
Integrated File Uploads
OpenRegister supports integrated file uploads directly within object POST/PUT operations, providing a unified approach to handling structured data (objects) and unstructured data (files) together.
Upload Methods
1. Multipart/Form-Data Upload (Recommended)
Use Case: Uploading files from web forms or file inputs
Authentication Note: ⚠️ Multipart file uploads require session-based authentication or API tokens. HTTP Basic Authentication is not supported for multipart uploads due to Nextcloud security policies. For API testing with Basic Auth, use base64-encoded files instead (see method 2 below).
Example:
POST /index.php/apps/openregister/api/registers/documents/schemas/document/objects
Content-Type: multipart/form-data
title=Annual Report 2024
attachment=@report.pdf
thumbnail=@cover.jpg
JavaScript Example (with session cookies):
const formData = new FormData();
formData.append('title', 'Annual Report 2024');
formData.append('attachment', fileInput.files[0]);
formData.append('thumbnail', thumbnailInput.files[0]);
fetch('/index.php/apps/openregister/api/registers/documents/schemas/document/objects', {
method: 'POST',
body: formData,
credentials: 'include' // Important: includes session cookies
})
.then(response => response.json())
.then(data => console.log('Created:', data));
Why this is recommended:
- ✅ Most efficient: No encoding overhead, files transferred directly
- ✅ Preserves metadata: Original filename and MIME type are maintained
- ✅ No guessing: Extension and filename are exactly as uploaded
- ✅ Best file quality: No conversion or inference errors
- ✅ Low memory footprint: Can stream directly from disk to disk
- ✅ Fastest method: Direct transfer without intermediate conversions
Authentication Methods for Multipart Uploads:
- ✅ Session cookies (recommended for web applications)
- ✅ Nextcloud App Passwords (for external applications)
- ✅ OAuth2 tokens (for third-party integrations)
- ❌ HTTP Basic Auth (not supported due to Nextcloud security policies)
2. Base64-Encoded Files (Recommended for API Testing)
Use Case: Embedding files in JSON payloads, API integrations, testing with HTTP Basic Auth
Authentication: Works with all authentication methods including HTTP Basic Auth.
Data URI Format:
{
"title": "Screenshot",
"image": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA..."
}
Plain Base64 Format:
{
"title": "Document",
"attachment": "JVBERi0xLjQKJeLjz9MKMyAwIG9iago8PC9MZW5ndGggMj..."
}
Note: Base64 encoding increases file size by approximately 33% and original filenames are lost. Use only for small files (< 100 KB) or when multipart is not possible.
3. URL References
Use Case: Referencing remote files, importing from external sources
Example:
{
"title": "External Document",
"attachment": "https://example.com/files/document.pdf",
"logo": "https://cdn.example.com/images/logo.png"
}
Note: URL references are slower as the server must download the file from the external URL. Use only for trusted sources or migration scenarios.
4. Mixed Upload Methods
You can combine all three methods in a single request:
POST /index.php/apps/openregister/api/registers/documents/schemas/document/objects
Content-Type: multipart/form-data
title=Complete Package
mainDocument=@contract.pdf
signature=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA...
reference=https://example.com/terms.pdf
Array of Files
Files can be uploaded as arrays:
Schema:
{
"properties": {
"attachments": {
"type": "array",
"items": {
"type": "file"
}
}
}
}
Upload:
{
"title": "Multi-File Document",
"attachments": [
"data:application/pdf;base64,JVBERi0xLjQKJeL...",
"https://example.com/file2.pdf",
"data:image/png;base64,iVBORw0KGgo..."
]
}
Update Operations
File properties work the same way with PUT/PATCH operations:
PUT /index.php/apps/openregister/api/registers/documents/schemas/document/objects/abc-123
Content-Type: multipart/form-data
title=Updated Document
attachment=@new-version.pdf
Note: Updating a file property replaces the previous file.
Error Handling
Invalid MIME Type
{
"error": "File at attachment has invalid type 'application/zip'. Allowed types: application/pdf, application/msword"
}
File Too Large
{
"error": "File at attachment exceeds maximum size (10485760 bytes). File size: 15728640 bytes"
}
Upload Error
{
"error": "Failed to read uploaded file for field 'attachment'"
}
URL Download Failure
{
"error": "Unable to fetch file from URL: https://example.com/missing.pdf"
}
Backward Compatibility
✅ Existing file endpoints remain unchanged:
POST /api/objects/{register}/{schema}/{id}/filesGET /api/objects/{register}/{schema}/{id}/filesDELETE /api/objects/{register}/{schema}/{id}/files/{fileId}
Both approaches work and can be used interchangeably.
Performance Comparison
| Method | Speed | File Size | Metadata | Use Case |
|---|---|---|---|---|
| Multipart | Fastest | Original | Preserved | ✅ Recommended for all uploads |
| Base64 | Medium | +33% larger | Lost | ⚠️ Small files only (< 100 KB) |
| URL | Slowest | Original | Preserved | 🐌 External imports only |
Best Practices
-
✅ ALWAYS use Multipart for user uploads
- Users expect filenames to be preserved
- Prevents confusion about generic filenames
-
⚠️ Base64 only for APIs
- When API client doesn't support multipart
- Document that filenames will be lost
- Always use data URI format with MIME type
-
🐌 URLs only for trusted sources
- Use timeout limits (max 30 seconds)
- Validate content-length headers upfront
- Implement retry logic
-
📝 Document your choice
- If using base64 or URL, explain why
- Make users aware of trade-offs
-
🧪 Test performance
- Measure upload times in production
- Monitor failure rates for URL downloads
File Metadata and Tagging
Each file attachment includes rich metadata:
- Basic properties (name, size, type, extension)
- Creation and modification timestamps
- Access and download URLs
- Checksum for integrity verification
- Custom tags for categorization
Tagging System
Files can be tagged with both simple labels and key-value pairs:
- Tags with a colon (':') are treated as key-value pairs and can be used for advanced filtering and organization
Version Control
The system maintains file versions by:
- Tracking file modifications with timestamps
- Preserving checksums to detect changes
- Integrating with the object audit trail system
- Supporting file restoration from previous versions
Security and Access Control
File attachments inherit the security model of their parent objects:
- Files are stored in NextCloud with appropriate permissions
- Share links can be generated for controlled external access
- Access is managed through the OpenRegister user and group system
- Files are associated with the OpenRegister application user for consistent permissions
File Operations
The system supports the following operations on file attachments:
- Retrieving Files
- Updating Files
- Deleting Files
File Preview and Rendering
The system leverages NextCloud's preview capabilities for supported file types:
- Images are displayed as thumbnails
- PDFs can be previewed in-browser
- Office documents can be viewed with compatible apps
- Preview URLs are generated for easy embedding
Integration with Object Lifecycle
File attachments are fully integrated with the object lifecycle:
- When objects are created, their file folders are automatically provisioned
- When objects are updated, file references are maintained
- When objects are deleted, associated files can be optionally preserved or removed
- File operations are recorded in the object's audit trail
Technical Implementation
The file attachment system is implemented through two main service classes:
- FileService: Handles low-level file operations, folder management, and NextCloud integration
- ObjectService: Provides high-level methods for attaching, retrieving, and managing files in the context of objects
These services work together to provide a seamless file management experience within the OpenRegister application.
File Structure
| id | integer Unique identifier of the file in Nextcloud |
| uuid | string Unique identifier for the file |
| filename | string Name of the file |
| downloadUrl | string <uri> Direct download URL for the file |
| shareUrl | string <uri> URL to access the file via share link |
| accessUrl | string <uri> URL to access the file |
| extension | string File extension |
| checksum | string ETag hash for file versioning |
| source | integer Source identifier |
| userId | string ID of the user who owns the file |
| base64 | string Base64 encoded content of the file |
| filePath | string Full path to the file in Nextcloud |
| created | string <date-time> ISO 8601 timestamp when file was first shared |
| updated | string <date-time> ISO 8601 timestamp of last modification |
{- "id": 123,
- "uuid": "123e4567-e89b-12d3-a456-426614174000",
- "filename": "profile.jpg",
- "extension": "jpg",
- "checksum": "abc123",
- "source": 1,
- "userId": "user-12345",
- "base64": "base64encodedstring",
- "filePath": "/files/profile.jpg",
- "created": "2023-02-15T14:30:00Z",
- "updated": "2023-05-20T10:15:00Z"
}How Files are Stored
Open Register provides flexible storage options for files:
1. Nextcloud Storage
By default, files are stored in Nextcloud's file system, leveraging its robust file management capabilities, including:
- Access control
- Versioning
- Encryption
- Collaborative editing
2. External Storage
For larger deployments or specialized needs, files can be stored in:
- Object storage systems (S3, MinIO)
- Content delivery networks
- Specialized document management systems
3. Database Storage
Small files can be stored directly in the database for simplicity and performance.
File Features
1. Versioning
Files maintain version history, allowing you to:
- Track changes over time
- Revert to previous versions
- Compare different versions
2. Access Control
Files inherit access control from their parent objects, ensuring consistent security:
- Users who can access an object can access its files
- Additional file-specific permissions can be applied
- Permissions can be audited
3. Metadata
Files support rich metadata to provide context and improve searchability:
- Standard metadata (creation date, size, type)
- Custom metadata specific to your application
- Extracted metadata (e.g., EXIF data from images)
4. Preview Generation
Open Register can generate previews for common file types:
- Thumbnails for images
- PDF previews
- Document previews
5. Content Extraction
For supported file types, content can be extracted for indexing and search:
- Text extraction from documents
- OCR for scanned documents and images
- Metadata extraction
OpenRegister now includes enhanced text extraction with entity tracking (GDPR), language detection, and language level assessment. See Enhanced Text Extraction & GDPR Entity Tracking for details.
Asynchronous Processing: Text extraction happens in the background after file upload, ensuring:
- Fast uploads: Your file uploads complete instantly without waiting
- Non-blocking: Users don't experience delays during file operations
- Reliable: Background jobs automatically handle retries for failed extractions
- Resource-efficient: Processing happens when resources are available
Text Extraction Options:
OpenRegister supports two text extraction engines:
-
LLPhant (Default) - PHP-based extraction:
- ✓ Native support: TXT, MD, HTML, JSON, XML, CSV
- ○ Library support: PDF, DOCX, DOC, XLSX, XLS (requires PhpOffice, PdfParser)
- ⚠️ Limited: PPTX, ODT, RTF
- ✗ No support: Image files (JPG, PNG, GIF, WebP)
- Best for: Privacy-conscious environments, regular documents
- Cost: Free (included)
-
Dolphin AI - Advanced AI-powered extraction:
- ✓ All document formats with superior quality
- ✓ OCR for scanned documents and images (JPG, PNG, GIF, WebP)
- ✓ Advanced table extraction
- ✓ Formula recognition
- ✓ Multi-language OCR
- Best for: Complex documents, scanned materials, images with text
- Cost: API subscription required
Extraction Scope Options:
- None: Text extraction disabled
- All files: Extract from all uploaded files
- Files in folders: Extract only from files in specific folders
- Files attached to objects: Extract only from files linked to objects (recommended)
Typical Processing Times:
- Text files: < 1 second
- PDFs (LLPhant): 2-10 seconds
- PDFs (Dolphin): 3-15 seconds
- Large documents or OCR: 10-60 seconds
- Images with OCR (Dolphin): 5-20 seconds
You can configure text extraction in Settings → File Configuration. Check extraction status in the file's metadata after upload.
Technical Implementation
Background Job Processing:
Text extraction uses Nextcloud's background job system for reliable, async processing:
- File Upload - User uploads a file
- Job Queuing - 'FileChangeListener' automatically queues 'FileTextExtractionJob'
- Job Execution - Background job system processes the file when resources are available
- Text Extraction - Selected extractor (LLPhant or Dolphin) processes the file
- Chunking - Text is automatically split into chunks with overlap (1000 chars per chunk, 200 char overlap)
- Storage - Extracted text and chunks stored in 'FileText' entity for reuse
- Completion - Status updated to 'completed' or 'failed'
Note: Text extraction is now fully independent of SOLR. Chunks are generated during extraction and stored in the database, making them reusable for SOLR indexing, vector embeddings, AI processing, or any other service that needs chunked text.
File Type Compatibility Matrix:
LLPhant Support:
- ✓ Native (TXT, MD, HTML, JSON, XML, CSV) - Perfect quality, very fast
- ○ Library (PDF, DOCX, DOC, XLSX, XLS) - Good quality, medium speed
- ⚠️ Limited (PPTX, ODT, RTF) - Basic text only, use Dolphin for better results
- ✗ No Support (JPG, PNG, GIF, WebP) - Requires Dolphin with OCR
Dolphin AI Support:
- ✓ All formats with superior quality
- ✓ OCR for scanned documents and images
- ✓ Table extraction with structure preserved
- ✓ Formula recognition (LaTeX format)
- ✓ Multi-language support
- ✓ Layout understanding (multi-column, etc.)
OCR-Specific Use Cases (Dolphin only):
- Document Digitization - Scanning paper archives into searchable text
- Receipt Processing - Photo receipts from mobile devices
- Screenshot Analysis - Extract text from application screenshots
- Infographic Text - Extract text from images with embedded text
- Historical Documents - Digitize old scanned materials
Quality Requirements for OCR:
- Minimum: 150 DPI resolution
- Recommended: 300+ DPI
- Clear, high-contrast images
- Minimal blur or distortion
- Properly oriented (not rotated)
Extraction Configuration Options:
Configure in Settings → File Configuration:
-
Text Extractor Selection:
- LLPhant (default) - Local, free, privacy-friendly
- Dolphin - Advanced AI, requires API key
-
Extraction Scope:
- None - Disabled
- All files - Every uploaded file
- Files in folders - Specific folders only
- Files attached to objects - Only object attachments (recommended)
-
Extraction Mode:
- Background (default) - Async via background jobs
- Immediate - Synchronous during upload (slower)
- Manual - Triggered by admin action only
-
Enabled File Types:
- Select which file extensions to process
- Different for LLPhant vs Dolphin
- Enable OCR formats (images) only if using Dolphin
Integration Tests:
The file text extraction system includes comprehensive integration tests:
# Run file extraction tests
vendor/bin/phpunit tests/Integration/FileTextExtractionIntegrationTest.php
# Test cases covered:
# - File upload queues background job
# - Background job execution completes
# - Text extraction end-to-end with content verification
# - Multiple file format support (TXT, MD, JSON)
# - Extraction metadata recording (status, method, timestamps)
Monitoring Extraction:
Check extraction status via logs:
# Watch extraction progress
docker logs -f nextcloud-container | grep FileTextExtractionJob
# Check for errors
docker logs nextcloud-container | grep 'extraction failed'
# View extraction statistics
# Settings → File Configuration → Statistics section
Files Management Page
The Files page provides a centralized view of all files tracked in the text extraction system.
Accessing the Files Page:
Navigate to Files in the main menu to view all files with their extraction status.
Features:
-
File List Table:
- File name and path
- File type and size
- Extraction status (Pending, Processing, Completed, Failed)
- Number of text chunks created
- Last extraction timestamp
-
Status Indicators:
- 🟠 Pending: File discovered but not yet extracted
- 🔵 Processing: Extraction in progress
- 🟢 Completed: Successfully extracted
- 🔴 Failed: Extraction error occurred
-
File Actions:
- Retry: Re-extract failed files
- View Error: See detailed error message for failed extractions
-
Pagination:
- Browse through large file lists (50 files per page)
- Navigate between pages
-
Refresh:
- Update the list to see latest extraction status
Use Cases:
- Monitor extraction progress across all files
- Identify and retry failed extractions
- View error details for troubleshooting
- Verify which files have been processed
Core File Extraction API:
OpenRegister provides dedicated API endpoints for file text extraction (moved from settings to core functionality):
GET /api/files- List all tracked files with extraction statusGET /api/files/{id}- Get single file extraction informationPOST /api/files/{id}/extract- Extract text from specific filePOST /api/files/extract- Extract all pending files (batch processing)POST /api/files/retry-failed- Retry all failed extractionsGET /api/files/stats- Get extraction statistics
Smart Re-Extraction:
The system automatically detects when files need re-extraction by comparing:
- File modification time ('mtime' from Nextcloud's 'oc_filecache')
- Last extraction time ('extractedAt' from 'oc_openregister_file_texts')
If 'mtime > extractedAt', the file is re-extracted to ensure content is up-to-date.
File Tracking Table:
Extracted text and metadata are stored in 'oc_openregister_file_texts' with:
- 'file_id' - Links to Nextcloud's 'oc_filecache' table
- 'extraction_status' - pending, processing, completed, failed
- 'extractedAt' - Timestamp of last extraction
- 'text_content' - Full extracted text
- 'text_length' - Character count
- 'chunked' - Whether text has been chunked
- 'chunk_count' - Number of chunks created
- 'chunks_json' - JSON array of text chunks with offsets (new in v0.2.7)
- 'extraction_method' - LLPhant or Dolphin
- Plus SOLR indexing and vectorization tracking
Chunking Details: Each chunk in 'chunks_json' contains the chunk text, start offset, and end offset. This allows for precise text retrieval and consistent chunking across all services.
Working with Files
Uploading Files
Files can be uploaded and attached to objects:
POST /api/objects/{id}/files
Content-Type: multipart/form-data
file: [binary data]
metadata: {"author": "Legal Department", "securityLevel": "confidential"}
Retrieving Files
You can download a file:
GET /api/files/{id}
Or get file metadata:
GET /api/files/{id}/metadata
Listing Files for an Object
You can retrieve all files associated with an object:
GET /api/objects/{register}/{schema}/{id}/files
Pagination
The listing endpoint supports standard pagination parameters:
| Parameter | Default | Minimum | Maximum | Notes |
|---|---|---|---|---|
_page | 1 | 1 | — | Page number |
_limit | 30 | 1 | none | Values below 1 are clamped to 1; no upper ceiling — _limit=5000 is honoured |
Per-object attachment counts are the natural bound for this endpoint, so there is no artificial cap on _limit. Call sites that want a fixed page size should set _limit explicitly.
Lock-aware response (authenticated callers)
When the request is made by an authenticated user, each file entry in the response carries two additional fields that expose Nextcloud lock state:
{
"id": 42,
"title": "report.pdf",
"locked": true,
"lock": {
"type": "user",
"scope": "exclusive",
"owner": "alice",
"createdAt": "2026-04-21T10:00:00+00:00",
"expiresAt": "2026-04-21T10:30:00+00:00"
}
}
| Field | Description |
|---|---|
locked | true if the file has an active Nextcloud lock, false otherwise |
lock.type | Lock type — one of "user", "app", "token" |
lock.scope | WebDAV lock scope — one of "exclusive", "shared" |
lock.owner | User id (for user/token types) or app id (for app type) |
lock.createdAt | ISO 8601 timestamp of when the lock was acquired |
lock.expiresAt | ISO 8601 timestamp of expiry, or null if the lock has no timeout |
When Nextcloud's lock provider is not available (the files_lock app is disabled), locked is false and lock is omitted.
Anonymous callers
When the request has no authenticated session (e.g. a public object fetched without credentials), both locked and lock are omitted from every entry. This prevents anonymous callers from observing who holds a lock or what apps are editing the file.
Locked-file resilience
A single Nextcloud-locked file no longer crashes the listing. If any file raises a LockedException during formatting, the endpoint returns HTTP 200 with a minimal envelope for that entry instead of failing the whole request:
{
"id": 42,
"title": "report.pdf",
"locked": true,
"lock": { "type": "user", "scope": "exclusive", "owner": "alice", "createdAt": "...", "expiresAt": null },
"error": "locked"
}
For anonymous callers, the stub carries only {id, title, error: "locked"} — no locked, no lock. A structured info-level log line is emitted server-side for each locked file encountered, so operators can still observe lock contention.
Updating Files
Files can be updated in two ways:
1. Update File Content
Upload a new version of the file:
PUT /api/objects/{register}/{schema}/{objectId}/files/{fileId}
Content-Type: application/json
{
'content': '[base64 encoded content or raw content]',
'tags': ['tag1', 'tag2']
}
2. Update Metadata Only
Update only the file metadata (tags) without changing content:
PUT /api/objects/{register}/{schema}/{objectId}/files/{fileId}
Content-Type: application/json
{
'tags': ['updated-tag1', 'updated-tag2']
}
Note: The 'content' parameter is optional. If omitted, only the metadata will be updated without modifying the file content itself.
Deleting Files
Files can be deleted when no longer needed:
DELETE /api/files/{id}
File Relationships
Files have important relationships with other core concepts:
Files and Objects
- Files are attached to objects
- An object can have multiple files
- Files inherit permissions from their parent object
- Files are versioned alongside their parent object
Files and Schemas
- Schemas can define expectations for file attachments
- File validation can be specified in schemas (allowed types, max size)
- Schemas can define required file attachments
Files and Registers
- Registers can be configured with different file storage options
- File storage policies can be defined at the register level
- Registers can have quotas for file storage
Use Cases
1. Document Management
Attach important documents to business objects:
- Contracts to customer records
- Invoices to order records
- Specifications to product records
2. Media Management
Store and manage media assets:
- Product images
- Marketing materials
- Training videos
3. Evidence Collection
Maintain evidence for regulatory or legal purposes:
- Compliance documentation
- Audit evidence
- Legal case files