Skip to main content

Text Extraction Technical Documentation

๐Ÿ“š Feature Documentation: See Text Extraction, Vectorization & Named Entity Recognition for user-facing documentation and overview.

Overviewโ€‹

OpenRegister's Text Extraction Service converts content from various sources (files, objects, emails, calendar items) into searchable text chunks. The service uses a handler-based architecture to support multiple source types and extraction methods.

Architectureโ€‹

Handler-Based Designโ€‹

Location: lib/Service/TextExtraction/

Service Flowโ€‹

Database Schemaโ€‹

Chunk Entityโ€‹

Table: oc_openregister_chunks

CREATE TABLE oc_openregister_chunks (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
uuid VARCHAR(255) NOT NULL UNIQUE,
source_type VARCHAR(50) NOT NULL,
source_id BIGINT NOT NULL,
text_content TEXT NOT NULL,
start_offset INT NOT NULL,
end_offset INT NOT NULL,
chunk_index INT NOT NULL,
checksum VARCHAR(64),
language VARCHAR(10),
language_level VARCHAR(20),
language_confidence DECIMAL(3,2),
detection_method VARCHAR(50),
indexed_in_solr BOOLEAN NOT NULL DEFAULT FALSE,
vectorized BOOLEAN NOT NULL DEFAULT FALSE,
owner VARCHAR(255),
organisation VARCHAR(255),
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

INDEX idx_source (source_type, source_id),
INDEX idx_checksum (checksum),
INDEX idx_language (language),
INDEX idx_owner (owner),
INDEX idx_organisation (organisation)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Key Fields:

  • source_type: 'file', 'object', 'email', 'calendar'
  • source_id: ID of the source entity
  • checksum: SHA256 hash of source text for change detection
  • text_content: The actual chunk text
  • start_offset / end_offset: Character positions in original text
  • chunk_index: Sequential chunk number (0-based)

Entity Classโ€‹

Location: lib/Db/Chunk.php

class Chunk extends Entity implements JsonSerializable
{
protected ?string $uuid = null;
protected ?string $sourceType = null;
protected ?int $sourceId = null;
protected ?string $textContent = null;
protected int $startOffset = 0;
protected int $endOffset = 0;
protected int $chunkIndex = 0;
protected ?string $checksum = null;
protected ?string $language = null;
protected ?string $languageLevel = null;
protected ?float $languageConfidence = null;
protected ?string $detectionMethod = null;
protected bool $indexedInSolr = false;
protected bool $vectorized = false;
protected ?string $owner = null;
protected ?string $organisation = null;
protected ?DateTime $createdAt = null;
protected ?DateTime $updatedAt = null;
}

Handlersโ€‹

FileHandlerโ€‹

Location: lib/Service/TextExtraction/FileHandler.php

Supported Formats:

  • Documents: PDF, DOCX, DOC, ODT, RTF
  • Spreadsheets: XLSX, XLS, CSV
  • Presentations: PPTX
  • Text Files: TXT, MD, HTML, JSON, XML
  • Images: JPG, PNG, GIF, WebP, TIFF (via OCR)

Extraction Methods:

  • LLPhant: Local PHP-based extraction
  • Dolphin: AI-powered extraction with OCR
  • Native: Direct text reading for plain text files

Implementation:

public function extract(int $sourceId, array $sourceMeta): string
{
$nodes = $this->rootFolder->getById($sourceId);
$file = $nodes[0];

// Extract based on MIME type
$mimeType = $sourceMeta['mimetype'] ?? 'unknown';

// TODO: Implement actual extraction logic
return $file->getContent();
}

ObjectHandlerโ€‹

Location: lib/Service/TextExtraction/ObjectHandler.php

Process:

  1. Load object from database
  2. Extract schema and register information
  3. Flatten nested object structures
  4. Concatenate property values with context
  5. Add metadata (UUID, version, organization)

Implementation:

public function extract(int $sourceId, array $sourceMeta): string
{
$object = $this->objectMapper->find($sourceId);
return $this->convertObjectToText($object);
}

private function convertObjectToText(ObjectEntity $object): string
{
$textParts = [];
$textParts[] = "Object ID: " . $object->getUuid();

// Add schema info
if ($object->getSchema() !== null) {
$schema = $this->schemaMapper->find($object->getSchema());
$textParts[] = "Type: " . ($schema->getTitle() ?? $schema->getName());
}

// Extract object data
$objectData = $object->getObject();
if (is_array($objectData)) {
$textParts[] = "Content: " . $this->extractTextFromArray($objectData);
}

return implode("\n", $textParts);
}

Chunking Strategiesโ€‹

Priority Order:

  1. Paragraph breaks (\n\n)
  2. Sentence endings (. ! ?)
  3. Line breaks (\n)
  4. Commas and semicolons
  5. Word boundaries (spaces)
  6. Character split (fallback)

Best for: Natural language documents, articles, reports

Configuration:

$chunks = $textExtractionService->textToChunks($text, [
'chunk_size' => 1000,
'chunk_overlap' => 200,
'strategy' => 'RECURSIVE_CHARACTER'
]);

Fixed Size Splittingโ€‹

Settings:

  • Chunk size: 1000 characters (default)
  • Overlap: 200 characters (default)
  • Minimum chunk: 100 characters

Best for: Structured data, code, logs

Configuration:

$chunks = $textExtractionService->textToChunks($text, [
'chunk_size' => 1000,
'chunk_overlap' => 200,
'strategy' => 'FIXED_SIZE'
]);

Supported File Formatsโ€‹

Text Documentsโ€‹

FormatMax SizeProcessing TimeNotes
.txt100MB< 1sUTF-8, ISO-8859-1, Windows-1252
.md50MB< 1sPreserves structure
.html20MB1-3sStrips scripts/styles

PDF Documentsโ€‹

TypeMax SizeProcessing TimeLibraries
Text PDF100MB2-10sSmalot PdfParser, pdftotext
Scanned PDF (OCR)50MB10-60sTesseract OCR

Requirements for OCR:

# Install Tesseract
sudo apt-get install tesseract-ocr

# With languages
apt-get install tesseract-ocr-nld tesseract-ocr-deu

Microsoft Officeโ€‹

FormatMax SizeProcessing TimeLibraries
.docx50MB2-5sPhpOffice/PhpWord
.xlsx30MB3-10sPhpOffice/PhpSpreadsheet
.pptx50MB2-5sZipArchive + XML

Images (OCR)โ€‹

FormatMax SizeProcessing TimeRequirements
JPG, PNG, GIF, BMP, TIFF20MB5-15s/pageTesseract OCR

Best Practices:

  • Use high-resolution scans (300 DPI ideal)
  • Ensure text is legible and not skewed
  • Black text on white background works best

Data Formatsโ€‹

FormatMax SizeProcessing TimeNotes
.json20MB1-2sRecursive extraction
.xml20MB1-3sTag names and content

Service Implementationโ€‹

TextExtractionServiceโ€‹

Location: lib/Service/TextExtractionService.php

Key Methods:

/**
* Extract text from a source using appropriate handler.
*/
public function extractSourceText(
string $sourceType,
int $sourceId,
array $sourceMeta
): array {
$handler = $this->getHandler($sourceType);
$text = $handler->extract($sourceId, $sourceMeta);
$checksum = hash('sha256', $text);

return [
'source_type' => $sourceType,
'source_id' => $sourceId,
'text' => $text,
'checksum' => $checksum,
'owner' => $handler->getOwner($sourceId, $sourceMeta),
'organisation' => $handler->getOrganisation($sourceId, $sourceMeta),
];
}

/**
* Split text into chunks.
*/
public function textToChunks(array $payload, array $options = []): array {
$chunkSize = $options['chunk_size'] ?? self::DEFAULT_CHUNK_SIZE;
$chunkOverlap = $options['chunk_overlap'] ?? self::DEFAULT_CHUNK_OVERLAP;
$strategy = $options['strategy'] ?? self::RECURSIVE_CHARACTER;

// Apply chunking strategy
$chunks = match($strategy) {
self::FIXED_SIZE => $this->chunkFixedSize($payload['text'], $chunkSize, $chunkOverlap),
self::RECURSIVE_CHARACTER => $this->chunkRecursive($payload['text'], $chunkSize, $chunkOverlap),
default => $this->chunkRecursive($payload['text'], $chunkSize, $chunkOverlap)
};

// Map to chunk entities with checksum
return array_map(function($index, $chunkText) use ($payload, $chunkSize) {
return [
'text_content' => $chunkText,
'chunk_index' => $index,
'start_offset' => $index * $chunkSize,
'end_offset' => ($index * $chunkSize) + strlen($chunkText),
'checksum' => $payload['checksum'] ?? null,
];
}, array_keys($chunks), $chunks);
}

/**
* Persist chunks to database.
*/
public function persistChunksForSource(
string $sourceType,
int $sourceId,
array $chunks,
?string $owner,
?string $organisation,
int $sourceTimestamp,
array $payload
): void {
// Delete existing chunks for this source
$this->chunkMapper->deleteBySource($sourceType, $sourceId);

// Create new chunks
foreach ($chunks as $chunkData) {
$chunk = new Chunk();
$chunk->setUuid(Uuid::v4()->toString());
$chunk->setSourceType($sourceType);
$chunk->setSourceId($sourceId);
$chunk->setTextContent($chunkData['text_content']);
$chunk->setStartOffset($chunkData['start_offset']);
$chunk->setEndOffset($chunkData['end_offset']);
$chunk->setChunkIndex($chunkData['chunk_index']);
$chunk->setChecksum($chunkData['checksum'] ?? null);
$chunk->setOwner($owner);
$chunk->setOrganisation($organisation);

$this->chunkMapper->insert($chunk);
}
}

Change Detectionโ€‹

The system uses SHA256 checksums to detect content changes:

// Calculate checksum from extracted text
$checksum = hash('sha256', $extractedText);

// Check if source is up-to-date
public function isSourceUpToDate(
int $sourceId,
string $sourceType,
int $sourceTimestamp,
bool $forceReExtract
): bool {
if ($forceReExtract === true) {
return false;
}

// Get existing chunks
$existingChunks = $this->chunkMapper->findBySource($sourceType, $sourceId);

if (empty($existingChunks)) {
return false;
}

// Check if checksum matches
$existingChecksum = $existingChunks[0]->getChecksum();
$currentChecksum = $this->calculateSourceChecksum($sourceId, $sourceType);

return $existingChecksum === $currentChecksum;
}

Benefits:

  • Avoids unnecessary re-extraction
  • Efficient change detection
  • Automatic updates on content modification

Performanceโ€‹

Processing Timesโ€‹

File TypeSizeExtraction TimeChunking Time
Text (.txt)< 1MB< 1s50ms
PDF (text)< 5MB1-3s100ms
PDF (OCR)< 5MB10-60s100ms
DOCX< 5MB1-2s100ms
XLSX< 5MB2-5s150ms
Images (OCR)< 5MB5-15s50ms

Bulk Processingโ€‹

  • Batch Size: 100 items per batch (configurable)
  • Parallel Processing: Supported via background jobs
  • Progress Tracking: Real-time status updates

API Endpointsโ€‹

Text Extractionโ€‹

POST /api/files/{fileId}/extract
POST /api/objects/{objectId}/extract
GET /api/chunks?source_type=file&source_id={id}
GET /api/chunks/{chunkId}

Chunkingโ€‹

POST /api/chunks/chunk
Content-Type: application/json

{
"source_type": "file",
"source_id": 12345,
"options": {
"chunk_size": 1000,
"chunk_overlap": 200,
"strategy": "RECURSIVE_CHARACTER"
}
}

Error Handlingโ€‹

Common Errorsโ€‹

ErrorCauseSolution
File too largeExceeds format limitReduce size or increase limit
Format not supportedUnrecognized formatEnable format or convert file
Extraction failedCorrupted fileVerify file integrity
OCR failedTesseract not installedInstall Tesseract OCR

Recoveryโ€‹

  • Failed extractions can be retried via API
  • Error messages stored for debugging
  • Automatic retry on content update

Processing Pipelineโ€‹

Step-by-Step Flowโ€‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. File Upload โ”‚
โ”‚ - User uploads file to Nextcloud โ”‚
โ”‚ - File stored in data directory โ”‚
โ”‚ - Upload completes immediately (non-blocking) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 2. Background Job Queued โ”‚
โ”‚ - FileChangeListener detects new/updated file โ”‚
โ”‚ - Queues FileTextExtractionJob asynchronously โ”‚
โ”‚ - User request completes without delay โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 3. Background Processing (non-blocking) โ”‚
โ”‚ - Job runs in background (typically within seconds) โ”‚
โ”‚ - Check MIME type and validate format โ”‚
โ”‚ - Use format-specific extractor โ”‚
โ”‚ - Handle encoding issues โ”‚
โ”‚ - Clean/normalize text โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 4. Document Chunking โ”‚
โ”‚ - Apply selected strategy โ”‚
โ”‚ - Create overlapping chunks โ”‚
โ”‚ - Preserve metadata โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 5. Storage โ”‚
โ”‚ - Store chunks in database โ”‚
โ”‚ - Calculate and store checksum โ”‚
โ”‚ - Link chunks to source โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Event Listenerโ€‹

Location: lib/Listener/FileChangeListener.php

Purpose: Automatically process files on upload/update.

Events:

  • NodeCreatedEvent - File uploaded
  • NodeWrittenEvent - File updated

Behavior:

  • Checks if extraction is needed (checksum comparison)
  • Triggers text extraction automatically via background job
  • Updates extraction status
  • Full error handling and logging

Registration: Registered in Application.php with service container integration.

File Size Limitsโ€‹

Default Limitsโ€‹

FormatDefault Max SizeConfigurableReason
Text/Markdown100MBYesMemory-efficient
HTML20MBYesDOM parsing overhead
PDF (text)100MBYesDirect extraction
PDF (OCR)50MBYesProcessing intensive
Office Docs30-50MBYesLibrary limitations
Images20MBYesOCR memory usage
JSON/XML20MBYesParsing complexity

Modifying Limitsโ€‹

Edit lib/Service/SolrFileService.php:

// File size limits (in bytes)
private const MAX_FILE_SIZE_TEXT = 104857600; // 100MB
private const MAX_FILE_SIZE_PDF = 104857600; // 100MB
private const MAX_FILE_SIZE_OFFICE = 52428800; // 50MB
private const MAX_FILE_SIZE_IMAGE = 20971520; // 20MB

Warning: Increasing limits may cause:

  • Memory exhaustion
  • Slow processing
  • Timeouts on large files

Dependenciesโ€‹

PHP Libraries (Installed via Composer)โ€‹

{
"smalot/pdfparser": "^2.0", // PDF text extraction
"phpoffice/phpword": "^1.0", // Word document processing
"phpoffice/phpspreadsheet": "^1.0" // Excel processing
}

System Commandsโ€‹

CommandPurposeInstallation
pdftotextPDF extraction fallbackapt-get install poppler-utils
tesseractOCR for images/scanned PDFsapt-get install tesseract-ocr

Checking Dependenciesโ€‹

# Check if pdftotext is available
which pdftotext

# Check Tesseract version
tesseract --version

# Test OCR
tesseract test-image.png output

Troubleshootingโ€‹

Debugging File Processingโ€‹

Enable Debug Logging:

  1. Go to Settings โ†’ File Management
  2. Enable "Detailed Logging"
  3. Check logs at: data/nextcloud.log or via Docker: docker logs nextcloud-container

Look for:

[TextExtractionService] Processing file: document.pdf
[TextExtractionService] Extraction method: pdfParser
[TextExtractionService] Extracted text length: 45678
[TextExtractionService] Created 12 chunks

Performance Issuesโ€‹

If processing is slow:

  1. Check file size and format
  2. Monitor memory usage: docker stats
  3. Reduce chunk size to decrease processing time
  4. Use faster extraction methods when possible

File Access Issuesโ€‹

If you see 'file not found' or 'failed to open stream' errors:

The system uses asynchronous background jobs to process files:

  • Non-blocking uploads: File uploads complete immediately without waiting for text extraction
  • Background processing: Text extraction runs in background jobs (typically within seconds)
  • Path filtering: Only OpenRegister files are processed
  • Automatic retries: Failed extractions are automatically retried by the background job system

To check background job status:

# View pending background jobs
docker exec -u 33 <nextcloud-container> php occ background-job:list

# Check logs for extraction job status
docker logs <nextcloud-container> | grep FileTextExtractionJob

Quality Issuesโ€‹

If text extraction is poor:

  1. Verify source file quality
  2. For scanned documents, ensure 300+ DPI
  3. Use native PDF over scanned when possible
  4. Test with different OCR languages
  5. Check for corrupted files

Best Practicesโ€‹

For Administratorsโ€‹

โœ… Do:

  • Enable only needed file formats
  • Set reasonable size limits
  • Monitor storage growth
  • Schedule bulk processing off-hours
  • Keep dependencies updated

โŒ Don't:

  • Process unnecessary file types
  • Set extremely high size limits
  • Skip dependency checks
  • Run bulk processing during peak hours

For Usersโ€‹

โœ… Do:

  • Use text-based PDFs when possible
  • Provide high-quality scans (300 DPI)
  • Use consistent file naming
  • Organize files in logical folders

โŒ Don't:

  • Upload password-protected files (won't extract)
  • Use low-resolution scans
  • Mix unrelated content in single file
  • Rely on OCR for perfect accuracy

FAQโ€‹

Q: Can I process password-protected files?
A: No, password-protected files cannot be extracted. Remove password first.

Q: How accurate is OCR?
A: 90-98% accuracy for good quality scans (300 DPI, clear text). Lower for poor scans.

Q: Can I process files retroactively?
A: Yes! Use bulk extraction via API or admin interface.

Q: Do I need to re-process files after changing chunk strategy?
A: Yes, existing chunks won't update automatically. Re-extract to apply new strategy.

Q: What happens to old chunks when I re-process?
A: Old chunks are replaced with new ones. Checksums ensure only changed content is re-processed.

Q: Can I see extracted text before chunking?
A: Check logs with debug mode enabled. Text is logged before chunking.

Migrationโ€‹

From FileText to Chunksโ€‹

The old FileText entity stored chunks in chunks_json. Migration to dedicated Chunk table:

// Migration: Version1Date20251118000000
// Drops deprecated openregister_file_texts and openregister_object_texts tables

// Migration: Version1Date20251117000000
// Adds checksum column to openregister_chunks table

Migration Strategy:

  1. Create openregister_chunks table
  2. Migrate existing chunks from chunks_json to table
  3. Update services to use ChunkMapper
  4. Drop old tables (after verification)

Feature Documentationโ€‹

Technical Documentationโ€‹