Skip to main content

Named Entity Recognition Technical Documentation

๐Ÿ“š Feature Documentation: See Text Extraction, Vectorization & Named Entity Recognition for user-facing documentation and overview.

Overviewโ€‹

OpenRegister's EntityRecognitionHandler extracts entities (persons, organizations, emails, etc.) from text chunks for GDPR compliance and data classification. The handler is integrated into the TextExtractionService workflow and automatically processes chunks after they are created. It supports multiple detection methods and stores entities in a GDPR register.

Architectureโ€‹

Service Flowโ€‹

Service Implementationโ€‹

EntityRecognitionHandlerโ€‹

Location: lib/Service/TextExtraction/EntityRecognitionHandler.php

Integration: Automatically invoked by TextExtractionService after chunks are created.

Key Methods:

/**
* Process chunks for a source and extract entities.
* This method is called after chunks are created to detect and store entities.
*/
public function processSourceChunks(string $sourceType, int $sourceId, array $options = []): array
{
// Get all chunks for this source (excluding metadata chunks)
$chunks = $this->chunkMapper->findBySource($sourceType, $sourceId);

foreach ($chunks as $chunk) {
$result = $this->extractFromChunk($chunk, $options);
}

return [
'chunks_processed' => $chunksProcessed,
'entities_found' => $totalEntities,
'relations_created' => $totalRelations,
];
}

/**
* Extract entities from a text chunk.
*/
public function extractFromChunk(Chunk $chunk, array $options = []): array
{
$method = $options['method'] ?? 'hybrid';
$entityTypes = $options['entity_types'] ?? null;
$confidenceThreshold = (float) ($options['confidence_threshold'] ?? 0.5);

$text = $chunk->getTextContent();

// Detect entities using selected method
$detectedEntities = $this->detectEntities($text, $method, $entityTypes, $confidenceThreshold);

// Store entities and create relations
$entitiesFound = 0;
$relationsCreated = 0;

foreach ($detectedEntities as $detected) {
// Find or create entity
$entity = $this->findOrCreateEntity(
$detected['type'],
$detected['value'],
$detected['category'] ?? $this->getCategoryForType($detected['type'])
);

// Create entity relation
$relation = new EntityRelation();
$relation->setEntityId($entity->getId());
$relation->setChunkId($chunk->getId());
$relation->setPositionStart($detected['position_start']);
$relation->setPositionEnd($detected['position_end']);
$relation->setConfidence($detected['confidence']);
$relation->setDetectionMethod($method);
$relation->setContext($this->extractContext($text, $detected['position_start'], $detected['position_end']));

// Set source references
if ($chunk->getSourceType() === 'file') {
$relation->setFileId($chunk->getSourceId());
} else if ($chunk->getSourceType() === 'object') {
$relation->setObjectId($chunk->getSourceId());
}

$this->entityRelationMapper->insert($relation);
$entitiesFound++;
$relationsCreated++;
}

return [
'entities_found' => $entitiesFound,
'relations_created' => $relationsCreated,
];
}

Database Schemaโ€‹

Entity Table (GDPR Register)โ€‹

Table: oc_openregister_entities

CREATE TABLE oc_openregister_entities (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
uuid VARCHAR(255) NOT NULL UNIQUE,
type VARCHAR(50) NOT NULL,
value TEXT NOT NULL,
category VARCHAR(50) NOT NULL,
belongs_to_entity_id BIGINT,
metadata JSON,
owner VARCHAR(255),
organisation VARCHAR(255),
detected_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

INDEX idx_type (type),
INDEX idx_category (category),
INDEX idx_detected (detected_at),
INDEX idx_value_hash (value(255)),
INDEX idx_owner (owner),
INDEX idx_organisation (organisation),
INDEX idx_belongs_to (belongs_to_entity_id),

FOREIGN KEY (belongs_to_entity_id) REFERENCES oc_openregister_entities(id) ON DELETE SET NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Key Fields:

  • type: Entity type (PERSON, EMAIL, PHONE, ORGANIZATION, etc.)
  • value: The entity value (e.g., 'john.doe@example.com')
  • category: Classification (personal_data, sensitive_pii, business_data)
  • belongs_to_entity_id: Parent entity (e.g., phone belongs to person)

Entity Relation Tableโ€‹

Table: oc_openregister_entity_relations

CREATE TABLE oc_openregister_entity_relations (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
entity_id BIGINT NOT NULL,
chunk_id BIGINT NOT NULL,
role VARCHAR(50),
file_id BIGINT,
object_id BIGINT,
email_id BIGINT,
position_start INT NOT NULL,
position_end INT NOT NULL,
confidence DECIMAL(3,2) NOT NULL,
detection_method VARCHAR(50) NOT NULL,
context TEXT,
anonymized BOOLEAN NOT NULL DEFAULT FALSE,
anonymized_value VARCHAR(255),
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,

INDEX idx_entity (entity_id),
INDEX idx_chunk (chunk_id),
INDEX idx_file (file_id),
INDEX idx_object (object_id),
INDEX idx_confidence (confidence),
INDEX idx_anonymized (anonymized),

FOREIGN KEY (entity_id) REFERENCES oc_openregister_entities(id) ON DELETE CASCADE,
FOREIGN KEY (chunk_id) REFERENCES oc_openregister_chunks(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Key Fields:

  • entity_id: Reference to Entity
  • chunk_id: Reference to Chunk
  • position_start / position_end: Character positions in chunk
  • confidence: Detection confidence (0.00-1.00)
  • detection_method: How it was found (regex, presidio, llm, hybrid)
  • context: Surrounding text for verification

Entity Typesโ€‹

Supported Entity Typesโ€‹

Entity TypeDescriptionExamplesGDPR Category
PERSONIndividual namesJohn Doe, Jane SmithPersonal Data
EMAILEmail addressesjohn@example.comPersonal Data
PHONEPhone numbers+31 6 12345678Personal Data
ADDRESSPhysical addresses123 Main St, AmsterdamPersonal Data
ORGANIZATIONCompany/org namesAcme CorporationBusiness Data
LOCATIONGeographic locationsAmsterdam, NetherlandsContextual Data
DATEDates and times2025-01-15, January 15thTemporal Data
IBANBank account numbersNL91 ABNA 0417 1643 00Sensitive PII
SSNSocial security numbers123-45-6789Sensitive PII
IP_ADDRESSIP addresses192.168.1.1Contextual Data

Entity Categoriesโ€‹

CategoryDescriptionExamples
personal_dataPersonal identifiable informationPERSON, EMAIL, PHONE, ADDRESS
sensitive_piiHighly sensitive PIIIBAN, SSN
business_dataBusiness-related entitiesORGANIZATION
contextual_dataContextual informationLOCATION, IP_ADDRESS
temporal_dataTime-related informationDATE

Detection Methodsโ€‹

1. Regex Patterns (Local)โ€‹

Status: โœ… Implemented

Advantages:

  • No external dependencies
  • Fast processing (~10-50ms per chunk)
  • Privacy-friendly (no data leaves server)

Limitations:

  • Lower accuracy for complex cases
  • Limited context awareness

Supported Entities:

  • Email addresses
  • Phone numbers
  • IBANs
  • Basic patterns

Implementation:

private function detectWithRegex(string $text, ?array $entityTypes, float $confidenceThreshold): array
{
$entities = [];

// Email detection
if ($entityTypes === null || in_array(self::ENTITY_TYPE_EMAIL, $entityTypes, true)) {
if (preg_match_all('/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/', $text, $matches, PREG_OFFSET_CAPTURE)) {
foreach ($matches[0] as $match) {
$entities[] = [
'type' => self::ENTITY_TYPE_EMAIL,
'value' => $match[0],
'category' => self::CATEGORY_PERSONAL_DATA,
'position_start' => $match[1],
'position_end' => $match[1] + strlen($match[0]),
'confidence' => 0.9,
];
}
}
}

// Phone detection
if ($entityTypes === null || in_array(self::ENTITY_TYPE_PHONE, $entityTypes, true)) {
$phonePattern = '/\+?[1-9]\d{1,14}|\+?31\s?[0-9]{9}|\d{3}[-.\s]?\d{3}[-.\s]?\d{4}/';
if (preg_match_all($phonePattern, $text, $matches, PREG_OFFSET_CAPTURE)) {
foreach ($matches[0] as $match) {
$entities[] = [
'type' => self::ENTITY_TYPE_PHONE,
'value' => $match[0],
'category' => self::CATEGORY_PERSONAL_DATA,
'position_start' => $match[1],
'position_end' => $match[1] + strlen($match[0]),
'confidence' => 0.7,
];
}
}
}

// Filter by confidence threshold
return array_filter($entities, fn($e) => $e['confidence'] >= $confidenceThreshold);
}

2. Presidio (External Service)โ€‹

Status: ๐Ÿ“ Planned (TODO)

Advantages:

  • High accuracy
  • Specialized for PII detection
  • Maintained by experts

Limitations:

  • Requires external service
  • Data sent to external API
  • Network latency

Implementation (Planned):

private function detectWithPresidio(string $text, ?array $entityTypes, float $confidenceThreshold): array
{
// TODO: Implement Presidio integration
// For now, fall back to regex
$this->logger->debug('[EntityRecognitionHandler] Presidio not yet implemented, using regex fallback');
return $this->detectWithRegex($text, $entityTypes, $confidenceThreshold);
}

3. LLM-Based (AI)โ€‹

Status: ๐Ÿ“ Planned (TODO)

Advantages:

  • Context-aware detection
  • Handles ambiguity
  • Can detect complex entities

Limitations:

  • Slower processing (~500-2000ms per chunk)
  • Higher cost
  • Requires API access

Implementation (Planned):

private function detectWithLLM(string $text, ?array $entityTypes, float $confidenceThreshold): array
{
// TODO: Implement LLM-based entity extraction
// For now, fall back to regex
$this->logger->debug('[EntityRecognitionHandler] LLM extraction not yet implemented, using regex fallback');
return $this->detectWithRegex($text, $entityTypes, $confidenceThreshold);
}

Status: โœ… Implemented

Process:

  1. Start with regex for fast detection
  2. Validate high-confidence entities
  3. Use Presidio/LLM for ambiguous cases (planned)

Implementation:

private function detectWithHybrid(string $text, ?array $entityTypes, float $confidenceThreshold): array
{
// Start with regex for fast detection
$regexEntities = $this->detectWithRegex($text, $entityTypes, $confidenceThreshold);

// TODO: Add Presidio validation for higher confidence
// TODO: Add LLM validation for ambiguous cases

return $regexEntities;
}

Entity Storageโ€‹

Finding or Creating Entitiesโ€‹

private function findOrCreateEntity(string $type, string $value, string $category): GdprEntity
{
// Try to find existing entity by value and type
try {
$qb = $this->entityMapper->getQueryBuilder();
$qb->select('*')
->from('openregister_entities')
->where($qb->expr()->eq('type', $qb->createNamedParameter($type)))
->andWhere($qb->expr()->eq('value', $qb->createNamedParameter($value)))
->setMaxResults(1);

$existing = $this->entityMapper->findEntity($qb);

// Update timestamp
$existing->setUpdatedAt(new DateTime());
$this->entityMapper->update($existing);

return $existing;
} catch (DoesNotExistException $e) {
// Entity doesn't exist, create new one
$entity = new GdprEntity();
$entity->setUuid(Uuid::v4()->toString());
$entity->setType($type);
$entity->setValue($value);
$entity->setCategory($category);
$entity->setDetectedAt(new DateTime());
$entity->setUpdatedAt(new DateTime());

return $this->entityMapper->insert($entity);
}
}

Performanceโ€‹

Processing Timesโ€‹

MethodTime per ChunkNotes
Regex10-50msFast, local
Presidio100-300msExternal API call
LLM500-2000msAI processing
Hybrid10-2000msDepends on validation needs

Recommendationsโ€‹

  1. Batch Processing: Process enhancements in background jobs
  2. Selective Enhancement: Only enable features you need
  3. Hybrid NER: Use fast regex first, validate with LLM only when needed
  4. Confidence Threshold: Adjust based on use case (default: 0.5)

API Endpointsโ€‹

Entity Extractionโ€‹

POST /api/chunks/{chunkId}/extract-entities
Content-Type: application/json

{
"method": "hybrid",
"confidence_threshold": 0.7,
"entity_types": ["PERSON", "EMAIL", "PHONE"]
}

Response:

{
"entities_found": 5,
"relations_created": 5,
"entities": [
{
"type": "PERSON",
"value": "John Doe",
"confidence": 0.95
},
{
"type": "EMAIL",
"value": "john@example.com",
"confidence": 0.9
}
]
}

GDPR Reportsโ€‹

GET /api/entities
GET /api/entities/{entityId}
GET /api/entities/{entityId}/occurrences
GET /api/gdpr/report

Entity Relationshipsโ€‹

Table: oc_openregister_entity_links

Links entities to each other to create a knowledge graph of relationships.

CREATE TABLE oc_openregister_entity_links (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
source_entity_id BIGINT NOT NULL,
target_entity_id BIGINT NOT NULL,
relationship_type VARCHAR(100) NOT NULL,
confidence DECIMAL(3,2) NOT NULL,
detection_method VARCHAR(50) NOT NULL,
metadata JSON,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

INDEX idx_source (source_entity_id),
INDEX idx_target (target_entity_id),
INDEX idx_relationship (relationship_type),
INDEX idx_confidence (confidence),
INDEX idx_both (source_entity_id, target_entity_id),

FOREIGN KEY (source_entity_id) REFERENCES oc_openregister_entities(id) ON DELETE CASCADE,
FOREIGN KEY (target_entity_id) REFERENCES oc_openregister_entities(id) ON DELETE CASCADE,

UNIQUE KEY unique_link (source_entity_id, target_entity_id, relationship_type)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Relationship Typesโ€‹

belongs_to: Contact Information Ownershipโ€‹

The primary relationship type - links contact information to persons or organizations:

Phone Numbers:
- Phone(+31612345678) โ†’ belongs_to โ†’ Person(John Doe)
- Phone(+31201234567) โ†’ belongs_to โ†’ Organization(Acme Corp)

Email Addresses:
- Email(john.doe@example.com) โ†’ belongs_to โ†’ Person(John Doe)
- Email(info@acme.com) โ†’ belongs_to โ†’ Organization(Acme Corp)

Physical Addresses:
- Address(123 Main St, Amsterdam) โ†’ belongs_to โ†’ Person(John Doe)
- Address(456 Business Park) โ†’ belongs_to โ†’ Organization(Acme Corp)

associated_with: Generic Associationโ€‹

For flexible use cases or uncertain relationships:

Examples:
- Phone(+31612345678) โ†’ associated_with โ†’ Person(Jane Doe) (uncertain ownership)
- Email(shared@team.com) โ†’ associated_with โ†’ Organization(Acme Corp) (shared mailbox)

primary_contact / alternate_contact: Contact Priorityโ€‹

Marks which contact method is primary:

Examples:
Person(John Doe):
- Email(john.doe@work.com) โ†’ primary_contact โ†’ Person(John Doe)
- Email(john.personal@gmail.com) โ†’ alternate_contact โ†’ Person(John Doe)
- Phone(+31612345678) โ†’ primary_contact โ†’ Person(John Doe)

Detection Methodsโ€‹

1. Proximity Detectionโ€‹

Entities found near each other in text are likely related:

Text: "Contact John Doe at +31612345678 or john.doe@example.com"

Detection:
- Person(John Doe) found at position 8-16
- Phone(+31612345678) found at position 20-33
- Email(john.doe@example.com) found at position 37-59

Links Created:
- Phone โ†’ belongs_to โ†’ Person (confidence: 0.85, method: proximity)
- Email โ†’ belongs_to โ†’ Person (confidence: 0.85, method: proximity)

2. Pattern Matchingโ€‹

Known patterns indicate relationships:

Patterns:
- "X works for Y" โ†’ works_for
- "X at Y" โ†’ works_for
- "Y's X" โ†’ belongs_to
- "X (Y)" โ†’ belongs_to

Example:
Text: "John Doe works for Acme Corp"

Link Created:
- Person(John Doe) โ†’ works_for โ†’ Organization(Acme Corp)
(confidence: 0.95, method: pattern)

3. LLM Analysisโ€‹

AI understands context and relationships:

Text: "Please contact our sales manager regarding this proposal"

Context:
- Email signature: "John Doe, Sales Manager, Acme Corp"
- Email metadata: from john.doe@acme.com

LLM Analysis:
- Person(John Doe) works for Organization(Acme Corp)
- Email(john.doe@acme.com) belongs to Person(John Doe)
- Role: Sales Manager (stored in Person metadata)

Links Created:
- Person โ†’ works_for โ†’ Organization (confidence: 0.92, method: llm)
- Email โ†’ belongs_to โ†’ Person (confidence: 0.98, method: llm)

Configurationโ€‹

NER Settingsโ€‹

Settings โ†’ OpenRegister โ†’ Text Analysis

  • Enable Entity Extraction: Yes/No
  • Extraction Method: Regex, Presidio, LLM, Hybrid
  • Entity Types: Select types to detect
  • Confidence Threshold: Minimum confidence (0.0-1.0)
  • Context Window: Characters around entity (default: 50)

Feature Documentationโ€‹

Technical Documentationโ€‹