Ga naar hoofdinhoud

Presidio Integration

Integrate OpenRegister with Microsoft Presidio Analyzer for advanced entity extraction and PII (Personally Identifiable Information) detection for GDPR compliance.

Overview

Presidio Analyzer is Microsoft's open-source PII detection service that provides:

  • High Accuracy: 90-98% precision for PII detection
  • Multi-language Support: 50+ languages including Dutch
  • GDPR Compliance: Built specifically for GDPR/CCPA requirements
  • Self-hosted: Run locally in Docker
  • Extensible: Add custom recognizers

Prerequisites

  • Nextcloud 28+ with OpenRegister installed
  • Docker and Docker Compose
  • At least 2GB RAM
  • Presidio Analyzer container (included in docker-compose.yml)

Quick Start

Step 1: Start Presidio Container

Presidio is included in the docker-compose configuration:

# Start all services including Presidio
docker-compose up -d

# Or specifically start Presidio
docker-compose up -d presidio-analyzer

Step 2: Verify Presidio is Running

# Check health
curl http://localhost:5001/health

# Expected response:
# {"status":"ok"}

Step 3: Configure OpenRegister

Presidio is automatically configured when the container is running. Configure in OpenRegister settings:

Settings → OpenRegister → Text Analysis → Entity Extraction

  • Method: Select "Presidio"
  • Presidio URL: http://presidio-analyzer:5001 (from Nextcloud container)
  • Default Language: Select your primary language (e.g., Dutch)
  • Supported Languages: Select languages to support

Configuration Details

Presidio Service Configuration

presidio-analyzer:
image: mcr.microsoft.com/presidio-analyzer:latest
container_name: openregister-presidio-analyzer
restart: always
ports:
- "5001:5001"
environment:
- GRPC_PORT=5001
- LOG_LEVEL=INFO
# Multi-language support including Dutch
- PRESIDIO_ANALYZER_LANGUAGES=en,nl,de,fr,es
deploy:
resources:
limits:
memory: 2G
reservations:
memory: 512M
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:5001/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3

Accessing Presidio

Important: Docker Container Communication

  • From Nextcloud container: http://presidio-analyzer:5001
  • From host machine: http://localhost:5001
  • NOT: http://localhost:5001 (from Nextcloud container, use container name)

Supported Entity Types

Presidio detects the following entity types:

Personal Information

  • PERSON: Person names
  • EMAIL_ADDRESS: Email addresses
  • PHONE_NUMBER: Phone numbers
  • NRP: National identification numbers (BSN in Netherlands)

Financial Information

  • CREDIT_CARD: Credit card numbers
  • IBAN_CODE: International bank account numbers
  • SWIFT_CODE: SWIFT codes

Location Information

  • LOCATION: Geographic locations
  • IP_ADDRESS: IP addresses
  • URL: Web URLs

Organization Information

  • ORGANIZATION: Organization names

Date and Time

  • DATE_TIME: Dates and timestamps

Medical Information

  • MEDICAL_LICENSE: Medical license numbers
  • US_PASSPORT: US passport numbers

Custom Entities

  • Add custom recognizers for domain-specific entities

Use Cases

1. GDPR Compliance

Automatically detect and track PII in documents:

use OCA\OpenRegister\Service\NerService;

$nerService = $this->container->get(NerService::class);

// Extract entities from Dutch text
$dutchText = "Jan de Vries woont in Amsterdam en zijn telefoonnummer is 06-12345678.";

$entities = $nerService->extractEntities($dutchText, 'presidio', [
'language' => 'nl'
]);

foreach ($entities as $entity) {
echo "Type: {$entity['type']}\n";
echo "Value: {$entity['value']}\n";
echo "Confidence: {$entity['confidence']}\n";
echo "Position: {$entity['start']}-{$entity['end']}\n\n";
}

Output:

Type: PERSON
Value: Jan de Vries
Confidence: 0.85
Position: 0-12

Type: LOCATION
Value: Amsterdam
Confidence: 0.85
Position: 22-31

Type: PHONE_NUMBER
Value: 06-12345678
Confidence: 0.95
Position: 58-69

2. Data Subject Access Requests

Generate GDPR reports showing all PII for a person:

// Find all entities for a specific person
$personEntities = $nerService->findEntitiesByValue('Jan de Vries', 'presidio');

// Generate GDPR report
$report = $gdprService->generateDataSubjectReport($personEntities);

3. Automatic Anonymization

Anonymize detected PII in documents:

// Extract entities
$entities = $nerService->extractEntities($text, 'presidio');

// Anonymize text
$anonymizedText = $anonymizationService->anonymize($text, $entities);

API Usage

Direct API Calls

Test Presidio directly:

# Analyze text for PII
curl -X POST http://localhost:5001/analyze \
-H "Content-Type: application/json" \
-d '{
"text": "Jan de Vries woont in Amsterdam",
"language": "nl",
"entities": ["PERSON", "LOCATION"]
}'

Response:

{
"entities": [
{
"entity_type": "PERSON",
"start": 0,
"end": 12,
"score": 0.85,
"analysis_explanation": {
"recognizer": "SpacyRecognizer",
"pattern": "PERSON"
}
},
{
"entity_type": "LOCATION",
"start": 22,
"end": 31,
"score": 0.85,
"analysis_explanation": {
"recognizer": "SpacyRecognizer",
"pattern": "LOCATION"
}
}
]
}

Integration with Text Extraction Pipeline

Presidio integrates with OpenRegister's text extraction pipeline:

Configuration

PHP Configuration

// config/ner_config.php
return [
'ner_enabled' => true,
'ner_method' => 'presidio', // Use Presidio for production

'presidio' => [
'analyzer_url' => 'http://presidio-analyzer:5001',
'default_language' => 'nl', // Default to Dutch
'languages' => ['nl', 'en'], // Support Dutch and English
'score_threshold' => 0.6, // Minimum confidence score
'entities' => [
'PERSON',
'EMAIL_ADDRESS',
'PHONE_NUMBER',
'IBAN_CODE',
'LOCATION',
'ORGANIZATION',
'NRP', // Dutch BSN numbers
]
]
];

Automatic Language Detection

If your documents are mixed language, detect language first:

// Detect language
$language = $nerService->detectLanguage($text);

// Use detected language for entity extraction
$entities = $nerService->extractEntities($text, 'presidio', [
'language' => $language
]);

Accuracy Comparison

MethodPrecisionRecallF1 ScoreSpeed
Presidio90-95%85-92%87-93%⚡⚡ Medium
MITIE (Local)75-85%70-80%72-82%⚡⚡⚡ Fast
LLM (GPT-4)92-98%90-95%91-96%⚡ Slow

Definitions:

  • Precision: Percentage of detected entities that are correct (low false positives)
  • Recall: Percentage of actual entities that were detected (low false negatives)
  • F1 Score: Harmonic mean of precision and recall (overall accuracy)

Troubleshooting

Container Won't Start

# Check logs
docker logs openregister-presidio-analyzer

# Common issues:
# 1. Port 5001 already in use
sudo lsof -i :5001

# 2. Insufficient memory
docker stats openregister-presidio-analyzer

# 3. Language models not downloaded
docker exec openregister-presidio-analyzer ls /app/models

Low Accuracy

Solutions:

  1. Specify correct language: 'language' => 'nl' for Dutch
  2. Adjust score threshold: Lower threshold for more detections
  3. Add custom recognizers for domain-specific entities
  4. Use hybrid approach with multiple methods

Connection Errors from OpenRegister

Problem: OpenRegister can't connect to Presidio.

Solutions:

  1. Verify analyzer URL uses container name: http://presidio-analyzer:5001
  2. Check containers are on same Docker network
  3. Test connection from Nextcloud container:
    docker exec <nextcloud-container> curl http://presidio-analyzer:5001/health

Slow Processing

Solutions:

  1. Process in batches
  2. Use async processing for large documents
  3. Cache results for repeated text
  4. Adjust timeout settings

Performance Optimization

Batch Processing

Process multiple chunks efficiently:

// Process multiple chunks
$chunks = [$chunk1, $chunk2, $chunk3];
$allEntities = $nerService->extractEntitiesBatch($chunks, 'presidio', [
'language' => 'nl',
'async' => true
]);

Caching

Cache entity extraction results:

// Cache entities for repeated text
$cacheKey = md5($text);
$entities = $cache->get($cacheKey);

if (!$entities) {
$entities = $nerService->extractEntities($text, 'presidio');
$cache->set($cacheKey, $entities, 3600); // Cache for 1 hour
}

Further Reading

Support

For issues specific to: