Manual Entity Anonymisation
OpenRegister exposes POST /api/files/{fileId}/manual-entities for the operator-supplied "add this exact text to the anonymisation list for this file" flow. The endpoint performs chunk-aware exact-string matching against the file's previously extracted text, creates (or reuses) a catalogue entry for the value, and creates one EntityRelation row per occurrence found with detectionMethod = manual.
The endpoint complements presidio/openanonymiser/pattern detection by giving operators a clean way to catch values the detectors missed without polluting the model with edge cases. It feeds into the same anonymise flow as auto-detected entities — POST /api/files/{fileId}/anonymize picks up the new relations on its next run.
Endpoint contract
POST /api/files/{fileId}/manual-entities
Content-Type: application/json
Body: {
"value": string, // operator-supplied text (REQUIRED, PII)
"type": string, // entity type tag, e.g. "PERSON" (REQUIRED)
"wholeWord": boolean (default true),
"caseSensitive": boolean (default true)
}
Note on
category: theoc_openregister_entities.categorycolumn isNOT NULLand is populated server-side fromtypevia the sameEntityRecognitionHandler::getCategoryForType()mapping the detector flow uses (PERSON / EMAIL / PHONE / ADDRESS →personal_data; IBAN / SSN →sensitive_pii; ORGANIZATION →business_data; LOCATION →contextual_data; DATE →temporal_data; everything else →contextual_data). The endpoint intentionally does NOT accept acategoryfield in v1 — operator-override on category is a follow-up if a concrete use case emerges.
Response shape (201 / 200):
{
"entity": {
"id": 7,
"uuid": "01HMX...",
"value": "Jan Jansen",
"type": "PERSON",
"reused": false
},
"relations": [
{
"id": 200,
"chunkId": 100,
"positionStart": 13,
"positionEnd": 23,
"context": "... Jan Jansen woont in ..."
}
],
"matchCount": 1,
"matchesSkipped": 0
}
Status codes:
- 201 — one or more matches found; the catalogue entry (new or reused) plus the inserted relations are in the body.
- 200 — zero matches found in the file. The catalogue entry is still created/reused and is available for use on other files. Body adds a
messagefield:"Text not found in file. Catalogue entry created (or reused) and is available for use on other files." - 400 —
{ "error": "invalid_request", "field": "value"|"type" }for missing required body fields, or{ "error": "regex_compile_failure" }for malformed Unicode in the needle or a value longer than the chunk overlap (200 chars). - 401 — no authenticated session:
{ "error": "unauthenticated" }. - 403 — acting user lacks write-access to the file:
{ "error": "forbidden", "reason": "write access to file required" }. Same RBAC check asPATCH /api/entity-relations/{id}— the file MUST be reachable in the user-folder andisUpdateable()MUST return true. - 415 — non-JSON Content-Type:
{ "error": "unsupported_media_type", "reason": "..." }. - 422 —
{ "error": "file_not_extracted" }— the file has no extracted chunks. Operator must trigger text extraction first. - 500 — unexpected failure:
{ "error": "internal_error" }. Body never echoes the operator-supplied value (ADR-005).
The endpoint is @NoAdminRequired — non-admins can add manual entities to files they can write.
Semantics
- Atomic per call. The catalogue write, the relation inserts, and the audit-trail entries all happen inside one
IDBConnection::beginTransaction()/commit()/rollBack()window. Either everything in the call lands or nothing does. - Idempotent on retry. Each match position is probed via
EntityRelationMapper::existsForFileAtPosition($fileId, $entityId, $chunkId, $positionStart, $positionEnd)before insert. Already-present rows bumpmatchesSkippedand don't insert. Re-running the call for the same value on the same file is a no-op DB-wise (the audit-trail entry is still written so operator intent is recorded). - Catalogue lookup-or-create.
GdprEntityMapper::findOneByValueAndType(value, type)resolves an existing row; otherwise a freshGdprEntityis inserted.entity.reused = (existing !== null)in the response signals which path was taken. Match flags do not key the catalogue lookup — the catalogue row is the canonical truth for that (value, type) and is shared across all files that reference it. - Chunk-aware matching. OR's text extractor splits long files into 1000-character chunks with 200-character overlap. The matcher runs
preg_match_allper chunk, computes absolute positions aschunk.startOffset + chunkRelativeOffset, and dedups by absolute (start, end) across chunks. When the same match appears in two chunks' overlap region the entry from the lowerchunkIndexwins, so re-runs select the same canonical chunk (idempotency precondition). - Match flag defaults. Both
wholeWordandcaseSensitivedefault to true. The defaults reflect "operator means exactly what they typed"; loosen them only when the use case demands it. - Value-too-long. Values longer than the chunk overlap (200 chars by default) are rejected with
regex_compile_failurebecause they cannot reliably be matched per-chunk. The error message never contains the value (ADR-005).
Audit trail
Two action types are written by every successful call:
-
entity_create— written ONLY when a new catalogue row was inserted:action = "entity_create"
user = acting user UID (NEVER the display name — ADR-005)
created = now (UTC)
changed.subjectType = "openregister_entities"
changed.subjectId = <gdpr_entity id>
changed.fields = { "value": <value>, "type": <type>, "category": <derived> }ADR-022 forensic exception:
valueIS allowed in the audit payload (and only here — never in HTTP logs or error responses). -
entity_relations_batch_create— written on EVERY call, even on zero-match calls:action = "entity_relations_batch_create"
user = acting user UID
created = now (UTC)
changed.subjectType = "openregister_files"
changed.subjectId = <fileId>
changed.fields = {
"value": <value>,
"type": <type>,
"fileId": <fileId>,
"detectionMethod": "manual",
"matchCount": <matchCount>,
"matchesSkipped": <matchesSkipped>,
"relationIds": [<id>, <id>, ...] // only the inserted ones
}Zero-match calls are still audited so operator intent is recorded even when the value isn't found in the file (e.g. a typo).
Anonymise-flow interaction
Manual-method relations carry detectionMethod = "manual" but are otherwise identical to detector-produced rows. The downstream pass picks them up unchanged:
POST /api/files/{fileId}/anonymize—EntityRelationMapper::findEntitiesForAnonymizationdoes NOT filter ondetection_method; all relations on the file (minusskip_anonymization = trueones) are included. Manual-method rows are anonymised on the next run.- Caveat: value-keyed substitution. OR's
DocumentProcessingHandler::anonymizeDocumentcollapses multiple distinct catalogue entries with the samevalueinto one substitution token. If an operator manually adds a value that already exists in the catalogue (under a different type), the auto-detected and manually-added occurrences will all map to the same placeholder. This is a feature, not a bug — operators don't see two placeholders for "Jan Jansen" just because one occurrence was auto-detected and another was operator-flagged.
PII redaction (ADR-005)
- Request log — controller logs
valueLengthonly, nevervalue. Permitted log payload:fileId,type,wholeWord,caseSensitive,valueLength,actor(UID — UID is not PII per ADR-005). - Error responses — never include the operator-supplied
value. Reason codes are stable strings; messages are operator-readable but PII-clean. - Logger warnings on the catalogue dedup-invariant violation —
GdprEntityMapper::findOneByValueAndTypelogs the type + the colliding ids if two rows match the same (value, type), but NOT the value itself (the value can be re-derived from the row in the catalogue audit log if a forensic step is needed). - Audit trail (
entity_create,entity_relations_batch_create) — explicit forensic exception per ADR-022.valueIS persisted here.
RBAC
Same model as the rest of OR's file-bound write endpoints:
- The file MUST resolve through the actor's
IRootFolder::getUserFolder($uid)->getById($fileId)lookup (i.e. the file must be visible to the actor). - The resolved node MUST be a file (not a folder) and
isUpdateable()MUST return true.
Either check failing produces a 403 with { "error": "forbidden", "reason": "write access to file required" }. There is no oracle between "file does not exist" and "file is not writable" — both produce 403/422 depending on the path (file-not-extracted → 422, write-denied → 403). This is the spec's no-oracle rule.
Spec references
- Capability:
openspec/changes/manual-entity-anonymisation/specs/entity-relation-grondslagen/spec.md - Design (matcher algorithm, audit, RBAC, idempotency invariants):
openspec/changes/manual-entity-anonymisation/design.md - Tracking issue:
#1593 - ADR-005 (no PII in logs / error responses; UID not display name in audit payloads)
- ADR-022 (audit-trail for OR-owned mutations; forensic exception for
valuein audit payload only) - Related: Entity-Relation Decision Metadata — operator decisions (
bases,skipAnonymization) on individual relations.