Office Document Sanitisation
Overview
When OpenRegister anonymises an Office document (.docx or .odt), the entity walker can only reach the visible text runs. Office documents also carry personally-identifying information (PII) in non-text structures that the walker is structurally blind to — reviewer comments, tracked-change author attributes, document metadata, person-identifying field codes, custom XML data bindings, and hyperlink URLs. These regularly leak names and case numbers into published Woo/dossier documents.
The document sanitiser runs ahead of the anonymisation walker. It produces a cleaned derivative of the input by performing XML-level surgery on the ZIP container; the cleaned file is then passed to the walker for entity detection and replacement. The original file is never modified.