Rudresh Panchal's blog

Final Report: GDPR Anonymization Tool

Github Page

Installation Guide

Usage Guide

Live Demo


What is text anonymization?

Text Anonymization refers to the processing of text, stripping it of any attributes/identifiers thus hiding sensitive details and protecting the identity of users.

Architecture

This system consists of two main components.

Sensitive Attribute Detection System

Before text can be anonymized, the sensitive attributes in the text which give out information have to be identified. We use two methods to do the same (that can be used in tandem or as standalone systems as well):

  1. Named Entity Recognition Based Detection: This relies on tagging of tagging of sensitive entities in text. The user can setup different configurations for different entities which would determine how the given entity anonymized. The different options available are: Deletion/Replacement, Supression and Generalization. The system currently ships with Spacy's NER system, but can very easily be switched out for other NER models.

  2. TF-IDF Based rare entity detection: Certain sensitive attributes in text might not neccesarily be tagged/identified by the NER system. These sensitive tokens can be identified by the TF-IDF system. The term frequency–inverse document frequency identifies possible rare entities in text based on the distribution and occurence of tokens across sample text snippets supplied by the user. Once the TF-IDF score threshold is set, tokens with scores above the same are determined to be sensitive and anonymized.

Sensitive Attribute Anonymization System

Once the sensitive attributes/tokens are detected, they need to be anonymized depening on the kind of token they are. The user can set different anonymization actions for different tokens. The currently available options are:

  1. Deletion/Replacement: To be used in certain cases where retaining a part of the attribute through the other anonymization methods too is not appropriate. Completely replaces the attribute with a pre-set replacement. Example: My name is John Doe would be replaced by My name is <Name>.

  2. Supression: To be used when hiding a part of the information is enough to protect the user's anonymity. The user can supply the percentage or the number of bits they want to be supressed. Example: My phone number is 9876543210 would be replaced by My phone number is 98675***** if the user chooses 50% supression.

  3. Generalization: To be used when the entity is sensitive enough to need anonymization but can still be partially retained to provide information. This system has two methods of carrying out generalization

    • Word Vector Based: In this option of generalization, the nearest neighbor of the word in the vector space is utlilized to generalize the attribute. Example: I live in India get's generalized to I live in Pakistan. This method, while completely changing the word largely retains vector space information useful in most NLP and Text Processing tasks

    • Part Holonym Based: In this option, the system parses the Wordnet Lexical Database to extract part holonyms. This method works exceptionally well with geographical entities. In this, the user can choose the level of generalization. Example: I live in Beijing get's generalized to I live in China at level 1 generalization and to I live in Asia at level 2 of generalization.

Phase 2 - Report: Rudresh Panchal

This post reflects upon some of the milestones achieved in GSoC 2018's Phase two.

The Phase 2 mainly concentrated on on expanding the rare entity detection pipeline, adding the generalization features and increasing accessibility to the system being built. The following features were successfully implemented:

  • Built a custom TF-IDF system to help recognize rare entities. The TF-IDF system saved the intermediate token counts, so that whenever a new document/text snippet is added to the knowledgebase, the TF-IDF scores for all the tokens do not have to be recalculated. The stored counts are loaded, incremented and the relevant scores calculated.

  • Implemented the "Part Holonym" based generalization feature. This feature relies on lexical databases like Wordnet to extract part holonyms. This generalizes tokens to their lexical supersets. For example: London gets generalized to England, Beijing to China at level one generalization and to Europe and Asia Respectively for level two generalization. The user is given the option of choosing the generalization level for each attribute.

  • Implemented the "Word Vector" based generalization feature. This maps the nearest vector space neighbour of a token in pretrained embeddings like GLoVE and replaces it with the same. For example: India gets replaced with Pakistan.

  • Implemented a general anonymization RESTful API. This gives people the option to utilize our system across different tech stacks.

  • Implemented a Token level RESTful API. This API endpoint gives token level information of various things including, the original token, replaced token, entity type and the anonymization type.

  • The API utilizes Django's token based authentication system. Implemented a dashboard to manage the authentication tokens for the users.

Some of the major things planned for the 3rd and final phase are:

  • Code cleanup: As the project progressed, some of the code has become redundant which needs to be removed.

  • Documentation: While the code is well commented and easy to understand, the project currently lacks thorough external documentation. A quick usage guide for non-programmer end users also could be helpful.

  • A simple scaffolding system for the user. The system currently ships without any predefined configurations (including entities, aliases etc). Script(s) which can quickly setup a ready to use system with certain default values (including pre-defined attribute actions, threshold values etc) would be useful.

  • GUI based and API based file upload system. The user currently has to currently paste plaintext in the GUI and set it as a parameter in the API. The option to directly upload text files will increase user convenience.

  • Experiment with language localization. The system currently works well with the English language, but it needs to be tried out with other languages.

Picture 1: The Token level API in action

Phase 1 - Report: Rudresh Panchal

With the first coding phase of GSoC 2018 coming to an end, this post reflects upon some of the milestones achieved in the past month.

I first worked on finalizing the architecture of the Text Anonymization system. This system is being built with the European Union's General Data Protection Regulations in mind. The system seeks to offer a seamless solution to a company's text anonymization needs. The many existing solutions to GDPR mainly focus on anonymization in Database entries, and not on anonymizing plain text snippets.

My system pipeline consists of two principal components.

  1. Entity Recognition: In this part, the entity is recognized using various approaches including Named Entity Recognition (implemented), Regular Expression based patterns (implemented) and TF-IDF based scores (To be implemented in 2nd Phase).

  2. Subsequent action: Once the entity is recognized, the system looks up the configuration mapped to that particular attribute, and carries out one of the following actions to anonymize the data:

  3. Suppression (implemented)

  4. Deletion/Replacement (implemented)

  5. Generalization (To be implemented in 2nd phase).

The methods to generalize the attribute include a novel word vector based generalization and extraction of part holonyms.

Some of the coding milestones achieved include:

  • Setup the coding environment for the development phase.

  • Setup the Django Web App and the Database.

  • Wrote a function to carry out the text pre-processing, including removal of illegal characters, tokenization, expansion of contractions etc.

  • Wrote and integrated wrappers for the Stanford NER system. Wrote the entity replacement function for the same.

  • Wrote and integrated wrappers for the Spacy NER system. Wrote the entity replacement function for this too.

  • Wrote the suppression and deletion functions. Integrated the two with a DB lookup for configurations.

  • Wrote the Regular Expression based pattern search function.

  • Implemented the backend and frontend of the entire Django WebApp.

Major things planned for Phase 2:

  • Implement a dual TF-IDF system. One gives scores based on the documents the user has uploaded, and one which gives scores based on TF-IDF trained on a larger, external corpora.

  • Implement a word vector closest neighbor based generalization.

  • Implement the holonym lookup and extraction functions,.

Picture 1: Shows the Dashboard of the users, allowing them to add new attribute configurations, modify existing configurations, add Aliases for the NER lookup, add Regex patterns and carry out text anonymization.

Picture 2: Shows the text anonymization system in action. The various entities and regex patterns were recognized and replaced as per the configuration.