Data Sanitization

Sanitization is the exercise of removing sensitive information from documents so as to ensure that only the intended information can be accessed from these documents. Classified information in the documents are masked or blacked out, before being passed on to a larger audience to make certain that the documents are not divulging any information -- like personal details, financial details, emblems etc. -- that they should not. Manually done, sanitization is a labour intensive and time consuming process, apart from the risk of some highly sensitive information not getting redacted due to human errors.


As the documents come in different formats including pdfs and images, a combination of the most advanced Computer Vision techniques along with state of the art Natural Language Processing techniques, like Transformers, can be employed to precisely identify and mask sensitive information from thousands of documents in quick time with accuracy much higher than that humans can achieve.