DataMapper Technical Whitepaper

Introduction

DataMapper is a digital archivist that can find and flag sensitive and personal data throughout company systems, helping you comply with privacy regulations like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) that make you responsible for keeping track of all the personal data you store.

What does DataMapper provide?

Knowledge of company document content locations along with who has access to them is essential to preventing a data-breach, responding to individual data requests and communicating effectively between departments day-to-day.

Different members of your team may be storing company information in different locations online and on their own computers. Each employee in your business may spend roughly 3.5 hours on average each week searching for documents. 20% are either misplaced or completely lost. Manually organizing documents scattered across mailboxes, servers, local drives, cloud storage, and USBs is an enormous task with a significant potential for human error; but an automated system makes it simple.

DataMapper reduces the risk of improper data handling by giving you a complete overview of files and documents from all users on a single dashboard. It organizes your documents intuitively based on content and context, making it easy to navigate all documents across an organization. Whole teams can join DataMapper so that all company information can be gathered and mapped for overall GDPR risk.

DataMapper categorizes personal data found by its level of sensitivity. Several Articles from the GDPR set out a hierarchy of data sensitivity. Examples of risk classification in ascending order are shown in the graphic below:

DataMapper’s automatic identification and classification of critical data make it easy to review GDPR risks. It will also help you when creating and mapping data flows. Make sure essential documents are stored in the right place under the right name. Since DataMapper shows you the risk level of each document along with its location and who is storing it, you can assess and adapt data-storage protocols to mitigate risk and keep track of compliance.

The clear overview of your documents and quick data access DataMapper provides can:

Make it easy to find out redundancy and decide what needs to be deleted.
Help you identify ways to streamline and simplify your company’s data processes.

In addition to the increased data security, easy GDPR compliance and time-saving value that its automated document management allows, DataMapper's 'bird's eye view' of company documents has other benefits to a company's bottom line. When you revisit data that was buried in an old file or floating aimlessly in cloud storage, you quickly discover any incomplete contracts, overlooked sales leads, unused subscriptions, superfluous systems and suppliers, and other information that would otherwise slip through the cracks. Make sure you are taking advantage of all the valuable information your company has collected, and eliminate the rest.

Architecture and integrations

Backend

DataMapper is developed almost exclusively in Python. Python was chosen as it is currently the market-leading programming language for machine learning (ML). The best performing widely adopted ML libraries are written for/in Python. We use the following key packages for our ML: scikit-learn, NLTK, spaCy, NumPy, SciPy, and to some degree Keras, Theano, Gensim, and Pandas.

Our stack also uses the following infrastructure software: MySQL, Redis, Elastic search, Flask, and Celery.

DataMapper runs as microservices on Kubernetes, deployed at Azure. Some activities (e.g. OCR is handled by external APIs, such as Microsoft Cognitive Services).

Client

The DataMapper client is developed in HTML/CSS/JS and deployed as native applications for Windows and OSX using Electron. It uses the framework Vue.js.

The DataMapper client runs on OSX and Windows. The client needs an internet connection to function and communicate with the DataMapper API.

Platform

Document formats

DataMapper can extract text from these formats:

Documents: txt, rtf, pdf, doc, docx, gdoc, odt (JPG)

Presentations: pptx, ppt

Spreadsheets: xls, xlsx, gsheet

Hypertext: html, htm

Archives: pst, zip

Languages

DataMapper understands documents written in English and Danish. More languages will be added in August.

Document classification

DataMapper comes with two main pre-trained document classification models: English and Danish. They classify documents into 80 document types, divided into main classes:

Contracts
Governance
Real Estate
Presentations
Intellectual property
Financial documents
Human Resources (including identification papers such as passport and driver's license)
Certificates,
Others, and more.

Some classes have many subclasses, others have few. We have 24 subclasses with an F1 score > .9 (i.e., ~ more than 90% accuracy).

DataMapper classifies documents by language detection, text cleaning, feature extraction (bag of words), main category classification (SVM) and leaf category classification (SVM). Some specific business logic is programmed into this pipeline, based on classifications.

Integrations

Integrations are built for:

Local drives
Network drives (Windows file share)
Dropbox
Google Drive
Microsoft Exchange
Google Mail
Sharepoint
OneDrive

API

The API supports a variety of methods. The main entries are:

admin → for creating companies, adding users to companies and managing user roles

document → for document classifications, setting document auth level, finding similar documents and document meta data

file → for fetching file data from a document, for getting a pdf version of any document and for searching for company's documents

integration → for creating new integrations and for updating integrations

user → for creating and editing user information; and for inviting new users

Client

The client communicates over TLS 1.2 to the API to provide functionality. It also connects to cloud storage providers directly for fetching folder names so that a user can select specific folders (e.g., for Exchange, Dropbox, ...). It can also scan selected local folders and network drives.

The client can show files (in a pdf viewer) and it can download all/some of your files (in a hierarchical or a flat structure). In the client, you can organize a team's roles, and define who can edit/see which classes of files. Specific files can also be set to "OK" or "critical" for flagging importance. Files may also be set to "private", such that other users from the same company cannot see them. In the client, a document's classification can also be set, if DataMapper did not detect it automatically (low confidence). The client also features a variety of filters and search functionality for finding documents. It also has an overall folder organization such that a user can browse files.

Flow

See the below figure. Data is hosted on Microsoft Azure's blob storage. The connection between DataMapper API (Python Flask) and the DataMapper client is encrypted. The backend fetches data from cloud storage providers directly (except when scanning local- and network drives, where data is uploaded from the client). Text is extracted from documents, and stored in MySQL, and made searchable from in-cache using Elastic Search (ES). ES is also used for GDPR queries.

Local and network scans:

DataMapper receives a temporary credential for the company's file storage location.
DataMapper uploads files directly to file storage. TLS 1.2 in transit. Microsoft Azure standard implementation.
Backend service processes the file and removes it from file storage. Backend service retains the OCR-ed version of the file and its original location path (e.g. C:\Documents\a) in the database.

Exchange scans:

A user authenticates their Exchange account in DataMapper and selects folders they want to scan. The credentials are stored encrypted in the database.
We use the following library: https://pypi.org/project/exchangelib/ which contacts Exchange web services to reach the user’s inbox (https://docs.microsoft.com/en-us/exchange/client-developer/exchange-web-services/start-using-web-services-in-exchange)
Backend service transforms emails into Files and also extracts Attachments. Backend service retains the OCR-ed version of the file and original location path in the database - which is an id from Exchange Web service.

Security

User controls access

The user chooses which files DataMapper can access and retains full control to manage data access over time.

User authentication

The verified creator of an account is given admin status and is the only one who can invite users to that team and the only one who can view a complete dashboard of all results. Users are identified by an administrator’s invite and a dedicated sign-up flow ensuring each user is verified.

Password and access tokens

Password and access tokens are signed with shared secret signature key and the password is hashed with sha256_crypt. Every access to your data is securely logged.

Network and access

To prevent man-in-the-middle attacks, all our servers are certified with X.509 certificates provided by WebTrust certified certificate authorities. All your data is hosted on trusted third-party services (e.g., Azure) that use state-of-the-art access control and operate server facilities that are physically guarded.

Data encrypted in transit

HTTPS in transit, TLS 1.2 Shared access signature

Data encrypted at rest

Azure private blob storage encrypted at rest with Azure managed AES 256 bit keys

Compliance with the Danish Business Authority’s guidelines

We comply with the Danish government’s Agency on Digitization’s best practice guidelines regarding IT providers. Read our answers to their 11 questions for IT providers here.

Internal rules for handling your data

We have adopted internal rules on information security which contain instructions and measures to protect your personal data against destruction, loss, alteration, unauthorized publication; and prevent unauthorized persons from gaining access to or knowledge of it.

Security Summary

DataMapper logs all access to your Documents.
DataMapper frequently backs up your documents and data.
DataMapper uses the latest encryption standards both when transferring and storing your documents, including backup.
DataMapper can guarantee that your Documents and data do not leave the EU.
DataMapper monitors and keeps all servers up to date with the latest OS and security patches.

Privacy and data storage

Storage

A brief overview of file storage:

Copies of the original files from data locations that a user selects to scan are stored temporarily during processing. These copies are kept only during a scan and removed once they have been processed.
The structure of files is "multi-tenant" in the sense that the location of files is containerized separately per company. For each company, during a scan, the system automatically obtains a separate credential (shared access signature in this case) to access the data.
Only 2 Senior developers at SafeOnline may get access to this data and it is protected with multifactor authentication. The current process for access to files requires written consent and a request from 2 persons in the company requesting. Each access is logged, and documentation can be requested at any time.
TLS 1.2 protects data in transit, and data is encrypted at rest with symmetric AES 256 keys - with standard Microsoft Azure configuration.

A brief overview of database storage:

OCR-ed version of the documents. These versions are persistently stored, thus not removed.
Lists of high-risk and low-risk files referencing OCR-ed document id and findings -> sensitive numbers, sensitive keywords, GDPR names (customer and employee names added by the user).
User information such as e-mail and name.
GDPR names added by the user.
Encrypted Exchange credentials per user. Exchange credentials are stored as the Exchange connector works, but they are encrypted by Azure Keyvault symmetric keys, thus not in a shape that could be read by a human.
The structure of the database is single tenant in the sense that all a company's data is in one place, and we rely on the code to show a company only their own information.
The database is protected by a Virtual Network infrastructure to ensure by rule that only the processing cluster has access to it and not any third party.
Human access to data is limited to 2 Senior developers at SafeOnline, and only as described in the rule above
Documentation: please see our data flow scematic above under the "Flow" heading.

Usernames and passwords

Usernames are stored in cleartext and passwords are stored as a hash. This means that we do not store the password itself, but a hash that can be compared only by the user entering the correct username and password.

During a scan for local or network drive

We first compile a big list of md5 hashes expected to be uploaded. The backend service is notified to expect these hashes and if the user closes the computer or Internet drops, the next time the user logs in, the backend asks the client to continue for the remaining hashes. We do not monitor activity on the user’s PC.

Authentication

Microsoft example:

Users who authenticate Microsoft are redirected to the Microsoft identity platform. (https://docs.microsoft.com/en-us/azure/active-directory/develop/). We receive user consent from Microsoft in the form of an access_token.

Thus 2-factor validation, conditional access and scope signing is processed between the user and Microsoft. Then, the Microsoft identity platform passes user consent on to us and gives us access to only data the user has given consent for. This access can be withdrawn by the user/company at any time.

Onboarding

Our onboarding flow will install the software and onboard all invited users. Mass deployment solutions are in progress for future versions.

Scale

DataMapper is based on the data provider hosted in the Azure platform which is scalable and uses the features and functionalities of Microsoft Azure. It is flexible to increase its capacity based on resource requirements.

Regions for storage: Current data center is in Microsoft Azure in Amsterdam, Holland
Scale Units: The application can Scale-Up on an on-demand basis when necessary.

Delivery and continuous updates

At Safe Online, we are dedicated to continuously improving based on the needs of users. We are constantly monitoring developments in regulations relevant to privacy, e.g., GDPR and related regulations in countries both inside and outside the EU to ensure the product is compatible with the latest local policies.

Changes and feature updates are deployed first in a staging environment and verified by a closed group of users and testers. Only when internal testing and the group of testers have approved changes and feature updates are these published in the production version. Customers are notified of upcoming updates.

Compatibility

DataMapper is compatible with Windows MacOS.

Help

Watch our how-to videos for help getting started:

Documentation & Support

For more information about DataMapper please see our help center.

Contact

Please contact us with any questions.