DataMapper Technical Whitepaper

Introduction

DataMapper is a digital archivist that can find and flag sensitive and personal data throughout company systems, helping you comply with privacy regulations like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) that make you responsible for keeping track of all the personal data you store. 

What does DataMapper provide?

Knowledge of company documents contents, locations along with who has access to them is essential to preventing a data-breach, responding to individual data requests and communicating effectively between departments day-to-day. 

Different members of your team may be storing company information in different locations online and on their own computers. Each employee in your business may spend roughly 3.5 hours on average each week searching for documents. 20% are either misplaced or completely lost. Manually organizing documents scattered across mailboxes, servers, local drives, cloud storage, and USBs is an enormous task with a significant potential for human error; but an automated system makes it simple. 

DataMapper reduces the risk of improper data handling by giving you a complete overview of files and documents from all users on a single dashboard. It organizes your documents intuitively based on content and context, making it easy to navigate all documents across an organization. Whole teams can join DataMapper so that all company information can be gathered and mapped for overall GDPR risk.

DataMapper categorizes personal data found by its level of sensitivity. Several Articles from the GDPR set out a hierarchy of data sensitivity. Examples of risks in ascending order are shown in the graphic below:

DataMapper’s automatic identification and classification of critical data make it easy to review GDPR risks. It will also help you when creating and mapping data flows. Make sure essential documents are stored in the right place under the right name. Since DataMapper shows you the risk level of each document along with its location and who is storing it, you can assess and adapt data-storage protocols to mitigate risk and keep track of compliance.

The clear overview of your documents and quick data access DataMapper provides can: 
  • Make it easy to find out redundancy and decide what needs to be deleted.
  • Help you identify ways to streamline and simplify your company’s data processes.

In addition to the increased data security, easy GDPR compliance and time-saving value that its automated document management allows, DataMapper's 'bird's eye view' of company documents has other benefits to a company's bottom line. When you revisit data that was buried in an old file or floating aimlessly in cloud storage, you quickly discover any incomplete contracts, overlooked sales leads, unused subscriptions, superfluous systems and suppliers, and other information that would otherwise slip through the cracks. Make sure you are taking advantage of all the valuable information your company has collected, and eliminate the rest.

Architecture and integrations

Backend 

DataMapper is developed almost exclusively in Python. Python was chosen as it is currently the market-leading programming language for machine learning (ML). The best performing widely adopted ML libraries are written for/in Python. We use the following key packages for our ML: scikit-learn, NLTK, spaCy, NumPy, SciPy, and to some degree Keras, Theano, Gensim, and Pandas.

Our stack also uses the following infrastructure software: MySQL, Redis, Elastic search, Flask, and Celery.

DataMapper runs as microservices on Kubernetes, deployed at Azure. Some activities (e.g. OCR is handled by external APIs, such as Microsoft Cognitive Services).

Client

The DataMapper client is developed in HTML/CSS/JS and deployed as native applications for Windows and OSX using Electron. It uses the framework Vue.js.

The DataMapper client runs on OSX and Windows. The client needs an internet connection to function and communicate with the DataMapper API.

Platform

Document formats

DataMapper can extract text from these formats: 

Documents: txt, rtf, pdf, doc, docx, gdoc, odt (JPG) 

Presentations: pptx, ppt 

Spreadsheets: xls, xlsx, gsheet 

Hypertext: html, htm

Archives: pst, zip

Languages

DataMapper understands documents written in English and Danish. More languages will be added in August.

Document classification

DataMapper comes with two main pre-trained document classification models: English and Danish. They classify documents into 79 document types, divided into main classes:

  • Contracts 
  • Governance
  • Real Estate 
  • Presentations 
  • Intellectual property
  • Financial documents 
  • Human Resources (including identification papers such as passport and driver's license) 
  • Certificates, 
  • Others, and more. 

Some classes have many subclasses, others have few. We have 24 subclasses with an F1 score > .9 (i.e., ~ more than 90% accuracy).

DataMapper classifies documents by language detection, text cleaning, feature extraction (bag of words), main category classification (SVM) and leaf category classification (SVM). Some specific business logic is programmed into this pipeline, based on classifications.

Integrations

Integrations are built for:

  • Local drives
  • Network drives (Windows file share)
  • Dropbox
  • Google Drive
  • Microsoft Exchange
  • Google Mail
  • IMAP
  • Sharepoint (coming soon)
  • OneDrive (coming soon)
  • Integration via Microsoft Flow and Zapier (coming soon)

API

The API supports a variety of methods. The main entries are:

admin → for creating companies, adding users to companies and managing user roles

document → for document classifications, setting document auth level, finding similar documents and document meta data

file → for fetching file data from a document, for getting a pdf version of any document and for searching for company's documents

integration → for creating new integrations and for updating integrations

user → for creating and editing user information; and for inviting new users

Client

The client communicates over TLS to the API to provide functionality. It also connects to cloud storage providers directly for fetching folder names so that a user can select specific folders (e.g., for Exchange, Dropbox, ...). It can also scan selected local folders and network drives.

The client can show files (in a pdf viewer) and it can download all/some of your files (in a hierarchical or a flat structure). In the client, you can organize a team's roles, and define who can edit/see which classes of files. Specific files can also be set to "OK" or "critical" for flagging importance. Files may also be set to "private", such that other users from the same company cannot see them. In the client, a document's classification can also be set, if DataMapper did not detect it automatically (low confidence). The client also features a variety of filters and search functionality for finding documents. It also has an overall folder organization such that a user can browse files.

Flow

See the below figure. Data is hosted on Microsoft Azure's blob storage. The connection between DataMapper API (Python Flask) and the DataMapper client is encrypted. The backend fetches data from cloud storage providers directly (except when scanning local- and network drives, where data is uploaded from the client). Text is extracted from documents, and stored in MySQL, and made searchable from in-cache using Elastic Search (ES). ES is also used for GDPR queries.

Security

User controls access

The user chooses which files DataMapper can access and retains full control to manage data access over time. 

User authentication
The verified creator of an account is given admin status and is the only one who can invite users to that team and the only one who can view a complete dashboard of all results. Users are identified by an administrator’s invite and a dedicated sign-up flow ensuring each user is verified.  
Password and access tokens 
Password and access tokens are signed with shared secret signature key and the password is hashed with sha256_crypt. Every access to your data is securely logged.
Network and access
To prevent man-in-the-middle attacks, all our servers are certified with X.509 certificates provided by WebTrust certified certificate authorities. All your data is hosted on trusted third-party services (e.g., Azure) that use state-of-the-art access control and operate server facilities that are physically guarded.
Data encrypted in transit  
HTTPS in transit, TLS 1.0, Shared access signature 

Data encrypted at rest

Azure private blob storage encrypted at rest with Azure managed AES 256 bit keys 

Compliance with the Danish Business Authority’s guidelines
We comply with the Danish government’s Agency on Digitization’s best practice guidelines regarding IT providers. Read our answers to their 11 questions for IT providers here.
Internal rules for handling your data

We have adopted internal rules on information security which contain instructions and measures to protect your personal data against destruction, loss, alteration, unauthorized publication; and prevent unauthorized persons from gaining access to or knowledge of it. 

Security Summary 
  • DataMapper logs all access to your Documents.
  • DataMapper frequently backs up your documents and data.
  • DataMapper uses the latest encryption standards both when transferring and storing your documents, including backup.
  • DataMapper can guarantee that your Documents and data do not leave the EU.
  • DataMapper monitors and keeps all servers up to date with the latest OS and security patches.

Scale

DataMapper is based on the data provider hosted in the Azure platform which is scalable and uses the features and functionalities of Microsoft Azure. It is flexible to increase its capacity based on resource requirements.
  • Regions for storage: Current data center is in Microsoft Azure in Amsterdam, Holland
  • Scale Units: The application can Scale-Up on an on-demand basis when necessary.

Delivery and continuous updates

At Safe Online, we are dedicated to continuously improve based on the needs of users. We are constantly monitoring developments in regulations relevant to privacy, e.g., GDPR and related regulations in countries both inside and outside the EU to ensure the product is compatible with the latest local policies.
Changes and feature updates are deployed first in a staging environment and verified by a closed group of users and testers. Only when internal testing and the group of testers have approved changes and feature updates are these published in the production version. Customers are notified of upcoming updates.

Compatibility

DataMapper is compatible with Windows MacOS.

Help

Watch our how-to videos for help getting started:

Documentation & Support

For more information about DataMapper please see our help center.

Contact

Please contact us with any questions.

Still need help? Contact Us Contact Us