DataMapper Technical Whitepaper
What does DataMapper provide?
Knowledge of company documents contents, locations along with who has access to them is essential to preventing a data-breach, responding to individual data requests and communicating effectively between departments day-to-day.
Different members of your team may be storing company information in different locations online and on their own computers. Each employee in your business may spend roughly 3.5 hours on average each week searching for documents. 20% are either misplaced or completely lost. Manually organizing documents scattered across mailboxes, servers, local drives, cloud storage, and USBs is an enormous task with a significant potential for human error; but an automated system makes it simple.
DataMapper reduces the risk of improper data handling by giving you a complete overview of files and documents from all users on a single dashboard. It organizes your documents intuitively based on content and context, making it easy to navigate all documents across an organization. Whole teams can join DataMapper so that all company information can be gathered and mapped for overall GDPR risk.
DataMapper categorizes personal data found by its level of sensitivity. Several Articles from the GDPR set out a hierarchy of data sensitivity. Examples of risks in ascending order are shown in the graphic below:
DataMapper’s automatic identification and classification of critical data make it easy to review GDPR risks. It will also help you when creating and mapping data flows. Make sure essential documents are stored in the right place under the right name. Since DataMapper shows you the risk level of each document along with its location and who is storing it, you can assess and adapt data-storage protocols to mitigate risk and keep track of compliance.
- Make it easy to find out redundancy and decide what needs to be deleted.
- Help you identify ways to streamline and simplify your company’s data processes.
In addition to the increased data security, easy GDPR compliance and time-saving value that its automated document management allows, DataMapper's 'bird's eye view' of company documents has other benefits to a company's bottom line. When you revisit data that was buried in an old file or floating aimlessly in cloud storage, you quickly discover any incomplete contracts, overlooked sales leads, unused subscriptions, superfluous systems and suppliers, and other information that would otherwise slip through the cracks. Make sure you are taking advantage of all the valuable information your company has collected, and eliminate the rest.
Architecture and integrations
DataMapper is developed almost exclusively in Python. Python was chosen as it is currently the market-leading programming language for machine learning (ML). The best performing widely adopted ML libraries are written for/in Python. We use the following key packages for our ML: scikit-learn, NLTK, spaCy, NumPy, SciPy, and to some degree Keras, Theano, Gensim, and Pandas.
Our stack also uses the following infrastructure software: MySQL, Redis, Elastic search, Flask, and Celery.
DataMapper runs as microservices on Kubernetes, deployed at Azure. Some activities (e.g. OCR is handled by external APIs, such as Microsoft Cognitive Services).
The DataMapper client is developed in HTML/CSS/JS and deployed as native applications for Windows and OSX using Electron. It uses the framework Vue.js.
The DataMapper client runs on OSX and Windows. The client needs an internet connection to function and communicate with the DataMapper API.
DataMapper can extract text from these formats:
Documents: txt, rtf, pdf, doc, docx, gdoc, odt (JPG)
Presentations: pptx, ppt
Spreadsheets: xls, xlsx, gsheet
Hypertext: html, htm
Archives: pst, zip
DataMapper understands documents written in English and Danish. More languages will be added in August.
DataMapper comes with two main pre-trained document classification models: English and Danish. They classify documents into 79 document types, divided into main classes:
- Real Estate
- Intellectual property
- Financial documents
- Human Resources (including identification papers such as passport and driver's license)
- Others, and more.
Some classes have many subclasses, others have few. We have 24 subclasses with an F1 score > .9 (i.e., ~ more than 90% accuracy).
DataMapper classifies documents by language detection, text cleaning, feature extraction (bag of words), main category classification (SVM) and leaf category classification (SVM). Some specific business logic is programmed into this pipeline, based on classifications.
Integrations are built for:
- Local drives
- Network drives (Windows file share)
- Google Drive
- Microsoft Exchange
- Google Mail
- Sharepoint (coming soon)
- OneDrive (coming soon)
- Integration via Microsoft Flow and Zapier (coming soon)
The API supports a variety of methods. The main entries are:
admin → for creating companies, adding users to companies and managing user roles
document → for document classifications, setting document auth level, finding similar documents and document meta data
file → for fetching file data from a document, for getting a pdf version of any document and for searching for company's documents
integration → for creating new integrations and for updating integrations
user → for creating and editing user information; and for inviting new users
The client communicates over TLS to the API to provide functionality. It also connects to cloud storage providers directly for fetching folder names so that a user can select specific folders (e.g., for Exchange, Dropbox, ...). It can also scan selected local folders and network drives.
The client can show files (in a pdf viewer) and it can download all/some of your files (in a hierarchical or a flat structure). In the client, you can organize a team's roles, and define who can edit/see which classes of files. Specific files can also be set to "OK" or "critical" for flagging importance. Files may also be set to "private", such that other users from the same company cannot see them. In the client, a document's classification can also be set, if DataMapper did not detect it automatically (low confidence). The client also features a variety of filters and search functionality for finding documents. It also has an overall folder organization such that a user can browse files.
See the below figure. Data is hosted on Microsoft Azure's blob storage. The connection between DataMapper API (Python Flask) and the DataMapper client is encrypted. The backend fetches data from cloud storage providers directly (except when scanning local- and network drives, where data is uploaded from the client). Text is extracted from documents, and stored in MySQL, and made searchable from in-cache using Elastic Search (ES). ES is also used for GDPR queries.
User controls access
The user chooses which files DataMapper can access and retains full control to manage data access over time.
Data encrypted at rest
Azure private blob storage encrypted at rest with Azure managed AES 256 bit keys
We have adopted internal rules on information security which contain instructions and measures to protect your personal data against destruction, loss, alteration, unauthorized publication; and prevent unauthorized persons from gaining access to or knowledge of it.
- DataMapper logs all access to your Documents.
- DataMapper frequently backs up your documents and data.
- DataMapper uses the latest encryption standards both when transferring and storing your documents, including backup.
- DataMapper can guarantee that your Documents and data do not leave the EU.
- DataMapper monitors and keeps all servers up to date with the latest OS and security patches.
- Regions for storage: Current data center is in Microsoft Azure in Amsterdam, Holland
- Scale Units: The application can Scale-Up on an on-demand basis when necessary.
Delivery and continuous updates
Documentation & Support
Please contact us with any questions.