DataMapper Technical Whitepaper
- Make it easy to find out redundancy and decide what needs to be deleted.
- Help you identify ways to streamline and simplify your company’s data processes.
- Contracts
- Governance
- Real Estate
- Presentations
- Intellectual property
- Financial documents
- Human Resources (including identification papers such as passport and driver's license)
- Certificates,
- Others, and more.
- Local drives
- Network drives (Windows file share)
- Dropbox
- Google Drive
- Microsoft Exchange
- Google Mail
- Sharepoint
- OneDrive
- DataMapper receives a temporary credential for the company's file storage location.
- DataMapper uploads files directly to file storage. TLS 1.2 in transit. Microsoft Azure standard implementation.
- Backend service processes the file and removes it from file storage. Backend service retains the OCR-ed version of the file and its original location path (e.g. C:\Documents\a) in the database.
- A user authenticates their Exchange account in DataMapper and selects folders they want to scan. The credentials are stored encrypted in the database.
- We use the following library: https://pypi.org/project/exchangelib/ which contacts Exchange web services to reach the user’s inbox (https://docs.microsoft.com/en-us/exchange/client-developer/exchange-web-services/start-using-web-services-in-exchange)
- Backend service transforms emails into Files and also extracts Attachments. Backend service retains the OCR-ed version of the file and original location path in the database - which is an id from Exchange Web service.
- DataMapper logs all access to your Documents.
- DataMapper frequently backs up your documents and data.
- DataMapper uses the latest encryption standards both when transferring and storing your documents, including backup.
- DataMapper can guarantee that your Documents and data do not leave the EU.
- DataMapper monitors and keeps all servers up to date with the latest OS and security patches.
- Copies of the original files from data locations that a user selects to scan are stored temporarily during processing. These copies are kept only during a scan and removed once they have been processed.
- The structure of files is "multi-tenant" in the sense that the location of files is containerized separately per company. For each company, during a scan, the system automatically obtains a separate credential (shared access signature in this case) to access the data.
- Only 2 Senior developers at SafeOnline may get access to this data and it is protected with multifactor authentication. The current process for access to files requires written consent and a request from 2 persons in the company requesting. Each access is logged, and documentation can be requested at any time.
- TLS 1.2 protects data in transit, and data is encrypted at rest with symmetric AES 256 keys - with standard Microsoft Azure configuration.
- OCR-ed version of the documents. These versions are persistently stored, thus not removed.
- Lists of high-risk and low-risk files referencing OCR-ed document id and findings -> sensitive numbers, sensitive keywords, GDPR names (customer and employee names added by the user).
- User information such as e-mail and name.
- GDPR names added by the user.
- Encrypted Exchange credentials per user. Exchange credentials are stored as the Exchange connector works, but they are encrypted by Azure Keyvault symmetric keys, thus not in a shape that could be read by a human.
- The structure of the database is single tenant in the sense that all a company's data is in one place, and we rely on the code to show a company only their own information.
- The database is protected by a Virtual Network infrastructure to ensure by rule that only the processing cluster has access to it and not any third party.
- Human access to data is limited to 2 Senior developers at SafeOnline, and only as described in the rule above
- Documentation: please see our data flow scematic above under the "Flow" heading.
- Regions for storage: Current data center is in Microsoft Azure in Amsterdam, Holland
- Scale Units: The application can Scale-Up on an on-demand basis when necessary.
Introduction
What does DataMapper provide?
Knowledge of company document content locations along with who has access to them is essential to preventing a data-breach, responding to individual data requests and communicating effectively between departments day-to-day.
Different members of your team may be storing company information in different locations online and on their own computers. Each employee in your business may spend roughly 3.5 hours on average each week searching for documents. 20% are either misplaced or completely lost. Manually organizing documents scattered across mailboxes, servers, local drives, cloud storage, and USBs is an enormous task with a significant potential for human error; but an automated system makes it simple.
DataMapper reduces the risk of improper data handling by giving you a complete overview of files and documents from all users on a single dashboard. It organizes your documents intuitively based on content and context, making it easy to navigate all documents across an organization. Whole teams can join DataMapper so that all company information can be gathered and mapped for overall GDPR risk.
DataMapper categorizes personal data found by its level of sensitivity. Several Articles from the GDPR set out a hierarchy of data sensitivity. Examples of risk classification in ascending order are shown in the graphic below:
DataMapper’s automatic identification and classification of critical data make it easy to review GDPR risks. It will also help you when creating and mapping data flows. Make sure essential documents are stored in the right place under the right name. Since DataMapper shows you the risk level of each document along with its location and who is storing it, you can assess and adapt data-storage protocols to mitigate risk and keep track of compliance.
In addition to the increased data security, easy GDPR compliance and time-saving value that its automated document management allows, DataMapper's 'bird's eye view' of company documents has other benefits to a company's bottom line. When you revisit data that was buried in an old file or floating aimlessly in cloud storage, you quickly discover any incomplete contracts, overlooked sales leads, unused subscriptions, superfluous systems and suppliers, and other information that would otherwise slip through the cracks. Make sure you are taking advantage of all the valuable information your company has collected, and eliminate the rest.
Architecture and integrations
Backend
DataMapper is developed almost exclusively in Python. Python was chosen as it is currently the market-leading programming language for machine learning (ML). The best performing widely adopted ML libraries are written for/in Python. We use the following key packages for our ML: scikit-learn, NLTK, spaCy, NumPy, SciPy, and to some degree Keras, Theano, Gensim, and Pandas.
Our stack also uses the following infrastructure software: MySQL, Redis, Elastic search, Flask, and Celery.
DataMapper runs as microservices on Kubernetes, deployed at Azure. Some activities (e.g. OCR is handled by external APIs, such as Microsoft Cognitive Services).
Client
The DataMapper client is developed in HTML/CSS/JS and deployed as native applications for Windows and OSX using Electron. It uses the framework Vue.js.
The DataMapper client runs on OSX and Windows. The client needs an internet connection to function and communicate with the DataMapper API.
Platform
Document formats
DataMapper can extract text from these formats:
Documents: txt, rtf, pdf, doc, docx, gdoc, odt (JPG)
Presentations: pptx, ppt
Spreadsheets: xls, xlsx, gsheet
Hypertext: html, htm
Archives: pst, zip
Languages
DataMapper understands documents written in English and Danish. More languages will be added in August.
Document classification
DataMapper comes with two main pre-trained document classification models: English and Danish. They classify documents into 80 document types, divided into main classes:
Some classes have many subclasses, others have few. We have 24 subclasses with an F1 score > .9 (i.e., ~ more than 90% accuracy).
DataMapper classifies documents by language detection, text cleaning, feature extraction (bag of words), main category classification (SVM) and leaf category classification (SVM). Some specific business logic is programmed into this pipeline, based on classifications.
Integrations
Integrations are built for:
API
The API supports a variety of methods. The main entries are:
admin → for creating companies, adding users to companies and managing user roles
document → for document classifications, setting document auth level, finding similar documents and document meta data
file → for fetching file data from a document, for getting a pdf version of any document and for searching for company's documents
integration → for creating new integrations and for updating integrations
user → for creating and editing user information; and for inviting new users
Client
The client communicates over TLS 1.2 to the API to provide functionality. It also connects to cloud storage providers directly for fetching folder names so that a user can select specific folders (e.g., for Exchange, Dropbox, ...). It can also scan selected local folders and network drives.
The client can show files (in a pdf viewer) and it can download all/some of your files (in a hierarchical or a flat structure). In the client, you can organize a team's roles, and define who can edit/see which classes of files. Specific files can also be set to "OK" or "critical" for flagging importance. Files may also be set to "private", such that other users from the same company cannot see them. In the client, a document's classification can also be set, if DataMapper did not detect it automatically (low confidence). The client also features a variety of filters and search functionality for finding documents. It also has an overall folder organization such that a user can browse files.
Flow
See the below figure. Data is hosted on Microsoft Azure's blob storage. The connection between DataMapper API (Python Flask) and the DataMapper client is encrypted. The backend fetches data from cloud storage providers directly (except when scanning local- and network drives, where data is uploaded from the client). Text is extracted from documents, and stored in MySQL, and made searchable from in-cache using Elastic Search (ES). ES is also used for GDPR queries.
Local and network scans:
Exchange scans:
Security
User controls access
The user chooses which files DataMapper can access and retains full control to manage data access over time.
User authentication
Password and access tokens
Network and access
Data encrypted in transit
Data encrypted at rest
Azure private blob storage encrypted at rest with Azure managed AES 256 bit keys
Compliance with the Danish Business Authority’s guidelines
Internal rules for handling your data
We have adopted internal rules on information security which contain instructions and measures to protect your personal data against destruction, loss, alteration, unauthorized publication; and prevent unauthorized persons from gaining access to or knowledge of it.
Security Summary
Privacy and data storage
Storage
A brief overview of file storage:
A brief overview of database storage:
Usernames and passwords
Usernames are stored in cleartext and passwords are stored as a hash. This means that we do not store the password itself, but a hash that can be compared only by the user entering the correct username and password.
During a scan for local or network drive
We first compile a big list of md5 hashes expected to be uploaded. The backend service is notified to expect these hashes and if the user closes the computer or Internet drops, the next time the user logs in, the backend asks the client to continue for the remaining hashes. We do not monitor activity on the user’s PC.
Authentication
Microsoft example:
Users who authenticate Microsoft are redirected to the Microsoft identity platform. (https://docs.microsoft.com/en-us/azure/active-directory/develop/). We receive user consent from Microsoft in the form of an access_token.
Thus 2-factor validation, conditional access and scope signing is processed between the user and Microsoft. Then, the Microsoft identity platform passes user consent on to us and gives us access to only data the user has given consent for. This access can be withdrawn by the user/company at any time.
Onboarding
Our onboarding flow will install the software and onboard all invited users. Mass deployment solutions are in progress for future versions.
Scale
Delivery and continuous updates
Compatibility
Help
Documentation & Support
Contact
Please contact us with any questions.