How we revolutionize document processing with machine learning

At Parashift, we focus on leveraging pragmatic state-of-the-art machine learning (ML) technologies to solve specific problems for our clients thanks to Intelligent Document Processing. The problem at the heart of this is the structured data extraction of all kinds of business documents.

Gartner estimates that today, more than 80% of corporate data is unstructured. This means that the data resides in the form of free text, emails, and other written documents. So it is not that surprising that autonomous classification and data extraction of documents are still major challenges that need to be addressed and solved. Why this is yet the case has various reasons. Here, I will briefly touch upon two of them:

Templates alone are not enough

Documents, even those that we would basically describe as rather standardized, such as invoices, for example, can have many different layouts and structures. This means that a traditional OCR (Optical Character Recognition) machine simply has no idea where to look for what on the document if it deviates from the “normal” case. As you can see, the complexity is often relatively high even with seemingly trivial documents.

A pragmatic approach to solving such a problem is to create a large number of static templates, which should capture the structural differences between the various documents and suppliers. The results of this approach are usually quite good. However, the more business transactions and stakeholders a company has, the faster this becomes a real management challenge. And simply applying created templates to formats that the machine has never seen before is hopeless.

Machines are vulnerable to noise

Another challenge comes from the quality of the documents which we may encounter in our daily business. If these documents are not really cleanly printed or are poorly photographed or scanned and forwarded to the mailbox, such cases can quickly become problematic for the OCR machine. When interpreting the input values, the machine often “sees” too much noise and is therefore unable to provide usable data.

A step towards autonomous document processing

We have specifically addressed these and other problems and developed appropriate methods in this process which enable us to take a big step forward towards autonomous document processing. That’s we have registered several of these methods with the patent office.

For the processing of documents, for example, we have set up a multi-layer process that can be carried out either quickly or very quickly, depending on the quality of the input. Let’s assume, for example, that there is a file in the pipeline that is of poor quality. As a result, this document is first processed by various specially developed enhancement algorithms and is only passed on when its contents are clearly recognizable by the OCR machine. The pre-processing of such documents can of course be further developed at will. And so, besides incremental improvements, we have several new cool features on the roadmap and will release them in the course of 2020. We are also working on a possibility to batch process large amounts of different scanned documents without any separating paper. However, the system must be robust and fast enough to recognize the different document types and assign the individual pages to each other. So, when we can go live with it is still unclear. But once it is feasible, it will enable us to greatly increase the level of end-to-end automation of any document-intensive process. That much is certain.

The problem of the almost endless template libraries is going to be solved by the Parashift Learning Network. The network allows our customers to benefit from all the standard documents and learnings that have been processed or internalized with our engine. In other words, the more documents of a certain type we have processed in the greatest possible variation, the greater is the probability that the engine will understand the nature of the document type and be able to determine and extract the relevant data. As a result, the engine will become better and better and more and more customers will want to work with us because of the superior quality we can provide. Hence, we have a reinforcement mechanism working in the favor of our clients and ourselves.

Today we are particularly strong on German invoices and receipts. Validation of the data only needs to be done in special cases. However, our engineering team is working simultaneously on 64 other standard document types such as credit card statements, bank receipts and insurance policies, which will go live at the beginning of 2020. Qualitative results in other languages can also be expected in the foreseeable future. Because we already have several customers where we can process documents in English, French and Italian.

Since we also want to be able to adapt to the individual needs of our customers, besides our standard type documents, we are currently developing a tool that will enable even non-technical employees to configure complex document types for the machine in just a few hours and train them with a sample set of 10 to 1000 documents. However, the machine’s learning on these individual documents will not be shared with the network as is the case with standard document types. The number of training documents required for a solid automation reliability is highly dependent on the complexity of the document type. To be more concrete about the tool, customers will be able to build on the standard document types and define additional fields that they want to get extracted. The interface for validation, if customers want to do that themselves, as well as the API will adapt automatically to the new configurations. The visualization below illustrates the process step by step.


The thing that is really groundbreaking about this new tool is that we can reduce a process from several weeks or even months to just one single day. 

A foundation for individual development

Another practical characteristic of our solution is that the architecture of the software enables developers to quickly implement additional post-extraction processes, which in turn can replace RPA clean up processes. One example is the implementation of AI solutions that use Natural Language Processing (NLP) to compare the extracted data with master data in other IT systems for verification. Others are software solutions for fraud prediction and basic anomaly detection.

With our API solution, we create an essential basis for the introduction respectively further development of revolutionary processes that make a major contribution to the growth of leading companies. After all, process innovations such as Robotic Process Automation (RPA) are only as good as the data they use as the very foundation for automation. What is still hidden in documents, images and other media today, will be available in the foreseeable future and offer unprecedented potential for the redesign of organizations and processes.

Related Posts