In one of the last articles, we addressed the relevance of certain configurations in the context of document scanning. Because if not scanned correctly, the subsequent extraction can result in quality losses that limit the targeted cost savings. However, we have not yet discussed exactly what happens after the scanning of the documents and how the data is extracted. There are a number of steps that every document sent to the Parashift Platform goes through. What happens during each of these steps and the order in which they are carried out is described in more detail in the following article.
Companies receive documents through a variety of channels. As already described in the article on Multichannel Document Processing, these can be emails with attachments such as PDFs or image files. In addition, there are also a large number of physical documents such as invoices, delivery notes and contracts that come into the company via the post office or branches, for example. All these documents must be digitized for extraction with Parashift if they are not already digitized. Early scanning is particularly suitable for this purpose, as documents are scanned and digitized directly after receiving them for further processing and archiving purposes. After the documents have been successfully scanned with the optimal settings, they are sent to Parashift via a REST API. In an ideal integration scenario, your employees will not even realize that Parashift has been integrated into the processes. Apart from the fact, of course, that they suddenly have nothing more to do with document data entry.
Once they have arrived at Parashift, an enhancement is first carried out to improve the quality of the input. The documents are subjected to a rotation check, where it is determined whether the document was scanned at an angle. Depending on this, the document is also trimmed and straightened. If camera photos are involved, corrections are made if necessary to facilitate the subsequent extraction. This enhancement process runs completely automatically.
Once this process is complete, the next step is optical character recognition (OCR), where the entire text is analyzed and the data on it is read out. In addition, the layout is analyzed and any barcodes are also extracted.
If the text has been read by the OCR, it will proceed to the next step: The page separation. Here it is analyzed whether the scan forwarded to Parashift actually consists of only one document or whether there are several documents in one scan. For example, a 15-page scan could consist of 4 invoices. If this is the case, Parashift’s software will recognize this and treat them as separate invoices. If the document contains empty pages, they are deleted. A small note: Today, this functionality is not yet available in the official version. But it is already under development and will be available by the end of 2020. This means that until the function is integrated, page separation cannot yet be handled by Parashift and should be ensured by other providers or methods.
After the page separation follows the classification of the documents. The algorithm determines what type of document is involved and stores this information for the next steps. As a customer of Parashift, you can optionally carry out the classification yourself in addition to page separation.
Based on the identified document type, the most important data is extracted in this part. In the case of an invoice, for example, this is the date, the address of the invoicing party and recipient, their VAT number, IBAN, the line items and the various amounts (among other fields). This step works without any templates or configuration requirements on your part. No other manual work is required for this step either. This is therefore a significant advantage over manual or template-based document extraction. Finally, a quality check is performed to see if the engine was confident in executing its job.
Post-processing and archiving
After the successful extraction of the most important data, the post-processing follows. This is a service offering that clearly differentiates Parashift from the competition. This is because if the data has not been extracted completely or incorrectly during the extraction of Parashift Standard Document Types (e.g. orders, delivery notes, invoices, etc.), Parashift takes over the manual post-processing. The refined data is then returned to you respectively your leading system via API again.
Once there, you can then apply business rules to the extracted data. For example, you can use the extracted metadata of the line items to perform order reconciliations, find terms of payment or vendors and compare them with the logged data in your ERP. Even if these are new suppliers, they are recognized thanks to the intelligent OCR solution, which works without master data reconciliation, and can be automatically added to the system. I explained how this works in an earlier article.
Once the business rules have been applied and the extracted data processed, the document is finally archived. Usually, this archiving is done in a document management system (DMS) and symbolizes the end of the life cycle of a document.
Follow us on LinkedIn to get the latest IDP News.