Why extraction rates don’t matter
Programs for the automatic processing of documents are always used when documents are to be automatically classified and data extracted. In detail, this means that it must be determined what kind of document it is (contract, correspondence, invoice, …) and which vital data must be extracted from the document (customer name, numbers, amounts, …).
In any case, they should relieve users of manual effort. How well these systems work and how much effort they can effectively decrease is often determined by extraction rates. These rates indicate as a percentage in how many cases data can be extracted fully automated without human intervention.
Accordingly, very high importance is attributed to these extraction rates. They are therefore also a topic in almost every sales talk for a new “OCR software”. (OCR generally refers to the recognition of all machine written text on a document, but is often synonymous with document classification and data extraction software from this OCR full text.)
Extraction rates in sales talks, more gut feeling than well-founded knowledge
In sales, you quickly learn to skillfully circumvent this question, because you can only lose by answering it. Either the competitor has promised a higher, magical number plucked from the air and you lose without ever having been tested. Or later, if your project is in a productive environment, the figure becomes a boomerang with the sentence: “But you promised at the time that the system would recognize XX% of data by itself! (Most of the time this problem, unfortunately, doesn’t matter to the salesperson, because then the technician or support employee has to deal with the customer :-))
The fact is, depending on the documents and documents types to be processed, the quotas can be higher or lower. This depends mainly on the following factors:
- Are the documents available in good quality (scan, photo, scanned copy of a photo with poor resolution at low exposure, …)?
- Has the platform already processed many similar documents?
- Is the configuration prepared for certain special cases?
- Do different character sets have to be processed (Latin, Greek, Cyrillic, …)?
- What kind of data must be extracted (simple header data, complex sentences or tabular position data)?
Put the use case in the foreground
What the question of extraction rates often ignores, however, is the actual use case for which such software is introduced. As I said at the beginning, the bottom line is always to automate an activity. The only question is: Why should this activity be automated?
In most cases, it boils down to the following two scenarios:
1. The aim is to make it as easy as possible for the end-user to enter data as quickly as possible so that he can continue working with the result immediately. This is the case, for example, when documents are sent to an insurance company or bank via an app. Core data is to be extracted and displayed to the user so that further information can be added if necessary. If data has not been correctly recognized, a user must complete it manually.
2. The end-user is to be relieved of the tedious and laborious collection of data to be able to continue working with the result promptly. This case differs in that it is not about getting a result as quickly as possible, but about obtaining particularly good, clean data, otherwise manual collection would take much time. Whether this happens automatically in 2 seconds or 3 hours is of secondary importance. Examples of this include the recording of long, complex invoices with a lot of line items, which are required for a purchase order check or reporting.
In case 1, incorrect extraction results lead to dissatisfied users. In the example with the insurance company, in the worst case, this leads to users not using the app and preferring to continue sending in their invoices on paper, thus generating work for the insurance company instead of the end-user.
Nevertheless, the focus here should be less on a system with excellent extraction rates, but rather on a system with reasonable extraction rates and a permanent improvement towards perfection. There is no point in the system not learning and making the same mistakes over and over again. This also means that users are not satisfied. But when users notice that the extraction is getting better and better, they are willing to continue using the product because they know it is being worked on.
In the second case, incorrect extraction results lead to high manual effort in the internal process. Using the position data as an example, an incorrectly read position can mean that all positions must be checked to find the error. In this case, the result of incomplete data and none at all is the same, only complete and correct data helps relieve work from the user. Therefore rely on a service that ensures fully validated, i.e. correctly checked data than on software that delivers good but sometimes incomplete or incorrect results.
Long-term perfection compared to short-term, expensive mediocrity
In any case, too little is often asked about the actual result to be achieved, and too much emphasis is placed on perfect extraction rates. Even if the title says otherwise, these are important; nobody wants a system that only recognizes 50% of the data correctly.
But a system that can deliver fully validated data and which has set itself the goal to deliver perfect extraction rates without human interaction in the future is more valuable in the medium term than a system that delivers good results now but through high initial effort and extreme adjustments on a specific document type.
So, the next time you talk about extraction rates, think carefully about what you want to use them for and which problem you want to solve by automating document processing.
If you want to know how Parashift can provide you with perfect metadata today and how this is related to our vision of fully autonomous document processing, you can contact me directly via the following button.