After we have looked at the differences as also the advantages and disadvantages of decentralized and centralized scanning in the last article, we will now focus on good scanning. This because many different factors flow into every scanned document, which as a consequence, have an effect on all downstream processes. Starting with the file format, through compression to resolution. That’s why there are many possible combinations for implementation, some of which are more suitable for certain tasks than others. If a person wants to read some data on a scanned document, the document must meet certain requirements. Somewhat readable for example. Consequently, it is very similar when we want to read documents with OCR software. In other words, here, too, some standards have to be met. Otherwise, the results of automated data extraction tend to be clearly worse.
Typically, three modes bitonal, grayscale and color are used for scanning. For Parashift it is not important whether the document was scanned in color or black and white, the algorithms are functional in both cases. So it is up to you which color mode you choose.
As you certainly know, there are a variety of different file formats that can be used for scanning documents. Some of the most common are PDF, PDF/A, TIFF, JPG and PNG.
Again, the Parashift Platform does not care which format you choose to work with. The important thing is that you choose a file format that suits your business context. In concrete terms this means that the file format should be suitable for long-term use and should also be suitable for your archiving purposes.
Compression is an important consideration, as incorrect compression can either result in loss of data or too much storage space being used. If compression is not applied at all, the scan will be particularly accurate, but the file will be too large for archiving purposes.
If you are processing documents with us, the documents sent to our API can already be compressed. A size of 40 – 50 kilobyte per scanned bitonal (black and white) A4 page is optimal. Larger documents are of course also no problem, although this is rather suboptimal for archiving.
The scan resolution is probably the most critical point for a successful document scan. It is typically measured with the number of dots per inch – DPI for short. As the name suggests, this number expresses how many dots are on one inch. However, this number should not be confused with pixel per inch, or PPI.
As a rule of thumb: a resolution of 300 DPI is sufficient for a scan. If the scan is done with 200 DPI or less, the file is indeed smaller, but accordingly, there is fewer output data in the scan. With Parashift and other OCR providers, low resolution may therefore cause problems in recognition and extraction. For example, if the resolution is too low, the probability that the letter B is read as the number 8 or the number 1 as small L is much higher. To illustrate this problem, the following graphics show a section of a scanned A4 document, differing only in resolution. The documents are sorted by ascending scan resolution.
It is obvious that the first image is likely to be much more difficult to read and therefore the error rate is higher. The number 1 can be read here misleadingly as small L. At 300 DPI, however, the situation is different. With this resolution the scan will be error-free.
The font size also plays a role in the resolution. If a text with a font size of 5 is scanned with 300 dpi on an A4, you limit the performance of OCR solutions relatively significantly.
The consequences of a low resolution are lower quality and a reduction in the speed of the character recognition engine. Since the characters on the document are sometimes not unique, several recognition variants must be processed, which understandably takes more time. In order to achieve an optimum resolution, the DPI number should therefore be 300 on the one hand and the font sizes should not be too small on the other.