Generative vs. Discriminative AI: Parashift makes Document Automation smarter

This article was written by our ML team.

This is the beginning of a multi part blog mini-series where we shine some light onto how we use ML / AI at Parashift for solving intelligent document understanding problems.

The first two parts are a short introduction into general key topics and concepts in terms of AI and not specifically focused on Parashift. We start by discussing the difference between discriminative and generative models. In the third part we will talk about our current efforts related to grounding LLM answers in documents. The term grounding in this context means the ability to link the answer of an LLM to explicit tokens in the document. The fourth part touches on more use cases of both methods in our product and finally the last part is as an outlook into where we want to go next.

Discriminative Models vs Generative AI

First we examine two different types of AI models and start with an introduction to highlight conceptual differences. As we will see those differences influence how and where we want to use those models in practice.

Discriminative Models

These models learn how to discriminate / differentiate between “things”. Think of them as ‘labeling machines’ to which you show a thing and the machine will assign a specific label and a number to it. The label(s) determine what that thing was classified as and the number reflects the confidence in the label. A common example is an image classifier that identifies animals. If this model is trained on, say, 100 different animal species all it can do is look at images and assign the best matching label. The model can not come up with a new label but is restricted (and biased) by the labels it was exposed to during training. Such models are also referred to as classifiers.

Many types of IDP tasks can be formulated in a way that a discriminative models can be trained to solve them. The approach always boils down to assigning a label to something. Use cases differ in the kind of label is assignd and to what “thing” they are assigned to. In the following examples we continuously increase the granularity of those label assignments and see how this solves different tasks.

Document Classification (one label per document)

The most straight forward use case where during training the model is fed documents (and labels) of different types to learn how to distinguish them. Here the labels are all the document-types we need to distinguish (i.e. ‘invoice’, ‘passport’, ‘gym membership card’ etc.). Despite the task being well defined and straight forward there is quite a lot of creative freedom on how the model is trained to perform the task. That could be on visual features, on textual features or a combination of both, to name a few design decisions.

Document classification in the Parashift Platform

Page Sequence Separation (one label per page)

Now we start from an ordered sequence of pages and the task is to evaluate which pages belong together to form an actual document. This is useful if one scans physical mail, resulting in one large PDF containing all documents. We opted for an approach to train a model that predicts two labels: this is a first page or this is a last page. Based on these predictions we can then infer where to split the page sequence into individual documents. See below a schematic example of a page sequence consisiting of 6 pages that results in 3 individual documents.

Page separation in the Parashift Platform

Information Extraction (one label per token)

Now lets see how to apply discriminative models to extract information such as the sender address, payable amounts, or delivery dates etc. If we look at this problem through the lens of a classifier, we must build a system that assigns labels to only specific parts of the document. As an example it can assign the label company name to the document text “Parashift AG”. As an additional step further one can train a model to predict which of these tokens belong together (so called link prediction). So the model has to decide if “Parashift AG” is one entity or if it is two separate ones like: “Parashift” & “AG”.

Embeddings and Tokenization

Now we know how to conceptually formulate those IDP tasks for a discriminative model. There are at least two more fundamental concepts to understand to get a clearer picture of the workings of such models: embeddings & tokenization. Since the task boils down to “assign label to thing” we need a way to meaningfully represent the “things” for the models to distinguish them (where thing is a document, a page, or tokens of documents). This is where embeddings come into play.

Embeddings

AI models do not work with what we humans understand as images, audio, document, text or tokens directly, they only understand numbers. The act of transforming real world objects into “a list of numbers” is called embedding or vectorization. The important aspect is to figure out a way such that those numbers have meaning for the problem you attempt to solve.

RGB as example

This section is not tied in any way to IDP, but just a generic example how we can encode information into a list of numbers. A nice visual example of using numbers to represent information is colors. On your screen the color of each pixel can be represented as three RGB-values (where R: red, G: green, B: blue). We can look at this as a type of embedding consisting of 3 numbers representing the [R, G, B] values. For our example here we decide that each number takes on values from 0..1 (pixels usually are represented with numbers from 0..255, but we make our lives easier with 0..1). “Red” is represented as [1, 0, 0], green as [0, 1, 0] and [0, 0, 1] corresponds to blue.

This enables us to express ANY color as a combination of these 3 numbers, for example yellow = red + green -> [1, 1, 0]. We pick the RGB example only because it is easy to visualize for humans (as color). Modern LLMs like ChatGPT, Claude or Llama use embeddings that are significantly larger than our example, typically in the range of a few thousand numbers per embedding!

With this in mind assume we build a token embedding model, and the model is only allowed to use 3 numbers for representing the meaning of a token (such that it is easy to visualize). We can see the output of the model below and that ‘addresses’ have a reddish tone, numbers appear green and dates have a blue note to them. It is unclear how the model learned this, but it is pure observation at this point. Upon further inspection we find exceptions to that observation, namely both the street numbers and zip codes in addresses are orange, not green. What this means is that our model is smart enough to change the embedding of a token based on its contex (= the tokens surrounding it). This means the embeddings are dynamic. In contrast static embeddings are fixed lookups where for example the token 9155 will always be assigned the same embedding, no matter the context.

We will see how at Parashift we leverage static and dynamic embeddings when it comes to grounding LLM answers in the actual document in the third part of this blog series. By grounding we mean “to identify which tokens in the document make up the answer” which allows to provide users with exact coordinates that belong to the answer (down to the token level, not just paragraph level). Static embeddings are limited in their expressivenes as they for example fail to distinguish homonyms (words that have the same spelling but different meaning based on context). An example would be the word lead in: – My action lead to a consequence – I don’t like lead in my drinking water.

Tokenization

Another crucial step for turning information extraction into a classification problem is to divide the document into smaller parts. This can be paragraphs, lines, words or tokens. Then we can train models to assign labels to those ‘parts’. The distinction between a word and a token is a bit blurry here but as a first approximation think of a token as “short words or sub-words”.

Dividing a document or text into tokens is done by a tokenizer and it is important to note that there is no agreed upoon method on how to best split a text into tokens. Below is a visualization on how the GPT-4 tokenizer splits up an example sentence. Tokens are visualized as colored rectangles. You can also go to this interactive tool and play with different tokenizers yourself. Once we have the document tokenized we can assign labels to them. One can see how assigning labels such as sender-name, sender-street, sender-house-number can finally serve for information extraction.

To close the loop back to information extraction we can look at the embeddings above one can hopefully imagine to train an other model that takes those embeddings and uses them to assign labels to tokens one is interested in. For this we train a separate model on different training examples to figure out the relationship between the embeddings and the task the model is actually required to solve. It is easy to see that with such an embedding model it would be easy to extract information about “addresses” or “dates”. If we are after information different from these two cases we are probably out of luck since the model cannot distinguish most of the other tokens from each other (they all have the same color).

This highlights the need for training embeddings that are generic enough for solving a wide range of potential downstream tasks but at the same time capture enough information to make it easier on the following classification models.

Summary of discriminative models

These models come with a few pros and cons when using them to solve tasks in real world applications.

The Pros

By predicting a specific label for each token in the document, it is straight forward to pinpoint the corresponding tokens in the document that correspond to the extracted information.

Every prediction comes with a ‘confidence score’ which reflects how sure the model is for having found the correct value. By calibrating the confidences with actual benchmarks on any given task one can get a reliable measure of how trustworthy a model’s predictions actually are.

Discriminative models are usually on the small side, especially when comparing them to the current LLM models. They typically range from a few million to a few hundreds of million parameters and can be trained in a few minutes to hours on a single GPU. This makes it feasible to train multiple specialized models tailored to the specific task.

The small size of these models also comes with the advantage of ‘speed’ where we measure speed as the time it takes for a model to generate an interpretable answer. Typical values here in terms of document understanding would be in the order of milliseconds for a handful of pages. So small documents can be processed < 1 second.

Privacy: By the very nature of those models it is not possible for these models to reveal sensitive training data between customers. The only thing that these models do is add a label on existing text / tokens in a document. So there is no way a model could directly reveal for example addresses that were in the training data.

A lot of the ‘heavy lifting’ can be done by the embedding models and then share the embeddings for multiple downstream tasks.

The Cons

A fundamental limitation is that they cannot naturally generalize their answers to unseen problems, even if the underlying embedding model is powerful enough to generalize to more problems than we trained our downstream model on. For each new problem we have to either extend the training of an existing model or to train a new one specialized in the new task. Since the advent of LLMs and their astonishing generalization capabilities this approach seems almost antiquated. We would not want to train a new model from scratch if we now also wanted to extract the phone numbers of documents.

Assigning labels is a fundamentally less general way of solving problems than the generative model approach. There are many use cases that simply cannot be tackled with purely discriminative models.

This is just the start of the conversation. If you’d like to explore how Parashift can support your automation journey, don’t hesitate to contact us.

Try Parashift free

Generative vs. Discriminative AI: How Parashift makes Document Automation smarter

Discriminative Models vs Generative AI

Discriminative Models

Document Classification (one label per document)

Page Sequence Separation (one label per page)

Information Extraction (one label per token)

Embeddings and Tokenization

Summary of discriminative models

The Pros

The Cons

Related Posts

Lightning-Fast Document Automation: Getting started with Parashift’s API in minutes

How to automate signature validation with Parashift

Boost your Document automation with these simple list hacks

What’s inheritance and how does it help with Document Automation?

Company

Solutions

Partners

Features

Discriminative Models vs Generative AI

Discriminative Models

Document Classification (one label per document)

Page Sequence Separation (one label per page)

Information Extraction (one label per token)

Embeddings and Tokenization

Summary of discriminative models

The Pros

The Cons

Related Posts

Lightning-Fast Document Automation: Getting started with Parashift’s API in minutes

How to automate signature validation with Parashift

Boost your Document automation with these simple list hacks

What’s inheritance and how does it help with Document Automation?

Company

Solutions

Partners

Features

Parashift is different