Data Extraction from Invoice Type Documents using Large Language Models
Towards tailored made AI models that adapt to changes and match the needs and demands of a dynamic real-world environment.
Extracting data from invoices is a crucial task for many businesses as it allows them to accurately track their expenses and payments. The information available in invoices, like supplier names, account numbers, dates, item descriptions, etc., are important for any company that deals with purchasing, accounting, supply management, data analytics, and similar.
Manually extracting this data from paper or digital documents is often time-consuming, prone to errors (typos or transposition errors), and costly as it might require a team of people to perform the task. It may take a person 5 minutes on average to read and understand the invoice, and manually type the information or copy and paste it into a destination software (e.g. ERP).
By automating the process of data extraction from invoices, businesses can reduce the time and resources required for manual data entry. Having accurate and up-to-date information about expenses and payments can help businesses make informed decisions about their finances and operations. Next to this, extracting data from invoices can help with:
Improving cash flow and payment efficiency by reducing errors and speeding up the processing of invoices and payments
Detecting fraud and duplicate entries by matching invoice line items to purchase orders and contracts
Analyzing spending patterns can help businesses identify areas where they can reduce cost and optimise spending, or identify areas where they can reduce waste and improve efficiency
Calculating CO2 emissions or carbon footprint by analyzing expenses and payments related to energy consumption and transportation.
Traditionally, automated data extraction is done using rule-based methods, such as regular expressions or template matching. These methods are inflexible and can be labor-intensive, and error-prone. Additionally, generalizability is often challenging to get right, if the rules are too specific, they may miss some cases or exceptions, or on the other hand, if the rules are too general, they may produce false positives or negatives.
Template matching and rule-based approaches require a manually defined template or rule for every single invoice format and layout, and rely on data being in specific positions in a document, consistent, clear, and legible.
However, invoice documents come from many different sources, industries, or businesses, the document formats and appearance vary, and documents are often scanned (see Figure 2.). Scans can be at a low resolution and have handwritten notes, annotations, or typos and errors in the data. The ever-changing nature and variations of real-world data present further challenges for rule-based methods. For example, changes in laws and regulations might affect how and what data is displayed in a document, or a company might change the layout of the data in their invoices, add more account numbers, change visual branding, etc.
In recent years, transformer-based large language models (LLMs), have revolutionized the field of natural language processing (from entity extraction to text generation). LLMs are trained on large amounts of data from various sources and formats, enabling them to recognize patterns in the data and make predictions based on these patterns, regardless of how complex the document format might be. LLMs can leverage their general knowledge and reasoning abilities to extract data (e.g. tax rates, discounts, payment terms, etc.), handle poor-quality documents, and easily deal with documents in various languages.
Unlike rule-based methods, LLMs are not limited by strict rules, making them more flexible and less prone to errors.
However, general models like OpenAI's ChatGPT can be too broad to achieve the highest accuracy for a specific use case, like invoice data extraction, as it requires the model to understand the structure and meaning of the data in the invoices.
Therefore, for the best results, it is necessary to fine-tune these models on specific use cases such as invoice data extraction, like we do at doXray. Fine-tuning involves training the model on a smaller, specific dataset relevant to the use case. This dataset may contain examples of invoices and their corresponding data fields, or other related texts that can help the model learn about invoices. By fine-tuning the model on this dataset, we adjust the model's parameters and align it to the invoice data extraction use case. This allows the model to make more accurate predictions and extract the required data with higher precision.
At doXray, our research has shown that periodic revisions of the model and continual learning are particularly beneficial for real-world use cases (Figure 3.). This is because invoice documents can change over time and by utilizing continual learning, our model can adapt to new invoice formats and accurately extract data, even when encountering previously unseen invoice formats and data layouts. This ensures that our model remains up-to-date and effective in extracting data from invoices.
In summary, the combination of large language models, fine-tuning, and continual learning provides a powerful set of tools for accurately extracting data from invoice-type documents, ensuring the best possible value and experience for doXray’s clients.