Your accounts team receives 500 supplier invoices in a month. The data is extracted automatically using an OCR system. Everything seems fine until reconciliation is performed, then discrepancies show up like mismatched totals and minor differences in GST amounts, duplicate invoice numbers, and inconsistencies between the vendor name, etc. The finance team’s time that was supposed to be spent on verifying PDF files compared to the extracted data is instead spent finding and fixing errors that should have been corrected automatically.
This example does not represent an edge case. This is actually one of the most frequent results of incorrectly capturing information using OCR for business documents.
OCR has assisted companies in eliminating the need to perform manual data entry; however, OCR does have significant limitations, particularly when working with invoices, PDFs, or any other type of scanned document that does not adhere to a clearly defined and structured format or layout. In order to fully understand how and why OCR accuracy decreases, we need to go beyond merely looking at the obvious types of errors associated with OCR, but we must explore how the document works in the real-world environment.
What OCR Really Does and Where Accuracy Starts Falling
Optical Character Recognition (OCR) is a technology that can convert scanned documents or photographs of documents into digital machine-readable text by processing the shapes of letters in the image and then converting those shapes into ASCII equivalents. This sounds pretty straightforward, but in actuality, OCR technology only recognises the textual content of a document and does not actually interpret the meaning of that document. Because of this gap between OCR’s recognition of individual letters and/or characters as opposed to OCR’s ability to understand the overall context of a given document, most issues with OCR accuracy arise.
Why OCR Accuracy Drops in Real-World Documents
Poor Scan Quality
A large volume of documents and invoices is scanned using a mobile phone, a poor-quality scanner, or a fax machine, which in turn creates many issues:
- Skewed or angled images
- Low image resolution
- Shadows and unevenly lit images
- Blurred or faded text.
Advanced OCR technology will not always be able to interpret characters with accuracy, where it finds misread numbers, missing characters, or merged fields will occur.
Unstructured Formats
Invoice layouts are much more diverse than other document types. There are thousands of variations in invoice layouts because of the many different ways that vendors structure invoices. For example, the invoice number, date, and total amount fields can be in many different locations, or can even be missing altogether!
Positional consistency is a major reliance of all types of OCR systems. When layouts are inconsistent, the OCR systems have a much higher incidence of problems with invoice OCR processing.
Tables and Line Items
One of the biggest areas of concern with respect to OCR is the OCR processing of tables. While OCR systems can often read the text that is present in tables, they do not necessarily preserve all of the required elements that are so important in creating a proper table. For example,
- Table columns are often misaligned
- Row boundaries are often lost
- The relationship of the quantities/rates/totals is often lost
When this occurs, the result is misplaced values and subsequently, the incorrectly calculated invoices, particularly with multi-line invoices.
Handwritten or Semi-Printed Fields
There is still a great number of invoices that have been created by handwritten notes, signed Signature or manually filled fields. When OCR systems are dealing with handwritten notes (especially cursive handwriting) and other forms, they experience a significant drop in accuracy.
OCR Accuracy Problems by Document Type
Different types of documents create unique OCR (Optical Character Recognition) errors. Let’s take a closer look at common OCR errors that arise with various types of documents.
OCR Errors in Invoices
Invoices are semi-structured like many other documents. Therefore, they can often produce common OCR errors, such as:
- Similar characters (e.g. O and 0, I and 1) cause incorrect invoice numbers.
- Omitting or shifting decimal places results in incorrect tax amounts.
- OCR does not detect vendor names split across multiple lines, leading to erroneous invoices being issued.
- The extraction of duplicate line items from OCR applications can frequently occur as a result of incorrect recognition of tables in invoices.
For example, OCR can incorrectly interpret an invoice that totals ₹98,750 as ₹9,875 or ₹98750 due to the absence of a decimal point in its contextual meaning. At that point, OCR recognises the input as characters, without considering any financial aspects of the information.
As such, issues with invoices produced through OCR often arise during audits and reconciliation instead of at the time they were finalised.
OCR Accuracy Issues in PDFs
Some PDFs have text that you can select and copy from, while other PDFs are scanned images of hard-copy documents and are simply placed in a PDF file. The problems with OCR arise when:
- A scanned image is considered a low-quality scan.
- The OCR cannot determine what a field is and what is just a document layer.
- The way the fonts have been compressed or embedded is problematic.
As a result, the OCR may provide inaccurate results, producing missing text blocks or confused reading order and skipping some fields. Therefore, even if the document looks perfect, it can still produce poor extraction results.
Scanned Document OCR Accuracy Challenges
OCR limitations of scanned documents are compounded.
Common problems with OCR accuracy from scanned documents include:
- Skewed Alignment and Incorrect Field Grouping
- Background Noise Read as Characters
- Stamps and Seals Distort Text Recognition
- Multi-Page Scans Lose Context Between Pages
OCR engines perform their functions on a page-by-page basis. The engine cannot distinguish that Page #2 is a continuation of Page #1 unless additional intelligence is incorporated.
Why OCR ≠ Document Understanding
This discrimination IS SIGNIFICANT. OCR responds to the inquiry:
- What is the character set that can be found within this image?
- Business processes require information in response to:
- What is the invoice number?
- Which amount is subject to taxation?
- Is this document a credit note or a tax invoice?
- Is this document consistent with the purchase order information?
OCR does not assess meaning, relationships or correctness, as it cannot determine if the extracted data is an acceptable response to business requirements.
At this point, there is no escape from the limitations presented by OCR.
OCR vs Intelligent Document Processing
OCR vs IDP Accuracy: The Key Difference
OCR is one part of a greater whole: IDP is the complete system.
Comparison of OCR to IDP is not about competition; rather, it represents an evolution.
IDP is composed of:
- OCR, which extracts the text.
- Machine learning for identifying the format of the document.
- Contextual AI that determines the meaning of the fields.
- Validation rules that will identify any issues before anyone uses the information downstream.
Thus, for instance, whereas an OCR system may extract “₹12,500” as a string, IDP looks at whether the value extracted meets tax regulations, line items, and previous patterns that have been utilised.
That is the reason there is a consistent pattern of comparisons between the two technologies, with respect to the overall accuracy of an IDP system being more accurate when it incorporates intelligent and validation capabilities in addition to OCR capabilities.
How IDP Solves OCR Accuracy Problems
IDP systems correct the shortcomings in OCR in three ways.
Contextual Field Recognition
First, instead of only looking for certain items based on where they are found in a template, IDP has also learned how different types of templates contain the same fields and what those fields look like (e.g. invoice number, invoice number, bill number, etc.).
Structural Understanding
Second, IDP identifies the places in a document where tables, headers, footers, and repeating lines are located and how those items relate together to provide a more accurate representation of quantity, price, and total to reduce errors in calculating total amounts.
Validation and Confidence Scoring
Third, IDP generates a confidence level (high, medium, or low) for each field and will create alerts for any fields that have low confidence levels.
This allows users to know that they need to check those fields before allowing data to be entered into their accounting systems.
Where Snoh Fusion Fits In
The Snoh Fusion platform expands on the concept of Optical Character Recognition (OCR), which alone is not enough. While OCR provides the initial output, Snoh Fusion will take this information and build upon it using intelligence, learning and verification to help address the complexities that exist with “real-world” documents, especially those containing invoices, PDFs and scanned images.
Snoh Fusion combines the use of OCR with AI-based document interpretation and human oversight that assist organisations in:
- Eliminating wasted time due to discrepancies in OCR data.
- Elevating data accuracy levels, without reliance on strict templates.
- Increasing the ability to process a high number of documents, but having less need for manual review, because of the human input in monitoring and double-checking accuracy levels.
Snoh Fusion’s goal is not to provide a replacement for OCR systems but rather to enhance their legitimacy.
Why OCR Accuracy Still Matters
“Although there are limitations to using Optical Character Recognition (OCR), it is an essential technology that many organisations expect to solve issues that were never intended to be solved by this technology.
Many organisations see errors after audits, compliance checks and financial discrepancies, which are a result of misusing OCR technology by applying no intelligence when using the OCR technology. With understanding and verification associated with the information collected by OCR, companies can finally realise the full benefits of automating their processes.”
Final Thoughts
OCR errors occur in invoices, PDFs and scanned documents, not due to technology failings, however, because the technology is not matching the expectation of the user.
OCR only reads text, not understands.
By realising the limitations of OCR and the use of Intelligent Document Processing, businesses can differentiate automation that results in rework from automation that builds trust.
FAQs
What are the OCR accuracy problems?
When text from an invoice, scanned document, PDF, etc., is incorrectly read by OCR (optical character recognition) software, there are many types of errors that can occur. Some examples include incorrect numbers or amounts, fields missing, or data that has been placed in the wrong area due to scan quality or the complexity of the document layout.
Why do OCR errors happen in invoices?
Every invoice is different in terms of design and structure. Invoices typically have tables, several different tax fields, several different formats of vendors, and many invoices do not have consistent placement of invoice numbers. These various structural formats of the invoice, as well as the inability for the OCR to interpret the invoice’s structure correctly, are all contributing factors to why errors occur in OCR documents.
How accurate is OCR for scanned documents?
OCR-scanned document accuracy is also affected by the quality of the original scan. Documents that are low-resolution, skewed, have shadows or stamps, and/or documents with handwritten text generally have low accuracy levels with OCR software.
What are the main limitations of OCR?
OCR technology’s limitations are: it does not know what type of context the text represents; it does not effectively manage tables; it struggles with handwritten documents; and it cannot validate whether or not the extracted data is accurate based on logical or financial reasoning.
Is OCR enough for invoice processing?
No, OCR will extract the text but will not be able to comprehend the content’s meaning or verify the extracted data’s correctness. This generally results in multiple instances of rework and/or manual correction of errors associated with the invoice processing workflow.