Automating data extraction from trade documents

Home/Success Stories /

Computer Vision

Natural Language Processing (NLP)

Highlights

Challenge

Automate the process of document processing and key data extraction from various scanned trade documents

Solution

In-house-built ML models for document classification and parsing to extract and interpret data from trade documents

Results

Mean accuracy of 92% and 90% decrease in manual data entry

About the Project

The Client was a governmental entity that needed to automate data extraction from scanned trade documents, assisting traders and declarants with filling out the customs declaration forms, decreasing the time to submit the declaration, and minimizing human bias and error. The client’s main objective was to have a solution that would take the scanned document as input and provide the extracted key data based on 50+ pre-defined labels as an output. The extracted data would then be used to automatically fill out a customs declaration form. In addition, the Client wanted the tool to automatically detect the types of documents and classify them accordingly.

To solve the above-mentioned problem, the Portmind Team used Computer Vision and Natural Language Processing techniques to develop in-house Named Entity Recognition (NER) and Optical Character Recognition (OCR) solutions. Further details about the technical approach can be found in our recently published scientific paper at IEEE here

CHALLENGE

Identify and extract key information from scanned trade documents

Transactions in international trade involve dozens of stakeholders, such as importers, insurance companies, shipping lines, exporters, and customs agencies, each generating and transferring documents.

To import goods, traders need to fill in a customs declaration form that contains information about the goods, transport details, and other relevant information such as the invoice, bill of lading, air waybill, packing list, etc. Traders use these scanned documents to manually extract key information and fill out the import declaration form. This process is highly time-consuming and error-prone.

This industry problem entails a set of technical challenges:

The same type of document can have various layouts depending on the creator of the document (for example different companies use different invoice layouts when issuing the invoice)
The key information placement may differ on different documents
Documents may come in different languages
There is significant noise in scanned documents. They may have rotated pages or a poor-quality or low-resolution scan, which makes the content less readable
Multiple fonts and text sizes make text recognition challenging

SOLUTION

After examining the Client’s requirements and analyzing the data, our team began the project implementation with the initial step of data cleaning and labeling. The dataset of various documents was reviewed by an internal team of data quality specialists who manually annotated the data based on the key data tags. The manually annotated data was used to train the AI document parsing model. Having a functional document parsing solution solved the main challenges posited by the Client. However, considering that the solution would be working with confidential data, the Portmind team also developed a custom-made OCR solution that would be deployed on Client servers and ensure data privacy. Furthermore, the tool was customized to cover the specific business needs of the Client. The team accomplished this by developing a synthetic data generation pipeline to create documents of different layouts, languages, fonts, and font sizes. This data was used to train an OCR model. Within the scope of the project, the Portmind team provided a solution that also classifies documents into one of 10 pre-defined document classes. To achieve this, we used both the visual and textual information of the documents. With this solution in place, the system automatically classifies the document when filling out the declaration form.

Based on the Client’s business needs, we delivered a product that does the following:

Identifies the document type

Extracts text from .pdf, .jpg, .png file types using an in-house built OCR

Recognizes the areas where important information resides on a document

Extracts text based on 50+ pre-defined labels

Automatically fills out the customs declaration form with the extracted data

RESULTS

The Portmind team developed a solution that automates the document classification and data extraction process. The solution extracts target information from scanned documents in less than 5 seconds with an average accuracy of 92%, making the process of filling out declaration forms 90% faster.

Want to see how the solution works in practice?