Whitepaper

The Ultimate Guide To Enterprise Content Capture

The goal of this paper is to review the leading document capture applications available in the current marketplace and compare their technical capabilities against 6 key evaluation data points.

Capture application functionality consists of inputting a set of documents, be they digital or hard copy, into the application as files. These files are then analyzed by various means to identify, interpret and then extract data. We can break this process down into six stages, or tasks, listed in processing order: Ingestion, recognition, classification, data extraction, validation, and export.

Ingestion is the function of the application looking for and loading documents in digital form as the beginning of the capture workflow.  Historically this process was principally hardware-driven, aka a document scanner.  While this is still applicable today, the scanner is usually a secondary consideration as the focus is not only on paper documents but electronic ones as well. Most scanners can be configured to output the same set of file types (typically TIFF, JPEG or PDF files) at predetermined dpi settings. Digital documents can have many different file types. Including the three types of scanned hard copy, these can include Email and attachments, MS Office documents, spreadsheets, compressed files, PowerPoint files, and other image types files not previously mentioned. It is the function of the ingestion task to be able to add and collate these file types into the application for further processing by subsequent tasks.

Recognition is the process of using a character recognition engine to interpret the text pixels in the ingested file or extracting the character value from files having embedded text; such as some pdf, MS office or other ASCII files. Sometimes bundled in this process is the conversion of some file types into formats that the character recognition engines require to function.  Some engines only accept specific file types and color depths, others have greater flexibility.  The result of the recognition task is all file types being rendered down into ‘pages’ with associated text values and their positions for each printed character recognized. 

Classification consists of identifying the document type presented in the ingested file to determine which sets of data to extract. While this sounds simple enough, applying it in practice is anything but.  Today’s capture applications offer a diverse range of different classification techniques.  This includes everything from comparing image size, dpi, layout, recognition, image and pattern matching, and Machine Learning and Artificial Intelligence (AI) services to analyze the document content to determine the classification.

Data Extraction techniques vary widely within the capture space as well, it is also usually heavily dependent on accurate document classification.  The range of techniques includes: zonal (a specific positional zone on a document), key-value searching – where the document text is searched for a specific key (word or set of words) and the associated value is found nearby, content analysis – using AI to extract values based on sets of learned contextual rules.

Validation applies business rules to extracted data to either verify that the captured data is the correct type, format or value.  In the case of a failure of the business rules to adjust or validate the data, optionally presents an interface where an operator can make manual adjustments to the captured data.  Some interfaces will also allow for manual reclassification of the document, edit the document structure and/or even resend the document back through the Data Extraction task once manual intervention is complete.                               

Export refers to the transfer of the data extracted to another source, system or repository.  This can be via a simple file transfer, database insertion, embedded repository interfaces or SOAP or REST services.  The idea is that the Data extracted is organized into the required formats, associated Metadata is collected and then exported out of the application.