What is Optical Character Recognition?
Nanonets online OCR & OCR API have many interesting use cases. Get started with Nanonets' pre-trained OCR extractors or build your own custom OCR models. Or schedule a demo to learn more about our OCR use cases!
Optical Character Recognition is the process of extracting data from a scanned paper document or image file and converting it into an editable, searchable digital format.
It is said that Optical Character Recognition (OCR) is the oldest data entry technique after keypunching. The keypunch was a device that punched holes into stiff paper cards according to a code that related alphanumeric characters to the holes. They were once widely used in data processing applications and to directly control automated machinery.
Optical Character Recognition (OCR), as we know today, is the technology used for the conversion of text in scanned images of typed, handwritten or printed documents, photographs with text in the background, and even images of movie scenes with superimposed text - into machine-encoded text that can be edited and searched.
For a long while, printed documents and images were scanned and stored as PDF files on electronic storage devices. The advent of Optical Character Recognition technology has revolutionised the processing of scanned /electronic documents. OCR software recognises text characters in image files and converts them into editable and searchable text.
Optical Character Recognition (OCR) technology has become increasingly popular over time with the availability of superfast microprocessors and highly advanced recognition techniques. Today, vast amounts of data are read at effective read rates and accuracy levels that were unimaginable a decade ago. Devices like OCR wands and desktop OCR scanners have made data capture faster, more efficient and more accurate than keyboard entry. Sophisticated desktop OCR scanners can read typewritten data at speeds of over 2400 words per minute!
OCR software helps scan documents and save them directly as text-searchable PDF files, or as editable text documents. Text-searchable PDF files are quite efficient as they enable searching and finding specific information without looking through every page.
So, how does OCR (Optical Character Recognition) work?
Presently OCR is a wide explorative study in pattern recognition, artificial intelligence and computer vision systems. Optical Character Recognition is extensively used as a means of digitising data from printed records like passports, invoices, business cards, receipts, bank statements, government documents, large survey data, static documents etc. Such data can then be easily edited, stored, displayed or searched electronically.
OCR Scanners or reading devices are used to scan documents for Text Input and Data Capture.
- Text Input devices are generally page readers or document scanners that ‘read’ large documents or parts thereof. Such textual data is entered for the purpose of editing, therefore, text input devices have varying levels of automation with regard to feeding, reading, sorting and stacking capabilities.
- Data Capture devices capture repetitive data and format functions on the data as it is entered. This data entry has to be very accurate as it is not intended to be edited later.
Two methods are generally used for Optical Character Recognition processes: Matrix Matching and Feature Extraction.
Matrix Matching is the simpler and more commonly used; comparing and matching what the OCR scanner reads as a character with a library of character templates. This feature is also the limiting factor of Matrix Matching as the scanner is unable to read fonts outside of the prescribed library.
Feature Extraction is also known as Intelligent Character Recognition (ICR) or Topological Feature Analysis. This method is versatile and uses varying degrees of computer intelligence and superior feature analysis to match characters that are less predictable. This variant of OCR is seen in ‘intelligent handwriting recognition’, the general techniques of feature detection in computer vision and, of course, in many of the latest OCR software.
- Before selecting the OCR algorithm the image is pre-processed to boost chances of recognition. Pre-processing techniques include:
De-skew or tilt the document a little to align the text lines
Despeckle by removing positive and negative spots and smoothing the edges
Binarisation of image into black and white
Line removal by cleaning up non glyph boxes and lines.
Layout analysis or zoning to identify columns, paragraphs, tables etc as blocks
Line and word detection
Script recognition
Character isolation or segmentation
Normalisation
Post processing, the OCR systems retain the original page structure and create a PDF containing both the original image and a searchable textual image. Error correction is done by ‘near neighbour analysis’.
The Optical Character Recognition Software
As with any solution, many versions of OCR software have been developed over the years, each with an edge over the other. Each new version of the Optical Character Recognition software showcases unique features and services to handle various types of documents. The more effective the OCR software, the more complex it is, with advanced capabilities, more tools and the versatility to meet the composite needs of high-quality, high-volume data processing.
The early versions of OCR were trained with images of each character of a font. More recent systems use a variety of digital image file format inputs to deliver a high level of accuracy for most fonts; sometimes even reproducing formatted text and other non textual components of the original document.
Applications of Optical Character Recognition
OCR software is developed for many domain-specific applications. In recent years OCR systems have been tweaked to include business rules, standard expressions or information contained in colour images, for better performance.
Some of the applications of Optical Character Recognition technology include:
- Data entry for business documents like cheques, passports, invoices, bank statements, receipts, proforma invoices etc.
- Data or document classification
- Automating data entry, extraction and processing
- Recognising number plates with a camera
- Indexing print material for search engines
- Passport recognition and information extraction at airports
- Recognising traffic signs
- Converting business card information into contact list
- Making text versions of printed documents that can be edited with word processors
- Making electronic images of printed documents searchable
- Pen computing – converting handwriting in real-time to control a computer
- Deciphering documents to be read aloud to blind and visually impaired users
- Making scanned documents searchable by converting them into PDF files.
- Archiving historic information- like magazines and newspapers - into searchable formats
Use cases of Optical Character Recognition by Industry
The Banking industry, along with related economic sectors, is perhaps the largest user of OCR. The most common use is in cheque management. Reduced cheque clearance time is one of the greatest achievements of OCR. A hand-written cheque is scanned; its details converted to digital text; signature validated and cheque cleared in real-time – all without human involvement. With AI methods, digital conversion of fully handwritten cheques is not far away.
Optical Character Recognition has multiple applications in the Legal industry, which generates the maximum paperwork. The simplest OCR scanners can be used for digitisation, storage, and database and searching of all printed documents – affidavits, judgements, filings, statements, wills etc. With OCR technology expanding to languages that do not use the Roman script, these processes can be used for legal records in other linguistic scripts like Japanese and Hindi. For an industry that relies heavily on the past, OCR technology can give seamless access to innumerable cases from the past.
Record keeping in a hospital / health care industry has been revolutionised by OCR technology. Medical history of patients – diagnostics, treatments, reports, X-rays, hospital records, insurance payments – can all be digitally stored at a single location and easily accessed when needed. Besides, these records help in planning hospital inventory including drugs, equipment and other products. Centralised data from hospitals in a region can help in the formulation of sound health policies.
Optical Character Recognition technology plays a key role in the Supply Chain Management industry. When items, with clear documentation of their origin, need to be located within the supply chain, OCR is more efficient than barcodes. OCR software can read lot codes, batch codes, expiry dates and serial numbers to follow an item through packaging, labelling and palletising.
The Benefits of OCR
The main advantages of OCR technology are saved time and reduced errors. It also enables data to be compressed into zip files, which cannot be done to a physical printed document.
For libraries, educational institutions, government departments, hospitals, banks – or any institution, OCR is a magic wand that can digitise huge archives of paper documents in any language or format into machine-readable data that can be stored easily and made accessible to anyone at the touch of a fingertip.
Optical Character Recognition enables searchability of data. Scanned files that are converted to machine-readable files can be saved in any format that can be searched on the internal server of an organisation, or made available universally on the Internet.
Documents digitised with OCR can be edited at will using a word processor.
A document scanned by OCR and saved on a digital database is accessible to anyone who has access to that database. For instance, a bank can access details of a customer’s transactions; or a government archive can be accessed for a birth certificate.
The Optical Character Recognition process is also instrumental in reducing storage space from a few cubic metres to a few gigabytes for the same amount of data.
The new norm for data backup (any number of times) will be a cloud server rather than duplicates and triplicates in paper. OCR thus becomes a sustainable option for office stationery and storage cabins.
Optical Character Recognition is highly versatile and can handle any linguistic script. This feature of OCR combined with the Unicode standard and translation software like Google Translate, enables translation of any scanned and digitised document into any other language. A benefit that renders human translators and their painstaking efforts redundant.
OCR scanners can greatly enhance the operational efficiency of your business by helping you focus on more important initiatives rather than minor day-to-day details. Some major advantages of automation would be:
- Enhanced process speed due to the automation and reduced manual effort
- Optimisation of workforce by reallocating data entry staff to higher value tasks
- Reduced labour cost associated with manual data entry and document sorting.
Beyond Digitisation of Text
Optical Character Recognition technology is not restricted to ‘reading’ prints. It is being increasingly used in fields like crime investigation to capture and restructure images and movements on a camera to solve or recreate a scene. Musicians are discovering new horizons through scanned sheet music made available digitally. Libraries across the world are scanning huge repositories of books into digital resources that are easily accessible to all.
Implementing OCR solutions in your business can transform your business processes and provide significant returns on investment pretty quickly. In addition, staff across departments can configure templates for each scanned document to guide the Optical Character Recognition software on where to look for particular data, which in turn will save further time and money.