Exchanging image processing and OCR components in a Setswana digitisation pipeline
Keywords:digitisation, optical character recognition, image processing, neural networks
As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMagick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data.
Copyright (c) 2020 Gideon Jozua Kotzé, Friedel Wolff
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.Copyright of all work published here subsists in the authors. While SACJ retains right of first publication, subsequent re-publication is expressly permitted provided the original SACJ publication is acknowledged and cited, according to the terms detailed below. If plagiarism is detected during review, a paper may be summarily rejected and will not be accepted unless even minor infringements are corrected. Should plagiarism be detected after a paper is published, the Editor reserves the right to withdraw a paper from publication. We expect authors to be honest in representing work as their own, and to respect the time and effort our reviewers put in without an undue burden of policing plagiarism, and hence take violations seriously. SACJ applies the Creative Commons Attribution NonCommercial 4.0 License (CC BY-NC 4.0) to all papers published in this journal. Authors who publish with SACJ agree to the following:
- Authors retain copyright and grant SACJ right of first publication. The work is additionally licensed under a Creative Commons Attribution Non-Commercial License that requires others who share the work to acknowledge the work’s authorship and initial publication in SACJ. Should anyone else wish to make commercial use of the work, SACJ cedes the right to the author to negotiate terms and does not expect to be paid any royalties.
- Authors may enter into additional arrangements for non-exclusive distribution of the SACJ-published version of the work (e.g., post it to a repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are required to refrain from posting their work online prior to completion of reviews so as not to compromise double-blind reviewing or confuse plagiarism checks.