Exchanging image processing and OCR components in a Setswana digitisation pipeline

Gideon Jozua Kotzé; Friedel Wolff

doi:10.18489/sacj.v32i2.707

Exchanging image processing and OCR components in a Setswana digitisation pipeline

Authors

Gideon Jozua Kotzé University of South Africa https://orcid.org/0000-0001-7318-2245
Friedel Wolff University of South Africa https://orcid.org/0000-0002-6615-5780

DOI:

https://doi.org/10.18489/sacj.v32i2.707

Keywords:

digitisation, optical character recognition, image processing, neural networks

Abstract

As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMagick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data.

Author Biographies

Gideon Jozua Kotzé, University of South Africa

Senior Researcher Academy of African Languages and Science College of Graduate Studies

Friedel Wolff, University of South Africa

Language Technologist Academy of African Languages and Science College of Graduate Studies

Downloads

Published

2020-12-08

Issue

Vol. 32 No. 2 (2020)

Section

Research Papers (general)

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Copyright of all work published here subsists in the authors. While SACJ retains right of first publication, subsequent re-publication is expressly permitted provided the original SACJ publication is acknowledged and cited, according to the terms detailed below. If plagiarism is detected during review, a paper may be summarily rejected and will not be accepted unless even minor infringements are corrected. Should plagiarism be detected after a paper is published, the Editor reserves the right to withdraw a paper from publication. We expect authors to be honest in representing work as their own, and to respect the time and effort our reviewers put in without an undue burden of policing plagiarism, and hence take violations seriously. SACJ applies the Creative Commons Attribution NonCommercial 4.0 License (CC BY-NC 4.0) to all papers published in this journal. Authors who publish with SACJ agree to the following:

Authors retain copyright and grant SACJ right of first publication. The work is additionally licensed under a Creative Commons Attribution Non-Commercial License that requires others who share the work to acknowledge the work’s authorship and initial publication in SACJ. Should anyone else wish to make commercial use of the work, SACJ cedes the right to the author to negotiate terms and does not expect to be paid any royalties.
Authors may enter into additional arrangements for non-exclusive distribution of the SACJ-published version of the work (e.g., post it to a repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are required to refrain from posting their work online prior to completion of reviews so as not to compromise double-blind reviewing or confuse plagiarism checks.

Exchanging image processing and OCR components in a Setswana digitisation pipeline

Authors

DOI:

Keywords:

Abstract

Author Biographies

Gideon Jozua Kotzé, University of South Africa

Friedel Wolff, University of South Africa

Downloads

Published

Issue

Section

License

Developed By

Information