Buy Crypto Markets Spot FuturesGOLD Earn Event Center

This article details the methodology for digitizing and preparing historical documents for OCR using Tesseract. It covers challenges in data collection from aged archives, preprocessing techniques such as binarization, skew correction, and noise removal, as well as environment setup and dataset preparation. The study follows established evaluation frameworks while adapting them to Tesseract 5, offering insights into improving OCR accuracy on degraded or complex archival materials.This article details the methodology for digitizing and preparing historical documents for OCR using Tesseract. It covers challenges in data collection from aged archives, preprocessing techniques such as binarization, skew correction, and noise removal, as well as environment setup and dataset preparation. The study follows established evaluation frameworks while adapting them to Tesseract 5, offering insights into improving OCR accuracy on degraded or complex archival materials.

Why Your Tesseract OCR Results Suck (and How to Fix Them Fast)

Author: Hackernoon

Source: Hackernoon

2025/08/19 15:00

6 min read

For feedback or concerns regarding this content, please contact us at [email protected]

Table of Links

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

Related work and 2.1 Arabic/Persian

2.2 Chinese/Japanese and 2.3 Coptic

2.4 Greek

2.5 Latin

2.6 Tamizhi
Method and 3.1 Data Collection

3.2 Data Preparation and 3.3 Preprocessing

3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation
Experiments, Results, and Discussion and 4.1 Processed Data

4.2 Dataset and 4.3 Experiments

4.4 Results and Evaluation

4.5 Discussion
Conclusion

5.1 Challenges and Limitations

Online Resources, Acknowledgments, and References

3 Method

This chapter provides the method of conducting this research. It explains data collection and preparation, the experimental environment and its configurations, and the assessment and evaluation of the outcomes.

3.1 Data Collection

We collect data from different public and private libraries with historical documents. We focus on items published in the early and mid-1900s because the first printing press found in Iraq dating back to the 1920s in Sulaymaniyah by Mandate authorities. It was an old hand-operated letterpress called Chapkhanay Hukumat (Government Press) (Hassanpour, 1992). We convert the documents to digital copies. Converting historical documents into digital copies has many issues, and one of them is physical issues. The physical issue with the process involves difficulties from aging, document degradation, and imperfect production processes. Stains, tears, and irregular accumulation of dirt, in addition to artifacts, are some other issues (Antonacopoulos et al., 2004).

3.2 Data Preparation

For optimal performance, Tesseract is best suited for images with a resolution of at least 300 dpi. Therefore, resizing images to meet this requirement can be beneficial. It is worth noting that earlier versions of Tesseract (3.05 and earlier) can handle inverted images, where the background is dark and the text is light, without encountering problems. However, in version 4.x, it is recommended to use images with a light background and dark text for improved performance (Google, 2023b).

3.3 Preprocessing

Before conducting OCR, Tesseract incorporates various image processing operations using the Leptonica library. Leptonica is a freely available open-source library encompassing software suitable for various image processing and analysis applications. In most cases, the built-in image processing functionalities of Tesseract effectively prepare the image for OCR. However, there may be instances where additional refinement is necessary, potentially leading to decreased precision. To observe the image processing steps performed by Tesseract, users can enable the configuration variable tessedit write images and review the processed image. If the resultant image appears to be of low quality, it is possible to apply additional image processing operations before feeding it into Tesseract for improved results (Google, 2023b).

\ • Inverting images: While previous versions of Tesseract (3.05 and earlier) can handle inverted images (with a dark background and light text) without problems, version 4.x should use a dark background and dark text.

\ • Rescaling: To optimize Tesseract’s performance, resizing images to a minimum DPI of 300 is recommended.

\ • Binarization: This process converts an image to black and white. Tesseract internally performs binarization using the Otsu algorithm, but the result may need to be improved, especially if the page background has uneven darkness. Tesseract 5.0.0 introduced Adaptive Otsu and Sauvola, two new Leptonica-based binarization methods.

\ • Noise Removal: Noise refers to unpredictable variations in an image’s brightness or color that can hinder text recognition. Tesseract cannot eliminate some forms of noise during binarization, which can lead to decreased accuracy rates.

\ • Dilation and Erosion: Characters with bold or thin features, especially those with serifs, may impact detail recognition and reduce accuracy. Dilation and Erosion operations can be applied to expand or contract the margins of characters against a common background. Erosion can compensate for heavy ink leakage in historical documents and restore characters to their original glyph structure.

\ • Skew Correction: Skewed images can negatively affect Tesseract’s line segmentation and OCR quality. Rotating the image to align the text lines horizontally can rectify this issue.

\ • Borders:

\ – Missing borders: OCR without a border can cause problems. Adding a minor border (e.g., 10pt) using tools like ImageMagick can help alleviate this issue.

\ – Large borders: Large borders, especially with a single letter/digit or a word on a significant background, can lead to problems (”empty page”). It is recommended to crop the image to fit within the text area with a border of at least 10 points.

\ – Scanning border Removal: Scanned documents often have dark borders, which can be mistakenly interpreted as extra characters, especially if they vary in size, shape, and color.

\ • Transparency / Alpha channel: Certain image formats, like PNG, can incorporate an alpha channel to achieve transparency. Tesseract 4.00, utilizing the Leptonica function pixRemoveAlpha(), can remove the alpha channel by merging the alpha component with a white background. However, this process may lead to issues in specific scenarios, such as performing OCR on movie subtitles. To solve such problems, users might be required to manually eliminate the alpha channel or perform image preprocessing by inverting the colors.

\ 3.3.1 Data Preparation for Tesseract

\ Data preparation for Tesseract can be done in two ways: generating the dataset artificially from text files or manually preparing the dataset from image lines. We follow the latter approach. For the images, they should be in TIFF format with the ”.tif” extension or PNG format with the extensions ”.png”, ”.bin.png”, or ”.nrm.png”. The transcription need plain text files containing a single line of text. They should have the same name as the corresponding line image but with the extension ”.gt.txt” added to the image extension.

3.4 Environment Setup

At present, the training process supports Linux as the operating system. While having a multicore system with OpenMP and Intel Intrinsics support for SSE/AVX extensions is beneficial, but not mandatory. Four cores are considered optimal, but the training can still run on devices with sufficient RAM, albeit slower. The training process does not require a GPU. Apart from the RAM needed for the operating system, having at least 1 GB of additional RAM is recommended. Memory usage can be regulated using the ”–max image MB” command-line option (Google, 2023a).

3.5 Dataset Preparation

We choose various schemes for data splitting based on the data that we collect. For the training and evaluation we follow the method that Idrees (2020) suggested but we apply it to Tesseract version 5.

3.6 Evaluation

Similar to the approach we take for dataset preparation, we follow the method that Idrees (2020) suggested for the evaluation as well.

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq ([email protected]).

:::

:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

200,000 USDT Prize Pool

Trade gold, silver & oil. Everyone wins.

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.