This article details the methodology for digitizing and preparing historical documents for OCR using Tesseract. It covers challenges in data collection from aged archives, preprocessing techniques such as binarization, skew correction, and noise removal, as well as environment setup and dataset preparation. The study follows established evaluation frameworks while adapting them to Tesseract 5, offering insights into improving OCR accuracy on degraded or complex archival materials.This article details the methodology for digitizing and preparing historical documents for OCR using Tesseract. It covers challenges in data collection from aged archives, preprocessing techniques such as binarization, skew correction, and noise removal, as well as environment setup and dataset preparation. The study follows established evaluation frameworks while adapting them to Tesseract 5, offering insights into improving OCR accuracy on degraded or complex archival materials.

Why Your Tesseract OCR Results Suck (and How to Fix Them Fast)

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

3 Method

This chapter provides the method of conducting this research. It explains data collection and preparation, the experimental environment and its configurations, and the assessment and evaluation of the outcomes.

3.1 Data Collection

We collect data from different public and private libraries with historical documents. We focus on items published in the early and mid-1900s because the first printing press found in Iraq dating back to the 1920s in Sulaymaniyah by Mandate authorities. It was an old hand-operated letterpress called Chapkhanay Hukumat (Government Press) (Hassanpour, 1992). We convert the documents to digital copies. Converting historical documents into digital copies has many issues, and one of them is physical issues. The physical issue with the process involves difficulties from aging, document degradation, and imperfect production processes. Stains, tears, and irregular accumulation of dirt, in addition to artifacts, are some other issues (Antonacopoulos et al., 2004).

3.2 Data Preparation

For optimal performance, Tesseract is best suited for images with a resolution of at least 300 dpi. Therefore, resizing images to meet this requirement can be beneficial. It is worth noting that earlier versions of Tesseract (3.05 and earlier) can handle inverted images, where the background is dark and the text is light, without encountering problems. However, in version 4.x, it is recommended to use images with a light background and dark text for improved performance (Google, 2023b).

3.3 Preprocessing

Before conducting OCR, Tesseract incorporates various image processing operations using the Leptonica library. Leptonica is a freely available open-source library encompassing software suitable for various image processing and analysis applications. In most cases, the built-in image processing functionalities of Tesseract effectively prepare the image for OCR. However, there may be instances where additional refinement is necessary, potentially leading to decreased precision. To observe the image processing steps performed by Tesseract, users can enable the configuration variable tessedit write images and review the processed image. If the resultant image appears to be of low quality, it is possible to apply additional image processing operations before feeding it into Tesseract for improved results (Google, 2023b).

\ • Inverting images: While previous versions of Tesseract (3.05 and earlier) can handle inverted images (with a dark background and light text) without problems, version 4.x should use a dark background and dark text.

\ • Rescaling: To optimize Tesseract’s performance, resizing images to a minimum DPI of 300 is recommended.

\ • Binarization: This process converts an image to black and white. Tesseract internally performs binarization using the Otsu algorithm, but the result may need to be improved, especially if the page background has uneven darkness. Tesseract 5.0.0 introduced Adaptive Otsu and Sauvola, two new Leptonica-based binarization methods.

\ • Noise Removal: Noise refers to unpredictable variations in an image’s brightness or color that can hinder text recognition. Tesseract cannot eliminate some forms of noise during binarization, which can lead to decreased accuracy rates.

\ • Dilation and Erosion: Characters with bold or thin features, especially those with serifs, may impact detail recognition and reduce accuracy. Dilation and Erosion operations can be applied to expand or contract the margins of characters against a common background. Erosion can compensate for heavy ink leakage in historical documents and restore characters to their original glyph structure.

\ • Skew Correction: Skewed images can negatively affect Tesseract’s line segmentation and OCR quality. Rotating the image to align the text lines horizontally can rectify this issue.

\ • Borders:

\ – Missing borders: OCR without a border can cause problems. Adding a minor border (e.g., 10pt) using tools like ImageMagick can help alleviate this issue.

\ – Large borders: Large borders, especially with a single letter/digit or a word on a significant background, can lead to problems (”empty page”). It is recommended to crop the image to fit within the text area with a border of at least 10 points.

\ – Scanning border Removal: Scanned documents often have dark borders, which can be mistakenly interpreted as extra characters, especially if they vary in size, shape, and color.

\ • Transparency / Alpha channel: Certain image formats, like PNG, can incorporate an alpha channel to achieve transparency. Tesseract 4.00, utilizing the Leptonica function pixRemoveAlpha(), can remove the alpha channel by merging the alpha component with a white background. However, this process may lead to issues in specific scenarios, such as performing OCR on movie subtitles. To solve such problems, users might be required to manually eliminate the alpha channel or perform image preprocessing by inverting the colors.

\ 3.3.1 Data Preparation for Tesseract

\ Data preparation for Tesseract can be done in two ways: generating the dataset artificially from text files or manually preparing the dataset from image lines. We follow the latter approach. For the images, they should be in TIFF format with the ”.tif” extension or PNG format with the extensions ”.png”, ”.bin.png”, or ”.nrm.png”. The transcription need plain text files containing a single line of text. They should have the same name as the corresponding line image but with the extension ”.gt.txt” added to the image extension.

3.4 Environment Setup

At present, the training process supports Linux as the operating system. While having a multicore system with OpenMP and Intel Intrinsics support for SSE/AVX extensions is beneficial, but not mandatory. Four cores are considered optimal, but the training can still run on devices with sufficient RAM, albeit slower. The training process does not require a GPU. Apart from the RAM needed for the operating system, having at least 1 GB of additional RAM is recommended. Memory usage can be regulated using the ”–max image MB” command-line option (Google, 2023a).

3.5 Dataset Preparation

We choose various schemes for data splitting based on the data that we collect. For the training and evaluation we follow the method that Idrees (2020) suggested but we apply it to Tesseract version 5.

3.6 Evaluation

Similar to the approach we take for dataset preparation, we follow the method that Idrees (2020) suggested for the evaluation as well.

\

:::info Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region - Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region - Iraq ([email protected]).

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

SHIB Price Prediction: Mixed Signals Point to $0.000010 Target Despite Technical Data Gaps

SHIB Price Prediction: Mixed Signals Point to $0.000010 Target Despite Technical Data Gaps

The post SHIB Price Prediction: Mixed Signals Point to $0.000010 Target Despite Technical Data Gaps appeared on BitcoinEthereumNews.com. Peter Zhang Jan 13,
Share
BitcoinEthereumNews2026/01/14 12:13
China Launches Cross-Border QR Code Payment Trial

China Launches Cross-Border QR Code Payment Trial

The post China Launches Cross-Border QR Code Payment Trial appeared on BitcoinEthereumNews.com. Key Points: Main event involves China initiating a cross-border QR code payment trial. Alipay and Ant International are key participants. Impact on financial security and regulatory focus on illicit finance. China’s central bank, led by Deputy Governor Lu Lei, initiated a trial of a unified cross-border QR code payment gateway with Alipay and Ant International as participants. This pilot addresses cross-border fund risks, aiming to enhance financial security amid rising money laundering through digital channels, despite muted crypto market reactions. China’s Cross-Border Payment Gateway Trial with Alipay The trial operation of a unified cross-border QR code payment gateway marks a milestone in China’s financial landscape. Prominent entities such as Alipay and Ant International are at the forefront, participating as the initial institutions in this venture. Lu Lei, Deputy Governor of the People’s Bank of China, highlighted the systemic risks posed by increased cross-border fund flows. Changes are expected in the dynamics of digital transactions, potentially enhancing transaction efficiency while tightening regulations around illicit finance. The initiative underscores China’s commitment to bolstering financial security amidst growing global fund movements. “The scale of cross-border fund flows is expanding, and the frequency is accelerating, providing opportunities for risks such as cross-border money laundering and terrorist financing. Some overseas illegal platforms transfer funds through channels such as virtual currencies and underground banks, creating a ‘resonance’ of risks at home and abroad, posing a challenge to China’s foreign exchange management and financial security.” — Lu Lei, Deputy Governor, People’s Bank of China Bitcoin and Impact of China’s Financial Initiatives Did you know? China’s latest initiative echoes the Payment Connect project of June 2025, furthering real-time cross-boundary remittances and expanding its influence on global financial systems. As of September 17, 2025, Bitcoin (BTC) stands at $115,748.72 with a market cap of $2.31 trillion, showing a 0.97%…
Share
BitcoinEthereumNews2025/09/18 05:28
Rattled retail retreats to Bitcoin, Ether after October crash

Rattled retail retreats to Bitcoin, Ether after October crash

Retail traders fled to Bitcoin and Ether after the October crypto crash last year, adding to an already tough year for altcoins.Retail traders spooked by the massive
Share
Coinstats2026/01/14 11:13