Development of a Gold-standard Pashto Dataset and a Segmentation App


The article aims to introduce a gold-standard Pashto dataset and a segmentation app. The Pashto dataset consists of 300 line images and corresponding Pashto text from three selected books. A line image is simply an image consisting of one text line from a scanned page. To our knowledge, this is one of the first open access datasets which directly maps line images to their corresponding text in the Pashto language. We also introduce the development of a segmentation app using textbox expanding algorithms, a different approach to OCR segmentation.

The authors discuss the steps to build a Pashto dataset and develop our unique approach to segmentation. The article starts with the nature of the Pashto alphabet and its unique diacritics which require special considerations for segmentation. Needs for datasets and a few available Pashto datasets are reviewed. Criteria of selection of data sources are discussed and three books were selected by our language specialist from the Afghan Digital Repository. The authors review previous segmentation methods and introduce a new approach to segmentation for Pashto content. The segmentation app and results are discussed to show readers how to adjust variables for different books. Our unique segmentation approach uses an expanding textbox method which performs very well given the nature of the Pashto scripts.

The app can also be used for Persian and other languages using the Arabic writing system. The dataset can be used for OCR training, OCR testing, and machine learning applications related to content in Pashto.


Abdelhay Zoizou, Arsalane Zarghili, and Ilham Chaker. “A New Hybrid Method for Arabic Multi-Font Text Segmentation, and a Reference Corpus Construction.” Journal of King Saud University—Computer and Information Sciences 32, no. 5 (June 2020): 576–82,

Atallah Al-shatnawi and Khairuddin Omar, “Methods of Arabic Language Baseline Detection—The State of Art,” International Journal of Computer Science and Network Security 8, no. 10 (October 2008).

Atifa Rawan and Yan Han, The Pasto-English Dictionary (2014),

Britta-Maria Gruber and Wolfgang Kirsch, “Writing Machu on a Western Computer (an interim report),” Saksaha: A Journal of Manchu Studies, 3, (1998):

Donald Sturgeon, “Digitizing Premodern Text with the Chinese Text Project,” Journal of Chinese History 4, no. 2 (2020): 486–98, https://doi:10.1017/jch.2020.19.

Donald Sturgeon, “Large-scale Optical Character Recognition of Pre-modern Chinese Texts,” International Journal of Buddhist Thought and Culture 28, no. 2 (2018): 11–44,

Enhancements and Evaluation,” in Computer Analysis of Images and Patterns, ed. Walter G. Kropatsch, Martin Kampel, and Allan Hanbury, vol. 4673, Lecture Notes in Computer Science (Berlin, Heidelberg: Springer Berlin Heidelberg, 2007), 522–30,

“Glossary of Unicode Terms,” The Unicode Consortium, last updated May 20, 2020,

Herbert Penzl, A Grammar of Pashto: A Descriptive Study of the Dialect of Kandahar, Afghanistan. (New York: Ishi Press, 2009).

Library of Congress, Pushto Romanization Tables (2013),

Mahmoud A. A. Mousa, Mohammed S. Sayed, and Mahmoud I. Abdalla, “Arabic Character Segmentation Using Projection Based Approach with Profile’s Amplitude Filter,” ArXiv:1707.00800 [Cs], July 3, 2017,

Majid Ziaratban and Karim Faez. “A Novel Two-Stage Algorithm for Baseline Estimation and Correction in Farsi and Arabic Handwritten Text Line,” in 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA: IEEE, 2008: 1–5,

Marek Rychlik et al., “Development of a New Image-to-Text Conversion System for Pashto, Farsi and Traditional Chinese,” ArXiv:2005.08650 [Cs], May 8, 2020,

Matthew Thomas Miller, Maxim G. Romanov, and Sarah Bowen Savant, “Digitizing the Textual Heritage of the Premodern Islamicate World: Principles and Plans,” International Journal of Middle East Studies 50, no. 1 (February 2018): 103–9,

Mohamed Attia and Mohamed El-Mahallawy, “Histogram-Based Lines and Words Decomposition for Arabic Omni Font-Written OCR Systems.

Open Islamicate Texts Initiative, OPEN ISLAMICATE TEXTS INITIATIVE (OPENITI): Creating the digital infrastructure for the study of the premodern Islamicate World (2016),

Riaz Ahmad et al., “Robust Optical Recognition of Cursive Pashto Script Using Scale, Rotation and Location Invariant Approach,” PLOS ONE 10, no. 9 (September 14, 2015): e0133648,

Saeeda Naz et al., “Challenges in Baseline Detection of Arabic Script Based Languages,” in Intelligent Systems for Science and Information, ed. Liming Chen, Supriya Kapoor, and Rahul Bhatia, Studies in Computational Intelligence (Springer International Publishing, 2014), 542: 181–96,

Safwan Wshah, Zhixin Shi, and Venu Govindaraju, “Segmentation of Arabic Handwriting Based on Both Contour and Skeleton Segmentation,” in 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain: IEEE, 2009: 793–97,

Shizza Zahoor et al., “Deep Optical Character Recognition: A Case of Pashto Language,” Journal of Electronic Imaging 29, no. 02 (March 4, 2020),

Sulaiman Khan et al., “KNN and ANN-Based Recognition of Handwritten Pashto Letters Using Zoning Features,” International Journal of Advanced Computer Science and Applications 9, no. 10 (2018),

Sultan Ullah et al., “Offline Pashto OCR Using Machine Learning,” in 2019 7th International Electrical Engineering Congress (IEECON), (Hua Hin, Thailand, 2019): 1–4,

Tarik Abu-Ain et al., “A Novel Baseline Detection Method of Handwritten Arabic-Script Documents Based on Sub-Words,” in Soft Computing Applications and Intelligent Systems, ed. Shahrul Azman Noah et al., Communications in Computer and Information Science 378 (Springer: Berlin, Heidelberg, 2013), 67–77,

The Unicode Standard Version 13.0—Core Specification: Chapter 17: Indonesia and Oceania (The Unicode Consortium: Mountain View, CA, 2020),

Yusra Osman, “Segmentation Algorithm for Arabic Handwritten Text Based on Contour Analysis,” in 2013 International Conference on Computing, Electrical and Electronic Engineering (ICCEEE), Khartoum, Sudan: IEEE, 2013: 447–52,

Zakir Ali et al., “Database Development and Automatic Speech Recognition of Isolated Pashto Spoken Digits Using MFCC and K-NN,” International Journal of Speech Technology 18, no. 2 (June 2015): 271–75,

How to Cite
Han, Y., & Rychlik, M. (2021). Development of a Gold-standard Pashto Dataset and a Segmentation App. Information Technology and Libraries, 40(1).