Searching text corpora in Tibetan Unicode


This post introduces two free applications that we recently tested for searching through the Tibetan Unicode text corpora that have kindly been made available by ACIP, BDRC zenodo, BDRC OpenPecha, Esukhia, and others. Each application has its own advantages and shortcomings as regards usability and performance, also they return different search results.

1) Docfetcher (accessed: 2020-06-20)

- desktop search application for Windows, Linux and OS X (open source)
- supports txt files and many other document formats, but no xml (those files need to be renamed into txt files before creating the index)
- index-based search
- search in “quotation marks”

Docfetcher, "ནག་པོ་པ"

2) AntConc (accessed: 2020-06-20)

- corpus analysis toolkit for concordancing and text analysis for Windows, Linux and OS X (freeware)
- supports txt, html and xml files (hide tags-function)
- searches files live (performance slow, crashes frequently on macOS Catalina)
- particular settings for proper Tibetan display (font and font size) and search in Tibetan required; see Esukhia, GitHub
- direct Tibetan input (searchbar) does not work under Windows 10 (copy paste search phrase)

AntConc, "ནག་པོ་པ"