Programming Projects in Sam-bhashya Platform
Siddhanta Kosha is an online library of Indic language books that offers content search as well as version-controlled, collaborative annotation of books stored anywhere on the Internet. Sam-bhashya is the name of this annotation facility. Sam-bhashya offers RESTful API access to its annotation dataset to enable mining of its content for developing sophisticated linguistic tools.
We invite students and software professionals to use and enrich this platform. We suggest a few illustrative programming projects to get started.
Sam-bhashya uses a novel extendible markup language for multi-level annotation, called Indic Knowledge Markup Language (IKML). We have released an open-source Python package called ikml_doc on PyPI.org to facilitate easy processing of IKML content. The project description gives instructions on how to use the package.
Programming Projects
- Write a function called stats(url) that returns a dictionary with the count of #of occurences of each tag in the doc.
Sample output: { “va”: 234, “pa”: 1000, .. }
- Write a function vakya_splits(url) that returns an array of vakyas and their split attributes:
Sample output: [ (“kurukshetra”, “kuru-kshetra”), ..]
- Extract vakya splits on siddhantakosha.org and develop an automatic vakya splitter. Use it to auto-split vakyas of an IKML document and add as annotations.
- Generate text extractor from Google OCR output. Split it into vakyas and store it in IKML.
- Auto-tag shlokas of an IKML grantha with their chandas.
- Write a Javascript UI to pick a random vakya in siddhantakosha.org collection and prompt the user to split it by sandhi and samaasa. Store it back.
- Write a tool to extract vakyas and their tantrayukti attributes where available from siddhantakosha.org.
- Auto-tag the tantrayukti of vakyas based on above training data.
How to participate
If you are interested in participating, please join our WhatsApp group by clicking this link.
You need to know Python programming at a minimum.