IKML: A Markup Language for Collaborative Semantic Annotation of Shaastra Texts
Bharatiya shaastra texts employ highly structured and well-defined patterns of discourse derived from Nyaaya and Mimaamsa concepts. Though they are written in a flat text style, making their knowledge structure explicit greatly helps in understanding and interpreting their meaning. It also helps in building automated tools to mine these texts for insights, and in building computational models of shaastras. However, transforming shaastra texts into knowledge structures cannot yet be automated, as there is not enough annotated data to train machine-learning tools. We have developed a novel markup language to help this process, called IKML (Indic Knowledge Markup Language). IKML offers a new way to represent shaastra texts and annotate them with knowledge metadata at multiple levels of abstraction in a collaborative manner. IKML is designed for easy collaborative editing by shaastra scholars, version control, visualization and scripted processing. It employs best practices of popular languages such as YAML, XML and JSON and is auto-convertible to these languages. This makes IKML representation of shaastra texts amenable for processing by scripting and visualization tools for large-scale knowledge mining. The key guiding principles for IKML design are brevity, readability, extensibility, and minability. IKML supports simultaneously representing an Indic language book as a scanned image document, OCR-extracted text, proof-corrected text, sentence and word-split version, grammar-tagged version, discourse-tagged version including tantrayuktis, tatparya lingas, sangatis and nyaaya sambandhas as well as augmentation with user notes, translations and comments. IKML supports tagging semantic relationships at multiple levels: between granthas/treatises, vibhagas/sections, vakyas/sentences and vishayas/concepts. Our web-based portal Siddhanta Kosha offers collaborative, crowd-sourced annotation and graph visualization of shaastra grantha libraries using IKML. The kosha currently has 25 popular granthas totalling 20000 sentences.
IKML Overview
- Each line of an IKML document looks as follows:
- [tag attr1=val1 ..] text
- IKML defines tags within square brackets [ ] to describe a document content E.g., [va] denotes a vakya.
- Each tag and its content is given in its own line.
- A tag can have as its children (denoted by indentation with 2 spaces),
- attributes that give further information about tag content. They are specified within [ ] prefixed by a dot “.”. E.g. [.vibhakti]
- Other tags which denote sub-annotations of the content. E.g., [pa] enumerates each pada of the vakya separately. Repetition of same subtag multiple times denotes a collection.
- An attribute can also be specified inline with the tag,
- g., [va label=”vakya”]
- There are some predefined attributes
- label: descriptive phrase for a tag
- rel_id: auto-filled serial number for multiple repetitions of same tag within its parent tag.
- rel_prefix: string to be prefixed with rel_id.
- id: auto-generated globally unique identifying string for the tag instance.
- id=<parent_tag.id>.<rel_prefix><rel_id>
- g. tarka.v.10 denotes 10th vakya of tarka book
- option: possible preset values of an attribute can be defined using its “option” sub-attribute.
- New tags can be defined along with their allowed sub tag hierarchy in a special tag called ikml_schema.
- A special tag called [include] <file path or URL> can be used to include ikml files in each other. This enables large documents to be modularly split into multiple IKML files.
- Another special tag called [inline] is similar to include, but makes the inlined file’s content part of its parent file.
- Content for a tag can be given after the tag in unicode text without quotes. Leading and trailing white space is ignored.
Here is a sample IKML snippet illustrating its salient features.
[include] schema.ikml [grantha id=TarkaSM] [vakyas] [va id=TarkaSM.v.1 rel_id=1] निधाय हृदि विश्वेशं विधाय गुरुवन्दनम् । बालानां सुखबोधाय क्रियते तर्कसंग्रहः ॥ [.tantrayukti id=TarkaSM.v.1] प्रयोजनम् purpose [.gist_tag id=TarkaSM.v.1] उपक्रमः beginning [.treatise_tag id=TarkaSM.v.1] प्रयोजनम् result [pa id=TarkaSM.v.1.1 rel_id=1] निधाय [.ptype] avyaya [pa id=TarkaSM.v.1.2 rel_id=2] हृदि [.ptype] subantam [.vibhakti] 7 [.vachana] 1 [sypa id=TarkaSM.v.1.3 rel_id=3] विश्वेशं [.sandhi] guna [.split] viSva+ISam [smpa] [.samaasa] tatpuruSha [.split] viSva-ISam [pa] viSva [.ptype] praatipadikam [pa] ISam [.ptype] subantam [.vibhakti] 2 [.vachana] 1 [pa id=TarkaSM.v.1.4 rel_id=4] विधाय [smpa id=TarkaSM.v.1.5 rel_id=5] गुरुवन्दनम् [pa id=TarkaSM.v.1.6 rel_id=6] बालानां [smpa id=TarkaSM.v.1.7 rel_id=7] सुखबोधाय [pa id=TarkaSM.v.1.8 rel_id=8] क्रियते [smpa id=TarkaSM.v.1.9 rel_id=9] तर्कसंग्रहः [va id=TarkaSM.v.2 rel_id=2] द्रव्यगुणकर्मसामान्यविशेषसमवायाऽभावाः सप्तपदार्थाः ॥ १॥ [.gist_tag id=TarkaSM.v.2] उपक्रमः beginning [sambandhas] [smb id=TarkaSM.s.2 rel_id=2] [.srcid id=TarkaSM.s.2] TarkaSM.v.2 [.targetid id=TarkaSM.s.2] TarkaSM.v.1 [.src_phrase id=TarkaSM.s.2] सप्तपदार्थ [.target_phrase id=TarkaSM.s.2] तर्कसंग्रह [include] schema.ikml [grantha id=TarkaSM]