Part of speech tagging is the process of selecting the most likely sequence of syntactic categories for the words in a sentence. It determines grammatical characteristics of the words, such as part of speech, grammatical number, gender, person, etc. In the case of Arabic language, this task is not trivial since most of the words are ambiguous as a result of the absence of vowels.

For each word, we want at a minimum to identify its main lexical category (noun, verb etc.) and inflectional features (plural, past tense etc.) if any. We might also identify some quasi-semantic features (proper noun) or even specify a word sense relative to some lexicon.

Kalmasoft PoS Tagger is the answer to most of the problems related to Arabic corpus tagging, a context-sensitive rule-based solution hand-crafted set of comprehensive syntactic rules to deal with Arabic datasets, the output is a structured XML or JSON format but SQL database and CSV are among the other alternatives.

Kalmasoft PoS Tagger is designed to prepare Arabic annotated corpus since tagged corpus is more useful than an untagged corpus because there is more information there than in the raw text alone. Once a corpus is tagged, it can be used to extract information. This can then be used for creating dictionaries and grammars of a language using real language data. Tagged corpora are also useful for detailed quantitative analysis of text.

The system's output -processed corpus- is therefore suited for machines rather than human although there exists a view interface for testing purposes which works well for short text; output can also be saved as HTML or TXT file.

V: verbA: adjectiveC: conjunction
N: nounPr: prepositiona: adverb
d: demonstrativer: relativeF: foreign word
O: ordinal numberE: verbal noun:
R: pronounT: typographic errorX: No Solution
P: perfectiveS: singularF: feminine
I: imperfectiveD: dual1: first person
M: imperativeP: plural2: second person
E: emphaticM: masculine3: third person

Check full documentation of Kalmasoft tagset here.

تعددت وتنوعت الأزمات التي خلفتها الحرب في اليمن وأزمة الانقطاع الكامل لخدمة الكهرباء ضاعفت من معاناة سكان هذه البلاد ودفعتهم نحو مصادر الطاقة البديلة للتخفيف من آثار تلك الأزمة
toddt wtnwot Al!zmAt Alty KlfthA AlHrb fy Alymn w!zm: AlAnqTAo AlkAml lKdm: AlkhrbA' DAoft mn moAnA: skAn hch AlblAd wdfothm nHw mSAdr AlTAq: Albdyl: lltKfyf mn |xAr tlk Al!zm:

ID Token KATS Syntax Arguments Prefix Suffix Gloss*
1 تعددت toddt VPIA 3PF•••
2 وتنوعت wtnwot VPIA 053PF••• PC
3 الأزمات Al!zmAt NNG ••••PF PD
4 التي Alty PL
5 خلفتها KlfthA VPIA 023SF3SF
6 الحرب AlHrb NNN ••••S• PD
7 في fy PP
8 اليمن Alymn NN•G
9 وأزمة w!zm: NF•G ••••SF PC
10 الانقطاع AlAnqTAo NF•G 07•SM PD
11 الكامل AlkAml NA•G ••••SM PD
12 لخدمة lKdm: NF•G ••••SF
13 الكهرباء AlkhrbA' N••G PD
14 ضاعفت DAoft VPIA 033SF•••
15 من mn PP
16 معاناة moAnA: NF•G 03•SF
17 سكان skAn NQ•G ••••BM
18 هذه hch PD
19 البلاد AlblAd N••G PD
20 ودفعتهم wdfothm VPIA 013SF3PM PC
21 نحو nHw NV
22 مصادر mSAdr NF•A ••••PM
23 الطاقة AlTAq: N••G ••••SF PD
24 البديلة Albdyl: NA•G ••••SF PD
25 للتخفيف lltKfyf NF•G PP
26 من mn PP
27 آثار |xAr NF•G ••••B•
28 تلك tlk PD
29 الأزمة Al!zm: NF•G ••••SF
(*) These are for reference only, the real module outputs simple version gloss (stems only).

(**) Larger xml output sample can be found here XML output sample

