Part of speech tagging is the process of selecting the most likely sequence of syntactic categories for the words in a sentence. It determines grammatical characteristics of the words, such as part of speech, grammatical number, gender, person, etc. In the case of Arabic language, this task is not trivial since most of the words are ambiguous as a result of the absence of vowels.

For each word, we want at a minimum to identify its main lexical category (noun, verb etc.) and inflectional features (plural, past tense etc.) if any. We might also identify some quasi-semantic features (proper noun) or even specify a word sense relative to some lexicon.

- Dependency Grammar
- Catenation
- Sentence Structure
- Deep Tagging

Kalmasoft's Deep PoS Tagger is the answer to most of the problems related to Arabic corpus tagging, a context-sensitive solution for each token using comprehensive set of syntactic rules, the output is a structured XML format but SQL database and CSV are among the other alternatives.

Kalmasoft's Deep PoS Tagger is designed to prepare Arabic annotated corpus since tagged corpus is more useful than an untagged corpus because there is more information there than in the raw text alone. Once a corpus is tagged, it can be used to extract information. This can then be used for creating dictionaries and grammars of a language using real language data. Tagged corpora are also useful for detailed quantitative analysis of text.

The system's output -processed corpus- is therefore suited for machines rather than human although there exists a view interface for testing purposes which works well for short text; output can also be saved as HTML or TXT file.

sliding window
MAPS Arabic PoS tagger
A screenshot of the MAPSSeman interface, you can view the technical specifications.
visualisation of parsed catena
Visualisation of parsed catena.

V: verbA: adjectiveC: conjunction
N: nounPr: prepositiona: adverb
d: demonstrativer: relativeF: foreign word
O: ordinal numberE: verbal noun:
R: pronounT: typographic errorX: No Solution
P: perfectiveS: singularF: feminine
I: imperfectiveD: dual1: first person
I: imperativeP: plural2: second person
E: energeticM: masculine3: third person

Check full documentation of Kalmasoft tagset here.

تعددت وتنوعت الأزمات التي خلفتها الحرب في اليمن وأزمة الانقطاع الكامل لخدمة الكهرباء ضاعفت من معاناة سكان هذه البلاد ودفعتهم نحو مصادر الطاقة البديلة للتخفيف من آثار تلك الأزمة
toddt wtnwot Al!zmAt Alty KlfthA AlHrb fy Alymn w!zm: AlAnqTAo AlkAml lKdm: AlkhrbA' DAoft mn moAnA: skAn hch AlblAd wdfothm nHw mSAdr AlTAq: Albdyl: lltKfyf mn |xAr tlk Al!zm:

ID Token KATS Syntax Morphology Prefix Suffix Gloss*
0001 تعددت toddt VPIA- 3PF---
0002 وتنوعت wtnwot VPIA 053PF--- PC
0003 الأزمات Al!zmAt NNG ---PF PD
0004 التي Alty PL
0005 خلفتها KlfthA VPIA 023SF3SF
0006 الحرب AlHrb NNN ---S- PD
0007 في fy PP
0008 اليمن Alymn NN-G
0009 وأزمة w!zm: NF-G ---SF PC
0010 الانقطاع AlAnqTAo NF-G 07-SM PD
0011 الكامل AlkAml NA-G ---SM PD
0012 لخدمة lKdm: NF-G ---SF
0013 الكهرباء AlkhrbA' N--G PD
0014 ضاعفت DAoft VPIA 033SF---
0015 من mn PP
0016 معاناة moAnA: NF-G 03-SF
0017 سكان skAn NQ-G ---BM
0018 هذه hch PD
0019 البلاد AlblAd N--G PD
0020 ودفعتهم wdfothm VPIA 013SF3PM PC
0021 نحو nHw NV
0022 مصادر mSAdr NF-A ---PM
0023 الطاقة AlTAq: N--G ---SF PD
0024 البديلة Albdyl: NA-G ---SF PD
0025 للتخفيف lltKfyf NF-G PP
0026 من mn PP
0027 آثار |xAr NF-G ---B-
0028 تلك tlk PD
0029 الأزمة Al!zm: NF-G ---SF
(*) These are for reference only, the real module outputs simple version gloss (stems only).

(**) Larger xml output sample can be found here XML output sample


