TerminologyWe use few custome terms through this document which we would like to clarify. Each task done by MAPS is called "process", the following table shows pairs of languages in the first four columns, the column headers shows the "direction" of the process; the name of the process itself is shown in the last column. Please follow the links for details on each process. |
Source Language | Input Script | Target Language | Output Script | Process Name |
Arabic | Unvocalized Arabic | Arabic | Vocalized Arabic | Vocalization |
Non-Arabic | Language dependent | Arabic | Arabic | Retrieval (Arabic) |
Multilingual | Language dependent | Language dependent | Language dependent | Transcription (Phonemic) |
Multilingual | Language dependent | Latin script based | Phonemic Latin | Romanization |
Arabic | Arabic | Multilingual | Language dependent | Retrieval (Multilingual) |
Multilingual | Language dependent | Arabic | Arabic | Arabicization |
Features and specifications
Potential applications
|
Module pages layout |
MAPS pages are laid out as follows:
|
Arabic Diacritizer |
Arabic is one of the UN official languages and is read from right-to-left; Arabic language has an inflectional system that is known for its rich vocabulary and complex morphology. The Arabic writing system consists of twenty eight letters (Abjad), twenty five of which are consonants and the remaining three letters are long vowels. A distinguishing feature of Arabic is that no letters are used to represent short vowels. Instead, they are represented by short strokes called diacritics, which are placed either above or below the preceding consonant. Another feature is that Arabic text is written unvocalized except for classical themes and Koranic text, this is a major stumbling stone for any NLP system. Kalmasoft diacritizing module is developed to accomplish full and semi-vocalization process of the raw input text. Please refer to Arabic Text Diacritizer for details. This module is currently being developed. Please refer Arabic Diacritizer for details. |
Arabic Root Extractor |
Arabic is a highly inflectional language, meaning it uses an effective system to generate and derive words. Stemming is the process of removing any affixes from such words, and reducing those words to their roots. Our full-fledged morphological analyzer utilizes a light stemmer which does not only affix removal but also root extraction, it does this using complicated techniques to deal with all forms of the assimilated, hollow, and defect tokens, the morphological analyzer does the pattern recognition necessary to complete the task and returns the correct form of the root or stem. A root dictionary is implemented to boost the system which can be used in Arabic monolingual document retrieval. Please refer to Arabic Root Extractor for detailsThis module is currently being developed. Please refer Arabic Stemmer for details. |
Arabic Conjugator |
Arabic is a non-concatinative language, it can be described as derivational language meaning that the morphotactics depend rather on affixation i.e. adding morphemes onto the word without changing the root, that is, preserving the core order of the verb binyanim, this results in the highly regular inflectional pattern distinguishing the language. The Inflection Generator (or simply conjugator) is a full-form lexical production module built on a root-based algorithm; a root like [ksr] "to break" may be seeded into the system yielding roughly 30,000 conjugations this is theoretically true for any other triconsonantal sound root. What Kalmasoft offers here will be not the thorough listing of the verb conjugation paradigm but rather the software which can then be used to create the whole inflectional model of the language back again or just the conjugation table of a specific form of verb; binary scripts are available and can be obtained too. Please refer to the list of tagged roots for further information. This module is currently being developed; please refer Arabic Conjugator for details. |
Arabic Inflector |
Arabic noun declension is the process of inflecting nouns to their sub-grammatical categories, MAPS inflects every single Arabic noun to more than dozen of categories including the classes e.g. Verbal Noun, Noun of Instrument, Active Participle, Passive Participle, Locative Noun, Numerative Noun and three cases Accusative, Nominative, and Genitive; the first group are directly derived from their parallel verbs since they are grammatically classified as nouns. Other stem inherent or generic characteristics e.g. semantic classification are not reflected in the table below, they have rather been dealt with in a direct hard-coding basis throughout the declension. This module is currently being developed; please refer Arabic Inflector for details. |
Arabic POS Tagger |
POS tagging is the process of assigning a part-of-speech tag such as noun, verb, pronoun, preposition, adverb, adjective or other tags to each word in a sentence. It reflects the word syntactic category based on its context for the purposes of resolving lexical ambiguity. This is a rule based module that makes use of an extensive knowledge base of rules developed our linguists to define precisely when to apply each form of tags.Please refer Arabic POS Tagger for details. |
Arabic Parser |
Parsing is a key to accurate translation - once text is correctly dis-assembled, it is much easier to transfer to a different language. Kalmasoft has developed a unique Parser for Arabic language which can correctly analyze natural text, represent it as abstract elements and relationships, and then seed it to generate text in a new language. This technology is language dependent but requires only few changes, different rule set, and dictionary for each additional language. This module is currently being developed. Please refer to Arabic Parser for details. |
Arabic Ontology Processor |
This module is currently being developed. Please refer to Arabic Ontology Processor for details. |
Personal Names Romanizer |
Both Arabic and English lack some of each other’s sounds and letters. For example, there is no perfect match for pharyngeals [Haa', Ein] or uvulars [Qaf, Khaa', Ghain] in English and (P, V) in Arabic. This leads to ambiguities during the romanization process. Hence, if there is an Arabic name with one of these sounds, variant spellings will result in English. This in fact a major stumbling block when converting non-Western language characters like Arabic Abjad to Roman characters; it is especially challenging for Arabic-to-English conversion because the Arabic uses only consonants and rarely use some diacritics for disambiguation, making it difficult to accurately return a single English version of an Arabic name input. Our system takes into account these peculiarities and supports many transliteration/transcription standards including UNGEGN, ALA-LC, DIN31635, SATTS, ISO233 as well as some academic transliteration systems like Buckwalter, Khoja and Qalam; this makes it essential as an integral transliteration module in NLP applications like Machine Translation (MT) and Cross-Language Information Retrieval (CLIR). Please refer Name Romanizer for details. |
Arabic Name Classifier |
This module makes use of the truth that different geographic regions have different name patterns and most have specific set of names unique to it beside other patterns that are in common e.g. the names "حفني" /ħaf'ni /, "مرسي" /mursi /, and "مدبولي" /mad'bu:li / are unique to Egypt while the names "أحمد" /ʔħ'mad/ "محمد" /muħam'mad/ can not be assigned to specific geographic region since they share the top very high frequency in all regions in the Arabic speaking countries among other names like "علي" /ʕli /. The module also gives some hints "gist" about gender and guesses on the religion for some non-Arabic origin names e.g. "جرجس" / girgis/, "مينا" /mi:na/, and "حنا" /ħan'na/ which denote Coptic or Christian names common in Egypt and Iraq. MAPS uses this embedded module to give high and reliable results. Please refer Name Classifier for details. |
Personal Names Arabicizer |
Transliteration is the process of formulating a representation of words in one language using the writing system of another language; the challenge of importing non-Arabic "foreign" names into Arabic language is not less important than the reverse process; this is called "Arabicization" in MAPS terminology, "Arabicizing" is the process of representing names written using scripts other than Arabic; this process does not actually impose such great challenges for language pairs that employ very close writing and sound systems such as Spanish/English or French/Spanish.
Distinction here should be made between two important points:
|
Personal Names Transcription System |
Representing Arabic names written in Arabic script in different languages and vice versa is a task always been described as a challenge to most cross-language content management and data mining systems; MAPS works not only for Romanization but also for a dozen of languages, the integral transliteration system makes it possible to take names in native script or Romanized form, perform the transcription and return results in the target native script, the output "transcribed names" are formatted in a special way directed to readers of the particular geographic region; For instance the Arabic name "بُرْهَان" is rendered "Бурхан" for Russian, "Burhan" for English speakers, Czech or Spanish, "Borhane" for Francophone and "Borhan" for German and Polish. Please refer to the Global Personal Name Transcription System for details and samples. |
Personal Names Retrieval System |
The system is capable of regenerating names back to their original languages and return the result in the target language native script; this re-building capability makes it ideal for applications like Named Entity Recognition (NER), Cross Language Information Retrieval (CLIR); retrieval feature works only for Arabic and partial section of Latin names for now. This module makes heavy usage of detailed conversion rules and heuristics to correctly re-build each input name no matter how bad the original name is damaged by the transcription process. Please refer to Name Retrieval System for detailed output sample. |
Geographic Names Romanizer |
This module does the mostly required processes of Arabic place name Romanization in more than 10 official Romanization systems; Please refer Geographic Names Arabicizer for details. |
Geographic Names Arabicizer |
This module does Arabicization for geographic names. Please refer Geographic Names Arabicizer for details. |
Geographic Names Transcription System |
This module is currently being developed. Please refer Geographic Names Transcription System for details. |
Geographic Names Retrieval System |
This module does retrieval for geographic names from many different languages to their original script. Please refer Geographic Names Retrieval System for details. |