@MISC{Abdelali_buildinga, author = {Ahmed Abdelali}, title = {Building A Modern Standard Arabic Corpus}, year = {} }
Share
OpenURL
Abstract
Language Engineering, including Information Retrieval, Machine Translation and other Natural Language-related disciplines, is showing more interest in the Arabic language in recent years. Suitable resources for Arabic are becoming a vital necessity for the progress of this research. Until recently, only two Arabic corpora were commonly available for researchers: the AFP Arabic newswire from LDC and the Al-Hayat newspaper collection from the European Language Resources Distribution Agency. But the necessity of a suitable corpus with a wider coverage that samples the language used over the vast region is a key for any objective research. In this paper we present preliminary results of experiments with a corpus for Modern Standard Arabic using data available on the World Wide Web. We selected samples of online published newspapers from different Arabic countries. The selection was driven mainly by the amount of data available. We will demonstrate the completeness and the representativeness of this corpus using standard metrics and show its suitability for Language Engineering experiments.