Extracting Knowledge from an Arabic-English Machine-Readable Dictionary Using Information Extraction

利用信息抽取技术从阿拉伯语-英语机读词典中提取知识

Abstract: Natural language processing (NLP) applications need large and rich amount of linguistic knowledge. Furthermore, electronic language sources such as dictionaries, encyclopedia, and corpora became available. So, automatic methods are emerged to extract lexical information from those sources to overcome the knowledge acquisition bottleneck.

摘要： 自然语言处理（NLP）应用需要大量且丰富的语言学知识。此外，诸如词典、百科全书和语料库等电子语言资源已变得触手可及。因此，自动提取这些资源中词汇信息的方法应运而生，旨在克服知识获取的瓶颈。

We presented a method to automatically extract lexical information from a machine-readable version of the Arabic-English Al-Mawrid dictionary. We used n-gram analysis and key-word-in-context (KWIC) analysis to discover lexical patterns that manifest morphologic, syntactic, or semantic information. Then, we used hand-crafted rule-based information extraction to extract that information.

我们提出了一种从机读版《Al-Mawrid》阿英词典中自动提取词汇信息的方法。我们利用 N-gram 分析和上下文关键词（KWIC）分析来发现体现形态、句法或语义信息的词汇模式。随后，我们采用手工构建的基于规则的信息抽取技术来提取这些信息。

Furthermore, we used punctuation marks and some heuristics to extract a set of synonyms in a subentry. This study registered high precision for all types of information, high recall for synonyms, and low recall for the other information. The study also showed that the Al-Mawrid has significant amount of derivations (morphologic information) and synonyms, domain labels, and hyponym/hypernym relations (semantic information).

此外，我们利用标点符号和一些启发式方法，从子条目中提取了一组同义词。本研究在各类信息的提取上均表现出高准确率，在同义词提取上表现出高召回率，而在其他信息的提取上召回率较低。研究还表明，《Al-Mawrid》词典包含大量的派生词（形态信息）、同义词、领域标签以及下位词/上位词关系（语义信息）。