应用科学学报 ›› 1999, Vol. 17 ›› Issue (2): 148-155.

• 论文 • 上一篇    下一篇

上下文相关汉语自动分词及词法预处理算法

黄河燕, 李渝生   

  1. 中国科学院计算机语言信息工程研究中心
  • 收稿日期:1998-02-19 修回日期:1998-06-08 出版日期:1999-06-30 发布日期:1999-06-30
  • 基金资助:
    国家自然科学基金资助项目

Context -Sensitive Automatic Chinese Word Segmentation and Lexical Preprocessing

HUANG HEYAN, LI YUSHENG   

  1. Research Center of Computer & Language Information Engineering, Academia Sinica, Beijing 100083
  • Received:1998-02-19 Revised:1998-06-08 Online:1999-06-30 Published:1999-06-30

摘要: 提出了一种适合于汉英机器翻译的上下文相关汉语自动分词及词法预处理算法.该算法采用正向多路径匹配算法和基于上下文相关知识的歧义切分消解算法,充分利用汉英机译系统词典库中的大量语法和语义等知识进行上下文相关的规则推导消歧,使自动分词的准确率达到了99%以上.同时,该算法还对汉语中意义冗余的重叠词和可以与中心词离合的虚词等进行了词法预处理,从而一方面可以减少系统词典的收词量,另一方面方便于对句子的分析处理.

关键词: 汉语自动分词, 词法预处理, 机器翻译

Abstract: In this paper, a context -sensitive automatic Chinese word segmentation and lexical preprocessing for Chinese-English machine translation system is proposed. This algorithm incorporates with improved MM matching and rule based context -sensitive ambiguity resolution by taking advantage of large amount of syntax, semantic and common sense knowledge in the lexicon of MT system. Its accurate rate reaches up to 99%. On the same time, in this algorithm, some lexical phonomena, such as reduplication word, function word, etc. are also processed, so as to deduce the amont of words in lexicon entry, and facilitate the parsing of a Chinese sentence.

Key words: lexical preprocessing, automatic Chinese word segmentation, machine translation