Journal of Applied Sciences ›› 2026, Vol. 44 ›› Issue (3): 377-389.doi: 10.3969/j.issn.0255-8297.2026.03.003

• Intelligent Information Processing • Previous Articles    

Method for Extracting Chinese-Burmese Parallel Sentence Pairs Based on Language Feature Enhancement

ZHAO Zixiao1,2, WANG Hao1,2, SHEN Tao1,2, JIANG Shuting1,2, ZHANG Siqi1,2, LAI Hua1,2, HUANG Yuxin1,2, YU Zhengtao1,2   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China;
    2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
  • Received:2026-03-15 Published:2026-06-23

Abstract: To address the scarcity of labeled resources and the limited representational capacity of models in extracting parallel sentence pairs in low-resource languages, this paper proposed a language-feature-enhanced method for Chinese-Burmese parallel sentence pair extraction. The method was optimized from three aspects: data augmentation, model architecture, and training mechanism. First, a Chinese-Burmese dual encoder based on a Siamese network was constructed to build a cross-lingual semantic representation space.Second, an information-content evaluation mechanism based on the L2 norm of word vectors was introduced to replace high-information features and perform sample augmentation,thus alleviating the data sparsity problem under low-resource conditions. Finally, positive and negative samples were constructed and dynamically modeled through contrastive learning to optimize sample boundaries and achieve more accurate Chinese-Burmese semantic alignment. Experimental results show that the proposed method achieves an F1 score of 95.03% on the Chinese-Burmese parallel sentence pair extraction task, outperforming the baseline model. In addition, this paper constructs a high-quality general-domain ChineseBurmese dataset containing 5 × 105 sentence pairs, providing data support for research on low-resource languages.

Key words: parallel sentence pair extraction, information augmentation, contrastive learning, Siamese network

CLC Number: