基于语言特征增强的汉-缅平行句对抽取方法

doi:10.3969/j.issn.0255-8297.2026.03.003

Abstract

Abstract: To address the scarcity of labeled resources and the limited representational capacity of models in extracting parallel sentence pairs in low-resource languages, this paper proposed a language-feature-enhanced method for Chinese-Burmese parallel sentence pair extraction. The method was optimized from three aspects: data augmentation, model architecture, and training mechanism. First, a Chinese-Burmese dual encoder based on a Siamese network was constructed to build a cross-lingual semantic representation space.Second, an information-content evaluation mechanism based on the L₂ norm of word vectors was introduced to replace high-information features and perform sample augmentation,thus alleviating the data sparsity problem under low-resource conditions. Finally, positive and negative samples were constructed and dynamically modeled through contrastive learning to optimize sample boundaries and achieve more accurate Chinese-Burmese semantic alignment. Experimental results show that the proposed method achieves an F1 score of 95.03% on the Chinese-Burmese parallel sentence pair extraction task, outperforming the baseline model. In addition, this paper constructs a high-quality general-domain ChineseBurmese dataset containing 5 × 10⁵ sentence pairs, providing data support for research on low-resource languages.

Key words: parallel sentence pair extraction, information augmentation, contrastive learning, Siamese network

CLC Number:

TP391

ZHAO Zixiao, WANG Hao, SHEN Tao, JIANG Shuting, ZHANG Siqi, LAI Hua, HUANG Yuxin, YU Zhengtao. Method for Extracting Chinese-Burmese Parallel Sentence Pairs Based on Language Feature Enhancement[J]. Journal of Applied Sciences, 2026, 44(3): 377-389.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

URL: https://www.jas.shu.edu.cn/EN/10.3969/j.issn.0255-8297.2026.03.003

https://www.jas.shu.edu.cn/EN/Y2026/V44/I3/377

References

[1] Artetxe M, Labaka G, Agirre E. Bilingual lexicon induction through unsupervised machine translation [C]//57th Annual Meeting of the Association for Computational Linguistics, 2019: 5002-5007.
[2] Bouamor H, Sajjad H. Parallel sentence extraction from comparable corpora using multilingual sentence embeddings [C]//Eleventh International Conference on Language Resources and Evaluation, 2018: 7-12.
[3] Grégoire F, Langlais P. A deep neural network approach to parallel sentence extraction [DB/OL]. (2017-09-28) [2026-03-15]. https://doi.org/10.48550/arXiv.1709.09783.
[4] Yang Z Q, Ma W T, Cui Y M, et al. Bilingual alignment pre-training for zero-shot crosslingual transfer [C]//3rd Workshop on Machine Reading for Question Answering, 2021: 100-105.
[5] Xue L T, Constant N, Roberts A, et al. MT5: a massively multilingual pre-trained textto-text transformer [C]//2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 483-498.
[6] Song Z J, Hu Z Z, Zhou Y E, et al. Embedded heterogeneous attention transformer for cross-lingual image captioning [J]. IEEE Transactions on Multimedia, 2024, 26: 9008-9020.
[7] Aftan S, Shah H. A survey on BERT and its applications [C]//20th Learning and Technology Conference (L&T), 2023: 161-166.
[8] Vu T T, He X L, Phung D, et al. Generalised unsupervised domain adaptation of neural machine translation with cross-lingual data selection [C]//2021 Conference on Empirical Methods in Natural Language Processing, 2021: 3335-3346.
[9] 郭军军, 田应飞, 余正涛, 等. 基于语义自适应编码的汉-越伪平行句对抽取方法[J]. 中文信息学报, 2021, 35(9): 58-65. Guo J J, Tian Y F, Yu Z T, et al. Pseudo-parallel sentence pair extraction for ChineseVietnamese based on semantic adaptive coding [J]. Journal of Chinese Information Processing, 2021, 35(9): 58-65. (in Chinese)
[10] 周远卓, 毛存礼, 沈政, 等. 基于孪生对比网络的汉语-东南亚语言多语言平行句对抽取[J]. 模式识别与人工智能, 2023, 36(10): 931-941. Zhou Y Z, Mao C L, Shen Z, et al. Siamese contrastive network based multilingual parallel sentence pair extraction between Chinese and Southeast Asian languages [J]. Pattern Recognition and Artificial Intelligence, 2023, 36(10): 931-941. (in Chinese)
[11] Zhu S L, Gu S W, Li S J, et al. Mining parallel sentences from Internet with multi-view knowledge distillation for low-resource language pairs [J]. Knowledge and Information Systems, 2024, 66(1): 187-209.
[12] Feng F, Yang Y, Cer D, et al. Language-agnostic BERT sentence embedding [C]//60th Annual Meeting of the Association for Computational Linguistics, 2022: 878-891.
[13] Sun Y, Wang S, Li Y, et al. ERNIE: enhanced representation through knowledge integration [DB/OL]. (2019-04-19) [2026-03-15]. https://doi.org/10.48550/arXiv.1904.09223.
[14] Wang L, Yang N, Huang X, et al. Text embeddings by weakly-supervised contrastive pretraining [DB/OL]. (2022-12-07) [2026-03-15]. https://doi.org/10.48550/arXiv.2212.03533.
[15] Zhang H X, Liu M T, Li C Y, et al. A reinforcement learning approach to improve lowresource machine translation leveraging domain monolingual data [C]//2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, 2024: 1486-1497.
[16] Zhu S L, Yang Y, Xu C. Extracting parallel sentences from nonparallel corpora using parallel hierarchical attention network [J]. Computational Intelligence and Neuroscience, 2020, 2020: 8823906.
[17] Steingrímsson S, Lohar P, Loftsson H, et al. Effective bitext extraction from comparable corpora using a combination of three different approaches [C]//14th Workshop on Building and Using Comparable Corpora (BUCC 2021), 2021: 8-17.
[18] 陈庆宇, 季繁繁, 袁晓彤. 基于伪孪生网络双层优化的对比学习[J]. 模式识别与人工智能, 2022, 35(10): 928-938. Chen Q Y, Ji F F, Yuan X T. Contrastive learning based on bilevel optimization of pseudo siamese networks [J]. Pattern Recognition and Artificial Intelligence, 2022, 35(10): 928-938. (in Chinese)
[19] Gao T Y, Yao X C, Chen D Q. SimCSE: simple contrastive learning of sentence embeddings [C]//2021 Conference on Empirical Methods in Natural Language Processing, 2021: 6894-6910.
[20] Pan X, Wang M, Wu L, et al. Contrastive learning for many-to-many multilingual neural machine translation [C]//The 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 244-258.
[21] Hong W, Zhang Z S, Wang J Y, et al. Sentence-aware contrastive learning for open-domain passage retrieval [C]//60th Annual Meeting of the Association for Computational Linguistics, 2022: 1062-1074.
[22] Ma X, Li H, Shi J W, et al. Importance-aware contrastive learning via semantically augmented instances for unsupervised sentence embeddings [J]. International Journal of Machine Learning and Cybernetics, 2023, 14(9): 2979-2990.
[23] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks [C]//28th International Conference on Neural Information Processing Systems-Volume 2, 2014: 3104-3112.
[24] Bromley J, Guyon I, Lecun Y, et al. Signature verification using a “siamese” time delay neural network [C]//Advances in Neural Information Processing Systems, 1993, 6: 737-744.
[25] Schakel A M J, Wilson B J. Measuring word significance using distributed representations of words [DB/OL]. (2015-08-10) [2026-03-15]. https://doi.org/10.48550/arXiv.1508.02297.
[26] Thu Y K, Pa W P, Utiyama M, et al. Introducing the Asian language treebank (ALT) [C]//10th International Conference on Language Resources and Evaluation, 2016: 1574-1578.
[27] Tiedemann J. Parallel data, tools and interfaces in OPUS [C]//8th International Conference on Language Resources and Evaluation, 2012: 2214-2218.
[28] Ott M, Edunov S, Baevski A, et al. Fairseq: a fast, extensible toolkit for sequence modeling [C]//Conference of the North, 2019: 48-53.
[29] Dreiseitl S, Ohno-Machado L. Logistic regression and artificial neural network classification models: a methodology review [J]. Journal of Biomedical Informatics, 2002, 35(5/6): 352-359.
[30] Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT? [C]//57th Annual Meeting of the Association for Computational Linguistics, 2019: 4996-5001.
[31] 毛存礼, 高旭, 余正涛, 等. 结构特征一致性约束的双语平行句对抽取[J]. 重庆大学学报, 2021, 44(1): 46-56. Mao C L, Gao X, Yu Z T, et al. Extraction of bilingual parallel sentence pairs constrained by consistency of structural features [J]. Journal of Chongqing University, 2021, 44(1): 46-56. (in Chinese)
[32] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C]//Advances in Neural Information Processing Systems, 2017, 30: 5998-6008.

Method for Extracting Chinese-Burmese Parallel Sentence Pairs Based on Language Feature Enhancement

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 2

Recommended Articles

Metrics

Comments

[1]	WU Wenqiang, CHEN Aibin, LI Xiaoyao. Forest Image Dehazing Based on Feature Fusion Attention and Contrastive Learning [J]. Journal of Applied Sciences, 2026, 44(1): 97-109.
[2]	JIN Yanliang, FANG Jie, GAO Yuan, ZHOU Jiahao. Semi-supervised Encrypted Traffic Classification Model Based on Contrastive Learning [J]. Journal of Applied Sciences, 2025, 43(3): 437-450.