AIGC Users Traceability Technology Based on Text Watermarking

SONG Yimin; LIU Gongshen

doi:10.3969/j.issn.0255-8297.2025.03.001

Journal of Applied Sciences >

2025 , Vol. 43 >Issue 3: 361 - 369

DOI: https://doi.org/10.3969/j.issn.0255-8297.2025.03.001

Digital Media Forensics and Security

AIGC Users Traceability Technology Based on Text Watermarking

SONG Yimin ,
LIU Gongshen

Expand

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Received date: 2024-10-30

Online published: 2025-06-23

Fold

Abstract

This study addresses the limitations of text watermarking technology in the Chinese language context, and proposes both modified watermarking and generative watermarking schemes for implementation in English and Chinese. Using the Bert model for English and the WoBert model for Chinese, this study designs a portable word substitution watermarking module, which embeds watermarking information by replacing the specified lexical elements in the source text. For generative watermarking, this study adopts the adversarial generative text watermarking model with targeted modifications and migrations on the Chinese corpus, ensuring compatibility with Chinese semantic structures and linguistic conventions of Chinese text. Experiments are conducted using a human-ChatGPT comparison corpus in both Chinese and English. The effectiveness of the proposed watermarking schemes is evaluated based on text watermarking evaluation metrics in terms of both accuracy and semantics. Results demonstrate the proposed methods’ effectiveness in enhancing watermark robustness and traceability in multilingual text.

Key words： text watermarking; pre-trained language model; generative model; comparison corpus

Cite this article

SONG Yimin , LIU Gongshen . AIGC Users Traceability Technology Based on Text Watermarking[J]. Journal of Applied Sciences, 2025 , 43(3) : 361 -369 . DOI: 10.3969/j.issn.0255-8297.2025.03.001

References

[1] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [J]. Advances in Neural Information Processing Systems, 2017, 30: 5997-6008.
[2] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners [EB/OL]. [2024-10-30]. https://cdn.openai.com/better-language-models/language_models_ are_unsupervised_multitask_learners.pdf.
[3] Bai Y, Jones A, Ndousse K, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback [DB/OL]. (2022-04-22) [2024-10-30]. http://arxiv.org/abs/2204.05862.
[4] Touvron H, Lavril T, Izacard G, et al. Llama: open and efficient foundation language models [DB/OL]. (2023-02-27) [2024-10-30]. http://arxiv.org/abs/2302.13971.
[5] Black S, Biderman S, Hallahan E, et al. GPT-Neox-20B: an open-source autoregressive language model [DB/OL]. (2022-04-14) [2024-10-30]. http://arxiv.org/abs/2204.06745.
[6] Firdhous M F M, Elbreiki W, Abdullahi I, et al. WormGPT: a large language model Chatbot for criminals [C]//202324th International Arab Conference on Information Technology (ACIT). IEEE, 2023: 1-6.
[7] Liu A, Pan L, Lu Y, et al. A survey of text watermarking in the era of large language models [J]. ACM Computing Surveys, 2024, 57(2): 1-36.
[8] Brassil J T, Low S, Maxemchuk N F, et al. Electronic marking and identification techniques to discourage document copying [J]. IEEE Journal on Selected Areas in Communications, 1995, 13(8): 1495-1504.
[9] Por L Y, Wong K S, Chee K O. UniSpaCh: a text-based data hiding method using Unicode space characters [J]. Journal of Systems and Software, 2012, 85(5): 1075-1082.
[10] Sato R, Takezawa Y, Bao H, et al. Embarrassingly simple text watermarks [DB/OL]. (2023- 10-13) [2024-10-30]. http://arxiv.org/abs/2204.06745.
[11] 刘豪, 孙星明, 刘晋飚. 基于字体颜色的文本数字水印算法[J]. 计算机工程, 2005, 31(15): 129-131. Liu H, Sun X M, Liu J B. Color-based watermarking algorithm for text documents [J]. Computer Engineering, 2005, 31(15): 129-131.(in Chinese)
[12] Topkara U, Topkara M, Atallah M J. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions [C]//8th Workshop on Multimedia and Security, 2006: 164-174.
[13] Munyer T, Tanvir A, Das A, et al. DeepTextMark: a deep learning-driven text watermarking approach for identifying large language model generated text [DB/OL]. (2023-05-09) [2024-10- 30]. http://arxiv.org/abs/2305.05773.
[14] Abdelnabi S, Fritz M. Adversarial watermarking transformer: towards tracing text provenance with data hiding [C]//2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021: 121-140.
[15] Sun Z, Du X, Song F, et al. Coprotector: protect open-source code against unauthorized training usage with data poisoning [C]//ACM Web Conference, 2022: 652-660.
[16] Kirchenbauer J, Geiping J, Wen Y, et al. A watermark for large language models [C]//International Conference on Machine Learning, 2023: 17061-17084.
[17] Christ M, Gunn S, Zamir O. Undetectable watermarks for language models [C]//The Thirty Seventh Annual Conference on Learning Theory, 2024: 1125-1139.
[18] Guo B, Zhang X, Wang Z, et al. How close is ChatGPT to human experts? comparison corpus, evaluation, and detection [DB/OL]. (2023-01-18) [2024-10-30]. http://arxiv.org/abs/ 2301.07597.

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References