利用邻近度与内容特征的用户识别方法

卢菁, 尤晨璐, 盖祺凯, 刘丛

doi:10.3969/j.issn.0255-8297.2024.06.014

应用科学学报 >

2024 , Vol. 42 >Issue 6: 1064 - 1077

DOI: https://doi.org/10.3969/j.issn.0255-8297.2024.06.014

计算机科学与应用

利用邻近度与内容特征的用户识别方法

展开

上海理工大学光电信息与计算机工程学院, 上海 200093

收稿日期: 2022-12-01

网络出版日期: 2024-11-30

基金资助

上海理工大学自然科学基金培育项目（No.20ZRPY08）资助

收起

User Identification Method Using Proximity and Content Features

Expand

School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

Received date: 2022-12-01

Online published: 2024-11-30

Fold

摘要

社交网络对用户拓扑结构进行了获取限制，使得利用结构特征进行识别的方法准确率大大下降。利用邻近度与内容特征的用户识别方法构建了一个融合属性特征、结构特征与内容特征的基于XGboost的半监督网络模型，将跨社交网络用户识别问题转换为二分类问题。针对无法获得完整用户拓扑结构与种子用户不足的问题，提出显式好友与隐式好友的提取方法，根据待匹配用户对好友网络中的显式匹配用户对、隐式匹配用户对与其他好友将好友网络融合，结合用户重要度改进LINE算法二阶邻近度的经验概率，获取待匹配用户对的结构特征；将用户发文时间序列特征、生成内容关键词重叠度特征与关注用户标签特征作为生成内容特征；最后将属性特征、结构特征与内容特征进行融合完成用户识别。在真实数据集上的实验证明了本方法的有效性。

关键词： 社交网络; 用户识别; 邻近度; XGBoost; 用户生成内容

本文引用格式

卢菁, 尤晨璐, 盖祺凯, 刘丛 . 利用邻近度与内容特征的用户识别方法[J]. 应用科学学报, 2024 , 42(6) : 1064 -1077 . DOI: 10.3969/j.issn.0255-8297.2024.06.014

Abstract

Social networks restrict access to user topology, which greatly reduces the accuracy of identification methods using structure features. We present proximity and content based User Identification based on XGboost, a semi-supervised network model that integrates attribute, structural and content features to transform the cross-social network user identification problem into a binary classification task. To tackle the challenge of incomplete topology information and insufficient seed users, a method for extracting explicit and implicit friends is proposed. Friend networks are fused according to explicit friends, implicit friends and other friends in the friend network of the user pair to be matched. The user’s importance is combined, so as to improve empirical probability of second order proximity of LINE algorithm and obtain the structure feature. We then extract time sequence features, keyword overlapping features, and followee tag feature as the content features. Finally, these features are fused to complete user identification. Experiments on real datasets show the effectiveness of this method.

Key words： social network; user identification; proximity; XGBoost; user generated content

参考文献

[1] 施少怀. 一种基于用户倾向的微博好友推荐算法[D]. 哈尔滨: 哈尔滨工业大学, 2013.
[2] Xu P H, Hu W B, Wu J, et al. Link prediction with signed latent factors in signed social networks [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019: 1046-1054.
[3] Li Y J, Peng Y, Zhang Z, et al. Matching user accounts across social networks based on username and display name [J]. World Wide Web, 2019, 22(3): 1075-1097.
[4] Zhou X P, Liang X, Du X Y, et al. Structure based user identification across social networks [J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6): 1178-1191.
[5] Liu L, Zhang Y M, Fu S, et al. ABNE: an attention-based network embedding for user alignment across social networks [J]. IEEE Access, 2019, 7: 23595-23605.
[6] Santos M L B. The “so-called” UGC: an updated definition of user-generated content in the age of social media [J]. Online Information Review, 2022, 46(1): 95-113.
[7] Li Y J, Zhang Z, Peng Y, et al. Matching user accounts based on user generated content across social networks [J]. Future Generation Computer Systems, 2018, 83: 104-115.
[8] Nie Y P, Jia Y, Li S D, et al. Identifying users across social networks based on dynamic core interests [J]. Neurocomputing, 2016, 210: 107-115.
[9] Li Y J, Su Z T, Yang J Q, et al. Exploiting similarities of user friendship networks across social networks for user identification [J]. Information Sciences, 2020, 506: 78-98.
[10] Zhang J, Chen B, Wang X M, et al. MEgo2Vec: embedding matched ego networks for user alignment across social networks [C]//Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018: 327-336.
[11] Zeng W J, Tang R, Wang H Z, et al. User identification based on integrating multiple user information across online social networks [J]. Security and Communication Networks, 2021: 5533417.
[12] Chen T Q, Guestrin C. XGBoost: a scalable tree boosting system [C]//Proceedings of the 22nd ACM International Conference on Knowledge Discovery and Data Mining, 2016: 785-794.
[13] Kusner M J, Sun Y, Kolkin N I, et al. From word embeddings to document distances [J]. 32nd International Conference on Machine Learning, 2015, 2957-2966.
[14] Tang J, Qu M, Wang M Z, et al. LINE: large-scale information network embedding [DB/OL]. 2015[2022-12-01]. http://arxiv.org/abs/1503.03578v1.
[15] Lawrence P, Sergey B, Rajeev M, et al. The PageRank citation ranking: bringing order to the web [J]. Stanford Digital Libraries Working Paper, 1998, 98: 161-172.
[16] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space [DB/OL]. 2013[2022-12-01]. https://arxiv.org/abs/1301.3781.
[17] Rousseeuw P J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis [J]. Journal of Computational and Applied Mathematics, 1987, 20: 53-65.
[18] Kong X N, Zhang J W, Yu P S. Inferring anchor links across multiple heterogeneous social networks [C]//Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, 2013: 179-188.
[19] Macqueen J. Some methods for classfication and analysis of multivariate observations [C]//Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probablity, 1965: 281.
[20] Cao X Z, Yu Y. ASNets: a benchmark dataset of aligned social networks for cross-platform user modeling [C]//Proceedings of the 25th ACM International Conference on Information and Knowledge Management, 2016: 1881-1884.
[21] Gao H, Wang Y Q, Shao J L, et al. UGCLink: user identity linkage by modeling user generated contents with knowledge distillation [C]//2021 IEEE International Conference on Big Data (Big Data), 2021: 607-613.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献