Social networks restrict access to user topology, which greatly reduces the accuracy of identification methods using structure features. We present proximity and content based User Identification based on XGboost, a semi-supervised network model that integrates attribute, structural and content features to transform the cross-social network user identification problem into a binary classification task. To tackle the challenge of incomplete topology information and insufficient seed users, a method for extracting explicit and implicit friends is proposed. Friend networks are fused according to explicit friends, implicit friends and other friends in the friend network of the user pair to be matched. The user’s importance is combined, so as to improve empirical probability of second order proximity of LINE algorithm and obtain the structure feature. We then extract time sequence features, keyword overlapping features, and followee tag feature as the content features. Finally, these features are fused to complete user identification. Experiments on real datasets show the effectiveness of this method.
LU Jing, YOU Chenlu, GAI Qikai, LIU Cong
. User Identification Method Using Proximity and Content Features[J]. Journal of Applied Sciences, 2024
, 42(6)
: 1064
-1077
.
DOI: 10.3969/j.issn.0255-8297.2024.06.014
[1] 施少怀. 一种基于用户倾向的微博好友推荐算法[D]. 哈尔滨: 哈尔滨工业大学, 2013.
[2] Xu P H, Hu W B, Wu J, et al. Link prediction with signed latent factors in signed social networks [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019: 1046-1054.
[3] Li Y J, Peng Y, Zhang Z, et al. Matching user accounts across social networks based on username and display name [J]. World Wide Web, 2019, 22(3): 1075-1097.
[4] Zhou X P, Liang X, Du X Y, et al. Structure based user identification across social networks [J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6): 1178-1191.
[5] Liu L, Zhang Y M, Fu S, et al. ABNE: an attention-based network embedding for user alignment across social networks [J]. IEEE Access, 2019, 7: 23595-23605.
[6] Santos M L B. The “so-called” UGC: an updated definition of user-generated content in the age of social media [J]. Online Information Review, 2022, 46(1): 95-113.
[7] Li Y J, Zhang Z, Peng Y, et al. Matching user accounts based on user generated content across social networks [J]. Future Generation Computer Systems, 2018, 83: 104-115.
[8] Nie Y P, Jia Y, Li S D, et al. Identifying users across social networks based on dynamic core interests [J]. Neurocomputing, 2016, 210: 107-115.
[9] Li Y J, Su Z T, Yang J Q, et al. Exploiting similarities of user friendship networks across social networks for user identification [J]. Information Sciences, 2020, 506: 78-98.
[10] Zhang J, Chen B, Wang X M, et al. MEgo2Vec: embedding matched ego networks for user alignment across social networks [C]//Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018: 327-336.
[11] Zeng W J, Tang R, Wang H Z, et al. User identification based on integrating multiple user information across online social networks [J]. Security and Communication Networks, 2021: 5533417.
[12] Chen T Q, Guestrin C. XGBoost: a scalable tree boosting system [C]//Proceedings of the 22nd ACM International Conference on Knowledge Discovery and Data Mining, 2016: 785-794.
[13] Kusner M J, Sun Y, Kolkin N I, et al. From word embeddings to document distances [J]. 32nd International Conference on Machine Learning, 2015, 2957-2966.
[14] Tang J, Qu M, Wang M Z, et al. LINE: large-scale information network embedding [DB/OL]. 2015[2022-12-01]. http://arxiv.org/abs/1503.03578v1.
[15] Lawrence P, Sergey B, Rajeev M, et al. The PageRank citation ranking: bringing order to the web [J]. Stanford Digital Libraries Working Paper, 1998, 98: 161-172.
[16] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space [DB/OL]. 2013[2022-12-01]. https://arxiv.org/abs/1301.3781.
[17] Rousseeuw P J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis [J]. Journal of Computational and Applied Mathematics, 1987, 20: 53-65.
[18] Kong X N, Zhang J W, Yu P S. Inferring anchor links across multiple heterogeneous social networks [C]//Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, 2013: 179-188.
[19] Macqueen J. Some methods for classfication and analysis of multivariate observations [C]//Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probablity, 1965: 281.
[20] Cao X Z, Yu Y. ASNets: a benchmark dataset of aligned social networks for cross-platform user modeling [C]//Proceedings of the 25th ACM International Conference on Information and Knowledge Management, 2016: 1881-1884.
[21] Gao H, Wang Y Q, Shao J L, et al. UGCLink: user identity linkage by modeling user generated contents with knowledge distillation [C]//2021 IEEE International Conference on Big Data (Big Data), 2021: 607-613.