nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo searchdiv qikanlogo popupnotification paper paperNew
2025, 02, v.57 16-23
基于XLNet和多粒度对比学习的新闻主题文本分类方法
基金项目(Foundation): 国家自然科学基金项目(62106069)
邮箱(Email): 2430179820@qq.com;
DOI: 10.13705/j.issn.1671-6841.2023164
摘要:

新闻主题文本内容简短却含义丰富,传统方法通常只考虑词粒度或句粒度向量中的一种进行研究,未能充分利用新闻主题文本不同粒度向量之间的关联信息。为深入挖掘文本的词向量和句向量间的依赖关系,提出一种基于XLNet和多粒度特征对比学习的新闻主题分类方法。首先,利用XLNet对新闻主题文本进行特征提取获得文本中词、句粒度的特征表示和潜在空间关系;然后,通过对比学习R-Drop策略生成不同粒度特征的正负样本对,以一定权重对文本的词向量-词向量、词向量-句向量和句向量-句向量进行特征相似度学习,使模型深入挖掘出字符属性和语句属性之间的关联信息,提升模型的表达能力。在THUCNews、Toutiao和SHNews数据集上进行实验,实验结果表明,与基准模型相比,所提方法在准确率和F1值上都有更好的表现,在三个数据集上的F1值分别达到了93.88%、90.08%、87.35%,验证了方法的有效性和合理性。

Abstract:

News topic text was typically concise but rich in meaning. However, traditional methods in most studies often only considered one type of granularity vector, either word or sentence-level, and failed to fully utilize the correlated information among different granularity vectors of news topic text. To address this issue and explore the dependence relationship between word vectors and sentence vectors in texts, a news topic classification method based on XLNet and multi-granularity feature contrastive learning was proposed. Firstly, features were extracted from the news topic text using XLNet to obtain the feature representations and potential spatial relationships of words and sentences in the text. Then, positive and negative sample pairs of different granularity features were generated using the R-Drop strategy in contrastive learning. Feature similarity learning was conducted on the word-word embedding, word-sentence embedding, and sentence-sentence embedding with certain weights, allowing the model to more deeply explore the related information between character attributes and sentence attributes, thereby enhancing the model′s expression ability. Experiments were conducted on THUCNews, Toutiao, and SHNews datasets, the results showed that the proposed method outperformed other methods in terms of accuracy and F1 value, with F1 values reached 93.88%, 90.08%, and 87.35% respectively, thus verifying the effectiveness and rationality of the proposed method.

参考文献

[1] 杨朝强,邵党国,杨志豪,等.多特征融合的中文短文本分类模型[J].小型微型计算机系统,2020,41(7):1421-1426.YANG Z Q,SHAO D G,YANG Z H,et al.Chinese short text classification model with multi-feature fusion[J].Journal of Chinese computer systems,2020,41(7):1421-1426.

[2] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems.New York:ACM Press,2013:3111-3119.

[3] PENNINGTON J,SOCHER R,MANNING C.GloVe:global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).Stroudsburg:ACL Press,2014:1532-1543.

[4] 李舟军,范宇,吴贤杰.面向自然语言处理的预训练技术研究综述[J].计算机科学,2020,47(3):162-173.LI Z J,FAN Y,WU X J.Survey of natural language processing pre-training techniques[J].Computer science,2020,47(3):162-173.

[5] DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language un-derstanding[EB/OL].(2018-10-11)[2023-05-10].https://arxiv.org/pdf/1810.04805.pdf.

[6] DEY S,WASIF S,TONMOY D S,et al.A comparative study of support vector machine and naive Bayes classifier for sentiment analysis on Amazon product reviews[C]//2020 International Conference on Contemporary Computing and Applications.Piscataway:IEEE Press,2020:217-220.

[7] WANG S D,MANNING C D.Baselines and bigrams:simple,good sentiment and topic classification[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:Short Papers-Volume 2.New York:ACM Press,2012:90-94.

[8] 许英姿,任俊玲.基于改进的加权补集朴素贝叶斯物流新闻分类[J].计算机工程与设计,2022,43(1):179-185.XU Y Z,REN J L.Naive Bayesian logistics news classification based on improved weighted complement[J].Computer engineering and design,2022,43(1):179-185.

[9] KIM Y.Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:ACL Press,2014:1746-1751.

[10] LAI S,XU L,LIU K,et al.Recurrent convolutional neural networks for text classification[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence.Palo Alto:AAAI Press,2015:2267-2273.

[11] 曾诚,温超东,孙瑜敏,等.基于ALBERT-CRNN的弹幕文本情感分析[J].郑州大学学报(理学版),2021,53(3):1-8.ZENG C,WEN C D,SUN Y M,et al.Barrage text sentiment analysis based on ALBERT-CRNN[J].Journal of Zhengzhou university (natural science edition),2021,53(3):1-8.

[12] 张海丰,曾诚,潘列,等.结合BERT和特征投影网络的新闻主题文本分类方法[J].计算机应用,2022,42(4):1116-1124.ZHANG H F,ZENG C,PAN L,et al.News topic text classification method based on BERT and feature projection network[J].Journal of computer applications,2022,42(4):1116-1124.

[13] YANG Z,DAI Z,YANG Y M,et al.XlNeT:generalized autoregressive pretrai-ning for language understanding[EB/OL].(2019-12-08)[2023-04-20].https://dl.acm.org/doi/pdf/10.5555/3454287.3454804.

[14] RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pretraining[EB/OL].(2018-08-22)[2023-04-20].https://www.cs.ubc.ca/~amuham01/LING530/papers/radford 2018improving.pdf.

[15] PETERS M,NEUMANN M,IYYER M,et al.Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:ACL Press,2018:2227-2237.

[16] DAI Z H,YANG Z L,YANG Y M,et al.Transformer-XL:attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:ACL Press,2019:2978-2988.

[17] CHEN X L,FAN H Q,GIRSHICK R,et al.Improved baselines with momentum contrastive learning[EB/OL].(2020-03-09)[2023-04-20].https://arxiv.org/pdf/2003.04297.pdf.

[18] GAO T Y,YAO X C,CHEN D Q.SimCSE:simple contrastive learning of sentence embeddings[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Online and Punta Cana,Dominican Republic.Stroudsburg:ACL Press,2021:6894-6910.

[19] WU L,LI J,WANG Y,et al.R-Drop:regularized dropout for neural networks[EB/OL].(2021-06-28)[2023-04-20].https://arxiv.org/pdf/2106.14448.pdf.

[20] 潘列,曾诚,张海丰,等.结合广义自回归预训练语言模型与循环卷积神经网络的文本情感分析方法[J].计算机应用,2022,42(4):1108-1115.PAN L,ZENG C,ZHANG H F,et al.Text sentiment analysis method combining generalized autoregressive pre-training language model and recurrent convolutional neural network[J].Journal of computer applications,2022,42(4):1108-1115.

[21] LIU Y,OTT M,GOYAL N,et al.RoBERTa:a robustly optimized BERT pretraining approach[EB/OL].(2019-07-26)[2023-04-20].https://arxiv.org/pdf/1907.11692.pdf.

[22] CUI Y M,CHE W X,LIU T,et al.Revisiting pre-trained models for Chinese natural language processing[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.Stroudsburg:ACL Press,2020:657-668.

基本信息:

DOI:10.13705/j.issn.1671-6841.2023164

中图分类号:TP391.1;TP18

引用信息:

[1]陈敏,王雷春,徐瑞等.基于XLNet和多粒度对比学习的新闻主题文本分类方法[J].郑州大学学报(理学版),2025,57(02):16-23.DOI:10.13705/j.issn.1671-6841.2023164.

基金信息:

国家自然科学基金项目(62106069)

检 索 高级检索