| 7 | 0 | 386 |
| 下载次数 | 被引频次 | 阅读次数 |
针对中文发明专利文本的文字描述专业性强、人工分类耗时耗力等问题,提出一种基于BERT-BiGRU模型的中文专利文本自动分类方法,利用预训练的BERT模型完成对中文发明专利文本进行向量化语义表征,引入词嵌入和多头注意力机制等方法抽取专利文本中词语的上下文语境语义信息,最终通过双向GRU门控网络完成对中文发明专利的分类。以Incopat专利数据库中的专利文本构建数据集,设计多组对比实验,实验结果表明,所提方法可以有效提高分类模型对中文专利文本的差异性特征提取能力,对8类专利文本的分类准确率达到了85.44%。
Abstract:The description of Chinese patent texts was highly professional, and manual classification was time-consuming and labor-intensive. An automatic classification method was proposed based on the BERT-BiGRU model. The pre-trained Chinese BERT model was used to complete the vectorized semantic representation of the Chinese patent text, during which the word embedding and multi-head attention mechanism were introduced to extract the contextual semantic information of the words in the patent text. Finally, the classification of Chinese patents was completed through the bidirectional GRU gated network. The data set was constructed from the patent texts in the Incopat patent database, and multiple sets of comparative experiments were designed. The experimental results showed that the proposed method could effectively improve the model′s ability to extract the differential features of Chinese patent texts, and the classification accuracy of 8 types of patent texts could be reached 85.44%.
[1] 国家知识产权局.国家知识产权局2020年政府信息公开工作年度报告[EB/OL].(2021-01-31) [2022-04-10].https://www.cnipa.gov.cn/art/2021/1/31/art_250_156580.html.State Intellectual Property Office.Annual Report of the State Intellectual Property Office on Government Information Disclosure in 2020[EB/OL].(2021-01-31) [2022-04-10].https://www.cnipa.gov.cn/art/2021/1/31/art_250_156580.html.
[2] 原之安,彭甫镕,谷波,等.面向标注数据稀缺专利文献的科技实体抽取[J].郑州大学学报(理学版),2021,53(4):61-68.YUAN Z A,PENG F R,GU B,et al.Technology entity extraction of patent literature with limited annotated data[J].Journal of Zhengzhou university (natural science edition),2021,53(4):61-68.
[3] 奚雪峰,周国栋.面向自然语言处理的深度学习研究[J].自动化学报,2016,42(10):1445-1465.XI X F,ZHOU G D.A survey on deep learning for natural language processing[J].Acta automatica sinica,2016,42(10):1445-1465.
[4] 傅依娴,芦天亮,马泽良.基于One-Hot的CNN恶意代码检测技术[J].计算机应用与软件,2020,37(1):304-308,333.FU Y X,LU T L,MA Z L.CNN malicious code detection technology based on One-Hot[J].Computer applications and software,2020,37(1):304-308,333.
[5] NAGARAJU K C,REDDY C R K.Reusable component retrieval from a large repository using Word2Vec with continuous bag of words[J].Ingénierie des systèmes d information,2021,26(5):453-460.
[6] 景丽,何婷婷.基于改进TF-IDF和ABLCNN的中文文本分类模型[J].计算机科学,2021,48(S2):170-175,190.JING L,HE T T.Chinese text classification model based on improved TF-IDF and ABLCNN[J].Computer science,2021,48(S2):170-175,190.
[7] 武永亮,赵书良,李长镜,等.基于TF-IDF和余弦相似度的文本分类方法[J].中文信息学报,2017,31(5):138-145.WU Y L,ZHAO S L,LI C J,et al.Text classification method based on TF-IDF and cosine similarity[J].Journal of Chinese information processing,2017,31(5):138-145.
[8] 刘秀磊,孔凡芃,谌彤童,等.基于BERT与XGBoost的航天科技开源情报分类[J].郑州大学学报(理学版),2021,53(3):15-22.LIU X L,KONG F P,CHEN T T,et al.Research on classification of aerospace science and technology open source information based on BERT and XGBoost[J].Journal of Zhengzhou university (natural science edition),2021,53(3):15-22.
[9] 赵云山,段友祥.基于Attention机制的卷积神经网络文本分类模型[J].应用科学学报,2019,37(4):541-550.ZHAO Y S,DUAN Y X.Convolutional neural networks text classification model based on Attention mechanism[J].Journal of applied sciences,2019,37(4):541-550.
[10] 张冲.基于Attention-Based LSTM模型的文本分类技术的研究[D].南京:南京大学,2016.ZHANG C.Text classification based on Attention-Based LSTM model[D].Nanjing:Nanjing University,2016.
[11] 李洋,董红斌.基于CNN和BiLSTM网络特征融合的文本情感分析[J].计算机应用,2018,38(11):3075-3080.LI Y,DONG H B.Text sentiment analysis based on feature fusion of convolution neural network and bidirectional long short-term memory network[J].Journal of computer applications,2018,38(11):3075-3080.
[12] 薛金成,姜迪,吴建德.基于LSTM-A深度学习的专利文本分类研究[J].通信技术,2019,52(12):2888-2892.XUE J C,JIANG D,WU J D.Patent text classification based on long short-term memory network and attention mechanism[J].Communications technology,2019,52(12):2888-2892.
[13] YU J S,HUANG L X,HU Y J,et al.A structured representation framework for TRIZ-based Chinese patent classification via reinforcement learning[C]//2020 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD).Piscataway:IEEE Press,2020:6-10.
[14] 杜恒欣,朱习军.基于BiLSTMATTCNN中文专利文本分类[J].计算机系统应用,2020,29(11):260-265.DU H X,ZHU X J.Chinese patent text classification based on BiLSTMATTCNN model[J].Computer systems and applications,2020,29(11):260-265.
[15] XU G N,DONG F,FENG J W.Mapping the technological landscape of emerging industry value chain through a patent lens:an integrated framework with deep learning[J].IEEE transactions on engineering management,2020,99:1-12.
[16] CUI Y M,CHE W X,LIU T,et al.Pre-training with whole word masking for Chinese BERT[J].IEEE/ACM transactions on audio,speech,and language processing,2021,29:3504-3514.
[17] 于海燕,陆慧娟,郑文斌.情感分类中基于词性嵌入的特征权重计算方法[J].计算机工程与应用,2017,53(22):121-125.YU H Y,LU H J,ZHENG W B.Feature weighting method based on part of speech embedding for sentiment classification[J].Computer engineering and applications,2017,53(22):121-125.
[18] 朱张莉,饶元,吴渊,等.注意力机制在深度学习中的研究进展[J].中文信息学报,2019,33(6):1-11.ZHU Z L,RAO Y,WU Y,et al.Research progress of attention mechanism in deep learning[J].Journal of Chinese information processing,2019,33(6):1-11.
[19] 黄磊,杜昌顺.基于递归神经网络的文本分类研究[J].北京化工大学学报(自然科学版),2017,44(1):98-104.HUANG L,DU C S.Application of recurrent neural networks in text classification[J].Journal of Beijing university of chemical technology (natural science edition),2017,44(1):98-104.
[20] 程艳,尧磊波,张光河,等.基于注意力机制的多通道CNN和BiGRU的文本情感倾向性分析[J].计算机研究与发展,2020,57(12):2583-2595.CHENG Y,YAO L B,ZHANG G H,et al.Text sentiment orientation analysis of multi-channels CNN and BiGRU based on attention mechanism[J].Journal of computer research and development,2020,57(12):2583-2595.
基本信息:
DOI:10.13705/j.issn.1671-6841.2022125
中图分类号:TP391.1;G255.53
引用信息:
[1]刘燕.基于BERT-BiGRU 的中文专利文本自动分类[J],2023,55(02):33-40.DOI:10.13705/j.issn.1671-6841.2022125.
基金信息:
河南省高校人文社会科学研究项目(2023-ZDJH-589);; 河南省哲学社会科学规划年度项目(2021BZH015)