| 271 | 5 | 378 |
| 下载次数 | 被引频次 | 阅读次数 |
在非正式问答语料中,往往存在问题文本中包含多个子问题的情况,需要将每个子问题分别识别出来.由于标注样本的数目太小,并且存在海量的未标注样本,可以用半监督深度学习方法来进行问题识别.采用了变分自编码器(variational auto-encoder,VAE),并且结合了在深度学习模型中广泛应用的注意力机制.实验结果表明,不管是F值还是准确率,变分自编码器和注意力机制的结合可以显著地提升问题识别的性能.
Abstract:In informal question-answer corpus,there were many questions which contained several subquestions. These questions should be indentified. Due to the number of the labeled samples was so small and there was a bulk of unlabeled samples,a semi-supervised deep learning method was used to detect questions. Variational auto-encoder( VAE) and attention mechanism were employed,and the latter one was widely used in deep learning methods. The effectiveness of the combination of VAE and attention mechanism to question detection were demonstrated by empirical studies in F-score and accuracy.
[1] TIBOR K,JAN S. Unsupervised multilingual sentence boundary detection[J]. Computational linguistics,2006,32(4):485-525.
[2] MIKHEEV A. Periods,capitalized words,etc[J]. Computational linguistics,2002,28(3):289-318.
[3] KILIAN E,VALERIO B,GRZEGORZ C,et al. Elephant:sequence labeling for word and sentence segmentation[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing. Seattle,2013:1422-1426.
[4] DUAN H M,SUI Z F,GE T. Classical Chinese sentence segmentation[C]∥Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing. Beijing,2010:15-22.
[5] ANU J P,KARJIGI V. Sentence segmentation for speech processing[C]∥National Conference on Communication,Signal processing and Networking. Palakkad,2014:1-4.
[6] TREVISO M V,SHULBY C,ALUISIO S. Sentence segmentation in narrative transcripts from neuropsychological test using recurrent convolutional neural networks[C]∥Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia,2017:315-325.
[7] JIN M X,KIM M Y,KIM D,et al. Chinese word segmentation[C]∥Proceedings of the SIGHAN Workshop on Chinese Language Processing,Barcelona,2004:1-8.
[8]姚冬冬,袁方,王煜,等.基于半监督DPMM的新闻话题检测[J].郑州大学学报(理学版),2016,48(3):63-68.
[9]李志圣,孙越恒,何丕廉,等.基于互联网和self-training的中文问答模式学习[J].计算机应用,2008,28(6):1575-1577.
[10]高嘉伟,梁吉业,刘杨磊,等.一种基于tri-training的半监督多标记学习文档分类算法[J].中文信息学报,2015,29(1):104-110.
[11]张栋,李寿山,周国栋.基于答案辅助的半监督问题分类方法[J].计算机工程与科学,2015,37(12):2352-2357.
[12] WEN T H,GASIC M,MRKSIC N,et al. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon,2015:1711-1721.
[13] SERBAN I,SORDONI A,BENGIO Y,et al. Building end-to-end dialogue systems using generative hierarchical neural network models[C]∥Proceedings of the 30th AAAI Conference on Artificial Intelligence. Phoenix,2016:3776-3783.
[14] MIKOLOV T,CHEN K,CORRADO G,et al. Efficient estimation of word representations in vector space[C]∥Proceedings of Workshop at the 1st International Conference on Learning Representation. Scottsdale,2013:37-48.
[15] KINGMA D P,BA J. Adam:a method for stochastic optimization[C]∥Proceedings of the 3rd International Conference on Learning Representation. San Diego,2015:1-15.
基本信息:
DOI:10.13705/j.issn.1671-6841.2018192
中图分类号:TP391.1;TN762
引用信息:
[1]王路,李寿山.基于变分自编码器的问题识别方法[J],2019,51(03):79-84.DOI:10.13705/j.issn.1671-6841.2018192.
基金信息:
国家自然科学基金项目(61331011,61672366)
2018-06-29
2018
2019-03-07
2019
2