基于半监督DPMM的新闻话题检测News Topic Detection Based on Semi-supervised DPMM
姚冬冬,袁方,王煜,刘宇
摘要(Abstract):
基于狄利克雷过程混合模型(DPMM)这一非参数贝叶斯生成模型,从语义的角度入手,结合其自动确定聚类个数的特性进行话题检测,运用了聚类个数K值由大到小变化的采样策略,通过逐层递进的形式获取到较为准确的K值,并在此基础上对语义聚类的词频特性加以分析,引入一组名词实体作为"热点特征词"来引导聚类过程,从而给出了DPMM半监督模型.实验结果表明,所给出的话题检测方法在TDT4语料上取得了较好的检测性能.
关键词(KeyWords): 话题检测;狄利克雷过程;Gibbs采样;幂律特性;名词实体
基金项目(Foundation): 河北省软科学研究计划项目(12457206D-11,13455317D)
作者(Author): 姚冬冬,袁方,王煜,刘宇
DOI: 10.13705/j.issn.1671-6841.2016070
参考文献(References):
- [1]洪宇,张宇,刘挺,等.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.
- [2]李胜东,吕学强,施水才,等.基于话题检测的自适应增量K-means算法[J].中文信息学报,2014,28(6):190-193.
- [3]周刚,邹鸿程,熊小兵,等.MB-Single Pass:基于组合相似度的微博话题检测[J].计算机科学,2012,39(10):198-202.
- [4]张晓艳.新闻话题表示模型和关联追踪技术研究[D].长沙:国防科学技术大学,2010.
- [5]GUO X,XIANG Y,CHEN Q,et al.LDA-based online topic detection using tensor factorization[J].Journal of information science,2013,39(4):459-469.
- [6]潘云仙,袁方.基于JST模型的新闻文本的情感分类研究[J].郑州大学学报(理学版),2015,47(1):64-68.
- [7]ANTONIAK C E.Mixture of Dirichlet processes with applications to Bayesian nonparametric problems[J].Annals of statistics,1974,2(6):1152-1174.
- [8]VLACHOS A,GHAHRAMANI Z,KORHONEN A.Dirichlet process mixture models for verb clustering[C]//Icml workshop on prior knowledge for text&language processing.Helsinki,2008:74-82.
- [9]梅素玉,王飞,周水庚.狄利克雷过程混合模型、扩展模型及应用[J].科学通报,2012,57(34):3243-3257.
- [10]王婵.基于Dirichlet过程混合模型的话题识别与追踪[D].北京:北京邮电大学,2013.
- [11]ZHANG H,JONATHAN W Q M,NGUYEN T M.Image segmentation by Dirichlet process mixture model with generalised mean[J].Iet image processing,2014,8(8):103-111.
- [12]BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation[J].Journal of machine learning research,2003,(3):993-1022.
- [13]FERGUSON T S.A Bayesian analysis of some nonparametric problems[J].Annals of statistics,1973,1(2):209-230.
- [14]徐谦,周俊生,陈家俊.Dirichlet过程及其在自然语言处理中的应用[J].中文信息学报,2009,23(5):25-32.
- [15]PITMAN J.Combinatorial stochastic processes[M].Springer,Berlin,2006:75-92.
- [16]NEAL R M.Markov chain sampling methods for Dirichlet process mixture models[J].Journal of computational and graphical Statistics,2000,9(2):249-265.
- [17]SATO I,NAKAGAWA H.Topic models with power-law using Pitman-Yor process[C]//Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining.Washington DC,2010:673-681.
- [18]HEINRICH G.Parameter estimation for text analysis[R].Germany:Fraunhofer IGD,2005.
- [19]张小平,周雪忠,黄厚宽,等.一种改进的LDA主题模型[J].北京交通大学学报(自然科学版),2010,34(2):111-114.
- [20]蒋文,齐林.一种基于深度玻尔兹曼机的半监督典型相关分析算法[J].河南科技大学学报(自然科学版),2016,37(2):47-51.
- [21]STEINBACH M,KARYPIS G,KUMAR V.A comparison of document clustering techniques[C]//Proceedings of the 6th ACM-SIGKDD international conference on text mining.Boston,2000:103-122.