| 780 | 23 | 79 |
| 下载次数 | 被引频次 | 阅读次数 |
研究了典型的特征加权方法,分析了词频权重以及tf*idf权重等特征加权方法在表现特征重要性上的不足,提出了一种新的特征权重计算方法tf*idf*cf.该方法综合考虑了特征频率、文档频率以及特征类别信息,更为全面准确地描述了特征在文本中的重要程度.实验结果表明,该方法可以有效地改善分类性能.
Abstract:The typical term weighted method is discussed.The deficiencies of tf and tf*idf on describing term's importance are analyzed.A new term weighted method tf*idf*cf is presented.The importance of term is described more accurately by this method because term frequency,document frequency and classificatory information are all taken into account.Experimental results show that this method can perform effectively.
[1]Salton G,Wong A,Yang C.A Vector Space Model for automatic indexing[J].Communications of the ACM,1995,18(11):613-620.
[2]Aas K,Eikvil L.Text categorization:a survey[R]∥Technical Report#941.Oslo:Norwegian computing center,1999.
[3]Salton G,McGill M J.An Introduction to Modern Information Retrieval[M].New York:McGraw-Hill,1983.
[4]Fabrizio S.Machine learning in automated text categorization[J].JACM,2002,34(1):1-47.
[5]Lertnattee V,Theeramunkong T.Effect of term distributions on centroid-based text categorization[J].Information Sci-ences,2004,158(1):89-115.
[6]唐焕玲,孙建涛,陆玉昌.文本分类中结合评估函数的TEF-WA权值调整技术[J].计算机研究与发展,2005,42(1):47-53.
[7]刘海峰,王元元,刘守生.一种组合型中文文本分类特征选择方法[J].广西师范大学学报:自然科学版,2007,25(4):208-211.
基本信息:
中图分类号:TP391.1
引用信息:
[1]孙挺,耿国华,周明全.一种有效的特征权重计算方法[J].郑州大学学报(理学版),2008,40(04):48-51.
基金信息:
十一五国家科技支撑计划重点资助项目,编号2006BAD20B02
2008-12-15
2008-12-15