| 516 | 16 | 259 |
| 下载次数 | 被引频次 | 阅读次数 |
在双模态维度情感识别中,存在由于信息不全面而导致预测性能不高的缺陷,且使用决策级融合方法进行融合大多依赖支持向量回归算法,但该算法无法有效处理大样本情况。为了解决以上问题,在语音和文本模态的基础上增加动作捕捉(motion capture, Mocap)数据,并针对该多模态数据提出一种基于随机梯度下降(stochastic gradient descent, SGD)的决策级融合维度情感识别方法。结合多任务学习机制,利用不同的深度学习模型分别对语音、文本和Mocap特征进行训练,并基于决策级融合方法实现多模态维度情感识别。在IEMOCAP数据集上的实验结果表明,Mocap数据更有助于提高效价维的值,结合更多情感数据有助于提升维度情感识别的预测性能,基于SGD进行决策级融合得到的一致性相关系数均值高于其他回归算法。
Abstract:In bimodal dimensional emotion recognition, there was a defect that incomplete information could lead to low prediction performance.The decision-level fusion method for feature fusion mostly depended on support vector regression algorithm, but this algorithm could not effectively deal with large samples.To address the above problems, motion capture(Mocap) data was added based on acoustic and text features. A decision-level fusion dimension emotion recognition method based on stochastic gradient descent(SGD) was proposed for the multi-modal data.Combined with multi-task learning mechanism, different deep learning models were used to train the acoustic, text and Mocap features, and multi-modal dimensional emotion recognition was achieved based on the decision-level fusion method.Experimental results on the IEMOCAP data set showed that Mocap data was more helpful to improve the value of the valence dimension.The combination of additional emotion data could help improve the prediction performance of dimensional emotion recognition.The mean value of concordance correlation coefficient obtained by decision-level fusion based on SGD was higher than other regression algorithms.
[1] ZHAO J F,MAO X,CHEN L J.Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J].Biomedical signal processing and control,2019,47(4):312-323.
[2] TRIPATHI S,BEIGI H.Multi-modal emotion recognition on IEMOCAP dataset using deep learning[EB/OL].[2021-03-15].https://www.researchgate.net/publication/324558351.
[3] ATMAJA B T,AKAGI M.Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM[J].Speech communication,2021,126:9-21.
[4] RUSSELL J A,MEHRABIAN A.Evidence for a three-factor theory of emotions[J].Journal of research in personality,1977,11(3):273-294.
[5] 高晓雅,李逸薇,张璐,等.基于多任务学习的正逆向情绪分值回归方法[J].郑州大学学报(理学版),2020,52(1):60-65.GAO X Y,LI Y W,ZHANG L,et al.Emotion regression approach with both forward and reverse values based on multi-task learning[J].Journal of Zhengzhou university (natural science edition),2020,52(1):60-65.
[6] 李霞,卢官明,闫静杰,等.多模态维度情感预测综述[J].自动化学报,2018,44(12):2142-2159.LI X,LU G M,YAN J J,et al.A survey of dimensional emotion prediction by multimodal cues[J].Acta automatica sinica,2018,44(12):2142-2159.
[7] 刘杰,刘欢,李寿山,等.基于双语对抗学习的半监督情感分类[J].郑州大学学报(理学版),2020,52(2):59-63.LIU J,LIU H,LI S S,et al.Semi-supervised sentiment classification with bilingual adversarial learning[J].Journal of Zhengzhou university (natural science edition),2020,52(2):59-63.
[8] YOON S,BYUN S,JUNG K.Multimodal speech emotion recognition using audio and text[C]//Proceedings of the IEEE Spoken Language Technology Workshop.Piscataway:IEEE Press,2018:112-118.
[9] YOON S,BYUN S,DEY S,et al.Speech emotion recognition using multi-hop attention mechanism[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing.Piscataway:IEEE Press,2019:2822-2826.
[10] ZHANG B Q,KHORRAM S,PROVOST E M.Exploiting acoustic and lexical properties of phonemes to recognize valence from speech[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing.Piscataway:IEEE Press,2019:5871-5875.
[11] SCHONEVELD L,OTHMANI A,ABDELKAWY H.Leveraging recent advances in deep learning for audio-visual emotion recognition[J].Pattern recognition letters,2021,146:1-7.
[12] PORIA S,MAJUMDER N,HAZARIKA D,et al.Multimodal sentiment analysis:addressing key issues and setting up the baselines[J].IEEE intelligent systems,2018,33(6):17-25.
[13] TZIRAKIS P,TRIGEORGIS G,NICOLAOU M A,et al.End-to-end multimodal emotion recognition using deep neural networks[J].IEEE journal of selected topics in signal processing,2017,11(8):1301-1309.
[14] PENG Z C,DANG J W,UNOKI M,et al.Multi-resolution modulation-filtered cochleagram feature for LSTM-based dimensional emotion recognition from speech[J].Neural networks,2021,140:261-273.
[15] ATMAJA B T,AKAGI M.Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning[J].APSIPA transactions on signal and information processing,2020,9:1-12.
[16] BUSSO C,BULUT M,LEE C C,et al.IEMOCAP:interactive emotional dyadic motion capture database[J].Language resources and evaluation,2008,42(4):335-359.
[17] 李海峰,陈婧,马琳,等.维度语音情感识别研究综述[J].软件学报,2020,31(8):2465-2491.LI H F,CHEN J,MA L,et al.Dimensional speech emotion recognition review[J].Journal of software,2020,31(8):2465-2491.
[18] GIANNAKOPOULOS T.pyAudioAnalysis:an open-source python library for audio signal analysis[J].PLoS one,2015,10(12):e0144610.
[19] PENNINGTON J,SOCHER R,MANNING C.Glove:global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Stroudsburg:Association for Computational Linguistics,2014:1532-1543.
基本信息:
DOI:10.13705/j.issn.1671-6841.2021299
中图分类号:TP391.41;TN912.34
引用信息:
[1]胡新荣,陈志恒,刘军平,等.基于SGD的决策级融合维度情感识别方法[J],2022,54(04):49-54.DOI:10.13705/j.issn.1671-6841.2021299.
基金信息:
国家自然科学基金项目(61103085);; 湖北省高等学校优秀中青年科技创新团队计划项目(T201807);; 湖北省高校知识产权推进工程项目(GXYS2018009);; 湖北省教育厅科学研究计划重点项目(D20191708)
2021-07-14
2021
2022-04-02
2022
2