nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo journalinfonormal searchdiv searchzone qikanlogo popupnotification paper paperNew
2023, 05, v.55 67-72
基于多领域条件生成的语音情感转换
基金项目(Foundation): 汉考国际科研基金项目(HT-202011-374)
邮箱(Email):
DOI: 10.13705/j.issn.1671-6841.2022167
投稿时间: 2022-06-22
投稿日期(年): 2022
终审时间: 2023-08-21
终审日期(年): 2023
审稿周期(年): 2
移动端阅读
摘要:

语音情感转换是在不改变话者声纹、语义的情况下,将一种情感语音转换成另一种情感语音的技术,本质是实现语音的风格迁移。主流的风格迁移技术有对抗生成技术(如CycleGAN,StarGAN)和实例规一化技术(如IN,CIN)。CIN相对于IN添加了均值方差选择性模块,具有更强的风格迁移能力。提出了将StarGAN和CIN结合的语音情感转换模型CIN-StarGAN,将CIN模块嵌入到StarGAN生成器。在ESD数据集上的实验结果表明,CIN-StarGAN比基于CycleGAN的情感转换模型收敛速度快28%,具有较好的风格转换能力。在多领域情感转换方法上具有潜在研究价值。

Abstract:

Emotional voice conversion was a technology that converted the emotion of a speech into another without changing the speaker′s timbre and semantics. Its essence was to transfer style of speech. The mainstream style transfer technologies included generative adversarial network(such as CycleGAN, StarGAN) and instance normalization technology(such as IN, CIN). Compared with IN, CIN added a mean variance selective module, which had stronger style transfer ability. StarGAN and CIN were combined, and proposed a new speech emotion conversion model, CIN-StarGAN. The model embeded the CIN module into the StarGAN generator. The experimental results on ESD data sets showed that CIN-StarGAN converged 28% faster than CycleGAN based emotion conversion model, and had better style transfer ability. It had potential research value in multi domain emotion transfer methods.

参考文献

[1] TAO J H,KANG Y G,LI A J.Prosody conversion from neutral speech to emotional speech[J].IEEE transactions on audio,speech,and language processing,2006,14(4):1145-1154.

[2] SCHULLER D M,SCHULLER B W.A review on five recent and near-future developments in computational processing of emotion in the human voice[J].Emotion review,2021,13(1):44-50.

[3] ZHOU K,SISMAN B,LIU R,et al.Emotional voice conversion:theory,databases and ESD[J].Speech communication,2022,137:1-18.

[4] HSU C C,HWANG H T,WU Y C,et al.Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks[EB/OL].(2017-04-04)[2022-03-20].https://arxiv.org/abs/1704.00849.pdf.

[5] GAO J,CHAKRABORTY D,TEMBINE H,et al.Non-parallel emotional speech conversion[EB/OL].(2018-10-03)[2022-03-20].https://arxiv.org/abs/1811.01174.pdf.

[6] ZHU J Y,PARK T,ISOLA P,et al.Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//2017 IEEE International Conference on Computer Vision.Piscataway:IEEE Press,2017:2242-2251.

[7] CHOI Y,CHOI M,KIM M,et al.StarGAN:unified generative adversarial networks for multi-domain image-to-image translation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:8789-8797.

[8] MORITANI A,SAKAMOTO S,OZAKI R,et al.StarGAN-based emotional voice conversion for Japanese phrases[C]//2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).Piscataway:IEEE Press,2021:836-840.

[9] HE X H,CHEN J J,RIZOS G,et al.An improved StarGAN for emotional voice conversion:Enhancing voice quality and data augmentation[EB/OL].(2021-06-18)[2022-03-05].https://arxiv.org/abs/2107.08361.pdf.

[10] ULYANOV D,VEDALDI A,LEMPITSKY V.Instance normalization:the missing ingredient for fast stylization[EB/OL].(2016-07-27)[2022-03-22].https://arxiv.org/abs/1607.08022.pdf.

[11] MIRZA M,OSINDERO S.Imputation of missing data with class imbalance using conditional generative adversarial networks[EB/OL].(2020-12-01)[2022-03-22].https://arxiv.org/abs/2012.00220.pdf.

[12] GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[J].Communications of the acm,2020,63(11):139-144.

[13] TAIGMAN Y,POLYAK A,WOLF L.Unsupervised cross-domain image generation[EB/OL].(2016-10-07)[2022-03-22].https://arxiv.org/abs/1611.02200.pdf.

[14] YI Z L,ZHANG H,TAN P,et al.DualGAN:unsupervised dual learning for image-to-image translation[C]//2017 IEEE International Conference on Computer Vision.Piscataway:IEEE Press,2017:2868-2876.

[15] KANEKO T,KAMEOKA H,TANAKA K,et al.Stargan-vc2:rethinking conditional methods for stargan-based voice conversion[EB/OL].[2019-07-29].https://arxiv.org/abs/1907.12279.pdf.

[16] LI Y A,ZARE A,MESGARANI N.Starganv2-vc:a diverse,unsupervised,non-parallel framework for natural-sounding voice conversion[EB/OL].(2021-07-21)[2022-03-21].https://arxiv.org/abs/2107.10394.pdf.

[17] ISOLA P,ZHU J Y,ZHOU T H,et al.Image-to-image translation with conditional adversarial networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2017:5967-5976.

[18] BA J L,KIROS J R,HINTON G E.Layer normalization[EB/OL].(2016-07-21)[2022-03-10].https://arxiv.org/abs/1607.06450.

[19] WEI X,GONG B,LIU Z,et al.Improving the improved training of wasserstein gans:A consistency term and its dual effect[EB/OL].(2018-03-05)[2022-04-01].https://arxiv.org/abs/1803.01541.pdf.

[20] ADLER J,LUNZ S.Banach wasserstein gan [EB/OL].(2019-07-11)[2022-04-01].https://www.semanticscholar.org/reader/143b0fb106d11265f2af62a29d06ef74d2935124.

[21] KONG J,KIM J,BAE J.Hifi-gan:generative adversarial networks for efficient and high fidelity speech synthesis [EB/OL].(2020-10-23)[2022-04-01].https://arxiv.org/abs/2010.05646.pdf.

基本信息:

DOI:10.13705/j.issn.1671-6841.2022167

中图分类号:TN912.3;TP183

引用信息:

[1]姚文翰,柯登峰,黄良杰,等.基于多领域条件生成的语音情感转换[J],2023,55(05):67-72.DOI:10.13705/j.issn.1671-6841.2022167.

基金信息:

汉考国际科研基金项目(HT-202011-374)

投稿时间:

2022-06-22

投稿日期(年):

2022

终审时间:

2023-08-21

终审日期(年):

2023

审稿周期(年):

2

检 索 高级检索