郑州大学学报（理学版）

2023, 05, v.55 67-72

基于多领域条件生成的语音情感转换

姚文翰柯登峰

黄良杰胡睿欣项敏特张劲松

1.北京语言大学信息科学学院

基金项目(Foundation): 汉考国际科研基金项目(HT-202011-374)

邮箱(Email):

DOI: 10.13705/j.issn.1671-6841.2022167

投稿时间： 2022-06-22

投稿日期（年）： 2022

终审时间： 2023-08-21

终审日期（年）： 2023

审稿周期（年）： 2

移动端阅读

201	3	104
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

语音情感转换是在不改变话者声纹、语义的情况下，将一种情感语音转换成另一种情感语音的技术，本质是实现语音的风格迁移。主流的风格迁移技术有对抗生成技术(如CycleGAN,StarGAN)和实例规一化技术(如IN,CIN)。CIN相对于IN添加了均值方差选择性模块，具有更强的风格迁移能力。提出了将StarGAN和CIN结合的语音情感转换模型CIN-StarGAN,将CIN模块嵌入到StarGAN生成器。在ESD数据集上的实验结果表明，CIN-StarGAN比基于CycleGAN的情感转换模型收敛速度快28%,具有较好的风格转换能力。在多领域情感转换方法上具有潜在研究价值。

关键词： 语音情感转换; 域转换; 条件实例归一化; 生成对抗网络;

Abstract：

Emotional voice conversion was a technology that converted the emotion of a speech into another without changing the speaker′s timbre and semantics. Its essence was to transfer style of speech. The mainstream style transfer technologies included generative adversarial network(such as CycleGAN, StarGAN) and instance normalization technology(such as IN, CIN). Compared with IN, CIN added a mean variance selective module, which had stronger style transfer ability. StarGAN and CIN were combined, and proposed a new speech emotion conversion model, CIN-StarGAN. The model embeded the CIN module into the StarGAN generator. The experimental results on ESD data sets showed that CIN-StarGAN converged 28% faster than CycleGAN based emotion conversion model, and had better style transfer ability. It had potential research value in multi domain emotion transfer methods.

KeyWords： emotional speech conversion; domain transfer; conditional instance normalization; generator adversarial network;

参考文献

[1] TAO J H,KANG Y G,LI A J.Prosody conversion from neutral speech to emotional speech[J].IEEE transactions on audio,speech,and language processing,2006,14(4):1145-1154.

[2] SCHULLER D M,SCHULLER B W.A review on five recent and near-future developments in computational processing of emotion in the human voice[J].Emotion review,2021,13(1):44-50.

[3] ZHOU K,SISMAN B,LIU R,et al.Emotional voice conversion:theory,databases and ESD[J].Speech communication,2022,137:1-18.

[4] HSU C C,HWANG H T,WU Y C,et al.Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks[EB/OL].(2017-04-04)[2022-03-20].https://arxiv.org/abs/1704.00849.pdf.

[5] GAO J,CHAKRABORTY D,TEMBINE H,et al.Non-parallel emotional speech conversion[EB/OL].(2018-10-03)[2022-03-20].https://arxiv.org/abs/1811.01174.pdf.

[6] ZHU J Y,PARK T,ISOLA P,et al.Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//2017 IEEE International Conference on Computer Vision.Piscataway:IEEE Press,2017:2242-2251.

[7] CHOI Y,CHOI M,KIM M,et al.StarGAN:unified generative adversarial networks for multi-domain image-to-image translation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:8789-8797.

[8] MORITANI A,SAKAMOTO S,OZAKI R,et al.StarGAN-based emotional voice conversion for Japanese phrases[C]//2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).Piscataway:IEEE Press,2021:836-840.

[9] HE X H,CHEN J J,RIZOS G,et al.An improved StarGAN for emotional voice conversion:Enhancing voice quality and data augmentation[EB/OL].(2021-06-18)[2022-03-05].https://arxiv.org/abs/2107.08361.pdf.

[10] ULYANOV D,VEDALDI A,LEMPITSKY V.Instance normalization:the missing ingredient for fast stylization[EB/OL].(2016-07-27)[2022-03-22].https://arxiv.org/abs/1607.08022.pdf.

[11] MIRZA M,OSINDERO S.Imputation of missing data with class imbalance using conditional generative adversarial networks[EB/OL].(2020-12-01)[2022-03-22].https://arxiv.org/abs/2012.00220.pdf.

[12] GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[J].Communications of the acm,2020,63(11):139-144.

[13] TAIGMAN Y,POLYAK A,WOLF L.Unsupervised cross-domain image generation[EB/OL].(2016-10-07)[2022-03-22].https://arxiv.org/abs/1611.02200.pdf.

[14] YI Z L,ZHANG H,TAN P,et al.DualGAN:unsupervised dual learning for image-to-image translation[C]//2017 IEEE International Conference on Computer Vision.Piscataway:IEEE Press,2017:2868-2876.

[15] KANEKO T,KAMEOKA H,TANAKA K,et al.Stargan-vc2:rethinking conditional methods for stargan-based voice conversion[EB/OL].[2019-07-29].https://arxiv.org/abs/1907.12279.pdf.

[16] LI Y A,ZARE A,MESGARANI N.Starganv2-vc:a diverse,unsupervised,non-parallel framework for natural-sounding voice conversion[EB/OL].(2021-07-21)[2022-03-21].https://arxiv.org/abs/2107.10394.pdf.

[17] ISOLA P,ZHU J Y,ZHOU T H,et al.Image-to-image translation with conditional adversarial networks[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2017:5967-5976.

[18] BA J L,KIROS J R,HINTON G E.Layer normalization[EB/OL].(2016-07-21)[2022-03-10].https://arxiv.org/abs/1607.06450.

[19] WEI X,GONG B,LIU Z,et al.Improving the improved training of wasserstein gans:A consistency term and its dual effect[EB/OL].(2018-03-05)[2022-04-01].https://arxiv.org/abs/1803.01541.pdf.

[20] ADLER J,LUNZ S.Banach wasserstein gan [EB/OL].(2019-07-11)[2022-04-01].https://www.semanticscholar.org/reader/143b0fb106d11265f2af62a29d06ef74d2935124.

[21] KONG J,KIM J,BAE J.Hifi-gan:generative adversarial networks for efficient and high fidelity speech synthesis [EB/OL].(2020-10-23)[2022-04-01].https://arxiv.org/abs/2010.05646.pdf.

基本信息:

DOI：10.13705/j.issn.1671-6841.2022167

中图分类号:TN912.3;TP183

引用信息:

[1]姚文翰,柯登峰,黄良杰,等.基于多领域条件生成的语音情感转换[J],2023,55(05):67-72.DOI:10.13705/j.issn.1671-6841.2022167.

基金信息:

汉考国际科研基金项目(HT-202011-374)

投稿时间：

2022-06-22

投稿日期（年）：

2022

终审时间：

2023-08-21

终审日期（年）：

2023

审稿周期（年）：

请选择需要下载的pdf数据

郑州大学学报（理学版）

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈

请选择需要下载的pdf数据

郑州大学学报（理学版）

使用微信“扫一扫”功能。将此内容分享给您的微信好友或者朋友圈

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈