Eigenvoice speaker adaptation with minimal data for statistical speech synthesis systems using a MAP approach and nearest-neighbors

Mohammadi, Amir; Sarfjoo, Seyyed Saeed; Demiroğlu, Cenk

Publication:
Eigenvoice speaker adaptation with minimal data for statistical speech synthesis systems using a MAP approach and nearest-neighbors

dc.contributor.author	Mohammadi, Amir
dc.contributor.author	Sarfjoo, Seyyed Saeed
dc.contributor.author	Demiroğlu, Cenk
dc.contributor.department	Electrical & Electronics Engineering
dc.contributor.ozuauthor	DEMİROĞLU, Cenk
dc.contributor.ozugradstudent	Mohammadi, Amir
dc.contributor.ozugradstudent	Sarfjoo, Seyyed Saeed
dc.date.accessioned	2015-12-17T10:41:50Z
dc.date.available	2015-12-17T10:41:50Z
dc.date.issued	2014-12
dc.description	Due to copyright restrictions, the access to the full text of this article is only available via subscription.
dc.description.abstract	Statistical speech synthesis (SSS) systems have the ability to adapt to a target speaker with a couple of minutes of adaptation data. Developing adaptation algorithms to further reduce the number of adaptation utterances to a few seconds of data can have substantial effect on the deployment of the technology in real-life applications such as consumer electronics devices. The traditional way to achieve such rapid adaptation is the eigenvoice technique which works well in speech recognition but known to generate perceptual artifacts in statistical speech synthesis. Here, we propose three methods to alleviate the quality problems of the baseline eigenvoice adaptation algorithm while allowing speaker adaptation with minimal data. Our first method is based on using a Bayesian eigenvoice approach for constraining the adaptation algorithm to move in realistic directions in the speaker space to reduce artifacts. Our second method is based on finding pre-trained reference speakers that are close to the target speaker and utilizing only those reference speaker models in a second eigenvoice adaptation iteration. Both techniques performed significantly better than the baseline eigenvoice method in the objective tests. Similarly, they both improved the speech quality in subjective tests compared to the baseline eigenvoice method. In the third method, tandem use of the proposed eigenvoice method with a state-of-the-art linear regression based adaptation technique is found to improve adaptation of excitation features.	en_US
dc.description.sponsorship	TÜBİTAK ; European Commission
dc.identifier.doi	10.1109/TASLP.2014.2362009
dc.identifier.endpage	2157
dc.identifier.issn	2329-9290
dc.identifier.issue	12
dc.identifier.scopus	2-s2.0-84921805734
dc.identifier.startpage	2146
dc.identifier.uri	http://hdl.handle.net/10679/1321
dc.identifier.uri	https://doi.org/10.1109/TASLP.2014.2362009
dc.identifier.volume	22
dc.identifier.wos	000344459700019
dc.language.iso	eng	en_US
dc.peerreviewed	yes	en_US
dc.publicationstatus	published	en_US
dc.publisher	IEEE	en_US
dc.relation	info:eu-repo/grantAgreement/TUBITAK/1001 - Araştırma/109E281	en_US
dc.relation	info:eu-repo/grantAgreement/EC/FP7/268409	en_US
dc.relation.ispartof	IEEE/ACM Transactions on Audio, Speech, and Language Processing
dc.rights	restrictedAccess
dc.subject.keywords	Cluster adaptive training	en_US
dc.subject.keywords	Eigenvoice adaptation	en_US
dc.subject.keywords	Nearest neighbor	en_US
dc.subject.keywords	Speaker adaptation	en_US
dc.subject.keywords	Statistical speech synthesis	en_US
dc.title	Eigenvoice speaker adaptation with minimal data for statistical speech synthesis systems using a MAP approach and nearest-neighbors	en_US
dc.type	article	en_US
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	7b58c5c4-dccc-40a3-aaf2-9b209113b763
relation.isOrgUnitOfPublication.latestForDiscovery	7b58c5c4-dccc-40a3-aaf2-9b209113b763

Collections

Computer Science

Publication: Eigenvoice speaker adaptation with minimal data for statistical speech synthesis systems using a MAP approach and nearest-neighbors

Files

Collections

Publication:
Eigenvoice speaker adaptation with minimal data for statistical speech synthesis systems using a MAP approach and nearest-neighbors