Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems

Eren, Eray; Demiroğlu, Cenk

Publication:
Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems

dc.contributor.author	Eren, Eray
dc.contributor.author	Demiroğlu, Cenk
dc.contributor.department	Electrical & Electronics Engineering
dc.contributor.ozuauthor	DEMİROĞLU, Cenk
dc.contributor.ozugradstudent	Eren, Eray
dc.date.accessioned	2023-08-22T11:23:36Z
dc.date.available	2023-08-22T11:23:36Z
dc.date.issued	2023-06
dc.description.abstract	End-to-end (e2e) speech synthesis systems have become popular with the recent introduction of text-to-spectrogram conversion systems, such as Tacotron, that use encoder–decoder-based neural architectures. Even though those sequence-to-sequence systems can produce mel-spectrograms from the letters without a text processing frontend, they require substantial amounts of well-manipulated, labeled audio data that have high SNR and minimum amounts of artifacts. These data requirements make it difficult to build end-to-end systems from scratch, especially for low-resource languages. Moreover, most of the e2e systems are not designed for devices with tiny memory and CPU resources. Here, we investigate using a traditional deep neural network (DNN) for acoustic modeling together with a postfilter that improves the speech features produced by the network. The proposed architectures were trained with the relatively noisy, multi-speaker, Wall Street Journal (WSJ) database and tested with unseen speakers. The thin postfilter layer was adapted with minimal data to the target speaker for testing. We investigated several postfilter architectures and compared them with both objective and subjective tests. Fully-connected and transformer-based architectures performed the best in subjective tests. The novel adversarial transformer-based architecture with adaptive discriminator loss performed the best in the objective tests. Moreover, it was faster than the other architectures both in training and inference. Thus, our proposed lightweight transformer-based postfilter architecture significantly improved speech quality and efficiently adapted to new speakers with few shots of data and a hundred training iterations, making it computationally efficient and suitable for scalability.
dc.identifier.doi	10.1016/j.csl.2023.101520
dc.identifier.issn	0885-2308
dc.identifier.scopus	2-s2.0-85151678340
dc.identifier.uri	http://hdl.handle.net/10679/8731
dc.identifier.uri	https://doi.org/10.1016/j.csl.2023.101520
dc.identifier.volume	81
dc.identifier.wos	000978850300001
dc.language.iso	eng
dc.peerreviewed	yes
dc.publicationstatus	Published
dc.publisher	Elsevier
dc.relation.ispartof	Computer Speech and Language
dc.relation.publicationcategory	International Refereed Journal
dc.rights	restrictedAccess
dc.subject.keywords	Adversarial training
dc.subject.keywords	Deep learning
dc.subject.keywords	Postfilter
dc.subject.keywords	Speaker adaptation
dc.subject.keywords	Speech synthesis
dc.subject.keywords	Transformer
dc.title	Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems
dc.type	article
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	7b58c5c4-dccc-40a3-aaf2-9b209113b763
relation.isOrgUnitOfPublication.latestForDiscovery	7b58c5c4-dccc-40a3-aaf2-9b209113b763

Files

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.45 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electrical & Electronics Engineering

Publication: Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems

Files

License bundle

Collections

Publication:
Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems