Show simple item record

dc.contributor.authorEren, Eray
dc.contributor.authorDemiroğlu, Cenk
dc.date.accessioned2023-08-22T11:23:36Z
dc.date.available2023-08-22T11:23:36Z
dc.date.issued2023-06
dc.identifier.issn0885-2308en_US
dc.identifier.urihttp://hdl.handle.net/10679/8731
dc.identifier.urihttps://www.sciencedirect.com/science/article/pii/S0885230823000396
dc.description.abstractEnd-to-end (e2e) speech synthesis systems have become popular with the recent introduction of text-to-spectrogram conversion systems, such as Tacotron, that use encoder–decoder-based neural architectures. Even though those sequence-to-sequence systems can produce mel-spectrograms from the letters without a text processing frontend, they require substantial amounts of well-manipulated, labeled audio data that have high SNR and minimum amounts of artifacts. These data requirements make it difficult to build end-to-end systems from scratch, especially for low-resource languages. Moreover, most of the e2e systems are not designed for devices with tiny memory and CPU resources. Here, we investigate using a traditional deep neural network (DNN) for acoustic modeling together with a postfilter that improves the speech features produced by the network. The proposed architectures were trained with the relatively noisy, multi-speaker, Wall Street Journal (WSJ) database and tested with unseen speakers. The thin postfilter layer was adapted with minimal data to the target speaker for testing. We investigated several postfilter architectures and compared them with both objective and subjective tests. Fully-connected and transformer-based architectures performed the best in subjective tests. The novel adversarial transformer-based architecture with adaptive discriminator loss performed the best in the objective tests. Moreover, it was faster than the other architectures both in training and inference. Thus, our proposed lightweight transformer-based postfilter architecture significantly improved speech quality and efficiently adapted to new speakers with few shots of data and a hundred training iterations, making it computationally efficient and suitable for scalability.en_US
dc.language.isoengen_US
dc.publisherElsevieren_US
dc.relation.ispartofComputer Speech and Language
dc.rightsrestrictedAccess
dc.titleDeep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systemsen_US
dc.typeArticleen_US
dc.peerreviewedyesen_US
dc.publicationstatusPublisheden_US
dc.contributor.departmentÖzyeğin University
dc.contributor.authorID(ORCID & YÖK ID 144947) Demiroğlu, Cenk
dc.contributor.ozuauthorDemiroğlu, Cenk
dc.identifier.volume81en_US
dc.identifier.wosWOS:000978850300001
dc.identifier.doi10.1016/j.csl.2023.101520en_US
dc.subject.keywordsAdversarial trainingen_US
dc.subject.keywordsDeep learningen_US
dc.subject.keywordsPostfilteren_US
dc.subject.keywordsSpeaker adaptationen_US
dc.subject.keywordsSpeech synthesisen_US
dc.subject.keywordsTransformeren_US
dc.identifier.scopusSCOPUS:2-s2.0-85151678340
dc.contributor.ozugradstudentEren, Eray
dc.relation.publicationcategoryArticle - International Refereed Journal - Institutional Academic Staff and Graduate Student


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record


Share this page