Browsing by Author "Eren, Eray"

Now showing 1 - 3 of 3

Metadata only
Deep learning-based speaker-adaptive postfiltering with limited adaptation data for embedded text-to-speech synthesis systems
(Elsevier, 2023-06) Eren, Eray; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Eren, Eray
End-to-end (e2e) speech synthesis systems have become popular with the recent introduction of text-to-spectrogram conversion systems, such as Tacotron, that use encoder–decoder-based neural architectures. Even though those sequence-to-sequence systems can produce mel-spectrograms from the letters without a text processing frontend, they require substantial amounts of well-manipulated, labeled audio data that have high SNR and minimum amounts of artifacts. These data requirements make it difficult to build end-to-end systems from scratch, especially for low-resource languages. Moreover, most of the e2e systems are not designed for devices with tiny memory and CPU resources. Here, we investigate using a traditional deep neural network (DNN) for acoustic modeling together with a postfilter that improves the speech features produced by the network. The proposed architectures were trained with the relatively noisy, multi-speaker, Wall Street Journal (WSJ) database and tested with unseen speakers. The thin postfilter layer was adapted with minimal data to the target speaker for testing. We investigated several postfilter architectures and compared them with both objective and subjective tests. Fully-connected and transformer-based architectures performed the best in subjective tests. The novel adversarial transformer-based architecture with adaptive discriminator loss performed the best in the objective tests. Moreover, it was faster than the other architectures both in training and inference. Thus, our proposed lightweight transformer-based postfilter architecture significantly improved speech quality and efficiently adapted to new speakers with few shots of data and a hundred training iterations, making it computationally efficient and suitable for scalability.
Metadata only
Speaker adaptation with deep learning for text-to-speech synthesis systems
Eren, Eray; Demiroğlu, Cenk; Demiroğlu, Cenk; Kayış, Enis; Güz, Ü.; Department of Computer Science; Eren, Eray
End-to-end (e2e) speech synthesis systems have become popular with the recent introduction of letter-to-spectrogram conversion systems, such as Tacotron, that use encoder-decoder-based neural architectures. Even though those sequence-to-sequence systems can produce mel-spectrograms from the letters without a text processing frontend, they require substantial amounts of well-massaged, labelled audio data that have high SNR and minimum amounts of artifacts. These data requirements make it difficult to build end-to-end systems from scratch especially for low-resource languages. Moreover, most of the e2e systems are not designed for devices with tiny memory and cpu resources. Here, we investigate using a traditional deep neural network (DNN) for acoustic modelling together with a postfilter that improves the speech features produced by the network. The proposed architectures were trained with the relatively noisy, multi-speaker, Wall Street Journal (WSJ) database and tested with unseen speakers. The thin postfilter layer was adapted with minimal data to the target speaker for testing. We investigated several postfilter architectures and compared them with both objective and subjective tests. Fully-connected and transformer-based architectures performed the best in subjective tests. The transformer-based architecture performed the best in objective tests. Moreover, it was faster than the other architectures both in training and inference speeds.
Metadata only
Uncertainty assessment for detection of spoofing attacks to speaker verification systems using a Bayesian approach
(Elsevier, 2022-02) Süslü, Çağıl; Eren, Eray; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Süslü, Çağıl; Eren, Eray
There has been tremendous progress in automatic speaker verification systems over the last decade. Still, spoofing attacks pose a significant challenge to their deployment. Even though there are various attack techniques such as voice conversion and speech synthesis, replay attacks pose one of the most important types since they can be done without significant expertise in speech technology. Moreover, replay attacks are hard to detect because they are done with simple replay of the original audio. The problem has gained more attention since the introduction of the ASV spoof 2017 challenge, which included a well-designed database with realistic replay attack conditions. Even though many different deep network types and acoustic features were proposed since the challenge, one key issue, which is model uncertainty around the neural networks’ decision is largely ignored. This is a result of using the softmax function with the cross-entropy loss, which is widely used in many domains. Here, we propose using evidential deep learning, which is a recently proposed method that is rapidly gaining popularity, for assessing the model uncertainty around the network's decision. Experimental results show that the investigated network architectures perform better in terms of equal error rate with the new loss function. Moreover, reliability of measured uncertainty is shown by filtering samples out of the test set using the Bayesian uncertainty measure, which resulted with a consistent decrease in EER with decreasing threshold.