Computer Science

Permanent URI for this collectionhttps://hdl.handle.net/10679/43

Browse

Now showing 1 - 20 of 36

Metadata only
Analysis of speaker similarity in the statistical speech synthesis systems using a hybrid approach
(IEEE, 2012) Güner, Ekrem; Mohammadi, A.; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Güner, Ekrem
Statistical speech synthesis (SSS) approach has become one of the most popular and successful methods in the speech synthesis field. Smooth speech transitions, without the spurious errors that are observed in unit selection systems, can be generated with the SSS approach. However, a well-known issue with SSS is the lack of voice similarity to the target speaker. The issue arises both in speaker-dependent models and models that are adapted from average voices. Moreover, in speaker adaptation, similarity to the target speaker does not increase significantly after around one minute of adaptation data which potentially indicates inherent bottleneck(s) in the system. Here, we propose using the hybrid speech synthesis approach to understand the key factors behind the speaker similarity problem. To that end, we try to answer the following question: which segments and parameters of speech, if generated/synthesized better, would have a substantial improvement on speaker similarity? In this work, our hybrid methods are described and listening test results are presented and discussed.
Metadata only
Analysis of speech-based measures for detecting and monitoring Alzheimer’s disease
(Springer Science+Business Media, 2014) Khodabakhsh, Ali; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, Ali
Automatic diagnosis of the Alzheimer’s disease as well as monitoring of the diagnosed patients can make significant economic impact on societies. We investigated an automatic diagnosis approach through the use of speech based features. As opposed to standard tests, spontaneous conversations are carried and recorded with the subjects. Speech features could discriminate between healthy people and the patients with high reliability. Although the patients were in later stages of Alzheimer’s disease, results indicate the potential of speech-based automated solutions for Alzheimer’s disease diagnosis. Moreover, the data collection process employed here can be done inexpensively by call center agents in a real-life application. Thus, the investigated techniques hold the potential to significantly reduce the financial burden on governments and Alzheimer’s patients.
Metadata only
Anti-spoofing for text-independent speaker verification: An initial database, comparison of countermeasures, and human performance
(IEEE, 2016-04) Wu, Z.; Leon, P. L. de; Demiroğlu, Cenk; Khodabakhsh, Ali; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, Ali
In this paper, we present a systematic study of the vulnerability of automatic speaker verification to a diverse range of spoofing attacks. We start with a thorough analysis of the spoofing effects of five speech synthesis and eight voice conversion systems, and the vulnerability of three speaker verification systems under those attacks. We then introduce a number of countermeasures to prevent spoofing attacks from both known and unknown attackers. Known attackers are spoofing systems whose output was used to train the countermeasures, while an unknown attacker is a spoofing system whose output was not available to the countermeasures during training. Finally, we benchmark automatic systems against human performance on both speaker verification and spoofing detection tasks.
Metadata only
Comparative study of credit risk evaluation for unbalanced datasets using deep learning classifiers
(IEEE, 2023) Öner, T.; Alnahas, D.; Kanturvardar, A.; Ülkgün, A. M.; Demiroǧlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk
Credit risk assessment deals with calculating the risk of a loan not being repaid. For this reason, a lot of research effort is directed at credit risk analysis. In this study, machine learning models such as Light Gradient-Boosting Machine and Neural Networks are utilized for credit risk assessment. These machine learning models are trained and tested using The Home Credit Default Risk dataset that was obtained from a competition on the website kaggle.com. Resampling techniques were also implemented to tackle the class imbalance problem in the dataset. Moreover, various preprocessing techniques were also utilized to deal with missing values and outliers in the dataset. The study presents the results of experiments with different parameters and preprocessing techniques and showcases the optimal configuration for the best results. The performance metrics of the machine learning models that are implemented in the experiments are compared to the performance metrics of a baseline system that used the Light Gradient-Boosting Machine model without applying preprocessing techniques.
Metadata only
Cross-lingual speaker adaptation for statistical speech synthesis using limited data
(Interspeech, 2016) Sarfjoo, Seyyed Saeed; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Sarfjoo, Seyyed Saeed
Cross-lingual speaker adaptation with limited adaptation data has many applications such as use in speech-to-speech translation systems. Here, we focus on cross-lingual adaptation for statistical speech synthesis (SSS) systems using limited adaptation data. To that end, we propose two techniques exploiting a bilingual Turkish-English speech database that we collected. In one approach, speaker-specific state-mapping is proposed for cross-lingual adaptation which performed significantly better than the baseline state-mapping algorithm in adapting the excitation parameter both in objective and subjective tests. In the second approach, eigenvoice adaptation is done in the input language which is then used to estimate the eigenvoice weights in the output language using weighted linear regression. The second approach performed significantly better than the baseline system in adapting the spectral envelope parameters both in objective and subjective tests.
Open Access
Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources
(Springer, 2024-02-12) Barakat, Huda Mohammed Mohammed; Turk, O.; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Barakat, Huda Mohammed Mohammed
Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.
Metadata only
Detection of Alzheimer's disease using prosodic cues in conversational speech
(IEEE, 2014) Khodabakhsh, Ali; Kuşçuoğlu, Serhan; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, Ali; Kuşçuoğlu, Serhan
Automatic diagnosis of the Alzheimer's disease as well as monitoring of the diagnosed patients can make significant economic impact on societies. We investigated an automatic diagnosis approach through the use of speech based features. As opposed to standard tests that are mostly focused on memory recall, spontaneous conversations are carried with the subjects in informal settings. Prosodic speech features extracted from speech could discriminate between healthy people and the patients with high reliability. Although the patients were in later stages of Alzheimer's disease, results indicate the potential of speech-based automated solutions for Alzheimer's disease diagnosis. Moreover, the data collection process employed here can be done inexpensively by call center agents in a real-life application. Thus, the investigated techniques hold the potential to significantly reduce the financial burden on governments and Alzheimer' patients.
Metadata only
Disentangling human trafficking types and the identification of pathways to forced labor and sex: an explainable analytics approach
(Springer, 2023-07) Eryarsoy, E.; Topuz, K.; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk
Terms such as human trafficking and modern-day slavery are ephemeral but reflect manifestations of oppression, servitude, and captivity that perpetually have threatened the basic right of all humans. Operations research and analytical tools offering practical wisdom have paid scant attention to this overarching problem. Motivated by this lacuna, this study considers two of the most prevalent categories of human trafficking: forced labor and forced sex. Using one of the largest available datasets due to Counter-Trafficking Data Collective (CTDC), we examine patterns related to forced sex and forced labor. Our study uses a two-phase approach focusing on explainability: Phase 1 involves logistic regression (LR) segueing to association rules analysis and Phase 2 employs Bayesian Belief Networks (BBNs) to uncover intricate pathways leading to human trafficking. This combined approach provides a comprehensive understanding of the factors contributing to human trafficking, effectively addressing the limitations of conventional methods. We confirm and challenge some of the key findings in the extant literature and call for better prevention strategies. Our study goes beyond the pretext of analytics usage by prescribing how to incorporate our results in combating human trafficking.
Metadata only
DNN-based speaker-adaptive postfiltering with limited adaptation data for statistical speech synthesis systems
(IEEE, 2019) Öztürk, M. G.; Ulusoy, O.; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk
Deep neural networks (DNNs) have been successfully deployed for acoustic modelling in statistical parametric speech synthesis (SPSS) systems. Moreover, DNN-based postfilters (PF) have also been shown to outperform conventional postfilters that are widely used in SPSS systems for increasing the quality of synthesized speech. However, existing DNN-based postfilters are trained with speaker-dependent databases. Given that SPSS systems can rapidly adapt to new speakers from generic models, there is a need for DNN-based postfilters that can adapt to new speakers with minimal adaptation data. Here, we compare DNN-, RNN-, and CNN-based postfilters together with adversarial (GAN) training and cluster-based initialization (CI) for rapid adaptation. Results indicate that the feedforward (FF) DNN, together with GAN and CI, significantly outperforms the other recently proposed postfilters.
Metadata only
Eigenvoice speaker adaptation with minimal data for statistical speech synthesis systems using a MAP approach and nearest-neighbors
(IEEE, 2014-12) Mohammadi, Amir; Sarfjoo, Seyyed Saeed; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Mohammadi, Amir; Sarfjoo, Seyyed Saeed
Statistical speech synthesis (SSS) systems have the ability to adapt to a target speaker with a couple of minutes of adaptation data. Developing adaptation algorithms to further reduce the number of adaptation utterances to a few seconds of data can have substantial effect on the deployment of the technology in real-life applications such as consumer electronics devices. The traditional way to achieve such rapid adaptation is the eigenvoice technique which works well in speech recognition but known to generate perceptual artifacts in statistical speech synthesis. Here, we propose three methods to alleviate the quality problems of the baseline eigenvoice adaptation algorithm while allowing speaker adaptation with minimal data. Our first method is based on using a Bayesian eigenvoice approach for constraining the adaptation algorithm to move in realistic directions in the speaker space to reduce artifacts. Our second method is based on finding pre-trained reference speakers that are close to the target speaker and utilizing only those reference speaker models in a second eigenvoice adaptation iteration. Both techniques performed significantly better than the baseline eigenvoice method in the objective tests. Similarly, they both improved the speech quality in subjective tests compared to the baseline eigenvoice method. In the third method, tandem use of the proposed eigenvoice method with a state-of-the-art linear regression based adaptation technique is found to improve adaptation of excitation features.
Metadata only
Eklemeli̇ di̇ller i̇çi̇n düşük bellekli̇ melez i̇stati̇sti̇ksel/bi̇ri̇m seçmeli̇ MKS si̇stemi̇
(IEEE, 2012) Guner, Ekrem; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Guner, Ekrem
The HMM-based TTS (HTS) approach has been increasingly getting more attention from the TTS research community. One of the advantage is the lack of spurious errors that are observed in the unit selection scheme. Another advantage of the HTS system is the small memory footprint requirement which makes it attractive for embedded devices. Here, we propose a novel hybrid statistical unit selection TTS system for agglutinative languages that aims at improving the quality of the baseline HTS system while keeping the memory footprint small. The intelligibility and quality scores of the baseline system are comparable to the MOS scores of English reported in the Blizzard Challenge tests. Listeners preferred the hybrid system over the baseline system in the A/B preference tests.
Open Access
Evaluation of linguistic and prosodic features for detection of Alzheimer’s disease in Turkish conversational speech
(Springer Science+Business Media, 2015-12) Khodabakhsh, Ali; Yesil, Fatih; Guner, Ekrem; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, Ali; Yesil, Fatih; Guner, Ekrem
Automatic diagnosis and monitoring of Alzheimer’s disease can have a significant impact on society as well as the well-being of patients. The part of the brain cortex that processes language abilities is one of the earliest parts to be affected by the disease. Therefore, detection of Alzheimer’s disease using speech-based features is gaining increasing attention. Here, we investigated an extensive set of features based on speech prosody as well as linguistic features derived from transcriptions of Turkish conversations with subjects with and without Alzheimer’s disease. Unlike most standardized tests that focus on memory recall or structured conversations, spontaneous unstructured conversations are conducted with the subjects in informal settings. Age-, education-, and gender-controlled experiments are performed to eliminate the effects of those three variables. Experimental results show that the proposed features extracted from the speech signal can be used to discriminate between the control group and the patients with Alzheimer’s disease. Prosodic features performed significantly better than the linguistic features. Classification accuracy over 80% was obtained with three of the prosodic features, but experiments with feature fusion did not further improve the classification performance.
Metadata only
Finding relevant features for statistical speech synthesis adaptation
(European Language Resources Association, 2014-05) Bruneau, P.; Parisot, O.; Mohammadi, Amir; Demiroğlu, Cenk; Ghoniem, M.; Tamisier, T.; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Mohammadi, Amir
Statistical speech synthesis (SSS) models typically lie in a very high-dimensional space. They can be used to allow speech synthesis on digital devices, using only few sentences of input by the user. However, the adaptation algorithms of such weakly trained models suffer from the high dimensionality of the feature space. Because creating new voices is easy with the SSS approach, thousands of voices can be trained and a nearest-neighbor algorithm can be used to obtain better speaker similarity in those limited-data cases. Nearest-neighbor methods require good distance measures that correlate well with human perception. This paper investigates the problem of finding good low-cost metrics, i.e. simple functions of feature values that map with objective signal quality metrics. To this aim, we use high-dimensional data visualization and dimensionality reduction techniques. Data mining principles are also applied to formulate a tractable view of the problem, and propose tentative solutions. With a performance index improved by 36% w.r.t. a naive solution, while using only 0.77% of the respective amount of features, our results are promising. Perspectives on new adaptation algorithms, and tighter integration of data mining and visualization principles are eventually given.
Metadata only
Gauss karışım modeli tabanlı konuşmacı belirleme sistemlerinde klasik MAP uyarlanması yönteminin performans analizi
(IEEE, 2010) Erdoğan, A.; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk
Gaussian mixture models (GMM) is one of the most commonly used methods in text-independent speaker identification systems. In this paper, performance of the GMM approach has been measured with different parameters and settings. Voice activity detection (VAD) component has been found to have a significant impact on the performance. Therefore, VAD algorithms that are robust to background noise have been proposed. Significant differences in performance have been observed between male and female speakers and GSM/PSTN channels. Moreover, single-stream GMM approach has been found to perform significantly better than the multi-stream GMM approach. It has been observed under all conditions that data duration is critical for good performance.
Metadata only
Gauss karışım modeli tabanlı konuşmacı doğrulama sistemlerinde kişiye ve kanala uyarlanmada klasik MAP tabanlı yöntemlerin performans analizi
(IEEE, 2011) Koşunda, Serol; Yeşil, Fatih; Ayazoğlu, Yaprak; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Koşunda, Serol; Yeşil, Fatih; Ayazoğlu, Yaprak
In this paper, performance of Gaussian mixture models (GMM) based algorithms implemented in Speech Processing Laboratory at Ozyegin University, within NIST SRE2004 and 2006 database was reported. Gaussian mixture models (GMM) is one of the most commonly used methods in text-independent speaker verification systems. In this paper, performance of the GMM approach has been measured with different parameters and settings. It has also been observed that eigenchannel-MAP and JFA methods both have increased the performance of the system against session variability which is one of the most challenging problem in text-independent speaker verification systems.
Metadata only
Hybrid nearest-neighbor/cluster adaptive training for rapid speaker adaptation in statistical speech synthesis systems
(International Speech Communication Association, 2013) Mohammadi, Amir; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Mohammadi, Amir
Statistical speech synthesis (SSS) approach has become one of the most popular methods in the speech synthesis field. An advantage of the SSS approach is the ability to adapt to a target speaker with a couple of minutes of adaptation data. However, many applications, especially in consumer electronics, require adaptation with only a few seconds of data which can be done using eigenvoice adaptation techniques. Although such techniques work well in speech recognition, they are known to generate perceptual artifacts in statistical speech synthesis. Here, we propose two methods to both alleviate those quality problems and improve the speaker similarity obtained with the baseline eigenvoice adaptation algorithm. Our first method is based on using a Bayesian approach for constraining the eigenvoice adaptation algorithm to move in realistic directions in the speaker space to reduce artifacts. Our second method is based on finding a reference speaker that is close to the target speaker, and using that reference speaker as the seed model in a second eigenvoice adaptation step. Both techniques performed significantly better than the baseline eigenvoice method in the subjective quality and similarity tests.
Open Access
Hybrid statistical/unit-selection Turkish speech synthesis using suffix units
(Springer International Publishing, 2016-12) Demiroğlu, Cenk; Güner, Ekrem; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Güner, Ekrem
Unit selection based text-to-speech synthesis (TTS) has been the dominant TTS approach of the last decade. Despite its success, unit selection approach has its disadvantages. One of the most significant disadvantages is the sudden discontinuities in speech that distract the listeners (Speech Commun 51:1039-1064, 2009). The second disadvantage is that significant expertise and large amounts of data is needed for building a high-quality synthesis system which is costly and time-consuming. The statistical speech synthesis (SSS) approach is a promising alternative synthesis technique. Not only that the spurious errors that are observed in the unit selection system are mostly not observed in SSS but also building voice models is far less expensive and faster compared to the unit selection system. However, the resulting speech is typically not as natural-sounding as speech that is synthesized with a high-quality unit selection system. There are hybrid methods that attempt to take advantage of both SSS and unit selection systems. However, existing hybrid methods still require development of a high-quality unit selection system. Here, we propose a novel hybrid statistical/unit selection system for Turkish that aims at improving the quality of the baseline SSS system by improving the prosodic parameters such as intonation and stress. Commonly occurring suffixes in Turkish are stored in the unit selection database and used in the proposed system. As opposed to existing hybrid systems, the proposed system was developed without building a complete unit selection synthesis system. Therefore, the proposed method can be used without collecting large amounts of data or utilizing substantial expertise or time-consuming tuning that is typically required in building unit selection systems. Listeners preferred the hybrid system over the baseline system in the AB preference tests.
Metadata only
Konuşmacı aradeğerlemeli SMM tabanlı metinden konuşma sentezleme si̇stemi
(IEEE, 2011) Orhan, Mustafa Cem; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Orhan, Mustafa Cem
Hidden Markov Model (HMM) based text-to-speech (TTS) systems offer many advantages compared to the concatenative approach. One of those advantages is the ability to interpolate between different speakers to generate new voices. In this paper, speaker interpolation for HMM-based TTS (HTS) is described and listening test results for the interpolation of English and Turkish voices are presented. Similar to English, we obtained Turkish speech that strongly reflect the interpolation ratio in perceptual similarity. Some insight into the interpolation process is also provided by analysing the spectra of the reference and final voices.
Metadata only
LIG at MediaEval 2015 multimodal person discovery in broadcast TV task
(CEUR-WS, 2015) Budnik, M.; Safadi, B.; Besacier, L.; Quénot, G.; Khodabakhsh, Ali; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, Ali
In this working notes paper the contribution of the LIG team (partnership between Univ. Grenoble Alpes and Ozyegin University) to the Multimodal Person Discovery in Broadcast TV task in MediaEval 2015 is presented. The task focused on unsupervised learning techniques. Two different approaches were submitted by the team. In the first one, new features for face and speech modalities were tested. In the second one, an alternative way to calculate the distance between face tracks and speech segments is presented. It also had a competitive MAP score and was able to beat the baseline.
Open Access
Multi-lingual depression-level assessment from conversational speech using acoustic and text features
(International Speech Communication Association, 2018) Özkanca, Yasin Serdar; Demiroğlu, Cenk; Besirli, A.; Çelik, S.; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Özkanca, Yasin Serdar
Depression is a common mental health problem around the world with a large burden on economies, well-being, hence productivity, of individuals. Its early diagnosis and treatment are critical to reduce the costs and even save lives. One key aspect to achieve that goal is to use voice technologies and monitor depression remotely and relatively inexpensively using automated agents. Although there has been efforts to automatically assess depression levels from audiovisual features, use of transcriptions along with the acoustic features has emerged as a more recent research venue. Moreover, difficulty in data collection and the limited amounts of data available for research are also challenges that are hampering the success of the algorithms. One of the novel contributions in this paper is to exploit the databases from multiple languages for feature selection. Since a large number of features can be extracted from speech and given the small amounts of training data available, effective data selection is critical for success. Our proposed multi-lingual method was effective at selecting better features and significantly improved the depression assessment accuracy. We also use text-based features for assessment and propose a novel strategy to fuse the text- and speech-based classifiers which further boosted the performance.

Browse

Browsing by Institution Author "DEMİROĞLU, Cenk"