Person: DEMİROĞLU, Cenk
Name
Job Title
First Name
Cenk
Last Name
DEMİROĞLU
42 results
Publication Search Results
Now showing 1 - 10 of 42
Conference ObjectPublication Metadata only LIG at MediaEval 2015 multimodal person discovery in broadcast TV task(CEUR-WS, 2015) Budnik, M.; Safadi, B.; Besacier, L.; Quénot, G.; Khodabakhsh, Ali; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, AliIn this working notes paper the contribution of the LIG team (partnership between Univ. Grenoble Alpes and Ozyegin University) to the Multimodal Person Discovery in Broadcast TV task in MediaEval 2015 is presented. The task focused on unsupervised learning techniques. Two different approaches were submitted by the team. In the first one, new features for face and speech modalities were tested. In the second one, an alternative way to calculate the distance between face tracks and speech segments is presented. It also had a competitive MAP score and was able to beat the baseline.Conference ObjectPublication Metadata only Gauss karışım modeli tabanlı konuşmacı belirleme sistemlerinde klasik MAP uyarlanması yönteminin performans analizi(IEEE, 2010) Erdoğan, A.; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, CenkGaussian mixture models (GMM) is one of the most commonly used methods in text-independent speaker identification systems. In this paper, performance of the GMM approach has been measured with different parameters and settings. Voice activity detection (VAD) component has been found to have a significant impact on the performance. Therefore, VAD algorithms that are robust to background noise have been proposed. Significant differences in performance have been observed between male and female speakers and GSM/PSTN channels. Moreover, single-stream GMM approach has been found to perform significantly better than the multi-stream GMM approach. It has been observed under all conditions that data duration is critical for good performance.Book PartPublication Metadata only Analysis of speech-based measures for detecting and monitoring Alzheimer’s disease(Springer Science+Business Media, 2014) Khodabakhsh, Ali; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, AliAutomatic diagnosis of the Alzheimer’s disease as well as monitoring of the diagnosed patients can make significant economic impact on societies. We investigated an automatic diagnosis approach through the use of speech based features. As opposed to standard tests, spontaneous conversations are carried and recorded with the subjects. Speech features could discriminate between healthy people and the patients with high reliability. Although the patients were in later stages of Alzheimer’s disease, results indicate the potential of speech-based automated solutions for Alzheimer’s disease diagnosis. Moreover, the data collection process employed here can be done inexpensively by call center agents in a real-life application. Thus, the investigated techniques hold the potential to significantly reduce the financial burden on governments and Alzheimer’s patients.Conference ObjectPublication Metadata only Sesli̇ yanıt si̇stemi̇ çaǧrı akışında di̇lbi̇lgi̇si̇ tabanlı Türkçe konuşma tanıma si̇stemi̇ tanıtımı(IEEE, 2012) Karagöz, Gün; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Karagöz, GünBu bildiride, çağrı merkezleri için kullanılan sesli yanıt sisteminde dilbilgisi-tabanlı Türkçe konuşma tanıma sistemi anlatılmaktadır. Yapılan çalışmada bir telekomünikasyon kurumunun çağrı merkezi sisteminin örneklemesi gerçekleştirilmiştir.Conference ObjectPublication Open Access Multi-lingual depression-level assessment from conversational speech using acoustic and text features(International Speech Communication Association, 2018) Özkanca, Yasin Serdar; Demiroğlu, Cenk; Besirli, A.; Çelik, S.; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Özkanca, Yasin SerdarDepression is a common mental health problem around the world with a large burden on economies, well-being, hence productivity, of individuals. Its early diagnosis and treatment are critical to reduce the costs and even save lives. One key aspect to achieve that goal is to use voice technologies and monitor depression remotely and relatively inexpensively using automated agents. Although there has been efforts to automatically assess depression levels from audiovisual features, use of transcriptions along with the acoustic features has emerged as a more recent research venue. Moreover, difficulty in data collection and the limited amounts of data available for research are also challenges that are hampering the success of the algorithms. One of the novel contributions in this paper is to exploit the databases from multiple languages for feature selection. Since a large number of features can be extracted from speech and given the small amounts of training data available, effective data selection is critical for success. Our proposed multi-lingual method was effective at selecting better features and significantly improved the depression assessment accuracy. We also use text-based features for assessment and propose a novel strategy to fuse the text- and speech-based classifiers which further boosted the performance.Conference ObjectPublication Metadata only Cross-lingual speaker adaptation for statistical speech synthesis using limited data(Interspeech, 2016) Sarfjoo, Seyyed Saeed; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Sarfjoo, Seyyed SaeedCross-lingual speaker adaptation with limited adaptation data has many applications such as use in speech-to-speech translation systems. Here, we focus on cross-lingual adaptation for statistical speech synthesis (SSS) systems using limited adaptation data. To that end, we propose two techniques exploiting a bilingual Turkish-English speech database that we collected. In one approach, speaker-specific state-mapping is proposed for cross-lingual adaptation which performed significantly better than the baseline state-mapping algorithm in adapting the excitation parameter both in objective and subjective tests. In the second approach, eigenvoice adaptation is done in the input language which is then used to estimate the eigenvoice weights in the output language using weighted linear regression. The second approach performed significantly better than the baseline system in adapting the spectral envelope parameters both in objective and subjective tests.Conference ObjectPublication Metadata only OCR-aided person annotation and label propagation for speaker modeling in TV shows(IEEE, 2016) Budnik, M.; Besacier, L.; Khodabakhsh, Ali; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, AliIn this paper, we present an approach for minimizing human effort in manual speaker annotation. Label propagation is used at each iteration of an active learning cycle. More precisely, a selection strategy for choosing the most suitable speech track to be labeled is proposed. Four different selection strategies are evaluated and all the tracks in a corresponding cluster are gathered using agglomerative clustering in order to propagate human annotations. To further reduce the manual labor required, an optical character recognition system is used to bootstrap annotations. At each step of the cycle, annotations are used to build speaker models. The quality of the generated speaker models is evaluated at each step using an i-vector based speaker identification system. The presented approach shows promising results on the REPERE corpus with a minimum amount of human effort for annotation.Conference ObjectPublication Metadata only Finding relevant features for statistical speech synthesis adaptation(European Language Resources Association, 2014-05) Bruneau, P.; Parisot, O.; Mohammadi, Amir; Demiroğlu, Cenk; Ghoniem, M.; Tamisier, T.; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Mohammadi, AmirStatistical speech synthesis (SSS) models typically lie in a very high-dimensional space. They can be used to allow speech synthesis on digital devices, using only few sentences of input by the user. However, the adaptation algorithms of such weakly trained models suffer from the high dimensionality of the feature space. Because creating new voices is easy with the SSS approach, thousands of voices can be trained and a nearest-neighbor algorithm can be used to obtain better speaker similarity in those limited-data cases. Nearest-neighbor methods require good distance measures that correlate well with human perception. This paper investigates the problem of finding good low-cost metrics, i.e. simple functions of feature values that map with objective signal quality metrics. To this aim, we use high-dimensional data visualization and dimensionality reduction techniques. Data mining principles are also applied to formulate a tractable view of the problem, and propose tentative solutions. With a performance index improved by 36% w.r.t. a naive solution, while using only 0.77% of the respective amount of features, our results are promising. Perspectives on new adaptation algorithms, and tighter integration of data mining and visualization principles are eventually given.ReviewPublication Open Access Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources(Springer, 2024-02-12) Barakat, Huda Mohammed Mohammed; Turk, O.; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Barakat, Huda Mohammed MohammedSpeech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.ArticlePublication Metadata only Spoofing voice verification systems with statistical speech synthesis using limited adaptation data(Elsevier, 2017-03) Khodabakhsh, Ali; Mohammadi, Amir; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, Ali; Mohammadi, AmirState-of-the-art speaker verification systems are vulnerable to spoofing attacks using speech synthesis. To solve the issue, high-performance synthetic speech detectors (SSDs) for attack methods have been proposed recently. Here, as opposed to developing new detectors, we investigate new attack strategies. Investigating new techniques that are specifically tailored for spoofing attacks that can spoof the voice verification system and are difficult to detect is expected to increase the security of voice verification systems by enabling the development of better detectors. First, we investigated the vulnerability of an i-vector based verification system to attacks using statistical speech synthesis (SSS), with a particular focus on the case where the attacker has only a very limited amount of data from the target speaker. Even with a single adaptation utterance, the false alarm rate was found to be 23%. Still, SSS-generated speech is easy to detect (Wu et al., 2015a, 2015b), which dramatically reduces its effectiveness. For more effective attacks with limited data, we propose a hybrid statistical/concatenative synthesis approach and show that hybrid synthesis significantly increases the false alarm rate in the verification system compared to the baseline SSS method. Moreover, proposed hybrid synthesis makes detecting synthetic speech more difficult compared to SSS even when very limited amount of original speech recordings are available to the attacker. To further increase the effectiveness of the attacks, we propose a linear regression method that transforms synthetic features into more natural features. Even though the regression approach is more effective at spoofing the detectors, it is not as effective as the hybrid synthesis approach in spoofing the verification system. An interpolation approach is proposed to combine the linear regression and hybrid synthesis methods, which is shown to provide the best spoofing performance in most cases.