Browsing by Author "Mohammadi, Amir"
Now showing 1 - 7 of 7
- Results Per Page
- Sort Options
ArticlePublication Metadata only Eigenvoice speaker adaptation with minimal data for statistical speech synthesis systems using a MAP approach and nearest-neighbors(IEEE, 2014-12) Mohammadi, Amir; Sarfjoo, Seyyed Saeed; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Mohammadi, Amir; Sarfjoo, Seyyed SaeedStatistical speech synthesis (SSS) systems have the ability to adapt to a target speaker with a couple of minutes of adaptation data. Developing adaptation algorithms to further reduce the number of adaptation utterances to a few seconds of data can have substantial effect on the deployment of the technology in real-life applications such as consumer electronics devices. The traditional way to achieve such rapid adaptation is the eigenvoice technique which works well in speech recognition but known to generate perceptual artifacts in statistical speech synthesis. Here, we propose three methods to alleviate the quality problems of the baseline eigenvoice adaptation algorithm while allowing speaker adaptation with minimal data. Our first method is based on using a Bayesian eigenvoice approach for constraining the adaptation algorithm to move in realistic directions in the speaker space to reduce artifacts. Our second method is based on finding pre-trained reference speakers that are close to the target speaker and utilizing only those reference speaker models in a second eigenvoice adaptation iteration. Both techniques performed significantly better than the baseline eigenvoice method in the objective tests. Similarly, they both improved the speech quality in subjective tests compared to the baseline eigenvoice method. In the third method, tandem use of the proposed eigenvoice method with a state-of-the-art linear regression based adaptation technique is found to improve adaptation of excitation features.Conference ObjectPublication Metadata only Finding relevant features for statistical speech synthesis adaptation(European Language Resources Association, 2014-05) Bruneau, P.; Parisot, O.; Mohammadi, Amir; Demiroğlu, Cenk; Ghoniem, M.; Tamisier, T.; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Mohammadi, AmirStatistical speech synthesis (SSS) models typically lie in a very high-dimensional space. They can be used to allow speech synthesis on digital devices, using only few sentences of input by the user. However, the adaptation algorithms of such weakly trained models suffer from the high dimensionality of the feature space. Because creating new voices is easy with the SSS approach, thousands of voices can be trained and a nearest-neighbor algorithm can be used to obtain better speaker similarity in those limited-data cases. Nearest-neighbor methods require good distance measures that correlate well with human perception. This paper investigates the problem of finding good low-cost metrics, i.e. simple functions of feature values that map with objective signal quality metrics. To this aim, we use high-dimensional data visualization and dimensionality reduction techniques. Data mining principles are also applied to formulate a tractable view of the problem, and propose tentative solutions. With a performance index improved by 36% w.r.t. a naive solution, while using only 0.77% of the respective amount of features, our results are promising. Perspectives on new adaptation algorithms, and tighter integration of data mining and visualization principles are eventually given.Conference ObjectPublication Metadata only Hybrid nearest-neighbor/cluster adaptive training for rapid speaker adaptation in statistical speech synthesis systems(International Speech Communication Association, 2013) Mohammadi, Amir; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Mohammadi, AmirStatistical speech synthesis (SSS) approach has become one of the most popular methods in the speech synthesis field. An advantage of the SSS approach is the ability to adapt to a target speaker with a couple of minutes of adaptation data. However, many applications, especially in consumer electronics, require adaptation with only a few seconds of data which can be done using eigenvoice adaptation techniques. Although such techniques work well in speech recognition, they are known to generate perceptual artifacts in statistical speech synthesis. Here, we propose two methods to both alleviate those quality problems and improve the speaker similarity obtained with the baseline eigenvoice adaptation algorithm. Our first method is based on using a Bayesian approach for constraining the eigenvoice adaptation algorithm to move in realistic directions in the speaker space to reduce artifacts. Our second method is based on finding a reference speaker that is close to the target speaker, and using that reference speaker as the seed model in a second eigenvoice adaptation step. Both techniques performed significantly better than the baseline eigenvoice method in the subjective quality and similarity tests.Conference ObjectPublication Metadata only Nearest neighbor approach in speaker adaptation for HMM-based speech synthesis(IEEE, 2013) Mohammadi, Amir; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Mohammadi, AmirStatistical speech synthesis (SSS) approach has become one of the most popular and successful methods in the speech synthesis field. Smooth speech transitions, without the spurious errors that are observed in unit selection systems, can be generated with the SSS approach. Another advantage is the ability to adapt to a target speaker with a couple of minutes of adaptation data. However, many applications, especially in consumer electronics, require adaptation with only a few adaptation utterances. Here, we propose a rapid adaptation technique that first attempt to select a reference model that is close to the target speaker given a distance measure. Then, as opposed to adapting to target speaker from an average model, as typically done in most systems, adaptation is performed from the new reference model. The proposed system significantly outperformed a state-of-the-art baseline system both in objective and subjective tests especially only when one utterance is available for adaptation.Master ThesisPublication Metadata only Speaker adaptation with minimal data in statistical speech synthesis systems(2014-09) Mohammadi, Amir; Demiroğlu, Cenk; Demiroğlu, Cenk; Yaralıoğlu, Göksenin; Sözer, Hasan; Department of Electrical and Electronics Engineering; Mohammadi, AmirStatistical speech synthesis (SSS) systems have the ability to adapt to a target speaker with a couple of minutes of adaptation data. Developing adaptation algorithms to further reduce the number of adaptation utterances to a few seconds of data can have substantial effect on the deployment of the technology in real life applications such as consumer electronics devices. The traditional way to achieve such rapid adaptation is the eigenvoice technique which works well in speech recognition but known to generate perceptual artifacts in statistical speech synthesis. Here, we propose three methods to both alleviate the quality problems of the baseline eigenvoice adaptation algorithm while allowing speaker adaptation with minimal data. Our first method is based on using a Bayesian eigenvoice approach for constraining the adaptation algorithm to move in realistic directions in the speaker space to reduce artifacts. Our second method is based on finding pre-trained reference speakers that are close to the target speaker and utilizing only those reference speaker models in a second eigenvoice adaptation iteration. Both techniques performed significantly better than the baseline eigenvoice method in the objective tests. Similarly, they both improved the speech quality in subjective tests compared to the baseline eigenvoice method. In the third method, tandem use of the proposed eigenvoice method with a state-of-the-art linear regression based adaptation technique is found to improve adaptation of excitation features.ArticlePublication Metadata only Spoofing attacks to i-vector based voice verification systems using statistical speech synthesis with additive noise and countermeasure(IEEE, 2016) Özbay, Mustafa Caner; Khodabakhsh, Ali; Mohammadi, Amir; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Özbay, Mustafa Caner; Khodabakhsh, Ali; Mohammadi, AmirEven though improvements in the speaker verification (SV) technology with i-vectors increased their real-life deployment, their vulnerability to spoofing attacks is a major concern. Here, we investigated the effectiveness of spoofing attacks with statistical speech synthesis systems using limited amount of adaptation data and additive noise. Experiment results show that effective spoofing is possible using limited adaptation data. Moreover, the attacks get substantially more effective when noise is intentionally added to synthetic speech. Training the SV system with matched noise conditions does not alleviate the problem. We propose a synthetic speech detector (SSD) that uses session differences in i-vectors for counterspoofing. The proposed SSD had less than 0.5% total error rate in most cases for the matched noise conditions. For the mismatched noise conditions, missed detection rate further decreased but total error increased which indicates that some calibration is needed for mismatched noise conditions.ArticlePublication Metadata only Spoofing voice verification systems with statistical speech synthesis using limited adaptation data(Elsevier, 2017-03) Khodabakhsh, Ali; Mohammadi, Amir; Demiroğlu, Cenk; Electrical & Electronics Engineering; DEMİROĞLU, Cenk; Khodabakhsh, Ali; Mohammadi, AmirState-of-the-art speaker verification systems are vulnerable to spoofing attacks using speech synthesis. To solve the issue, high-performance synthetic speech detectors (SSDs) for attack methods have been proposed recently. Here, as opposed to developing new detectors, we investigate new attack strategies. Investigating new techniques that are specifically tailored for spoofing attacks that can spoof the voice verification system and are difficult to detect is expected to increase the security of voice verification systems by enabling the development of better detectors. First, we investigated the vulnerability of an i-vector based verification system to attacks using statistical speech synthesis (SSS), with a particular focus on the case where the attacker has only a very limited amount of data from the target speaker. Even with a single adaptation utterance, the false alarm rate was found to be 23%. Still, SSS-generated speech is easy to detect (Wu et al., 2015a, 2015b), which dramatically reduces its effectiveness. For more effective attacks with limited data, we propose a hybrid statistical/concatenative synthesis approach and show that hybrid synthesis significantly increases the false alarm rate in the verification system compared to the baseline SSS method. Moreover, proposed hybrid synthesis makes detecting synthetic speech more difficult compared to SSS even when very limited amount of original speech recordings are available to the attacker. To further increase the effectiveness of the attacks, we propose a linear regression method that transforms synthetic features into more natural features. Even though the regression approach is more effective at spoofing the detectors, it is not as effective as the hybrid synthesis approach in spoofing the verification system. An interpolation approach is proposed to combine the linear regression and hybrid synthesis methods, which is shown to provide the best spoofing performance in most cases.