Kooperativer Bibliotheksverbund

Berlin Brandenburg

and
and

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • Haeb-Umbach, Reinhold  (13)
Type of Medium
Language
Year
Journal
  • 1
    Language: English
    In: Computer Speech & Language, November 2017, Vol.46, pp.374-385
    Description: Acoustic beamforming can greatly improve the performance of Automatic Speech Recognition(ASR) and speech enhancement systems when multiple channels are available. We recently proposed a way to support the model-based Generalized Eigenvalue beamforming operation with a powerful neural network for spectral mask estimation. The enhancement system has a number of desirable properties. In particular, neither assumptions need to be made about the nature of the acoustic transfer function (e.g., being anechonic), nor does the array configuration need to be known. While the system has been originally developed to enhance speech in noisy environments, we show in this article that it is also effective in suppressing reverberation, thus leading to a generic trainable multi-channel speech enhancement system for robust speech processing. To support this claim, we consider two distinct datasets: The CHiME 3challenge, which features challenging real-world noise distortions, and the challenge, which focuses on distortions caused by reverberation. We evaluate the system both with respect to a speech enhancement and a recognition task. For the first task we propose a new way to cope with the distortions introduced by the Generalized Eigenvalue beamformer by renormalizing the target energy for each frequency bin, and measure its effectiveness in terms of the PESQ score. For the latter we feed the enhanced signal to a strong DNN back-end and achieve state-of-the-art ASR results on both datasets. We further experiment with different network architectures for spectral mask estimation: One small feed-forward network with only one hidden layer, one Convolutional Neural Network and one bi-directional Long Short-Term Memory network, showing that even a small network is capable of delivering significant performance improvements.
    Keywords: Robust Speech Recognition ; Acoustic Beamforming ; Multi-Channel Speech Enhancement ; Deep Neural Network ; Engineering ; Computer Science
    ISSN: 0885-2308
    E-ISSN: 1095-8363
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 2
    Description: This report describes the computation of gradients by algorithmic differentiation for statistically optimum beamforming operations. Especially the derivation of complex-valued functions is a key component of this approach. Therefore the real-valued algorithmic differentiation is extended via the complex-valued chain rule. In addition to the basic mathematic operations the derivative of the eigenvalue problem with complex-valued eigenvectors is one of the key results of this report. The potential of this approach is shown with experimental results on the CHiME-3 challenge database. There, the beamforming task is used as a front-end for an ASR system. With the developed derivatives a joint optimization of a speech enhancement and speech recognition system w.r.t. the recognition optimization criterion is possible. Comment: Technical Report
    Keywords: Computer Science - Numerical Analysis ; Computer Science - Computational Engineering, Finance, And Science
    Source: Cornell University
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 3
    Language: English
    In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), December 2015, pp.444-451
    Description: We present a new beamformer front-end for Automatic Speech Recognition and apply it to the 3rd-CHiME Speech Separation and Recognition Challenge. Without any further modification of the back-end, we achieve a 53% relative reduction of the word error rate over the best baseline enhancement system for the relevant test data set. Our approach leverages the power of a bi-directional Long Short-Term Memory network to robustly estimate soft masks for a subsequent beamforming step. The utilized Generalized Eigenvalue beamforming operation with an optional Blind Analytic Normalization does not rely on a Direction-of-Arrival estimate and can cope with multi-path sound propagation, while at the same time only introducing very limited speech distortions. Our quite simple setup exploits the possibilities provided by simulated training data while still being able to generalize well to the fairly different real data. Finally, combining our front-end with data augmentation and another language model nearly yields a 64 % reduction of the word error rate on the real data test set.
    Keywords: Speech ; Training ; Speech Recognition ; Array Signal Processing ; Estimation ; Artificial Neural Networks ; Robust Speech Recognition ; Beamforming ; Feature Enhancement ; Neural Networks
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 4
    Language: English
    In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, December 2013, pp.458-463
    Description: In this paper we present an algorithm for the unsupervised segmentation of a character or phoneme lattice into words. Using a lattice at the input rather than a single string accounts for the uncertainty of the character/phoneme recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model is known. Recently a Weighted Finite State Transducer (WFST) based approach has been published which we show to suffer from an issue: language model probabilities of known words are computed incorrectly. Fixing this issue leads to greatly improved precision and recall rates, however at the cost of increased computational complexity. It is therefore practical only for single input strings. To allow for a lattice input and thus for errors in the character/phoneme recognizer, we propose a computationally efficient suboptimal two-stage approach, which is shown to significantly improve the word segmentation performance compared to the earlier WFST approach.
    Keywords: Lattices ; Probability ; Computational Modeling ; Context ; Transducers ; Acoustics ; Speech ; Automatic Speech Recognition ; Unsupervised Learning
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 5
    Language: English
    In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp.5053-5057
    Description: The parametric Bayesian Feature Enhancement (BFE) and a datadriven Denoising Autoencoder (DA) both bring performance gains in severe single-channel speech recognition conditions. The first can be adjusted to different conditions by an appropriate parameter setting, while the latter needs to be trained on conditions similar to the ones expected at decoding time, making it vulnerable to a mismatch between training and test conditions. We use a DNN backend and study reverberant ASR under three types of mismatch conditions: different room reverberation times, different speaker to microphone distances and the difference between artificially reverberated data and the recordings in a reverberant environment. We show that for these mismatch conditions BFE can provide the targets for a DA. This unsupervised adaptation provides a performance gain over the direct use of BFE and even enables to compensate for the mismatch of real and simulated reverberant data.
    Keywords: Speech ; Speech Recognition ; Training ; Reverberation ; Adaptation Models ; Noise Reduction ; Robust Speech Recognition ; Deep Neuronal Networks ; Feature Enhancement ; Denoising Autoencoder ; Engineering
    ISSN: 1520-6149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 6
    Language: English
    In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp.196-200
    Description: We present a neural network based approach to acoustic beamforming. The network is used to estimate spectral masks from which the Cross-Power Spectral Density matrices of speech and noise are estimated, which in turn are used to compute the beamformer coefficients. The network training is independent of the number and the geometric configuration of the microphones. We further show that it is possible to train the network on clean speech only, avoiding the need for stereo data with separated speech and noise. Two types of networks are evaluated. One small feed-forward network with only one hidden layer and one more elaborated bi-directional Long Short-Term Memory network. We compare our system with different parametric approaches to mask estimation and using different beamforming algorithms. We show that our system yields superior results, both in terms of perceptual speech quality and with respect to speech recognition error rate. The results for the simple feed-forward network are especially encouraging considering its low computational requirements.
    Keywords: Speech ; Estimation ; Training ; Array Signal Processing ; Acoustics ; Signal to Noise Ratio ; Neural Networks ; Robust Speech Recognition ; Acoustic Beam-Forming ; Feature Enhancement ; Deep Neural Network ; Engineering
    ISSN: 15206149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 7
    Language: English
    In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp.4057-4061
    Description: In this paper we present an algorithm for the unsupervised segmentation of a lattice produced by a phoneme recognizer into words. Using a lattice rather than a single phoneme string accounts for the uncertainty of the recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model (LM) is known. We propose a computationally efficient iterative approach, which alternates between the following two steps: First, the most probable string is extracted from the lattice using a phoneme LM learned on the segmentation result of the previous iteration. Second, word segmentation is performed on the extracted string using a word and phoneme LM which is learned alongside the new segmentation. We present results on lattices produced by a phoneme recognizer on the WSJ-CAM0 dataset. We show that our approach delivers superior segmentation performance than an earlier approach found in the literature, in particular for higher-order language models.
    Keywords: Lattices ; Hidden Markov Models ; Acoustics ; Computational Modeling ; Speech ; Vocabulary ; Iterative Methods ; Automatic Speech Recognition ; Unsupervised Learning ; Word Segmentation ; Engineering
    ISBN: 9781479928927
    ISSN: 1520-6149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 8
    Language: English
    In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), September 2018, pp.466-470
    Description: Signal dereverberation using the weighted prediction error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. But in its original formulation, WPE requires multiple iterations over a sufficiently long utterance, rendering it unsuitable for online low-latency applications. Recently, two methods have been proposed to overcome this limitation. One utilizes a neural network to estimate the power spectral density (PSD) of the target signal and works in a block-online fashion. The other method relies on a rather simple PSD estimation which smoothes the observed PSD and utilizes a recursive formulation which enables it to work on a frame-by-frame basis. In this paper, we integrate a deep neural network (DNN) based estimator into the recursive frame-online formulation. We evaluate the performance of the recursive system with different PSD estimators in comparison to the block-online and offline variant on two distinct corpora. The REVERB challenge data, where the signal is mainly deteriorated by reverberation, and a database which combines WSJ and VoiceHome to also consider (directed) noise sources. The results show that although smoothing works surprisingly well, the more sophisticated DNN based estimator shows promising improvements and shortens the performance gap between online and offline processing.
    Keywords: Reverberation ; Estimation ; Conferences ; Neural Networks ; Indexes ; Microphones ; Speech Recognition ; Online Speech Enhancement ; Dereverberation
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 9
    Language: English
    In: 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP), October 2017, pp.1-6
    Description: Multi-channel speech enhancement algorithms rely on a synchronous sampling of the microphone signals. This, however, cannot always be guaranteed, especially if the sensors are distributed in an environment. To avoid performance degradation the sampling rate offset needs to be estimated and compensated for. In this contribution we extend the recently proposed coherence drift based method in two important directions. First, the increasing phase shift in the short-time Fourier transform domain is estimated from the coherence drift in a Matched Filter-like fashion, where intermediate estimates are weighted by their instantaneous SNR. Second, an observed bias is removed by iterating between offset estimation and compensation by resampling a couple of times. The effectiveness of the proposed method is demonstrated by speech recognition results on the output of a beamformer with and without sampling rate offset compensation between the input channels. We compare MVDR and maximum-SNR beamformers in reverberant environments and further show that both benefit from a novel phase normalization, which we also propose in this contribution.
    Keywords: Coherence ; Delays ; Array Signal Processing ; Estimation ; Signal to Noise Ratio ; Synchronization ; Acoustics ; Engineering
    E-ISSN: 2473-3628
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 10
    Language: English
    In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp.171-175
    Description: In this paper we show how a neural network for spectral mask estimation for an acoustic beamformer can be optimized by algorithmic differentiation. Using the beamformer output SNR as the objective function to maximize, the gradient is propagated through the beamformer all the way to the neural network which provides the clean speech and noise masks from which the beamformer coefficients are estimated by eigenvalue decomposition. A key theoretical result is the derivative of an eigenvalue problem involving complex-valued eigenvectors. Experimental results on the CHiME-3 challenge database demonstrate the effectiveness of the approach. The tools developed in this paper are a key component for an end-to-end optimization of speech enhancement and speech recognition.
    Keywords: Speech ; Artificial Neural Networks ; Array Signal Processing ; Acoustics ; Linear Programming ; Eigenvalues and Eigenfunctions ; Signal to Noise Ratio ; Acoustic Beamforming ; Deep Neural Network ; Complex-Valued Algorithmic Differentiation ; Generalized Eigenvalue Problem ; Engineering
    ISSN: 15206149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. Further information can be found on the KOBV privacy pages