Kooperativer Bibliotheksverbund

Berlin Brandenburg

and
and

Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
Filter
  • IEEE Xplore  (10)
Type of Medium
Language
Year
Source
  • IEEE Xplore  (10)
Journal
  • 1
    Language: English
    In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), December 2015, pp.444-451
    Description: We present a new beamformer front-end for Automatic Speech Recognition and apply it to the 3rd-CHiME Speech Separation and Recognition Challenge. Without any further modification of the back-end, we achieve a 53% relative reduction of the word error rate over the best baseline enhancement system for the relevant test data set. Our approach leverages the power of a bi-directional Long Short-Term Memory network to robustly estimate soft masks for a subsequent beamforming step. The utilized Generalized Eigenvalue beamforming operation with an optional Blind Analytic Normalization does not rely on a Direction-of-Arrival estimate and can cope with multi-path sound propagation, while at the same time only introducing very limited speech distortions. Our quite simple setup exploits the possibilities provided by simulated training data while still being able to generalize well to the fairly different real data. Finally, combining our front-end with data augmentation and another language model nearly yields a 64 % reduction of the word error rate on the real data test set.
    Keywords: Speech ; Training ; Speech Recognition ; Array Signal Processing ; Estimation ; Artificial Neural Networks ; Robust Speech Recognition ; Beamforming ; Feature Enhancement ; Neural Networks
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 2
    Language: English
    In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, December 2013, pp.458-463
    Description: In this paper we present an algorithm for the unsupervised segmentation of a character or phoneme lattice into words. Using a lattice at the input rather than a single string accounts for the uncertainty of the character/phoneme recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model is known. Recently a Weighted Finite State Transducer (WFST) based approach has been published which we show to suffer from an issue: language model probabilities of known words are computed incorrectly. Fixing this issue leads to greatly improved precision and recall rates, however at the cost of increased computational complexity. It is therefore practical only for single input strings. To allow for a lattice input and thus for errors in the character/phoneme recognizer, we propose a computationally efficient suboptimal two-stage approach, which is shown to significantly improve the word segmentation performance compared to the earlier WFST approach.
    Keywords: Lattices ; Probability ; Computational Modeling ; Context ; Transducers ; Acoustics ; Speech ; Automatic Speech Recognition ; Unsupervised Learning
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 3
    Language: English
    In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2015, pp.5053-5057
    Description: The parametric Bayesian Feature Enhancement (BFE) and a datadriven Denoising Autoencoder (DA) both bring performance gains in severe single-channel speech recognition conditions. The first can be adjusted to different conditions by an appropriate parameter setting, while the latter needs to be trained on conditions similar to the ones expected at decoding time, making it vulnerable to a mismatch between training and test conditions. We use a DNN backend and study reverberant ASR under three types of mismatch conditions: different room reverberation times, different speaker to microphone distances and the difference between artificially reverberated data and the recordings in a reverberant environment. We show that for these mismatch conditions BFE can provide the targets for a DA. This unsupervised adaptation provides a performance gain over the direct use of BFE and even enables to compensate for the mismatch of real and simulated reverberant data.
    Keywords: Speech ; Speech Recognition ; Training ; Reverberation ; Adaptation Models ; Noise Reduction ; Robust Speech Recognition ; Deep Neuronal Networks ; Feature Enhancement ; Denoising Autoencoder ; Engineering
    ISSN: 1520-6149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 4
    Language: English
    In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp.196-200
    Description: We present a neural network based approach to acoustic beamforming. The network is used to estimate spectral masks from which the Cross-Power Spectral Density matrices of speech and noise are estimated, which in turn are used to compute the beamformer coefficients. The network training is independent of the number and the geometric configuration of the microphones. We further show that it is possible to train the network on clean speech only, avoiding the need for stereo data with separated speech and noise. Two types of networks are evaluated. One small feed-forward network with only one hidden layer and one more elaborated bi-directional Long Short-Term Memory network. We compare our system with different parametric approaches to mask estimation and using different beamforming algorithms. We show that our system yields superior results, both in terms of perceptual speech quality and with respect to speech recognition error rate. The results for the simple feed-forward network are especially encouraging considering its low computational requirements.
    Keywords: Speech ; Estimation ; Training ; Array Signal Processing ; Acoustics ; Signal to Noise Ratio ; Neural Networks ; Robust Speech Recognition ; Acoustic Beam-Forming ; Feature Enhancement ; Deep Neural Network ; Engineering
    ISSN: 15206149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 5
    Language: English
    In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp.4057-4061
    Description: In this paper we present an algorithm for the unsupervised segmentation of a lattice produced by a phoneme recognizer into words. Using a lattice rather than a single phoneme string accounts for the uncertainty of the recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model (LM) is known. We propose a computationally efficient iterative approach, which alternates between the following two steps: First, the most probable string is extracted from the lattice using a phoneme LM learned on the segmentation result of the previous iteration. Second, word segmentation is performed on the extracted string using a word and phoneme LM which is learned alongside the new segmentation. We present results on lattices produced by a phoneme recognizer on the WSJ-CAM0 dataset. We show that our approach delivers superior segmentation performance than an earlier approach found in the literature, in particular for higher-order language models.
    Keywords: Lattices ; Hidden Markov Models ; Acoustics ; Computational Modeling ; Speech ; Vocabulary ; Iterative Methods ; Automatic Speech Recognition ; Unsupervised Learning ; Word Segmentation ; Engineering
    ISBN: 9781479928927
    ISSN: 1520-6149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 6
    Language: English
    In: 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), September 2018, pp.466-470
    Description: Signal dereverberation using the weighted prediction error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. But in its original formulation, WPE requires multiple iterations over a sufficiently long utterance, rendering it unsuitable for online low-latency applications. Recently, two methods have been proposed to overcome this limitation. One utilizes a neural network to estimate the power spectral density (PSD) of the target signal and works in a block-online fashion. The other method relies on a rather simple PSD estimation which smoothes the observed PSD and utilizes a recursive formulation which enables it to work on a frame-by-frame basis. In this paper, we integrate a deep neural network (DNN) based estimator into the recursive frame-online formulation. We evaluate the performance of the recursive system with different PSD estimators in comparison to the block-online and offline variant on two distinct corpora. The REVERB challenge data, where the signal is mainly deteriorated by reverberation, and a database which combines WSJ and VoiceHome to also consider (directed) noise sources. The results show that although smoothing works surprisingly well, the more sophisticated DNN based estimator shows promising improvements and shortens the performance gap between online and offline processing.
    Keywords: Reverberation ; Estimation ; Conferences ; Neural Networks ; Indexes ; Microphones ; Speech Recognition ; Online Speech Enhancement ; Dereverberation
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 7
    Language: English
    In: 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP), October 2017, pp.1-6
    Description: Multi-channel speech enhancement algorithms rely on a synchronous sampling of the microphone signals. This, however, cannot always be guaranteed, especially if the sensors are distributed in an environment. To avoid performance degradation the sampling rate offset needs to be estimated and compensated for. In this contribution we extend the recently proposed coherence drift based method in two important directions. First, the increasing phase shift in the short-time Fourier transform domain is estimated from the coherence drift in a Matched Filter-like fashion, where intermediate estimates are weighted by their instantaneous SNR. Second, an observed bias is removed by iterating between offset estimation and compensation by resampling a couple of times. The effectiveness of the proposed method is demonstrated by speech recognition results on the output of a beamformer with and without sampling rate offset compensation between the input channels. We compare MVDR and maximum-SNR beamformers in reverberant environments and further show that both benefit from a novel phase normalization, which we also propose in this contribution.
    Keywords: Coherence ; Delays ; Array Signal Processing ; Estimation ; Signal to Noise Ratio ; Synchronization ; Acoustics ; Engineering
    E-ISSN: 2473-3628
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 8
    Language: English
    In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp.171-175
    Description: In this paper we show how a neural network for spectral mask estimation for an acoustic beamformer can be optimized by algorithmic differentiation. Using the beamformer output SNR as the objective function to maximize, the gradient is propagated through the beamformer all the way to the neural network which provides the clean speech and noise masks from which the beamformer coefficients are estimated by eigenvalue decomposition. A key theoretical result is the derivative of an eigenvalue problem involving complex-valued eigenvectors. Experimental results on the CHiME-3 challenge database demonstrate the effectiveness of the approach. The tools developed in this paper are a key component for an end-to-end optimization of speech enhancement and speech recognition.
    Keywords: Speech ; Artificial Neural Networks ; Array Signal Processing ; Acoustics ; Linear Programming ; Eigenvalues and Eigenfunctions ; Signal to Noise Ratio ; Acoustic Beamforming ; Deep Neural Network ; Complex-Valued Algorithmic Differentiation ; Generalized Eigenvalue Problem ; Engineering
    ISSN: 15206149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 9
    Language: English
    In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp.5325-5329
    Description: This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system. A neural network which estimates masks for a statistically optimum beamformer is jointly trained with a network for acoustic modeling. To update its parameters, we propagate the gradients from the acoustic model all the way through feature extraction and the complex valued beamforming operation. Besides avoiding a mismatch between the front-end and the back-end, this approach also eliminates the need for stereo data, i.e., the parallel availability of clean and noisy versions of the signals. Instead, it can be trained with real noisy multi-channel data only. Also, relying on the signal statistics for beamforming, the approach makes no assumptions on the configuration of the microphone array. We further observe a performance gain through joint training in terms of word error rate in an evaluation of the system on the CHiME 4 dataset.
    Keywords: Acoustics ; Array Signal Processing ; Training ; Neural Networks ; Computational Modeling ; Microphones ; Noise Measurement ; Robust Asr ; Multi-Channel Asr ; Acoustic Beamforming ; Complex Backpropagation ; Engineering
    ISSN: 15206149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 10
    Language: English
    In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp.6655-6659
    Description: Signal dereverberation using the Weighted Prediction Error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. First proposed as an iterative algorithm, follow-up works have reformulated it as a recursive least squares algorithm and therefore enabled its use in online applications. For this algorithm, the estimation of the power spectral density (PSD) of the anechoic signal plays an important role and strongly influences its performance. Recently, we showed that using a neural network PSD estimator leads to improved performance for online automatic speech recognition. This, however, comes at a price. To train the network, we require parallel data, i.e., utterances simultaneously available in clean and reverberated form. Here we propose to overcome this limitation by training the network jointly with the acoustic model of the speech recognizer. To be specific, the gradients computed from the cross-entropy loss between the target senone sequence and the acoustic model network output is backpropagated through the complex-valued dereverberation filter estimation to the neural network for PSD estimation. Evaluation on two databases demonstrates improved performance for on-line processing scenarios while imposing fewer requirements on the available training data and thus widening the range of applications.
    Keywords: Dereverberation ; Speech Enhancement ; Joint Optimization ; Robust Asr ; Engineering
    ISSN: 15206149
    E-ISSN: 2379-190X
    Source: IEEE Conference Publications
    Source: IEEE Xplore
    Source: IEEE Journals & Magazines 
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. Further information can be found on the KOBV privacy pages