Ghahramani and Hinton used a version of the Mixture of Competing Experts arhitecture, a modular arhcitecture that can divide up a problem into different regimes and separately model them.
Each expert network was linear but used recurrent connections (a linear Kalman filter). The nonlinear gating network computed coefficients that were used to combine the experts' outputs into a single global prediction. The network was thereby able to handle very different regimes of the input, such as startup conditions versus normal operating conditions. A probabilistic interpretation of the outputs allowed the network to indicate its own certainty that it could account for each piece of data, and thereby detect abnormal operating conditions.
Figure 1. Mozer's architecture for speaker-independent recognition.
"Figure 1: the basic processing stages that occur when an acoustic signal is presented for speaker-independent recognition: segmentation (where the utterance begins and ends in the continuous input stream), digital filtering for time normalization and acoustic feature extraction, and a neural network recognizer.
A huge amount of work goes into the preprocessing and data collection
"Speaker-independent recognition with a vocabulary of a dozen alternatives achieves recognition accuracies greater than 95% in actual usage" (Mozer, 1996).