Hybrid Context Dependent CD-DNN-HMM Keyword Spotting (KWS) In Speech Conversations
Vivek Tyagi, Xerox Research Center India

We present detailed analysis of phoneme recognition performance of a context dependent tied-state triphone Gaussian Mixture Model Hidden Markov Model (CD-GMM-HMM) acoustic model (state-of-the-art large acoustic model (AM)) and a four hidden layer context dependent Deep Neural Network (CD-DNN-HMM) AM on the WSJ speech corpus. Using a bigram phoneme language model, phoneme recognition experiments are performed on a two hour independent test set using the Viterbi decoding which show a relative $33.3%$ improvement by our CD-DNN acoustic model. We then present a filler based Hybrid DNN-HMM Keyword Spotting KWS system which to our knowledge is the first KWS architecture using context dependent DNN and HMM. In our experiments, a strong baseline of CD-GMM-HMM KWS provide $79.0%$ correct detection accuracy at a false alarm (FA) rate of $5.0$ FA/Hr. Whereas, the proposed hybrid CD-DNN-HMM KWS results in $88.5%$ correct detection accuracy at $5.0$ FA/Hr -- a relative improvement of $43.3%$. We provide further analysis and conclude that Hybrid CD-DNN-HMM KWS provides an attractive alternate solution for near real-time KWS applications with high detection accuracy and low FA.