Variance Reduction For Optimization In Speech Recognition
Jen-Tzung Chien, National Chiao Tung University
Pei-Wen Huang, National Chiao Tung University

Abstract:
Deep neural network (DNN) is trained according to a mini-batch optimization based on the stochastic gradient descent algorithm. Such a stochastic learning suffers from instability in parameter updating and may easily trap into local optimum. This study deals with the stability of stochastic learning by reducing the variance of gradients in optimization procedure. We upgrade the optimization from the stochastic dual coordinated ascent (SDCA) to the accelerated SDCA without duality (or dual-free ASDCA). This optimization incorporates the momentum method to accelerate the updating rule where the variance of gradients can be reduced. Using dual-free ASDCA, the optimization of dual function of SDCA in a form of convex loss is implemented by directly optimizing the primal function with respect to pseudo-dual parameters. The non-convex optimization in DNN training can be resolved and accelerated. Experimental results illustrate the reduction of training loss, variance of gradients and word error rate by using the proposed optimization for DNN speech recognition.