A Discriminative Unsupervised Method For Speaker Recognition Using Deep Learning
Muhammad Muneeb Saleem, Volkswagen ERL
John H.l. Hansen, University of Texas at Dallas

A Gaussian mixture model (GMM) is used in state-of-the-art i-Vector based speaker recognition systems for acoustic space division and prediction. The main purpose of such acoustic space clustering is to constrain the acoustic comparison in small regions where between-speaker differences are the main source of variability. In this study, we investigate two unsupervised discriminative approaches as an alternative to the universal background model (UBM) for feature space clustering and prediction. In our first approach, a deep neural network (DNN) is directly used to estimate the mixture-wise posterior probabilities of the UBM. In this approach, while the UBM is still used for acoustic space division, the prediction is performed using DNN. The motivation for using a DNN for such prediction is the ability to learn invariant higher level features discriminatively. In our second approach, a stacked restricted Boltzmann machine (RBM) is used instead of the UBM for both acoustic space clustering and prediction. In this approach, the clustering is performed based on the higher level distributed representations of speech features using a stacked RBM. The stand alone system using our first approach (UBM-DNN) improves overall performance by +14% in minDCF when evaluated on NIST SRE 2010.