Improving Speech Recognition Using Limited Accent Diverse British English Training Data With Deep Neural Networks
Maryam Najafian, University of Texas at Dallas
Saeid Safavi, University of Hertfordshire
John H. L. Hansen, University of Texas at Dallas
Martin Russell, University of Birmingham, Birmingham

Abstract:
Despite the recent advances in acoustic modelling tasks modelling speech data coming from different speakers with varying accents, age, and speaking styles is a fundamental challenge for Deep Neural Networks (DNNs) based Automatic Speech Recognition (ASR). A relative gain of 46.85% is achieved in recognising the Accents of British Isles corpus by applying a baseline DNN model rather than a Gaussian mixture model. However, even for powerful DNN based systems accents remain a challenge. Our study shows that for a `difficult" accent such as Glaswegian the relative word error rate is 78.9% higher than that of the standard southern English accent. In this work we propose four multi-accent learning strategies, and evaluate their effectiveness within the context of DNN based acoustic modelling framework. Using an i-vector based accent identification system with 78% accuracy to label the training data. We present a novel study on the effect of increase in the accent diversity, the `difficulty" and the amount of supplemented training data on the ASR performance. On average a further ASR gain of 27.24 % is achieved using the proposed strategies. Our results show that across all accent regions supplementing the training set with a small amount of data from the most `difficult" accent (2.25 hours of Glaswegian accent) leads to a similar gain in performance as using a large amount of accent diverse data (8.96 hours from "4 accent regions). Although the ideas presented are focused on DNN based analysis with limited amount of multi-accented data, they are applicable for training all classifiers with multi-conditional limited resources.