Microsoft today announced that the research team reached that 5.1 per cent error rate with the speech recognition system, a new industry milestone, substantially surpassing the accuracy achieved last year.
Advances in speech recognition have created services such as Speech Translator, which can translate presentations in real-time for multi-lingual audiences.
Last year, Microsoft’s speech and dialog research group announced a milestone in reaching human parity on the Switchboard conversational speech recognition task, meaning the company created technology that recognised words in a conversation as well as professional human transcribers.
Switchboard is a corpus of recorded telephone conversations that the speech research community has used for more than 20 years to benchmark speech recognition systems. The task involves transcribing conversations between strangers discussing topics such as sports and politics.
Microsoft has reduced error rate by about 12 per cent compared to last year’s accuracy level, using a series of improvements to their neural net-based acoustic and language models. The company introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling.
Additionally, the approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels. Moreover, Microsoft strengthened the recogniser’s language model by using the entire history of a dialog session to predict what is likely to come next, effectively allowing the model to adapt to the topic and local context of a conversation.
While achieving a 5.1 per cent word error rate on the Switchboard speech recognition task is a significant achievement, the speech research community still has many challenges to address, such as achieving human levels of recognition in noisy environments with distant microphones, in recognizing accented speech, or speaking styles and languages for which only limited training data is available.
Moving from recognising to understanding speech is the next major frontier for speech technology.