Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce

Abstract

Automatic Speech Recognition (ASR) is essential for any voice-based application. The streaming capability of ASR becomes necessary to provide immediate feedback to the user in applications like Voice Search. LSTM/RNN and CTC based ASR systems are very simple to train and deploy for low latency streaming applications but have lower accuracy when compared to the state-of-the-art models. In this work, we build accurate LSTM, attention and CTC based streaming ASR models for large-scale Hinglish (blend of Hindi and English) Voice Search. We evaluate how various modifications in vanilla LSTM training improve the system’s accuracy while preserving the streaming capabilities. We also discuss a simple integration of end-of-speech (EOS) detection with CTC models, which helps reduce the overall search latency. Our model achieves a word error rate (WER) of 3.69% without EOS and 4.78% with EOS, with ~1300 ms (~46.64%) reduction in latency.

Anthology ID:: 2023.acl-industry.26
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Sunayana Sitaram, Beata Beigman Klebanov, Jason D Williams
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 276–283
Language:
URL:: https://aclanthology.org/2023.acl-industry.26
DOI:: 10.18653/v1/2023.acl-industry.26
Bibkey:
Cite (ACL):: Abhinav Goyal and Nikesh Garera. 2023. Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 276–283, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce (Goyal & Garera, ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-industry.26.pdf

PDF Cite Search