For years, we have been dreaming about the world where humans can actually talk to machines. Where everything can get easily done by the intelligent voice recognition systems. And now, we are moving closer towards fulfilling those dreams. Speech recognition has now invaded our lives. We have it in our phones (Siri, Google Now), our computers (Cortana) and even in our homes (Alexa). But they are still not as good as the chatbots even though they have a lot more potential than the bots. If you want to perform a task, you are still required to talk to a chatbot but in voice-recognition, you can do whatever you want to with your hands and control your intelligent assistants at the same time.
Speech Recognition has always been around us. We have been using it since the 80’s in form of answer machines where they would tell you that the person is unreachable but you can still leave a message. So, why now it’s becoming famous and hitting the mainstream? Because now, Deep Learning is advanced enough to get integrated with the voice-recognition systems. It’s been a really long time since Andreas Stolcke predicted that speech recognition is commercially most promising applications of natural language technology.
Microsoft's Artificial Intelligence and Research Unit is known for its tremendous contribution in the field of AI. Last year, a paper was published by them that clearly stated that their system's accuracy has finally surpassed IBM’s Watson. The NIST 2000 test set, famous for pointing out human errors showed that Microsoft’s conversational recognition system defeated IBM’s Watson 0.4 percent.
Speech recognition with Deep Learning is a lot more different than this but it gives a rough idea of how it works. From turning sounds into bits, processing sampled sound data and recognizing characters from short sounds, Deep Learning covers it all. Developers who work on these systems usually feed all the sample data in a specific accent. So, if you say “hullo” instead of “hello”, it will mean nothing to the system. We can overcome this problem by feeding the machine sample sounds of every accent. The input to the neural networks is 20-millisecond audio chunks and for each sound slice it will dig deep into the system and trigger the valid response matching the intent.
Just like chatbots, Intelligent Voice Recognition works on data. The accuracy of Speech Recognition solely depends on the provided data and it will get more efficient in time if we keep feeding it the required sound data. We call it “virtuous cycle of AI”. As more people use these voice-recognition systems, more data is gathered; making it more efficient and the algorithms will perform more efficiently as well.