Human Auditory Processing and Speech Recognition – tcworld India 2016
I’m attending tcworld India 2016 in Bangalore. Pavithra Garre gave a presentation entitled “Human Auditory Processing and Speech Recognition—Potential Latencies and Benefits for Documentation”. These are my notes from the session. All credit goes to Pavithra, and any mistakes are my own.
Pavithra Garre is an engineer in design technology at Samsung Electronics in South Korea. She started by showing us a video clip about communication as an innate human ability, and about the vision of interacting with computers via speech recognition, and the evolution of speech recognition technology.
Pavithra’s presentation was very interactive. She asked questions and chatted to the audience throughout. The presentation covered the layers of speech recognition architecture, the modes of speech recognition, speech identifiers and tagging, CMS interpretation and custom delivery.
Pavithra described a three-layer architecture:
- Speech recognition: There are different modes of speech recognition: converting digital audio to simpler acoustic forms; matching units of speech; a complex lexical decoding system based on pattern matching; applying grammar, such as in predictive typing; and phoneme identification. There are challenges in speech recognition technology, such as background noise reduction, the size of the data gathered and data compression to reduce this size, and the problem of energy consumption.
- Tagging the different elements of speech to present to the CMS: The software needs to identify what the person is talking about, and tag each element appropriately. Once the speech is tagged, it becomes data. Examples of tags may be a form of XML, or VTML, or a more complex tagging format like ID3 or ID3V2Easy.
- Documentation in a database: Content is information plus data. The indexed tag and associated content are combined to form or retrieve a document, in what Pavithra calls “CMS interpretation”.
Some well known examples of speech recognition software:
- Siri by Apple
- Genie by Microsoft
- Google Speech
- and more
Where can we use this technology and the voice bank containing the derived content?
- Marketing agility
- Big data and analytics
- Resolving disputes about customer interactions in a help desk (this suggestion came from the audience)
- Better performance
Pavithra also described things you need to take into account, such as data volume and data migration.
There was a lively and interested discussion at the close of the presentation. Thanks Pavithra for an interesting presentation!