Global Autonomous Language Exploitation (GALE)
The goal of the DARPA GALE program is to develop and apply technologies to absorb, analyze and interpret huge volumes of speech and text in multiple languages. Automatic processing "engines" will convert and distill the data, delivering pertinent, consolidated information in easy-to-understand forms to military personnel and monolingual English-speaking analysts in response to direct or implicit requests.
GALE consists of three major engines: Transcription, Translation and Distillation. The output of each engine is English text. The input to the transcription engine is speech and to the translation engine, text. Engines pass along pointers to relevant source language data that will be available to humans and downstream processes. The distillation engine integrates information of interest to its user from multiple sources and documents.
ICSI is currently participating in GALE as part of the both the BBN team (contributing machine translation technology) and the IBM team (contributing novel feature extraction approaches to improve speech recognition).
Speech Processing for Meetings
ICSI researchers seek to develop algorithms and
systems for the recognition of speech from meetings, as well as methods
for information retrieval and other applications that such recognition
would make possible. Funding for this research is provided by the Swiss project, IM2: Interactive Multimodal Information Management. IM2 Website ICSI's meeting recorder project page.
Speaker Recognition
This project is concerned with the discovery of highly
speaker-characteristic behaviors ("speaker performaces") for use in
speaker recognition and related speech technologies. The intention is
to move beyond the usual low-level short-term spectral features which
dominate speaker recognition systems today, instead focusing on
higher-level sources of speaker information, including idiosyncratic
word usage and pronunciation, prosodic patterns, and vocal gestures.
The project goal is two-fold: to conduct fundamental research to
discover new speaker-distinctive features and encode them into richer,
more informative speaker models; and to evaluate the utility of these
feature sets and models for speaker recognition and other speech
technology applications. The feature discovery efforts are necessarily
exploratory, pursuing both a "knowledge-based" track, building on
existing linguistic constructs and guided by insights from
psycholinguistics and human performance studies, and a more speculative
"data-driven" approach, seeking idiosyncratic "vocal performances" ---
spectr-temporal patterns with high speaker-characterizing power,
independent of linguistic constraints. Speaker Recognition Project Page
My Speech-to-Text (MySTT)
The MySTT ("My Speech-To-Text") project is a development effort to
create a free speech recognition engine aimed at the automatic
transcription of natural, large-vocabulary, human-to-human
communication. It is implemented based on GStreamer, a popular
multimedia streaming framework, and an extension of it called Appscio
MPF, which extends GStreamer for multimedia analytics. The goal of
MySTT is to be easily extendable and interfaceable with other products
and research projects in the multimedia realm. All components,
including the models, are under open source licensing free to use for
both research as well as commercial purposes.
Speech Technology for Developing Countries
ICSI researchers are developing speech recognition technologies for "emerging regions". As part of this effort, they have developed simple recognizers for Tamil, a language spoken by over 50 million people in Southease India, where illiteracy rates hover around 50% for men and between 60% to 80% for women. Speech recognition, especially in combination with speech synthesis and compelling visual user interfaces, may be key in increasing access to technology to primarily oral communities. They have designed and field tested prototypes for speech recognition applications, collectively called Open Sesame, which includes a multi-modal system that accepts both voice and touch input to provide farmers and other rural community members with information on agricultural innovations and crop varieties, as recommended by local experts in Tamil Nadu. The system is one example of ICSI's capability to rapidly design and deploy low-cost speech prototypes using openly available technology.
Mutaphrase
Many natural language processing (NLP) applications implicitly or explicitly depend on content being expressed in a particular way. Thus, a process which is programmed or trained for the sequence "You weren't smart to eat fugu" will not necessarily handle the semantically equivalent paraphrase "Eating blowfish was dumb of you". The mutaphraser automatically generates variants of an input sentence using the semantics and syntax encoded in FrameNet and the lexical semantic information in WordNet. The utility of mutaphrasing is tested on various NLP applications including speech recognition, machine translation training, and machine translation evaluation.
Multiple Stream Speech Recognition
This project has three components.
(1) Cortically-inspired speech recognition: Acoustic events such as speech exhibit distinctive spectro-temporal
amplitude modulations. These types of modulations are not
well-captured by conventional feature extraction methods, which
involve either spectral processing or temporal processing at a time.
Recent findings from mammalian-auditory-cortical receptive field
measurements suggest that biological systems are highly-tuned to
spectro-temporal modulations. The spectro-temporal receptive fields (STRFs) of cortical cells are found to resemble 2-D spectro-temporal
Gabor filters. In prior work, researchers have used 2-D Gabor filters
to extract spectro-temporal features for speech recognition and speech
discrimination. However, these studies have involved only single
streams of task-optimized features to very large multi-dimensional
representations of spectro-temporal responses. Therefore, there is a
need to explore the use of multiple streams of spectro-temporal
features, which may preserve the organizational map of STRFs and
alleviate cumbersome computation of sizable data, in speech
recognition.
This research aims to develop, evaluate, and incorporate multi-stream
spectro-temporal features for robust speech recognition.
(2) Parallel processing for speech recognition: In noisy or reverberant environments, more processing will be needed for speech recognition. If a mobile device is used then the device will often be elsewhere than right up near the user's mouth, which will hurt ASR. For instance, in the most recent NIST evaluations, the best word error rate for multi-microphone speech recognition in a conference room was about 40%. This used beamforming, but as yet does not have the techniques we propose below, which have the potential of significantly reducing this error rate, at the expense of using much more computational power.
A parallel processing approach that could help further is the multi-stream methodology, in which multiple signal representations are used to generate posterior probabilities of speech sound classes, and then are combined and further transformed (Gaussianized and orthogonalized) to generate input features for a statistical speech recognition engine. Multi-layer perceptrons generate the individual posterior probabilities. These methods have been successfully used for 2-15 streams, but we would ultimately like to work with much larger ensembles of feature generators. We will start our work using the Quicknet libraries that were developed at ICSI, parallelizing it for the target approaches discussed in this proposal. We will then develop code that incorporates these libraries in a system that permits experimentation and ultimately exhibits much greater robustness for speech recognition in moderate noise and reverberation with microphones that are not head-mounted. This work is closely connected with the Berkeley ParLab, which is described here.
More about the Speech Research Group >>
top |