Professional Documents
Culture Documents
digital libraries * text mining * tutorial dialogue systems * mulltimodal interfaces * speech synthesis * dialog systems * minority languages
computer-assisted language learning * question answering * information theory * technology supported education * language modeling
Carnegie Mellon
“Thus it may be true that the way to translate from Chinese to Arabic, or
from Russian to Portuguese, is not to attempt the direct route... Perhaps the
way is to descend, from each language, down to the common base of human
communication--the real but as yet undiscovered universal language--and
then re-emerge by whatever particular route is convenient.”
- Warren Weaver
2
Contents
Overview 4
Ongoing Research 6
Academic Programs 18
Faculty 22
Contents
3
The LTI was established in 1996, combining the Center for Machine Translation (CMT), which was founded in 1986, and
other areas of computational language research. The LTI contains a unique mix of theoretical and systems-building re-
searchers specializing in various aspects of computer science, artificial intelligence, computational linguistics and machine
learning, and provides a rich and diverse environment for collaboration among faculty, graduate students, visiting scholars,
and research staff. As part of the School of Computer Science at CMU, LTI faculty and students collaborate closely with
members of the Computer Science Department, the Center for Automated Learning and Discovery, the Robotics Institute,
The Institute for Software Research International, and the Human-Computer Interaction Institute. Collaborative research
areas include mobile computing, computational biology, multi-agent systems, cognitive modeling, intelligent tutoring sys-
tems, multi-media interfaces, text and data mining, artificial intelligence systems, and machine learning theory and algo-
rithms.
Overview
5
Research Methodology
The modeling of human language lies at A Linguistic Model that is refined with
Research Methodology
Structural
Model Predictions Learned
Translation
Rules
Fig 1:Research methology integrating Fig 3: Data, knowledge, model and pre-
6
Translation of human language was one of projects such as BABYLON and Diplomat/ are investigating methods such as unsu-
Machine Translation
the very first tasks attempted by the devel- Tongues, successful multi-lingual commu- pervised learning of complex morphol-
opers of first digital computers in the nication is achieved by augmenting this ogy, and transfer-rule induction from lim-
1950s. Over fifty years later, fully auto- limited-quality MT with human interaction ited numbers of selected word-aligned
matic Machine Translation (MT) remains in order to help resolve translation errors. phrases and sentences, via machine learn-
one of the most difficult and challenging ing methods such as seeded version
topics of research within Artificial Intelli- There is currently active research being spaces. Although we have had initial suc-
gence. With the emergence of universal conducted within the LTI on all of the ma- cess, much of the challenge remains be-
access to information enabled by today’s jor approaches to MT. Each of these ap- fore us.
internet, language has become a critical proaches has some unique strengths but
barrier to global information access and also inherent weaknesses and limitations. Another aspect in which MT systems dif-
communication, and the need for MT is Consequently, different approaches are fer is in their input/output modes: text ver-
greater than ever before. The LTI origi- suitable for different scenarios. Our main sus speech. The JANUS project, for in-
nated as the Center for Machine Transla- research thrusts are in machine learning stance, combines speech recognition with
tion in the mid-1980s, and MT continues approaches to MT, including corpus-based language translation in the large. Other
to be a prominent sub-discipline of research approaches such as Generalized Example- projects, such as the Speechalator, pro-
with the LTI. The LTI is unique in the Based MT and Statistical MT systems that duced limited-scope, speech-to-speech
breadth of MT problems and approaches have focused primarily on Chinese-to-En- MT on a hand-held device that also in-
that are being investigated and pursued in glish and Arabic-to-English MT. Addition- cludes fluent speech synthesis. The area
the context of a variety of research projects, ally, we conduct ongoing work on Multi- of speech-to-speech MT is still young and
and in the number of faculty and research- Engine MT (MEMT), combining the re- growing, with technical difficulties, such
ers involved in MT research. sults of different MT techniques in order as how to translate from a lattice of sen-
to exploit each technique’s strong points. tence recognition hypotheses, produced
Part of the excitement of MT as a research by the speech recognizer, rather than sim-
field lies in its wide range of challenges, Since traditional rule-based approaches to ply a single known sentence as input.
from cutting-edge applications that are MT require lengthy development cycles, Human factors issues include clarifica-
commercially feasible today, through tech- and corpus driven MT requires large tion dialogs when the recognition or the
niques that could have practical applica- amounts of pre-translated parallel text for translation is problematic, and how to
tion within a few years, to problems that training, the LTI is investigating alterna- train users implicitly to use the system
will not be fully solved until the advent of tive MT paradigms for minority languages, for maximum effectiveness.
Ongoing Research
true Artificial Intelligence. The ultimate such as Quechua and Mapudungún. The
goal of this area can be characterized as goal of the AVENUE project is to produce Finally, we have some non-traditional
machine translation that is: (1) general pur- MT systems requiring neither extended projects, such as investigating Dolphin
pose (any topic or domain); (2) high qual- human development cycles (too costly) or language (we kid thee not), and whether
ity (human quality or better); and (3) fully huge parallel corpora (not available). We it can be interpreted.
automatic. Remarkably, our current MT
capabilities can reasonably satisfy any two
of these three criteria, but we cannot yet
meet all three at once. Our KANT project Types of Machine Translation
produces fully automatic, high quality
translations for information dissemination Interlingua
in well-defined technical domains such as
Semantic Sentence
electric power utility management and Analysis Planning
heavy equipment technical documentation
(as in the Catalyst application for Cater- Transfer Rules
Syntactic Text
pillar). Our Example-based MT and Sta- Generation
Parsing
tistical MT systems can produce fully au-
tomatic translation in broad or unlimited Source Target
domains, but have not yet approached or (Arabic) Direct: SMT, EBMT (English)
surpassed human quality levels. In other
7
technology can be used to support that ates for students to make their thinking specific knowledge sources for robust
endeavor. A major thrust of this research transparent to the tutor. State-of-the-art processing of student explanations, to
is to explore issues related to eliciting and tutorial dialogue systems have focused on facilitate this process. Thus, beyond
responding to productive student expla- leading students through directed lines of producing technology to be used to address
nation behavior. It involves many reasoning to support conceptual under- our own theoretical questions about
broader issues such as influencing student standing, clarifying procedures, or coach- learning interactions, we aim to produce
expectations, motivation, and learning ing the generation of explanations for jus- reusable technology that can facilitate the
orientation. This interdisciplinary re- tifying solutions, problem solving steps, work of other researchers pursuing their
search agenda involves five primary foci: predictions about complex systems, or un- own related questions. Our ultimate goal
derstanding of computer architecture. is to produce resources that are simple
*Controlled experimentation and analy- Evaluations of state-of-the-art tutorial dia- enough to be used by non-AI researchers
sis of student interactions with each other logue systems provide a powerful proof- and practitioners, such as education
as well as with human tutors and com- of-concept, demonstrating conclusively researchers, domain experts, and
puter tutors in order to explore the stimuli that the language technology exists for sup- instructors, and thus to put the power of
that encourage productive student behav- porting productive learning interactions in tutorial dialogue in the hands of those with
ior, appropriate learning orientation, and natural language. Carnegie Mellon re- the pedagogical expertise to maximize its
ultimately effective learning searchers are at the forefront of this move- effectiveness and meet the real needs of
ment, both in terms of producing landmark students.
*Analysis of think aloud protocols in systems and widely used
learning scenarios in order to better un- resources.
derstand the process of learning
* Basic research in language technology A current thrust of this
to support the semi-automatic analysis of work at the LTI involves
language interactions in learning sce- pushing this technology
narios (text classification, automatic es- into new areas, such as
say grading, etc.) supporting design
activities in an exploratory
Ongoing Research
CALL
ther by going to the country or by having a allow us to read along with a student and University of Pittsburgh.
personal tutor. First attempts to create lan- give help in understanding a passage upon
guage learning software lacked feedback and request. By using information retrieval,
authenticity. This is where language tech- we can find appropriate texts for a
nologies are starting to provide a potent al- student’s level of reading and lexical and
ternative. The use of speech recognition has grammar knowledge and can give other
enabled students to speak to a system and researchers the tools to determine how
find out exactly what phonetic and prosodic hard a new text can be and still be effec-
errors were made and how to correct them. tive, for example, what percentage of new
Using modeling of native and non-native words can be in a text which still allows
speech and knowledge of the native lan- the student to generalize the meaning. For
guage of the student, we have developed learning research, members of the LTI
Universal Library
The central goal of the Universal Library is uted to centers in those countries, where than specific terms, which suggests a
to digitize, index and make universally avail- personnel furnished by their governments multimodal rather than a pure text-based
able all published works of humankind, in- scan books for two shifts per day. As of interface.
cluding books, periodicals, artwork and fall 2004 about 100,000 volumes have
music. A further goal is to provide value- been scanned. These are passed through Copyright is major barrier to free distri-
added information services such as auto- optical character recognition software, bution of content. The vast majority of
mated summarization, reading assistants, full indexed, and added to the Universal Li- works ever published are still in copy-
content search and translation over the brary. Approximately half of the scanned right. Of these, more than 90% are out
internet in any language. Imagine the situa- books are in English; the remainder are of print, which means that they produce
tion in which every researcher, every teacher, in a wide range of other languages. The no revenue either for the author or the
every citizen and even every schoolchild first million books are expected to be publisher. We are endeavoring to encour-
would have everything ever written at her complete by the end of 2006, at which age publishers to allow the Universal Li-
fingertips, regardless of what country she point we will embark on the Ten Million brary to scan their out of print books and
lives in, her economic status, the school she Book Project. permit them to be viewed on the Internet
attended, or her native language. The am- and retrieved through search engines.
plification of human potential would be vast, The Language Technologies Institute and Publishers who do this often find an in-
Ongoing Research
and would lead to the ultimate democrati- ISRI provide indexing software and stor- creased demand for their books. We are
zation of information and knowledge. Like age infrastructure for the Million Book also working with the government of In-
Rome, the Universal Library is not built in a Project. The CMU Libraries furnish dia to develop a new copyright statute that
day, but rather it is a project for the ages, metadata, archiving and copyright clear- would provide funds, analogous to the
requiring constant improvement and enrich- ance support. A number of research UK Public Lending Right, to be distrib-
ment. projects are underway to explore appli- uted to copyright owners whose works
cations and uses for the Universal Library. are accessed on the Internet, with micro-
We estimate that approximately 100 million One of them, the Universal Dictionary, is payments to be provided through public
different books have been published in the an effort to build a database of every word funds. Eventually the availability of such
history of the world, but only a tiny fraction in every language. This will serve as a payments could remove further obstacles
are available digitally. The remainder must basic resource for machine translation and to offering copyrighted material.
first be scanned – a labor-intensive chore. multilingual searching. We are also ex-
Toward that end, with the support of the ploring new methods of navigating huge The Universal Library enjoys coopera-
National Science Foundation, we have or- text spaces. As the size of the collection tive relationships with other institutions,
ganized the Million Book Project, a joint grows, the limitations of keyword search, including the Internet Archive, the Digi-
effort of CMU and the governments of In- particularly for multilingual queries, be- tal Library Federation, and the National
dia and China, with other countries such as come severe. What is needed is a lan- Academy Press.
9
Egypt and Turkey joining the cause. Hun- guage-independent search method that is
dreds of digital scanners have been distrib- able to retrieve based on concepts rather
Large amounts of genomic and protein The goal is to derive new hypotheses to at each position in the alignment. We de-
sequence data for homo sapiens and correlate these building blocks with struc- rived an algorithm for systematically iden-
other organisms have become available, tural, dynamic and functional “meaning” tifying the conservation of specific physi-
together with a growing body of corre- for different living organisms in terms of cal-chemical properties in individual posi-
lated protein structure and function data folding, activity, interactions, and path- tions in a multiple sequence alignment. We
creating an opportunity for addressing ways. For example, one of the important have applied our method to the diverse
the sequence mapping and structure challenges we try to tackle is the predic- GPCR family and demonstrate the compu-
folding problems with increasingly so- tion of super secondary structures such as tational significance of the properties we
phisticated data-driven (statistical and the beta helix (Figure). We work with su- have identified by successfully using them
computational) methods to discover, per secondary protein structure experimen- to predict whether specific amino acids will
Computational Biology
characterize and model regularities and talists so that the hypotheses generated occur in particular positions in the align-
outliers in the biological data. Machine from this computational approach can be ment. We have also used our method to
learning methods, with large amounts of tested by wet lab experiments. annotate Rhodopsin, a well characterized
data, led to multiple breakthroughs in member of the GPCR family, with a
language technologies such as automatic In another attempt to relate protein primary selectional pressure profile, which allowed
speech recognition, document classifi- sequence to its structure and function, we us to biologically interpret our findings.
cation, information extraction, statisti- have engaged in a project to infer evolu- We further applied our method to a mul-
cal machine translation and other chal- tionary selectional pressure from residue tiple sequence alignment of an HIV-I pro-
lenging natural language processing conservation in multiple sequence align- tein, and are gearing to apply it to a large
tasks. Our research exploits the analogy ment of proteins families. Many measures set of protein families, including crystallins
between mapping words to meaning via have been proposed for quantifying the and various globins. Looking ahead, we
syntax, in order to decipher the funda- overall degree of sequence conservation in plan to refine our method by incorporating
mental meaningful building blocks of a multiple sequence alignment of protein. phylogenetic histories, and separating mu-
biological sequence language, via its However, these measures fail to identify tation.
structure to its underlying function. which particular properties are conserved
The above examples illustrate the power
of combining computational linguistics,
statistical machine learning and bio-se-
quence/structure/function discovery; we
expect to tackle other interesting problems
in this field, such as protein-protein inter-
action predictions and aspects of immune
system modeling at the molecular level.
Ongoing Research
RADAR
routine to complex problem solving. This handle all these diverse tasks, including when and how to ask for advice or offer
new technology should be equally valuable requests and situations that were not an- suggestions). All of these capabilities
to managers in industry, academia, and gov- ticipated by RADAR’s designers. When depend on machine learning technology:
ernment. RADAR will help its human mas- faced with a surprising new request, RA- learning from human advice, learning by
ter in many ways: scheduling meetings, al- DAR might not know how to proceed, but observation, learning by active experi-
locating resources, maintaining a project it should do something sensible. Perhaps mentation, and transferring the results of
it can weave together fragments of old learning across problems and domains.
Ongoing Research
11
Speech Processing fact that the speech signal can be heavily has several pronunciations depending on
Speech is the most natural way for humans affected by background noises, channel dis- whether it is a year”nineteen eighty four,”
to communicate and we find it so easy to tortions, or cross-talk, but also that spoken a quantity “one thousand nine hundred
use, that most of us are surprised to learn speech varies in speaking style, speed, and (and) eighty four,” or a telephone number
how complex the processing of spoken lan- content. More difficulties arise in speech which can be pronounced “one nine eight
guage actually is. As one of our goals at recognition because different words might four.”
LTI is to make speech communication with be pronounced the same (as in “two”, “to”,
and through computers more useful, we and “too”), one word might be pronounced • Linguistic analysis: Once given the
work on improving the fundamental tech- differently (such as “the” in “the teen” vs words, we still require the pronunciations.
Speech Processing
nologies of automatic speech processing, “the adult”) and also because speech is This can be done by a pronunciation dic-
i.e. speech recognition and speech synthe- spoken continuously, so it provides no natu- tionary. However, no matter how large the
sis. Also, we develop new technologies ral segmentation. For instance, the same dictionary is, we will still encounter words
using those components, such as speech- phonetic sequence can be segmented into outside of its vocabulary due to neologisms,
to-speech translation, spoken dialog sys- two different word sequences: “This ma- names, etc. Therefore, a letter-to-sound
tems, audio-based information extraction chine can recognize speech,” or “This ma- rules system is also required. Prosody, in-
and retrieval, and computer aided language chine can wreck a nice beach.” Which se- cluding tune, duration, and phrasing are the
learning. quence will be picked depends on the ex- components that make speech interesting.
pectation of the listener. There are many ways to pronounce words;
Speech Recognition In order to learn the knowledge in the com- recreating the prosody and style (i.e. po-
Automatic Speech Recognition is the pro- ponents of the automatic speech recognizer, lite or urgent) makes the speech more un-
cess of decoding a spoken speech signal namely the acoustic models, the pronun- derstandable and more acceptable to us-
into a written form, that is, a sequence of ciation dictionary, and the language model, ers.
words. To do this, the analog speech sig- today’s speech recognition algorithms must
nal needs to be digitized and then –for ef- use data from which those models are •Waveform generation: Currently, the most
ficiency reasons-- reduced to its essential trained. Thus, the acoustic model learns the common form of constructing waveforms
relevant information, which is mainly done most likely way people pronounce particu- from phonetic and prosodic descriptions is
by a form of frequency decomposition (the lar phonemes in particular contexts. The by concatenating short pieces of pre-re-
picture below shows a spectrogram repre- pronunciation dictionary models the most corded natural speech and modifying their
sentation of a speech waveform from con- likely sequences of phonemes to build prosody to match the desired form. Tradi-
versationally spoken speech. The bright- words, and the language model learns the tional approaches record all phone-phone
ness of the colors indicate the energy level most likely sequences of words to build transitions in a language (called diphones).
present at a given frequency. The final rep- sentences. The language model is statisti- Although this technique is robust, more
resentation of speech in the computer is a cally trained and scores all of the possible general unit selection synthesis, in which
stream of parameter vectors over time. phrases that could have been spoken. It the database contains more varied speech
These vectors will be classified into pho- differs from more traditional parsing tech- with multiple examples of phones in vari-
nemes – the smallest linguistically distinct niques, although they may overlap, since ous contexts, and an appropriate selection
sounds of a language. For this purpose pro- speech is less likely to be in traditional lin- algorithm, seem to offer promise of much
higher quality speech.
Ongoing Research
Speech Processing
Since speech recognition is the most lecting enough data to find all ex-
knowledge based techniques are also amples is difficult. Thus, smoothing
closely coupled with statistically based natural means to allow communication
across language and culture barriers, and back-off techniques are often re-
methods. A common theme appearing quired. A number of new language
throughout all the LTI’s work is develop- speech recognizers in many different
languages are an essential prerequisite modeling techniques are being in-
ing and applying machine learning tech- vestigated within the LTI, including
niques to appropriately defined knowledge for making speech-driven communi-
cation applications attractive and class-based language models that
driven frameworks to improve the useful- consider whole word classes, not
ness of the work. These interdisciplinary available to the public. The project
GlobalPhone focuses on the rapid de- just individual words, for n-grams.
approaches encourage sharing of tech- Additionally, new statistical model-
niques over different projects: language ployment of speech recognizers in
many languages, i.e. by reducing the ing techniques such as maximum
modeling techniques may also be used in entropy are being developed to give
text summarization; novel machine learn- required effort in terms of time and
cost to build such recognizers, thus en- better estimates of the probability
ing techniques may be applied to speech distribution of word sequences.
problems. The LTI’s speech research al- abling support for languages for which
lows standard components to be used in few or no resources are available.
• Synthesis using Festival, FestVox,
other larger projects thus making them
more useful, but also offering greater chal- • Meeting Summarization: A micro- and Flite: To ensure that our speech
phone records a multi-person meeting. synthesis work is available to the
lenges in the application of speech and widest range of users, we work with
language that lead to more fundamental Off-line speech recognition technol-
ogy transcribes the meeting, including the University of Edinburgh’s Fes-
research. In the following we list a selec- tival Speech Synthesis System, a
tion of projects and applications which the difficult task of separating the
voices and identifying the speaker. free software synthesis toolkit and
are currently under development in LTI: engine. We have also produced the
Information retrieval technology is
then used to index the data so we can FestVox tools for building new
• Speech-to-Speech Translation: The voices in new languages allowing
Consortium for Speech Translation answers queries such as “Find the part
where Bob and Jane talked about next the construction of both general
Research (C-STAR) is a speech-to- voices, and domain specific voices.
speech translation system developed years’ budget.”
Also with the small footprint CMU
Ongoing Research
jointly between CMU and a interna-
tional partners from Japan, Korea, • Dialog systems: The CMU DARPA Flite system, synthesis can be used
Communicator project allows experi- on any platform.
Italy, France, and China. Here,
speech recognition must deal with ments in mixed initiative dialog be-
tween humans and machines via the • Robust speech recognition: current
many languages, recognize and trans-
late in real-time, and handle many dif- telephone in the domain of flight, ho- automatic speech recognition sys-
ferent users. Recent work investi- tel and car rental information. This tems are limited in their ability to
gates deploying such systems on re- requires accurate, real-time recogni- adapt to the effects of new speak-
source constrained mobile devices, tion across the reduced bandwidth and ers, difficult acoustical environ-
and improve the robustness and qual- potentially noisy telephone, real-time ments, non-native accents, and
ity of domain dependent translation. access to networked information, natu- spontaneous speech production.
In the project STR-Dust (“Speech ral language generation, parsing, syn- Researchers at the LTI are carrying
Translation for Domain Unlimited thesis, and a dialog manager. out a broad program of research to
Spontaneous Communication Tasks”) improve the robustness of automatic
13
we push the limits of today’s speech • Computer aided language learning: speech recognition using a variety
translation coverage to rather unlim- The Fluency project applies speech of techniques.
ited domains, such as in meetings, lec- recognition techniques to aid non-na-
tures, and news. tives in pronunciation.
Machines cannot understand the meaning of a multimedia document in the way that a human can, but many useful tasks can be
accomplished with limited forms of understanding. Statistical corpus analysis, probabilistic inference, and machine learning are the
tools of IR and Text Mining research. Research at the LTI is grounded in theory, and tested in large-scale applications. Conse-
quently, research projects focus on everything from basic theory to software engineering. Several representative examples are
Information Retrieval and Text Mining
described below.
Research on Advanced IR Architec- Translingual Information Retrieval The Distributed Information Retrieval
tures develops systems that combine uses queries in one language (e.g., En- project studies environments, such as the
standard search queries with detailed, glish) to find documents in other lan- Web, large corporate networks, and peer-
long-term user and task models and guages (e.g., German, Chinese and Ara- to-peer networks, in which thousands of
highly structured documents. Document bic). Traditional machine translation search engines are available. Cooperation
structure may indicate how the document methods do not work well when queries cannot be assumed, so robust techniques
is organized (e.g., XML), or it may be are short, out of context, and not sen- are required for automatically character-
provided by language analysis tools (e.g., tences. Our research focuses on corpus- izing search engines, selecting among
named-entities, part-of-speech, syntactic based translation of query terms by learn- them, searching them, and integrating re-
parsing). This research supports LTI ing empirical associations among multi- sults retrieved from different sources.
projects on open-corpus language tutor- lingual lexicons from translation mates
ing, such as the REAP reading compre- (documents, paragraphs, passages or sen- The Language Technologies Institute pio-
hension project. Much of this research tences), and by mapping queries and neered research in automated Text Sum-
is distributed via the open-source Lemur documents to conceptual ‘interlingua’ that marization with the Maximum Marginal
Ongoing Research
Toolkit . bridge the language barrier. Relevance Metric and its application to
user-profile-relevant document summari-
zation. Research also focuses on summa-
rizing dialog clusters of topically related
documents and automated generation of
briefings from corpora.
14
15 Ongoing Research
The LTI has a long history of research Knowledge Acquisition from Natural at helping users to find answers in on-
in knowledge-based natural language Language Text line documents. More advanced ques-
processing and computational linguis- tion-an-answering (QA) systems, such as
tics, dating back to Carbonell’s work The knowledge acquisition bottleneck has LTI’s JAVELIN project, use NLP tech-
on knowledge-based interlingual ma- long been decried as one of the limiting niques (segmentation, stemming, pars-
chine translation and Tomita’s work on factors for applications of artificial intelli- ing, semantic interpretation, unification,
efficient natural language parsing tech- gence – how can we get all of the appro- etc.) to a) understand the underlying
niques, when the precursor of the LTI priate world knowledge into the computer meaning of the questions they are posed,
Knowledge-Based NLP
was Carnegie Mellon’s “Center for so that it can solve problems of practical and b) find the most likely answers in
Machine Translation.” Of particular significance in a new domain? In our re- the target collection(s). Information gath-
note are the KANT and KANTOO search on knowledge acquisition from text, ering becomes a collaborative process,
systems developed by Nyberg, we are working to define a formal map- where the system and user work together
Mitamura and Carbonell that brought ping between specific structures in natural in an ongoing dialog to refine the search
high-accuracy interlingua machine language and corresponding meaning rep- for ever-better answers. In addition to
translation to large-scale practical use resentations in a formal representation (e.g. research on basic parsing and interpre-
for translating technical literature at frame logic). The goal of CMU’s contri- tation of unrestricted text, we are also
Caterpillar Inc. to several languages. bution to the HALO-II project is to reduce actively working on: a) information gath-
This line of work is characterized by the cost of encoding knowledge for a prob- ering dialogs; b) sophisticated multi-
careful linguistic analysis, large-scale lem-solving system by making it possible layer retrieval strategies; c) a range of
knowledge engineering, and solid sys- to acquire knowledge directly from an ex- approaches to information extraction
tem building. More recently, knowl- isting text (e.g., from a textbook). Current (pattern-based, statistical, etc.); and d)
edge-based systems are combined with work focuses on acquiring various types synthesis of answers from multiple an-
machine learning, such as in the AV- of knowledge (ontologies, rules, processes, swer candidates. The team is also inves-
ENUE project where translation trans- etc.) from college textbooks in domains tigating the use of dynamic planning in
fer rules are learned from a minimal such as Biology, Chemistry, and Physics. QA (e.g., to explore first the strategies
number of word-aligned translation that are most likely to yield a good an-
pairs via new techniques such as Open-Domain Question Answering for swer when processing time is limited),
seeded-version-space learning. The Multi-Lingual Text Collections and use of reasoning and belief networks
current “pure” knowledge-based to piece together individual pieces of in-
projects at LTI are: As the size of available on-line text collec- formation from several documents.
tions grows ever larger, simple search en-
gines are becoming less and less effective
Ongoing Research
At the LTI we welcome industrial partici- tric, and ATR research. We engage in the Customized intensive courses –
pation in our research, in the form of in- following types of relations: ranging from executive one- or two-
dustrial affiliates, customized education day sessions to much longer and
programs, guest researchers participating Affiliateships – where industry participates more technical targeted offerings on
in common projects, and funded R&D ef- in the results of LTI research, including pub- selected topics in Language Tech-
forts. The LTI collaborates with many in- lications, data sets, software, and briefings. nologies and its applications.
dustrial partners, national and international,
with several of these partnerships actively In-residence fellows – where industrial re- Sponsored R&D projects – where
LTI technologies are extended, in-
Industrial Programs
ongoing and others having successfully searchers or engineers join the LTI for a
concluded. LTI industrial partners, past specified duration to receive full-immersion tegrated and/or applied to address
and present, include: IBM, Intel, SRI, training in new technologies, to work on joint challenging problems in industry or
Hitachi, Fujitsu, Siemens, Denso, Vulcan, projects, or both. government.
Ontoprise, Lycos, Northrup Grunman,
Hewlett Packard, Boeing, Justsystems,
CNRI, Dong-A-Seetech, Caterpillar,
Carnegie Group,Dynamix, General Elec-
Ongoing Research
Countries represented in the Language Technologies Institute (languages of study, faculty, staff,
students and/or visiting researchers)
17
The PhD in Language and Information Technologies is a research-oriented degree program consisting of the following
components: successful completion of a set of courses, mastery of certain proficiencies, and a program of research, directed
by a faculty advisor, culminating in a PhD thesis.
Proficiencies
The following skills must be demonstrated in the course of
graduate study, with flexibility in the form and timing of
their demonstration:
Writing: Satisfied by producing a peer-reviewed confer-
18
The Master of Language Technologies (MLT) is a professional degree that is normally completed in two years. Students
choose an individualized curriculum from a flexible set of courses and self-paced laboratory modules that cover linguistic
and statistical approaches and basic computer science. The curriculum is usually tailored to emphasize a specialty in one of
three language technology areas: Machine Translation, Information Retrieval, or Speech Technology. Directed research is
an integral part of the MLT program; each MLT student carries out research under the guidance of a faculty advisor.
With some modifications and enhancements, the MLT curriculum also forms the course-based component of the PhD Pro-
gram. The more research-oriented MLT students are encouraged to apply for continuing studies in the PhD program, with
most of their MLT courses and hands-on work being credited towards the PhD.
Master of LT Curriculum
Academic Programs
The curriculum for the MLT consists of a minimum of
120 course units at a senior or graduate level. From these
120 units, six courses must be LTI courses and two other
courses must be SCS courses. There are additional con-
straints on course selection, required in order to meet SCS-
wide Masters requirements. A concentrated form of this
degree may be completed in one year without the research
component.
LTI Courses
Year 1 Year 2
Admissions
Alan Black has created practical implementations of computational theories of speech and language.
After a wide background in morphology, language modeling in speech recognition, and computation
semantics, he now works in all aspects of speech generation. As an author of the free software Festival
Speech Synthesis System, he researched text analysis, prosodic modeling, waveform generation, and ar-
chitectural issues in synthesis systems. His work targets data-driven computational models that allow
synthesizers to capture speaker style. Specifically, he studies data-driven prosodic models, automatic
building of voices in English and other languages. To allow spoken output anywhere, he also deploys this
work on handheld computers, specifically addressing rapid development of voices in new languages,
Alan W Black modeling of speaker individuality, and evaluation of voice quality.
Associate Research
Professor Professor Black teaching is very practical, thus his courses involve significant exercises that allow stu-
BS Computer Science dents to gain experience in building synthetic voices, statistically trained models, etc. After some practical
Coventry Polytechnic 1984 experience it is easier to understand the underlying theoretical issues and their relative importance.
MS Knowledge Based
Systems
University of Edinburgh www.cs.cmu.edu/~awb
1986
PhD Artificial Intelligence
University of Edinburgh speech synthesis speech to speech translation spoken dialog systems
1993
Ralf Brown's research interests cover several areas of language technology, such as reference resolution,
disambiguation, corpus-based machine translation, cross-language information retrieval, and topic track-
ing in news. His recent research has focused on Example-Based Machine Translation and its applications,
particularly in the context of multi-engine translation systems, and on topic tracking in news. He also
works with machine-learning techniques for extracting patterns from parallel text in order to build trans-
lation systems with less training material.
Current and recent projects include RADD (Rapidly-Adaptable Data-Driven Machine Translation), AV-
ENUE (machine translation for languages with few resources), Topical Novelty Detection in the TDT
(Topic Detection and Tracking) program for detecting new events in the news and tracking their evolu-
Ralf Brown tion, TONGUES (rapid development of bi-directional speech-to-speech translation systems), and
Senior Systems MUCHMORE (cross-language information retrieval in the medical domain).
Scientist
BS Computer Science www.cs.cmu.edu/~ralf
Towson University 1986
PhD Computer Science
Carnegie Mellon
University 1993
Jamie Callan is interested in a wide range of information retrieval and text mining topics. In recent years
his research has focused on four problems listed below.
• Federated Search (“Distributed IR”): Provide access to many search engines through a single
search interface; includes peer-to-peer search. Research topics include learning what each engine
contains, selecting which to search, searching them, and integrating results from different sources.
Faculty
• Adaptive Document Filtering: Monitor information streams to find documents that satisfy an infor-
mation need. The system should learn a person’s information needs, rapidly identify desired docu-
ments, and distinguish between novel and redundant information.
Jamie Callan • Large-Scale Text Analysis: Develop tools for rapidly analyzing large text datasets. For example,
Associate Professor when a government agency receives 100,000 comments about a new regulation, it needs to know
22
BA Applications of which groups commented, what topics were discussed, and what supporting evidence was cited.
Computer Science
Univ. of Connecticut 1984 • IR for Language Applications: Search engines are increasingly used in question answering and
MS Computer & Information language tutoring systems. Such applications require rich text annotation (e.g., syntax, named en-
Science tity), complex queries, and retrieval models that combine varied forms of evidence.
Univ. of Massachusetts
1987 His students initially work closely with him to study specific ideas while learning research skills and IR.
PhD Computer Science As students gain expertise, they develop their own interests and have more freedom in exploring them.
Univ. of Massachusetts
1993
www.cs.cmu.edu/~callan
Maxine Eskenazi's research interests lie in the variability of the speech signal whether it be to aid non-
native speakers to learn a language, to enable systems to dialogue with the elderly, or to process the of
speech of any other group of speakers whose production differs greatly from the average. At the LTI,
she has created the Fluency project, which develops basic algorithms and systems for language learn-
ing. The systems are used for foreign language learning as well as for learning American English
dialects. They are also used to test pedagogical theories about language learning. Work on this project
has spun off a company, Carnegie SpeechTM, which has created products based on Fluency algo-
rithms. She also works on the use of authentic materials for language learning. Here we characterize
the language learners knowledge and the knowledge to be acquired (curriculum) and then determine
which texts, from a very large database of texts taken off the Web, should be shown to the learner next.
Maxine Eskenazi Teacher-created curriculum and learners interests add to the power of adaptation to the individuals
Associate Teaching
needs.
Professor
BA Modern Languages
Carnegie Mellon University
Dr. Eskenazi views teaching as a constant dialogue. It is an occasion for all individuals concerned to
1973 come together and learn something, question something, change something.
DEA Linguistics
University of Paris VII 1981
Doctorat de Troisieme Cycle www.lti.cs.cmu.edu/~max
Computer Science
University of Paris XI 1984 computer-aided language learning speech processing speech recognition
Scott Fahlman is responsible for the knowledge-representation research effort on the RADAR Project,
a large DARPA-funded research effort whose goal is to build an automated cognitive assistant for
busy managers, making extensive use of AI and machine-learning techniques.
As a researcher, he is primarily interested in Artificial Intelligence and its applications. Currently, he Faculty
is working on SCONE, a practical system that can represent a large body of real-world knowledge
and that can efficiently perform the kinds of search and inference that seem so effortless for us hu-
mans. He believes that such "knowledge base" systems will be important tools in the future, perhaps
used in even more ways than database systems are used today.
With respect to natural language understanding, the field has made considerable progress focusing on
Scott E. Fahlman superficial aspects of language, but Professor Fahlman believes that future progress depends on our
23
Research Professor
ability to extract and represent the actual meaning of a piece of text, and to use large amounts of
BS Electrical Engineering
background knowledge in understanding the text,using powerful new tools for knowledge represen-
and Computer Science tation and inference.
MIT 1973
MS Electrical Engineering www.cs.cmu.edu/~sef
and Computer Science
MIT 1973
PhD Artificial Intelligence
MIT 1977
He is currently working on an intelligent system for automated allocation of offices and related re-
sources, in both crisis and routine situations. This work is part of the RADAR project, aimed at
creating a general-purpose assistant for office managers. He is also working on techniques for identi-
fication of both known and surprising patterns in large-scale databases, and applying these tech-
niques to homeland security. This work is part of the ARGUS project, which is a joint research project
Eugene Fink involving Carnegie Mellon and Dynamix Technologies.
Systems Scientist
BS Mount Allison University www.cs.cmu.edu/~eugene
1991
MS University of Waterloo
1992
PhD Carnegie Mellon
University
1999
artificial intellegence machine learning computational geometry
Bob Frederking's primary research area has been machine translation applications that do not currently
permit the use of purely knowledge-based techniques. This includes rapidly developing Machine Trans-
lation (MT) for new languages and translating text and speech that are not limited to a narrow, well-
defined domain. Our main technical approach in this area is Multi-Engine MT (MEMT). MEMT applies
several different MT techniques to the same text, and then attempts to select the best results from each
technique. He developed and implemented the initial chart-based dynamic-programming technique for
merging the results from the different engines and our current merging technique, which uses statistical
language modeling to select among the different technique outputs. He has also been involved in LTI
projects in Cross-Language Information Retrieval, Question Answering, and Information Extraction
Robert from email, among other things.
Frederking Professor Frederking believes that successful advising and teaching hinge largely on successful commu-
Senior Systems Scientist, nication: presenting advice (or a lecture), understanding what (if anything) the student is having trouble
Director LTI Graduate
with, and then providing the information or guidance that he or she needs to resolve any difficulties. As
Programs
BS Computer Engineering
the Chair of the LTI's graduate programs, he is the default advisor for students who are not project-
Case Western Reserve supported.
University 1977
PhD Computer Science, AI www.cs.cmu.edu/~ref/
Carnegie Mellon University
1986 speech-to-speech MT rapid-development wide-coverage MT question answering
Alex Hauptman's research aims to design and build intelligent programs that process data from large
volumes of multimedia data, including text, image, video, and audio and make the data useful for other
applications, so as to improve speech recognition, image understanding, NLP, machine learning, ques-
tion answering and IR. The challenge is to find the right data, to process it into a suitable form for
Faculty
training, learning, or re-use, and to build mechanisms that can successfully utilize this data.
This work takes part in the context of the Informedia digital video project, which aims to achieve ma-
chine understanding of video and film media, including all aspects of search, retrieval, visualization and
Alex Hauptmann summarization in both current and archival content collections. The base technology developed under
Informedia combines speech, image and natural language understanding to automatically transcribe,
Senior Systems Scientist
segment and index linear video for intelligent search and image retrieval.
24
BA Psychology
Johns Hopkins University
1982
MA Psychology
Johns Hopkins University www.cs.cmu.edu/~alex
1982
Diplom Computer Science
Technische Universität Berlin
1984
PhD Computer Science
Carnegie Mellon University
1991
multimedia analysis multimedia interfaces Informedia digital video library
How does sequence map to structure and function of proteins in different organisms? Dr. Klein-
Seetharaman takes a linguistically inspired view of this question in analogy to “How do words map to
meaning in natural languages?” using stochastic language modeling technologies. Computational models
are validated experimentally by interdisciplinary (biochemical and biophysical, in particular NMR
spectroscopic) studies of purified proteins and model peptide sequences. The emphasis lies on testing
predicted sequence dependence on structural and dynamic aspects of folding/misfolding and func-
tional properties of proteins. Specific proteins that are expressed, purified and studied experimentally
in Dr. Klein-Seetharaman’s laboratory include the G-protein coupled receptor rhodopsin, the glutamate
Judith Klein- receptors and the epidermal growth factor receptor. These systems function in diverse signal transduc-
Seetharaman tion pathways, but resemble each other in their mechanism of action. Each receptor undergoes sub-
Assistant Professor, Depart. stantial conformational changes during the signaling process and the investigation of the precise mo-
of Pharmacology, Univ. of lecular details of these changes is instrumental to elucidating the molecular mechanism of signaling by
Pittsburgh School of Med. these molecules.
Research Scientist, LTI
Diplom in Biology, Univ. of www.cs.cmu.edu/~judithks
Cologne, Germany 1995
Diplom in Chemistry, Univ.
of Cologne, Germany 1996
PhD Biological Chemistry,
MIT 2000 Computational biology/bioinformatics biochemistry/biophysics structural biology
The central focus of John Lafferty's research is machine learning, including algorithms, theory, and
statistical methods for learning from data. The motivating applications for this work most often comes
from text and natural language processing, information retrieval, and other areas of language tech-
nologies. For example, in recent work with his colleagues he has studied approximate inference
algorithms for a family of mixture models appropriate for document collections, and applied the
algorithms to automatically extract the subtopic structure of scientific articles. Over several years
Professor Lafferty has been involved in the development of a language modeling approach to infor-
mation retrieval, including a general approach to IR based on decision theory. In other work he is
researching learning algorithms for sequential and graph-structured data, using a framework called
conditional random fields for combining the strengths of graphical models with discriminative classi-
fication methods such as support vector machines and logistic regression.
John Lafferty
Professor (CSD, LTI) www.cs.cmu.edu/~lafferty
BA
Middlebury College 1982
MS
Princeton University 1984
PhD Mathematics
Princeton University 1986
natural language processing machine learning information theory
Alon Lavie's main areas of research are Machine Translation (MT) of both text and speech, and
Spoken Language Understanding (SLU). His current most active research is on the design and devel-
opment of new approaches to Machine Translation, for languages with limited amounts of data re-
sources. He has also worked extensively on the design and development of Speech-to-Speech Ma-
chine Translation systems and on robust parsing algorithms for analysis of spoken language.
Faculty
Faculty
Professor Lavie is co-PI of the AVENUE project (funded by NSF/ITR), where we are developing a
general framework for building prototype MT systems for languages for which only scarce amounts
of data and linguistic resources are available. He also works on parsing algorithms for spoken lan-
guage analysis of databases of transcribed spoken language (such as CHILDES). He was co-PI of the
Nespole! and C-STAR speech translation projects and of the LingWear and Babylon mobile speech
Alon Lavie translation projects, where he directed the design and development of the analysis and translation
25
She is also part of a consortium for designing semantic interlingual representations of text meaning.
We are using multi-parallel corpora (multiple versions of the same text) to center in on what is com-
mon among sentences that are supposed to convey the same meaning. In addition to the interlingua
Lori Levin design, the consortium is producing annotated multi-parallel corpora, tools for annotation, and evalu-
Associate Research ation metrics.
Professor
BA Linguistics Her other interests include computer-assisted language learning, especially tools to assist second
University of Pennsylvania language readers with comprehension of authentic texts.
1979
PhD Linguistics www.cs.cmu.edu/~lsl
MIT 1986
minority languages machine translation interlingua representations lexicons
• CAMMIA (Conversational Agent for Multilingual Mobile Information Access): A system which
extends VoiceXML with NLP and dialog management to support dynamic multi-task dialogs in
Japanese and English.
Teruko Mitamura • KANT (Knowledge-based Accurate Natural Language Translation): A project founded in 1991 for
the research and development of large-scale, practical translation systems for technical documenta-
Associate Research
tion. KANT uses a controlled vocabulary and grammar for each source language, and explicit, yet
Professor
LTI Finance Director
focused semantic models for each technical domain to achieve very high accuracy in translation.
MA Linguistics
University of Pittsburgh Teruko Mitamura teaches the courses Machine Translation, Grammars and Lexicons, and LT for CALL.
1985
PhD Linguistics www.cs.cmu.edu/~teruko
University of Pittsburgh
1989
knowledge-based MT question answering Japanese NLP and dialog systems
• Open-Domain Question Answering. The JAVELIN project combines natural language dialog, in-
formation retrieval, text understanding, fact extraction, and probabilistic reasoning to answer com-
Faculty
plex questions about entities, relationships and events expressed in unstructured text.
• Conversational Agents for Mobile Multilingual Information Access. The CAMMIA project is cre-
ating speech dialog systems for robust, multi-task dialogs in mobile environments such as car
navigation systems.
Eric Nyberg
Associate Professor • Knowledge-Based Machine Translation. Since the late 1980’s he has worked on controlled lan-
guage, document checking and machine translation for technical documentation; the current sys-
26
BA Computer Science
Boston University 1983 tem, KANTOO, is now in use at Caterpillar, Inc.
PhD Computational
Linguistics Professor Nyberg also teaches a two-course series on software engineering and information technol-
Carnegie Mellon ogy, where students learn about software analysis, design, and construction in the context of real-world
University 1992 team projects.
www.cs.cmu.edu/~ehn
• Speech interaction with PDAs, web portals, and robots is now feasible. But what is the ideal
style for human-machine speech communication? Natural language interfaces are easy for people,
Roni Rosenfeld yet they are brittle, difficult to develop, and they strain recognition technology. Furthermore, by
Professor
trying to emulate people, they fail to communicate the functional limitations of the machine. Are
BS Mathematics and Physics there better alternatives? The Speech Graffiti (aka USI) project is designing and evaluating new
Tel-Aviv University 1985 speech-based interaction paradigms.
MS Computer Science
Carnegie Mellon University www.cs.cmu.edu/~roni
1991
PhD Computer Science
Carnegie Mellon University statistical language modeling speech recognition/interfaces machine learning
1994
Alex Rudnicky's research centers on interactive systems that use speech. He is interested in the following
problems:
• Speech systems that learn: his research attempts to develop a process that, given an abstract
specification of capabilities, supports the automatic configuration of a speech system for an inter-
active task, and then supports incremental learning over the life of the application.
Faculty
Faculty
• Automatic detection and recovery from error: Automatic systems cannot easily detect and recover
from communication breakdowns. We can, however, use features of recognition, understanding,
and dialog to predict the likelihood of misunderstanding at a given instance, and then apply heu-
ristic strategies for guiding the conversation back onto track.
Alex Rudnicky • A theory of language design for speech-based interactive systems: Speech-mode communication
Principal Systems predisposes the user to choose certain words and grammatical preferences. Understanding the
27
Scientist underlying principles of these preferences(and how these are influenced by the system's language)
BS Psychology leads to better language design for interactive systems.
McGill University 1975
MS Psychology • The role of speech in the computer interface: We can analyze an interface in terms of its intended
Carnegie Mellon University task(s), costs of interactions, and the perceived user value. We've studied models based on time,
1976 system error, and task structure, which are useful for simple systems and appear to be extensible to
PhD Psychology more complex systems.
Carnegie Mellon University
www.cs.cmu.edu/~rudnicky
1980
In her teaching, Tanja combines an introduction to theoretical foundations, essential algorithms, and
Tanja Schultz state-of-the-art systems strategies with experimental practice. Combining these two facets, students
Research Scientist have the opportunity to gain a deep understanding of the theory as well as hands-on expertise to develop
MS Mathematics and
Physical Education
speech recognition and understanding systems. Her goal is to provide a foundation and help them to
University of Heidelberg explore their own ideas.
MS Computer Science
University of Karlsruhe www.cs.cmu.edu/~tanja
1995
PhD Computer Science
University of Karlsruhe speech recognition human-human communication human-machine communication
2000
Michael Shamos' research interests include digital libraries, language identification, electronic voting,
electronic negotiation, Internet law and policy, and experimental mathematics. As Co-Director of the
Institute for eCommerce, he runs the technology side of the Masters of Science in Electronic Commerce
program. Additionally, he teaches the courses Ecommerce Technology, Electronic Payment Systems,
and Ecommerce Law and Regulation. He is also the Director of the Universal Library project.
Professor Shamos' business and consulting experience includes serving as an expert witness in com-
puter software and electronic voting cases, as an examiner of electronic voting systems, as a consultant
Michael Shamos on electronic voting, and as an arbitrator in computer-related disputes for the American Arbitration
Distinguished Career Profes-
Association. Additionally, he was a Supervisory Programmer with the National Cancer Institute from
sor, Co-Director, Institute for 1970 to 1972 while a commissioned officer in the United States Public Health.
eCommerce, Director, Univer-
sal Library www.ecom.cmu.edu/shamos.html
AB Physics
Princeton University 1968
MA Physics
Vassar College 1970
PhD Computer Science
Yale University 1978
JD Duquense University digital libraries language identification Internet policy electronic negotiation
1981
Rich Stern's research group develops techniques that improve the accuracy of speech recognition sys-
tems in difficult acoustical environments. They deal with problems in recognition accuracy resulting
from additive noise sources, background music, competing talkers, room reverberation, and other sources
of degradation such as non-native accents or spontaneous speech production. His group has been
Faculty
developing creative solutions to these problems using classical statistical compensation techniques,
microphone arrays, and signal processing based on auditory physiology and perception. He has also
worked in the areas of language modeling, the integration of phonetic, syntactic, and semantic informa-
tion, and multimodal fusion of information. In addition to his speech recognition work, he has also
maintained an active research program in psychoacoustics, where he is best known for theoretical work
Richard Stern in binaural perception.
Professor of Electrical and
Computer Engineering
28
BS Electrical Engineering Professor Stern has taught the ECE courses in digital signal processing and signals and systems for
MIT 1970 many years, along with other courses in the general areas of communication theory and acoustics. He
MS Electrical Engineering and frequently lectures on speech recognition, signal processing, speech perception, and speech production
Computer Sciences for various LTI courses.
University of California,
Berkeley 1972 www.ece.cmu.edu/~rms
Ph.D Electrical Engineering
and Computer Science
MIT 1977
The major theme of Professor Xing's research is understanding and modeling how living systems
function and evolve based on mathematical principles, and developing probabilistic inference and
learning algorithms for computational biology and for generic intelligent systems of a wide range of
applications such as vision, IR and NLP. Currently, his projects largely fall into two categories:
• Computational Biology, with an emphasis on developing formal models and algorithms that
address problems of practical biological and medical concerns, such as, 1) modeling genome-
microenvironment interactions in cancer development and embryogenesis via joint analysis of
genomic, proteomic, cytogenetic and pathway signaling data; 2) statistical inference of haplotype,
Eric Xing linkage and pedigree for genetic, clinical and forensic applications; and 3) modeling substitution,
Assitant Professor recombination, selection and genome rearrangement for comparative genomic analysis.
• Statistical Machine Learning, emphasizing theory and algorithms for learning complex probabi-
MS Computer Science, listic models, learning with prior knowledge, and reasoning under uncertainty. We focus on, 1)
Rutgers University,1998
variational inference/learning theory and algorithms; 2) algorithms and applications of Bayesian
Ph.D Molecular Biology and
Biochemistry,
nonparametrics and hierarchical Bayesian models in data mining; and 3) probabilistic and optimi-
Rutgers University, 1999 zation-theoretic methods for semi-unsupervised learning and kernel machines.
Ph.D Computer Science,
University of California, www.cs.cmu.edu/~epxing
Berkeley, 2004
computational biology machine learning computational statistics
Yiming Yang is a Professor of the Language Technology Institute and the Computer Science Depart-
ment at Carnegie Mellon University. Professor Yang received her B.S. degree in Electrical Engineer-
ing, and her Ph.D. degree in Computer Science (Kyoto University, Japan). Her research has been
centered on statistical classification methods and their applications to a variety of challenging prob- Faculty
Faculty
lems in the real world, including automated text classification, novelty detection and event tracking,
protein sequence analysis, cross-language information retrieval, web-mining for multimedia question
answering, and intelligent email filtering and organization.
Professor Yang teaches Information Retrieval and Advanced Statistical Learning courses.
Yiming Yang
29
Professor www.cs.cmu.edu/~yiming
MS Computer Science
Kyoto University 1982
PhD Computer Science
Kyoto University 1986
David Evans
founder Clairvoyance
former Director of the Laboratory for Computational Linguistics, CMU
former Director of the Academic Computational Linguistics Program, CMU
Phil Hayes
co-founder Carnegie Software Partners
former faculty member Computer Science Department, CMU
Vibhu O. Mittal
Senior Scientist, Google, Inc.
Raúl E. Valdés-Pérez
co-founder and President of Vivisimo Inc
former faculty member Computer Science Department, CMU
30
31