(IJCST-V3I2P9) : Srishti Dhamija, Kriti Aggarwal, Shashi Pal Singh, Ajai Kumar

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015
RESEARCH ARTICLE
OPEN ACCESS
Hybrid-Statistical Machine Translation From

English to Hindi
Srishti Dhamija [1], Kriti Aggarwal [2], Shashi Pal Singh [3], Ajai Kumar [4]
Banasthali Vidyapith [1] & [2]
Banasthali
AAI, Centre for development of Advanced Computing [3] & [4]
Pune - India
ABSTRACT
The fundamental aim of this paper is to take a fragment written in English and translate it in Hindi language by
the use of statistical and rule based approach that represents an accurate translation of the original sentence. An
n-gram based language model, i.e. a type of probabilistic model, is combined with the syntax based translation
model that includes the parsing using CYK algorithm and word alignment by IBM models. In this method, tree
frame is basically used as statistical model which is then combined with some linguistically motivated
reordering rules to improve the lexical analysis system accuracy. Results are presented according to translation
accuracy and efficiency.
Keywords:- Language Model, Syntax-Based Translation Model, Rule-Based Approach, Lexical Analysis,
Reordering.
I. INTRODUCTION
Soon after the first electronic computers became
available, warren weaver(1949) proposed [5] that
computers would one day be able to take a document
written in one human language as input and translate
it efficiently into the other language automatically,
the task which is now referred to as Machine
Translation. Broadly characterised, Statistical
machine translation (SMT) is based on automatic text
translation by the use of statistical models and
examples of translations, by matching fragments of
contents to the documents already translated by
people and the stitching them together. All
knowledge of translation is gathered in a large
collection of human translated document, called
parallel corpus. This is a natural collection from:
news articles, many government proceedings,
journals, websites, marketing material etc. Though
other machine translation systems which are
developed according to their paradigms are also in
use, mainly rule based or example based systems.
SMT has overcome the academic research about MT
systems and achieved significant interest over last
two decades.
Statistical MT systems are statistical [5] because they
choose statistics or these learning techniques as the
way of translating a document, gathered from parallel
corpora, among many other ways. Now, the language
is full of nuance and ambiguity, so any fragment
(either short or long) will be translated in many ways.
This task of translatable fragments is fundamental to
ISSN: 2347-8578
statistical machine translation and is of primary

focus. A phrase based translation system finds the
phrase pairs in parallel corpora which are stored with
their frequency statistics. The evaluation of MT
system is an active research area in itself. Other than
human judgement field relies on automatic measures
of output quality i.e. BLEU (Bilingual Evaluation
Understudy) metrics. It is a precision based
evaluation measure that collects statistics on per
sentence basis but these statistics are aggregated over
a test corpus to provide more robust evaluation.
Statistical methods are advantageous over nonstatistical techniques as they produce better
translations. The vague or ill-defined relationships
between words, phrases and grammatical structures
are captured by probability distribution and statistical
techniques. A further benefit of these systems is that
they need not rely on features of the languages
involved which enable the machine translation
systems to be built for multiple language pairs with
minimal modifications to the technique. No-doubt
knowledge of the languages involved is often needed
for improved quality of translation. Additional
language
specific
information
including
morphological features, reordering and grammatical
models are incorporated by statistical models.
II. OVERVEIW OF SMT MODELS

The goal of machine translation is to translate from
an English input sentence i.e. f to an output Hindi
www.ijcstjournal.org
Page 48
sentence i.e. e, that has the equivalent meaning as f.
we do this by building statistical model to show the
translation process, and find
E=argmax p(e|f)
Brown et al. (1993) [4] introduced source channel
model, where e i.e. the output language sentence is
viewed as being generated by the source with
probability p(e) defined by the language model and
then passed to translation channel to produce f, the
input language sentence ,according to the translation
probability p(f | e). The task of translation system is
to determine e from observed sentence f and the best
translation is by computing:
e = argmax P (f |e) P(e)
translations depending on probability estimates P(e)

and P(e |f).
Using Bayes theorem,

decomposed as:
P(e)= P(w1,w2,..,wm)
this
problem
can
be
P (f | e)P(e )= argmax P(f |e)P(e)

P(f )
Since the source text f is constant across any
alternative translation, it can be disregarded as
P(f | e)P(e)=argmax P(f |e)P(e)
Source
Transl
ation
Corpu
s
A. Language Model
The LM [4] tries to estimate the likelihood of a given
sentence translation in the target language. The more
common it is, the more likely it will be that it is a
good translation mainly in the terms of fluency. This
is done by counting the relative number of
occurrences of the sentence in a monolingual corpus.
P(e) for a sentence with m words is defined as the
joint probability of a sequence of all words in that
sentence;
This is then decomposed into series of conditional

probabilities by applying the chain rule:
P(e)=P(w1)P(w2|w1)P(w3|w2w1)P(w4|w3w2w1).P(wm
| w1. wm-1).
So the probability of a word w, given a number of
previous words, is calculated using Maximum
Likelihood Estimation (MLE) i.e. the count of
occurrences of the complete sequence divided by the
count of conditional sequence.
P(w3|w1w2)= count P(w1w2w3)
count P(w1w2)
1) N-Grams:
Training
set
Decoder
language
Target
Target
Fig.1 Statistical Machine Translation
So this generative model [4] which resulted from

decomposition of p(e|f) produce two fundamental
components of basic SMT: the language model p(e)
and the translation model p(f|e). The language model
searches for the best translation regardless of the
input text whereas the translation model conditions
the search for best translation on the input text.so
these two components provide adequacy and fluency
to the translated text. The third component is
decoder, a module that performs the search for best
translation e, given the space of all possible
ISSN: 2347-8578
In a large corpus, chances of searching occurrences

of a given new sentence to translate is very small. In
case if not even a single occurrence of sequence of
words is seen in the corpus, P(e) will tend to 0 and so
will P(e|f). Solution to this is finding the occurrence
of parts of such sentences, more specifically the ngrams [4] or a sequence of up to n words. Larger the n
greater will be the information about the context of
sequence. Whereas smaller the n, it will be more
reliable because more cases will be seen in the
training data and better will be the statistical
estimates. Generally size of n varies according to the
size of corpus i.e. the greater the corpus longer will
be the n grams that can be counted. N gram models
are based on Markov assumption that probability of a
word can be calculated depending on its entire history
by computing the probability of a word given the last
few words as in bigrams as shown:
P(e)= P(w1) P(w2|w1) P(w3|w2)..P(wm|wm-1)
B. Translation Model
Second stage [4] of SMT system is translation
modelling which includes the step of word alignment
over the sentence aligned bilingual corpus. Most
systems still use generative models for this purpose
Page 49
such as one implemented in freely available tool
GIZA++. It is an implementation of IBM alignment
models which treat word alignment as a hidden
process and maximize the probability of (e ,f) pairs
using Expectation Maximization algorithm.
For better alignment of Indian languages, information
about the cognates is needed as Indian languages
have borrowed a large number of words from
English. This list was prepared by CPMS
(computational phonetic model of scripts) and is
added to bilingual corpora for initialization of EM
algorithm.
2) CYK Algorithm
Generating a parse tree

This algorithm is only a recognizer which will only
determine if the sentence is in the language. It can
further be extended into a parser that also construct a
parse tree, by storing parse tree nodes as the elements
of array. To build the tree structure, the node is linked
to array elements that were used to produce it. If all
parse trees of the sentence are to be kept, it is
mandatory to store in the array element a list of all
the ways the node can be obtained. This is done with
back-pointers.
S, NP(4,1)
NN(3,2)
NP1)(2,1)
NN(1,2)
My
Chinese
wife
VBZ(1,3)
NN(1,4)
is
Fig. 2 Example of CYK algorithm
Production
S NP VP
NP PRP$ NN
VP VBZ NN
String w: My wife is Chinese.

PRP
My
NN
wife
VBZ
is
NN
Chinese.
ISSN: 2347-8578
Step 1: get a POS tag for a SL sentence of length n

via Stanford parser, where n is no. of words.
Step 2: in a matrix of size n*n, assign these POS tags
to the first row of matrix i.e. a[1][n] where
n=1,2,3n.
For first two consecutive tags check in production

rulesIf rule exist,
assign LHS non terminal of production
to (n-1) rows first column and then
jump to third column of (n-1) row and
check the same for the next two column
values of first row, thus now assigning
value to the third column of (n-1)th row.
Else
Check next two consecutive tags and
assign value to second column of (n-1)th
row, further jump by 1.
Iterate till no. of rows is equal to no. of words.
3) IBM Alignment Models 1 through 3

Och and Ney [1] describe the statistical alignment as
trying to compute the probabilistic links between the
SL string e, and target language string h, and the
alignment a between positions in e and h.
m1j=m1.mj
VP(2,3)
PRP(1,1)
ALGORITHM
Step3: for (n-1)th row,
Cocke-Younger-Kasami algorithm[8] is a parsing

algorithm for Chomsky Normal Form(CNF). It uses
bottom-up parsing and dynamic programming. It has
high efficiency in certain situations and the worst
case running time of CYK is Theta (n3.|G|), where n
is length of the parsed string and | G | is the size of
CNF grammar.
PRP,VB(3,
i is the no. of rows and j is the no. of

columns in the table. CYK algorithm
correctly computes ai,j for all i and j; thus w
is in L(G) if and only if S is in a1,n.
Hindi and English sentences contain tokens H and E.

Tokens in sentences are aligned according to one
another. Set of possible alignments is denoted by A
and all translations from H to E is ae that holds the
index of corresponding token E in English sentence.
A = {(h,1) : h=1.H ; e= 1,.,E}
H e= ae
E = ae
Using the above notation the basic alignment model
can be given as:
Pr (e1E | m1H) = Pr (e1E ,a1E | m1H)
This is IBM model.
Model 2
It overcomes the limitations of model 1 i.e. adds the
way of differentiating between the alignments that
Page 50
align words on the other end of sentences from
maximum likely ones. Probability that hth target word
is connected to eth source word is calculated by
distortion probability.
Model 3
This makes one to many translations probable i.e.
fertility based alignment is introduced. Reverse
distortion probabilities are assigned uniformly.
[1]
Problems in Word Alignment :Given a sentence aligned parallel corpus there can be
many alignments of a single word in a sentence but
we aim to have best out of all as shown:
Ex:
Rahul
is
good
boy

There are several problems associated with this
approach, based on IBM model, which can be dealt
with:
Translation model
Distortion model
Fertility model
distortion model harder. Apart from TAM (tense,

aspect and modality) verbs also creates errors in
fertility model because TAM information is
distributed over several words which in turn reduces
the alignment accuracy. But using the cognate list can
help us improving this.
4) EM Algorithm
EM Algorithm [1] is used to find a maximum
likelihood parameters of the statistical model. It
proceeds from observation that the following is a way
to solve these two sets of unknowns and find the
probability of the translated output. Then change this
with their alternative meanings as per the requirement
and further estimate the second set probability. Then
use these both to find a better estimate, thus
alternating between two until the result converges to
fixed points.
Parameter initialization
Alignment probability calculation
The first problem is to find the most likely translation

of the given source language (SL) text, irrespective of
positions. This is taken care of by the [7] translation
model.
Ex: Rahul is a good boy

(one-to-one alignment)
Second problem is to align positions in SL sentence
with positions in the TL sentence, which is addressed
by distortion model [7].
Word orders of both languages is to be taken care of
in this model.
Parameter re-estimation
Alignment probability recalculation

No
Converged?
Yes
Ex: NULL Rahul worked in CDAC

CDAC

(word order and spurious words)
Third problem is to find out the number of TL words
generated from one SL word. Sometimes SL word
may generate no TL word or a TL word may be
generated by no SL word (NULL insertion). The
fertility model [7] accounts for this.
Ex: Rahul is working in
Final parameters and alignment

Fig.3 EM algorithm
5) Synonym Handling Further this paper gives a way to handle problems of

synonyms present in large bilingual corpora.
CDAC
Ex: ram is a good boy
CDAC

These three models form the core of the IBM model

based generative SMT. Since English is SVO
language and Hindi is SOV, which creates the task of
ISSN: 2347-8578
Hindi Translation
Page 51
Synonyms of the word good, as given in corpus,
are and . Based on the categories and
user requirement appropriate meaning of the word is
selected by user at run time.
TENSE
CONDITION
FOR POS TAG
RULES
III.
Simple past
ed
RULE BASED RE-ORDERING
Technique of including rule based and morphological

analysis give better accuracy of SMT [6]. In this
paper we present our work by making some linguistic
rules based on tenses, modality etc like we can
combine the phrase based models with some
reordering rules as per English language.
ConditionVBD
Past
continuous
He
TABLE 1 LINGUISTIC RULES
TENSE
CONDITION
FOR POS TAG
I + do
Simple
Present
ConditionVB/VBZ
I
He + does
RULES
Concate
with
VB/VBZ/NN&
do =
Concate
He
She + does
Concate
with does=
You ,we ,they +

do
Simple
Present
continuous
VBG +
Past
continuous
She
VBG + &
We/ you/ they/

it + did
VBG + &
We/ you/ they/

it
VBG + &
He
Conditionwas/were/did +
VBG
Am =
VBG +
Is=
Were =
Were =
VBG + &
was=
VBG + &
Was =
She
VBG + &
VBG +
We/ you/ they/

it + did
VBG + &
We/ you/ they/

it
VBG + &
Will =
Simple future
He/It
Condition-MD
&& ! be
Is =
ISSN: 2347-8578
VBG + &
Was =
do=
You, we, they
was=
Was =
Concate +
He
VBG + &
Was =
You ,we ,they
She
Conditionis/am/are +
VBG
Conditionwas/were/did +
VBG
+ does=
She
VBG +
Are=
Were =
Were =
Will =
She/ This
Will =
We/ They/ You
Will =
Page 52
Future continuous
I + VBG
VBG + &
Will =
Condition-MD+ be
He + VBG
VBG + &
Will =
She + VBG
[7]
G.Chinnapa and Anil Kumar Singh- Language

Technologies Research Centre International
Institute of Information Technology, Hyderabad :
A Java Implementation of an Extended Word
Alignment Algorithm Based on the IBM
Models|Issue: 2006
VBG + &
Will =
You/ we/ they +

VBG
VBG + &
Will=
REFERENCES
[1] Khin thandar Nwet and Ni Lar TheinUniversity of Computer Studies, Yangon,
Mayanmar : Word Alignment based on Hybrid
Approach for Mayanmar-English Machine
Translation, | Issue : 2011
[2] Cristina Espana I Bonet LSI Department Universitat Politecnica de Catalunya :
Statistical Machine Translation- a practical
tutorial | Issue: March 2010
[3] Shweta Dubey and Tarun Dhar DiwanAssistant professor Dr. CV Raman University
Bilaspur, India: Supporting large EnglishHindi parallel corpus using word alignment|
Issue : July 2012
[4] Lucia specia University of Wolverhampton,
Stafford street : Fundamental and New
approaches to Statistical Machine Translation.
[5] James Brunning- Cambridge University
Engineering Dept. and Jesus College :
Alignment Models and Algorithms for
Statistical Machine Translation| Issue : August
2010.
[6] Rahul.C.Dinunath.K,
Remya
Ravindran,
K.P.Soman- Department of Computational
Engineering & Networking, Amrita Vishwa
Vidyapeetham, Coimbatore : Rule Based
Reordering and Morphological Processing For
English-Malyalam
Statistical
Machine
Translation.
ISSN: 2347-8578
Page 53

(IJCST-V3I2P9) : Srishti Dhamija, Kriti Aggarwal, Shashi Pal Singh, Ajai Kumar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(IJCST-V3I2P9) : Srishti Dhamija, Kriti Aggarwal, Shashi Pal Singh, Ajai Kumar

Uploaded by

Copyright:

Available Formats

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015

Hybrid-Statistical Machine Translation From

statistical machine translation and is of primary

II. OVERVEIW OF SMT MODELS

translations depending on probability estimates P(e)

Using Bayes theorem,

P (f | e)P(e )= argmax P(f |e)P(e)

This is then decomposed into series of conditional

Fig.1 Statistical Machine Translation

So this generative model [4] which resulted from

In a large corpus, chances of searching occurrences

Generating a parse tree

Fig. 2 Example of CYK algorithm

String w: My wife is Chinese.

Step 1: get a POS tag for a SL sentence of length n

For first two consecutive tags check in production

3) IBM Alignment Models 1 through 3

Step3: for (n-1)th row,

Cocke-Younger-Kasami algorithm[8] is a parsing

i is the no. of rows and j is the no. of

Hindi and English sentences contain tokens H and E.

distortion model harder. Apart from TAM (tense,

Alignment probability calculation

The first problem is to find the most likely translation

Alignment probability recalculation

Ex: NULL Rahul worked in CDAC

Final parameters and alignment

5) Synonym Handling Further this paper gives a way to handle problems of

These three models form the core of the IBM model

RULE BASED RE-ORDERING

Technique of including rule based and morphological

TABLE 1 LINGUISTIC RULES

You ,we ,they +

We/ you/ they/

We/ you/ they/

We/ you/ they/

We/ you/ they/

You, we, they

You ,we ,they

We/ They/ You

G.Chinnapa and Anil Kumar Singh- Language

You/ we/ they +

You might also like