Professional Documents
Culture Documents
RESEARCH ARTICLE
OPEN ACCESS
ABSTRACT
The fundamental aim of this paper is to take a fragment written in English and translate it in Hindi language by
the use of statistical and rule based approach that represents an accurate translation of the original sentence. An
n-gram based language model, i.e. a type of probabilistic model, is combined with the syntax based translation
model that includes the parsing using CYK algorithm and word alignment by IBM models. In this method, tree
frame is basically used as statistical model which is then combined with some linguistically motivated
reordering rules to improve the lexical analysis system accuracy. Results are presented according to translation
accuracy and efficiency.
Keywords:- Language Model, Syntax-Based Translation Model, Rule-Based Approach, Lexical Analysis,
Reordering.
I. INTRODUCTION
Soon after the first electronic computers became
available, warren weaver(1949) proposed [5] that
computers would one day be able to take a document
written in one human language as input and translate
it efficiently into the other language automatically,
the task which is now referred to as Machine
Translation. Broadly characterised, Statistical
machine translation (SMT) is based on automatic text
translation by the use of statistical models and
examples of translations, by matching fragments of
contents to the documents already translated by
people and the stitching them together. All
knowledge of translation is gathered in a large
collection of human translated document, called
parallel corpus. This is a natural collection from:
news articles, many government proceedings,
journals, websites, marketing material etc. Though
other machine translation systems which are
developed according to their paradigms are also in
use, mainly rule based or example based systems.
SMT has overcome the academic research about MT
systems and achieved significant interest over last
two decades.
Statistical MT systems are statistical [5] because they
choose statistics or these learning techniques as the
way of translating a document, gathered from parallel
corpora, among many other ways. Now, the language
is full of nuance and ambiguity, so any fragment
(either short or long) will be translated in many ways.
This task of translatable fragments is fundamental to
ISSN: 2347-8578
www.ijcstjournal.org
Page 48
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015
sentence i.e. e, that has the equivalent meaning as f.
we do this by building statistical model to show the
translation process, and find
E=argmax p(e|f)
Brown et al. (1993) [4] introduced source channel
model, where e i.e. the output language sentence is
viewed as being generated by the source with
probability p(e) defined by the language model and
then passed to translation channel to produce f, the
input language sentence ,according to the translation
probability p(f | e). The task of translation system is
to determine e from observed sentence f and the best
translation is by computing:
e = argmax P (f |e) P(e)
P(e)= P(w1,w2,..,wm)
this
problem
can
be
Source
Transl
ation
Corpu
s
A. Language Model
The LM [4] tries to estimate the likelihood of a given
sentence translation in the target language. The more
common it is, the more likely it will be that it is a
good translation mainly in the terms of fluency. This
is done by counting the relative number of
occurrences of the sentence in a monolingual corpus.
P(e) for a sentence with m words is defined as the
joint probability of a sequence of all words in that
sentence;
Training
set
Decoder
language
Target
Target
ISSN: 2347-8578
www.ijcstjournal.org
Page 49
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015
such as one implemented in freely available tool
GIZA++. It is an implementation of IBM alignment
models which treat word alignment as a hidden
process and maximize the probability of (e ,f) pairs
using Expectation Maximization algorithm.
For better alignment of Indian languages, information
about the cognates is needed as Indian languages
have borrowed a large number of words from
English. This list was prepared by CPMS
(computational phonetic model of scripts) and is
added to bilingual corpora for initialization of EM
algorithm.
2) CYK Algorithm
S, NP(4,1)
NN(3,2)
NP1)(2,1)
NN(1,2)
My
Chinese
wife
VBZ(1,3)
NN(1,4)
is
Production
S NP VP
NP PRP$ NN
VP VBZ NN
ISSN: 2347-8578
VP(2,3)
PRP(1,1)
ALGORITHM
PRP,VB(3,
www.ijcstjournal.org
Page 50
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015
align words on the other end of sentences from
maximum likely ones. Probability that hth target word
is connected to eth source word is calculated by
distortion probability.
Model 3
This makes one to many translations probable i.e.
fertility based alignment is introduced. Reverse
distortion probabilities are assigned uniformly.
[1]
Problems in Word Alignment :Given a sentence aligned parallel corpus there can be
many alignments of a single word in a sentence but
we aim to have best out of all as shown:
Ex:
Rahul
is
good
boy
There are several problems associated with this
approach, based on IBM model, which can be dealt
with:
Translation model
Distortion model
Fertility model
4) EM Algorithm
EM Algorithm [1] is used to find a maximum
likelihood parameters of the statistical model. It
proceeds from observation that the following is a way
to solve these two sets of unknowns and find the
probability of the translated output. Then change this
with their alternative meanings as per the requirement
and further estimate the second set probability. Then
use these both to find a better estimate, thus
alternating between two until the result converges to
fixed points.
Parameter initialization
Parameter re-estimation
CDAC
Ex: ram is a good boy
CDAC
ISSN: 2347-8578
Hindi Translation
www.ijcstjournal.org
Page 51
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015
Synonyms of the word good, as given in corpus,
are and . Based on the categories and
user requirement appropriate meaning of the word is
selected by user at run time.
TENSE
CONDITION
FOR POS TAG
RULES
III.
Simple past
ed
ConditionVBD
Past
continuous
He
TENSE
CONDITION
FOR POS TAG
I + do
Simple
Present
ConditionVB/VBZ
I
He + does
RULES
Concate
with
VB/VBZ/NN&
do =
Concate
He
She + does
Concate
with does=
Simple
Present
continuous
VBG +
Past
continuous
She
VBG + &
VBG + &
VBG + &
He
Conditionwas/were/did +
VBG
Am =
VBG +
Is=
Were =
Were =
VBG + &
was=
VBG + &
Was =
She
VBG + &
VBG +
VBG + &
VBG + &
Will =
Simple future
He/It
Condition-MD
&& ! be
Is =
ISSN: 2347-8578
VBG + &
Was =
do=
was=
Was =
Concate +
He
VBG + &
Was =
She
Conditionis/am/are +
VBG
Conditionwas/were/did +
VBG
+ does=
She
VBG +
Are=
www.ijcstjournal.org
Were =
Were =
Will =
She/ This
Will =
Will =
Page 52
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 2, Mar-Apr 2015
Future continuous
I + VBG
VBG + &
Will =
Condition-MD+ be
He + VBG
VBG + &
Will =
She + VBG
[7]
VBG + &
Will =
VBG + &
Will=
REFERENCES
[1] Khin thandar Nwet and Ni Lar TheinUniversity of Computer Studies, Yangon,
Mayanmar : Word Alignment based on Hybrid
Approach for Mayanmar-English Machine
Translation, | Issue : 2011
[2] Cristina Espana I Bonet LSI Department Universitat Politecnica de Catalunya :
Statistical Machine Translation- a practical
tutorial | Issue: March 2010
[3] Shweta Dubey and Tarun Dhar DiwanAssistant professor Dr. CV Raman University
Bilaspur, India: Supporting large EnglishHindi parallel corpus using word alignment|
Issue : July 2012
[4] Lucia specia University of Wolverhampton,
Stafford street : Fundamental and New
approaches to Statistical Machine Translation.
[5] James Brunning- Cambridge University
Engineering Dept. and Jesus College :
Alignment Models and Algorithms for
Statistical Machine Translation| Issue : August
2010.
[6] Rahul.C.Dinunath.K,
Remya
Ravindran,
K.P.Soman- Department of Computational
Engineering & Networking, Amrita Vishwa
Vidyapeetham, Coimbatore : Rule Based
Reordering and Morphological Processing For
English-Malyalam
Statistical
Machine
Translation.
ISSN: 2347-8578
www.ijcstjournal.org
Page 53