You are on page 1of 14

Vietnamese Text

Normalization method
Student: Phan Thanh Vinh
Outline
1. NSW categorization
2. Model of text normalization in Vietnamese


1. NSW Categorization
1. NUMBERS
NTIM: time (1:30)
NDAT: date (17/3/87,1/3/2010)
NDAY: day and month (17/3,03-05/3)
NMON: month and year (3/87,3/2010,3-5/87)
NNUM: number (2009, 70.000)
NTEL: telephone number (0912.334.577)
NDIG: number as digits (M s 999)
NSCR: score (T s l 3-5)
NRNG: range (T 3-5 ngy)
NPER: percentage (93%, 30-40%)
NFRC: fraction (34/6,6/145)
NADD: address (s 14/3/2 ph Hu)
1. NSW Categorization
2. LETTERS
LWRD read as a word (London, NATO)
LSEQ letter sequence (ODA, GDP)
LABB abbreviation (TS (tin s))
3. OTHERS
PUNC: punctuation ( ( ) * + - /)
URLE: url, path name or email (http://soict.hut.vn)
MONY: Money (2$, $2, VN 9.000)
CSEQ: read all characters ( :), XXX)
DURA: duration (ngh) (- in scores (2-3))
NONE: ignored (asscii art)
2. Model of text normalization in
Vietnamese
2. Model of text normalization in
Vietnamese
2.1 Detection of NSWs
2.1.1 Splitting of Tokens
1. except these special cases(urls, emails, path names (URLE group), compound numbers and money)
2. replace white-spaces with . between digits in telephone numbers remove extra white-spaces (09 15
33 45 77 replaced to 09.15.33.45.77)
3. place a single white space directly preceding and following all punctuations(HN.Vin)
4. break down the input text into whitespace-separator tokens

2. Model of text normalization in
Vietnamese
2.1 Detection of NSWs
2.1.2 Filtering of candidate NSWs
1. A NSW will be tagged with w (<w>15/5</w>)
2. Some punctuations do not need speaking out including stops (.), commas (,), semicolons (;), colons (:),
question marks (?) and exclamation marks (!)
3. classified into one category of each group using corresponding algorithms/techniques (<w
nsw=NDATE>15/5</w>)
2. Model of text normalization in
Vietnamese
2.2 Splitting of Compound NSWs
Compound NSWs cannot be classified to any categories , need to be split into more than one sub - NSWs.
Compound NSWs includes:
1. digits and letters (e.g. I5108, 1m65)
2. letters and hyphens (-) or slashes (/) (e.g. kw/h, p- an)
3. lower and upper
<split><w>8</w><w>h</w><w>30</w></split>

Other NSWs must be in group such as numbers (telephone numbers 0914.392.492, dates
17/3/2010, ranges 30-40%, scores) do not need to be split
2. Model of text normalization in
Vietnamese
2.3 Classification of NSWs
2.3.1 Classification of categories in Number group
Build the decision tree with properties for the NSW including
Format properties
1. Number character
2. Number of digit
Context properties
1. directly preceding tokens
2. directly following tokens
2. Model of text normalization in
Vietnamese
2.3 Classification of NSWs
2.3.1 Classification of categories in Letter group
classified by estimating the probability of a NSW classified in each category
Where n=(l1,l2,ln) is the observed string
make up of n character
2. Model of text normalization in
Vietnamese
2.3 Classification of NSWs
2.3.1 Classification of categories in Others group
PUNC category should be speakable punctuation
URLE category should consist @, http://, www or .com, .net, .vn
MONY category should in the format: currency symbols ($, VN, S$, SGD) before or after the amount of
money.
CSEQ category is labeled for other characters, which can be spoken.
NONE category is labeled for special cases, which should not be read, such as asscii art, formating junk
2. Model of text normalization in
Vietnamese
2.3 Expanding of NSWs
generate full words for NSWs
Th T <w nsw=PUNC full=m ngoc>( </w>ngy
<w nsw=NDATE full=mi lm thng nm> 15/5</w>
<w nsw=PUNC full=ng ngoc>)</w> vo
2. Model of text normalization in
Vietnamese
2.3 Expanding of NSWs
Reference
A method for Vietnamese Text Normalization to improve the quality
of speech synthesis. Thu-Trang Thi Nguyen, Thanh Thi Pham, Do-Dat
Tran

You might also like