Welcome to Scribd!

Chuan Hoa Van Ban

Uploaded by

0% found this document useful (0 votes)

18 views14 pages

Vietnamese Text Normalization method Student: Phan Thanh Vinh Outline 1. NSW Categorization 2. Model of text normalization in Vietnamese 2. Detection of NSWs 2.1. Split down the input text into whitespace-separator tokens.

Original Description:

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

18 views14 pages

Chuan Hoa Van Ban

Uploaded by

phan_vinh_20

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 14

Search inside document

Vietnamese Text

Normalization method
Student: Phan Thanh Vinh
Outline
1. NSW categorization
2. Model of text normalization in Vietnamese

1. NSW Categorization
1. NUMBERS
NTIM: time (1:30)
NDAT: date (17/3/87,1/3/2010)
NDAY: day and month (17/3,03-05/3)
NMON: month and year (3/87,3/2010,3-5/87)
NNUM: number (2009, 70.000)
NTEL: telephone number (0912.334.577)
NDIG: number as digits (M s 999)
NSCR: score (T s l 3-5)
NRNG: range (T 3-5 ngy)
NPER: percentage (93%, 30-40%)
NFRC: fraction (34/6,6/145)
NADD: address (s 14/3/2 ph Hu)
1. NSW Categorization
2. LETTERS
LWRD read as a word (London, NATO)
LSEQ letter sequence (ODA, GDP)
LABB abbreviation (TS (tin s))
3. OTHERS
PUNC: punctuation ( ( ) * + - /)
URLE: url, path name or email (http://soict.hut.vn)
MONY: Money (2$, $2, VN 9.000)
CSEQ: read all characters ( :), XXX)
DURA: duration (ngh) (- in scores (2-3))
NONE: ignored (asscii art)
2. Model of text normalization in
Vietnamese
2. Model of text normalization in
Vietnamese
2.1 Detection of NSWs
2.1.1 Splitting of Tokens
1. except these special cases(urls, emails, path names (URLE group), compound numbers and money)
2. replace white-spaces with . between digits in telephone numbers remove extra white-spaces (09 15
33 45 77 replaced to 09.15.33.45.77)
3. place a single white space directly preceding and following all punctuations(HN.Vin)
4. break down the input text into whitespace-separator tokens

2. Model of text normalization in
Vietnamese
2.1 Detection of NSWs
2.1.2 Filtering of candidate NSWs
1. A NSW will be tagged with w (<w>15/5</w>)
2. Some punctuations do not need speaking out including stops (.), commas (,), semicolons (;), colons (:),
question marks (?) and exclamation marks (!)
3. classified into one category of each group using corresponding algorithms/techniques (<w
nsw=NDATE>15/5</w>)
2. Model of text normalization in
Vietnamese
2.2 Splitting of Compound NSWs
Compound NSWs cannot be classified to any categories , need to be split into more than one sub - NSWs.
Compound NSWs includes:
1. digits and letters (e.g. I5108, 1m65)
2. letters and hyphens (-) or slashes (/) (e.g. kw/h, p- an)
3. lower and upper
<split><w>8</w><w>h</w><w>30</w></split>

Other NSWs must be in group such as numbers (telephone numbers 0914.392.492, dates
17/3/2010, ranges 30-40%, scores) do not need to be split
2. Model of text normalization in
Vietnamese
2.3 Classification of NSWs
2.3.1 Classification of categories in Number group
Build the decision tree with properties for the NSW including
Format properties
1. Number character
2. Number of digit
Context properties
1. directly preceding tokens
2. directly following tokens
2. Model of text normalization in
Vietnamese
2.3 Classification of NSWs
2.3.1 Classification of categories in Letter group
classified by estimating the probability of a NSW classified in each category
Where n=(l1,l2,ln) is the observed string
make up of n character
2. Model of text normalization in
Vietnamese
2.3 Classification of NSWs
2.3.1 Classification of categories in Others group
PUNC category should be speakable punctuation
URLE category should consist @, http://, www or .com, .net, .vn
MONY category should in the format: currency symbols ($, VN, S$, SGD) before or after the amount of
money.
CSEQ category is labeled for other characters, which can be spoken.
NONE category is labeled for special cases, which should not be read, such as asscii art, formating junk
2. Model of text normalization in
Vietnamese
2.3 Expanding of NSWs
generate full words for NSWs
Th T <w nsw=PUNC full=m ngoc>( </w>ngy
<w nsw=NDATE full=mi lm thng nm> 15/5</w>
<w nsw=PUNC full=ng ngoc>)</w> vo
2. Model of text normalization in
Vietnamese
2.3 Expanding of NSWs
Reference
A method for Vietnamese Text Normalization to improve the quality
of speech synthesis. Thu-Trang Thi Nguyen, Thanh Thi Pham, Do-Dat
Tran

Data Compression Techniques
Document4 pages
Data Compression Techniques
AdamGonzalez
No ratings yet
Latex Font Encodings
Document40 pages
Latex Font Encodings
norbeck
No ratings yet
Rust Reference Manual
Document87 pages
Rust Reference Manual
none
No ratings yet
C) The Set of All Strings Containing at Least Two 0's
Document4 pages
C) The Set of All Strings Containing at Least Two 0's
Pratik Kapse
No ratings yet
Alevel Computer Science (Pseudocode Notes)
Document20 pages
Alevel Computer Science (Pseudocode Notes)
Syed Ali
No ratings yet
Rust Manual
Document79 pages
Rust Manual
jackieness
No ratings yet
Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
Document6 pages
Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
Abhishek Das
No ratings yet
School of Information Technology & Engineering Lab - Continuous Assessment Test - 2
Document3 pages
School of Information Technology & Engineering Lab - Continuous Assessment Test - 2
D. Ancy Sharmila
No ratings yet
Mozart-Oz Notation
Document39 pages
Mozart-Oz Notation
Dave20000
No ratings yet
Mtech ECE 2nd Sem
Document33 pages
Mtech ECE 2nd Sem
shakti sindhu
No ratings yet
Indian Language Text Representation and Categorization Using Supervised Learning Algorithm
Document5 pages
Indian Language Text Representation and Categorization Using Supervised Learning Algorithm
ijbui iir
No ratings yet
Reference Base Paper
Document6 pages
Reference Base Paper
nikhil test
No ratings yet
A Layman's Guide To A Subset of ASN
Document26 pages
A Layman's Guide To A Subset of ASN
ajith54
No ratings yet
Vietnamese Word Segmentation Techniques: A Comparison of CRF and SVM Models
Document8 pages
Vietnamese Word Segmentation Techniques: A Comparison of CRF and SVM Models
Flower Lantern
No ratings yet
Лабораторна_робота_1_Лупій_Анастасія
Document14 pages
Лабораторна_робота_1_Лупій_Анастасія
Анастасія Лупій
No ratings yet
2nd Sem Mid Sem1
Document6 pages
2nd Sem Mid Sem1
ankushsoun123
No ratings yet
Spec 2
Document17 pages
Spec 2
Eflos
No ratings yet
d96 Spec 6.1
Document16 pages
d96 Spec 6.1
Eflos
No ratings yet
A General System for Coding Taxonomic Descriptions
Document5 pages
A General System for Coding Taxonomic Descriptions
Lisandro Negrete
No ratings yet
Word Sense Disambiguation Methods Applied To English and Romanian
Document8 pages
Word Sense Disambiguation Methods Applied To English and Romanian
Leon
No ratings yet
Ijere 1
Document5 pages
Ijere 1
aulia citra berlian
No ratings yet
Automata Theory Module 1
Document85 pages
Automata Theory Module 1
Anusha k p
100% (1)
Title of Your Paper: (Centered, Times New Roman (Bold) 14 PT)
Document3 pages
Title of Your Paper: (Centered, Times New Roman (Bold) 14 PT)
Ammyrul Ashraf
No ratings yet
Assignment On Dbms
Document7 pages
Assignment On Dbms
Surendra Singh Chauhan
No ratings yet
Assigmnent I TEXT WEB Media (2024 Feb)
Document12 pages
Assigmnent I TEXT WEB Media (2024 Feb)
Siddhi
No ratings yet
I T C E: Ece/cs 252
Document9 pages
I T C E: Ece/cs 252
Big
No ratings yet
Ijea Templet
Document5 pages
Ijea Templet
Hchepi Rafiudin
No ratings yet
SSF: Shakti Standard Format Guide
Document26 pages
SSF: Shakti Standard Format Guide
S.s. Akhtar
No ratings yet
Call Blocking - UMTS-Vijay Kaushik
Document5 pages
Call Blocking - UMTS-Vijay Kaushik
Prakash Maurya
No ratings yet
Oberon-2 Programming Language Reference Report
Document26 pages
Oberon-2 Programming Language Reference Report
qeer321
No ratings yet
ASN1 Intro
Document22 pages
ASN1 Intro
Hazem Elabed
No ratings yet
On Isodual Double Toeplitz Codes: Minjia Shi, Li Xu, Patrick Sol e
Document15 pages
On Isodual Double Toeplitz Codes: Minjia Shi, Li Xu, Patrick Sol e
Tudor Micu
No ratings yet
Details and Examples Version 3: Representing Myanmar in Unicode
Document45 pages
Details and Examples Version 3: Representing Myanmar in Unicode
kaonohi
No ratings yet
5 - Chen & Fu, 2018
Document4 pages
5 - Chen & Fu, 2018
Abu Mansyur
No ratings yet
Context Based Text-Generation Using LSTM Networks
Document11 pages
Context Based Text-Generation Using LSTM Networks
harsha gvd
No ratings yet
Template Sinta&Q1 Tasya Dwi Yuliani 6KID UTS 7juni2022 Metodologi Penelitian
Document5 pages
Template Sinta&Q1 Tasya Dwi Yuliani 6KID UTS 7juni2022 Metodologi Penelitian
Tasya Dwi Yuliani 061940422455
No ratings yet
Problems Related to Shortest Strings in Formal Languages
Document60 pages
Problems Related to Shortest Strings in Formal Languages
sbwjlvnbs
No ratings yet
2020 Lrec-1 487
Document9 pages
2020 Lrec-1 487
Naoual Nassiri
No ratings yet
XII COMP SC Study Material
Document26 pages
XII COMP SC Study Material
Nihdrt
No ratings yet
Homework No:3: CAP301: Database Management System
Document9 pages
Homework No:3: CAP301: Database Management System
Surendra Singh Chauhan
No ratings yet
BC-5000&BC-5150 - HL7 Communication Protocol - V2.0 - EN
Document27 pages
BC-5000&BC-5150 - HL7 Communication Protocol - V2.0 - EN
Олександр
No ratings yet
Design and Implementation of A Product Cipher Using Substitution and Transposition Ciphers
Document7 pages
Design and Implementation of A Product Cipher Using Substitution and Transposition Ciphers
Candida Noronha
No ratings yet
MD Adil Irshad
Document37 pages
MD Adil Irshad
chatroom Mern
No ratings yet
R.V. College of Engineering
Document56 pages
R.V. College of Engineering
Danny Raj
No ratings yet
1&2 PDF
Document41 pages
1&2 PDF
Md Shankar
No ratings yet
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
Document4 pages
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
Yash Gautam
No ratings yet
Sample Question Paper-Mid Term
Document1 page
Sample Question Paper-Mid Term
trash email
No ratings yet
Group A Assignment No: 7
Document10 pages
Group A Assignment No: 7
Shubham Dhanne
No ratings yet
A Layman's Guide To A Subset of ASN.1, BER, A Layman's Guide To A Subset of ASN.1, BER, and DER and DER
Document26 pages
A Layman's Guide To A Subset of ASN.1, BER, A Layman's Guide To A Subset of ASN.1, BER, and DER and DER
med
No ratings yet
A Procedural DTD Project For Dictionary Entry Parsing Described With Parameterized Grammars
Document10 pages
A Procedural DTD Project For Dictionary Entry Parsing Described With Parameterized Grammars
music2850
No ratings yet
b675 PDF
Document2 pages
b675 PDF
briofons
No ratings yet
Ty Comps A 42 SPCC Ia
Document3 pages
Ty Comps A 42 SPCC Ia
Dhruv
No ratings yet
Text Classification
Document32 pages
Text Classification
SamNeil70
No ratings yet
Chapter 2 - Lexical Analysis
Document69 pages
Chapter 2 - Lexical Analysis
Anonymous zDRmfIpGf
No ratings yet
ASN.1 Summary
Document8 pages
ASN.1 Summary
infombm
No ratings yet
Solution of Question Paper For Class 12 IP
Document8 pages
Solution of Question Paper For Class 12 IP
Subhash Bhagat
No ratings yet
Komissarov V N Koralova A L Praktikum Po Perevodu S Angliysk PDF
Document112 pages
Komissarov V N Koralova A L Praktikum Po Perevodu S Angliysk PDF
Ailina Erbil
No ratings yet
Neural Enquirer: Learning To Query Tables With Natural Language
Document19 pages
Neural Enquirer: Learning To Query Tables With Natural Language
a4104165
No ratings yet
Prepare Yourself for the Chinese Language Proficiency Exam (HSK). Elementary Chinese Language Difficulty Levels: Volume I: HSK Levels 1 and 2
From Everand
Prepare Yourself for the Chinese Language Proficiency Exam (HSK). Elementary Chinese Language Difficulty Levels: Volume I: HSK Levels 1 and 2
Muhammad Wolfgang G. A. Schmidt
No ratings yet
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Huong Dan Lap Trinh Voi Android 17
Document7 pages
Huong Dan Lap Trinh Voi Android 17
Hà Nguyễn
No ratings yet
Huong Dan Lap Trinh Voi Android 12
Document5 pages
Huong Dan Lap Trinh Voi Android 12
phan_vinh_20
No ratings yet
Huong Dan Lap Trinh Voi Android 8
Document5 pages
Huong Dan Lap Trinh Voi Android 8
Nguyen Kien
No ratings yet
MIPS Instruction Formats and Examples
Document9 pages
MIPS Instruction Formats and Examples
Gustavo F. de Amorim
No ratings yet
Mips Isa
Document79 pages
Mips Isa
phan_vinh_20
No ratings yet
Day1 Intro DiodeApplication
Document73 pages
Day1 Intro DiodeApplication
phan_vinh_20
No ratings yet
Spectrum Analysis 082702
Document50 pages
Spectrum Analysis 082702
phan_vinh_20
No ratings yet
6th Central Pay Commission Salary Calculator
Document15 pages
6th Central Pay Commission Salary Calculator
rakhonde
100% (436)
JPWB - Step 2 Contents
Document13 pages
JPWB - Step 2 Contents
rungrung
100% (1)
Learn Greek
Document39 pages
Learn Greek
Sandra
No ratings yet
31 July, 1, 2 August Lesson Plan Eng Lang G5
Document4 pages
31 July, 1, 2 August Lesson Plan Eng Lang G5
Rocky Balbowa
No ratings yet
Life and Writings Of: I.Eet - Anateshwar Sharma Baral
Document355 pages
Life and Writings Of: I.Eet - Anateshwar Sharma Baral
Naresh Giri
No ratings yet
Average To Topper Series HW
Document40 pages
Average To Topper Series HW
Raj verma
No ratings yet
Making Connections Spelling Patterns
Document3 pages
Making Connections Spelling Patterns
api-28679754
No ratings yet
Korean Language Series
Document8 pages
Korean Language Series
Siti Aminah Mohd Mornie
No ratings yet
Progress Report For Arabic Language
Document11 pages
Progress Report For Arabic Language
Abdurrazzaq Oba Uthman
No ratings yet
Alphabet Capital Letters Small Letters Alphabet Capital Letters Small Letters
Document9 pages
Alphabet Capital Letters Small Letters Alphabet Capital Letters Small Letters
胡佳玲
No ratings yet
English Spelling Quiz 10000 Words That Will Test Your Writing, Grammar, Proofreading, and Spelling Skills by Robinson, Janette
Document541 pages
English Spelling Quiz 10000 Words That Will Test Your Writing, Grammar, Proofreading, and Spelling Skills by Robinson, Janette
case com
No ratings yet
Evolution of Chinese Characters
Document2 pages
Evolution of Chinese Characters
马文馨
No ratings yet
Hindi-English Primer - 2: Formative Assessment - I
Document2 pages
Hindi-English Primer - 2: Formative Assessment - I
Vivekananda Dharmavaram
No ratings yet
Coptic Alphabet
Document1 page
Coptic Alphabet
marthafarag
No ratings yet
Chapter 4 Read
Document3 pages
Chapter 4 Read
api-252949092
No ratings yet
Activity Sheet in Grade 2 1 Quarter (Week 2)
Document5 pages
Activity Sheet in Grade 2 1 Quarter (Week 2)
Jubylyn Aficial
100% (1)
Spelling Shed Scheme Stage 6
Document217 pages
Spelling Shed Scheme Stage 6
Karen Tozer
No ratings yet
Punctuation
Document100 pages
Punctuation
Joffer Cedric Salvador
No ratings yet
IC Style Guide 2020
Document20 pages
IC Style Guide 2020
John Kihungi
100% (1)
Basque Pronunciation Guide
Document2 pages
Basque Pronunciation Guide
Valeria Castro
No ratings yet
English Alphabetic Code With The International Phonetic Alphabet
Document8 pages
English Alphabetic Code With The International Phonetic Alphabet
Rikko Putra Youlia
100% (1)
Why Teach Spelling
Document60 pages
Why Teach Spelling
Vaishu Rajendran
No ratings yet
There Are at Least Four Ways To Integrate Quotations
Document5 pages
There Are at Least Four Ways To Integrate Quotations
api-238242808
No ratings yet
A Beginner S Guide To Bemba
Document43 pages
A Beginner S Guide To Bemba
Francisco José Da Silva
100% (5)
Use of Punctuation Marks
Document11 pages
Use of Punctuation Marks
rahulb88
No ratings yet
F
Document14 pages
F
Denis Ladaniuc
No ratings yet
Islam Me Riba K Hukm e Imtanai Ki Ahemiat by Sh. Imran N Hussein (Shariat Forum)
Document43 pages
Islam Me Riba K Hukm e Imtanai Ki Ahemiat by Sh. Imran N Hussein (Shariat Forum)
Qamar Uz Zaman
No ratings yet
English 3 Exam
Document1 page
English 3 Exam
Charles Cruz
No ratings yet
Urdu Seekhiye - Urdu Letters Chart
Document2 pages
Urdu Seekhiye - Urdu Letters Chart
Arun Iyer
No ratings yet
Alphabets, Letters and Diacritics in European Languages: Albanian
Document8 pages
Alphabets, Letters and Diacritics in European Languages: Albanian
Nilmalvila Blue Lilies Pond
No ratings yet
G7 Punctuation Marks
Document3 pages
G7 Punctuation Marks
Eman
No ratings yet