Professional Documents
Culture Documents
Ludovic Lebart
Centre National de la Recherche Scientifique
Telecom-ParisTech, Paris, France
1
Text Mining and Open-ended Questions
in Sample Surveys
Summary / Outline
7) Conclusions
Text Mining and Open-ended Questions
in Sample Surveys
7) Conclusions
1- Principles of Data Mining and Text mining: A reminder
valid,
novel,
potentially useful,
and ultimately… understandable
The main goal being to automatically extract from the ore (raw data)
the genuine diamond of truth…. (Benzécri 1973)
1- Principles of Data Mining and Text mining: A reminder
Despite the fact that we may deal with several observational levels
(households, individuals, trajectories or biographical data,
areas or regions…), there is a consistency and a unity of content
in a survey data set - together with general hypotheses formulated
beforehand - that are not present in the usual data mining input data.
✔ Initial paradigm:
✔
7
1- Principles of Data Mining and Text mining: A reminder
WEB Press
Complaints
8
Text Mining and Open-ended Questions
in Sample Surveys
7) Conclusions
2- Open-ended Questions: Why? How?
Cost
DRAWBACKS
Complexity
Specificity
ADVANTAGES Speed
Freedom
Specificity
11
2- Open-ended Questions: Why? How?
When asked
"what is the most important problem facing this country [USA] at present",
16% of Americans mention crime and violence (grouped free responses),
whereas the same item asked in a closed question
produces 35% of the same response.
As a matter of fact, it is also legitimate to raise this same issue with respect
to regional and generational differences.
2- Open-ended Questions: Why? How?
In some other particular contexts, the cultural gap between those who
have conceived the questionnaire and the interviewees is hidden by the
purely numerical coding of the closed questions.
“Would you like to add something about some topics that could be
missing in this questionnaire, about the minimum wage system ?”
Another: “Thank you for coming. It proves that you are thinking of me”.
Some respondents are far from the problematic “Attitude towards an institution”
2- Open-ended Questions: Why? How?
15
2- Open-ended Questions: Why? How?
16
2- Open-ended Questions: Why? How?
A survey in three cities (Tokyo, New York, Paris) about dietary habits.
---- 1
SPAGHETTI,CHINESE
++++
CAESAR SALAD,LOBSTER TAILS,BAKED POTATO, CHOCOLATE MOUSSE
---- 2
SEAFOOD,GREEN SALAD,CHINESE FOOD
++++
CHAMPAGNE,CAVIAR,GREEN SALAD,GRILLED SEAFOOD
---- 3
CHINESE FOOD
++++
CHINESE FOOD,FRENCH FOOD,VEAL,BREAD
---- 4
PASTA
++++
BEARNAISE BEEF,CHINESE FOOD,ITALIAN FOOD,PASTA
2- Open-ended Questions: Why? How?
7) Conclusions
3- From texts to numerical data
CORPUS
Texts
Sentences or responses
Segments or quasi-segments
Characters
s
24
3- From texts to numerical data
Ambiguity of frequencies:
statistical frequency versus « linguistic frequency »
Sample surveys
(statistical frequency)
Closed questions
Open questions
ouvertes
Texts
( linguistic frequency)
25
3- From texts to numerical data
The counts for the first phase of numeric coding are as follows:
Out of 1043 responses, there are 13 669 occurrences,
with 1 413 distinct words.
When the words appearing at least 16 times are selected,
there remain 10 357 occurrences of these words,
with 135 distinct words (types).
I 248 go 19 of 312
I'm 22 going 26 on 59
a 298 good 303 other 33
able 55 grandchildren 30 others 17
about 31 happiness 227 our 29
after 26 happy 137 out 34
all 86 have 99 own 16
and 504 having 70 peace 77
anything 19 health 609 people 61
are 65 healthy 45 really 28
27
3- From texts to numerical data
28
3- From texts to numerical data
29
3- From texts to numerical data
I 2 46 92 30 25 19 11 21 2
I'm 2 5 9 3 2 1 0 0 0
a 10 56 66 54 44 19 20 22 7
able 1 9 16 9 7 4 4 5 0
about 0 3 13 7 1 2 4 1 0
after 1 8 11 3 1 2 0 0 0
all 1 24 19 8 18 6 3 5 2
and 8 89 148 86 73 30 25 32 13
anything 0 4 9 1 3 0 1 1 0
30
3- From texts to numerical data
1 I 14 25 in 27
2 a 59 26 is 37
3 about 15 27 it 133
4 all 21 28 it's 28
5 and 42 29 long 14
6 are 25 30 morning 9
7 been 12 37 nothing 25
8 carbohydrate 14 32 nutritional 9
9 carbohydrates33 33 nutritious 12
10 cereal 34 34 nuts 25
11 complex 25 35 of 25
12 crunchy 9 36 people 28
13 eaten 10 37 showed 11
14 eating 19 38 taste 11
15 energy 33 39 that 80
16 for 57 40 that's 13
17 give 9 41 he 82
18 gives 11 42 they 50
19 good 52 43 to 32
20 grape 25 44 was 19
21 has 30 45 with 11
22 have 27 46 years 11
23 healthy 23 47 you 81
24 how 9
3- From texts to numerical data
Example 2: "What is the main idea in this commercial" Examples of repeated segments
The common open-ended question : "What dishes do you like and eat often?
(With a probe: "Any other dishes you like and eat often?").
- The three sets of respondents are broken down into into six categories
(three categories of age, combined with the gender).
3- From texts to numerical data
1 de 891 Lematización:
2 y 806 -Singulares y plurales
3 en 694
- Masculino y femenino
- Formas verbales a infinitivo
4 boca 433 -…
5 con 356
Eliminados artículos,
6 fruta 334
preposiciones …
7 un 308
13 taninos 167
P1 P2 ... P250
14 el 159
Vino 1 0 1 ... 2
15 una 152
Vino 2 1 0 ... 1
16 madera 140
Vino 3 0 0 ... 1
17 bien 116
. . . . . . . . . . .
18 toque 108
Vino 443 1 2 ... 0
3- From texts to numerical data
boca (mouth)
fruta (fruit)
muy (very)
nariz (nose)
nota (note)
taninos (tannins)
madera (wood)
buen (good)
rojo (red)
negro (black)
toque (hint)
bien (well)
final (end)
maduro (ripened)
balsámico (balsamic)
vino (wine)
todavía (still)
elegante (elegant)
agradable (pleasant)
jugoso (juicy)
medio (medium)
fino (fine)
algo (some/something)
cereza (cherry)
ser (to be)
ligero (light)
suave (mild)
potente (powerful)
acidez (acidity)
0 50 100 150 200 250 300 350 400
Frequency
Text Mining and Open-ended Questions
in Sample Surveys
7) Conclusions
4) Basic statistical tools: Visualization, Characteristic words.
38
4) Basic statistical tools: Visualization, Characteristic words.
Briefly, one can summarize the principles of methods for performing these
data reductions:
These two families of methods can be used on the same data matrix and
they complement one another very effectively.
x = aj f + ε
i
j i j
i
Value of variable j
for individual i Residual (hopefully small)
Coefficient of variable j
Known = Unknown
Garnett J.-C. (1919) - General ability, cleverness and purpose. British J. of Psych.,
9, p 345-366.
Thurstone L. L. (1947) - Multiple Factor Analysis. The University of Chicago Press,
Chicago.
x = a j f + b j g + ... + ε
i
j i i j
i
41
4) Basic statistical tools: Visualization, Characteristic words.
Eckart C., Young G. (1939) - A principal axis transformation for non- hermitian matrices.
Bull. Amer. Math. Assoc., 45, p 118-121.
= λ1 × + ... + λ α × + ... + λ p ×
42
4) Basic statistical tools: Visualization, Characteristic words.
95 88 88 87 95 88 95 95 95 106 95 78 65 71 78 77 77 etc.
143 144 151 151 153 170 183 181 162 140 116 128 133 144 159 166 170
153 151 162 166 162 151 126 117 128 143 147 175 181 170 166 132 116
143 144 133 130 143 153 159 175 192 201 188 162 135 116 101 106 118
123 112 116 130 143 147 162 183 166 135 123 120 116 116 129 140 159
133 151 162 166 170 188 166 128 116 132 140 126 143 151 144 155 176
160 168 166 159 135 101 93 98 120 128 126 147 154 158 176 181 181
154 155 153 144 126 106 118 133 136 153 159 153 162 162 154 143 128
159 153 147 159 150 154 155 153 158 170 159 147 130 136 140 150 150
151 144 147 176 188 170 166 183 170 166 153 130 132 154 162 120 135
155 181 183 162 144 147 147 144 126 120 123 129 130 112 101 135 150
166 147 129 123 133 144 133 117 109 118 132 112 109 120 136 120 136
136 130 136 147 147 140 136 144 140 132 129 151 153 140 128 153 147
130 133 140 124 136 152 166 147 144 151 159 140 123 130 123 109 112
126 120 143 145 162 153 155 175 154 144 136 130 120 112 123 123 144
144 159 155 155 162 166 158 147 140 147 126 123 132 135 136 144 147
136 143 162 175 136 110 112 135 120 118 126 151 150 130 129 133 147
133 151 143 106 85 93 128 136 140 140 144 143 126 117 116 129 124
……………………………..etc.
43
4) Basic statistical tools: Visualization, Characteristic words.
Reconstitution of the Cheetah with 2, 4, 6, 8, 10, 12, 20, 30, 40 principal axes
45
4) Basic statistical tools: Visualization, Characteristic words.
46
4) Basic statistical tools: Visualization, Characteristic words.
**** Ain
Ain Isere Jura Rhone Hte_Saone Savoie Hte_Savoie
**** Aisne
Aisne Ardennes Marne Nord Oise Seine_Marne Somme
**** Allier
Allier Cher Creuse Loire Nievre Puy_de_Dome Hte_Saone
**** Alpes_Prov
Alpes_Prov Alpes_Hautes Alpes_Marit Drome Var Vaucluse
**** Alpes_Hautes
Alpes_Hautes Alpes_Prov Drome Isere Savoie
**** Alpes_Marit
Alpes_Marit Alpes_Prov Var
**** Ardeche
Ardeche Drome Gard Loire Hte_Loire Lozere
**** Ardennes
Ardennes Aisne Marne Meuse
……………………….
47
4) Basic statistical tools: Visualization, Characteristic words.
48
4) Basic statistical tools: Visualization, Characteristic words.
49
4) Basic statistical tools: Visualization, Characteristic words.
50
4) Basic statistical tools: Visualization, Characteristic words.
51
4) Basic statistical tools: Visualization, Characteristic words.
52
4) Basic statistical tools: Visualization, Characteristic words.
Money 51 64 32 29 17 193
Future 53 90 78 75 22 318
Unemployment 71 111 50 40 11 283
Decision 1 7 5 5 4 22
Difficult 7 11 4 3 2 27
Economic 7 13 12 11 11 54
Selfishness 21 37 14 26 9 107
Occupation 12 35 19 6 7 79
Finances 10 7 7 3 1 28
War 4 7 7 6 2 26
Housing 8 22 7 10 5 52
Fear 25 45 38 38 13 159
Health 18 27 20 19 9 93
Work 35 61 29 14 12 151
53
4) Basic statistical tools: Visualization, Characteristic words.
54
4) Basic statistical tools: Visualization, Characteristic words.
55
4) Basic statistical tools: Visualization, Characteristic words.
1 j p
1 general term of the
contingency table
Symmetry F=
i fij
(n,p)
of the two
spaces: n
row-profiles column-profiles
rows and 1 j p j j'
columns 1
i
i' i
p n
n points in R p points in R
• •• • • • • • • • • •• • •
• • •••
• • • • • •• • •• • • • •• • ••
• • •• • • • • • • •
• • • • • • • • • • •
• • • •• • •
• •
•
Rp Rn
• • •
56
4) Basic statistical tools: Visualization, Characteristic words.
A x is 2 (21 % ) .
F in a n c e s
.1 5
H IG H
. F u tu r e
.
W ar N o DE GRE E
. .
F ea r .
. U n em p l o y m e n t
T R A D E S e lfi s h n e s s
. . A x is 1
. H e a l th
- .2 - .1 0 .1 . .2
M on ey (57 % )
.
.
EL EM D iff ic u lt
H o u s in g
. W o rk .
- .1 5
.D e c i s i o n
O c c u p a tio n
.
E c o n o m ic C O LL E G E
. .
57
4) Basic statistical tools: Visualization, Characteristic words.
Notations:
kij -sub-frequency of word i in the part j of the corpus;
ki. -frequency of word i in the whole corpus;
k.j -frequency (size) of part j;
k.. -size of the corpus (or, simply, k).
T E X T P A R T S
W O R D S
ki j ki .
k. j k. .
k. . size of corpus
ki . frequency of word in corpus
59
Text Mining and Open-ended Questions
in Sample Surveys
Summary / Outline
7) Conclusions
60
5) Applications: Open questions, sample surveys, texts
E 1 -A G E 2 a re
E 3-A G E 2 f a m i ly f ro m sh o u ld
g e n e ra l v er y w if e m e
le is u re fr e e d o m h a p p in ess
sta n d a rd e m plo y m e nt d og h elp
tim e
house w o rld d a u g h ter
m u sic
a nd w o u ld
e d u catio n w o rk I
m one y be w ell
to E 1 -A G E 3
not a n y thin g
E 2 -A G E 1 you
s a t is f .
lo v e da y
E 1 -A G E 1
jo b ha ve g o in g
ke e p
c o m f o r ta b l y
t h in g s c o m f o r t a b le m o re
m u ch
E 3 -A G E 1
f rie nd s th in k c an
f u tu re ca r out go
62
p ea c e o f m in d Example 1 (« Life » question)
E3-A G E3
Location of
p e a c e in t h e w o r ld E 2 -A G E 3 Segments
w e lfa r e o f m y fa m ily
E2-A G E2 h a p p in e ss , g o o d h e a lth
E 1 -A G E 2
E3-A G E2 la w a n d o r d e r
a n ic e h o m e
a g o o d st a n d a r d o f liv in g I d o n 't kn o w
h a v in g e n o u g h m o n e y to l ive E1-A G E3
E 2 -A G E 1
E1-A G E1
a g o o d jo b c a n 't t h i n k o f a n y t h i n g e l s e
fr ie n d s a n d fa m i ly
E 3 -A G E 1
63
5) Applications: Open questions, sample surveys, texts
1 f rie n d s 2 .8 7 1 .1 1 17 11 6 3 .4 4
2 do 1 .3 5 .4 5 8 4 7 2 .6 0
3 w ant 1 .0 1 .3 0 6 3 1 2 .4 4
4 b e in g 2 .1 9 1 .1 1 13 11 6 2 .1 8
5 jo b 2 .5 3 1 .3 6 15 14 2 2 .1 6
6 h a v in g 1 .5 2 .6 7 9 7 0 2 .1 1
7 t h in g s .8 4 .2 7 5 2 8 2 .0 6
- -- - -- - -- - -- - -- -
2 w ife .0 0 .6 5 0 68 -2 .1 0
1 h e a lt h 2 .7 0 5 .8 5 16 609 -3 .5 9
1 m in d 2 .5 5 .4 5 5 47 2 .9 1
2 w e lfa r e 1 .5 3 .2 1 3 22 2 .4 2
3 peace 2 .5 5 .7 4 5 77 2 .1 7
64
5) Applications: Open questions, sample surveys, texts
1 .3 3 - 1 f r i e n d s , f r i e n d s , m y h o m e l i f e
1 .1 2 - 2 b e i n g c o n t e n t h a v i n g e n o u g h m o n e y t o d o w h a t y o u w a n t t o d o ,
w ith in re a s o n , h a v in g g o o d frie n d s , h a v in g a fu lfillin g jo b to d o ,
h a v in g s o m e id e a o f w h a t y o u w a n t to d o a n d th e fre e d o m to c h o o s e ,
p ro te c tio n o f th e e n v iro n m e n t
1 .0 5 - 3 to h a v e g o o d frie n d s a ro u n d h a v in g a g o o d jo b , liv in g in a g o o d a re a ,
h a v in g lo ts o f fre e d o m to d o th e th in g s y o u w a n t to d o
.9 3 - 4 g o o d l i v i n g e d u c a t i o n , g o o d j o b , m o n e y
.9 7 - 1 to g e th e rn e s s , p e a c e o f m in d , g o o d h e a lth , re lig io n , n o
.6 4 - 2 n o t t o d i e , p e a c e o f m i n d , d o n 't l i k e p e o p l e l i v i n g e n v i o u s o f e a c h
o th e r
.6 3 - 3 p e a c e o f m in d g o o d h e a lth , h a p p in e s s , e n o u g h m o n e y to k e e p a
s ta n d a rd o f liv in g
.3 8 - 4 w e lfa re o f m y fa m ily w o rk , s a tis fa c tio n , g o o d h e a lth , tra v e l 65
5) Applications: Open questions, sample surveys, texts
66
5) Applications: Open questions, sample surveys, texts
67
5) Applications: Open questions, sample surveys, texts
Example 3: International survey (Tokyo Gas Company). A survey in three cities (Tokyo,
New York, Paris) about dietary habits. Open question: "What dishes do you like and eat often?
New York: First principal plane. Table crossing words and age x gender categories
5) Applications: Open questions, sample surveys, texts
Example 3: International survey (continuation). Question: "What dishes do you like and eat often?
New York: First principal plane. Example of confidence areas for categories (Bootstrap)
5) Applications: Open questions, sample surveys, texts
Example 3: International survey (continuation). Question: "What dishes do you like and eat often?
New York: First principal plane. Example of confidence areas for words (Bootstrap)
5) Applications: Open questions, sample surveys, texts
Example 3: International survey (continuation). Question: "What dishes do you like and eat often?
New York: First principal plane. Example of Kohonen Map (Self Organizing map).
5) Applications: Open questions, sample surveys, texts
Axis 2 : 1.75%
First Principal Plane
Mesoneros de Castilla (03)
6.0
WINES & MARKS
Gran Reserva
Fuentenarro (02)
79
93
80
82 88
97 Jaros Chafandín (01)
81 83 90 91 92
-3.0 -1.5 84 85 86 1.5 Axis 1: 3.52%
89
87 San Román (01)
Numanthia (02)
Bienvenida Sitio de El Palo (01)
Bienvenida Sitio de El Palo (02)
95 Termanthia (02)
Tares P3 (01)
Carramimbre (03) Gran Elías Mora (00)
-1.5
Viña Eremos (03)
Mark81 82 83 84 85 86 87 88 89 90
Average mark: 85.16
5) Applications: Open questions, sample surveys, texts
Example 4: Comments about 522
Axis2
Variables suplementarias Spanish wines (continuation)
Mesoneros de Castilla (03)
Jaros Chafandín (01)
Vega Sicilia 'Único' (94)
4.5
Viña Sastre Pesus(01)
Valdelosfrailes (03)
Gran Reserva Punta Esencia (01)
Fuentenarro (02)
Astrales (02)
3.0 20-24,9€
0-4,9€ 5-9,9€
Torondos (02) Valdecuadrón (02) 15-19,9€ 25-29,9€
Tinto joven 10-14,9€ 50-99,9€
Gayubar (02)
Tinto crianza
1.5
Viñatorondos (03) 94
Valdetán (02) 79 100-300€
78
93
81 91 97
80
Viña Valdable (03) 82 83 88 30-49,9€ 90 92
- 3.0 - 1.5 84 89
1.5 Axis1
Marqués de Olivara (98) 85 86 87
Rauda (01)
El Marqués (02) 95
Processing Strategy
Instrumental Partition
79
6) About textual data in general
Importance of Meta-data
Meta-data
linguistics
Grammar / Syntax
Textual data
Semantics networks
External Corpora
externes
Other a priori structures
sociolinguistics,
chronology, etc.
80
6) About textual data in general
Semantics
A man thinks (A stone thinks)
81
6) About textual data in general
To bear
To bore
Task
Polysemous words: DUTY
Tax
DRUG medicine
Addicting product
82
6) About textual data in general
X is sometimes purring
X mews
X has whiskers
X likes milk
X likes chasing mice
At the end, the point « X » will be superimposed with the point « CAT»
83
6) About textual data in general
(1) calm–wisdom–discretion–wariness–fear–panic,
84
6) About textual data in general
85
6) About textual data in general
86
Text Mining and Open-ended Questions
in Sample Surveys
Summary / Outline
7) Conclusions
7) Conclusions
As a conclusion...
As a conclusion... (continuation)
All these processing are carried out under the supervision of robust
assessment procedures:
- Akuto H. (1992). International Comparison of Dietary Culture. Nihon Keizai Simbun, Tokyo.
- Bécue M., Lebart L. (1996). Clustering of texts using semantic graphs. Application to open-ended questions in surveys,
Proceedings of the IFCS 96 Symposium, Kobe, Springer Verlag, Tokyo (in press).
- Bécue-Bertaut M., Pagès J., Alvarez-Esteban R., Vásquez Burguete J.L. (2006) Détermination d’une note globale, synthèse
d’une évaluation numérique et d’appréciations libres. Application aux études de marché. (in French) Actes des JADT-2006.
- Bécue-Bertaut, M., Álvarez Esteban R., Pagès (2008,)
http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2006/tocJADT2006.htm
Rating of products through scores and free-text assertions. Comparing and combining both. Food Quality and Preference,
19/1, 122-134.
- Belson W.A., Duncan J.A. (1962): A Comparison of the check-list and the open response questioning system, Applied
Statistics, 2, 120-132.
- Benzécri J.-P. (1992). Correspondence Analysis Handbook. Marcel Dekker, New York.
- Biber D. (1995). Dimensions of register variation. Cambridge Univ. Press, Cambridge.
- Bradburn N., Sudman S., and associates (1979): Improving Interview Method and Questionnaire Design, Jossey Bass,
San Francisco.
- Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis,
J. of the Amer. Soc. for Information Science, 41 (6), 391-407.
- Habert B., Nazarenko A., Salem A. (1997). Les linguistiques de corpus. Armand colin, Paris.
- Hayashi C., Suzuki T., Sasaki M. (1992): Data Analysis for Social Comparative research: International Perspective,
North-Holland, Amsterdam.
- Lebart L. (1982). Exploratory analysis of large sparse matrices, with application to textual data, COMPSTAT,
Physica Verlag, 67-76.
- Lebart L., Salem A., Bécue M., (2000), Análisis estadístico de textos, Editorial Milenio, Lleida.
- Lebart L., Salem A., Berry E. (1998). Exploring Textual Data. Kluwer, Dordrecht.
- Lebart L., Morineau A., Warwick K. (1984). Multivariate Descriptive Statistical Analysis. John Wiley. N.Y.
- Ritter H., Kohonen T. (1989). Self Organizing Semantic Maps. Biol. Cybern. 61, 241-254.
- Salem A. (1984). La typologie des segments répétés dans un corpus, fondée sur l'analyse d'un tableau croisant mots et textes,
Cahiers de l'Analyse des Données, 489-500.
- Sasaki M., Suzuki T. (1989): New directions in the study of general social attitudes : trends and cross-national perspectives,
Behaviormetrika, 26, 9-30.
- Schuman H., Presser F. (1981): Question and Answers in Attitude Surveys, Academic Press, New York.
- Sudman S., Bradburn N. (1974): Response Effects in Survey, Aldine, Chicago.
Surveys data and software (DtmVic)
can be downloaded from
www.dtm-vic.com
Thank You
Gracias
Grazie
Obrigado
Merci
92