Professional Documents
Culture Documents
Tim Walsh
November 28th, 2011 University of Delaware
Personal Interest
Interest in Psychology & Cognitive Science
o Experimental Design o Research Methods
User studies are at the core of social science research Research: creating interfaces for specific user populations
o Evaluation requires user studies
So why bother?
o Cant we prove that our system works without involving these pesky users?
Tim Walsh Fall 2011 3
Why, you may ask, do my colleagues and I put so much time, money, and energy into experiments? For social scientists, experiments are like microscopes or strobe lights, magnifying and illuminating the complex, multiple forces that simultaneously exert their influences on us. They help us slow human behavior to a frameby-frame narration of events, isolate individual forces, and examine them carefully and in more detail. They let us test directly and unambiguously what makes humans beings tick and provide a deeper understanding of the features and nuances of our own biases. Dan Ariely
The Upside of Irrationality
Professor of Psychology & Behavioral Economics
Tim Walsh Fall 2011 4
Double-blind
o Reduce confounding effects of experiment
Random Sample
o Results draw from a representative sample of real user population
Tim Walsh Fall 2011 6
Different challenges associated with each Evaluation of Experimental Design: is the experiment objective, reliable, and valid?
Tim Walsh Fall 2011 7
Tim Walsh
Fall 2011
Tim Walsh
Fall 2011
10
Brian Landau
School of Computing, Robert Gordon University, Abeerden, Scotland, UK Abeerden,
Tim Walsh Fall 2011 12
Motivation
Questionnaires are common method for data collection across many scientific fields
o Interactive Information Retrieval
Questionnaires can be administered in many different ways How does mode of administering questionnaire effect the results?
Tim Walsh
Fall 2011
13
Hypothesis
H1
o Quantitative ratings of system would be more positive when administering questionnaire via interview mode
H2
o Subjects will provide longer and more informative responses to open questions in interview mode than in other modes (i.e., pen and paper, electronic)
Tim Walsh
Fall 2011
14
Operational Definitions
Question Type
o Closed fixed set of responses o Open open-ended questions; possibilities endless!
Mixed Questionnaire
o Questionnaire contains both closed and fixed questions
Tim Walsh
Fall 2011
15
Operational Definitions
Questionnaire Modes
o Interview Face-to-face interview with experimenter o Pen & Paper User writes answers, by hand, using pen and paper o Electronic User answers questions via keyboard and mouse on a computer
Acquiescence
o Tendency to answer in the affirmative
Demand Effects
o Expectation of how the user should behave in research setting
Experimental Design
2 x 3 Design
o Search-term highlighting 2 levels o Yes o no o Mode 3 levels o pen & paper o Electronic o interview
Tim Walsh
Fall 2011
19
Dependent Variables
o Various metrics of questionnaire answers
Goal is to analyze differences in content, length, quality, and effectiveness of each questionnaire mode
Tim Walsh Fall 2011 20
Experimental Method
Tim Walsh
Fall 2011
21
Experimental Method
Exit Questionnaire
o This is the focus of the study
Tim Walsh
Fall 2011
22
Tim Walsh
Fall 2011
23
Content:
o 21 Closed Questions 8 Usability Questions 13 Questions regarding XRF Interface o 4 Open-Ended Questions What were the most positive things about using this system and why? What were the most negative things about using this system and why? How would you improve this system and why? Is there anything else that you would like to tell us about this system and your experiences using it?
Tim Walsh
Fall 2011
24
Tim Walsh
Fall 2011
25
Evaluation of Results
Eventual coding scheme notes certain attributes of answers Goal: Reduce raw data to simplest form, remove differences caused by nature of each questionnaire mode
Tim Walsh
Fall 2011
26
Dependent Variables
Tim Walsh
Fall 2011
27
Tim Walsh
Fall 2011
28
Evaluation of Results
Differences in open answers
o Length o Uniqueness
Tim Walsh
Fall 2011
29
Evaluation of Results
Differences in open answers
o Length o Uniqueness
Tim Walsh
Fall 2011
30
Evaluation of Results
Additionally, varying degrees of efficiency between experimental modality Efficiency:
o Number of unique units identified by a given subject
Tim Walsh
Fall 2011
31
Tim Walsh
Fall 2011
32
Conclusions
H1
o Quantitative ratings of system would be more positive when administering questionnaire via interview mode
Yes
o limited supporting data, somewhat inconclusive o Conclusion: use interview mode to elicit responses to closed questions Open questions may be confounded
Tim Walsh
Fall 2011
33
Conclusions
H2
o Subjects will provide longer and more informative responses to open questions in interview mode than in other modes (i.e., pen and paper, electronic)
Yes
o Longer in interview, yet not as well formed o Similar amount of usable feedback across all modalities Not necessarily more informative
Tim Walsh
Fall 2011
34
Conclusions
Scores are generally high
o Small range on higher-end of spectrum o Support for acquiescence
Tim Walsh
Fall 2011
35
Tim Walsh
Fall 2011
36
Feedback v. No Feedback
o Difference in flow between modalities
Tim Walsh
Fall 2011
37
Unitizing process
o Low reliability o Objectivity? Create code before administering task o Hard to do given open-endedness of possible responses Researchers code answers on the fly using previously established coding rubric No raw data, but coded data would be more objective Data would also be more anonymous
Tim Walsh
Fall 2011
38
Have interview time user Must give timer feedback to all or none
Tim Walsh Fall 2011 39
Diane Kelly
School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599
Tim Walsh
Fall 2011
40
Motivation Provide a general, reusable, and realistic framework to evaluate end-to-end question answering systems Goal:
o Development of a protocol for evaluating a complex questionanswering system
End-to-end QA System:
o System that takes free input o Provides specific and accurate answers to questions o Supports beginning to end of QA process
Tim Walsh Fall 2011 41
HITIQA
Large text corpus, consists of partial answers Users have two main interfaces:
o Question: interactive dialogue, clarify scope of question o Answer: visual interface to explore answer space
Tim Walsh
Fall 2011
42
Tim Walsh
Fall 2011
43
HITIQA
Tim Walsh
Fall 2011
44
HITIQA
Tim Walsh
Fall 2011
45
Questionnaire
o Assess realistic aspect of task
Wizard of Oz Approach
o Man behind the curtain (or rather, computer) translates free-text query into a more system-friendly query o Answer is passed onto user o User is unaware of intermediary
Tim Walsh Fall 2011 47
RUTS Paradigm Real Users Real Tasks Real Systems Real problems?
o Hard to control environment? o Vulnerable to confounding variables? o Results, in theory, should be very generalizable Representative of real world environment
Tim Walsh
Fall 2011
48
RUTS Paradigm
Real Users
o Intelligence analysts
Real Tasks
o Preparation of reports consisting of complex questions, multiple sub-questions
Real Systems
o Two versions of HITIQA to simulate different systems
Tim Walsh
Fall 2011
50
Evaluation
Tim Walsh
Fall 2011
51
CrossCross-Evaluation
Real users assess one another Results will converge on representation of population Concerns
o What criteria is used to assess the quality of the report? o Who performs the assessment? o How is the effect of the system measured?
Tim Walsh
Fall 2011
52
CrossCross-Evaluation
Tim Walsh
Fall 2011
53
CrossCross-Evaluation
Tim Walsh
Fall 2011
54
Improvements to System
Tim Walsh
Fall 2011
56
Tim Walsh
Fall 2011
57
Results
Cross evaluation reports scored higher in second workshop across all analysts Possible reasons why:
o Increased effectiveness of system o Increased focus generated by cross-validation The judges were more aware of the experimental design as they were part of the judgment process Somewhat nullifies blindness of experiment o Increased experience of performing experiment The more the analysts perform the task, the better they become at it
Many of these confounding effects could be easier to isolate with a larger sample population
Tim Walsh Fall 2011 58
How to study the complexities of human behavior in environments where the researchers lack control?
o Some hypotheses cannot be lab-tested require realworld environment o Challenges of experimental design o Reliable, valid methods?
Tim Walsh
Fall 2011
59
o Such scenarios exemplify the most difficult elements of human behavior research o In such an environment, how do you isolate influential forces? Prevent confounding effects? Know what you are measuring?
Tim Walsh Fall 2011 60
Tim Walsh
Fall 2011
61
Conclusions
While RUTS may not be the most scientific framework, and far from an ideal research method, it offers a chance of understanding a complex environment with infinite possible confounding effects Attempt to bring science to otherwise unscientific process
Tim Walsh
Fall 2011
62
Feedback v. Cost
o While user studies offer a great deal of insight, is it worth the cost? Time, resource, and financially consuming Can have results confounded as a result of slight oversight in experimental design Results are not guaranteed to be generalizable unless population is representative
Application to IR
o Many systems are theoretically sound, but have no bearing on practical application