User Studies

User Studies
Experimental Design & Evaluation

A look at the work of Diane Kelly, with a focus on two of her publications: Questionnaire mode effects in interactive Information Retrieval Experiments and A Model for Quantitative Evaluation of an End-to-End Question Answering System
Tim Walsh
November 28th, 2011 University of Delaware
Introduction Crash Course in Experimental Design Papers

o Questionnaire Mode Effects in Interactive IR o Quantitative Evaluation of End-to-End Q/A System
Summary & Conclusion

Tim Walsh Fall 2011 1
Personal Interest
Interest in Psychology & Cognitive Science
o Experimental Design o Research Methods
User studies are at the core of social science research Research: creating interfaces for specific user populations
o Evaluation requires user studies
If it doesnt work for the user, it doesnt work!

Introduction to User Studies: Whats the point?

Experimental Design: General Themes
o Time Consuming o Difficult to do correctly o Expensive o Well made experiments can be confounded
So why bother?
o Cant we prove that our system works without involving these pesky users?
Why, you may ask, do my colleagues and I put so much time, money, and energy into experiments? For social scientists, experiments are like microscopes or strobe lights, magnifying and illuminating the complex, multiple forces that simultaneously exert their influences on us. They help us slow human behavior to a frameby-frame narration of events, isolate individual forces, and examine them carefully and in more detail. They let us test directly and unambiguously what makes humans beings tick and provide a deeper understanding of the features and nuances of our own biases. Dan Ariely
The Upside of Irrationality
Professor of Psychology & Behavioral Economics
User Studies: Scientific Method

Declare null hypothesis Design an experiment to test the null hypothesis Run the experiment, gather experimental results Interpret data obtained from results using appropriate statistical analysis Choose to reject or accept the null hypothesis based on analysis
User Studies: General Design

Choose one or two variables to test
o Too many variables leads to inconclusive results
Control all other factors

o This is a main difficulty o Hard to account for everything! o Sometimes hard to control potential confounding variables
Double-blind
o Reduce confounding effects of experiment
Random Sample
o Results draw from a representative sample of real user population
User Studies: Challenges & Goals

User Study needs to have
o Objectivity o Reliability o Validity
Different challenges associated with each Evaluation of Experimental Design: is the experiment objective, reliable, and valid?

Objectivity
o judgment based on observable phenomena and uninfluenced by emotions or personal prejudices o Goal: Remove any possible bias from experiment o Challenges: Experimenter bias o Use double-blind experiments Sampling bias o Use a random sample of a subset of the general population

Reliability
o Consistency or stability of a measure of behavior o Goal: Experiment will yield same result each time test is administered o Challenges: Test-retest Measurement error Internal consistency
Tim Walsh
Fall 2011

Validity
o The trueness of the experimental measure o Goal: Intended hypothesis being tested can actually be validated from the given experiment and the results can be generalized o Challenges: Internal & External Validity Confounding variables, direction of effects, history of participants, maturation
Tim Walsh
Fall 2011
10
Was the experiment objective? Reliable? Valid?

o Are the results generalizable? o Is the design of the experiment vulnerable to any bias? Social desirability? Experimenter? o Random sample of population? o Are the results confounded by third-variables?
User Studies: General Evaluation Questions
Questionnaire Mode Effects in Interactive Information Retrieval Experiments

Diane Kelly, David J. Harper
School of Information and Library Science, University of North Carolina, Chapel Hill, NC 27599-3360, 27599USA
Brian Landau
School of Computing, Robert Gordon University, Abeerden, Scotland, UK Abeerden,
Motivation
Questionnaires are common method for data collection across many scientific fields
o Interactive Information Retrieval
Questionnaires can be administered in many different ways How does mode of administering questionnaire effect the results?
Tim Walsh
Fall 2011
13
Hypothesis
H1
o Quantitative ratings of system would be more positive when administering questionnaire via interview mode
H2
o Subjects will provide longer and more informative responses to open questions in interview mode than in other modes (i.e., pen and paper, electronic)
Tim Walsh
Fall 2011
14
Operational Definitions
Question Type
o Closed fixed set of responses o Open open-ended questions; possibilities endless!
Mixed Questionnaire
o Questionnaire contains both closed and fixed questions
Tim Walsh
Fall 2011
15
Operational Definitions
Questionnaire Modes
o Interview Face-to-face interview with experimenter o Pen & Paper User writes answers, by hand, using pen and paper o Electronic User answers questions via keyboard and mouse on a computer
Study is investigating differences in results between modalities

Possible Reasons for Difference in Modality

Social Desirability Effect
o Desire to answer according to social norms
Acquiescence
o Tendency to answer in the affirmative
Demand Effects
o Expectation of how the user should behave in research setting
Cognitive & Physical Effort

o Varying levels of mental and physical effort exerted in each modality
Open v. Closed Questions

o Difference in each type of answer
Social Desirability Bias

Experimental design ensures anonymity Results cannot impact any facet of real-life Psychological effect:
o Desire to answer questions in a manner that will be favored by others o Possibly subconscious
Most sensitive when pertaining to:

o Abilities o Personality o Drug use o Sexual activity o Other taboo subject matter
Experimental Design
2 x 3 Design
o Search-term highlighting 2 levels o Yes o no o Mode 3 levels o pen & paper o Electronic o interview
Tim Walsh
Fall 2011
19
Experimental Design Independent Variables

o Questionnaire Mode o Proximity in interviewer mode o Researcher
Dependent Variables
o Various metrics of questionnaire answers
Goal is to analyze differences in content, length, quality, and effectiveness of each questionnaire mode
Experimental Method
Sample population (n) of 51 college students

o Almost all undergraduate o 2/3 female o Varying levels of computer literacy, confidence o Varying majors and education interests o All paid $20 for their participation May contribute to social desirability bias Im getting paid I better do what they want!
Tim Walsh
Fall 2011
21
Experimental Method
Preliminary questionnaire Task

o 4 search scenario task o 4 post-search questionnaires
Exit Questionnaire
o This is the focus of the study
Tim Walsh
Fall 2011
22
Experimental Method: Task
Use TREC-8 Interactive Task

o Use XRF search interface o Bogus task to help blind the user from the true research question o Ideally, will limit effect of social desirability bias, demand effect
1.5 hours to complete ( ! )
Tim Walsh
Fall 2011
23
Experimental Method: Exit Questionnaire

Main effect being studied Cross-modal
o Variations between modes minimized by controlling differences o Goal: Same questionnaire, just administered via differing modality
Content:
o 21 Closed Questions 8 Usability Questions 13 Questions regarding XRF Interface o 4 Open-Ended Questions What were the most positive things about using this system and why? What were the most negative things about using this system and why? How would you improve this system and why? Is there anything else that you would like to tell us about this system and your experiences using it?
Tim Walsh
Fall 2011
24
Evaluation of Results Unitizing process

o Code open-ended answers o Goal: Identify mentions of notable features o 60% agreement between researchers
Tim Walsh
Fall 2011
25
Evaluation of Results
Eventual coding scheme notes certain attributes of answers Goal: Reduce raw data to simplest form, remove differences caused by nature of each questionnaire mode
Tim Walsh
Fall 2011
26
Dependent Variables
Tim Walsh
Fall 2011
27
Evaluation of Results Patterns in results

o Lower scores represent more critical, valid scores Based on assumption that users will over-inflate system scores o Represent acquiescence of questionnaire modality
Tim Walsh
Fall 2011
28
Differences in open answers
o Length o Uniqueness
Length varies, uniqueness converges

o Similar semantic results
Tim Walsh
Fall 2011
29
Differences in open answers
o Length o Uniqueness
Length varies, uniqueness converges

o Similar semantic results
Tim Walsh
Fall 2011
30
Additionally, varying degrees of efficiency between experimental modality Efficiency:
o Number of unique units identified by a given subject
Tim Walsh
Fall 2011
31
Problems in Data & Design
After administration of questionnaire, issues detected in responses

o Back References Users felt previous answer(s) applicable to different question See previous written as answer, referencing a previous answer o Duplication of Answers Electronic Questionnaire Users copy & paste one answer to multiple questions
Tim Walsh
Fall 2011
32
Conclusions
H1
o Quantitative ratings of system would be more positive when administering questionnaire via interview mode
Yes
o limited supporting data, somewhat inconclusive o Conclusion: use interview mode to elicit responses to closed questions Open questions may be confounded
Tim Walsh
Fall 2011
33
Conclusions
H2
o Subjects will provide longer and more informative responses to open questions in interview mode than in other modes (i.e., pen and paper, electronic)
Yes
o Longer in interview, yet not as well formed o Similar amount of usable feedback across all modalities Not necessarily more informative
Tim Walsh
Fall 2011
34
Conclusions
Scores are generally high
o Small range on higher-end of spectrum o Support for acquiescence
Pen & Paper

o Most concise response
Pen & Paper, Electronic

o Most well-formed results o Interview has longer results, yet less well-formed essentially the same content
Tim Walsh
Fall 2011
35
Evaluation of Experimental Design Social Desirability

o No requirement of face-to-face interaction o Internal desire to fit in with crowd may still arise in other modalities o Popularity of social media may alter these results Twitter, Facebook, etc., are not face-to-face in nature Very much about fitting in with the crowd May desensitize need for face-to-face communication for social desirability bias
Tim Walsh
Fall 2011
36
Evaluation of Experimental Design Experimental Fatigue

o Very long experimental task A LOT of questionnaires! o Done on computer o No break between task and exit questionnaire Maybe give a few moments to reflect on results o May have lead to differences in modalities Electronic questionnaire => fatigued user?
Feedback v. No Feedback
o Difference in flow between modalities
Tim Walsh
Fall 2011
37
Evaluation of Experimental Design
Unitizing process
o Low reliability o Objectivity? Create code before administering task o Hard to do given open-endedness of possible responses Researchers code answers on the fly using previously established coding rubric No raw data, but coded data would be more objective Data would also be more anonymous
Tim Walsh
Fall 2011
38
Evaluation of Experimental Design

Confounding Variable: Pace
o Slow => More critical? Possibly, also a result of fatigue o May be confounding Electronic, Pen & Paper => No time constraints per question Interview => implicit time constraint by nature o May need to control variable Introduce timer per question to each modality o Electronic, Pen & Paper Similar to SAT / GRE
o Interview
Have interview time user Must give timer feedback to all or none
A Model for Quantitative Evaluation of an End-to-End End-toQuestionQuestion-Answering System

Nina Wacholder
School of Communication, Information and Library Studies, Rutgers University, New Brunswick, NJ 08901.
Diane Kelly
School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599
Paul Kantor, Robert Rittman, Ying Sun, and Bing Bai

School of Communication, Information and Library Studies, Rutgers University, New Brunswick, NJ 08901
Sharon Small, Boris Yamrom, and Tomek Strzalkowski Yamrom,

Department of Computer Science, University at Albany, Albany, NY 12222
Tim Walsh
Fall 2011
40
Motivation Provide a general, reusable, and realistic framework to evaluate end-to-end question answering systems Goal:
o Development of a protocol for evaluating a complex questionanswering system
End-to-end QA System:
o System that takes free input o Provides specific and accurate answers to questions o Supports beginning to end of QA process
HITIQA
HIgh-Quality InTeractive Question Answering Question Answering System

o Driven by natural language computer driven dialogue
Large text corpus, consists of partial answers Users have two main interfaces:
o Question: interactive dialogue, clarify scope of question o Answer: visual interface to explore answer space
Tim Walsh
Fall 2011
42
HITIQA How it works

o (via HITIQA website)
Tim Walsh
Fall 2011
43
HITIQA
Tim Walsh
Fall 2011
44
HITIQA
Tim Walsh
Fall 2011
45
Evaluation Techniques Cross-Evaluation

o Real users asses one another Individuals who use better systems will produce better results Collective judgment will converge on a representative assessment of the sample population
Questionnaire
o Assess realistic aspect of task
RUTS paradigm Cranfield paradigm

Cranfield Paradigm Effectiveness = good document ranking Features & Limitations

o Real users are a confounding variable o No successful attempt at creating reusable QA answer key o Idealized assumptions regarding document usefulness o Automatic assessment of QA systems
Wizard of Oz Approach
o Man behind the curtain (or rather, computer) translates free-text query into a more system-friendly query o Answer is passed onto user o User is unaware of intermediary
RUTS Paradigm Real Users Real Tasks Real Systems Real problems?
o Hard to control environment? o Vulnerable to confounding variables? o Results, in theory, should be very generalizable Representative of real world environment
Tim Walsh
Fall 2011
48
RUTS Paradigm Seeks to evaluate

o Quality of report produced by information seekers (end of process) o Experience of gathering information by user
Focus on the information gathering process, not quality of answers

o Goes against what many other experiments seek to achieve
Found to be effective and efficient

RUTS Paradigm
Real Users
o Intelligence analysts
Real Tasks
o Preparation of reports consisting of complex questions, multiple sub-questions
Real Systems
o Two versions of HITIQA to simulate different systems
Tim Walsh
Fall 2011
50
Evaluation
Analyzed using General Linear Model (GLM)

o Determines which scores are different, significant o Minimizes effect between judges
Report grading is subjective in nature

o Cross evaluation used to create an objective analysis o Measure effectiveness and efficiency Using AntWorld o A collaborative tool for searching the web
Tim Walsh
Fall 2011
51
CrossCross-Evaluation
Real users assess one another Results will converge on representation of population Concerns
o What criteria is used to assess the quality of the report? o Who performs the assessment? o How is the effect of the system measured?
Tim Walsh
Fall 2011
52
What criteria is used to assess the quality of the report?

o 6-point scale (Worthless) 0 --- 1 --- 2 --- 3 --- 4 --- 5 (Excellent) o Judged on Content Relevance Organization Comprehension
Tim Walsh
Fall 2011
53
Who performs the assessment?

o Analysts are professional information seekers o The analysts have extensive experience in producing reports that conform to the specifications set by superiors, etc. o Same analysts in both workshops to retain reliability of experiment
Tim Walsh
Fall 2011
54
CrossCross-Evaluation How is the effect of the system measured?

o General Linear Model (GLM) o Protocol designed: 4 analysts (1 left on last day) 1 day of training devoted to users 4 scenarios available, 2 chosen per user Report generation instructions provided 2 questionnaires administered to assess: o degree of realism of task o Time required to complete task o Usability of system o Comfort level of using system
Improvements to System
Improvements made to system between Workshop I and Workshop II

o Analysts felt the biggest improvements were related to managing of answers o Change to visual display aided in exploration of answer space
Month between each workshop
Tim Walsh
Fall 2011
56
Results Quality of reports at second workshop are much better

o In general, evaluation provided a lot of useful data between Workshop I and Workshop II
Tim Walsh
Fall 2011
57
Results
Cross evaluation reports scored higher in second workshop across all analysts Possible reasons why:
o Increased effectiveness of system o Increased focus generated by cross-validation The judges were more aware of the experimental design as they were part of the judgment process Somewhat nullifies blindness of experiment o Increased experience of performing experiment The more the analysts perform the task, the better they become at it
Many of these confounding effects could be easier to isolate with a larger sample population
Fundamental Challenge of User Studies
How to study the complexities of human behavior in environments where the researchers lack control?
o Some hypotheses cannot be lab-tested require realworld environment o Challenges of experimental design o Reliable, valid methods?
Tim Walsh
Fall 2011
59
Fundamental Challenge of User Studies End-to-End QA Systems represent such a challenge

o Variable nature of human beings makes analysis of end-to-end QA system interaction difficult
How can the experiment ensure validity of results? How can the experiment ensure reliable results?
o Such scenarios exemplify the most difficult elements of human behavior research o In such an environment, how do you isolate influential forces? Prevent confounding effects? Know what you are measuring?
Fundamental Challenge of User Studies RUTS helps stabilize the experiment

o Identify patterns that represent user perceptions of different systems and tasks o Provides understanding of possible confounding effects
Cross-evaluation serves as a dynamic measurement

o The nature of this evaluation is elastic in nature, allowing for accurate results in an uncontrolled environment
Tim Walsh
Fall 2011
61
Conclusions
While RUTS may not be the most scientific framework, and far from an ideal research method, it offers a chance of understanding a complex environment with infinite possible confounding effects Attempt to bring science to otherwise unscientific process
Tim Walsh
Fall 2011
62
Summary & Conclusion

User studies offer insight into human behavior
o Essential in the evaluation of systems
Feedback v. Cost
o While user studies offer a great deal of insight, is it worth the cost? Time, resource, and financially consuming Can have results confounded as a result of slight oversight in experimental design Results are not guaranteed to be generalizable unless population is representative
Application to IR
o Many systems are theoretically sound, but have no bearing on practical application
If it doesnt work for the user, it doesnt work!


User Studies

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

User Studies

Uploaded by

Copyright:

Available Formats

User Studies

Experimental Design & Evaluation

Introduction Crash Course in Experimental Design Papers

Summary & Conclusion

If it doesnt work for the user, it doesnt work!

Introduction to User Studies: Whats the point?

User Studies: Scientific Method

User Studies: General Design

Control all other factors

User Studies: Challenges & Goals

User Studies: Challenges & Goals

User Studies: Challenges & Goals

User Studies: Challenges & Goals

Was the experiment objective? Reliable? Valid?

User Studies: General Evaluation Questions

Questionnaire Mode Effects in Interactive Information Retrieval Experiments

Study is investigating differences in results between modalities

Possible Reasons for Difference in Modality

Cognitive & Physical Effort

Open v. Closed Questions

Social Desirability Bias

Most sensitive when pertaining to:

Experimental Design Independent Variables

Sample population (n) of 51 college students

Preliminary questionnaire Task

Experimental Method: Task

Use TREC-8 Interactive Task

1.5 hours to complete ( ! )

Experimental Method: Exit Questionnaire

Evaluation of Results Unitizing process

Evaluation of Results Patterns in results

Length varies, uniqueness converges

Length varies, uniqueness converges

Problems in Data & Design

After administration of questionnaire, issues detected in responses

Pen & Paper

Pen & Paper, Electronic

Evaluation of Experimental Design Social Desirability

Evaluation of Experimental Design Experimental Fatigue

Evaluation of Experimental Design

Evaluation of Experimental Design

A Model for Quantitative Evaluation of an End-to-End End-toQuestionQuestion-Answering System

Paul Kantor, Robert Rittman, Ying Sun, and Bing Bai

Sharon Small, Boris Yamrom, and Tomek Strzalkowski Yamrom,

HIgh-Quality InTeractive Question Answering Question Answering System

HITIQA How it works

Evaluation Techniques Cross-Evaluation

RUTS paradigm Cranfield paradigm

Cranfield Paradigm Effectiveness = good document ranking Features & Limitations

RUTS Paradigm Seeks to evaluate

Focus on the information gathering process, not quality of answers

Found to be effective and efficient

Analyzed using General Linear Model (GLM)

Report grading is subjective in nature

What criteria is used to assess the quality of the report?

Who performs the assessment?

CrossCross-Evaluation How is the effect of the system measured?

Improvements made to system between Workshop I and Workshop II

Month between each workshop

Results Quality of reports at second workshop are much better

Fundamental Challenge of User Studies

Fundamental Challenge of User Studies End-to-End QA Systems represent such a challenge

Fundamental Challenge of User Studies RUTS helps stabilize the experiment