You are on page 1of 64

User Studies

Experimental Design & Evaluation


A look at the work of Diane Kelly, with a focus on two of her publications: Questionnaire mode effects in interactive Information Retrieval Experiments and A Model for Quantitative Evaluation of an End-to-End Question Answering System

Tim Walsh
November 28th, 2011 University of Delaware

Introduction Crash Course in Experimental Design Papers


o Questionnaire Mode Effects in Interactive IR o Quantitative Evaluation of End-to-End Q/A System

Summary & Conclusion


Tim Walsh Fall 2011 1

Personal Interest
Interest in Psychology & Cognitive Science
o Experimental Design o Research Methods

User studies are at the core of social science research Research: creating interfaces for specific user populations
o Evaluation requires user studies

If it doesnt work for the user, it doesnt work!


Tim Walsh Fall 2011 2

Introduction to User Studies: Whats the point?


Experimental Design: General Themes
o Time Consuming o Difficult to do correctly o Expensive o Well made experiments can be confounded

So why bother?
o Cant we prove that our system works without involving these pesky users?
Tim Walsh Fall 2011 3

Why, you may ask, do my colleagues and I put so much time, money, and energy into experiments? For social scientists, experiments are like microscopes or strobe lights, magnifying and illuminating the complex, multiple forces that simultaneously exert their influences on us. They help us slow human behavior to a frameby-frame narration of events, isolate individual forces, and examine them carefully and in more detail. They let us test directly and unambiguously what makes humans beings tick and provide a deeper understanding of the features and nuances of our own biases. Dan Ariely
The Upside of Irrationality
Professor of Psychology & Behavioral Economics
Tim Walsh Fall 2011 4

User Studies: Scientific Method


Declare null hypothesis Design an experiment to test the null hypothesis Run the experiment, gather experimental results Interpret data obtained from results using appropriate statistical analysis Choose to reject or accept the null hypothesis based on analysis
Tim Walsh Fall 2011 5

User Studies: General Design


Choose one or two variables to test
o Too many variables leads to inconclusive results

Control all other factors


o This is a main difficulty o Hard to account for everything! o Sometimes hard to control potential confounding variables

Double-blind
o Reduce confounding effects of experiment

Random Sample
o Results draw from a representative sample of real user population
Tim Walsh Fall 2011 6

User Studies: Challenges & Goals


User Study needs to have
o Objectivity o Reliability o Validity

Different challenges associated with each Evaluation of Experimental Design: is the experiment objective, reliable, and valid?
Tim Walsh Fall 2011 7

User Studies: Challenges & Goals


Objectivity
o judgment based on observable phenomena and uninfluenced by emotions or personal prejudices o Goal: Remove any possible bias from experiment o Challenges: Experimenter bias o Use double-blind experiments Sampling bias o Use a random sample of a subset of the general population
Tim Walsh Fall 2011 8

User Studies: Challenges & Goals


Reliability
o Consistency or stability of a measure of behavior o Goal: Experiment will yield same result each time test is administered o Challenges: Test-retest Measurement error Internal consistency

Tim Walsh

Fall 2011

User Studies: Challenges & Goals


Validity
o The trueness of the experimental measure o Goal: Intended hypothesis being tested can actually be validated from the given experiment and the results can be generalized o Challenges: Internal & External Validity Confounding variables, direction of effects, history of participants, maturation

Tim Walsh

Fall 2011

10

Was the experiment objective? Reliable? Valid?


o Are the results generalizable? o Is the design of the experiment vulnerable to any bias? Social desirability? Experimenter? o Random sample of population? o Are the results confounded by third-variables?
Tim Walsh Fall 2011 11

User Studies: General Evaluation Questions

Questionnaire Mode Effects in Interactive Information Retrieval Experiments


Diane Kelly, David J. Harper
School of Information and Library Science, University of North Carolina, Chapel Hill, NC 27599-3360, 27599USA

Brian Landau
School of Computing, Robert Gordon University, Abeerden, Scotland, UK Abeerden,
Tim Walsh Fall 2011 12

Motivation

Questionnaires are common method for data collection across many scientific fields
o Interactive Information Retrieval

Questionnaires can be administered in many different ways How does mode of administering questionnaire effect the results?

Tim Walsh

Fall 2011

13

Hypothesis

H1
o Quantitative ratings of system would be more positive when administering questionnaire via interview mode

H2
o Subjects will provide longer and more informative responses to open questions in interview mode than in other modes (i.e., pen and paper, electronic)

Tim Walsh

Fall 2011

14

Operational Definitions
Question Type
o Closed fixed set of responses o Open open-ended questions; possibilities endless!

Mixed Questionnaire
o Questionnaire contains both closed and fixed questions

Tim Walsh

Fall 2011

15

Operational Definitions
Questionnaire Modes
o Interview Face-to-face interview with experimenter o Pen & Paper User writes answers, by hand, using pen and paper o Electronic User answers questions via keyboard and mouse on a computer

Study is investigating differences in results between modalities


Tim Walsh Fall 2011 16

Possible Reasons for Difference in Modality


Social Desirability Effect
o Desire to answer according to social norms

Acquiescence
o Tendency to answer in the affirmative

Demand Effects
o Expectation of how the user should behave in research setting

Cognitive & Physical Effort


o Varying levels of mental and physical effort exerted in each modality

Open v. Closed Questions


o Difference in each type of answer
Tim Walsh Fall 2011 17

Social Desirability Bias


Experimental design ensures anonymity Results cannot impact any facet of real-life Psychological effect:
o Desire to answer questions in a manner that will be favored by others o Possibly subconscious

Most sensitive when pertaining to:


o Abilities o Personality o Drug use o Sexual activity o Other taboo subject matter
Tim Walsh Fall 2011 18

Experimental Design

2 x 3 Design
o Search-term highlighting 2 levels o Yes o no o Mode 3 levels o pen & paper o Electronic o interview

Tim Walsh

Fall 2011

19

Experimental Design Independent Variables


o Questionnaire Mode o Proximity in interviewer mode o Researcher

Dependent Variables
o Various metrics of questionnaire answers

Goal is to analyze differences in content, length, quality, and effectiveness of each questionnaire mode
Tim Walsh Fall 2011 20

Experimental Method

Sample population (n) of 51 college students


o Almost all undergraduate o 2/3 female o Varying levels of computer literacy, confidence o Varying majors and education interests o All paid $20 for their participation May contribute to social desirability bias Im getting paid I better do what they want!

Tim Walsh

Fall 2011

21

Experimental Method

Preliminary questionnaire Task


o 4 search scenario task o 4 post-search questionnaires

Exit Questionnaire
o This is the focus of the study

Tim Walsh

Fall 2011

22

Experimental Method: Task

Use TREC-8 Interactive Task


o Use XRF search interface o Bogus task to help blind the user from the true research question o Ideally, will limit effect of social desirability bias, demand effect

1.5 hours to complete ( ! )

Tim Walsh

Fall 2011

23

Experimental Method: Exit Questionnaire


Main effect being studied Cross-modal
o Variations between modes minimized by controlling differences o Goal: Same questionnaire, just administered via differing modality

Content:
o 21 Closed Questions 8 Usability Questions 13 Questions regarding XRF Interface o 4 Open-Ended Questions What were the most positive things about using this system and why? What were the most negative things about using this system and why? How would you improve this system and why? Is there anything else that you would like to tell us about this system and your experiences using it?

Tim Walsh

Fall 2011

24

Evaluation of Results Unitizing process


o Code open-ended answers o Goal: Identify mentions of notable features o 60% agreement between researchers

Tim Walsh

Fall 2011

25

Evaluation of Results
Eventual coding scheme notes certain attributes of answers Goal: Reduce raw data to simplest form, remove differences caused by nature of each questionnaire mode

Tim Walsh

Fall 2011

26

Dependent Variables

Tim Walsh

Fall 2011

27

Evaluation of Results Patterns in results


o Lower scores represent more critical, valid scores Based on assumption that users will over-inflate system scores o Represent acquiescence of questionnaire modality

Tim Walsh

Fall 2011

28

Evaluation of Results
Differences in open answers
o Length o Uniqueness

Length varies, uniqueness converges


o Similar semantic results

Tim Walsh

Fall 2011

29

Evaluation of Results
Differences in open answers
o Length o Uniqueness

Length varies, uniqueness converges


o Similar semantic results

Tim Walsh

Fall 2011

30

Evaluation of Results
Additionally, varying degrees of efficiency between experimental modality Efficiency:
o Number of unique units identified by a given subject

Tim Walsh

Fall 2011

31

Problems in Data & Design

After administration of questionnaire, issues detected in responses


o Back References Users felt previous answer(s) applicable to different question See previous written as answer, referencing a previous answer o Duplication of Answers Electronic Questionnaire Users copy & paste one answer to multiple questions

Tim Walsh

Fall 2011

32

Conclusions
H1
o Quantitative ratings of system would be more positive when administering questionnaire via interview mode

Yes
o limited supporting data, somewhat inconclusive o Conclusion: use interview mode to elicit responses to closed questions Open questions may be confounded

Tim Walsh

Fall 2011

33

Conclusions
H2
o Subjects will provide longer and more informative responses to open questions in interview mode than in other modes (i.e., pen and paper, electronic)

Yes
o Longer in interview, yet not as well formed o Similar amount of usable feedback across all modalities Not necessarily more informative

Tim Walsh

Fall 2011

34

Conclusions
Scores are generally high
o Small range on higher-end of spectrum o Support for acquiescence

Pen & Paper


o Most concise response

Pen & Paper, Electronic


o Most well-formed results o Interview has longer results, yet less well-formed essentially the same content

Tim Walsh

Fall 2011

35

Evaluation of Experimental Design Social Desirability


o No requirement of face-to-face interaction o Internal desire to fit in with crowd may still arise in other modalities o Popularity of social media may alter these results Twitter, Facebook, etc., are not face-to-face in nature Very much about fitting in with the crowd May desensitize need for face-to-face communication for social desirability bias

Tim Walsh

Fall 2011

36

Evaluation of Experimental Design Experimental Fatigue


o Very long experimental task A LOT of questionnaires! o Done on computer o No break between task and exit questionnaire Maybe give a few moments to reflect on results o May have lead to differences in modalities Electronic questionnaire => fatigued user?

Feedback v. No Feedback
o Difference in flow between modalities

Tim Walsh

Fall 2011

37

Evaluation of Experimental Design

Unitizing process
o Low reliability o Objectivity? Create code before administering task o Hard to do given open-endedness of possible responses Researchers code answers on the fly using previously established coding rubric No raw data, but coded data would be more objective Data would also be more anonymous

Tim Walsh

Fall 2011

38

Evaluation of Experimental Design


Confounding Variable: Pace
o Slow => More critical? Possibly, also a result of fatigue o May be confounding Electronic, Pen & Paper => No time constraints per question Interview => implicit time constraint by nature o May need to control variable Introduce timer per question to each modality o Electronic, Pen & Paper Similar to SAT / GRE
o Interview

Have interview time user Must give timer feedback to all or none
Tim Walsh Fall 2011 39

A Model for Quantitative Evaluation of an End-to-End End-toQuestionQuestion-Answering System


Nina Wacholder
School of Communication, Information and Library Studies, Rutgers University, New Brunswick, NJ 08901.

Diane Kelly
School of Information and Library Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599

Paul Kantor, Robert Rittman, Ying Sun, and Bing Bai


School of Communication, Information and Library Studies, Rutgers University, New Brunswick, NJ 08901

Sharon Small, Boris Yamrom, and Tomek Strzalkowski Yamrom,


Department of Computer Science, University at Albany, Albany, NY 12222

Tim Walsh

Fall 2011

40

Motivation Provide a general, reusable, and realistic framework to evaluate end-to-end question answering systems Goal:
o Development of a protocol for evaluating a complex questionanswering system

End-to-end QA System:
o System that takes free input o Provides specific and accurate answers to questions o Supports beginning to end of QA process
Tim Walsh Fall 2011 41

HITIQA

HIgh-Quality InTeractive Question Answering Question Answering System


o Driven by natural language computer driven dialogue

Large text corpus, consists of partial answers Users have two main interfaces:
o Question: interactive dialogue, clarify scope of question o Answer: visual interface to explore answer space

Tim Walsh

Fall 2011

42

HITIQA How it works


o (via HITIQA website)

Tim Walsh

Fall 2011

43

HITIQA

Tim Walsh

Fall 2011

44

HITIQA

Tim Walsh

Fall 2011

45

Evaluation Techniques Cross-Evaluation


o Real users asses one another Individuals who use better systems will produce better results Collective judgment will converge on a representative assessment of the sample population

Questionnaire
o Assess realistic aspect of task

RUTS paradigm Cranfield paradigm


Tim Walsh Fall 2011 46

Cranfield Paradigm Effectiveness = good document ranking Features & Limitations


o Real users are a confounding variable o No successful attempt at creating reusable QA answer key o Idealized assumptions regarding document usefulness o Automatic assessment of QA systems

Wizard of Oz Approach
o Man behind the curtain (or rather, computer) translates free-text query into a more system-friendly query o Answer is passed onto user o User is unaware of intermediary
Tim Walsh Fall 2011 47

RUTS Paradigm Real Users Real Tasks Real Systems Real problems?
o Hard to control environment? o Vulnerable to confounding variables? o Results, in theory, should be very generalizable Representative of real world environment

Tim Walsh

Fall 2011

48

RUTS Paradigm Seeks to evaluate


o Quality of report produced by information seekers (end of process) o Experience of gathering information by user

Focus on the information gathering process, not quality of answers


o Goes against what many other experiments seek to achieve

Found to be effective and efficient


Tim Walsh Fall 2011 49

RUTS Paradigm

Real Users
o Intelligence analysts

Real Tasks
o Preparation of reports consisting of complex questions, multiple sub-questions

Real Systems
o Two versions of HITIQA to simulate different systems

Tim Walsh

Fall 2011

50

Evaluation

Analyzed using General Linear Model (GLM)


o Determines which scores are different, significant o Minimizes effect between judges

Report grading is subjective in nature


o Cross evaluation used to create an objective analysis o Measure effectiveness and efficiency Using AntWorld o A collaborative tool for searching the web

Tim Walsh

Fall 2011

51

CrossCross-Evaluation

Real users assess one another Results will converge on representation of population Concerns
o What criteria is used to assess the quality of the report? o Who performs the assessment? o How is the effect of the system measured?

Tim Walsh

Fall 2011

52

CrossCross-Evaluation

What criteria is used to assess the quality of the report?


o 6-point scale (Worthless) 0 --- 1 --- 2 --- 3 --- 4 --- 5 (Excellent) o Judged on Content Relevance Organization Comprehension

Tim Walsh

Fall 2011

53

CrossCross-Evaluation

Who performs the assessment?


o Analysts are professional information seekers o The analysts have extensive experience in producing reports that conform to the specifications set by superiors, etc. o Same analysts in both workshops to retain reliability of experiment

Tim Walsh

Fall 2011

54

CrossCross-Evaluation How is the effect of the system measured?


o General Linear Model (GLM) o Protocol designed: 4 analysts (1 left on last day) 1 day of training devoted to users 4 scenarios available, 2 chosen per user Report generation instructions provided 2 questionnaires administered to assess: o degree of realism of task o Time required to complete task o Usability of system o Comfort level of using system
Tim Walsh Fall 2011 55

Improvements to System

Improvements made to system between Workshop I and Workshop II


o Analysts felt the biggest improvements were related to managing of answers o Change to visual display aided in exploration of answer space

Month between each workshop

Tim Walsh

Fall 2011

56

Results Quality of reports at second workshop are much better


o In general, evaluation provided a lot of useful data between Workshop I and Workshop II

Tim Walsh

Fall 2011

57

Results
Cross evaluation reports scored higher in second workshop across all analysts Possible reasons why:
o Increased effectiveness of system o Increased focus generated by cross-validation The judges were more aware of the experimental design as they were part of the judgment process Somewhat nullifies blindness of experiment o Increased experience of performing experiment The more the analysts perform the task, the better they become at it

Many of these confounding effects could be easier to isolate with a larger sample population
Tim Walsh Fall 2011 58

Fundamental Challenge of User Studies

How to study the complexities of human behavior in environments where the researchers lack control?
o Some hypotheses cannot be lab-tested require realworld environment o Challenges of experimental design o Reliable, valid methods?

Tim Walsh

Fall 2011

59

Fundamental Challenge of User Studies End-to-End QA Systems represent such a challenge


o Variable nature of human beings makes analysis of end-to-end QA system interaction difficult
How can the experiment ensure validity of results? How can the experiment ensure reliable results?

o Such scenarios exemplify the most difficult elements of human behavior research o In such an environment, how do you isolate influential forces? Prevent confounding effects? Know what you are measuring?
Tim Walsh Fall 2011 60

Fundamental Challenge of User Studies RUTS helps stabilize the experiment


o Identify patterns that represent user perceptions of different systems and tasks o Provides understanding of possible confounding effects

Cross-evaluation serves as a dynamic measurement


o The nature of this evaluation is elastic in nature, allowing for accurate results in an uncontrolled environment

Tim Walsh

Fall 2011

61

Conclusions

While RUTS may not be the most scientific framework, and far from an ideal research method, it offers a chance of understanding a complex environment with infinite possible confounding effects Attempt to bring science to otherwise unscientific process

Tim Walsh

Fall 2011

62

Summary & Conclusion


User studies offer insight into human behavior
o Essential in the evaluation of systems

Feedback v. Cost
o While user studies offer a great deal of insight, is it worth the cost? Time, resource, and financially consuming Can have results confounded as a result of slight oversight in experimental design Results are not guaranteed to be generalizable unless population is representative

Application to IR
o Many systems are theoretically sound, but have no bearing on practical application

If it doesnt work for the user, it doesnt work!


Tim Walsh Fall 2011 63

You might also like