You are on page 1of 6

Do student evaluations measure teaching effectiveness?

Philip Stark, professor of statistics | 10/14/13 | 23 comments | Leave a comment



Since 1975, course evaluations at Berkeley have included the following question:
Considering both the limitations and possibilities of the subject matter and course, how
would you rate the overall teaching effectiveness of this instructor?
1 (not at all effective), 2, 3, 4 (moderately effective), 5, 6, 7 (extremely effective)
Among faculty, student evaluations of teaching are a source of pride and satisfaction
and frustration and anxiety. High-stakes decisions including merit reviews, tenure, and
promotions are based in part on these evaluations. Yet, it is widely believed that
evaluations reflect little more than a popularity contest; that its easy to game the
ratings; that good teachers get bad ratings; that bad teachers get good ratings; and
that fear of bad ratings stifles pedagogical innovation and encourages faculty to water
down course content.
What do we really know about student evaluations of teaching effectiveness?
Quantitative student ratings of teaching are the most common method to evaluate
teaching.
[1]
De facto, they define effective teaching for many purposes, including
faculty promotions. They are popular partly because the measurement is easy:
Students fill out forms. It takes about 10 minutes of class time and even less faculty
time. The major labor for the institution is to transcribe the data; online evaluations
automate that step. Averages of student ratings have an air of objectivity by virtue of
being numerical. And comparing the average rating of any instructor to the average for
her department as a whole is simple.
While we are not privy to the deliberations of the Academic Senate Budget Committee
(BIR), the idea of comparing an instructors average score to averages for other
instructors or other courses pervades our institutions culture. For instance, a sample
letter offered by the College of Letters and Sciences for department chairs to request a
targeted decoupling of faculty salary includes:
Smith has a strong record of classroom teaching and mentorship. Recent student
evaluations are good, and Smiths average scores for teaching effectiveness and course
worth are (around) ____________ on a seven-point scale, which compares well with
the relevant departmental averages. Narrative responses by students, such as
________________, are also consistent with Smiths being a strong classroom
instructor.
This places heavy weight on student teaching evaluation scores and encourages
comparing an instructors average score to the average for her department.
What does such a comparison show?
In this three-part series, we report statistical considerations and experimental evidence
that lead us to conclude that comparing average scores on omnibus questions, such
as the mandatory question quoted above, should be avoided entirely. Moreover, we
argue that student evaluations of teaching should be only a piece of a much richer
assessment of teaching, rather than the focal point. We will ask:
How good are the statistics? Teaching evaluation data are typically spotty and
the techniques used to summarize evaluations and compare instructors or courses are
generally statistically inappropriate.
What do the data measure? While students are in a good position to evaluate
some aspects of teaching, there is compelling empirical evidence that student
evaluations are only tenuously connected to overall teaching
effectiveness.
[2]
Responses to general questions, such as overall effectiveness, are
particularly influenced by factors unrelated to learning outcomes, such as the gender,
ethnicity, and attractiveness of the instructor.
Whats better? Other ways of evaluating teaching can be combined with student
teaching evaluations to produce a more reliable, meaningful, and useful composite;
such methods were used in a pilot in the Department of Statistics in spring 2013 and
are now department policy.
At the risk of losing our audience right away, we start with a quick nontechnical look at
statistical issues that arise in collecting, summarizing, and comparing student
evaluations. Please read on!
Administering student teaching evaluations
Until recently, paper teaching evaluations were distributed to Berkeley students in
class. The instructor left the room while students filled out the forms. A designated
student collected the completed forms and delivered them to the department office.
Department staff calculated average effectiveness scores, among other things. Ad
hoc committees and department chairs also might excerpt written comments from the
forms.
Online teaching evaluations may become (at departments option) the primary survey
method at Berkeley this year. This raises additional concerns. For instance, the
availability of data in electronic form invites comparisons across courses, instructors,
and departments; such comparisons are often inappropriate, as we discuss below.
There also might be systematic differences between paper-based and online
evaluations, which could make it difficult to compare ratings across the
discontinuity.
[3]

Who responds?
Some students are absent when in-class evaluations are administered. Students who
are present may not fill out the survey; similarly, some students will not fill out online
evaluations.
[4]
The response rate will be less than 100%. The further the response rate
is from 100%, the less we can infer about the class as a whole.
For example, suppose that only half the class responds, and that all those responders
rate the instructors effectiveness as 7. The mean rating for the entire class might be
7, if the nonresponders would also have rated it 7. Or it might be as low as 4, if the
nonresponders would have rated the effectiveness 1. While this example is
unrealistically extreme, in general there is no reason to think that the nonresponders
are like the responders. Indeed, there is good reason to think they are not like the
responders: They were not present or they did not fill out the survey. These might be
precisely the students who find the instructor unhelpful.
There may be biases in the other direction, too. It is human nature to complain more
loudly than one praises: People tend to be motivated to action more by anger than by
satisfaction. Have you ever seen a public demonstration where people screamed were
content!?
[5]

The lower the response rate, the less representative of the overall class the responders
might be. Treating the responders as if they are representative of the entire class is a
statistical blunder.
The 1987 Policy for the Evaluation of Teaching (for advancement and
promotion) requires faculty to provide an explanation if the response rate is below .
This seems to presume that it is the instructors fault if the response rate is low, and
that a low response rate is in itself a sign of bad teaching.
[6]
The truth is that if the
response rate is low, the data should not be considered representative of the class as a
whole. An explanation of the low response ratewhich generally is not in the
instructors controlsolves nothing.
Averages of small samples are more susceptible to the luck of the draw than averages
of larger samples. This can make teaching evaluations in small classes more extreme
than evaluations in larger classes, even if the response rate is 100%. Moreover, in
small classes students might imagine their anonymity to be more tenuous, which might
reduce their willingness to respond truthfully or to respond at all.
Averages
As noted above, Berkeleys merit review process invites reporting and comparing
averages of scores, for instance, comparing an instructors average scores to the
departmental average. Averaging student evaluation scores makes little sense, as a
matter of statistics. It presumes that the difference between 3 and 4 means the same
thing as the difference between 6 and 7. It presumes that the difference between 3
and 4 means the same thing to different students. It presumes that 5 means the same
things to different students in different courses. It presumes that a 4 balances a 6 to
make two 5s. For teaching evaluations, theres no reason any of those things should be
true.
[7]

Effectiveness ratings are what statisticians call an ordinal categorical variable: The
ratings fall in categories with a natural order (7 is better than 6 is better than is
better than 1), but the numbers 1, 2, , 7 are really labels of categories, not quantities
of anything. We could replace the numbers with descriptive words and no information
would be lost: The ratings might as well be not at all effective, slightly effective,
somewhat effective, moderately effective, rather effective, very effective, and
extremely effective.
Does it make sense to take the average of slightly effective and very effective
ratings given by two students? If so, is the result the same as two moderately
effective scores? Relying on average evaluation scores does just that: It equates the
effectiveness of an instructor who receives two ratings of 4 and the effectiveness of an
instructor who receives a 2 and a 6, since both instructors have an average rating of 4.
Are they really equivalent?
They are not, as this joke shows: Three statisticians go hunting. They spot a deer. The
first statistician shoots; the shot passes a yard to the left of the deer. The second
shoots; the shot passes a yard to the right of the deer. The third one yells, we got it!
Even though the average location of the two misses is a hit, the deer is quite
unscathed: Two things can be equal on average, yet otherwise utterly dissimilar.
Averages alone are not adequate summaries of evaluation scores.
Scatter matters
Comparing an individual instructors (average) performance with an overall average for
a course or a department is less informative than campus guidelines appear to assume.
For instance, suppose that the departmental average for a particular course is 4.5, and
the average for a particular instructor in a particular semester is 4.2. The instructor is
below average. How bad is that? Is the difference meaningful?
There is no way to tell from the averages alone, even if response rates were perfect.
Comparing averages in this way ignores instructor-to-instructor and semester-to-
semester variability. If all other instructors get an average of exactly 4.5 when they
teach the course, 4.2 would be atypically low. On the other hand, if other instructors
get 6s half the time and 3s the other half of the time, 4.2 is almost exactly in the
middle of the distribution. The variability of scores across instructors and semesters
matters, just as the variability of scores within a class matters. Even if evaluation
scores could be taken at face value, the mere fact that one instructors average rating
is above or below the mean for the department says very little. Averages paint a very
incomplete picture. It would be far better to report the distribution of scores for
instructors and for courses: the percentage of ratings that fall in each category (17)
and a bar chart of those percentages.
All the children are above average
At least half the faculty in any department will have teaching evaluation averages at or
below median for that department. Someone in the department will be worst. Of
course, it is possible for an entire department to be above average compared to all
Berkeley faculty, by some measure. Rumor has it that department chairs sometimes
argue in merit cases that a faculty member with below-average teaching evaluations is
an excellent teacherjust perhaps not as good as the other teachers in the
department, all of whom are superlative. This could be true in some departments, but
it cannot be true in every department. With apologies to Garrison Keillor, while we have
no doubt that all Berkeley faculty are above average compared to faculty elsewhere, as
a matter of arithmetic they cannot all be above average among Berkeley faculty.
Comparing incommensurables
Different courses fill different roles in students education and degree paths, and the
nature of the interaction between students and faculty in different types of courses
differs. These variations are large and may be confounded with teaching evaluation
scores.
[8]
Similarly, lower-division students and new transfer students have less
experience with Berkeley courses than seniors have. Students motivations for taking
courses varies, in some cases systematically by the type of course. It is not clear how
to make fair comparisons of student teaching evaluations across seminars, studios,
labs, large lower-division courses, gateway courses, required upper-division courses,
etc., although such comparisons seem to be common
[9]
and are invited by the
administration, as evidenced by the excerpt above.
Student Comments
What about qualitative responses, rather than numerical ratings? Students are well
situated to comment about their experience of the course factors that influence
teaching effectiveness, such as the instructors audibility, legibility, and availability
outside class.
[10]

However, the depth and quality of students comments vary widely by discipline.
Students in science, technology, engineering, and mathematics tend to write much less,
and much less enthusiastically, than students in arts and humanities. That makes it
hard to use student comments to compare teaching effectiveness across disciplinesa
comparison the Senate Budget Committee and the Academic Personnel Office make.
Below are comments on two courses, one in Physical Sciences and one in Humanities.
By the standards of the disciplines, all four comments are glowing.
Physical Sciences Course:
Lectures are well organized and clear
Very clear, organized and easy to work with
Humanities Course:
There is great evaluation of understanding in this course and allows for critical analysis
of the works and comparisons. The professor prepares the students well in an upbeat
manner and engages the course content on a personal level, thereby captivating the
class as if attending the theater. Ive never had such pleasure taking a class. It has
been truly incredible!
Before this course I had only read 2 plays because they were required in High School.
My only expectation was to become more familiar with the works. I did not expect to
enjoy the selected texts as much as I did, once they were explained and analyzed in
class. It was fascinating to see texts that the authors were influenced by; I had no idea
that such a web of influence in Literature existed. I wish I could be more helpful in this
evaluation, but I cannot. I would not change a single thing about this course. I looked
forward to coming to class everyday. I looked forward to doing the reading for this
class. I only wish that it was a year long course so that I could be around the material,
GSIs and professor for another semester.
While some student comments are extremely informativeand we strongly advocate
that faculty read all student commentsit is not obvious how to compare comments
across disciplines to gauge teaching effectiveness accurately and fairly.
[11]

In summary:
Response rates matter, but not in the way campus policy suggests. Low response
rates need not signal bad teaching, but they do make it impossible to generalize results
reliably to the whole class. Class size matters, too: All else equal, expect more
semester-to-semester variability in smaller classes.
Taking averages of student ratings does not make much sense
statistically. Rating scales are ordinal categorical, not quantitative, and they may well
be incommensurable across students. Moreover, distributions matter more than
averages.
Comparing instructor averages to department averages is, by itself,
uninformative. Again, the distribution of scoresfor individual instructors and for
departmentsis crucial to making meaningful comparisons, even if the data are taken
at face value.
Comparisons across course types (seminar/lecture/lab/studio), levels (lower
division / upper division / MA / PhD), functions (gateway/major/elective), sizes (e.g.,
7/35/150/300/800), or disciplines is problematic. Online teaching evaluations invite
potentially inappropriate comparisons.
Student comments provide valuable data about the students experiences.
Whether they are a good measure of teaching effectiveness is another matter.
In the next installment, we consider what student teaching evaluations can measure
reliably. While students can observe and report accurately some aspects of teaching,
randomized, controlled studies consistently show that end-of-term student evaluations
of teaching effectiveness can be misleading.

You might also like