Posted on: October 13, 2021
Written by Michael Boyle and Mike Schmierbach, authors of Applied Communication Research Methods , who discuss how student evaluations can be used in teaching.
As instructors, we often struggle to find concrete examples of course concepts that are relevant to both students and faculty. In research methods, this can be particularly difficult, as students may express little interest in data-driven careers or experiences. Yet nearly every student and faculty member has experience with a widespread, controversial, and well-researched example of social science research: teaching evaluations.
Whether known as teaching ratings, student evaluation of teaching, or student ratings of teaching effectiveness, these forms, typically distributed near the end of the semester, commonly include a list of either Likert or yes/no items designed to assess how well the instructor performed throughout the semester. Students and teachers also share both experience with these instruments and a personal investment in their efficacy, and thus course discussion and activities designed around interrogating them offer a point of entry to conversations about validity, question wording, sampling, confidence intervals, and other important topics in research.
Scholarship on student evaluations of teaching
Given their ubiquity, considerable research is available on the topic, and this can be an opportunity to connect scholarship to personal experience. (It also gives instructors the opportunity to learn about and combat the many flaws with teaching evaluation as currently practiced. Everything we report below is based upon specific research, and interested readers can consult the list of further reading for more details and examples to use in class or to share with administrators.) A summary of this work as well as a list of applicable studies can be found in the work of Uttl, White and Gonzales, who presented a meta-analysis showing no consistent evidence that evaluations are correlated with student learning. Individual studies have varied, with some showing a positive association between objective exam performance and student ratings and others suggesting that highly rated instructors actually did a worse job preparing their students for future classes, even as they earned high marks for giving out high grades. Given the range of findings, it seems clear that as a whole, teaching evaluations are ineffective as predictors of student learning and preparation for future courses. However, it may be that some instruments are more effective than others, and as measures of student satisfaction evaluations show some validity.
Further complicating the discussion is the potential for unintended bias in ratings, where traits such as gender, race, attractiveness and even free candy affect ratings in ways unrelated to learning. Because evaluations are used to determine continued employment, tenure and promotion, and merit pay, among other benefits, any unintended bias may exacerbate inequalities in higher education.
Methods concepts related to evaluations
Our textbook, Applied Communication Research Methods , is built around defining key terms and concepts in the research methods literature. In the case of teaching evaluations, we consider four examples of general concepts that could be taught in connection with a discussion of evaluations, as well as some of the specific additional terms that might be included in a lesson.
Measurement validity. Evaluations provide an ideal entry into the question of whether they provide an accurate indicator. This leads to an initial conversation about the actual purpose of the rating instrument - for instance, what does it mean to measure whether the professor is “doing a good job”? Are we trying to capture learning, engagement, enjoyment, or something else? This then can lead to a more nuanced discussion about the extent to which individual items on the rating instrument meet that goal (or not). Some guiding questions to consider in this conversation include:
● Do the items currently on the rating instrument do a good job? (face validity)
● What items should be added to improve the rating instrument? (content validity)
● What can we compare scores on the rating instrument to in order to validate the rating instrument? (external validity)
Question wording. Apart from discussing validity, the rating instrument can also be used to discuss common question wording issues that plague questionnaire-based research. For instance, students can look for common problems like double-barreled items, confusing items (question clarity), or leading questions. Additionally, instructor ratings often serve as an example of response set, with participants giving the same answer on item after item, even when they are meant to reflect different aspects of teaching.
Sampling and response bias. Sampling-related issues provide further venues for discussion with issues such as sample size and response bias being key starting points for discussion with students. In small classes of 20 students, the impact of low response rates can be clear -- most students understand that if four people in a class complete evaluations they provide little information about performance and would likely differ greatly if a different set of students weighed in. But larger numbers of participants do not necessarily mean evaluations include a representative sample. Students can reflect on the types of people likely to complete evaluations or skip them, and how that can change the scores. They might also think about what specific techniques motivate them to participate, and whether those apply for other kinds of research.
(Over)interpretation of scores. Results from rating instruments also provide the opportunity to discuss core statistical concepts such as means, medians and standard deviations. Students can consider examples of student scores to see how two seemingly different means may simply reflect bias introduced from outliers and skewed distribution. More advanced classes can talk about the statistics necessary to compare courses, whether from semester to semester or instructor to instructor. This is a great way to explain the logic of confidence intervals and why a focus on specific point estimates instead of ranges of values belies a lack of numeracy. For example, with small classes a mean of 4.5 and another of 3.5 are unlikely to represent a clear difference. Students can then discuss why standard statistical procedures such as t-tests may not be a solution, because non-random assignment of students to courses along with the absence of representative samples of students at the end of the semester violate core requirements for such tests. Ultimately, this can help students learn that numbers do not automatically indicate greater precision or higher quality data.
Acting on research
Ultimately, using evaluations as the basis for in-class discussion of research methods topics can do more than just help students relate to the content. We also hope that it encourages students to think about evaluation more carefully, increasing both thoughtful responses and the overall response rate. For instructors, it might also be a chance to become familiar with the many shortcomings of evaluations, and to be able to provide data-based insights to improve (or even eliminate) the practice.
Our goal with that last point is not to suggest that instructors should ignore student input. Rather, given the limitations of evaluations, instructors and departments should think about other mechanisms to measure teaching and solicit student feedback. Midterm evaluations, focus groups, open discussion with the class, and even experimenting with different approaches at random within or between courses can all provide valuable data. Understanding research methods can help empower instructors, chairs and deans to make better decisions. It isn’t just students who stand to learn and improve by thinking about evaluations in the context of research methods.
List for further reading
Adams, M. J., & Umbach, P. D. (2012). Nonresponse and online student evaluations of teaching: Understanding the influence of salience, fatigue, and academic environments. Research in Higher Education, 53(5), 576-591. https://doi.org/10.1007/s11162-011-9240-5
Bassett, J., Cleveland, A., Acorn, D., Nix, M., & Snyder, T. (2017). Are they paying attention? Students’ lack of motivation and attention potentially threaten the utility of course evaluations. Assessment & Evaluation in Higher Education, 42(3), 431-442. https://doi.org/10.1080/02602938.2015.1119801
Beleche, T., Fairris, D., & Marks, M. (2012). Do course evaluations truly reflect student learning? Evidence from an objectively graded post-test. Economics of Education Review, 31(5), 709-719. https://doi.org/10.1016/j.econedurev.2012.05.001
Braga, M., Paccagnella, M., & Pellizzari, M. (2014). Evaluating students’ evaluations of professors. Economics of Education Review, 41, 71-88. https://doi.org/10.1016/j.econedurev.2014.04.002
Chávez, K., & Mitchell, K. M. (2020). Exploring bias in student evaluations: Gender, race, and ethnicity. PS: Political Science & Politics, 53(2), 270-274. https://doi.org/10.1017/S1049096519001744
Esarey, J., & Valdes, N. (2020). Unbiased, reliable, and valid student evaluations can still be unfair. Assessment & Evaluation in Higher Education, 45(8), 1106-1120. https://doi.org/10.1080/02602938.2020.1724875
Freng, S., & Webber, D. (2009). Turning up the heat on online teaching evaluations: Does “hotness” matter?. Teaching of Psychology, 36(3), 189-193. https://doi.org/10.1080/00986280902959739
Kitto, K., Williams, C., & Alderman, L. (2019). Beyond Average: Contemporary statistical techniques for analysing student evaluations of teaching. Assessment & Evaluation in Higher Education, 44(3), 338-360. https://doi.org/10.1080/02602938.2018.1506909
MacNell, L., Driscoll, A., & Hunt, A. N. (2015). What’s in a name: Exposing gender bias in student ratings of teaching. Innovative Higher Education, 40(4), 291-303. https://doi.org/10.1007/s10755-014-9313-4
Uttl, B., White, C. A., & Gonzalez, D. W. (2017). Meta-analysis of faculty's teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22-42. https://doi.org/10.1016/j.stueduc.2016.08.007
Youmans, R. J., & Jee, B. D. (2007). Fudging the numbers: Distributing chocolate influences student evaluations of an undergraduate course. Teaching of Psychology, 34(4), 245-247. https://doi.org/10.1080/00986280701700318