The University of British Columbia
UBC - A Place of Mind
The University of British Columbia
Faculty of ArtsArts Instructional Support & Information Technology (Arts ISIT)
  • Teaching & Learning
  • Computer Services
  • Administrative Tools
  • Webinars
  • Announcements
  • Contact Us

Does Instructor Gender Matter? Evaluating the Effect of Gender on Course Evaluation Scores: A Report

Michelle Lee (Student Research Assistant)
Leah P. Macfadyen (Program Director, Evaluation & Learning Analytics), Email.

April 27, 2015

1 Abstract
2 Purpose
3 Introduction
3.1 Past literature
4 Data and Methods
4.1 Dataset
4.2 Methods
4.3 Department code legend
4.4 Course evaluation scores by department
4.5 Score difference between female and male instructors
5 Results
5.1 Average Score Difference – By Large Classes
5.2 Average Score Difference – By Department
5.3 Effect sizes and confidence intervals – instructor gender
5.4 Effect sizes and confidence intervals – course year level (400-level vs. 100-level)
5.5 Effect sizes and confidence intervals – student gender
5.6 Miscellaneous findings: class size
6 Conclusion

1 Abstract

In recent years, universities have placed greater focus on instructors’ teaching quality and student satisfaction. Controversially, student course evaluation scores are used to evaluate the overall effectiveness of an instructor. A plethora of research has been conducted on other variables that influence student evaluation scores, including gender. Some studies suggest a strong instructor gender effect, but their findings have been greatly disputed.

Our project aimed to examine the relationship between student course evaluation scores and instructor gender. Other variables of interest, including student gender, course year level, class size, and department, were also investigated. Our dataset was based on 22,021 students’ teaching evaluation scores on all courses within the UBC Faculty of Arts.

The effect of instructor gender was found to be minimal and insignificant. Course year level and departmental effects were shown to be significant, but small. Rather than focusing on observable instructor characteristics, future research should focus on hard-to-measure instructor qualities, which are far more likely to predict evaluation scores than gender.

2 Purpose

Our study aimed to answer two questions:

Is there an effect of instructor gender on student evaluation scores in courses within the UBC Faculty of Arts?

Are there interaction effects between evaluation scores and student gender, instructor gender, department, course year level, and class size?

3 Introduction

High stakes decisions, such as tenure and promotions, now rely on course evaluation scores. A finding of a strong instructor gender effect would certainly have major ramifications for the current course evaluation system.

It is important to note that this study is not about resolving the controversy about the validity of these scores, but whether the scores are biased with respect to instructor gender.

3.1 Past literature

Findings on the effect of teacher gender have been mixed. Many studies have proposed that students are biased against female instructors who adopt a ‘get tough’ approach, compared to male instructors (Koblitz 1990; Basow and Silberg 1987). A few studies have indicated that female instructors need to behave in stereotypically feminine ways to avoid a decrease in student evaluation scores (Benett 1982; Kierstead 1988).

Several meta-analysis indicate no difference in global evaluation of male and female instructors, or few and inconsistent interaction effects between student and instructor gender (Feldman 1992, 1993; Wachtel 2006; Wright 2005). However, a recent controversial study using online courses suggested strong gender bias in student ratings of teaching, which motivated our study (MacNell 2014).

4 Data and Methods

4.1 Dataset

The original dataset had 52,163 student evaluations on all courses within the UBC Faculty of Arts in the 2013-2014 year. Online courses, grad-level courses, and incomplete entries were omitted. The dataset was also filtered for courses that had at least 1 male and 1 female instructor. After filtering, the final dataset had 22,021 evaluations.

Our focus was on one course evaluation question (coded UMI6), “Overall, the instructor was an effective teacher,” with responses in Likert scale (‘Strongly disagree’, ‘Disagree’, ‘Neutral’, ‘Agree’, ‘Strongly Agree’). The scores were converted numerically (i.e., Strongly disagree = 1; Strongly Agree = 5). The mean score was 4.18; there were 10,981 evaluations for male instructors and 10,524 evaluations for female instructors.

A sample of the dataset is below:

4.1

4.2 Methods

Hierarchical linear models (random intercept, random slope), using department as the aggregation level, were used to parse effects of gender. A backwards elimination algorithm (package lmer, lmerTest, MuMIn) was used to exclude non-significant effects of the model, and to choose the best model, based on Akaike information criterion (AIC). Tukey’s HSD was also employed. All analysis and visualization were implemented using R.

4.3 Department code legend

4.3

4.4 Course evaluation scores by department

The Likert plot below is a summary of student satisfaction, aggregated by department. Not all Arts departments are shown below, as the dataset was further filtered for departments with at least 6 instructors.

4.4

4.5 Score difference between female and male instructors

To calculate the score difference between female and male instructors, two measures were used.

The female-male score difference was calculated as:

4.5

When aggregating the FMScoreDiff by course year level or department, the difference was weighted by number of students:

4.5(1)

This was done so all score differences were weighted proportionally, by class size.

5 Results

Differences in scores between male and female instructors were small. Tukey’s HSD indicated no significant difference in scores (p=0.31), but reported some evidence of significant differences between courses of different year level.

5.1 Average Score Difference – By Large Classes

The graph only shows courses that have at least 3 male and 3 female instructors. The FMScoreDiff was calculated for these courses. Positive score differences indicate female instructors are preferred. It is clear that gender differences are minimal across large courses.

5.1

5.2 Average Score Difference – By Department

The graph only shows departments that have at least 6 male and 6 female instructors. The FMScoreDiff was calculated for these departments. Positive score differences indicate female instructors are preferred. Again, it is clear that gender differences are minimal across departments.

5.2

5.3 Effect sizes and confidence intervals – instructor gender

The graph below represents the coefficients and confidence intervals for the effect of instructor gender, obtained from the hierarchical linear model.

5.3

Blue dots indicate male instructors receive higher scores than their female counterparts (and red dots indicate the reverse for female instructors). Lines represent the confidence intervals.

For example, male professors in ASIA and CENE receive, on average, +0.15 to +0.25 scores higher than their female counterparts. On the other hand, female professors in PSYC, FHIS and ANTH receive, on average, +0.15 scores higher than their male counterparts.

Departments where female instructors received higher scores: PSYC, FHIS, ECON, and ANTH.

Department where male instructors received higher scores: CENE and ASIA.

Departments with no significant instructor gender effect: SOCI, POLI, PHIL, MUSC, ENGL, CRWR, CNRS, and AHVA.

5.4 Effect sizes and confidence intervals – course year level (400-level vs. 100-level)

The graph below represents the coefficients and confidence intervals for the effect of course year level, obtained from the hierarchical linear model.

5.4

Blue dots indicate 400-level courses receive higher scores than 100-level courses (and red represents the reverse for 100-level courses). Lines represent the confidence intervals.

For example: In the Anthropology department, 400-level courses received +0.6 scores higher than 100-level courses.

Departments where 400-level courses were favoured: PSYC, FHIS, ENGL, ECON, and ANTH.

Departments where 100-level courses were favoured: CNRS, CENE, and ASIA.

Departments with no significant course year level effect: SOCI, POLI, PHIL, MUSC, CRWR, and AHVA.

5.5 Effect sizes and confidence intervals – student gender

The graph below represents the coefficients and confidence intervals for the effect of student gender, obtained from the hierarchical linear model.

5.5

Blue dots indicate male students give higher scores than their female counterparts (and red represents the reverse for female students). Lines represent the confidence intervals.

For example: In the AHVA (Art History and Visual Art) department, male students gave +0.15 scores higher than their female counterparts.

Departments where male students gave higher scores: FHIS, ENGL, ANTH, and AHVA.

Departments where female students gave higher scores: SOCI, PHIL, and ECON.

Departments with no significant student gender effect: PSYC, POLI, MUSC, CRWR, CNRS, CENE, and ASIA.

5.6 Miscellaneous findings: class size

While instructor gender was the focus of this study, some interesting miscallenous results were found. For example, a simple regression of class size and UMI6 evaluation score showed a strong negative relationship.

The graph below shows a correlation plot between score and class size, subdivided by student gender.

5.6

Both male and female students give similar scores, as shown by the red and blue regressions. Given the high correlation, there is some suggestion that student satisfaction may be closely related to class size.

6 Conclusion

Overall, the results show very little evidence supporting an overall effect of instructor gender on course evaluation scores. This is in agreement with established meta-analysis that indicates the lack of evidence on the instructor gender effect. Course year level, student gender, and departmental effects were shown to be significant, but small.

Rather than focusing on observable instructor characteristics, future research should focus on hard-to-measure instructor qualities, which are far more likely to predict evaluation scores than gender.

Arts Instructional Support & Information Technology (Arts ISIT)
Faculty of Arts
1234 Street
Vancouver, BC Canada V0V 0V0
Find us on
 
Back to top
The University of British Columbia
  • Emergency Procedures |
  • Terms of Use |
  • Copyright |
  • Accessibility