Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

Appendix on Performance Gap Data Acquisition & Analysis

To determine the performance gap between EAD and non-EAD teams, we first had to determine which teams should be categorized as ex ante disadvantaged.  The theoretical and practical complexity of directly determining EAD clearly precluded any direct determination, so we decided to settle for measuring the international debating community’s consensus on which institutions were EAD using a survey.  This comes with a risk of people mistakenly thinking that these results represent the reality of disadvantage as opposed to its perception and thereby reifying existing prejudice.  We hope to mitigate that risk by explicitly noting it and by gathering survey answers from a diverse and knowledgeable group of respondents.

We began by creating a survey listing 316 institutions that have attended recent WUDC tournaments (2017-19).  The survey explained that we were interested specifically in ex ante disadvantage (not other forms of disadvantage) and asked respondents to select one of these answers regarding only ex ante disadvantage for each institution listed:

The survey was posted to six social media sites that are followed by many in the global debating community.  In an effort to ensure a more diverse and well-informed set of respondents, email invitations were sent out to experienced debaters and judges with knowledge of less represented circuits or broad knowledge of the WUDC community over many years.  Respondents needed to fill out a form to volunteer to take the survey and were then sent the survey itself.  We received a total of 53 survey responses.  

To evaluate the surveys, we assigned point values to each response and then averaged all responses for each institution.  We then categorized institutions into four groups:  no EAD, low EAD, moderate EAD and high EAD.  Any ranges we chose were bound to be somewhat arbitrary, so we just chose four ranges of equal magnitude.  This list of categorized institutions covered all teams who completed all preliminary rounds at WUDC 2017-19, but it also covered almost all the teams who competed in 2016 and 2020, so we expanded our analysis to include all of the last five years.  For the few teams whose institutions were not listed in our survey but who competed in 2016 or 2020, we extrapolated what their ranking would likely have been from how similar institutions were ranked.  For example, if every Japanese institution is ranked as moderate EAD, then we felt confident in categorizing a new Japanese institution as moderate EAD.  If this could not be done with confidence, that institution was not included.[26]  

With all institutions categorized, we used speaker points as an estimate of each team’s demonstrated skill in each round.  Of course, speaker points are not the same thing as demonstrated skill because judges are imperfect, but they are the best proxy we have for a cardinal measure of demonstrated skill.  For each round, we calculated the average speaker score for high EAD teams, moderate EAD teams and the set of teams categorized as either no EAD or low EAD.  We then analyzed the trends of these points for each group.  

[26]  The 2020 WUDC tab had 13 teams who were completely anonymized, and these teams could not be included in our analysis. 

Works Cited

Barnes, R. E., Kehle, P., Mckenny, H. N., & Lee, C. Z. (2020). Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points. International Debate.

Du Toit, N. (2014). An Evaluation of Four-Team-Per-Contest Swiss (Power Paired) Tournament Structures Using Computer Models in Python. Monash Debating Review, 100-116.

Leizrowice, R. (2020, April 20). Debate Seriousposting. Retrieved from Facebook:

Pierson, E. (2013). The Gender Gap in Competitive Debate. Monash Debating Review, 8-16.

Pierson, E. (2016, August). How Much Does Losing First Round of a Tournament Hurt Your Final Result?Retrieved from Obsession with Regression: