Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

Section 1: Introduction

In our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”, we wrote about improving the accuracy of the break at British Parliamentary debate tournaments.  For that research, we used computer simulations of millions of tournaments to analyze how the accuracy of the break could be improved.  A more accurate break means a break where teams who debated better at that tournament are more likely to break, and where the teams who do break are better ordered according to this demonstrated skill.  We first confirmed the widely held belief that running more preliminary rounds would increase break accuracy significantly, and then we showed that using a different scoring system would also result in a much more accurate break.  Using five different metrics, we showed that by using a new scoring system called Early Taper (ET) over 9 rounds, we could achieve a more accurate break than by using the Status Quo (SQ), even if the SQ scoring system were used for 18 rounds.  The ET scoring system offers more points in the early rounds of the tournament and then a typical amount of points in later rounds, as shown in the table here.

Table 1:  Defining SQ & ET scoring systems

The Break Accuracy paper explained why tapered point systems like ET work better than the status quo system and also answered various objections to adopting the new system.  The present paper continues to look deeper into how scoring systems work, how to measure their success and how to fully address the most common concerns about implementing ET.  Section 2 takes a closer look at the results of a few randomly selected pairs of simulations that share identical inputs, to see how the results of ET and SQ differ when holding everything constant.  We also introduce two new metrics that can be aggregated over thousands of simulations.  Section 3 explores how much influence the results from each of the 9 rounds at Worlds has on a team’s final standing in both SQ and ET scoring systems.  We also look closely at the claim that ET scoring might be detrimental to the competitive success of teams from disadvantaged institutions and also how ET might help these teams.  Section 4 looks at what we can learn from our model about which round has the most reliable and valid decisions, taking into consideration a range of assumptions about judge packing.  Basically, in which round are judge panels most likely to give the correct call.  Section 5 looks at how ET impacts pull-ups.  Section 6 examines how ET affects a team’s ability to recover from bad rounds.  Section 7 discusses how to weigh the various trade-offs in scoring system advantages and disadvantages regarding fairness.  And, lastly, Section 8 takes a few steps back from the details and discusses the overall robustness of the model we are using and addresses a range of concerns about its applicability to actual debate tournaments.

Start reading