Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

Section 6: Tapered Scoring and Recovery from Bad Rounds

It is a virtue of a scoring system that a single bad result does not doom a team’s chances of breaking.  Teams should be able to recover from bad results they receive in one or two rounds, whether these are due to bad debating or (worse) bad judging.  But there’s a balance to be struck here with accuracy.  It would be a vice of a scoring system if it were possible to break after earning zero points in the first two days.  Indeed, you could construct a scoring system where every team is live until Round 9, but it would be catastrophically inaccurate.  Being able to recover and break after a bad performance or two is good, being able to recover and break after many bad performances is bad.  The key is finding the balance.

SQ starts having mathematically dead rooms (for the open break) in Round 5.  A team can take zero points on day 1 and still have a mathematical possibility of breaking open.  Of course, there is no realistic chance of a team breaking open after such a bad start, and the adjudication core will be unlikely to look at a zero-point room in Round 4 as a genuinely live room.  In contrast, ET starts having mathematically open-dead rooms in Round 3.  A team can take a 4th in Round 1 and still break open, even if they take another 4th in the second half of the tournament.  But, somewhere in the first two rounds, they need to beat two teams if they are to stay in contention for the open break.  Obviously, if they take a 4th in Round 1, the path to the open break will be narrower.  This is necessary to improve the accuracy of the break.  But, for the very rare team who deserves to break open and takes a 4th in Round 1, they will be in easier rooms for more early rounds until they catch up to rooms with mostly open-break-level teams, and these early rounds still have more points on offer.  Yes, bad results in both of the first two rounds could knock a break-level team out of the open break, but this is much less likely than the problem in SQ of bad results in late rounds knocking break-quality teams out of the break. For a discussion of weighing these various sources of bad luck, see the following section.

Section 7: Comparing Unfairness

There are some unfair situations that will be more likely with tapered scoring, just as there are some unfair situations that will be more likely with the status quo.  What we need is a way to account for how likely and how severe various unfair situations are.  There is a tendency in discussions like this one for proponents and opponents of the proposed system to present stories, both real and hypothetical, that illustrate ways in which a system can create unfair results.  This can be done well, in a way that uses narrative to illustrate a problem that is a major source of unfairness, or it can be done poorly, using an unusual situation to argue by anecdote.  In fact, we started our paper with a story about a very strong team missing the break at 2007 Worlds, in order to help explain part of the motivation for our research.  Such stories can undoubtedly have rhetorical power, but simply trading these stories is obviously no way to rationally resolve the dispute.  Most of the scenarios that are offered (ours included) illustrate instances of putative unfairness, and since our primary motive in this research is fairness, these scenarios are certainly not irrelevant, but we need some way to assess which system will have more total unfairness.  Let us be clear, there is no system that will eliminate unfair scenarios.  Every year, no matter what system we use, some teams will be treated unfairly by judges or by an imperfect scoring system.  So, the question is, how do we determine which system will minimize the number and the degree of unfairness of these inevitable situations?

Basically, this is the entire point of our statistical research.  Imagine a team that debated very well during the whole tournament, but got unfairly dropped in rounds 1 and 2 of the ET system and didn’t break; they are part of our statistics.  Imagine a team in the SQ system that consistently debated very well, picked up a bunch of points on day 1 and 2, then didn’t pick up enough points in top rooms on day 3 to break, despite still debating very well; they are part of our statistics.  We can’t tell which system is going to minimize these unfair situations without creating a model and running millions of simulations to ensure that we capture all the different ways that teams get treated unfairly, how frequent each of these is and how grave the unfairness is in each.  The metrics we use to measure accuracy are, in effect, also measuring how likely these unfair situations are, and how egregious they are (e.g., it’s more unfair if the team who debated 8th best in prelims misses the break than if the team who debated 48th best misses the break, even though both are unfair).  Ultimately, the conclusion is clear, we get fewer and less grievous unfair situations under ET.  That’s what the lower numbers in the data really mean, across multiple metrics.  Yes, there will be unfair outcomes under ET.  It’s not hard to describe what they could look like.  But, there will be fewer of these and they won’t be as bad as the harms that will continue to exist if we stay with the status quo.  If you don’t care about minimizing unfairness, then none of this will concern you.  But of course, then you presumably won’t be trying to convince people that a system is bad by constructing narratives of unfairness.

Next page