Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

3.2 | What Should Round Influence Be Like?

Unless there is some good reason for wanting some rounds to be more influential than others, it is better for all rounds to be equally influential.  If some rounds were more influential without good reason, then it would be unfair to those teams who happened to do worse in those arbitrarily more influential rounds.  To put it another way, if it were possible for all rounds to be equally influential, then it would be unfair to make some rounds more influential than others, unless there was some compelling reason to do so.  For example, if making some rounds dramatically more influential than others helped to make the final tab more accurate, that might be a compelling justification.  

Let’s consider some reasons why one might want certain rounds to be more influential.  Those people who would prefer that some rounds count for more than others, appear to offer one of these three main reasons:  1) some rounds are more accurately judged; 2) some rounds involve teams that are more closely matched in skill; and 3) some rounds are less randomly paired.  Let’s consider each of these justifications for why we might want some rounds to be more influential than others.

First, if it were true that the calls by judging panels in certain rounds were much more likely to be correct than were the calls in other rounds, then this would be a plausible reason to want those rounds to be more influential.  This would be a compelling justification of unequal influence to exactly the extent that it made the final tab more accurate.  If the difference in call accuracy were small, then that would probably not justify a major difference in round influence.  Moreover, if other factors (e.g., team position) advantaged certain teams, we would not want to amplify this unfairness by giving more influence to teams that randomly drew a favorable position in certain rounds and not others.  A similar point can be made for teams who happen to be better or worse at one kind of topic (e.g., economics) than another.  But, all that being said, a significant increase in call accuracy would, ceteris paribus, be a reason for unequal round influence, but only insofar as doing so would increase the overall accuracy of the final tab. 

Regarding the second suggestion, it is unclear why one would to link rewards (or influence on final placement) to winning rounds where skill levels are more similar.  Indeed, this situation seems to warrant just the opposite approach.  If we already know that the teams’ skill levels are closely matched, we want to make smaller adjustments in where they stand by giving this relatively less influence.  For example, it is counter-productive to take four teams of admittedly similar skill level and use Round 9 to separate them by 90 places on the final tab.  This justification is not persuasive.

The third rationale for why some rounds should be less influential is to avoid random disadvantages.  The fact that a round involves random pairing may be given as a justification for it to have less influence.  The rationale here is that some of these randomly allocated rooms will be harder than others.  This is most obviously true in Round 1, which is totally random, but we should not forget that WUDC rules require that teams be randomly placed within each bracket, so almost all rooms are constructed with some degree of random allocation.  (The very top rooms in the last few preliminary rounds may be an exception to this.)  The desire to lower the influence of rounds that are more affected by random room allocation seems reasonable, as it stems from a desire to minimize the damage from the unfair scenario where a team finds that they are assigned to a Round 1 room with 2 or 3 top-tier teams, by sheer bad luck.[5]

To understand the fairness issues with the first and third plausible rationales, it helps to distinguish between procedural fairness and substantive fairness. A purely random room allocation in Round 1 is totally fair from a procedural perspective because it is totally unbiased, in the same way that a random lottery drawing (e.g., for money) is totally procedurally fair. Of course, if a random lottery leaves one person with tons of money while lots of people are destitute, then there will be legitimate criticisms about the substantive fairness of that system, because substantive fairness is about people ending up with what they deserve.[6] Similarly, random room allocation may end up interfering with the accuracy of the tab, a problem of substantive fairness because teams end up ranked lower than they deserve. How to balance these two kinds of fairness is a classic ethical dilemma. Even if violating citizen’s rights to procedural justice (e.g., by allowing warrantless police searches) were to increase the likelihood of achieving substantive justice by convicting only those who are guilty, it would be a bad idea.[7] Similarly, even if allowing CA teams in Round 1 to manually spread out those teams they see as best were to increase the likelihood of achieving an accurate break, it would also be a bad idea. Using random room allocation in Round 1, together with tapered points, is both procedurally fair because being random is unbiased, and substantively fair because it increases the likelihood that teams place where they deserve to.

A fundamental problem here may be that these three rationales for why unequal round influence is justified all presume that we should be making decisions about appropriate round influence levels through the lens of our intuitions about which particular rounds merit greater or lesser reward; but that’s the wrong lens to use, even if everyone could agree on those intuitions (which we doubt).  We do agree that merit matters, but the correct lens is to look at how the system in its entirety rewards merit.  This requires seeing the entire tournament as a single sorting problem, and not just focusing on whether the rewards in some particular rounds comport with our intuitions about fairness.  From a systemic perspective, one can see that the number of points awarded in each round has subtle downstream interactions with other rounds, such that systems need to be evaluated holistically, not round by round.  Put simply, what really matters is that the scoring system as a whole puts teams in the order they merit on the final tab (so long as no procedural injustices are allowed).  And, as we have shown, tapered scoring does a much better job at this than the status quo. 

Even if that doesn’t entirely convince you, it is incumbent on the defender of SQ to show why Round 9 should be 30 times more influential than Round 1.  Two clear harms from the excessive influence of Round 9 include, first, teams being given an unbalanced motion in the final (or next to final) round, which seriously impacts all teams on the bubble by sheer bad luck.  Second, teams bring pulled up into a much harder bracket have their bad luck dramatically magnified by the status quo.[8]  Although tapered scoring does not entirely eliminate either problem, it does significantly mitigate both.[9]  These alone are pervasive harms that justify a new approach.  In the end, maybe Round 9 should be the most influential; maybe not.  Round 9 still is most influential in ET; it’s just not radically more influential.

Next page

[5]  Some debate circuits—notably the American Parliamentary Debate Associate (APDA)—try to mitigate this potential unfairness by having top teams enter Round 1 as “seeded”, so that these teams are “protected” from meeting in the Round 1.  Such a process is theoretically possible at Worlds, but it be cumbersome and would also introduce other elements of unfairness, so we do not recommend this.

[6]  In discussions of economic justice, the question of what various people deserve is complex and controversial.  In the context of debate tournaments, we are very comfortable defending the idea that the teams who debated best at the tournament are the ones who deserve to break.

[7]  Whether it is true that warrantless searches would accomplish this is irrelevant to this analogy.  The point is that EVEN IF we suppose that they worked, they wouldn’t be justified.

[8]  We thank Wei Sheng Neo for pointing out that these harms had not been sufficiently emphasized.

[9]  Section 5 discusses how tapered scoring impacts the fairness of pull-ups.  Section 7 discusses how to compare various sources of unfairness with each other.