Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

3.3 | Round Influence and Global Injustice

A critic of tapered scoring might agree that stepping back to look at the whole system is justified, and then go on to argue that we need to take an even bigger step back to look at the unjust social and economic system in which debate tournaments exist.  Obviously, some teams are at a competitive disadvantage because their institutions have significant economic limitations, because their local debate circuits are not highly developed or integrated into international circuits, or because of other factors that exist before the rounds at Worlds even begin.[10]  In short, because we live in a world that is pervasively unjust regarding distribution of resources and opportunities, some teams are ex ante disadvantaged.  Call these “EAD teams”.[11]

Out of this same set of concerns arises an objection to tapered scoring.  The objection is based on two empirical premises.  First, because of the various impacts of global injustice, EAD teams tend to be less well prepared for competition at Worlds (e.g., they are less familiar with typical WUDC judging standards).  Second, because of this, EAD teams will tend to do worse in early rounds, but their performance quality increases at a faster rate than non-EAD teams during the tournament.  We will call this the “gap closing” premise.  It would be useful to have data that might support or refute these empirical premises, but they are prima facie plausible enough to take very seriously.  Indeed, the first premise seems obviously true and is consistent with the data we see in WUDC results.[12]  But the key to the objection is the gap closing premise.  It seems to follow from the gap closing premise that EAD teams would perform better with SQ scoring (where early rounds are less influential) even if overall break accuracy were improved by using ET.  Of course, the objection here is not that tapered scoring is intended to harm EAD teams, just that it has this unfortunate side effect.  

Before starting a deeper analysis of this objection to tapered scoring based on ex ante disadvantage, let’s contemplate another scoring change that might benefit EAD teams.  If EAD teams struggle particularly in early rounds (say, 1 through 3) and then improve significantly (i.e., close the gap), then by the same logic as used in the objection, these teams would be benefitted by placing even less emphasis on early rounds than in the SQ.  This could be done with a new scoring system that might be called “First Day Discount” (or FDD), which counts rounds 1 through 3 as worth half as many points as rounds 4 through 9.[13]  FDD could easily be implemented at Worlds and if the gap closing premise at the foundation of this objection is true, then this would benefit EAD teams.  Does it follow that we should prefer the FDD scoring system over SQ?  Although it is possible that FDD is a better system than SQ, it does not follow from what has been presented so far.

Figures 5.1 to 5.5

In order to complete the ex ante disadvantage objection, a normative premise is needed in addition to the earlier two empirical premises.  This “comparative normative premise” would need to be something like, if system X benefits EAD teams in comparison to system Y, and this benefit is significant enough to outweigh other harms of system X compared to system Y (e.g., worse overall accuracy), then system X should be preferred.  Presumably, the mere existence of some benefit to EAD teams from a system is not an automatic trump card, guaranteeing preference for that system.  Providing a truly miniscule increase in ranks on the final tab to EAD teams (on average) would not justify accepting dramatically less accurate results for everyone (including EAD teams).  Our point here is not to claim that EAD teams are harmed to only a miniscule degree by ET.  As we said above, before doing empirical data analysis, we can have no idea what the degree of this harm is.  The point here is simply that the objection relies inescapably on a normative premise, and this premise requires a comparative judgment of harms and benefits.  Additionally, note that by saying that a comparative judgment of harms and benefits is required, we are also not asserting that the impacts on all teams should be weighed equally.  Perhaps equality in weighing is appropriate, but it might be that unfairness toward some teams (e.g., EAD teams) ought to be weighted more heavily than unfairness toward other teams.  The point is that this is still a weighing, a comparative judgment of harms and benefits, not a lexicographic prioritization (i.e., it isn’t a trump card).

We have demonstrated extensively the benefits of ET over SQ. There are similar accuracy benefits of SQ over FDD, as shown in these charts. So, we know there is a cost (i.e., a harm) to using FDD over SQ, and a similar cost to using SQ over ET. In contrast, there is no clear empirical evidence showing the degree of benefit that EAD teams gain from using SQ instead of ET, if any. [14] We agree that if a large enough impact were found here, this could be persuasive, but at this time there is no such evidence. In fact, an analysis of 5 years of WUDC results calls into question the gap closing premise that is at the foundation of this objection. Although EAD teams do tend to improve as the tournament goes on (as shown by an upward trend in speaker points), this also happens for other teams. To be blunt, the data here are inconsistent.

In our analysis concerning the performance gap, the trends in each of the five years were not consistent with each other. We analyzed moderate EAD and high EAD teams in each year as separate cases.[15] The yearly charts of the performance gap changed direction from round to round (decreasing gap to increasing gap, or vice versa) more often than not over 9 rounds (7 opportunities to change direction), so the following comments are based on the statistically generated liner trend lines. There were years where the gap (i.e., the trend line) decreased for both groups. In other years, the gap increased for both groups, or one group increased while the other decreased. In some cases, the line was essentially flat. Clearly, one reason for the fluctuation in these charts was that some topics played to the strengths of the EAD teams more than others.[16] For example, Round 7 at WUDC 2020 (“This House believes that ASEAN should abandon ‘the ASEAN Way’.”) had the lowest performance gap that year.[17] Aggregating the results over 5 years helps to factor out some of the noise due to particular motions, though the charts still fluctuated up and down considerably. The two charts shown here present the 5-year aggregated data, showing the performance gap tending to decrease slightly for moderate EAD teams and tending to increase slightly for high EAD teams. If we combine these two groups into one moderate/high EAD group, then the trend line suggests that over the course of nine rounds, the performance gap closes by about 0.1 speaker points. But if ex ante discrimination were driving these results, we would expect the gap to close more for high EAD teams than it does for moderate EAD teams. In fact, the trend line runs in the opposite (increasing) direction for high EAD teams.

Figure 6
Figure 7

Of course, our analysis here is not conclusive.  It is possible that a larger or more sophisticated analysis would show a more dramatic closing of the gap by EAD teams.  But, as things stand, the gap closing premise is not a compelling empirical foundation for an objection to tapered scoring.  Even if it turns out to be technically true that the performance gap does close, the magnitude of that closure (a tenth of a speaker point over nine rounds) is unlikely to make a significant difference in how EAD teams perform in a tapered scoring system.  At the same time, we have shown that EAD teams, along with all other teams, will be ranked much more accurately using tapered scoring.  So, it seems clear that this objection falls short because it does not satisfy the comparative normative premise we discussed above.  Even if the performance gap were closed by twice as much (0.2 points), the degree of impact from this phenomenon is insufficient to make a strong case for forgoing the gains in fairness for all teams by making the tab more accurate.  Just as switching from SQ to the FDD (First Day Discount) scoring system over SQ would not be justified if it benefitted EAD team by such a small degree, preferring SQ over ET is also not justified.

As we said above, steps need to be taken to redress the ways in which global injustice undermines the fairness of international debating, but rejecting a more accurate and fair scoring system is not a strategy supported by the evidence.

Next page

[10]  Additionally, there is also implicit bias in judging, but we will discuss that separately is Section 3.4.

[11]  The existence of EAD teams is not in dispute, nor is the claim that these teams would perform better at Worlds if it were not for these pre-tournament injustices.  It is obviously not within the power of the world debating community to solve global injustice, but we agree that our community ought to do what it can to mitigate the effect that these injustices have on teams by doing what we can to level the playing field.  In the same way that FIFA should be doing more to invest in soccer (i.e., football) programs in nations with less developed teams, the WUDC community should be doing more to invest in regions with less developed debating circuits.  Of course, FIFA has billions more dollars to work with, but debaters still could be doing much more.  Some reasonable proposals have been made for how we could help to mitigate the unjust situations that EAD teams find themselves in, and these should be implemented.

[12]  Obviously, this data cannot establish the causal connection, but this should not be a controversial claim.

[13]  It isn’t important whether this means giving people in the first three rounds points of (3/2/1/0) and then (6/4/2/0) after that, or whether you start with (1.5/1.0/0.5/0) and then (3/2/1/0) after that.  We will represent this First Day Discount (FDD) scoring system in the former way, using just whole numbers, but these systems are mathematically identical.

[14]  It is important to remember that the issue being discussed here is ex ante disadvantage, not implicit bias and discrimination that occurs during the tournament.  The latter is important and it is a scourge on the activity that needs to be addressed, but it also has no direct impact on which scoring system to use.  Indeed, the most plausible indirect impact is that by making round influence more equal, we can thereby reduce the harms of implicit bias by eliminating rounds in which particular instances of bias can have outsized influence.  But, as it turns out, that’s an argument in favor of using tapered scoring.

[15]  See the appendix for more detail on how the survey was conducted and how we categorized institutions regarding ex ante disadvantage.

[16]  Emma Pierson came to a similar conclusion regarding gender and speaker point gaps.  Women closed the gap with men when the topic was about gender.  (Pierson, 2013)

[17]  One might be tempted to think that all motions should be chosen with a primary goal of reducing this performance gap.  Surely, reducing this gap is one of the things that should be considered in setting motions, but it isn’t clear that it should be the sole (or even primary) criterion.  This is an interesting discussion, but it is beyond the scope of this paper.