Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

8.2 | Limitations

We turn to limitations that may impact our conclusions.  The primary asymmetry that we introduce in tapered scoring is to make the weight of a round a function of round number.  Therefore, phenomena which themselves are a function of round number may be of concern to us.  These phenomena are narrower than just those that can vary from round to round.  For example, some teams might perform better with some motions than others.  But unless we can predict before the tournament in which rounds certain teams will be disadvantaged more than others, for example in round 7, as far as tapered scoring goes, this is indistinguishable from random noise, which is already accounted for in our model.

To understand why, remember that our only proposed change is the weighting of rounds as a function of round number.  If some disadvantage (e.g., systemic bias) has the same net effect in all rounds, then our proposal would not alleviate this, but nor would it exacerbate it, assuming the same disadvantage would continue to hold through the whole tournament.  If some disadvantage (e.g., getting tired and hungry) affects all teams, then results of all debates will still be the same.  To the extent that some phenomena have a random or partly random impact (e.g., falling ill) consistent through the tournament, it can be modelled as part of the random noise that we accounted for in our model, whose standard deviation we calculated in the aggregate.  Some practices, like position rotation, are not completely random, but since nothing is known about positions a priori, still appear simply as random noise to any tapered scoring system.

Some elements, however, are a function of round number.  The most obvious is judge packing, the practice of concentrating highly qualified adjudicators in live rooms, which decrease in number as the tournament progresses.  We discussed this extensively in our previous paper.  Judge packing is difficult to model, but we did so by varying perception noise through a tournament, starting with high noise (when judges are less packed).  We found that judge packing assists the Early Taper system more than it assists the status quo, and we refer the reader to our previous paper for details.

Other phenomena may be a consistent disadvantage at first glance, but have an impact that increases or decreases through the tournament.  For example, if one believes that some biases (e.g. regional, accent or demographic) can be counteracted with high-quality judges, then judge packing will have an impact on them.  This deserves some attention, and we spent a couple months trying to satisfactorily model systemic bias in our code, but everything we tried seemed open to various reasonable objections questioning whether that was how bias actually works.  Since bias itself is such a nuanced subject with little consensus on exactly how it works, let alone how to model it, we are not aware of any standard models for bias, nor were we able to come up with one.  We therefore folded it into perception noise, which being zero-mean is not a tool well-suited for this purpose.  As we say, some elements of bias will not affect our conclusions, but others—those with an impact being a function of round number—will.  We would be interested in working with anyone with expertise in this area, and if it is possible to develop a clear and not terribly controversial way to model systemic bias, we would like to include this in future work.

There are, of course, countless other nuances that our model does not capture.  We could have modelled each of the actors involved and their interactions with each other, such as power dynamics within panels, the impact of certain motion topics, adjudicator feedback from debaters, swing teams, and of course the many forms of bias operating in our circuit.  Such a model, while capturing more of what goes on, would probably not have been more accurate: if anything, it would likely be less accurate, since it would have countless more parameters to tune and moving parts to amplify slight imprecisions, and we would have had no more information to tune them than the data we actually used—past WUDC and HWS Round Robin data.  It is important to bear this trade-off in mind: adding complexity eventually compromises robustness.  But all models have limitations, and in our case, the salient omissions are those that vary with round number, and hence whose effect would be changed by a tapered scoring system.

Next page