Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

Section 8: Robustness

It’s worth taking some time to discuss the validity of our model.  We took a lot of effort to refine our model to be as realistic as possible.  But to discuss the degree to which our model should be trusted, we must first start at the question: What is a model for?

8.1 | Objectives of the Model

There’s a saying in statistics: “All models are wrong, but some are useful.”  All natural and social systems are far too complex to be captured accurately in a model.  The question for any model is therefore not whether phenomena work exactly as described, but the degree to which they capture enough of the essence of the system to be informative.  Conversely, as we acknowledge and consider limitations of our model, the question we should be asking ourselves is: what types of omissions would have the potential to alter our overall findings?

Our noise model is kept relatively general for this reason.  We decomposed noise into skill noise and perception noise, because only one of these affects the omniscient “deserved outcome” to which we compare the actual outcome.  But otherwise, we kept all sources of noise together, which we could then calibrate using real data from WUDC and the HWS Round Robin.  The causes of variation in performance are countless, but in the aggregate, we hope that an additive normally distributed noise model will capture the essence of ordinary variation in performance.

For the same reason, while we have run experiments for a wide range of tapering structures, the specific structure we propose is less significant than the observation that relaxing the constraint that all rounds must contribute the same number of team points has a drastic effect on the accuracy of tournament outcomes.  In our experiments, different conditions have yielded slightly different optima, but in none of them have the status quo come close to bringing our basic conclusion into doubt.

While widely encompassing, our model still has limitations, which we will discuss shortly.  Before we get there, one further point deserves mention.  Perhaps counterintuitively, it would be problematic for our results to rely too heavily on the model we used.  A good tournament structure should work in a range of circumstances—not micro-optimized to work with the WUDC norms of 2019.  If our conclusions change drastically when we increase noise slightly or add one more subtlety to the model, then we have much bigger systemic problems than whether that parameter is accurate.  Therefore, while we took care to use parameters that most closely mirrored real-world data for our most thorough simulations, it would be an error to take them too seriously.

Moreover, a tournament structure should in principle be capable of finding the best teams no matter who shows up at the tournament.  The ideal (albeit fictional) tournament structure should require no further assumption than “better teams tend to win more often”.  Experimental methods of course do not provide a way to verify such sweeping results, and we need to make at least some assumptions in order to write the program that carries out the simulations.  But the success of the tournament structure certainly shouldn’t rest on whether the true noise is exactly as we claim.

For this reason, we verified our results with different parametric assumptions, including the case where there is no noise.  Naturally, the precise figures vary under differing conditions.  The same reasoning applies to why we assessed fairness under so many metrics— every metric captures something slightly different.  What we are looking for is whether the results all point in the same direction.

Next page