Categories
Uncategorized

Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

Section 2: Hapless Teams and Charmed Teams

Averages of thousands of simulations are great for comparing which system is most likely to yield the most accurate break, but averages also feel sterile and it is difficult to envision the actual results they represent.  To address this, we examined 13 pairs of tournament simulations (SQ & ET), and selected 5 of each to serve as a representative sample.  Every simulation begins with the computer assigning a random “seed” number from 1 to 1 billion, which then allows the program to assign randomized values to all the variables (e.g., team skill variation and judge perception noise).  If the same seed number is used in two simulations, then all of those variables will remain the same, and the same “performance table” (i.e., how every team and panel performs in every round) will always result.  If the scoring system is also the same, then the results will always be identical.  So, each of the 13 pairs of simulations we ran used the same random seed number for both SQ and ET.  Put simply, the only difference between the two simulations in a pair was the scoring system, holding everything else constant other than the scoring system. 

As we did in Break Accuracy, we will frequently be talking about those teams whose demonstrated debating skill during preliminary rounds was among the top 48 teams.  Since by stipulation, these teams actually debated best (regardless of what judges said), we are confident in saying that these are the teams who deserve to break open.[1]  So, as a convenient shorthand, we will call these “deserving teams”.  Our focus here is on which deserving teams are excluded from the break (hapless teams) and which undeserving teams are included (charmed teams).  The numbers in the columns labeled “Hapless Teams” and “Charmed Teams” represent where teams are ranked according to their mean demonstrated skill during preliminary rounds.  Put simply (though perhaps too theatrically), these numbers are each team’s rank on “God’s tab”.  For example, in the first row, the first team listed under the Hapless column is #13, which means that the team whose actual debating performance was 13th best at the tournament did not break in this simulation.  Similarly, also in the first row, the last team listed under the Charmed column is #101, which means that the team whose actual debating performance was the 101st best at the tournament somehow did manage to break in this simulation.

Table 2: Ranks of hapless and charmed teams in SQ & ET under identical performance tables

This table may look like an intimidating bunch of numbers, but it is actually quite simple if you know how to look at it.  To read this table, just focus on comparing an orange row with the purple row immediately below it, which makes sure you are comparing apples to apples.  Each number on the orange line is a team that the SQ system sorted incorrectly and each number on the purple line is a team that the ET system sorted incorrectly.  They either got screwed out of breaking (left column) or snuck into the break (right column).  So, by looking at, for example, the top pair of rows on the left, you can compare which teams got screwed by the SQ & ET systems in a pair of simulations where all the teams performed exactly as well in both simulations and all the judges were exactly as perceptive.  If the purple row has fewer numbers, that means that ET sorted fewer teams incorrectly into the break.  Ideally, we want three things:  1) fewer teams listed on a row; 2) on the left it is better to have higher numbered teams, so that the most deserving teams are not excluded; 3) on the right it is better to have lower numbered teams, so that particularly undeserving teams are not included.  The ET system does better in all these comparisons, but the comparison of the top pair of simulations (SQ #1 vs. ET #1) is most favorable to SQ, while the bottom pair is most favorable to ET.

Admittedly, this is a lot of numbers to look at, but it is illuminating to see some representative examples of what the results are from each scoring system. Obviously, the results from a handful of randomly chosen simulations are not a sound basis for any conclusion. The compelling evidence comes from the cumulative analysis of millions of simulations, and metrics like the Quality Deficit Score, which boils the results in the above table into a single number representing how much debating skill is missing from a particular break (i.e., the quality that is lost by replacing the hapless teams with the charmed teams), which we can then average over thousands of simulations.

There are other useful ways to look at this data, which are less comprehensive, but still illuminating.  Below we present two simple metrics that were also used in Neil Du Toit’s important paper on tab systems for BP. (Du Toit, 2014)  We ran 20,000 simulations of each system (SQ and ET), and for each simulation we recorded both the best team excluded from the break and the worst team who made the break.  The best hapless team and worst charmed team were selected based on their ranking according to mean demonstrated skill (MDS) in that simulated tournament.  The charts below show the distribution of the best teams excluded by each system and the worst teams included by each system.

Figure 1
Figure 2

In terms of actual mean demonstrated skill, the average rank for the best out team for SQ was 21.5, while for ET it was 23.8.  The average rank of the worst in team for SQ was 117.8, while for ET it was 92.6.  One useful way of seeing these charts is that both scoring methods will have the same effect on all the teams in the pink. Changing from SQ to ET lets you trade all the bad results in orange for the less bad results in purple.  And, keep in mind that these charts just represent the single best excluded teams and the single worst included teams in each simulation we ran.  These charts leave out all the other hapless and charmed teams.  There would be similar charts for the second best out or the second worst in, etc.  Also, the two systems do not generate the same total number of hapless and charmed teams.  All of this is the reason why QDS is a much more useful metric for overall comparison.

Next page


[1]  Obviously, there are also teams who deserve to break in other categories, but it is harder to model who these teams are.  In Break Accuracy, we show that ET will also make the ESL and EFL breaks more accurate, but for now we will just focus on the open break, which is much easier to model and evaluate.  Similar conclusions apply to the other break categories.