Section 5: Metrics of Success
To find which scoring system yields the most accurate break, we first need a way to assess how accurate a break is. There is no straightforward way to do this: is it worse for ten borderline-deserving teams to miss narrowly, or the third-best team in the tournament to miss alone? Instead of defending a single metric, we employ five different metrics. The first two assess how accurate a tournament’s ranking of teams is, specifically focusing on the break, and we call them the “break quality metrics”. The last three assess the accuracy of the ranking of all teams, and we call these the “order accuracy metrics”. We also track average bubble size (i.e., the number of teams tied on team points with the lowest breaking team), though not as a “metric of success” per se.
Both break quality and order accuracy metrics have their place. The break quality metrics focus on what probably matters the most to many: whether the correct teams break. However, these metrics focus on the open break alone, because there is no algorithm to design a simulation with a reliable conception of “break quality” for the ESL or EFL breaks. Order accuracy metrics offer insight into the entire ranking, not merely that of teams breaking open, though to reduce computation time we only compute these metrics over the top half of the tournament, since teams in the bottom half of the tab have not made the break in any category in the past 10 years at WUDC.
To be clear, we do not propose using any of these metrics as part of a real tournament. It wouldn’t be possible, as they are calculated using a “ground truth” that real tournaments don’t know. Rather, the point of this section is to define and explain how we will measure the success (accuracy and fairness) of the various systems we consider, to see which is best.When we run experiments, our simulation computes each metric for each simulated tournament. We then present the average of each metric, over the thousands of simulations that we run.
5.1 | Break Quality Metrics
Let us first lay some groundwork. It seems uncontroversial that an ideal preliminary round system would have the teams who perform the best advance to the elimination rounds. In real life, there is no way to know for sure which teams truly “performed the best”, but in simulations, the computer has a God’s eye view of the true demonstrated skill of every team in every round, and we can use this to measure break quality.
As stated above, “deserving teams” are the teams with the highest mean demonstrated debating skill in the tournament, however many are needed for the break (e.g., 48 for partial double-octofinals at WUDC). If all the deserving teams break, we say that the tournament had a “perfectly inclusive break” (regardless of the order in which they break—so it may not be a truly perfect break). The break quality metrics measure the deviation of the set of actual teams in the break from the set of teams that deserve to be in the break.
Our first metric is perhaps the most intuitive: how many of the deserving teams failed to break? We call this the “hard luck count” (HLC). A perfectly inclusive break has an HLC of 0. Of course, real tournaments almost never achieve this kind of perfection, but it is clearly preferably to get as close as possible.Our second metric of success is called the “quality deficit score” (QDS). The QDS is defined as the difference between the sum of mean demonstrated skill (MDS) of teams in the actual break, and the sum of MDS of teams in a perfectly inclusive break. Again, this information is not accessible in real life, but it is calculable in simulations where the computer has access to the true MDS of every team. Lower QDS’s are better; a perfectly inclusive break (regardless of breaking order) has a QDS of 0.
The nuance this adds to the hard luck count is a measure of how bad a particular team’s undeserved break is. Two tournaments might each have four deserving teams not break, giving an HLC of 4, but in one case these teams might have been close calls (among the weakest teams in the break), while in another they might be high-performing teams falling on even harder luck. To demonstrate the contrast, three examples of tournaments with similar HLC but different QDS are shown in Figure 4. In the leftmost example, the teams incorrectly breaking or not breaking were all borderline, so despite having an HLC of 4, the QDS is small. The center example also has an HLC of 4, but since the deserving teams not breaking debated better at the tournament, the QDS is larger. In the rightmost example, although the HLC is just 3, the incorrectly breaking and not breaking teams were an especially gross injustice, so it yields a higher QDS than the other two.
5.2 | Order Accuracy Metrics
The question of how “accurate” a ranking is also, sadly, not straightforward. For example, consider three tournaments of the same five teams, yielding the rankings in Table 1. The “true” ranking of the teams is in the leftmost column. All three tournaments are inaccurate in their own ways, but which is the least inaccurate?
There is extensive mathematical literature on this topic (called “rank correlation”), and no clear answer to what the best metric is. (Langville & Meyer, 2012) Different metrics suit different contexts. Given this, we use three intuitively plausible metrics, and will present the results for these three metrics as applied to team rankings at the end of preliminary rounds.
The first is the “pairwise error count” (PEC), computed as follows: We take each possible pair of teams (if there are n teams, there are n(n−1)/2 such pairs) and take note of whether the truly better-performing team came ahead of the truly worse-performing team in the tournament’s actual ranking. We count the number of pairs for which this did not happen; the number of such errors is the PEC.
The second we call the “rank difference squared” (RDS). For each team, we take their true rank (say, 3rd) and actual rank (say, 5th), and find the squared difference (e.g., (3 − 5)² = 4). The sum of this exponent for all teams is the RDS. The idea behind squaring the difference is to make large inaccuracies (the top team coming ninth) weigh more than equivalently many small inaccuracies (eight teams each being off by one place). This approach is widely used in rank correlation analysis.
The third we call the “sum of skill difference” (SSD). It is based upon what is known in some literature as “Spearman’s footrule”, which previous authors on debate tournament structure have used as their primary metric (Du Toit, 2014). Unlike the other two, which only concern ordinal rankings (1st, 2nd, …), the SSD accounts for differences in cardinal skill levels. It is calculated as follows. Take the absolute difference between the MDS of the team that should have come first and the team that actually came first. Then repeat this for all ranks until last place and add all these absolute differences together. By using the MDS, this metric recognizes that it is less unfair for two teams closely matched in demonstrated skill to be ordered incorrectly, than two teams whose demonstrated skill is far apart to be ordered incorrectly.
To illustrate each of these metrics, worked calculations for each of them in the example from Table 1 are shown in Table 2. We cherry-picked these examples to illustrate that the three metrics are not equivalent, and so offer distinct assessments of order accuracy. However, in practice, as one might expect, they agree most of the time and differ only in tricky cases.Because our focus in this paper is on improving break accuracy, and because it is clear from the history of WUDC tournaments that teams in the bottom half of the team rankings do not break in any division, we focused on how well various scoring systems do at achieving order accuracy in the top half of the teams on the team tab. This dramatically reduced computation time without ignoring any data that was likely to be relevant to the accuracy of the break. The accuracy of the bottom half of the team tab is undoubtedly matches the top half.
5.3 | Bubble Size
In our experimental results, we also present results on bubble size, i.e., the number of teams tied on team points with the lowest breaking team. This is not exactly a metric of success, but it is illuminating in a variety of ways.
At a typical WUDC, the bubble is the number of teams that are on 17 points after Round 9. In the years since the break expanded to 48, the average size of the bubble is 25.5 teams. Many debaters seem to prefer those rare situations where there is a “clean break” (e.g., where all teams on 18 and above break, and all teams on 17 and below do not). For many people, this is because they believe that speaker points are biased and more arbitrary than team points. The smaller the bubble, the less power speaker points have in deciding who breaks. Several authors have plausibly argued that speaker points are conducive to systemic bias against those in marginalized groups. So, there is reason to prefer systems that minimize the size of the bubble.
 This was one of the arguments offered in support of the proposal to break all and only those teams on 18 or more points (after 9 rounds) at Worlds. (Barnes, Hume, & Johnson, Expanding the Worlds Break, 2011)
 See, for instance: (Pierson, 2013), (Spera, Mhaoileoin, & O’Dwyer, 2013), (Falkenstein, 2013), (Buckley & Tedja, 2013), (Kohn & Perkins, 2018). Also, Huyen Thi Thanh Nguyen is doing very extensive research on the gender gap in competitive debating, which has not yet been published.
 There are other ways to break ties in team points, other than using speaker points, and using these would reduce the impact of systemic bias. Unfortunately, our research strongly suggests that the most obvious alternate method to break ties (using strength of opponents) leads to significantly less accurate breaks. We have not found any methods of breaking ties on the bubble that is clearly superior to using speaker points.