Section 6: Results
The charts in this section contain data from our simulations. The first simulations confirm the conventional wisdom that adding more rounds to a tournament makes the results more accurate, with diminishing returns as more rounds are added. Our initial simulations assumed an idealized environment where a team’s skill did not vary from round to round and where judges always accurately perceived debaters’ skill. Later, we looked at more realistic situations with noise (i.e., team skill fluctuation and imperfect judge perception).
6.1 | More Rounds Generates More Accuracy
In a large tournament like the WUDC, with say 360 teams, the break will get significantly more accurate as you add more rounds. Setting aside noise, the charts in Figure 5 show what 5 rounds, 9 rounds and 18 rounds look like in terms of our key metrics. In all metric charts in this paper, the goal is to minimize the inaccuracy according to that metric, so results that are lower represent better outcomes.
These charts show what everyone already believes: holding more rounds will improve the accuracy of the break. All of our metrics confirm this. Note that the number of deserving teams who miss the break does not actually vary that much. This is both interesting and misleading. We will return to this later, but briefly, it turns out that the real key to improving break accuracy isn’t just excluding fewer deserving teams, but getting the order of all the teams more accurately sorted so that fewer team who are very deserving are excluded and fewer teams who are very undeserving are included. The size the bubble (e.g., number of teams on 17 at a typical WUDC) also shrinks considerably as more rounds are added. All three order accuracy metrics improve considerably when you add more rounds. Table 3 how much worse the status quo (SQ) system does with fewer rounds (5 vs 9, or 9 vs 18).
So, without any noise at a tournament (i.e., absent both team skill varying from round to round and judge perception error), adding more rounds clearly improves the accuracy of the overall team rankings and the accuracy of the break. But, since there is noise at tournaments, we need to look at what happens according to our metrics of success under realistic assumptions about noise. This is shown in Figures 6.1 to 6.6.
Even with noise, these results confirm that adding rounds improves break quality and overall rank accuracy on all metrics, as shown in Table 4. Adding more rounds also reduces the bubble size significantly. As expected, adding noise makes the improvements from adding rounds less dramatic, but the improvements are still quite substantial. Of course, even though using 18 rounds is better according to all of our metrics of break accuracy, there are obvious practical problems with holding 18 preliminary rounds, which makes this system untenable as a real policy. But, if we could get similarly impressive improvements in accuracy and fairness without these practical problems (e.g., within 9 rounds), we clearly should.
6.2 | Tapered Points
In the course of our research, we compared the results countless scoring systems. In this subsection, we compare the status quo method for 9 rounds with two new methods employed over 9 rounds. The new systems we are considering here, which are summarized in Table 5, offer more points per round at the start of the tournament and these points are reduced until the end of the preliminary rounds (Round 9), at which time they all offer the tradition points. This is similar to the annealing process described in Section 2. Points in any given round are always multiples of the standard (3,2,1,0). In the first method, represented by green in the table and charts below, these points just change day by day (“Day Taper” or “DT”). In the second method, represented by blue, the points change round by round (“Round Taper” or “RT”). Here, status quo points are multiplied by the number of preliminary rounds remaining when that round begins.
Below we present two sets of six graphs, representing our five metrics of success and the bubble size. The first set (Figures 7.1 – 7.6) is under the idealized assumption of no noise. The second set (Figures 8.1 – 8.6) assumes a realistic noise level. The point here is to first establish a “proof of concept” for our approach of using tapered points, using an idealized environment, and then show that these results hold under realistic conditions.
So, in the idealized world where teams’ demonstrated skill does not vary from round to round and the more skilled team always wins, both tapered systems perform much better than the status quo on all five metrics of success, and they also reduce the size of the bubble, which reduces the reliance on and influence of speaker points in determining who breaks. Tapered systems reduce the number of deserving teams excluded from the break by either 52% (using DT) or 73% (using RT). More importantly, in all our 70,000 simulations of the tapered systems, only the very worst result was as bad as the average SQ result by the break quality metrics, and according to the order accuracy metrics, the worst tapered system result usually was better that the best SQ result. There is no doubt that without noise, both of these tapered systems do much better than the status quo at generating breaks that are more accurate and fairer for everyone. Indeed, the RT tapered system approaches perfection in a situation without noise.
Obviously, actual tournaments have both kinds of noise: variations in demonstrated skill between rounds and variations in the accuracy of judging panels’ perceptions of that skill. We calculated these noise levels using data from the past 10 years at Worlds and from the HWS Round Robin (as described in Section 4). This data set gave us a demonstrated skill standard deviation (D-noise) of 3.02 and a judge panel perception variation standard deviation (P-noise) of 3.67. Running 70,000 trials of each of the three tabbing systems with this realistic noise gave us the results shown in Figures 8.1 to 8.6.
Predictably, all of the systems average much less accurate breaks when we add noise, but the new DT system still does better than the SQ on all metrics (though by smaller margins) and RT does better on all but one metric. With realistic noise in the simulation, the advantages of the new systems are less dramatic, but they certainly still exist. Notably, the DT system does better than the RT system when there is noise. Both tapered systems outperform the SQ, but the DT system is most clearly superior. Only the pairwise error count metric has the three different scoring systems roughly on par, but DT still performs better, and given its dramatic superiority on the other metrics, this does not concern us.
The chart for the three order accuracy metrics are important, not just because they tell us about the accuracy of the ESL and EFL breaks, but also because they tell us about order accuracy within the breaks. Although QDS measures the quality missing from the overall break, it doesn’t say anything about how accurate the ordering is inside the break. This internal order accuracy is obviously important in determining who gets a bye out of partial double-octofinals, and is also central to making the elimination round pairings fair. For example, in the SQ, the top team on 17 points is inevitably much stronger than the bottom team on 18 points (and almost certainly the bottom team on 19), but the former is seeded below the latter. So, it is entirely possible that a top seeded team would be paired against the lower ranked (but very highly skilled) team at the top of the 17s as the supposedly weakest team in the room, while a much lower seeded team would be paired against the higher ranked (but considerably weaker) bottom team on 19 as the low seed team in the room. Basically, the more accurate the ordering is within the breaks, the fairer the elimination pairings are, so that stronger teams are more likely to advance further, as certainly should be the case.
It bears pointing out that the amount of noise has a negligible effect on bubble size. The distribution of team points, and hence the bubble size (which is just the number of teams on a particular number of team points), is almost fully determined by the scoring system and number of rounds; the only results that can affect it are those of debates with pull-up teams in them. Moreover, because the new systems have smaller bubbles, more teams are “locked into” the break on team points (i.e., fewer teams rely on speaker points to break). Several people have argued that speaker points are an unreliable measure of how well teams are debating, either because they are inconsistently and capriciously awarded, or because they are significantly tainted by implicit bias. To the extent that you believe that speaker points are less reliable than team points, the systems that rely on them less to determine the break will be preferable. Indeed, this is one reason why someone might prefer the RT system over the DT system, even though it doesn’t perform quite as well according to our metrics.
 Unless otherwise noted, data charts are all based on 70,000 simulations for each comparative element.
 In these “box and whisker” charts, the box represents the range of the middle 50% of the results, while the thin lines extending up and down from these boxes represent the range of the worst 25% and best 25% of the results respectively. The average of all the results for each system is represented by a thin white line inside the box. Data represented on box and whisker charts in shades of grey come from tournament simulations of either fewer or more than 9 rounds. All data in color comes from systems operating over exactly 9 rounds.
 We have run simulations using a wide range of assumptions about noise, not just the ones employed here. (The ones here are our best estimates of the real-world noise based on the available data.) To broadly generalize our findings here, the less noise that the system contains, the more dramatic the improvement that is gained by using a tapered system, but even when unrealistically high levels of noise are simulated, tapered systems still outperform the status quo.
 In noisy environments there are both advantages and disadvantages of larger bubbles. It’s not all that matters, since the DT beats SQ with much smaller bubbles. Disadvantages of larger bubbles include excessive reliance on potentially biased speaker points, which our model does not capture (for reasons discussed later in the paper).
 For example, in 2013, Monash B broke as the top team on 17 (ranked 48th on the tab) and then won Worlds.
 We earlier cited numerous articles documenting bias in debating in general and in speaker points in particular, but it is also worth pointing out that speaker points are also suspect for more mundane reasons. Speaker points are often assigned in a very rushed manner under pressure to hand in a ballot and not hold up the tournament. Also, the judges assigning them have often been spending recent rounds judging debates of very different quality, which can make the current debate seem significantly better or worse in comparison. Some of these issues are discussed by Maria English and James Kilcup (English & Kilcup, 2013).
We attempted to simulate judges’ implicit bias against certain teams, as distinct from general judge perception noise. In the end, we could not come up with a way of doing this that we believed both accurately captured the existing implicit bias and also relied on widely accepted assumptions. But if you agree that implicit bias exists and is disproportionately manifested in speaker points, then there is additional reason to prefer systems that minimize the influence of speaker points (e.g., by creating smaller bubbles).