Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points

by R. Eric Barnes, Paul Kehle, Nick McKenny and Chuan-Zheng Lee • HWS

Ideally, preliminary rounds at tournaments sort teams so that the best teams break into elimination rounds. At the World Championships of debate, the scoring system during the nine preliminary rounds does a poor job of sorting teams accurately. Adding additional rounds would increase the accuracy and fairness, but this is impractical. Using mathematical models and computer simulations of tournaments, we show that using a slightly different scoring system over the nine preliminary rounds would improve the accuracy of the break even more than would doubling the number of preliminary rounds to 18. Other implications and insights into tabulation and sorting accuracy are also discussed.

6.4 | Judge Packing

Anyone who is familiar with how a major BP tournament is run knows that there exists a nearly universal practice of packing stronger judges into live rooms (i.e., rooms where teams still have a chance of breaking).  Stronger judges are more likely to accurately assess the skill levels of the teams.  In other words, stronger panels have lower judge perception noise.  We designed our simulator to allow us to vary the judge perception noise in each individual round, allowing us to simulate judge packing.  (Note:  readers who are not interested in a quite detailed discussion of how to model judge packing in computer simulations may wish to skip to the last paragraph of Section 6.4, without any fear of missing material that is essential to understanding our argument.)

Although judge packing is common at Worlds and other major BP tournaments, not all tournaments (i.e., core adjudication teams) will implement it in the same way.  Some tournaments are more aggressive in packing judges earlier.  This involves categorizing more rooms in early rounds as “low priority”, taking highly rated judges out of these rooms and putting them into “high priority” rooms.[21]  There is disagreement about how judge packing should be done, and we do not take any position on this dispute.  But, since judge packing is a very common practice, we do want to model judge packing, and we want to do it based on as few controversial assumptions as possible. So, to this end, we have chosen to designate a room as dead (i.e., low-priority) if it is mathematically impossible for teams in the room to break in the open category, and to otherwise call a room live.[22]

Figure 10: Number of live rooms as the tournament progresses, using the status quo and Early Taper systems

Recall from Section 4 that we found the baseline for judge perception noise using data from the HWS Round Robin, from years where dual panels were employed.  That means that our baseline judge perception noiseis set to the average strength of panels at the HWS Round Robin.  Based on conversations with former Chief Adjudicators and Deputy Chief Adjudicators of Worlds, who are all also quite familiar with the HWS RR, we estimated that the average panel strength at the HWS RR was roughly equal to panel strength in live rooms during Round 5 or 6 at a Worlds with typical judging quality.  From this baseline, we estimated (again in consultation with former WUDC adjudication core members) that judge perception noise in Round 9 live rooms was perhaps 70–80% as much, and in an average Round 1 panel at Worlds the judge perception noise level would be in the range of 120–130% range, but we admit that these are largely intuitive guesses.[23]  Fortunately, the outcomes of our simulations were not particularly affected by the precise numbers we used from these ranges (or even numbers plausibly outside of them).  We do not mean that more judging noise does not make results less accurate; it does.  We mean that more or less judge perception noise hurts or helps the systems we looked at in roughly proportional ways.  So, the precise degree of judging noise (within realistic boundaries that are consistent with the data from the past 10 years at Worlds) does not impact which scoring system will produce the most accurate results.  Our simulations showed that under any plausible assumptions, the accuracy of all systems improved when the system employed judge packing.  Moreover, tapered systems improved to a larger degree by introducing packing.  The apparent reason for this is that judge packing in ET can start earlier and be done more aggressively, because the number of live rooms decreases much faster.

Figure 10 shows the number of mathematically live rooms at a Worlds with 360 teams, when going into the round number given on the X axis.  A room is mathematically live if it is still mathematically possible for some team in that room to make the open break (e.g., if they took first place in every remaining round).  The status quo system is represented in orange and the Early Taper system in red.

Using the SQ system, there are no truly dead rooms until Round 5 (and then only 2), so if judge packing is going to happen earlier, it will need to be at the expense of teams who still have a chance to break.  Using ET, judge packing can happen sooner without disadvantaging any live teams.  Starting in Round 3, ET has significantly fewer live rooms, allowing for higher quality (i.e., more accurate) judge panels in these live rooms. In rounds 5 through 8, there are about half as many live rooms in ET than in the SQ.

Figure 11 represents the performance of SQ and ET, each under four sets of assumptions regarding judge packing.  A system’s accuracy is measured here using the Quality Deficit Score metric.  On the left are the four SQ simulations.  The leftmost dataset represents results from “flattened” judge packing.[24]  The next three represent results from packing assumptions that we will call “low”, “medium” and “high”.  The three assumptions differ primarily in how much more judge perception noise is assigned to Round 1 and how much less to Round 9.  Low packing assumes that the perceptions of Round 1 panels are only a little noisier than the baseline level, and Round 9 panels in live rooms are only a little less noisy than baseline levels.  High packing assumes that the perceptions of Round 1 panels are much noisier than the baseline, and Round 9 panels in live rooms are much less noisy than baseline levels.  We think that the reality is very likely to lie between these two sets of assumptions, and medium packing represents the assumption that packing affects judge perception noise to a degree that is somewhere between low and high.  Under all of these assumptions, Round 5 is noisier than baseline and Round 6 is less noisy than baseline noise (i.e., a typical HWS RR panel).

Since Early Taper has never been used, no one has any experience of what judge packing would look like under that system.  So, to estimate the judge perception noise in each round of ET, we assumed that if two systems have the same number of live rooms in a given round (at a tournament of the same size, with the same judging pool), then the quality of judging in those live rooms should be the same.  So, obviously, the judging noise in Round 1 should be the same for both systems because every room is a live room in both systems.  If half the rooms are live and half are dead, then (all other things being equal), the average quality of the panels in live rooms should be the same in SQ or ET.  This seems true, whether half the rooms are dead after Round 7 (as in the SQ) or half the rooms are dead after Round 4 (as in ET).  So, our assumptions for low, medium and high noise levels for ET were set by comparing the number of live rooms in ET with the number of live rooms in SQ, then matching the estimated noise in the same type of packing assumption (flat, low, medium or high).  For example, going into Round 5 in ET, there are 45 live rooms.  In SQ, there are 45 live rooms going into Round 8.  So, the judging panel noise for ET with low packing in Round 5 is going to the equal to the judging panel noise for SQ with low packing in Round 8.  Similarly, for medium and high packing.

Figure 11: How much does judge packing improve break quality under SQ and ET?

The left half of Figure 11 shows that packing judges in SQ improves the break accuracy (using the QDS metric) by 7%, 12% or 17%, depending on whether you assume a low, medium or high impact on judge perception noise from packing.  This represents improvement over a flattened packing model, and the improvement over a simulation with no packing at all would be much greater.[25]  Similarly, the right half of the chart shows that packing judges in ET improves the break accuracy (QDS) by 11%, 20% or 26%, depending on whether you assume a low, medium or high impact on judge perception noise from packing.  So, regardless of what plausible packing assumptions you think are correct, packing judges helps ET more than it helps SQ.  The superior impact of packing on ET make sense because by reducing the number of live rooms earlier, tournaments using ET can begin packing panels earlier.  So, tournaments using ET will have more accurate panel decisions (i.e., less judge perception noise) in live rooms in all rounds where ET uses packing, as compared to tournaments using SQ.  This improvement in accuracy starts earlier in the tournament and is more intense in all rounds after it begins.

So, here are the simple lessons from this section about judge packing.  First, judge packing has a positive and significant impact on the accuracy of the break, regardless of whether you think judge packing has a high, low or medium impact on the accuracy of judging panel perception.  Second, and more to the point of this paper, judge packing improves break accuracy for both SQ and ET, and it does so by roughly the same amount, though considered proportionally, packing helps ET to a significantly greater degree.  Again, this is true regardless of whether you think judge packing has a high, low or medium impact on judge perception noise.  So, because judge packing helps SQ less than it helps ET, we feel safe in leaving packing out of our model.  In other words, by leaving packing out we are not favoring our conclusions; if anything, just the opposite, we are giving SQ its best shot by assuming a flattened judge packing outside of this section of the paper.  We have chosen to do this because we have no firm evidence for preferring one set of assumptions about judge packing over another, so as long as it does not impact which of the systems performs better (which we’ve just shown it does not), it is better to have our model avoid any unwarranted assumptions.

Next page

[21]  Judge packing makes panels in high priority rooms more accurate, but at the cost of making panels less accurate in low priority rooms.  It would be much more difficult to build a simulator that made the judging panels of different rooms different quality (during the same round).  So, what we did was to uniformly adjust the quality of the panel in every room in a particular round, and to set it to the level one could expect in a live room during that round.  The justification for this is simple.  We are concerned in this paper with those teams who are in contention to break, and so it is no problem that teams in dead rooms are getting more accurate judging in our simulations than is realistic.  Where these teams end up is (by definition) immaterial to the quality of the teams in the break.  So, this simplification in our model has zero impact on the HLC and QDS metrics.  This simplification of packing would introduce minor distortion to PEC, RDS and SSD, but because packing is not used in the simulations discussed outside this section of the paper, and because these three metrics are not used inside this section, none of the data in this paper is distorted by our simplification.

[22]  We are painfully aware that judge packing is much more complicated than this binary designation suggests, but this way of drawing the distinction is the clearest objectively determinable criterion that can be used to estimate how much packing can be done at any given point in a tournament.  If there were a way to accurately calculate the number of live and dead rooms for the ESL or EFL break, then we would, but there is not sufficient consistency in the number of points needed to break in these categories.  As discussed below, we use this live and dead room information to make quite modest assumptions about how judge packing operates in each system.  We are not claiming that tournaments do or should pack based on this distinction, merely that this distinction provides a rough measure of when more or less packing can be done at a tournament.  We are certainly not advocating that tournaments treat rooms as dead just because the teams in them are no longer in contention to break open.

[23]  A concern has been raised about scoring systems like Early Taper that the greatest number of team points are being given out in Round 1, when the panels are at their weakest.  We address this concern directly in Section 7.1.

[24]  It is important to distinguish between a simulation of a tournament with no judge packing and a simulation with flattened judge packing.  With no judge packing, judges in every round are allocated as they are in Round 1, with the goal being to get as close as possible to equal panel quality in every room.  Flattened judge packing is where one assigns the same baseline noise level (i.e., judging quality) in every round, but this level is much closer to what you would expect in a live round panel in Round 5 or 6, which is clearly better than in a Round 1 room.  In this paper, this baseline is the level of noise based on an average HWS RR panel.  One way to think about this is that flattened packing is similar to taking the improvements in judge panel quality in live rooms as the tournament goes on and distributing these improvements evenly across all nine rounds.  But, to the best of our knowledge, the only hard empirical data that exists on judge noise is what comes out of the HWS RR data, comparing independent judge panels, and that gives us an average panel noise level of 3.67 (our baseline).  Flattened packing just means applying this judge perception noise (3.67) to every round, not because this is more realistic, but rather because it avoids adding additional unwarranted assumptions.  As is made clear by the end of this section on judge packing, the charts, arguments and conclusions of this paper would not change significantly if we had chosen to build judge packing into our model throughout the paper.  The only changes would have been for all the chart values to decrease (i.e., improve) somewhat and for there to be an additional set of unwarranted assumptions in our model, these being the choice of specific judge perception noise levels for each round based on packing.

[25]  See the previous footnote for the distinction between flattened packing and no packing.  Having no judge packing performs much worse in both systems.  In SQ, no packing is 29% worse than flattened packing, and in ET no packing is 48% worse than flattened packing.  Given that all plausible packing frameworks improve significantly over flattened packing, it is clear why judge packing is a very appealing (though not unproblematic) strategy.