Section 7: Potential Objections
In this section, we discuss the three most common objections to our proposal that we have encountered. First, that tapered systems have the least skilled panels assigning the most points. Second, that by limiting the number of live rooms more quickly, the experience at Worlds is less fun for more teams. Third, that there are other new tabbing systems that would have even greater benefits.
7.1 | The “Weak Early Panels” Objection
The most common objection that we have heard about using tapered points is that it awards the most points in the rounds that have the weakest judging panels (i.e., the highest judge perception noise), before the tournament has had an opportunity to pack the best judges into the live rooms. Objectors have placed a particular emphasis on Round 1, since at the start of Worlds, the adjudication core may not know the judging pool well enough to place into every room a chair in whom they have high confidence. Of course, Round 1 panels are not any worse using ET than they are using SQ. There are two main responses to this objection and both are important, for different reasons.
The first response is that, uncontroversially, calls are much more obvious in the early rounds of Worlds. So, it requires much less judging skill for the panel to make the correct call. This is particularly true of Round 1, but is also true in other early rounds. There are cases of multiple highly competitive teams in the same room in Round 1, but these are unusual, and having more than two in a room is rare. As a result, even mediocre panels are likely to come to the correct call in Round 1. And, in the other early rounds, though judge packing has not been aggressively implemented, many adjudication cores will already ensure that there are reliable chairs in the higher point rooms. Because the teams have not been well sorted into rooms by skill level, the calls in those rooms will generally be more obvious than later in the tournament, and as the calls become less obvious, the panels get better. With ET, they get better much sooner and more dramatically, as we established in Section 6.4.
The second response is more dispositive of the question, even if it may be less satisfying or intuitive. Ultimately, the best response to the objection is that our model has already shown that when we do take the relative weakness of early round panels into consideration (using any plausible packing assumptions), the results of the tapered scoring systems are clearly better.[31] Large debating competitions are complex systems with thousands of interacting moving parts, and our intuitions about what systems will work best are simply not reliable. Moreover, the Brazil Nut analogy shows that there are at least competing intuitions pushing toward adopting a tapered system. You should start sorting the nuts with a big hard shake. Yes, that first big shake will inevitably move some small nuts up and maybe even some big nuts down, but on the whole, it will make much more progress toward an accurate sorting than starting with a small shake, especially when you are limited to just nine shakes before you open the can and skim off the top 48 nuts. At the start of this research, each of the authors had different intuitions about what system would produce the best results, and none of these were exactly borne out by the model. In the end, the Early Taper system emerged as clearly superior.
Despite all this, some people may still be concerned about relatively weak panels dispensing large numbers of points. We suspect that this may just feel problematic, and that this feeling is likely due to a couple understandable assumptions that turn out to be incorrect. The first assumption is that the number of points available in a round is directly proportional to the impact that round will have on a team’s ranking on the tab. The second is that the quality of a judging panel is directly proportional to how likely they are to come to the correct decision. In both cases, the assumption is wrong, and for the same reason. The confounding factor that disrupts both of these relationships is how close to the end of the tournament you are. The first round is very different from the last round. Once this is clear, one can see two crucial insights. First, ET does not give early rounds outsized influence, it merely elevates them from near irrelevance. Second, although early panels are weaker, that doesn’t imply that they are less likely to get the call right. A wine connoisseur may be unable to discern the difference between similar wines, but anyone can tell the difference between wine and grape juice. Our ongoing research aims to bring greater precision to bear on both round influence and call errors across rounds. We hope to publish this soon.
Of course, in the end, every system that has noise will exclude some deserving teams from the break and include some undeserving teams. None of the systems are perfect. The goal is to pick the system that will minimize this, and whether it is intuitive or not, ET does a much better job at this than SQ. That’s what all the data show very clearly.[32]
[31] Now, one might wonder whether actual panels in early rounds are even less accurate than we assumed them to be in our investigation of judge packing, but there are two responses to this concern. First, we have run many simulations where we assume even weaker panels in early rounds, and these have not substantially altered the result that tapered systems (especially ET) are significantly more accurate. Second, there is a limit to how weak we can assume these early panels can be and remain consistent with the raw data from the past 10 years of Worlds. It turns out that no mathematically consistent judge packing framework can make SQ perform well enough to even be as accurate as ET using flattened packing.
Here’s the mathematical justification of this claim, which is just for readers who cannot seem to give up the claim that if early panels are weak enough, ET will give worse results. As stated in Section 4, the round-to-round variation of teams’ combined speaker points has a standard deviation of 4.75. This variation must be a combination of demonstrated skill noise and judge perception noise. So, the maximum standard deviation for judge perception noise averaged over nine rounds is 4.75, if we assume (absurdly) that all teams always debate at exactly the same skill level in every round (i.e., demonstrated skill noise = 0). Of course, given judge packing, Round 1 panels can have a higher perception noise than 4.75, but even if we assume (absurdly) than Round 9 panels are so good as to have zero perception noise, and we shift this noise from later rounds into the earlier rounds, we cannot get a Round 1 noise than exceeds 7.5. And, if we model how packing would reduce this noise to zero over the course of nine rounds, the resulting “absurd packing assumptions” would improve the performance of both SQ and ET comparably. But, it wouldn’t improve SQ enough to even outperform ET with flattened packing.
[32] There is nothing “natural” or privileged about the status quo system for assigning points. It’s just what we are used to. We can all agree that some scoring systems will assign too many points in early rounds and some will assign too few points in early rounds. For example, systems that start with fewer point and then dramatically increase them, perform quite poorly. Ultimately, the goal is to find the “sweet spot” where there aren’t too few or too many points allocated in each round. We have excellent reason to believe that the Early Taper system hits this sweet spot.