6.5 | Field Size Sensitivity
To make the presentation of our findings easier to follow, we chose a specific WUDC tournament size (360 teams) to use in almost all of our models.[26] However, we did run simulations on many other tournament sizes, from 240 to 400 teams (in increments of 40 teams), to ensure that our conclusions were consistent across field sizes. To create the charts in this section, we ran 2.4 million simulations. Each data point on the charts in Figures 12.1 and 12.2 represents the average QDS from 20,000 simulations. The SQ and RT systems were evaluated for field sizes (240, 280, 320, 360 & 400), with each analyzed for 12 tournament lengths (1-12 rounds).[27] When we looked at the average quality deficit scores (QDS) associated with these, we found a very interesting result.

Round-by-Round system

Status Quo system
In Figures 12.1 to 12.4, the X-axis represents the number of rounds that are run at the tournament before the break, while the Y-axis represents the average quality deficit score for the break (so lower scores are better). The expectation is obviously that as one runs more preliminary rounds, the accuracy of the break increases and the QDS goes down. On the left, we see how the tapered system (after a brief instability if breaking after just 2 rounds) consistently improves the quality of the break as more rounds are added, regardless of the size of the tournament, doing so more slowly as the number of rounds increases. On the right, we see that the status quo system also displays an overall trend toward higher quality breaks, but that it does this in a dramatically inconsistent manner. Figures 12.3 and 12.4 magnify the range of rounds we are most concerned with. If you hold the number of rounds constant, one expects larger tournaments to produce less accurate breaks. This expectation is clearly borne out in the RT taper system, but it is also borne out in the status quo data, thought the erratic results in the status quo make this harder to see.

Round-by-Round system (magnified)

Status Quo system (magnified)
What is stunning about the status quo data here is both the inconsistent improvement of the systems as you add more rounds and also the dramatic difference that field size makes in the number of rounds that will yield the best results. For example, with a 400-team tournament using the status quo, the open break will be higher quality (according to QDS & HLC) after 7 rounds than after 9 rounds.[28] With 280 teams, the open break after 8 rounds is of higher quality than after 9. Moreover, this isn’t due to randomness in simulations; our results are averaged over 20,000 simulations, and we get consistent results if we repeat the experiment.
The reason for the inconsistent improvement of the status quo has to do with the large bubbles created by the SQ and with the probability of having a clean break (or close to it). Mathematically, how close a tournament is to a clean break depends only on the field size, number of round and the results of pull-up rooms—not any other debate results. So, with 400 teams, it is very likely that there will be a clean break after 9 rounds, or close to it.[29] When brackets are reasonably large (say, more than 10), the top team on a certain number of team points is invariably a much higher quality team than the bottom team on the next higher bracket of team points. Indeed, this is generally also true of the second-highest and third-highest, when compared to the second-lowest and third-lowest (though obviously to a lesser extent). This phenomenon is illustrated below in Figure 13.1, showing the results of a single tournament simulation in SQ.


Ideally, we want the curve created by plotting MDS against tab position to be smooth and non-increasing, but noise makes this impossible. The closer to this ideal, the better a system should score on all of our five metrics. Ultimately, the lesson here is that due to the inherent nature of the SQ system, large numbers of teams will end on the same point level (i.e., it has large bubbles), and top teams at the n−1 point level will be dramatically stronger (i.e., have much higher MDS) than bottom teams at the n point level. The analysis above explains why this phenomenon results in the SQ system yielding very inconsistent break accuracy depending on the field size at the tournament, whereas tapered systems provide more consistent (and better) accuracy regardless of field size.
Figure 13.1 represents a field of 360 teams, with each tiny orange dot representing one team. The grey lines show the trend lines from the top of a point level to the bottom.[30] The break line falls in the upper part of the 17s, as it consistently will, unless the field size differs significantly from 360. Since the same number of teams will still break, if there were 400 teams, the break line would consistently fall at or near the clean break between 17s and 18s. In this case, all the relatively weak teams at the bottom of the 18s are included in the break, while all the quite strong teams at the top of the 17s are excluded from the break. This combination leads to a sharp reduction in the overall quality of teams in the break (i.e., a higher QDS). In contrast, RT produces very small bubbles, within which there is much smaller variation in teams’ mean demonstrated skill. So, it makes fairly little difference how often there is a clean break, which there frequently will be. Figure 13.2 is a chart provided to show an outcome near the ideal of accuracy. This represents RT in a tournament without any noise. So, this is not a realistic situation, but it’s what we would like our scoring system to get closer to.


Figure 13.3 shows a curve generated by ET where, although not smooth, there is no place where you can draw the break line that will be particularly disruptive to the quality of the break, so this system will not fall prey to the field size sensitivity problem. There are only faintly discernable groupings of teams on the same number of points, and there are no major jumps in skill between these. In Figure 13.4, we see that 18 rounds of the status quo still produces very clear point groupings of teams on the same number of points. This shows that SQ-18 will still inevitably result in more field size sensitivity than a tapered scoring system. The quality of the break will jump up and down depending on whether the number of teams competing makes the break likely to be a clean break, or nearly so. This is certainly not desirable. SQ-18’s overall ordering of the top half of the team tab will be much more accurate than that of SQ-9, because the bubbles will be smaller and the ordering of teams within each point level will be much more accurate. But, just like SQ-9, SQ-18 has very big jumps in quality from the bottom of one point-level to the top of the level that is one point lower, and that significantly undermines the accuracy of the system. The chart of the ET system has some spikes, but it is generally smoother, reflecting its greater accuracy.
[26] The size of 360 teams reflects a common size from over the past 10 years, the average of which is approximately 350. We acknowledge that in certain recent years, the number of teams has been much lower than this, but we believe that 360 represents a fairly typical tournament size.
[27] The RT taper scoring system is used in this comparison because it is easiest to define for any number of rounds, whereas the exact pattern of points in a Day Taper or Early Taper system is not clearly defined for tournaments that are not 9 rounds. Regardless of this, none of the tapered systems we looked at displayed the kind of irregularity across field sizes that the status quo clearly displays. That being said, it is worth remembering that in noisy environments, the DT taper outperforms the RT taper, and Early Taper outperforms the DT taper.
We plan to discover which taper patterns work best for various tournament sizes and lengths. Our initial investigations into this area lead us to expect that they will all look analogous to Early Taper, reducing in early rounds and then flattening out. If so, we would consider all of these to be varieties of Early Taper systems.
[28] With 400 teams, the average QDS after 9 rounds is about 42.5, which is 30% higher (i.e., worse) than the average QDS after 7 rounds, which is about 32.5.
[29] History bears this out. The 2010 WUDC (Koc) had 388 teams and only two teams in the top 48 were on 17. The 2012 WUDC (De La Salle) had 396 teams and only one team in the top 48 was on 17. The 2013 WUDC (Berlin) had 388 teams and only one team in the top 48 was on 17. As this moves to 400 teams, the chances of a clean (or nearly clean) break just increases.
[30] In SQ simulations, it is easy to identify where teams on each point level start and stop. We have added trend lines for the portions of the tab representing teams on 17 and on 18, since that is where the break happens. Every different tournament simulation will be different in the details, but these trend lines will be very similar.