Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points

by R. Eric Barnes, Paul Kehle, Nick McKenny and Chuan-Zheng Lee • HWS

Ideally, preliminary rounds at tournaments sort teams so that the best teams break into elimination rounds. At the World Championships of debate, the scoring system during the nine preliminary rounds does a poor job of sorting teams accurately. Adding additional rounds would increase the accuracy and fairness, but this is impractical. Using mathematical models and computer simulations of tournaments, we show that using a slightly different scoring system over the nine preliminary rounds would improve the accuracy of the break even more than would doubling the number of preliminary rounds to 18. Other implications and insights into tabulation and sorting accuracy are also discussed.

Section 2: An Analogy

There is a phenomenon that physicists call “granular segregation”, often casually referred to as “the Brazil nut effect”.  Granular segregation explains why, when you open a can of mixed nuts, the Brazil nuts are disproportionally on top, even if all the nuts were uniformly mixed when they were initially packaged.  Of course, this is a very common-sense phenomenon.  As you shake a can of mixed nuts (e.g., during shipping), the larger ones tend to rise to the top and the smaller ones (including broken pieces of all sorts) fall to the bottom.  Actually, this process is well studied because of various industrial applications. The goal in many industrial settings is essentially sorting large grains up and small grains down, which makes the whole assortment more tightly packed.  Studies of this problem have found that a steadily forceful tapping (i.e., shaking) of the grains (e.g., mixed nuts) is not an efficient sorting strategy.  “A tapped granular system … won’t reach a state of maximum density on its own but tends to get stuck in a state of intermediate density.  It can only be made more dense through a process called annealing, in which the tapping grows more gentle over time.” (Acevedo, Asencio, Zuriguel, & Maza, 2017)  Again, this intuitively makes sense.  If you have a big can full of unsorted nuts, you’ll start by giving it a good hard shake to start the sorting process.  But, if you’ve gotten to the point where the can is almost completely sorted, you would want to give it a much gentler tap, since a harder shake would do more to disrupt the existing sorting than to help improve it.  Indeed, the general rule is to start with big shakes and gradually move to softer and softer shakes.

As things are with nuts, so are they with debate teams.  Preliminary rounds at tournaments are essentially a mechanism for sorting the most skilled teams to the top of the heap, where they are scooped off and placed into the elimination rounds for a final sorting.  Preliminary rounds begin with a totally random mixture of teams, in which we “shake” them up by running preliminary rounds that sort them up or down by giving them 3, 2, 1 or 0 team points.  The problem with our status quo sorting strategy is that there is no process of annealing; every shake has the same degree of intensity.  That is to say, every round is worth the same number of points.  What is needed to achieve superior sorting outcomes is to have earlier rounds worth more points and later rounds worth fewer points, starting off with hard shakes and ending up with soft shakes.

In what follows, we present the results of computer simulations of debate tournaments using various scoring systems, which are then evaluated using various metrics of success.

Section 3: Modelling Assumptions

Rigorous study of the mathematical phenomena in tournament structures requires us to begin with some baseline assumptions that model how results of debates come about.  Our simulations use this model to generate millions of hypothetical tournaments on which to assess various tournament structures.  In this section we outline the model; we discuss how we arrived at specific parameters (that is, the means and amounts of natural variation, or “noise”, associated with a team’s performance) in Section 4.

We model differences in skill between teams by assigning each team a “baseline skill”, reflecting the combined speaker points[2] the team would receive on average over many rounds if perfectly assessed.

All teams have rounds in which they perform better or worse.  We model this in each round by adding zero-mean normally distributed noise to each team’s baseline skill, which we call “skill noise”, to arrive at a “demonstrated skill” for that round.  The amount of skill noise is assumed to be the same for all teams; that is, all teams are assumed to be similarly susceptible to being inconsistent in round-to-round performance.  However, while the parameters for all teams are the same in the simulations, randomness will result in some teams having higher round-to-round range than others, as we see in real results at the end of tournaments.

Finally, we model human imperfections in judges by adding a further zero-mean normally distributed noise term to demonstrated skill for perception noise, to arrive at a “perceived skill” for each team in each round.  While in the first instance we assume this noise to be the same for all judges and all teams, we will also discuss a refinement of this assumption to account for differing judge quality in Section 7.1.

These noise terms imply that a better team will not always beat a worse team: sometimes, noise will result in the worse team winning, sometimes deservedly (when due to skill noise), sometimes not (when due to perception noise).  The larger the difference in baseline skill between two teams, the rarer such upsets are.

The most important function of preliminary rounds is to select the correct teams for elimination rounds.  It is not their only function, but it is their primary function.  The correct teams are the teams who debated best at the tournament, which we assume to mean the teams with the highest mean demonstrated skill.  We label these teams the “deserving teams” because in a fair system the teams who debate best are most deserving of breaking.  A system for running preliminary rounds is inferior (unfair) to the extent that it facilitates undeserving (i.e., worse-performing) teams advancing over deserving teams.  Moreover, it is a worse problem for a more highly skilled deserving team to miss breaking than for a less skilled deserving team.  Similarly, it is worse to have a lower skilled undeserving team included in the break than it is to have a more skilled undeserving team included in the break.

Next page

[2]  In this paper, we will focus on the combined (or total) speaker points that a team gets, ignoring how they are distributed between the two speakers, since this distribution is irrelevant to where the team ranks on the team tab, which is the only thing that determines who breaks.  Note that the combined speaker points of the teams in a room imply their ordinal team ranking, but they also contain more information beyond this.  They add a cardinal measure of skill.