## Section 4: Methods

**How We Constructed the Tournament Simulator**

Our research is based on computer simulations of tournaments. We simulated tournaments using the status quo scoring system and we simulated tournaments using a variety of new scoring systems for assigning team points (i.e., how many points for a first, second, third or fourth). Our simulator (programmed in Mathematica) appears to be similar to a tournament simulator written by Neil Du Toit (programmed in Python) that was the basis of an article in 2014 that looked at some related issues in BP tabulation. (Du Toit, 2014) Our simulator used assumptions based on a much larger data set and also built in a wider range of factors, but despite these differences, the findings of the two independently built simulators were very consistent.

Teams in the simulated tournaments were grouped (or “paired”) into rooms of four for each round, randomly for Round 1 and then from the top down (according to team points), pulling up teams as needed, as is done now, consistent with WUDC rules. Teams in each room were ranked from 1^{st} to 4^{th} place based on their “perceived skill” calculated for that round (see below). Some simulations followed the status quo system and assigned 3, 2, 1 or 0 points to the teams in each room. Other simulations altered the point assignment system (e.g., awarding 6/4/2/0 or 9/6/3/0), depending on which round it was. We demonstrate that it is possible to get dramatically more accurate breaks by using a new scoring system.In each room, the judging panel assigns team points and speaker points, where the team points and combined speaker points are required to be consistent (i.e., ties and low-point wins are prohibited). In essence, the team points are an ordinal ranking of the teams, while combined speaker points are a cardinal ranking of the teams.^{[3]}

They are required to be consistent because the combined speaker point total for each team is intended to represent the overall quality of debating (i.e., demonstrated skill) from that team in that debate, as perceived by the judges. The set of four combined speaker point totals contains more information — since you can infer the team points from the combined speaks, as on-line ballots do in electronic tabulation software (e.g., Tabbycat) — whereas you cannot infer the combined speaks from the team points. Indeed, even though the first thing that judging panels do is decide team points (a wise procedural choice), the ** logically** more fundamental question at the end of each debate is: How good (i.e., skillful) was each team in the debate? If we had a tool that somehow directly measured each

**demonstrated skill after in a round, then we would obviously determine**

*team’s***speaker points first and then just infer team placement from that, because combined speaks are just a measure of the skill the teams have demonstrated in this debate, as assessed by the judges. All of that is just to explain why our method of simulation begins with combined speaker points, and then infers team placement.**

*combined*^{[4]}

To construct a simulation, we used a range and distribution of debating skills for teams reflecting those that would be found at a real debate tournament. Tournaments obviously differ in this respect, but the WUDC is generally considered the most important tournament each year, so we used this as our touchstone. Using the team results from the last 10 years at the WUDC, we determined that team skill falls into a roughly normal distribution. We excluded data from any team that didn’t compete in all 9 rounds. We chose to represent these skill levels using a normal distribution with a mean of 150 and a standard deviation of approximately 6.52.^{[5]} See Figure 2 for the frequency of various skill levels at Worlds from 2010–2019.

In creating our computer simulation of tournaments, we began by assuming that each team had a “baseline skill level” (realistically ranging from 120 to 180). Each new simulation began with the computer assigning each of the *n* teams in the tournament a baseline skill level, based on the normal distribution of skills that we obtained by analyzing the WUDC data. Of course, all teams have good rounds and bad rounds for a variety of reasons (e.g., the specific motion, the position they occupy, headaches, etc.), and so in each round of a simulated tournament, our program varied each team’s “demonstrated skill” up or down from their baseline skill. We assumed that these variations in demonstrated skill (i.e., “skill noise”) also followed a normal distribution (of mean 0, added to the baseline skill), which is borne out by the same data set by analyzing the typical variations in team performance over the course of 9 rounds.

As we all know, judges are not perfect. Perceptions of how well a team performed differ from one judge to the next, even among the very best judges. Even the carefully discussed estimation of a panel of judges may not accurately capture the true demonstrated skill of the four teams in a debate. In other words, judges and panels can get the call wrong. Because of this, our model did not determine team placement in a room based on their demonstrated skill. Instead, our model took each team’s demonstrated skill level in that round and used this to generate the judging panel’s “perceived skill level” for that team. Team placement (i.e., team points) in the round was inferred directly from the perceived skill levels of the four teams. So, a team with a higher baseline skill might place below a debate team with a lower baseline skill because the demonstrated skill of the former was lower than the demonstrated skill of the latter in that round, or because the perceived skill of the former was lower than that of the latter, or both. In other words, the better team had a bad round, the worse team had a good round, or the judges inaccurately perceived the team performances. Our model captures all of these possibilities and their potential conjunction.

In determining how much a team’s perceived skill level varied from their actual demonstrated skill level, we looked at data from the HWS Round Robin during the years in which there were two full panels judging each debate. Over the course of 4 years, pairs of entirely independent panels evaluated 320 team performances. We looked at how frequently the combined speaker points assigned by the two panels differed and by how much they differed. This data is given in Figure 3. From this data, we concluded that the standard deviation of judge panel perception in combined speaks (i.e., the “judging noise”) was approximately 3.67.^{[6]} The 3-judge panels at the HWS Round Robin are relatively strong panels. They are likely to have less judging noise than most preliminary round panels at typical tournaments, but generally have more noise (i.e., not as strong) as panels judging live rooms in Rounds 7, 8 & 9 at the WUDC.

We were able to calculate the average standard deviation of a team over nine rounds at Worlds, using data from the past 10 years. This combined speaker point variation (standard deviation of 4.75) must be totally accounted for by the conjoined impact of round-to-round variations in demonstrated debating skill and the variation in judging panels’ perceptions of debating skill. So, we could then calculate the demonstrated round-to-round skill variation as approximately a standard deviation of 3.02. ^{[7]}

At Worlds, and any typical tournament, the best judges tend to get placed into the live rooms (i.e., rooms that contain teams who still have a chance to break). This practice of “packing” panels is particularly prevalent in the last 2 or 3 preliminary rounds. So, we designed our simulator to allow for more judging noise in the earlier rounds of the tournament and for less judging noise in the later rounds. Obviously, in the rooms that are not “live”, the judging noise will increase as the tournament progresses. Because our ultimate aim is to determine what scoring system will create the most accurate break, once a simulated team is not in a live room (which implies they are no longer able to break), that team’s final outcome at the tournament is much less relevant to our metrics of success, which are outlined below. So, by design, our model ignores that non-live rooms in actual tournaments will have an increase in judging noise, because this fact is not relevant to the outcomes we are concerned with in this project.^{[8]} To be clear, we actually tested countless levels of judge noise and degrees of variation in demonstrated skill. In none of these cases did we find any changes in outcomes that differed dramatically from the results presented here. Less judge perception noise consistently made all of the different systems’ sorting outcomes more accurate, but regardless of the noise level, the status quo was consistently much less accurate than the best tapered scoring system.

So, our model created scores for teams’ baseline skill, demonstrated skill and perceived skill. All of these include many decimal places, but since speaker points need to be expressed in whole numbers, we translated the perceived skill scores of the four teams into combined speaker points by assigning the set of four combined speaker points (expressed as integers) that minimized the total difference between the four perceived scores and the four combined speaker points.^{[9]} We tested scoring systems (i.e., methods of assigning team points for placement in a preliminary round) using simulated tournaments of many sizes, but our discussion here will use a tournament with 360 teams and 9 rounds as our paradigm example. We discuss below our tests confirming that our conclusions apply to a wider range of assumptions.

^{[3]} “Cardinal payoffs are numbers representing the outcomes of a game where the numbers represent some continuum of values, such as money, quantity, or market share. Cardinal payoffs allow the theorist to vary the degree or intensity of payoffs, unlike ordinal payoffs, in which only the order of values is important.” (Shor, 2019)

^{[4]} We did not simulate individual speaker points in our model. For our purposes here, the only relevant speaks are the combined speaker points for a team, because how these points are divided between the two speakers is totally irrelevant to which teams break, and that is our sole concern here.

^{[5]} Even after excluding teams who missed some rounds, we found that the distribution of points at the lower end of the distribution was more irregular and protracted. Since the performance of teams at the bottom of the skill range is very unlikely to affect who breaks, we decided that it was more important to capture the skill distribution of the top half of teams at Worlds. So, we calculated the overall standard deviation based on the top half of the scores, projecting a bottom half of the point distribution that mirrored the top half. This yields a more realistic simulation of how skill levels of teams realistically in contention to break are distributed.

^{[6]} To extrapolate this, we assumed that judge panels were no more likely to err on the side of assigning too many points than they were to err on the side of too few points.

^{[7]} Assume that one knows that the total standard deviation in some data set is X. Also, assume we know that this variation is exclusively the result of two distinct factors (factor Y and factor Z). An accepted rule of statistics is: *X = (Y² + Z²)^½.* So, since we know X (average total SD over 9 rounds) and we know Y (SD of judge perception variation), we can calculate Z (SD of demonstrated skill variation). Of course, this assumes that the average quality of judge panels at Worlds is roughly the same as the average panel quality at the HWS Round Robin, and for the purposes of estimating how much team skill typically varies from round to round, this seems reasonable. When actually running simulations, our simulator can adjust judge perception noise round by round to capture the impact of variations in judge panel accuracy that come from packing good judges into certain rooms.

^{[8]} We are well aware that some people do not favor the practice of packing better judges in live rooms as a tournament progresses. We are not taking any stand on that dispute here. Our model simply reflects the commonplace practice of major BP tournaments, including Worlds. We have also run simulations where there was no judge packing and so the judge noise remained constant throughout the tournament. The results of these simulations were not dramatically different, but packing judges did improve break accuracy in all the systems we looked at. The effects of judge packing are one of the things we discuss below.

^{[9]} For example, if teams had perceived skill scores of {146.18, 145.87, 142.45, 142.33}, then that would get translated into {147, 146, 143, 142} because this total difference (.18 + .13 + .55 + .33 = 1.19) is the smallest possible movement to get to all unique whole numbers. Or, in a very close debate, the perceived skills might be {152.86, 152.85, 152.75, 152.73}, in which case {154, 153, 152, 151} is the assignment that minimizes total difference (1.14 + .15 + .75 + 1.73 = 3.77). Of course, in most cases, this approach just means rounding off each decimal to the nearest whole number.