Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

Section 3: Round Influence

If one round offers more points than another, then ceteris paribus that round will be more influential.  But could an early round offering more points be less influential than a later round offering fewer points?  In the SQ point system, round influence increases geometrically from Round 1 to Round 9.  What happens in a tapered point system like ET?  And, importantly, what do we want the distribution of round influence to look like?  Let’s start with the first two questions about how influential rounds actually are in SQ and ET.  After we look at that, we’ll discuss at length what we should want the distribution to look like.

3.1 | What Does Actual Round Influence Look Like?

Interestingly, the foundation of the most common objection to using the Early Taper system has been that it unfairly makes some rounds worth much more than others.  For understandable reasons, people have perceived that it would make Round 1 too influential on a team’s final ranking, which would be unfair.  The compelling intuition here is that making any round much more influential than others is unfair.  This makes sense because a team might be unfairly disadvantaged in any round by things like an unbalanced motion, a biased chair, a tougher team position, etc.  (Leizrowice, 2020)  It clearly exacerbates the unfairness if these random misfortunes happen in very influential rounds for some teams and much less influential rounds for other teams.  So, a rational response is to prefer that all rounds are equally influential, or roughly so.  But, as it turns out, ET is the system in which rounds are of roughly equal influence, whereas SQ has a large disparity between how influential different rounds are.

Most people in the debate community understand that as things stand now, not all rounds matter equally.  A team that “tanks” (takes 4th place) in Round 1 is rewarded by easier competition in their future rounds, while tanking in Round 9 yields no reward at all.  Similarly, “soaring” (taking 1st place) in Round 1 comes with the burden of harder competition in their future rounds, while soaring in Round 9 comes with no burden at all.  Emma Pierson confirmed these intuitions with a statistical analysis of past tabs at Worlds, concluding that “while winning round 1 affects your performance for a couple rounds, by the end of the tournament (round 9) it doesn’t make much difference.”  (Pierson, 2016)  Importantly, this shows that the number of points available in a given round is not the only factor in how much influence that round will have on teams’ final ranking.  How early or late a round occurs is also a very important factor.  In this section, we use a new approach to show that the outcome of various rounds has significantly different degrees of influence on how a team will rank at the end of the tournament.  To demonstrate this, we need to answer the question, how can one quantify round influence?  We propose the follow method, based on the tournament simulator described in our earlier paper, “Break Accuracy”.  (Barnes, Kehle, Mckenny, & Lee, 2020)

The goal here is to estimate for each round how much of a difference doing well or poorly in that round makes in a team’s final ranking on the tab.  So, here is how we calculated each round’s “Round Influence Factor” (RIF).

  1. Create randomized parameters for all teams’ skill levels and all judge panels’ accuracy in every round, using our same basic model.  (This is called a “performance table”.) 
  2. Run a simulation using this performance table, which will produce a team tab.
  3. Pick one team (e.g., starting at the top of the tab)
  4. Run a simulation using this performance table, except that this team “tanks” in round 1
  5. Run a simulation using this performance table, except that this team “soars” in round 1
  6. The value “Rank Change” equals the difference in finals ranks between these two.[2]  
  7. Do this (steps 4 – 6) for that same team in each round (18 simulations run for each team).
  8. Go back to step 3 and move on to the next team.  (There is no need for multiple trials on any one team, since the same performance table will always produce the same results.)
  9. After doing all the teams in the top half of the tab, for each ROUND, average the Rank Changes of all top half teams in that round.  This average is the RIF for that round using that performance table.
  10. Go back to step 1 and repeat for a significant number of different performance tables.  Average each round’s RIF across all these performance tables and this is overall RIF for that round.

After getting the RIF for each round in the SQ scoring system, we used the exact same set of performance tables to calculate the RIF for each round of the ET scoring system.  Finally, we translated these into a normalized scale based on the idea that (trivially) a team’s final ranking is influenced by nothing other than their performance against the other teams in their 9 rounds.  So, since 100% of the influence comes from these 9 rounds, we used the RIFs of each scoring system to distribute this 100% across all 9 rounds.  These new “relative RIFs” facilitate an apples-to-apples comparison between round influence in different systems and tournaments of different sizes.  Charts showing the raw RIFs and relative RIFs for each system are given here. 

Figure 3
Figure 4

The charts show that SQ goes from a Round 1 raw RIF of 3.1, indicating very little influence, steadily up to a Round 9 raw RIF of 90.5, indicating a very high influence.  In other words, on average, by taking the 1st in Round 1, a given team can expect to place about 3 ranks higher on the final tab, as opposed to where they would end by simply opting out of Round 1 or taking the 4th.  But, by taking the 1st instead of the 4th in Round 9, they will (on average) place about 90 ranks higher on the final tab.[3]  Round influence in the ET system is more nuanced.  The RIF in ET Round 1 appears high in comparison to SQ, but it is only slightly more influential than if every round were equally influential.  This equal influence level is represented by the black dotted line in Figure 4.  The RIF in ET Round 5 is relatively low and is the furthest deviation from equal influence, though it is less of a deviation than the RIF in every round in SQ other than Round 6.  In short, ET has much less extreme fluctuations in round influence.  Indeed, in SQ, Round 9 is about 30 times more influential than Round 1.  In comparison, in ET, Round 9 (still the most influential round) is not even 2 times as influential as Round 5 (the least influential round).  Essentially, in ET, the 9 rounds have roughly equal influence, while in SQ, early rounds are almost irrelevant, while later rounds are wildly influential.  Surprisingly, the primary complaint against our conclusion in Break Accuracy (that it was bad to have a single round be unduly influential) is a sin that SQ is most guilty of.

We are confident in our data here, but even if our estimates were off somewhat, the result would still be that Round 9 (and other late rounds) have absurdly more influence than earlier rounds.  As noted above, in 2016 Emma Pierson, using very different methods that analyzed past WUDC tab data, also concluded that Round 1 SQ results were virtually irrelevant at Worlds.  (Pierson, 2016)  Moreover, our conclusions about SQ Round 9’s raw RIF are easily confirmed without computer simulation by using any WUDC tab sheet that has about 360 teams.  Given those independently confirmed end points, the middle of that chart should not be at all surprising.  We have tested round influence on a variety of field sizes, and the results are consistent.  The specific numbers on absolute influence change, but that is to be expected.  The curves for SQ and ET have the same shape and proportions.

The numbers here are averages of how teams in the top half of the tab are affected.[4]  Teams near the top will generally move down fewer places on the tab when they tank a round, while teams near the middle of the tab will move more places because there are more teams in their bracket.  In SQ, a team in the middle of the tab could drop 120 places on the final tab based just on the outcome of Round 9, without even considering the impact of lower speaker points from that round.  The comparable maximum in ET is dropping about 50 places.  So, the difference between taking a 1st and a 4th in Round 9 will still have a major impact on your standing, but it can’t drop you fully one-third of the way down the tab because of one bad performance (or one bad decision).  Of course, for those not in the dead center of the tab, the impact will be smaller in both systems, but SQ Round 9 will generally have a bit more than double the influence of ET Round 9.  The difference is a bit less than double for Round 8 and diminishes until their impact is almost the same in Round 5.  In rounds 1 through 4, the impact of ET exceeds the impact of SQ, but that’s not a bug, it’s a feature.  In SQ, those rounds have less than half the impact that they would if round influence were equally distributed.  In SQ Round 1, it is only one-tenth as much as an equitable distribution of influence.

In contrast to SQ, although there is variation in the influence different rounds have in ET, these influence levels are much more consistent.  Round 9 is still the most influential, for roughly the same reason, but Round 9 doesn’t have a hugely outsized influence like it does in the SQ.  What this really shows is that when the ET system offers lots of points in the early rounds, it is actually just boosting those rounds up from the pits of irrelevance, while at the same time tempering the influence of the typical number of points offered on the last day, and especially in Round 9.  Teams still can earn up to 3 points in those rooms, but the influence of those points is less disruptive to the ordering on the tab because there are fewer teams in each point bracket.  (If you have read Section 2 of Break Accuracy, it’s like you are giving the can of nuts a relatively softer final shake because the nuts are already pretty well sorted out.)  So, a team that screws up in Round 1 will be hurt under the ET scoring system more than under the SQ system. But, under the ET system, a poor performance in ANY round counts against a team to roughly the same degree, whereas in the SQ, a poor performance in later rounds has a devastating impact on a team’s final ranking.

We can imagine some people arguing that the round influence levels in SQ are better because the round influence is highest when the live rooms have the best judges, and also claiming that in ET, the round influence is too high in rounds where the average live round panel is weak.  But recall that the only reason we care about having panels of good judges is because we think they are more likely to get the call right.  What if the panels with weaker judges were actually more likely to get the call right?  This may sound ridiculous (or even self-contradictory) at first, but if this were somehow true, then it would be better to have the rounds with weaker panels be more influential!  We will discuss this briefly in Section 4.

Next page

[2] How the rank change is calculated is described here.  Because speaker points do not influence what room a team is placed in during preliminary rounds, the gain or loss of a speaker point in any round has the exact same influence on one’s final tab ranking.  So, we wanted a method of calculating rank change that did not employ speaker points.

  1. If total team points after 9 rounds are identical for tanking and soaring in a round, then Rank Change = 0
  2. If the soaring team has 1 final team point more on the tab than the tanking team, then:  Rank Change = (bracket size of tanked team’s final points ÷ 2) + (bracket size of soared team’s final points ÷ 2).
  3. If the soaring team has 2 or more final points more than the tanking team, then:  Rank Change = (“tanked bracket size” ÷ 2) + (“soared bracket size” ÷ 2) + (total size of intervening brackets)
  4. If the soaring team has FEWER final points than the tanking team, then Rank Change will be a negative number, calculated in an analogous way.

[3]  Obviously, we are NOT claiming that, statistically, teams who actually place 1st in Round 1 typically end just 3 places higher than teams who actually place 4th.  That is not an apples-to-apples comparison because the teams who do well in Round 1 tend to be better debaters than the teams who do poorly in Round 1, and this impacts their later performance.  To understand the influence that a ROUND (as such) has, we need to hold the baseline skill of the team constant and only vary the place they took in the one particular round under analysis, leaving their demonstrated skill constant in all other rounds.  

[4]  The bottom half of the tab behaves symmetrically, as our tests have confirmed.