Fairness, Scoring Systems and the World Universities Debating Championship

by R. Eric Barnes, Paul Kehle, Chuan-Zheng Lee & Hugh N. McKenny

Tapered scoring creates more accurate breaks than traditional scoring in British Parliamentary (BP) debate tournaments. This paper further examines how BP scoring systems work and addresses some concerns that have been raised about tapered scoring. We begin by deploying some intuitive metrics that have been used in the past to evaluate break accuracy. A significant portion of the paper is devoted to evaluating the level of influence that different rounds have. We consider whether there is any good justification for different rounds having different levels of influence and also whether tapered scoring would unjustly impact certain teams. The paper explores various other topics relevant to understanding scoring systems, such as how call accuracy changes by round, the effect of pulling up teams, and the ability of teams to recover. We end by discussing two broader themes, how to rationally compare competing scoring systems and how to assess the fundamental model that we have used to justify many of our conclusions. This paper assumes a familiarity with our previous paper “Break Accuracy: Optimizing the Sorting Power of Preliminary Rounds Using Tapered Points”.

Section 4: Call Error Count

The likely accuracy of a judging panel’s call is not merely a function of how good the judges are (i.e., the judge perception noise for that panel).  This quality of the judge increases the precision of the judging panel, which is only one important factor in its overall accuracy.[22]  The other major factor is whom they are judging.  If the actual performances of all the teams in a room are very close in demonstrated skill level, the panel will obviously be much more likely to make an error in how they are ranked than they would if the teams displayed wildly different skill levels.  As we all know, rooms in Round 1 (e.g., at Worlds) typically have teams with widely different skill levels, while rooms in Round 9 have teams with quite similar skill levels.  So, it is possible that a typical panel in round 1 or 2 is more likely to get the call right in their room than is a much stronger panel in round 8 or 9.  We set out to discover what our model could tell us about this phenomenon.

To measure each round’s call error count (CEC), we began by defining the degree of error in any particular call.  The simplest metric for this is the pairwise error count, in which any ordered call, e.g., (1st CG > 2nd OO > 3rd OG > 4th CO), is translated into the mathematically equivalent set of six ordered pairs, in this case:  
(OO > OG), (CG > OG), (OG > CO), (CG > OO), (OO > CO), (CG > CO).

Let’s suppose that the correct ranking is (OG > OO > CG > CO), or expressed as ordered pairs:  
(OG > OO), (OG > CG), (OG > CO), (OO > CG), (OO > CO), (CG > CO).  

That would mean that the initial example call given above (CG > OO > OG > CO) would have an error count of 3 (underlined above).  

Using the PEC metric to evaluate the call accuracy for a given round (e.g., at Worlds) is quite simple in one sense, but also very uncertain in another sense.  The easy part, about calculation, is described here.  

  1. Run a tournament simulation.
  2. Calculate the average PEC for every room in round 1.
  3. Calculate the average PEC for each other round.
  4. Go back to step 1 and repeat for a significant number of different simulations (i.e., different performance tables).  Average each round’s average PEC across all these same performance tables and this yields the CEC for that round.

The CEC provides a simple number indicating how accurate a panel decision is likely to be in any given round of the WUDC.  The CEC will be a number between zero and six, and the closer the CEC is to zero, the more accurate calls are likely to be in that round.  If all calls were completely accurate, the CEC would be 0.  If all calls were completely random, the CEC would approach 3.  So, the relevant range is really 0 to 3.

We do need to be clear about what we are measuring here, and we need to be clear about what we simply don’t know.  In Break Accuracy, we were concerned primarily with rooms that were live for at least some category, since only live rooms could impact the accuracy of the break.  Someone might be concerned with how call accuracy (CEC) differs from round to round just for live rooms for the open break (“open-live rooms”), just for live rooms for any break (“all live rooms”), or one might be concerned with CEC for the entire field (“all rooms”).  It is easy to determine which rounds are live for the open break, but it is hard to know which rooms are live for the other breaks (e.g., ESL and EFL).  It’s also hard to know how the adjudication core is going to pack judges during the tournament.  Will they start to pack judges into higher rooms before any rooms are technically dead?  Will they give equal priority to rooms that are live for just the EFL break as for rooms that are live for the open break?  Etc.  

On top of this uncertainty, we also don’t know how much more precise a Round 9 open-live panel is compared to an average Round 1 panel, or compared to a Round 9 panel in a dead room.  As we discussed in section 6.4 of Break Accuracy, there is no reliable data on exactly how much more precise judging panels get with judge packing (i.e., how much less noisy they become).  Our data concerning overall speaker score variation does give us a range of possibility, with a maximum degree of increased judge precision between a Round 1 panel and a Round 9 open-live panel.  But the actual change in precision is surely nowhere near this maximum.   Additionally, not all tournaments have judging pools of equal quality.  This variation surely holds for different years at the WUDC as well, and this variation will impact call accuracy.

Our goal in this section will be to provides some very quick insight into call accuracy despite all of this uncertainty.  Our intention is to treat this topic more thoroughly in a future paper.  Since call accuracy will require extensive analysis, there is just not sufficient space in this paper.  Here we present data based on our estimation of the most plausible way in which judge packing impacts panel accuracy in live rooms.  The second complication is related to this.  We are certainly interested in the CEC for open-live rooms, but we are also interested in the CEC for other rooms.  In this first discussion of the topic, we limit ourselves to discussing CEC for open-live rooms.  We plan to look at call accuracy in other rooms in a future publication.

The two charts here represent just the tip of the iceberg concerning this complex issue.  As just mentioned, these both focus just on open-live rooms.  Figures 8.1 & 8.2 represents the exact same data, but Figure 8.2 is magnified to show detail, while Figure 8.1 shows the whole relevant range of call accuracy (0-3) to provide perspective.  Our good-faith estimate of how judge packing would actually affect judge perception noise and thereby affect call accuracy.  Clearly, Worlds is different every year due to changes in field size, number of judges available and quality of the judging pool, which will all impact call accuracy.

Figure 8.1
Figure 8.2

As usual, being lower on the chart indicates better performance, since we want fewer errors.  Note that although there are variations in call accuracy that are certainly consequential, the range is smaller than one might expect.  Given the packing assumptions we made, the difference between the best and worst CEC fell between 1.0 and 1.35.  At best, the average panel in a round makes one error (i.e., switching two adjacent teams from the correct call).  Whereas, in the worst rounds, on average out of every three rooms, two panels will make one error and one panel’s call will contain two errors.  Obviously, these are averages; there will be many calls with no errors and these will be balanced by some calls with a disheartening number of errors.  But, given that there are 24 possible calls in each debate (with only one being correct and just three being off by one), it is encouraging that the CEC of some rounds is not higher than it is.  Note also that SQ and ET will always have the same value for rounds 1 and 2, because Round 1 is a random draw and Round 2 always consists of four brackets based just on the result of Round 1, without any judge packing in either system.  

In Figure 8.2, it is clear that ET is more accurate in the middle of the tournament if one starts packing judges when there are rooms where all teams in that room are mathematically excluded from the open break (i.e., “open-dead rooms”).  An adjudication team may choose to be less aggressive than this in packing judges because they are mindful of the importance of ESL and EFL breaks.  Or, an adjudication team may choose to be more aggressive than this because they believe that even though some rooms are mathematically live, they are not realistically live rooms (e.g., a zero-point room in Round 4).  In order to avoid these judgment calls, we base our packing assumptions just on the percentage of rooms that are live for the open break, which can be objectively calculated.  

Perhaps what is most startling in Figure 8.2 is the improvement in CEC of ET from Round 2 to Round 3.  This appear more dramatic than it is because of the magnification of the Y axis here, but it’s still worth discussing.  This improvement happens because Round 3 in ET has 19% open-dead rooms, allowing judge packing that makes panels more precise at a time in the tournament where brackets have not yet been well sorted, allowing for greater call accuracy.  In subsequent rounds, judging panels in open-live rooms continue to get better (more precise), but teams of similar skill get sorted into brackets even faster, so the CEC gets worse.  For SQ, judge packing doesn’t start until Round 5 (if we assume that it starts when there are mathematically open-dead rooms), and even then, only 2% of rooms are open-dead, so packing on our model will be very minimal.  Since teams keep getting better sorted by skill into brackets, CEC keeps getting worse until Round 5.  After that, open-dead rooms in SQ increase more quickly and judge packing can boost panel precision faster than team are sorted, so CEC improves.  Figure 9 shows that there are always more dead-open rooms in ET than SQ, so given standard judge allocation patterns, panels in ET live-open rooms should always be more precise than in SQ live-open rooms.  Nevertheless, by Round 9, calls in SQ are a bit more accurate than calls in ET.  This is because the teams in the much smaller brackets created by ET (about 50% smaller) are sorted very well by skill level (making accurate calls harder), so CEC gets worse for ET than SQ, even though the live panels in ET are better than those in SQ.

Figure 9

Another potentially surprising insight we gained is how accurate the calls are in the first round, despite there being no judge packing at all.  We mentioned this likely possibility in Break Accuracy.  Calls in early rounds (especially Round 1) are likely to be accurate because the teams in the same room are so much less likely to be of the same skill level, even a fairly weak panel should be able to get the call right in Round 1.  We anticipated this result, but what surprised us somewhat was that under some assumptions, panels in later rounds got to be as accurate as panels in Round 1. 

As we said at the outset of this section, even though we are confident about the basic structure of our model, there is a great deal of uncertainty about the precise values that should be assigned to the many variables that are involved in modeling call accuracy.  Our hope is to develop some methods of narrowing down these assumptions so that we can have a higher confidence in the specific results of our model here.  But even without this additional precision, we can see from our results here that there is no compelling reason to worry about distortions coming from inaccurate calls in early weighted rounds when using tapered scoring.

Next page

[22]  The distinction between precision and accuracy is very important here.  Panel A may be capable of much more precise judgments than Panel B, but Panel A may end up giving less accurate team rankings than Panel B if the rooms judged by Panel B have much more obvious calls.