by R. Eric Barnes & Christopher Doak • United States
How reliable are adjudicator decisions in British Parliamentary debate? This is the fundamental question that we address in this paper. In other words, if the same debate were viewed by different panels of qualified debate judges, how likely is it that the results of the judges would be the same? At the foundation of competitive BP debating is the belief that a panel of competent judges who are given time to deliberate provides us with a result that is reasonably reliable. Unfortunately, there has been little empirical study of whether this belief is warranted. This paper discusses a study performed at the Hobart and William Smith Round Robin in April of 2016, in which the entire tournament was run with two independent panels of highly qualified judges in each debate. Our focus in this paper is on these questions: 1) How reliable are judges’ assignment of team points? 2) How reliable are judges’ assignment of speaker points? and, 3) How reliable are judges’ initial calls about a debate, before the deliberation begins? Additionally, in an appendix to the paper, we offer some reflections on some other topics, such as the volatility of tournament outcomes due to lack of reliability.
The British Parliamentary debate circuit is fortunate to have a group of intelligent, and highly experienced judges. Analyzing the behavior and decisions of these “experienced adjudicators” might illuminate some unknown truths about debating. The purpose of this paper is to address three aspects of these judges’ behavior: their agreement with other “experienced adjudicators” in their final call of the same round, movement from their initial call to their final, and agreement of speaker scores for similarly ranked teams. We examined this behavior by using two panels of these “experienced adjudicators” in each debate room. The conclusion of this paper is especially relevant for the faith placed in “adjudicator assessments,” which are weighted heavily for selecting adjudicators for the World Universities Debate Championships (WUDC). At the core of our research was the fundamental question of whether or not debate judging was arbitrary. If different highly qualified panels did not generally agree on an assessment of the same debate round, then the results of every debate would seemingly not be based on the persuasiveness of the debaters, but rather on the varying dispositions of their judges.
It is extremely important to keep in mind throughout this research the distinction between reliability and validity. A type of measuring device (e.g., judging panels) is reliable to the extent that distinct uses of that device will produce the same result. A type of measuring device is valid to the extent that the results produced by that device accurately reflect the thing that they claim to be measuring. This research study concerns reliability, whereas earlier research at this tournament in 2014 concerned validity. The present study looks at reliability of both team point assignments and speaker point assignments. In particular, this study is concerned with reliability between different panels, a form of inter-rater reliability. Essentially, we are primarily concerned with the degree to which different judging panels are likely to come to similar decisions.
 A common example used to explain this distinction is a bathroom scale. If bathroom scale #1 consistently shows the same weight each of the five consecutive times I get on it one morning, then that scale is reliable. However, if I actually weigh 165 and scale #1 consistently shows a weight of 175, then the scale is not valid, despite being reliable. If bathroom scale #2 gave me five measures of 172, 169, 166, 160 and 156, then this scale would be much more valid (the average of the measurements being 164.6), even though scale #2 is not particularly reliable.
 Barnes, R.E. and McPartlon, M. “Comparing Experienced Judges and Lay Judges” Monash Debating Review. 2014. This study focused on determining the extent to which experienced debate adjudicators came to the same decisions as intelligent, well-educated people who lack any prior experience with competitive debate
Brief Literature Review
Little work has been done on inter-rater reliability at the panel level, since no tournament before the HWS RR has used two simultaneous panels. As noted above, the 2014 Barnes & McPartlon study used two different kinds of panels, and was not concerned with inter-rater reliability in the same way. Although other tournaments could have collected data on inter-rater reliability at the individual judge level (if they collected this data), we were unable to find any studies even at this level. However, some research has examined judging bias that may be one factor in judging reliability.
Research on judging bias has examined bias based on a variety of factors. Henson and Dorasil found that debate judges express significant regional bias and bias toward one side of the debate, but they did not find bias based on sex. This study concerned a form of high school debating in the United States called “Lincoln-Douglas” and the data came from several years at just the annual tournament of champions. Harper analyzed two years of data from the national championship tournament of the National Parliamentary Debate Association, based in the United States to examine gender bias, and found that “this study revealed little difference statistically and practically in regard to gender and competitive equity”. In contrast, several authors have documented the existence of systemic bias in debate judging. For example, Jarzabek provides extensive anecdotal evidence of discrimination, harassment and bias based on gender in competitive debating. The largest statistical study that we are aware of was done by Pierson, in which she did find a substantial (and highly statistically significant) gap in the speaker points given to men and women.
The degree of judging bias that is based on any of these factors is unlikely to be particularly consistent among various judging panels, so the existence of bias is likely to lead to lower inter-rater reliability. But, though plausible, this is largely speculative. Our study examines judging reliability more directly.
 Henson, C. and Dorasil, P. “An Empirical Analysis of Judging Bias by Sex, Region & Side”. Unpublished manuscript available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1768087 – Accessed 11 June 2019
 Harper, C.T. “The Relationship Between Gender and Competitive Equity at the 2007 and 2008 National Parliamentary Debate Association National Tournaments”. Dissertation available at https://search.proquest.com/docview/746605105 – Accessed 11 June 2019
 Jarzabek, M.G.J. “The Double Standard in CEDA: A Feminist Perspective on Gender Stereotyping in Intercollegiate Debate.” https://eric.ed.gov/?id=ED399585 – Accessed 11 June 2019
 Pierson, E. “Men Outspeak Women: Analysing the Gender Gap in Competitive Debate”. Monash Debating Review, 11, 2013
The Hobart and William Smith Round Robin (“HWS RR” hereafter) is an annual elite debating competition that invites 16 of the best debate teams and a number of highly regarded debate judges from around the world. In 2016, there were 28 judges involved in the tournament. Of these, 8 have been in a WUDC Finals as a debater or judge. Another 4 have been on the CA team of a major debating tournament (e.g., Yale IV). An additional 7 have broken as a judge at Worlds. The remaining 9 judges, save one, have all chaired rounds at Worlds or broken at major international BP tournaments. The remaining judge (who was a panelist in just 3 rounds) had a strong background in other forms of debate and was from a nation just starting to do BP. Of course, we recognize that these kinds of credentials are not necessary or sufficient to make a judge good, but the point is that this was an exceptionally well qualified judging pool. The pool of judges was 43% female and 21% from debating circuits outside of the US & Canada. Prior to the tournament, we distributed disclaimers to each of the judges asking for their permission to be used in this study.
Over the course of 5 rounds at the HWS RR, each team debates every other team exactly once. Judges are allocated such that no judge ever sees the same team more than twice and, whenever possible, we avoid two judges being on the same panel more than once.
Traditionally, British Parliamentary debate will have three judges on a panel for one room of debaters. For this study, we used two panels for each room, with three judges on each panel. After the debate, one panel was sent to a different room to deliberate independently, while the other panel remained in the debating room as usual. Prior to any deliberations, each individual judge filled out initial call forms, indicating how they would rank the teams if they had to decide before discussing the round with other judges. The two oral adjudications were delivered entirely independently, with members of the first panel barred from being in the room for the second panel’s oral adjudication. The preliminary debates were recorded, along with the oral adjudications.
All the quantitative data was entered into a spreadsheet and analyzed using the methods described below. This included:
• Panel A judge panel ballots (including speaker points)
• Panel B judge panel ballots (including speaker points)
• Judges’ individual initial calls
 The justification for this was to avoid possible (or perhaps inevitable) non-verbal cues from the first panel from having any affect on the content or tone of the second panel’s oral adjudication.
We framed our research using the following questions:
- To what degree do two highly qualified panels tend to differ in their ranking of teams in the same round?
- To what degree do two highly qualified panels tend to differ in their assignment of speaker points to individual debaters?
- To what degree do individual highly qualified judges tend to differ in their ranking of teams prior to deliberations?
Central to our analysis is a comparison of complete rankings, from both panel to panel and judge to judge. To compare, we used the same method of comparison as was used in Barnes & McPartlon 2014 comparing experienced judges and lay judges. This method was described in that paper as follows: “The difference between two complete rankings can be measured on a scale from 0 (representing an identical ranking) to 6 (representing a maximally divergent ranking). A complete ranking can be translated into a set of 6 bilateral rankings, comparing each possible pairing of teams out of the four teams in the room. Each bilateral ranking was scored as a 0 if the two complete rankings agreed on which of those two teams should be ranked higher, and was scored as a 1 if they disagreed. These six scores were then summed to provide the final divergence between the two complete rankings on the 0-6 scale.” The highest degree of agreement would be a score of 0, where all rankings are the same, however, the highest degree of disagreement would be a score of 6. See the examples below.
In order to compare panels’ speaker point assignments, we looked at every speaker in every round and we calculated the difference between the speaker points assigned by panel A and panel B. When discussing the difference between the speaker points assigned to a team, unless explicitly noted otherwise, we are talking about the difference between the combined points assigned by panel A and the combined points assigned by panel B. But most of our conclusions concern comparison of individuals’ speaker points.
For another perspective on speaker point differences, we also grouped the results according to whether a particular team was given the same team rank by both panels in a round, given ranks that differed by one, or given ranks that differed by two. We then calculated how much individual speaker points differed in each of these categories.
Our last way of looking at speaker points was to compare the points given to partners on the same team to determine if the two panels rated them significantly differently, relative just to each other. We counted it as a significant difference if one of the following two criteria were met: 1) panel A gave higher points to one member of the team, while panel B gave higher points to the other member of the team; 2) one panel gave both teammates equal points, while the other panel had a difference of at least 2 points between the two teammates.For our last primary question, we wanted to look at the reliability of individual judges’ individual calls. We chose to look at this in two different ways. First, we simply compared all 6 individual judge calls from the same debate, calculating the divergence for each of the 15 possible unique pairings of those 6 judges. This gave us information on how similar individual calls are, based on 300 bilateral comparisons between individual calls. Second, we wanted to know the degree to which individual judge calls are consistent with the final call of an independent panel after they have deliberated. Judges may often compare their own initial calls to the final call in a room where they are judging, but this final call is clearly not independent of their initial call, since that judge almost certainly influenced that panel’s final call. But at this tournament we were able to compare each judge’s individual call to the final call of the other (independent) panel. This gave us information on how similar individual calls are to independent panel calls, based on 120 bilateral comparisons between individual calls and independent panel calls.
 To illustrate this, image that panel A assigned a team’s first speaker an 82 and the second speaker a 79, while panel B assigned the first speaker a 78 and the second speaker an 81. The team’s cumulative scores differed by only 2 points, while the sum of the difference between the two speakers’ individual scores would be 6 points. In most cases in this paper, we will be concerned with differences in individuals’ speaker points, but when we write about teams’ speaker points we will be using the former (cumulative) method.
Findings and Discussion
Given our initial questions, there are three areas of analysis: divergence between panels, differences in speaker points, and consistency of individual judge calls.
Divergence Between Panel Rankings
The chart below shows how frequent the different degrees of divergence (on the X-axis) between the panels was, expressed as a percentage of the number of debates (n=20) where the two panels were compared. So, 5% represents 1 debate. Also shown on this chart, for the purpose of comparison, are (in orange) what the distribution would look like if two the panels made their calls randomly, and (in red) what the distribution looked like when comparing the calls of lay judges to pro judges (Barnes & McPartlon, 2014). The average divergence between the two panels was 1.15, with a standard deviation of 0.79.
Mathematically, there are 24 different possible calls. Obviously, there is only one way for two calls to have a divergence of 0, while there are three ways to have a divergence of 1, five ways to have a divergence of 2, and six ways to have a divergence of 3. Given this, it is notable that panels had a divergence of 0 in 20% of the debates, and a divergence of 1 in fully 50% of the debates. In our view, calls that diverge by only 1 degree are very similar, and given how frequently two teams are very closely matched in a debate, it is not at all surprising that two panels would come to a different conclusion on how to rank closely matched teams. The other thing worth remarking is that only 5% of the time (i.e., in only one debate) did the panels diverge by more than 2 degrees, and no panels diverged by more than 3 degrees. This suggests that panels of highly qualified judges are unlikely to come to significantly different conclusions about debates. At the same time, it is clearly not the case that we should expect even two excellent panels to come to the same exact call. This does not imply that there is no such thing as a correct call (or that there is one), but it does strongly suggest that even excellent panels cannot be relied upon to consistently arrive at the precisely correct result, though they might reliably get quite close to the correct result. Of course, this research study used a relatively small sample size and further research of this sort will provide more evidence about this.
In terms of validity, if we assume that there is a single correct call, it is mathematically impossible for more than 60% of the panels to have arrived at precisely the correct answer. It could be as low as 0%, but it would be 60% if at least one panel in each room arrived at the correct call. Given that these were exceptionally strong panels and that it is likely that in some rooms neither panel arrived at the correct call, we are led to conclude that at many tournaments perhaps fewer than half of the panels arrive at the exactly correct call. At the same time, it is possible that as many as 97.5% of the panels were within one degree of divergence from the correct call. Of course, this is the mathematic upper limit and the actually number is likely to be significantly lower, but given the number of panels that agreed exactly or were very close, we expect that a strong majority (over 80%) of panels are arriving at a call that is no more than one degree of divergence from the correct call, and this strikes us as encouraging.
Another way to look at this data is from the team’s perspective, asking how frequently did the two panels rank an individual team differently, and to what degree? In 51% of cases, a given team was assigned the same rank in the debate by both panels. In 40% of cases, a team’s ranks differed by one point. And, in 9% of cases, a team’s ranks differed by two points. No teams received ranks that differed by three point (i.e., a first and a fourth). Of course, at all other tournaments, there is just one panel per round. Later in this paper, we look at the ranges of possible result for the teams if just one of the two panels in each round counted.
 In each case where panels diverged by only 1 degree, we looked at the cumulative team speaker points awarded to the teams whose ranking was switched. We found that in 50% of these cases (five debates), both panels gave these teams total points differing by no more than 3 points, suggesting that both panels saw the teams as closely matched in these cases. Interestingly, in three of these debates (i.e., 30% of the cases diverging by 1 degree), at least one of the panels assigned the switched teams total speaker points differing by at least 8, suggesting that they saw their ranking of these two teams as quite clear. So, it seems that even in some cases where the panel’s divergence is very small (1), the panels may see some elements of the debate very differently.
Divergence Between Panel Speaker Points
The average difference between the two panels in their assignment of speaker points to an individual was just under 2 points (1.97). The maximum difference was 7 points and standard deviation was 1.56, with the most common difference being 1 point (29% of cases). In 69.4% of cases, point assignments differed by 2 or less.
Speaker points are often considered to be somewhat arbitrary and unreliable indicators of debating quality. Obviously, they are not entirely arbitrary, nor are they completely unreliable. If we were to label these different degrees of disagreement in speaker points, we would do so as follows: 0 = nonexistent; 1 = minimal; 2 = modest; 3 = substantial; 4 or 5 = dramatic; 6 or more = radical. Given the range of the speaker point scale (at least 21 regularly used scores, 65-85), the expectation that two panels will assign the exact same points to speakers is not realistic. At the same time, debaters and judges seem to perceive a difference of two or three speaker points as weighty. In our view, it is encouraging that in 69.4% of cases the difference in speaker points was either nonexistent, minimal or modest (i.e., from 0 to 2). The rapid falloff to cases of dramatic or radical differences (making up less than 15%) is also somewhat reassuring. Obviously, everyone will always prefer numbers that show ever greater reliability, but these numbers do not suggest to us that speaker points are useless or hopelessly arbitrary, at least when they are assigned by panels of highly qualified judges who have a common touchstone (experience at the WUDC).
When both panels gave a team the same rank in the debate, the average difference between the two panels in their assignment of combined speaker points to a team was 2.51. When a team received ranks that differed by one, the average combined team speaker point difference was 4.44. And when a team received ranks that differed by two, the average combined team speaker point difference was 7.14. So, not surprisingly, speaker point assignments to a team were much more similar when the two panels were closer to agreeing on that team’s ranking in the debate.
In evaluating speakers on the same team, the two panels awarded points that were not significantly different (as defined in the methods section above) in 72% of cases. In 19% of cases, the two panels awarded speaker points indicating that they ranked the teammates in a different order. In other words, both panels thought that one teammate was better than the other, but they disagreed on which speaker that was. In another 9% of cases, one panel awarded equal speaker points, while the other panel had the teammates separated by at least 2 speaker points.
 We weren’t entirely sure what to make of the fact that in some cases highly qualified panels assigned speakers points that differed by 6 or 7. Even though this happened only about 3% of the time, it seemed hard to fathom. When we looked more closely at the data, we saw that almost all of these dramatic and radical differences in speaker points occurred in situations where both team members’ points differed significantly. For example, this is an exhaustive list of when the combined point difference for a team was 10 or greater: 75/75 vs. 82/82; 77/77 vs. 83/83; 78/78 vs. 83/83; 75/75 vs. 80/80; 78/78 vs. 82/84. There is a clear pattern here, which continues even when we look at slightly less significant point differences. The most likely explanation is that the two panels evaluated the whole team’s argumentative approach to the debate quite differently, as opposed to the team’s success or failure being attributed primarily to one of the debaters.
Individual Panelist Rankings
As described in the methods section, we took two approaches to evaluating the reliability of individual judges’ initial calls on team rankings. The average divergence between the initial calls of two judges in the same debate was 1.89 (with a standard deviation of 1.30). As the chart here makes clear, panelists do make some extremely divergent calls (i.e., 4, 5 or 6 degrees of divergence), in about 11% of cases. In contrast, 70% of the time these individual calls diverge by 2 or fewer degrees. Given that these calls are made before these judges have had an opportunity to talk through the debate with the other judges, this seems to us to be a fairly high degree of correlation between the judges’ views of these debates.
Notwithstanding the above remarks, we want to make very clear that this research strongly confirms the belief that the final calls of panels are much more reliable than the initial calls of individual judges. The former diverge by an average of 1.15, while the latter diverge an average of 1.89. As an average, this is a big difference, given that the maximum average divergence one could plausibly expect would be 3.00. Of course, we don’t expect it to be surprising to anyone that panels are more reliable than individuals, but it is good confirm what we already suspected.
The other approach we took in evaluation individual judge calls was to compare them to the final call of the other panel (which they were not a part of). In the chart below, we again see the random distribution of call divergence in the rightmost orange columns (for reference). The blue (leftmost) columns represent the frequency of call divergence between individual calls and the other panel. The red (middle) columns represent divergence between individual calls and the judge’s own panel, which is more closely correlated, presumably because the individual making the call was one of just three people influencing that panel’s final call.
Using this method, we found that initial calls by highly qualified judges diverged from the settled opinion (i.e., the final call) of a panel of similarly qualified judges by an average of 1.60 degrees (with a standard deviation of 1.16). But while this average is neither strikingly high nor particularly low, we found that focusing on the distribution in this data was more telling. In particular, we note that 21% of these initial calls diverged by 3 or more degrees. That is to say, more than 1 out of 5 initial calls by highly qualified judges were very different from the settled opinion of a highly qualified panel (a significantly more reliable measure, as shown above).
In an additional 29% of cases, the initial call diverged from the independent panel by 2 degrees, so fully half the time, these judges’ initial calls diverged by at least 2 degrees from an independent panel. Based on this, our conclusion is that initial calls, prior to the opportunity to talk things through with other judges, are not particularly reliable, even with strong judges. This confirms the anecdotal evidence that it’s common for highly respected adjudicators to disagree with the official results of the WUDC judging test (when this has been announced) and that panels of the best judges (e.g., at the Grand Finals of the WUDC) often strongly disagree on the correct call. Because of this, we suggest that CA teams should be cautious in their use of a judge’s initial call that is often submitted as a part of a judging test before a tournament or may simply be observed when judging on a panel together. It is quick and easy to looking at how divergent judges’ calls are, and it may seem more objective than looking at the long rationale they provide, but our data suggest that even some of the world’s most respected judges have unreliable initial calls (which in most cases they will want to amend based on deliberation with the whole panel). If one does use call accuracy to evaluate judge tests, then tournaments should be careful to select a recorded round where top judges independently (without any discussion) and nearly unanimously arrive at the same call.
We also compared judges’ initial calls to the final call by their own panel, and in these cases they differed from their own panel’s final call by an average of 1.60. Recall that the two panels differed from each other by an average of 1.15, and that the average difference from a random call is 3.00. The difference between 1.60 and 1.15 is substantial — 15% of the range from perfect agreement (0) and random (3) — and for reasons just mentioned, judges need to be cautious in drawing conclusions about the accuracy of their initial calls by comparing them to their own panels’ final calls. This is particularly true in cases where a judge tends to chair because we found that chairs have a much greater influence over the final call of the panel than do panelists, even when everyone knows that the panelists are very highly qualified judges. The data that support this are interesting. When comparing the divergence between individual calls and the other panel’s final decision, we discovered that there was almost no difference between how much panelists diverged (1.61 degrees on average) and how much chairs diverged (1.58 degrees on average). This suggests that in this group of highly qualified judges, chairs were no more likely than panelists to arrive at the correct decision, as estimated by an independent panel’s final decision. But, when we compared the divergence between individual calls and the same panel’s final decision, we discovered that there was a much larger difference between panelists’ divergence (1.26 degrees on average) and chairs’ divergence (0.93 degrees on average). The opening of this gap, when switching between these two comparisons, suggests that chairs exert considerably more influence on the outcome of a panel’s final decision than do the panelists, even when the panelists are highly qualified, and so both less likely to be deferential to the chair and more likely to be respected by the chair. We are not claiming that this is a particularly surprising conclusion. Indeed, this may even be desirable in situations where one has good reason to believe that the chair’s initial call is significantly more likely to be correct. But, on panels where this is not the case and the judges are roughly equal in their qualifications, the chair should be cautious in using their power to sway the outcome toward their initial call, if they care about the panel’s final decision being correct. Moreover, as mentioned above, chairs should not take the fact that their initial calls tend to match the panel’s final call as persuasive evidence that their initial calls are correct.
 An average divergence of 3.0 is what one would get by comparing calls to random calls. A higher average divergence would mean that some factor was actively pushing the calls to be more divergent than random, which would be very odd indeed.
 With respect to very large degrees of divergence (5 or 6), we do note that the data show that these were slightly more frequent when comparing individual judge calls to the judges own panel than to the other panel. However, given the small sample size and that these were only slightly more frequent, this is not statistically significant and we regard this as very likely an anomaly.
 We thank Steven Penner for suggesting that we add this anecdotal evidence.
 The gap discussed above (between 1.58 and 1.61) is just 1.0% of the range from perfect agreement to random, whereas the gap between 1.26 and 0.93 is fully 11.0% of this range, which seems substantial.
Here is one more way to think about the implications of the data about individual judge variation. For the sake of argument, assume that almost all the RR judges would likely be chairing panels during preliminary rounds at Worlds. The data we collected on judges’ calls before deliberation are not mathematically compatible with the claim that the chair usually arrives at the correct call before deliberation. There are simply too few cases of zero divergence to make this possible. At the same time, these data do give us reason to believe that typical Worlds chairs generally make initial calls that are reliably close (i.e., within two degrees of divergence about 80% of the time) to what a panel of independent strong judges would conclude after deliberation. Of course, there is no reason to believe that the panelist calls are more likely to be correct, but we nevertheless ought to be cautious in how strongly we discourage panelists (especially well qualified panelists) from challenging the initial call of the chair. Moreover, it would be wise to instill greater humility into many chairs, especially when they have experienced panelists (e.g., in elimination rounds), but also in cases where their panelists are not well known but are offering cogent reasons for a different call.
 We want to emphasize again that this claim of reliability is not the same as a claim of accuracy. It is possible that debate judges consistently and systematically end up getting the call wrong.
Although it was not one of our initial research goals, in the process of analyzing the data, we found another potentially illuminating way of looking at the differences between the calls of the two panels. For the sake of argument, let’s assume that every panel at the 2016 HWS RR was a highly qualified panel. The tournament would not have been criticized for having poor judging if each room had only one of the two panels that they actually had. Now, imagine that we had randomly selected only one of the two panels from each room to do the adjudication. What would the potential variations in the outcome of the preliminary rounds look like?
One team at this tournament received the same team points from both panels in every round, and so for this team, randomly selecting just one panel in each round would have made no difference at all. However, for the other fifteen teams it would make a difference, and in some cases, a dramatic difference. With one panel per room, over the course of five rounds a team can earn between 0 and 15 points. Two teams’ potential totals ranged over six possible outcomes. For example, one team could have ended on as many as 12 points (certain to break) or as few as 7 points (well out of contention for breaking), depending on which panels were randomly chosen. Of course, in most cases, the results in the middle of the range of possible outcomes were much more likely, but in one case a team had four possible outcomes, each 25% likely if panels were selected at random.In figure 6, individual teams are represented by bars of the same color in the same row. The 16 teams are labeled from “A” to “P”, where A was first on the tab and P was last on the tab after preliminary rounds. The height of each individual bar represents the likelihood that an individual team would have ended on a particular number of points (0 – 15). So, it is 100% likely that team D (the very tall blue bar) would end on 10 points because in every round they received the same points from both panels. As another example, it is 50% likely that team M would end on 3 points and 50% likely that they would end on 5 points, since in four rounds both panels gave them the same ranking and in one round the panels’ rankings differed by 2.
Of course, as all debaters know, even a very small variation in the team points that are awarded can make a huge difference. Our impression is that this way of looking at the data gives the impression that judge variation is a greater problem with judge reliability, whereas our earlier ways of looking at the data (e.g., Figure 2) gives the impression of a smaller problem. Obviously, over the course of a debater’s career (if not within a set of preliminary rounds), one would expect these fortunate and unfortunate decisions to even out, though this may be cold comfort when misfortune seems to strike during the elimination rounds of a major tournament.
 In a round robin, getting a different number of points in a round is arguably an even more important difference, since there is no “silver lining” to getting fewer points. A team will debate against the same teams in their future rounds, regardless of how many points they are on.
First, based on this evidence, judging panels tend to give fairly similar calls, and so appear to be fairly reliable. Two highly competent panels diverged, on average, by just over one degree of difference (1.15 degrees), which is roughly the equivalent of swapping two adjacent teams in a call. Second, speaker points appear to be somewhat reliable, differing by an average of 1.97 points per speaker. In 45.1% of cases, the two panels assigned points that were either identical or differed by just 1 point. But, in 30.6% of cases, the two panels assigned points that differed by 3 or more points, and most debaters would surely regard a difference of 3 or more points as a consequential difference. Our analysis did not address whether there is systemic bias in the awarding of speaker points, which might affect both panels (and so might disrupt accuracy without disrupting reliability). Third, not surprisingly, individual judge calls are much less reliable than are the calls of full panels. This suggests that using individual calls to grade judge tests is often a poor method of grading judges. Fourth, we found that even among highly qualified judges, the chair of a panel exercises substantially more influence on the panel’s final decision than do the panelist judges, despite the fact that the initial call of the chair seems no more likely to be correct than the initial call of the panelists.
Limitations and Directions for Future Research
There were several limitations to our research. First, our sample size was relatively small, just 20 rounds, with 40 judging panels. Our intention is to continue running the HWS RR with dual judging panels into the future, as long as it remains practical for us to do so, and to again analyze the data that comes out of these tournaments. We will, of course, seek to publish new information after several more years of data is available. Second, while the unusually high quality of judges at the HWS RR is important to allowing us to draw certain conclusions from the data, it also lessens the extent to which some conclusions may be extrapolated to more typical BP Debate tournaments.
The methods used in this paper make us curious about what we would learn from running dual panels of lay judges, either at a tournament or through the use of videotaped debates. This would give is a much better sense of the reliability of lay judges, beyond the extremely limited findings on this from the earlier Barnes & McPartlon study. Additionally, we would still like to study the impact of the speakers’ perceived identity on the outcome of debates, regarding both team points and speaker scores. Although judging bias was not the focus of this paper, the data generated by dual panels may end up being useful in studying questions about judge bias.
Finally, there are some theoretical questions that are related to the empirical work in this paper and that are worth extensive discussion. In particular, there is the question of whether there is a single correct call for any particular debate round, or whether there could be multiple calls that are equally correct? The answer to this theoretical question has no impact on the methodology or conclusions of the forgoing empirical paper, but it does seem important to the international debating community. Our own belief is that there is a single correct call in any debate round. Our view is essentially a version of the theory of “idealized judging” outlined in “Philosophy in Practice: Understanding Value Debate” (Barnes, 1996), modified for the BP format. This theory claims that the correct call is the call that would be made by a judge who was ideally situated to evaluate the debate. Unfortunately, the argument for this position will need to wait.
 This obviously also has roots in the philosophical ideal observer theories of the enlightenment, such as Adam Smith’s “The Theory of Moral Sentiments” (1790).
Eric Barnes is an associate professor of philosophy and coaches the debate team at Hobart and William Smith Colleges (HWS) in Geneva, NY. His research is about moral theory, applied ethics and competitive debate. In 2007, he invented the Round Robin format for British Parliamentary debate, and the HWS Round Robin has served as a laboratory for debate research as well as a venue for high quality debating. Prof. Barnes has published articles about (and against) judging theory and about expanding the break at the WUDC (in the run up to that decision). He is committed to improving the activity by promoting both better arguments and better ways to run tournaments.
Chris Doak graduated magna cum laude from Hobart College with a B.A. in Economics. At Hobart, he was a central member of the debate team, and participated in competitions in the United States, Canada and China. He is currently a second-year law student at Syracuse University College of Law. At Syracuse, he is a member of Syracuse Law Review and the Trial and Appellate divisions of the moot court Travis H.D. Lewin Advocacy Honor Society. Chris hopes to use his background for debate towards a career as a trial attorney.