Cricket, R

Bigger isn’t always better – the case of the first innings in cricket

I’ve got an unsubstantiated hunch (the best kind of hunch!) about cricket. Well, not just one, I have loads, but this particular hunch is about the first innings of a cricket match, and that bigger isn’t always better.

I greatly enjoyed following England’s first innings against South Africa in the second Test in Cape Town. But, even with the high run rate while Stokes was smashing it everywhere, I was convinced that the higher that first innings got, the less likely we’d be to win it. This goes against the received wisdom in cricket, which is that the bigger the first innings score, the better it is.

So, I’ve had a look at all first innings scores in Tests from 1990 until now (there’s just over a thousand of them). Here’s simple density plot of the distributions of runs scored in the first innings per match result:

density plot of runs

What this seems to show is that there’s a limited sweet spot from just over 380 runs to about 500 runs where a win is the most likely result. Once a team scores over about 500 runs in the first innings, the most likely match result is a draw.

Part of that is probably because of how much time posting a huge first innings takes out of the game. What happens when we look at runs scored vs. balls taken in the first innings?

scatter plot of runs and balls simple

There’s a green cluster in the middle between about 350 and 550 runs and between about 700 and 800 balls. That, I reckon, is the sweet spot for the perfect first innings: scoring a high but not massive number of runs, without taking too much time. England took 755 balls (125.5 overs) in their first innings in Cape Town, so a win was still just about the most likely result there… but, this may just be an exception. We’ll see.

Here’s the same plot with some lines showing a run rate of 2, 3, and 4 runs per over (the steeper the line, the greater the run rate):

scatter plot of runs and balls

Visually, I’m convinced the sweet spot of 380-500 runs at a decent run rate is obviously there. So, let’s try looking at some simple percentages by comparing scores between 380-500 runs with scores over 500 runs, where runs are scored at over 3.5 runs an over:

Run rate over 3.5, runs between 380 and 500
won draw lost        = 62.32% win rate
43     16     10          = 2.69 win:draw ratio

Run rate over 3.5, runs over 500
won draw lost        = 54.29% win rate
57     47      1           = 1.21 win:draw ratio

The win rate goes down slightly for the higher scores, and the win:draw ratio goes down too. i.e. even if you’re scoring well, going beyond 500 just makes the draw more likely and doesn’t actually help your chances of winning.

But, that’s not quite a fair comparison. I said earlier that if you’re going to score more runs, you have to do it at a higher run rate, so comparing all scores above 3.5 an over isn’t exactly fair. Let’s now compare a good score at a good run rate with a high score at a high run rate. Again, I’m taking a good score to be 380-500 and a high score to be over 500. In terms of run rate, I’m quantifying a good run rate as between the mean run rate of all innings and the mean plus one standard deviation (i.e. between 3.13 and 3.72 runs per over), and a high run rate as above the mean plus one standard deviation (i.e. above 3.72 runs per over).

So, is a score of 380-500 at 3.13-3.72 runs per over better than a score of 500+ at 3.72+ ?

380-500 runs at 3.13-3.72 RPO (mean runs: 438 , mean RPO: 3.40)
won draw lost        = 56.10% win rate
46    20     16          = 2.3 win:draw ratio

500+ runs at 3.72+ RPO (mean runs: 587, mean RPO: 4.90)
won draw lost        = 57.14% win rate
44    32     1             = 1.375 win:draw ratio

…the lower, slower score isn’t better, but it isn’t worse either. The likelihood of winning stays the same; the only difference is that batting on makes losing much less likely and drawing much more likely.

This is really counterintuitive, and I find it hard to wrap my head around the fact that scoring 438 at 3.4 an over is about as likely to result in a win as scoring 587 at 4.9 an over. One possibility is that the matches which feature high first innings scores are played on absolute roads, like in the 1997 Colombo snoozeathon between India and Sri Lanka, meaning that a high second innings score is also pretty likely. Therefore, you’d expect the first and second innings scores to correlate in matches where the first innings was 500+ runs at 3.72+ RPO… but they don’t (r=0.07, p=0.52). Nor do the first and second innings scores correlate in matches where the first innings was between 380-500 runs at 3.13-3.72 RPO (r=-0.15, p=0.18). The only indication that a massive first innings score may mean that the pitch is easier to bat on is that the mean second innings score in response to a massive first innings score is 346.90, while the mean second innings score in response to a good first innings score is 307.09. A t-test between the two set of second innings scores is “relatively significant” (as an ever-hopeful colleague of mine used to say) with a p-value of 0.07, but that doesn’t cut it. This is another mystery for another blog post.

Right, back to looking at just the first innings scores and win rate. One last way of exploring this is by creating a matrix of win rates in bins of runs scored and run rate.

I’ve put all innings into bins of 50 runs and bins of 0.5 RPO. This means that every square in the following graphs is represented by a set of matches where that many runs have been scored at that rate. It’s only done for bins with at least five matches in (because you can’t really extrapolate from things where only one or two matches have happened, as that leads to a lot of 0% and 100% win rates).

This graph visualises the win rate per bin; the darker the green, the greater the likelihood of winning based on that kind of first innings:

rough matrix of runs, RPO, win rate - five matches or more, cropped

But what if, instead of plotting the simple win likelihood for all bins, we plot the most likely result based on that bin, along with the likelihood of that result? In this graph, the colour represents the top result – win, draw, or loss – and the intensity of that colour represents the likelihood – the more intense the colour, the more likely that result:

rough matrix of runs, RPO, top result, rate, cropped

In both matrices, the sweet spot with the most green and the most intense green falls within 400 and 500 runs… although it turns out that in terms of overall win likelihood, the best first innings is to score between 500 and 550 runs, scored at over 4 runs per over.

Ultimately, what this shows is that batting on past 500 or so makes losing the match hugely unlikely (but definitely not impossible), so if safety first is your watchword, have at it. However, if you want to win a Test match, there’s not much point in batting on past 500 or so in the first innings, 550 at most, no matter how fast you score (and if you do decide to go for the big imposing total, you’d better hurry up about it). Ben Stokes might have set a load of records, but with a bit of statistical sleuthing, he’d have realised it was pointless because his batting blitz was actually just making it harder for England to win.

Why bother creating these incredible cricketing memories when the statistics say hold back?

…because it’s much more entertaining. If you focus on the statistics all the time, you end up with a team like England under Peter Moores, where nobody knows anything before they’ve looked at the data. Fair enough, then.


Using R to calculate better cricket statistics… or, how to revolutionise the way we slag off Ian Bell.

Have you ever been bothered by the idea of career batting averages, how it doesn’t reflect a player’s form, and how it’s unfair to compare averages of cricketers who’ve played over a hundred tests to cricketers who’ve played maybe thirty since one bad innings will damage the experienced cricketer’s average way less than the relative newcomer?

Well, you’re not alone. I’ve always thought that cricinfo should report a ten-innings rolling average. Occasionally you get a stat like “Cook is averaging 60 or so in the last few matches” or whatever, but there’s no functionality on cricinfo or statsguru to be able to look that up.

Enter R. R is a free open-source statistical programme that I normally use for my ERP research, but it’s also the next best thing after Andy Zaltzman for settling arguments about cricket statistics.

I’ve written some R code which can take any cricketer on cricinfo and spit out a ten-innings rolling average to show fluctuations in form. Plotting it with ggplot2 can show a player’s peaks and troughs as compared to their career average, and can hopefully be used as a much more objective way of saying whether or not somebody’s playing badly.

Alastair Cook has been a lightning rod for criticism in the last couple of years. He scored heavily in his first few matches as England captain, and for a little while it seemed as though captaincy would improve his batting, but then he went into a long slump. He recently broke his century drought, and people are divided over whether he’s finally hitting form again or whether this is a dead cat bounce on an inevitable decline. Some people take his last five Tests and say he’s back; others take his last year or two in Tests and say he’s lost it. What is missing from all the misspelled derision in the comments under any article about Cook is a ten-innings rolling average and how it changes over time.


Alastair Cook: rolling and cumulative averages

This graph shows Cook’s peaks and troughs in form quite nicely. The big one in the middle where he averaged about 120 over ten innings is a combination of his mammoth 2010-11 Ashes series and the home series against Sri Lanka where he scored three centuries in four innings. His recent slump can be seen in the extended low from his 160th innings and onwards, where his rolling average went down to below 20. Now, though, it’s clear that not only has he regained some form, he’s actually on one of the better runs of his career.

Similarly, it seems like commentators and online commenters alike feel like Gary Ballance should be dropped because he’s on a terrible run of form. Certainly, he’s had a few disappointing innings against the West Indies and New Zealand lately, but is his form that bad?

Gary Ballance: rolling and cumulative averages

Gary Ballance: rolling and cumulative averages

…no, no it isn’t. He’s still averaging 40 in his last ten innings.

If anything, it’s Ian Bell who should be dropped because of bad form:


Bell’s average has had a few serious drops recently, going down to 20 after a poor Ashes series in Australia (along with pretty much every other England player too), rebounding a bit after a healthy home series against India, and then plummeting back down to 20 after two bad series against West Indies and New Zealand. Unlike Cook, however, Bell never seems to stay in a rut of bad form for very long… but that never stops his detractors from claiming he hasn’t been good since 2011.

The missing bit in the cumulative average line, by the way, is from where Bell averaged a triple Bradman-esque 297 after his first three innings against West Indies and Bangladesh, which were 70, 65*, and 162*.

The forthcoming Ashes series also raises the interesting comparison of Joe Root and Steven Smith, two hugely promising young batsmen both at their first real career high points. Smith in particular is seen as having had an excellent run of form recently and has just become the #1 ranked Test batsman. Most cricket fans online seem to think that there’s no contest between Smith and Root, with Smith being by far and away the better batsman…

Root and Smith

…but it appears that there’s not actually much to choose between them. If anything, Root has had the highest peak out of the two of them, averaging 120 over ten innings against India last summer and the West Indies more recently (this is in fact comparable to Alastair Cook’s peak against Australia in 2010-11, but has attracted far less attention). He’s dropped a little since, but is still averaging a more than acceptable 85. Smith’s current rolling average of 105 is also very impressive, and it’ll be fascinating to see how he gets on in this series.

If you are interested in calculating and plotting these graphs yourself, you can follow the R code as below.

Firstly, if you don’t use them already, install and run the following packages:


The next step is to create a dataframe of innings for each player. You can do this by going to any player’s cricinfo profile, and then clicking on “Batting innings list” under the statistics section. Take that URL and paste it in here like so:

# Joe Root innings
url = ";template=results;type=batting;view=innings"
Root.tables = readHTMLTable(url,stringsAsFactors = F)
Root.full = Root.tables$"Innings by innings list"

This creates a fairly messy dataframe, and we have to tidy it up a lot before doing anything useful with it. I rolled all the tidying and calculating code into one big function. Essentially, it sorts out a few formatting issues, then introduces a for loop which loops through a player’s innings list and calculates both the cumulative and ten-innings rolling averages at each individual innings (of course, the first nine innings will not return a ten-innings rolling average), and then puts the dataframe into a melted or long format:

rollingbattingaverage <- function(x) {
  x$Test <- x[,14]            # creates new column called Test, which is what column 14 should be called
  x <- x[,c(1:9, 11:13, 15)]  # removes 10th column, which is just blank, and column 14
  x$NotOut=grepl("\\*",x$Runs) #create an extra not out column so that the Runs column works as a numeric variable
  #Reorder columns for ease of reading
  x <- x[,c(1, 14, 2:13)]
  #Convert Runs variable to numeric variables
  x$Runs <- as.numeric(x$Runs)
  #This introduces NAs for when Runs = DNB
  x <- x[complete.cases(x),] 
  rolling <- data.frame(innings = (1:length(x$Runs)), rollingave = NA, cumulave = NA)
  names(rolling) <- c("innings", "rolling", "cumulative")
  i = 1
  z = length(x$Runs)
  for (i in 1:z) {
    j = i+9
    rolling[j,2] = sum(x$Runs[i:j])/sum(x$NotOut[i:j]==FALSE)
    rolling[i,3] = sum(x$Runs[1:i])/sum(x$NotOut[1:i]==FALSE)
  #because of the j=i+9 definition and because [i:j] works while [i:i+9] doesn't, 
  #creates 9 extra rows where all are NA
  x <- rolling[1:length(x$Runs),] #removes extra NA rows at the end
  melt(x, id="innings") 

Then I have another function which sorts out the column names (since changing the names of a function’s output is kind of tricky) and adds another column with the player’s name in it so that the player dataframes can be compared:

sortoutnames <- function(x) {
  x$player = deparse(substitute(x))
  allx <- list(x)
  x <-, 'names<-', c("innings","type", "average", "player")))

Now we can plot an individual player’s rolling and cumulative averages:

plotplayer <- function(x) {
  myplot <- ggplot(data=x, aes(y=average, x=innings, colour=type))
  myplot+geom_line()+scale_y_continuous(limits=c(0, 200), breaks=seq(0,200,by=10))

The next function isn’t really necessary as a function since all it does is rbind two or more dataframes together, but it makes things easier and neater in the long run:

compareplayers <- function(...) {

And finally, we need to create functions for various types of graphs to be able to compare players:

plotrolling <- function(x){
  myplot <- ggplot(data=x[x$type=="rolling",], aes(x=innings, y=average, colour=player))
  myplot+geom_line()+scale_y_continuous(limits=c(0, 200), breaks=seq(0,200,by=10))
plotcumulative <- function(x){
  myplot <- ggplot(data=x[x$type=="cumulative",], aes(x=innings, y=average, colour=player))
  myplot+geom_line()+scale_y_continuous(limits=c(0, 200), breaks=seq(0,200,by=10))
plotboth <- function(x){
  myplot <- ggplot(data=comparisons, aes(x=innings, y=average, colour=player, size=type))
  myplot+geom_line()+scale_size_manual(values=c(0.6,1.3))+scale_y_continuous(limits=c(0, 200), breaks=seq(0,200,by=10))
plotrollingscatter <- function(x){
  myplot <- ggplot(data=x[x$type=="rolling",], aes(x=innings, y=average, colour=player))
  myplot+geom_point()+scale_y_continuous(limits=c(0, 200), breaks=seq(0,200,by=10))

Now that all the functions exist, you can get the information quickly and easily; just find the correct URL for the player(s) you want, paste it in the bit where the URL goes, and then run the functions as follows:

Root <- rollingbattingaverage(Root.full)
Root <- sortoutnames(Root)
comparisons <- compareplayers(Root, Smith)