When Burnley got beat 3-1 by Everton at Goodison Park on the 15th April, 33 games into their Premier League season, they’d gained only 4 points out of a possible 51 in their away fixtures. But during this time they’d also managed to accrue 32 points out of a possible 48 at Turf Moor; if the league table were based upon only home fixtures, they’d be in a highly impressive 6th place. But they were in 14th position in the real world, and would be rock bottom of the opposite hypothetical league which counted only away fixtures.
Newspapers seem to love rattling out stats like these but they’re often just cherry-picking data. Why 33 games? What if Burnley won their next two home games and lost their next away game - these figures would be even more mind-blowing. What if they started winning away games and the pattern goes cold? Stats can often be manipulated to fit any narrative, and this is especially true in football reporting; here from ‘Fortress Turf Moor’ or ‘Poor travellers Burnley destined for relegation’.
With just two games to go, Burnley are all but mathematically safe from relegation, but I wanted to look at the data to see whether they managed to cure their homesickness and how their skewed ratio of home:away points measures up to previous records. And whilst we’re here, have any teams showed a wanderlust, preferring to pick up more of their points from away fixtures?
Let’s fire up R. I’ve used the package engsoccerdata
which includes databases of historical results from English (and European) football leagues and several built-in functions for analysing its data.
devtools::install_github("jalapic/engsoccerdata")
require(engsoccerdata)
require(dplyr)
require(ggplot2)
First, we need to update the engsoccerdata
database with results from the current season using england_current()
function and subset the dates as we’re only interested in the PL era for this post (1992-93 - 2016-17).
#update 'england' dataframe if needed
england <- rbind(england, subset(england_current(), !(Date %in% england$Date & home %in% england$home)))
#subset
EPL <- rbind(england, england_current()) %>%
subset(tier == 1 & Season %in% 1992:2016)
Have a quick look to make sure it’s up to date.
tail(EPL, 5)
## Date Season home visitor FT hgoal vgoal
## 192355 2017-05-06 2016 Swansea City Everton 1-0 1 0
## 192356 2017-05-07 2016 Arsenal Manchester United 2-0 2 0
## 192357 2017-05-07 2016 Liverpool Southampton 0-0 0 0
## 192358 2017-05-08 2016 Chelsea Middlesbrough 3-0 3 0
## 192359 2017-05-10 2016 Southampton Arsenal 0-2 0 2
## division tier totgoal goaldif result
## 192355 1 1 1 1 H
## 192356 1 1 2 2 H
## 192357 1 1 0 0 D
## 192358 1 1 3 3 H
## 192359 1 1 2 -2 A
Next I’ve created a custom function, maketable_ha()
, to make league tables for each PL season: one using only home fixtures for each team and another using only away fixtures.
maketable_ha <- function(df=NULL, Season=NULL, tier=NULL, pts=3, type = c("both", "home", "away")) {
GA<-GF<-ga<-gf<-gd<-GD<-D<-L<-W<-Pts<-.<-Date<-home<-team<-visitor<-hgoal<-opp<-vgoal<-goaldif <-FT<-division<-result<-maxgoal<-mingoal<-absgoaldif<-NULL
#subset by season and tier, if applicable
if(!is.null(Season) & is.null(tier)) {
dfx <- df[(df$Season == Season), ]
} else if(is.null(Season) & !is.null(tier)) {
dfx <- df[(df$tier == tier), ]
} else if(!is.null(Season) & !is.null(tier)) {
dfx <- df[(df$Season == Season & df$tier == tier), ]
} else {
dfx <- df
}
#subset only home or away fixtures, if applicable
if(match.arg(type)=="home") {
temp <- select(dfx, team=home, opp=visitor, GF=hgoal, GA=vgoal)
} else if(match.arg(type)=="away") {
temp <- select(dfx, team=visitor, opp=home, GF=vgoal, GA=hgoal)
} else if(match.arg(type)=="both") {
temp <-rbind(
select(dfx, team=home, opp=visitor, GF=hgoal, GA=vgoal),
select(dfx, team=visitor, opp=home, GF=vgoal, GA=hgoal)
)
}
#make table
table <- temp %>%
mutate(GD = GF-GA) %>%
group_by(team) %>%
summarise(GP = sum(GD<=100),
W = sum(GD>0),
D = sum(GD==0),
L = sum(GD<0),
gf = sum(GF),
ga = sum(GA),
gd = sum(GD)
) %>%
mutate(Pts = (W*pts) + D) %>%
arrange(-Pts, -gd, -gf) %>%
mutate(Pos = rownames(.)) %>%
as.data.frame()
table <- arrange(table, -Pts, -gd, -gf)
return(table)
}
Home points ratio (HPR)
We’ll apply the maketable_ha()
function to calculate a ‘home points ratio’, HPR
: the proportion of total points that were gained at home by each team in each season (1.0 = all of a team’s points for the season were gained at home; 0.0 = no points gained at home).
dd <- lapply(unique(EPL$Season), function(x) {
#league tables for home fixtures
home <- maketable_ha(EPL, Season = x, tier = 1, type="home") %>%
mutate(Hpts = Pts, GPH = GP)
#league tables for away fixtures
away <- maketable_ha(EPL, Season = x, tier = 1, type="away") %>%
mutate(Apts = Pts, GPA = GP)
#combined (real) league table
both <- maketable_ha(EPL, Season = x, tier = 1, type="both") %>%
mutate(real_pos = Pos)
#merge together
plyr::join_all(list(home, away, both), by = "team", type = 'full') %>%
mutate(Season = x, GP = GPH + GPA, Pos = real_pos) %>%
select(Season, team, GP, GPH, Hpts, GPA, Apts, Pos) %>%
mutate(HPR = Hpts / (Hpts + Apts) ) %>%
arrange(HPR)
} ) %>%
#collapse this list to a dataframe
plyr::rbind.fill() %>%
#order by ascending home points ratio
arrange(HPR)
#prettify the season variable (e.g. 2016 -> 2016/17)
dd$season <- as.factor(paste0(dd$Season, "-", substr(dd$Season+1, 3, 4)))
Let’s have a quick look at either end of this dataframe.
dd %>% select(season, team, GP, Hpts, Apts, HPR) %>% head
## season team GP Hpts Apts HPR
## 1 1997-98 Crystal Palace 38 11 22 0.3333333
## 2 1993-94 Norwich City 42 21 32 0.3962264
## 3 2008-09 Hull City 38 14 21 0.4000000
## 4 2003-04 Blackburn Rovers 38 19 25 0.4318182
## 5 2014-15 Crystal Palace 38 21 27 0.4375000
## 6 2000-01 Manchester City 38 15 19 0.4411765
dd %>% select(season, team, GP, Hpts, Apts, HPR) %>% tail
## season team GP Hpts Apts HPR
## 501 2016-17 Hull City 36 28 6 0.8235294
## 502 2016-17 Burnley 36 33 7 0.8250000
## 503 1999-00 Coventry City 38 37 7 0.8409091
## 504 2005-06 Fulham 38 41 7 0.8541667
## 505 1992-93 Leeds United 42 44 7 0.8627451
## 506 2009-10 Burnley 38 26 4 0.8666667
Interestingly, Burnley have the largest home points ratio, but not from this season; they gained only 4 out of 30 points (87%) away from home in the 09/10 season away. Nevertheless, this season’s Burnley also clock in at 5th place with 83% of points gained at home. Crystal Palace have the lowest HPR by quite some margin, gaining only 33% total points from home fixtures during the 97/98 season. As expected, our PL average shows a slight preference for home fixtures overall, with teams picking up 61% of their total points at home on average.
We can visualise this data by plotting the 10 largest vs. the 10 smallest HPRs and comparing them against a PL average (shown in grey). (I’ve presented HPR as a percentage as it seems more intuitive, and normalised the bars relative to 50% as this is our null hypothesis, i.e. no home or away preference.)
Click here to show code used to generate plot
Out of curiosity, here’s the same plot but for the top flight across all seasons (1888 - present; using 3 points for a win for all seasons).
Click here to show code used to generate plot
This season’s Burnley only have the 64th highest home points ratio now; 1st place goes to Wolves, who managed to gain only 1 out of their 31 points away from home in the 1895-96 season (given 3 points for a win as in the modern era). Leeds are the only other team in top flight history to equal the feat of gaining one single away point - out of 24 total points in the 1946-47 season. Blackpool hold the all time highest away points bias, picking up 19 out of 27 points away from home in the 1966-67 season.
HPR: team averages
If we pool the data from all seasons we can find out the overall home / away preferences for all past and current PL teams.
dd3 <- dd %>%
group_by(team) %>%
summarise(HPR = mean(HPR), GP = sum(GP)) %>%
arrange(HPR)
dd3 %>% tbl_df %>% print(n = nrow(.))
## # A tibble: 47 × 3
## team HPR GP
## <chr> <dbl> <int>
## 1 Crystal Palace 0.5124406 310
## 2 Blackpool 0.5128205 38
## 3 Manchester United 0.5582854 959
## 4 Nottingham Forest 0.5661227 198
## 5 Arsenal 0.5700214 959
## 6 AFC Bournemouth 0.5714286 74
## 7 Wigan Athletic 0.5742155 304
## 8 Chelsea 0.5760889 959
## 9 Aston Villa 0.5774686 924
## 10 Charlton Athletic 0.5956775 304
## 11 Liverpool 0.5956984 960
## 12 Manchester City 0.5970030 769
## 13 Wimbledon 0.5974085 316
## 14 Ipswich Town 0.6046069 202
## 15 Leeds United 0.6047249 468
## 16 West Bromwich Albion 0.6072911 415
## 17 Sunderland 0.6079351 605
## 18 Tottenham Hotspur 0.6083261 959
## 19 Everton 0.6094660 960
## 20 Swansea City 0.6113086 226
## 21 Blackburn Rovers 0.6118826 696
## 22 Leicester City 0.6120253 419
## 23 Coventry City 0.6158050 354
## 24 Sheffield Wednesday 0.6170957 316
## 25 Bolton Wanderers 0.6183924 494
## 26 West Ham United 0.6257402 804
## 27 Middlesbrough 0.6266046 572
## 28 Wolverhampton Wanderers 0.6301276 152
## 29 Swindon Town 0.6333333 42
## 30 Queens Park Rangers 0.6351686 278
## 31 Newcastle United 0.6415718 844
## 32 Southampton 0.6435465 693
## 33 Portsmouth 0.6523316 266
## 34 Hull City 0.6541267 188
## 35 Norwich City 0.6545060 316
## 36 Oldham Athletic 0.6548469 84
## 37 Birmingham City 0.6566897 266
## 38 Derby County 0.6583931 266
## 39 Stoke City 0.6601283 340
## 40 Fulham 0.6642135 494
## 41 Cardiff City 0.6666667 38
## 42 Watford 0.6669643 149
## 43 Sheffield United 0.6898336 122
## 44 Reading 0.6909572 114
## 45 Barnsley 0.7142857 38
## 46 Bradford City 0.7264957 76
## 47 Burnley 0.7558081 112
Interesting to see that any individual season variation in HPR is eliminated and now all teams show at least some preference for home fixtures. Nevertheless, Crystal Palace again show the lowest level of home bias (51% of points at home) and Burnley the most (76%), although Burnley have far fewer games under their belt and this could be expected to revert to a lower mean.
Points per game (ppg): home and away
One other thing we might be interested in: the absolute number of points gained at home and away, instead of their relative ratios - total points is the only statistic that matters in the end after all. To this we’ll calculate points per game (ppg) for both home (ppg_home
) and away (ppg_away
) fixtures.
First let’s look at each individual each season:
dd4 <- dd %>%
mutate(ppg_home = Hpts/GPH, ppg_away = Apts/GPA) %>%
select(season, team, GP, Pos, HPR, ppg_home, ppg_away) %>%
arrange(desc(ppg_away))
#group by league finish: top 4, relegated, or 5th - 17th
dd4$Pos <- as.numeric(dd4$Pos)
dd4$col <- ifelse(dd4$Pos >= 18, "rel", ifelse(dd4$Pos <=4, "top", "mid"))
dd4$col <- factor(dd4$col, levels = c("top", "mid", "rel")) #reorder for plotting
rbind(head(dd4), tail(dd4))
## season team GP Pos HPR ppg_home ppg_away col
## 1 2004-05 Chelsea 38 1 0.4947368 2.4736842 2.5263158 top
## 2 2001-02 Arsenal 38 1 0.4597701 2.1052632 2.4736842 top
## 3 2008-09 Chelsea 38 3 0.4698795 2.0526316 2.3157895 top
## 4 2008-09 Liverpool 38 2 0.5000000 2.2631579 2.2631579 top
## 5 2001-02 Manchester United 38 3 0.4545455 1.8421053 2.2105263 top
## 6 2007-08 Chelsea 38 2 0.5058824 2.2631579 2.2105263 top
## 501 1992-93 Leeds United 42 17 0.8627451 2.0952381 0.3333333 mid
## 502 2015-16 Aston Villa 38 20 0.6470588 0.5789474 0.3157895 rel
## 503 2009-10 Hull City 38 19 0.8000000 1.2631579 0.3157895 rel
## 504 1999-00 Watford 38 20 0.7916667 1.0000000 0.2631579 rel
## 505 2009-10 Burnley 38 18 0.8666667 1.3684211 0.2105263 rel
## 506 2007-08 Derby County 38 20 0.7272727 0.4210526 0.1578947 rel
Click here to show code used to generate plot
So teams who finish the season with a higher HPR seem to finish lower down the table; more of a case of faring poorly away than being excellent at home. Meanwhile, those near the top tend to have a more equal home / away points ratio.
Here’s the same figures with three colour groups get a better feel for finishing position:
Click here to show code used to generate plot
So to win the league you’ve got to pick up high points per game home and away, right? Let’s look at the records for every PL winner:
dd4[dd4$Pos==1,] %>% arrange(HPR)
## season team GP Pos HPR ppg_home ppg_away col
## 1 2001-02 Arsenal 38 1 0.4597701 2.105263 2.473684 top
## 2 2004-05 Chelsea 38 1 0.4947368 2.473684 2.526316 top
## 3 2015-16 Leicester City 38 1 0.5185185 2.210526 2.052632 top
## 4 1993-94 Manchester United 42 1 0.5217391 2.285714 2.095238 top
## 5 2006-07 Manchester United 38 1 0.5280899 2.473684 2.210526 top
## 6 2016-17 Chelsea 35 1 0.5357143 2.647059 2.166667 top
## 7 1999-00 Manchester United 38 1 0.5384615 2.578947 2.210526 top
## 8 2012-13 Manchester United 38 1 0.5393258 2.526316 2.157895 top
## 9 2003-04 Arsenal 38 1 0.5444444 2.578947 2.157895 top
## 10 1996-97 Manchester United 38 1 0.5466667 2.157895 1.789474 top
## 11 2008-09 Manchester United 38 1 0.5555556 2.631579 2.105263 top
## 12 1992-93 Manchester United 42 1 0.5595238 2.238095 1.761905 top
## 13 2014-15 Chelsea 38 1 0.5632184 2.578947 2.000000 top
## 14 1998-99 Manchester United 38 1 0.5822785 2.421053 1.736842 top
## 15 2000-01 Manchester United 38 1 0.5875000 2.473684 1.736842 top
## 16 1994-95 Blackburn Rovers 42 1 0.5955056 2.523810 1.714286 top
## 17 1995-96 Manchester United 38 1 0.5975610 2.578947 1.736842 top
## 18 2007-08 Manchester United 38 1 0.5977011 2.736842 1.842105 top
## 19 2002-03 Manchester United 38 1 0.6024096 2.631579 1.736842 top
## 20 1997-98 Arsenal 38 1 0.6025641 2.473684 1.631579 top
## 21 2005-06 Chelsea 38 1 0.6043956 2.894737 1.894737 top
## 22 2009-10 Chelsea 38 1 0.6046512 2.736842 1.789474 top
## 23 2013-14 Manchester City 38 1 0.6046512 2.736842 1.789474 top
## 24 2011-12 Manchester City 38 1 0.6179775 2.894737 1.789474 top
## 25 2010-11 Manchester United 38 1 0.6875000 2.894737 1.315789 top
Well, Manchester United won the league in 2010-11 with only 1.32 ppg away from home. And on the flip side, Arsenal became champions in 2001-02 with a higher ppg from away fixtures (2.47) than home (2.10). For comparison, the median overall ppg (home and away fixtures) for PL winners is 2.29 (87 points), although United managed to win it in 1996-97 with an unbelievable 75 points (1.97 ppg)… (To put that into soul-destroying persepective as a Liverpool fan, Liverpool got 84 points when they finished runners-up in 2013-14.)
HPR vs. ppg_home and ppg_away
Finally, let’s pool the data for ppg_home
and ppg_away
as we did before to see each team’s overall record in the PL.
dd5 <- dd %>%
group_by(team) %>%
summarise(ppg_home = sum(Hpts) / sum(GPH), ppg_away = sum(Apts) / sum(GPA)) %>%
mutate(HPR = ppg_home / (ppg_home + ppg_away)) %>%
arrange(desc(ppg_away))
rbind(head(dd5), tail(dd5))
## # A tibble: 12 × 4
## team ppg_home ppg_away HPR
## * <chr> <dbl> <dbl> <dbl>
## 1 Manchester United 2.3479167 1.8580376 0.5582364
## 2 Arsenal 2.1544885 1.6270833 0.5697336
## 3 Chelsea 2.1231733 1.5895833 0.5718590
## 4 Liverpool 2.0625000 1.4187500 0.5924596
## 5 Manchester City 1.8020833 1.2207792 0.5961513
## 6 Leeds United 1.7692308 1.1880342 0.5982659
## 7 Hull City 1.1808511 0.6382979 0.6491228
## 8 Barnsley 1.3157895 0.5263158 0.7142857
## 9 Cardiff City 1.0526316 0.5263158 0.6666667
## 10 Swindon Town 0.9047619 0.5238095 0.6333333
## 11 Bradford City 1.1842105 0.4473684 0.7258065
## 12 Burnley 1.3928571 0.4464286 0.7572816
If we plot home ppg against away ppg and fill according to home points ratio we can see that teams with the highest HPR tend to have the lowest number of points per game. This seems to support the idea that a high HPR is caused more by having away jitters than a home fortress.
Click here to show code used to generate plot
In conclusion…
I’m not sure how useful this information is (if at all), but I find exploring data that interests me is always a good way to get to grips with new methods or technologies; this was my first time piping with magrittr
and my first time using knitr
and rmarkdown.
Popular sports like football tend to generate a massive amount data but their interpretation by the mainstream media often leaves a lot to be desired. Hopefully this post shows how useful open-source databases and tools like engsoccerdata
can make it easy to conduct a more rigorous analysis!