The Best Historical Weather Data Site for Sports Betting

As we all know these days, data is power, to the point that it can easily be used to destroy a country or continent. That is beyond the scope of our discussion here, but what is not is the far more important discussion of where one can find the best historical weather data for sports gambling purposes. As for how weather actually affects sports handicapping, baseball is the sport where it probably matters the most which I discuss at the end of my book, but unfortunately the days of easy money in baseball totals are over as weather is mostly priced in. Fortunately there are many interesting and sometimes counter-intuitive impacts across other sports.

While there are many paid sites as well as sites that can be scraped, the best source out there I have found is completely free and legal to use and is shockingly provided by the US government. Climatic Data Online has data from weather stations all around the world, in most cases going back longer than any reasonable sports data set, and at least in the US, is of reasonable quality. This data set contains all the exciting weather information you could want, namely temperature, wind, air pressure, dew point, and precipitation.

The only real drawback of this data set is that most of the stations are at airports meaning you will often have to compromise by choosing an airport distant from the stadium. In addition, you are best off using major airports where possible, as smaller local airports or non-airports often have lower data quality. Fortunately, most stadiums are built near airports. If your historical sports data set has weather built in you may be better off using that instead, although I have found that often even when I have an in-stadium source and a nearby airport, the airport data is of higher statistical value. In addition, you can always get a weather forecast for most any airport, but you can't always get a decent stadium weather forecast.

There are several ways to get data off CDO but if you have good internet probably the easiest and fastest is to just download it all, then use filename scripts to get what you want. The place to get this is the archive at https://www.ncei.noaa.gov/data/global-hourly/archive/csv/ which has TAR files with the entire year's data for all stations. Each year is about 4.5 GB packed or 40-50 GB unpacked. Each archive is filled with numbered files, with each file corresponding to the year's data at the numbered station. You just unpack the numbers of the stations you want and you are good to go. If you only want a few stations, you can download them individually at https://www.ncei.noaa.gov/data/global-hourly/access/ instead. Setting up a script to only download the stations you want is probably faster, but I personally want all the data in case I handicap sports in other locations (there is a ton of international data in here as well!) in the future and the CDO site gets taken down. The Fifth Risk made it sound like private weather corps run by big Trump donors want the data taken down so they can sell it themselves and while I have no idea if the book is right, I would get it now while you still can. 

Finding the stations you want is probably the most time-consuming part of this process. There are far more airports in the US than most realize and often you will have several options near the stadium. The CDO site has a lookup feature where you can find all stations in a given country, city or state. However I think that this file from Weather Graphics is the best tool for the job. It has a list of all the stations in the CDO database. The number of the file is the concatenation of the "maslib" and "wban" columns. If you are using airports, probably the easiest way to look up the station number is to search by airport name - for example for Los Angeles, which is LAX, search for "KLAX" (US airport names begin with a K) and you will be taken to the row with the corresponding number, 72295023174.

With your list of station numbers, which probably has to be built by hand looking at pictures from Google maps, you can then set up a script to read all the files for that year's weather in the cities you want and combine them. There is also some really basic parsing and data cleaning that has to be done to get the final weather data from the CDO flat files. I also like to use averages of all observations taken over that hour as I find the wind figures to be higher quality that way. I attach a sample R script I use here, although there are probably better ways to do this.

 

 

 

 

library(sqldf)
library(lubridate)

b_file_list = read.csv("Home_Weather.csv")##csv file with station numbers you want in column 'Number'
b_file_list = subset(b_file_list,is.na(Number)==FALSE)

b_file_list = b_file_list$Number
setwd("../2018/") #directory with that year's station data numbered files=

batting_data = read.csv(paste(b_file_list[1],".csv",sep=""))
weather_data = subset(weather_data,select=c(STATION,DATE,LATITUDE,LONGITUDE,ELEVATION,NAME,WND,TMP,SLP,AA1))
for (i in 2:length(b_file_list))
{
weather_temp = read.csv(paste(b_file_list[i],".csv",sep=""))
weather_temp = subset(weather_temp,select=c(STATION,DATE,LATITUDE,LONGITUDE,ELEVATION,NAME,WND,TMP,SLP,AA1))
weather_data = rbind(weather_data, weather_temp)
}

weather_data = unique(weather_data)
weather_data$temp = as.numeric(substr(weather_data$TMP,1,5))/10
weather_data$press = as.numeric(substr(weather_data$SLP,1,5))/10
weather_data$precip = as.numeric(substr(weather_data$AA1,4,7))
weather_data$precip = ifelse(is.na(weather_data$precip)==TRUE,0,weather_data$precip)
weather_data$wind = substr(weather_data$WND,9,12)
weather_data$wind = as.numeric(weather_data$wind)/10
weather_data$Date_Day = substr(weather_data$DATE,1,10)
weather_data$Date_F = ymd(weather_data$Date_Day)
weather_data$Date_Hour = as.numeric(substr(weather_data$DATE,12,13))
#throw out garbage/errors (if there is an error, usually the value will be '9999'
weather_data = subset(weather_data,temp<100)
weather_data = subset(weather_data,wind<100)
#take the hourly average of all observations that hour and output
Hour_Avg = sqldf("select STATION,NAME,ELEVATION,LATITUDE,LONGITUDE,Date_F,Date_Hour,avg(temp) as temp, avg(press) as press,avg(precip) as precip,avg(wind) as wind from weather_data
group by STATION,NAME,ELEVATION,LATITUDE,LONGITUDE,Date_F,Date_Hour")
write.csv(Hour_Avg,"../Weather_2018_F.csv") ##Output file