Cleaning Berkeley Earth's BEST Gridded Daily Temperature Data

You may have recently seen air quality maps produced by the Berkeley Earth group, especially in the wake of the horrific Camp Fire whose death toll now exceeds 80. For example, here’s their real-time visualization of PM2.5 concentrations.

For several years, the Berkeley Earth team had primarily worked on producing gridded weather datasets with solid historical coverage, going back to 1850 in some cases. I’ve been particularly interested in their work given the global daily temperature datasets they have on tap – which are publicly available! This is all very exciting, given that many of the global datasets available, like Willmott and Matsuura (UDEL) and CRU, are only at monthly resolution. However, there’s something to be desired in making their offerings truly accessible to researchers:

Their netCDF files are packaged with “temperature anomaly” and “climatology” layers, but no “temperature” layer. This necessitates a bit of wrangling on your part to overcome some inherent dimensional consistency issues to create a temperature object. It would’ve struck me to provide the temperature values, and allow researchers to determine which climatology period they want to construct, not the reverse.
For example, the climatology is a 365-day stack. Great, except that the anomalies data is based on true dates and includes leap years. There’s not an immediate way to sum the two into temperature estimates. If you ignore the mismatch and are performing analysis that spans many decades, by the end of your time-series you’ll be off by several weeks (amateurish!).
The .nc’s are stripped of an informative time axis (see below figure). Instead, the files have date_number, day, month, and year attributes. Why couldn’t they have provided an out of the box date object, and left it to researchers to parse the month and year when necessary?
As a result, you wouldn’t even know whether leap dates are included unless you manually inspected the layer count for a decadal file.
And lastly, the absence of that time axis means you’re not able to discern dates when visually inspecting the data in a viewer like Panoply. If you wanted to see what the temperature [anomalies] were for Jan 12, 2013, then you’ll have to manually count how many days passed since the Jan 1, 2010 start date of that decadal extract.

![](https://i0.wp.com/anthonylouisdagostino.com///wp-content/uploads/2018/12/best_data.png?resize=712%2C520&ssl=1)

BEST temperature data does not come packaged with a useful time axis

I’ve spent a fair amount of time trying to address these issues, and wanted to share R code so that you don’t have to recreate the process. The only packages you’ll need to run this are raster and tidyverse.

In brief, this function 1.) reads in a BEST .nc file, which you may or may not have previously spatial subsetted, 2.) creates a stacked climatology that properly accounts for leap years by duplicating Feb 28 climatology values for Feb 29, and 3.) outputs temperature estimates with a date explicit axis.

  |  |  |
|---|---|
|  | returnBESTtemp <- function(tempFile) { |
|  |  |
|  | \# tempFile: complete path to a netcdf BEST temperature file of daily records \[need not be geographic subset\] |
|  |  |
|  | ave <- brick(tempFile, var = "climatology") |
|  | anomalies <- brick(tempFile, var = "temperature") |
|  |  |
|  | \# — deal with absence of leap year in anomaly series |
|  | ave.dates <- as\_date(as.numeric(unlist(lapply(strsplit(names(ave), "X"), function(x) x\[\[2\]\]))), origin = "1949-12-31") |
|  | which(ave.dates == "1950-02-28") \# 59th doy |
|  |  |
|  | leap.ave <- stack(ave\[\[1:59\]\], ave\[\[59\]\], ave\[\[60:365\]\]) \# leap year version duplicates climatology from feb 28 for leap day |
|  | leap.ave.bd <- format(as\_date(1:366, origin = "1951-12-31"), format = "%b-%d") |
|  |  |
|  | clima <- stack(ave, ave, leap.ave, ave) \# since starting with 1950, non-leap, non-leap, leap, non-leap, and then recycle.  |
|  | clima.fullsize <- stack(replicate(17, clima)) \# 17 iterations of this 4 year sequence, now comparable in dimensions to anomalies object  |
|  |  |
|  | dates <- as\_date(as.numeric(unlist(lapply(strsplit(names(anomalies), "\\\\."), function(x) x\[\[2\]\]))), origin = "1949-12-31") |
|  | years <- unique(year(dates)) |
|  | month.day <- format(dates, format="%b-%d") |
|  |  |
|  | \# — drop 2018, which is problematic because it's only a partial year |
|  | anomalies <- subset(anomalies, which(!year(dates) %in% 2018), drop = T) |
|  | dates <- dates\[!year(dates) %in% 2018\] |
|  |  |
|  | print(length(dates) == dim(anomalies)\[3\]) |
|  | temp <- anomalies + clima \# recycles to length of anomalies  |
|  |  |
|  | \# — perform spot checks to ensure values are properly summed  |
|  | test.layers <- floor(runif(20) \* dim(anomalies)\[3\]) %>% unique() |
|  | anomalies.test <- anomalies\[\[test.layers\]\] |
|  | monthday.test <- month.day\[test.layers\] |
|  | indices <- unlist(lapply(monthday.test, function(x) which(leap.ave.bd == x))) \# julian day values  |
|  |  |
|  | \# — these all check out — satisfied it's taking the correct sums — #  |
|  | sum.random <- anomalies\[\[test.layers\]\] + leap.ave\[\[indices\]\] \# manual combination of anomalies layer and climatology |
|  | true.values <- temp\[\[test.layers\]\] \# our combined temperature values layer  |
|  | min(true.values == sum.random) |
|  |  |
|  |  |
|  | temp <- setZ(temp, dates) |
|  | names(temp) <- as.character(dates) |
|  |  |
|  | temp |
|  | } |

 [view raw](https://gist.github.com/a8dx/acfb8181ebcdb5375c4d085064f22029/raw/3c166b63e2016e80412e3e440fa3222ef127dfed/clean_BEST_data.r) [ clean\_BEST\_data.r ](https://gist.github.com/a8dx/acfb8181ebcdb5375c4d085064f22029#file-clean_best_data-r) hosted with ❤ by [GitHub](https://github.com) 

You’ll notice some hard coded elements, such as the origin date and whether to drop partial years (since BEST is continuously updating their data, a portion of the current year will always be included in the most recent decadal extract). I’ve also included some tests to convince myself that the summed product mirrors values obtained from a manually searched climatology layer combined with the anomalies layer for a specified date.