Default MHW outputs in R and Python

Last updated: 2019-05-06

workflowr checks: (Click a bullet for more information)

✖ R Markdown file: uncommitted changes
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(666)

The command set.seed(666) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

✔ Repository version: b1a7e3f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/global/
    Ignored:    data/global_KS_cat.Rdata
    Ignored:    data/global_KS_clim.Rdata
    Ignored:    data/global_KS_event.Rdata
    Ignored:    data/global_dec_trend.Rdata
    Ignored:    data/global_effect_cat.Rdata
    Ignored:    data/global_effect_clim.Rdata
    Ignored:    data/global_effect_event.Rdata
    Ignored:    data/global_effect_event_length_slope.Rdata
    Ignored:    data/global_effect_event_slope.Rdata
    Ignored:    data/sst_ALL_add_trend.Rdata
    Ignored:    data/sst_ALL_aov_tukey.Rdata
    Ignored:    data/sst_ALL_clim_event_cat.Rdata
    Ignored:    data/sst_ALL_clim_event_cat_fix.Rdata
    Ignored:    data/sst_ALL_flat.Rdata
    Ignored:    data/sst_ALL_knockout.Rdata
    Ignored:    data/sst_ALL_length.Rdata
    Ignored:    data/sst_ALL_length_width_10.Rdata
    Ignored:    data/sst_ALL_length_width_20.Rdata
    Ignored:    data/sst_ALL_length_width_30.Rdata
    Ignored:    data/sst_ALL_length_width_40.Rdata
    Ignored:    data/sst_ALL_length_width_50.Rdata
    Ignored:    data/sst_ALL_missing.Rdata
    Ignored:    data/sst_ALL_missing_fix.Rdata
    Ignored:    data/sst_ALL_repl.Rdata
    Ignored:    data/sst_ALL_trended.Rdata

Unstaged changes:
    Modified:   analysis/Climatologies_and_baselines.Rmd
    Modified:   analysis/best_practices.Rmd
    Modified:   analysis/r_vs_python.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
html	38559da	robwschlegel	2019-03-19	Build site.
Rmd	970b22c	robwschlegel	2019-03-19	Publish the vignettes from when this was a pkgdown framework
html	fa7fd57	robwschlegel	2019-03-19	Build site.
Rmd	64ac134	robwschlegel	2019-03-19	Publish analysis files

Overview

The purpose of this vignette is to walk one through how to calculate MHWs in both the R and Python languages. It will use code from both languages throughout. The secondary purpose of this vignette is to show that the outputs are identical. The tertiary purpose of this vignette is to run benchmarks on the code to compare their speeds.

Calculating events

First up, let’s look at the two different ways in which one may calculate MHWs. First we will look at the R code, then Python.

R code

Setup

The basic functionality for calculating marine heatwaves (MHWs) in R may be found in the heatwaveR package that may currently be downloaded and installed with the following lines of code. Note that if one has already installed these packages they do not need to be installed again. To run one of the lines in the following code chunk will require that the hash-tag first be removed.

## This is an R chunk ##

# install.packages("devtools")
## Development version from GitHub
# devtools::install_github("robwschlegel/heatwaveR") 
## Stable version from CRAN
# install.packages("heatwaveR")

With the necessary packages installed, we activate heatwaveR with the following line of code.

## This is an R chunk ##

library(heatwaveR)

Default output

With everything downloaded and ready for us to use we may now calculate some events. The heatwaveR package has three built in time series (sst_WA, sst_Med, sst_NW_Atl) that we may use to more easily demonstrate how the code works. The general pipeline in R for calculating events is first to create a climatology from a chosen time series using ts2clm(), and then to feed that output into detect_event(), as seen below.

## This is an R chunk ##

# First we create a default climatology as outlined in Hobday et al. (2016)
  # NB: We use a longer climatology period here as this matches the Python default
ts_clim <- ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))

# Then we feed that object into the second and final function
res_r <- detect_event(ts_clim)

To look at these outputs we would use the following options. For now we’ll just look at the event output.

## This is an R chunk ##

# Look at the top six rows of the first 6 columns of the events
res_r$event[1:6, 1:6]

# Or perhaps the most intense event
res_r$event[res_r$event$intensity_max == max(res_r$event$intensity_max), 1:6]

Python code

Setup

To download and install the Python package for calculating MHWs one may run the following line of code in a console:

## This is a bash chunk ##

# Note that the hashtag must be removed in orde for the code to run
# pip install git+git://github.com/ecjoliver/marineHeatWaves@master

Or simply download the GitHub repository and follow the instructions there for downloading. Or if still lost, phone a friend!

Before we begin the calculations we need to create a time series. I’ve chosen to take the sst_WA time series from the heatwaveR package to ensure consistency between the results. The python code prefers to have the dates and the temperatures fed into it as two separate vectors, so I will first split these up in R before we feed them to Python.

## This is an R chunk ##

# Get the built-in time series from heatwaveR
ts <- sst_WA

# Take only the temperature column
sst <- ts$temp

# Take only date column
dates <- ts$t

# If one wants to save the sst values to load into Python
write.csv(ts$temp, "data/sst_WA.csv", row.names = FALSE)

Default calculations

With the package installed, and our time series prepped, the basic MHW calculation is performed as follows:

## This is a Python chunk ##
# Required for data prep and export
import numpy as np
from datetime import date
import pandas as pd
# The MHW functions
import marineHeatWaves as mhw
# The date values
  # We could take the date values we created above,
  # But let's rather create the date values in Python
t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)
# The temperature values
  # Note that to fetch an object created in R we use 'r.*'
  # Where '*' is the name of the R object
sst = np.asarray(r.sst)
# Or if one rather wants to load the data from a csv file
# sst = np.loadtxt("data/sst_WA.csv", delimiter=',', skiprows=1) # because a heading row needs to be skipped
# The event metrics
mhws, clim = mhw.detect(t, sst)
# If one wants to save the results as csv files
# Save event results
mhws_df = pd.DataFrame.from_dict(mhws)
mhws_df.to_csv('data/mhws_py.csv', sep = ',', index=False)
# 
# Save climatology results
clim_df = pd.DataFrame.from_dict(clim)
clim_df.to_csv('data/clim_py.csv', sep = ',', index=False)

Reticulate code

It is also possible to run the Python code through R with the use of the R package reticulate. This is particularly useful as it allows us to perform the comparisons and benchmarking all within the same language.

Setup

Here we load the reticulate package and choose the conda environment I’ve already created called py27. For help on how to set up a conda environment go here. I’ve ensured that all of the required python modules are installed within this environment.

## This is an R chunk ##

# install.packages("reticulate")
library(reticulate)
use_condaenv("py27")

Once we’ve told R which version/environment we would like to use for Python we may then load the necessary modules.

## This is an R chunk ##

np <- import("numpy")
datetime <- import("datetime")
mhw <- import("marineHeatWaves")

One may run python code in it’s native form within R by passing it as a character vector to py_run_string().

## This is an R chunk ##

py_run_string("t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)")
py_run_string("sst = np.loadtxt(open('data/sst_WA.csv', 'r'), delimiter=',', skiprows=1)")
# py_run_string("sst.flags.writeable = True")
# py_run_string("sst.setflags(write = True)")
# py_run_string("sst = np.array(sst).copy()")

Or we may just create the objects in native R.

## This is an R chunk ##

# These numbers were taken from print(t) in the python code above
t <- as.integer(np$array(seq(723546, 735598)))
sst <- np$array(sst_WA$temp)

Default calculations

Either way, once we have created the necessary parts of the time series in either language, we may call the Python functions in R directly with the line of code below.

## This is an R chunk ##

# The following line appears not to run because the numpy array that Python expects is given to it by R in a read-only format...
# Thus far I've not found a way to correct for this.
# So it appears that one may pass R objects to Python code directly, but not to reticulate code...
res_python <- mhw$detect(t = t, temp = sst)

Comparisons

With some climatologies and events calculated with both languages we may now compare the results of the two. I’ll do so here natively in R to avoid any potential hiccups from translating across languages. I’ll create the R output here, and load the Python output created in the Python code chunk above.

## This is an R chunk ##

# Load libraries for data manipulation and visualisation
library(tidyverse)

# Set R results
res_event_R <- res_r$event
res_clim_R <- res_r$climatology

# Set Python results
res_event_Python <- read_csv("data/mhws_py.csv")
res_clim_Python <- read_csv("data/clim_py.csv")

With these default results loaded in the same format in the same language we can now start to look at how they compare. For starters I am just doing some simple sums and correlations.

Climatologies

## This is an R chunk ##

cor(res_clim_R$seas, res_clim_Python$seas)
sum(res_clim_R$seas) - sum(res_clim_Python$seas)
cor(res_clim_R$thresh, res_clim_Python$thresh)
sum(res_clim_R$thresh) - sum(res_clim_Python$thresh)

The seasonal and threshold climatologies correlate very well, but adding them up we see that the Python values are consistently higher. This is due to rounding differences between the languages. Even though the sum difference between the thresholds may look large at a value of ~6.5, considering this is over 12053 days of data, the difference is only ~0.0005/day. So I think it is safe to move on and look at the events.

February 29th

It was also recently brought to my attention by Zijie Zhao, author of the MATLAB distribution for MHW code, that Python and R calculate February 29th differently. With Python calculating the climatologies first, and then filling in the missing days, whereas R calculates February 29th first, and then the climatologies from that. So with that being said let’s have a peek at the difference this may cause.

## This is an R chunk ##

res_clim_R_29 <- res_clim_R %>% 
  filter(doy == 60) %>% 
  select(seas, thresh) %>% 
  distinct()
res_clim_Python_29 <- res_clim_Python %>% 
  filter(res_clim_R$doy == 60) %>% 
  select(seas, thresh) %>% 
  distinct()

res_clim_R_29$seas - res_clim_Python_29$seas
res_clim_R_29$thresh - res_clim_Python_29$thresh

Another thing pointed out by Zijie is that Python and MATLAB calculate percentiles differently, this then is likely also true for R and Python. This would explain why the seasonal climatologies (seas) are always much closer than the thresholds (thresh) for all of these tests.

Events

Below we have a for loop that will run a correlation on all rows with the same name, as well as find the sum difference between them.

## This is an R chunk ##

# Remove non-numeric columns
res_event_num <- res_event_R %>% 
  select_if(is.numeric)

# Run the loop
res_event <- data.frame()
for(i in 1:length(colnames(res_event_num))){
  if(colnames(res_event_num)[i] %in% colnames(res_event_Python)){
    x1 <- res_event_R[colnames(res_event_R) == colnames(res_event_num)[i]]
    x2 <- res_event_Python[colnames(res_event_Python) == colnames(res_event_num)[i]]
    x <- data.frame(r = cor(x1, x2, use = "complete.obs"),
                    difference = round(sum(x1, na.rm = T) - sum(x2, na.rm = T), 4),
                    var = colnames(res_event_num)[i])
    colnames(x)[1] <- "r"
    rownames(x) <- NULL
    } else {
      x <- data.frame(r = NA, difference = NA, var = colnames(res_event_num)[i])
      }
  res_event <- rbind(res_event, x)
  }
na.omit(res_event)

It is a bit odd that there is such a large difference between intensity_var for the two language. This implies that R is detecting larger differences in the intensity around the threshold than Python is. This may again be due to the differences in thresholds being detected, but that makes a rather small difference. It is perhaps more likely that variance is calculated slightly differently between languages. Looking at the source code one sees that this is calculated by first finding the variance of the intensity, and then the square root of that. These two calculations likely differ enough between the languages to be causing this. Ultimately it is of little concern as the variance values found throughout both packages are not of much interest to anyone.

We may also see that the difference in intensity_max_abs is a bit higher in Python than in R. This is rather surprising as intensity_max_abs simply shows the maximum temperature (independent of any thresholds etc.) experienced during that event. And considering that these calculations were performed on the exact same time series, intensity_max_abs should always be exactly the same between the two languages. Looking at the climatology output one sees that the temperature is still correct, meaning that the difference must occur somewhere within the R function detect_event. Looking at the R source code we see that intensity_max_abs is calculated on line 298, for Python this is calculated on line 376. The reason for the difference is because R is looking for the maximum temperature throughout all dates for a given event, whereas Python is simply taking the temperature on the peak date of the event. This is a subtle but significant difference as this means this value is showing two different things. I’m not certain which is more appropriate, though I’m leaning towards the Python implementation.

With all of our overlapping columns compared, and our differences and Pearson r values looking solid, let’s finish off this basic comparison by finding which columns are not shared between the different language outputs. Also, please note that the apparently large differences between the index_start, index_peak, and index_end values are due to the different indexing methods of R and Python. R starts at 1 and Python at 0. The

## This is an R chunk ##

cols_R <- colnames(res_event_R)[!(colnames(res_event_R) %in% colnames(res_event_Python))]
cols_R
cols_Py <- colnames(res_event_Python)[!(colnames(res_event_Python) %in% colnames(res_event_R))]
cols_Py

Wonderful! Almost everything matches up exactly. The duration of categories of events is something added in R by another function category(), and will be left that way for now. The “time” columns in the Python output aren’t relevant as far as I can see in the present usage as these functions currently only take day values. The different methods of labelling the events will be left as they are for now as well.

It is also worth noting that the values for index_start, index_peak, and index_end are off by one between the two languages. This is due to the difference in indexing between the languages. Looking at the date_start, date_peak, and date_end values we see that they are the same.

Missing data

A bug was discovered in v0.3.3 of the R code with regards to the handling of missing data. Specifically, in the move to a C++ back-end, the R code stopped being able to handle missing data for the climatology calculations. This has since been corrected from v0.3.4 onwards, but it is worthwhile to ensure that missing data are handled the same way.

First we’ll create the dataframe with missing data:

## This is an R chunk ##

library(padr)
library(lubridate)
random_knockout <- function(df, prop){
  res <- df %>% 
    sample_frac(size = prop) %>% 
    arrange(t) %>%
    pad(interval = "day")
  return(res)
}

# Remove 10% of the data randomly
sst_WA_miss_random <- random_knockout(sst_WA, 0.9)
write.csv(sst_WA_miss_random$temp, file = "data/sst_WA_miss_random.csv", row.names = F, na = "NaN")

# Remove simulated ice cover
sst_WA_miss_ice <- sst_WA %>% 
  mutate(month = month(t, label = T),
         temp = ifelse(month %in% c("Jan", "Feb", "Mar"), NA, temp)) %>% 
  select(-month)
write.csv(sst_WA_miss_ice$temp, file = "data/sst_WA_miss_ice.csv", row.names = F, na = "NaN")

With the missing data saved, we can now calculate and compare the climatologies that the two different languages will create.

## This is a Python chunk ##
# Required for data prep and export
import numpy as np
from datetime import date
import pandas as pd
# The MHW functions
import marineHeatWaves as mhw
# The date values
t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)
# The temperature values
sst_random = np.loadtxt("data/sst_WA_miss_random.csv", delimiter = ',', skiprows = 1)
sst_ice = np.loadtxt("data/sst_WA_miss_ice.csv", delimiter = ',', skiprows = 1)
# The event metrics
mhws_random, clim_random = mhw.detect(t, sst_random)
# It appears as though the Python code can't deal with ice coverage...
#mhws_ice, clim_ice = mhw.detect(t, sst_ice)
# Save climatology results
clim_random_df = pd.DataFrame.from_dict(clim_random)
clim_random_df.to_csv('data/clim_py_random.csv', sep = ',', index = False)
mhws_random_df = pd.DataFrame.from_dict(mhws_random)
mhws_random_df.to_csv('data/mhws_py_random.csv', sep = ',', index = False)
#clim_ice_df = pd.DataFrame.from_dict(clim_ice)
#clim_ice_df.to_csv('data/clim_ice_py.csv', sep = ',', index = False)

With the Python climatologies and events calculated from the missing data we’ll now do the same for R.

## This is an R chunk ##

# Load and prep missing data
sst_WA_miss_random <- read_csv("data/sst_WA_miss_random.csv") %>% 
  dplyr::rename(temp = x) %>% 
  mutate(t = seq(as.Date("1982-01-01"), as.Date("2014-12-31"), by = "day")) %>% 
  select(t, temp)
sst_WA_miss_random$temp[is.nan(sst_WA_miss_random$temp)] <- NA

# caluclate R climatologies/events
res_random_R <- detect_event(ts2clm(sst_WA_miss_random, maxPadLength = 10,
                                    climatologyPeriod = c("1982-01-01", "2014-12-31")))
res_clim_random_R <- res_random_R$climatology
res_event_random_R <- res_random_R$event

# Load Python results
res_clim_random_Python <- read_csv("data/clim_py_random.csv")
res_event_random_Python <- read_csv("data/mhws_py_random.csv")

And then the comparisons of the seas and thresh values.

## This is an R chunk ##

# Compare clims/thresholds
cor(res_clim_random_R$seas, res_clim_random_Python$seas)
mean(res_clim_random_R$seas) - mean(res_clim_random_Python$seas)
cor(res_clim_random_R$thresh, res_clim_random_Python$thresh)
mean(res_clim_random_R$thresh) - mean(res_clim_random_Python$thresh)

We may see from the results above that the values still correlate very strongly, if slightly less so than with no missing data. We see again that the Python values are higher overall for both the seasonal climatology and the threshold. This time however the differences are much larger. The seasonal climatology is off by ~0.0004/day, which is acceptable, but the 90th percentile threshold is off by ~0.0695. This is too large of a difference and this issue will need to be investigated. It is most likely due to the difference in how quantiles are calculated between the languages though and so there may be little to do about it. Perhaps more importantly is how these differences may affect the events being detected.

## This is an R chunk ##

nrow(res_event_random_R)
nrow(res_event_random_Python)

## This is an R chunk ##

# Remove non-numeric columns
res_event_random_num <- res_event_random_R %>% 
  select_if(is.numeric)

# Run the loop
res_event_random <- data.frame()
for(i in 1:length(colnames(res_event_random_num))){
  if(colnames(res_event_random_num)[i] %in% colnames(res_event_random_Python)){
    x1 <- res_event_random_R[colnames(res_event_random_R) == colnames(res_event_random_num)[i]]
    x2 <- res_event_random_Python[colnames(res_event_random_Python) == colnames(res_event_random_num)[i]]
    x <- data.frame(difference = round(sum(x1, na.rm = T) - sum(x2, na.rm = T), 4),
                    var = colnames(res_event_random_num)[i])
    # colnames(x)[1] <- "r"
    rownames(x) <- NULL
    } else {
      x <- data.frame(difference = NA, var = colnames(res_event_random_num)[i])
      }
  res_event_random <- rbind(res_event_random, x)
  }
na.omit(res_event_random)

Because the count of events detected is not the same, we cannot run correlations. We can sum up the differences to get an idea of how far off things are, and we can see that the results are troubling. The differences in the index_* values can be discounted as we have already established that we have a different number of events. The next largest difference is that for intensity_cumulative_abs. Looking at the one event that is the same between the languages one may see that this value is the same for both, meaning that we may discount this difference as it is not due to calculation differences, but rather also due to the differing number of events. Regardless, it is an indicator that overall the much lower threshold in the R language is giving us results that look much more dire than the Python language. This is problematic and stems from the quantile calculation issue.

I looked into this in quite some detail and it appears to stem not only from the calculation of quantiles being different, but also from the R quantile calculations never giving more than one additional decimal place of precision over the initial data, whereas the Python results float near infinity. This explains why the R results always tend to be lower, but does not account for why the presence of missing data has such a dramatic effect.

## This is an R chunk ##

clim_only_random_R <- res_clim_random_R %>% 
  select(doy, seas, thresh) %>% 
  distinct() %>% 
  filter(doy != 60)

clim_only_random_Python <- res_clim_random_Python %>% 
  slice(1:365)

cor(clim_only_random_R$seas, clim_only_random_Python$seas)
sum(clim_only_random_R$seas) - sum(clim_only_random_Python$seas)
cor(clim_only_random_R$seas, clim_only_random_Python$seas)
sum(clim_only_random_R$thresh) - sum(clim_only_random_Python$thresh)

Looking at only the values for one seasonal climatology or threshold cycle we see that the differences are the same as when we compare all of the data. I don’t know why they wouldn’t be… but I wanted to check.

After having gone over the R code very thoroughly I think the next course of action is to go through the step by step output of the Python code to see where exactly these changes begin to occur. It could be in the following places:

clim_calc_cpp: The calculation of quantile values
- The rounding of the results to 0.001 will have an influence, but can’t explain the size of the difference when missing data are present
smooth_percentile: RcppRoll::roll_mean is used for the rolling means and this may differ greatly from Python when missing data are present
- Line 290 in the Python code is where the function runavg is used for the same purpose

Benchmarks

The final thing we want to look at in this vignette is the speed differences in calculating MHWs between the two languages.

R

library(microbenchmark)
# The old R code
library(RmarineHeatWaves)
microbenchmark(detect(make_whole(data = sst_WA), climatology_start = "1982-01-01", climatology_end = "2014-12-31"), times = 10)
# The new R code
microbenchmark(detect_event(ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))), times = 10)

Unsurprisingly, the average for running the heatwave analysis on heatwaveR 10 times comes out at ~0.194 seconds per calculations, which is faster than the average of ~0.730 with RmarineHeatWaves.

Python

import time
total = 0
for i in range(10):
    start = time.clock()
    mhws, clim = mhw.detect(t, sst)
    end = time.clock()
    total += end-start
bench_py = total/10
print(bench_py)

The average speed for running the Python code 10 times comes out at ~0.175 seconds, which is slightly faster than R. And the Python code is also calculating the event categories, which R is not. That is done in an additional step with category(). The R code is however allowing for a wider range of options for climatologies, and may be more easily multi-cored, as shown in this vignette.

Conclusion

Overall the basic applications of these two languages provide near identical results, and nearly the same computational time. I would not say that one is better enough than the other to warrant switching between languages. Rather this vignette supports the argument that people collaborating on a research project while using R and Python do not need to worry about their results. The exception currently being that large amounts of missing data appear to be influencing the calculation of the 90th percentile threshold differently. This is something that will be investigated further and corrected as necessary.

Session information

sessionInfo()

This reproducible R Markdown analysis was created with workflowr 1.1.1

Default MHW outputs in R and Python

Robert W Schlegel

2019-05-06

Overview

Calculating events

R code

Setup

Default output

Python code

Setup

Default calculations

Reticulate code

Setup

Default calculations

Comparisons

Climatologies

February 29th

Events

Missing data

Benchmarks

R

Python

Conclusion

Session information