Default MHW outputs in R and Python

Overview

The purpose of this vignette is to walk one through how to calculate MHWs in both the R and Python languages. It will use code from both languages throughout. The secondary purpose of this vignette is to show that the outputs are identical. The tertiary purpose of this vignette is to run benchmarks on the code to compare their speeds.

Calculating events

First up, let’s look at the two different ways in which one may calculate MHWs. First up, the R code, then Python.

R code

Setup

The basic functionality for calculating marine heatwaves (MHWs) in R may be found in the heatwaveR package that may currently be downloaded and installed with the following lines of code. Note that if one has already installed these packages they do not need to be installed again. To run one of the lines in the following code chunk will require that the hashtag first be removed.

# install.packages("devtools")
## Development version from GitHub
# devtools::install_github("robwschlegel/heatwaveR") 
## Stable version from rOpenSci
## Currently not available
## Stable version from CRAN
# install.packages("heatwaveR") # Currently not available

With the necessary packages installed, we activate heatwaveR with the following line of code.

library(heatwaveR)

Default output

With everything downloaded and ready for us to use we may now calculate some events. The heatwaveR package has three built in time series (sst_WA, sst_Med, sst_NW_Atl) that we may use to more easily demonstrate how the code works. The general pipeline in R for calculating events is first to create a climatology from a chosen time series using ts2clm(), and then to feed that ooutput into detect_event(), as seen below.

# First we create a default climatology as outlined in Hobday et al. (2016)
ts_clim <- ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))

# Then we feed that object into the second and final function
res <- detect_event(ts_clim)

To look at these outputs we would use the following options. For now we’ll just look at the event output.

# Look at the top six rows of the first 6 columns of the events
res$event[1:6, 1:6]

## # A tibble: 6 x 6
##   event_no index_start index_peak index_end duration date_start
##      <int>       <dbl>      <int>     <dbl>    <int> <date>    
## 1        1         884        887       895       12 1984-06-02
## 2        2         899        901       904        6 1984-06-17
## 3        3         908        922       926       19 1984-06-26
## 4        4        1008       1010      1012        5 1984-10-04
## 5        5        1023       1027      1029        7 1984-10-19
## 6        6        1033       1034      1052       20 1984-10-29

# Or perhaps the most intense event
res$event[res$event$intensity_max == max(res$event$intensity_max), 1:6]

## # A tibble: 1 x 6
##   event_no index_start index_peak index_end duration date_start
##      <int>       <dbl>      <int>     <dbl>    <int> <date>    
## 1       42       10629      10651     10689       61 2011-02-06

Python code

Setup

To download and install the Python package for calculating MHWs one may run the following line of code in a console:

# Note that the hashtag must be removed in order
# pip install git+git://github.com/ecjoliver/marineHeatWaves@master

Or simply download the GitHub repository and follow the instructions there for downloading. Or if still lost, phone a friend!

Before we begin the calculations we need to crerate a time series. I’ve chosen to take the sst_WA time series from the heatwaveR package to ensure consistency between the results. For ease of use I simply saved this object as a csv file on my computer and loaded it back in to Python.

ts <- sst_WA
# Write out only a vector of temperatures
write.csv(ts$temp, "data/sst_WA.csv", row.names = FALSE)

Default calculations

With the package installed it is then activated and run as follows:

# Required for data prep and export
import numpy as np
from datetime import date
import pandas as pd
# The MHW functions
import marineHeatWaves as mhw
# The date values
t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)
# The temperature values
sst = np.loadtxt(open("data/sst_WA.csv", "r"), delimiter=',', skiprows=1) # because a heading row needs to be skipped
# The event metrics
mhws, clim = mhw.detect(t, sst)
# Save event results
mhws_df = pd.DataFrame.from_dict(mhws)
mhws_df.to_csv('data/mhws_py.csv', sep = ',', index=False)
# Save climatology results
clim_df = pd.DataFrame.from_dict(clim)
clim_df.to_csv('data/clim_py.csv', sep = ',', index=False)

Reticulate code

It is also possible to run the Python code through R with the use of the R package reticulate. This is particularly useful as it allows us to perform the comparisons and benchmarking all within the same language.

Setup

Here we load the reticulate package and choose the conda environment I’ve already created called py27. For help on how to set up a conda environment go here. I’ve ensured that all of the required python modules are installed within this environment.

# install.packages("reticulate")
library(reticulate)
use_condaenv("py27")

Once we’ve told R which version/environment we would like to use for Python we may then load the necessary modules.

np <- import("numpy")
datetime <- import("datetime")
mhw <- import("marineHeatWaves")

One may run python code in it’s native form within R by passing it as a character vector to py_run_string().

py_run_string("x = 10")
py_run_string("t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)")
py_run_string("sst = np.loadtxt(open('data/sst_WA.csv', 'r'), delimiter=',', skiprows=1)")

Default calculations

# These numbers were taken from print(t) in the python code above
t <- as.integer(np$array(seq(723546, 735598)))
sst <- np$array(sst_WA$temp)
res_python <- mhw$detect(t = py$t, temp = py$sst) # There appears to be an Rcpp issue preventing this from handshaking...

## Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: integer argument expected, got float
## 
## Detailed traceback: 
##   File "/home/rws/anaconda2/envs/py27/lib/python2.7/site-packages/marineHeatWaves.py", line 188, in detect
##     year[i] = date.fromordinal(t[i]).year

# The reticulate package is still in development...

Comparisons

With some climatologies and events calculated with both languages we may now compare the results of the two. I’ll do so here natively in R to avoid any potential hiccups from translating across languages. I’ll create the R output here, and load the Python output created above.

# Load libraries for data manipulation and visualisation
library(tidyverse)

# Set R results
res_event_R <- res$event
res_clim_R <- res$climatology

# Set Python results
res_event_Python <- read_csv("data/mhws_py.csv")
res_clim_Python <- read_csv("data/clim_py.csv")

With these default results loaded in the same format in the same language we can now start to look at how they stack up. For starters I am just doing some simple correlations.

Climatologies

cor(res_clim_R$seas, res_clim_Python$seas)

## [1] 0.9999999

cor(res_clim_R$thresh, res_clim_Python$thresh)

## [1] 0.9999996

That looks about right. Good to see things getting off to a good start. Now for the events.

Events

First we peak at a few specific values of interest.

cor(res_event_R$intensity_max, res_event_Python$intensity_max)

## [1] 0.9999997

cor(res_event_R$intensity_var, res_event_Python$intensity_var)

## [1] 0.999608

cor(res_event_R$rate_onset, res_event_Python$rate_onset)

## [1] 1

cor(res_event_R$rate_decline, res_event_Python$rate_decline)

## [1] 0.9999999

With that looking fine, we next run a for loop that will run a correlation on all rows with the same name.

# Remove non-numeric columns
res_event_num <- res_event_R %>% 
  select_if(is.numeric)

# Run the loop
res_event_cor <- data.frame()
for(i in 1:length(colnames(res_event_num))){
  if(colnames(res_event_num)[i] %in% colnames(res_event_Python)){
    x1 <- res_event_R[colnames(res_event_R) == colnames(res_event_num)[i]]
    x2 <- res_event_Python[colnames(res_event_Python) == colnames(res_event_num)[i]]
    x <- data.frame(r = cor(x1, x2), var = colnames(res_event_num)[i])
    colnames(x)[1] <- "r"
    rownames(x) <- NULL
  } else {
    x <- data.frame(r = NA, var = colnames(res_event_num)[i])
    # colnames(x) <- colnames(res_event_R)[i]
  }
  res_event_cor <- rbind(res_event_cor, x)
}
res_event_cor

##            r                            var
## 1         NA                       event_no
## 2  1.0000000                    index_start
## 3  1.0000000                     index_peak
## 4  1.0000000                      index_end
## 5  1.0000000                       duration
## 6  0.9999991                 intensity_mean
## 7  0.9999997                  intensity_max
## 8  0.9996080                  intensity_var
## 9  1.0000000           intensity_cumulative
## 10 0.9999971       intensity_mean_relThresh
## 11 0.9999994        intensity_max_relThresh
## 12 0.9995685        intensity_var_relThresh
## 13 0.9999996 intensity_cumulative_relThresh
## 14 1.0000000             intensity_mean_abs
## 15 0.9990733              intensity_max_abs
## 16 0.9997267              intensity_var_abs
## 17 1.0000000       intensity_cumulative_abs
## 18 1.0000000                     rate_onset
## 19 0.9999999                   rate_decline

With all of our overlapping columns compared, and our Pearson r values looking solid, let’s finish off this basic comparison by finding which columns are not shared between the different language outputs.

cols_R <- colnames(res_event_R)[!(colnames(res_event_R) %in% colnames(res_event_Python))]
cols_R

## [1] "event_no"

cols_Py <- colnames(res_event_Python)[!(colnames(res_event_Python) %in% colnames(res_event_R))]
cols_Py

## [1] "category"          "duration_extreme"  "duration_moderate"
## [4] "duration_severe"   "duration_strong"   "n_events"         
## [7] "time_end"          "time_peak"         "time_start"

Wonderful! Most things match up nicely. The duration of categories of events is something added in R by another function category(), and will be left that way for now. The “time” columns in the Python output aren’t relavent as far as I can see in the present usage as these functions currently only take day values. THe different methods of labelling the events will be left as they are for now as well.

It is also worth noting that the values for index_start, index_peak, and index_end are off by one between the two languages. This is due to the difference in indexing between the languages. Looking at the date_start, date_peak, and date_end values we see that they are the same.

Benchmarks

The final thing we want to look at in this vignette is the speed differences in calculating MHWs between the two languages.

R

system.time(detect_event(ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))))

##    user  system elapsed 
##   0.572   0.012   0.585

Python

import time
t1 = time.time()
mhws, clim = mhw.detect(t, sst)
t2 = time.time()
print(t2-t1)

## 0.162385940552

The Python code appears to be just a bit faster at ~0.18 seconds to ~0.20 seconds for R. And it is calculating the event categories as well, which R is not. That is done in an additonal step with category().

Robert W Schlegel

2018-05-29

Overview

Calculating events

R code

Setup

Default output

Python code

Setup

Default calculations

Reticulate code

Setup

Default calculations

Comparisons

Climatologies

Events

Benchmarks

R

Python

Contents