Last updated: 2019-03-19

workflowr checks: (Click a bullet for more information)
  • R Markdown file: up-to-date

    Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

  • Environment: empty

    Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

  • Seed: set.seed(666)

    The command set.seed(666) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

  • Session information: recorded

    Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

  • Repository version: 970b22c

    Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

    Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
    
    Ignored files:
        Ignored:    .Rhistory
        Ignored:    .Rproj.user/
        Ignored:    data/sst_ALL_clim_only.Rdata
        Ignored:    data/sst_ALL_event_aov_tukey.Rdata
        Ignored:    data/sst_ALL_flat.Rdata
        Ignored:    data/sst_ALL_miss.Rdata
        Ignored:    data/sst_ALL_miss_cat_chi.Rdata
        Ignored:    data/sst_ALL_miss_clim_event_cat.Rdata
        Ignored:    data/sst_ALL_miss_clim_only.Rdata
        Ignored:    data/sst_ALL_miss_event_aov_tukey.Rdata
        Ignored:    data/sst_ALL_repl.Rdata
        Ignored:    data/sst_ALL_smooth.Rdata
        Ignored:    data/sst_ALL_smooth_aov_tukey.Rdata
        Ignored:    data/sst_ALL_smooth_event.Rdata
        Ignored:    data/sst_ALL_trend.Rdata
        Ignored:    data/sst_ALL_trend_clim_event_cat.Rdata
    
    Untracked files:
        Untracked:  analysis/bibliography.bib
        Untracked:  code/functions.R
        Untracked:  code/workflow.R
        Untracked:  data/.gitignore
        Untracked:  data/clim_py.csv
        Untracked:  data/clim_py_joinAG_1.csv
        Untracked:  data/clim_py_joinAG_5.csv
        Untracked:  data/clim_py_joinAG_no.csv
        Untracked:  data/clim_py_minD_3.csv
        Untracked:  data/clim_py_minD_7.csv
        Untracked:  data/clim_py_pctile_80.csv
        Untracked:  data/clim_py_pctile_95.csv
        Untracked:  data/clim_py_pctile_99.csv
        Untracked:  data/clim_py_random.csv
        Untracked:  data/clim_py_spw_11.csv
        Untracked:  data/clim_py_spw_51.csv
        Untracked:  data/clim_py_spw_no.csv
        Untracked:  data/clim_py_whw_3.csv
        Untracked:  data/clim_py_whw_7.csv
        Untracked:  data/mhwBlock.csv
        Untracked:  data/mhws_py.csv
        Untracked:  data/mhws_py_joinAG_1.csv
        Untracked:  data/mhws_py_joinAG_5.csv
        Untracked:  data/mhws_py_joinAG_no.csv
        Untracked:  data/mhws_py_minD_3.csv
        Untracked:  data/mhws_py_minD_7.csv
        Untracked:  data/mhws_py_pctile_80.csv
        Untracked:  data/mhws_py_pctile_95.csv
        Untracked:  data/mhws_py_pctile_99.csv
        Untracked:  data/mhws_py_random.csv
        Untracked:  data/mhws_py_spw_11.csv
        Untracked:  data/mhws_py_spw_51.csv
        Untracked:  data/mhws_py_spw_no.csv
        Untracked:  data/mhws_py_whw_3.csv
        Untracked:  data/mhws_py_whw_7.csv
        Untracked:  data/sst_ALL.Rdata
        Untracked:  data/sst_ALL_KS.Rdata
        Untracked:  data/sst_ALL_cat_chi.Rdata
        Untracked:  data/sst_ALL_clim_category_count.Rdata
        Untracked:  data/sst_ALL_con.Rdata
        Untracked:  data/sst_ALL_detrend.Rdata
        Untracked:  data/sst_ALL_event_cor.Rdata
        Untracked:  data/sst_ALL_miss_clim_KS_p.Rdata
        Untracked:  data/sst_ALL_miss_clim_aov_p.Rdata
        Untracked:  data/sst_ALL_miss_clim_aov_tukey.Rdata
        Untracked:  data/sst_ALL_miss_event_CI.Rdata
        Untracked:  data/sst_ALL_smooth_R2.Rdata
        Untracked:  data/sst_ALL_smooth_R2_base.Rdata
        Untracked:  data/sst_ALL_smooth_cor_base.Rdata
        Untracked:  data/sst_ALL_smooth_real_category.Rdata
        Untracked:  data/sst_ALL_trend_cat_chi.Rdata
        Untracked:  data/sst_ALL_trend_clim_KS_p.Rdata
        Untracked:  data/sst_ALL_trend_clim_aov_tukey.Rdata
        Untracked:  data/sst_ALL_trend_clim_only.Rdata
        Untracked:  data/sst_ALL_trend_event_CI.Rdata
        Untracked:  data/sst_ALL_trend_event_aov_tukey.Rdata
        Untracked:  data/sst_WA.csv
        Untracked:  data/sst_WA_miss_ice.csv
        Untracked:  data/sst_WA_miss_random.csv
    
    Unstaged changes:
        Deleted:    .Rbuildignore
        Modified:   .gitignore
        Deleted:    DESCRIPTION
        Deleted:    NAMESPACE
        Modified:   NEWS.md
        Deleted:    R/MHWdetection-package.R
        Deleted:    R/placeholder.R
        Modified:   README.md
        Deleted:    _pkgdown.yml
        Modified:   _workflowr.yml
        Modified:   analysis/_site.yml
        Deleted:    docs/articles/fig/heatwaveR_v3.svg
        Deleted:    docs/docsearch.css
        Deleted:    docs/docsearch.js
        Deleted:    docs/link.svg
        Deleted:    docs/pkgdown.css
        Deleted:    docs/pkgdown.js
        Deleted:    docs/pkgdown.yml
        Deleted:    docs/sitemap.xml
        Deleted:    man/MHWdetection-package.Rd
        Deleted:    man/placeholder.Rd
        Deleted:    tests/testthat.R
        Deleted:    tests/testthat/test-placeholder.R
        Deleted:    vignettes/.gitignore
        Deleted:    vignettes/Climatologies_and_baselines.Rmd
        Deleted:    vignettes/Short_climatologies.Rmd
        Deleted:    vignettes/best_practices.Rmd
        Deleted:    vignettes/bibliography.bib
        Deleted:    vignettes/data/.gitignore
        Deleted:    vignettes/data/clim_py.csv
        Deleted:    vignettes/data/clim_py_joinAG_1.csv
        Deleted:    vignettes/data/clim_py_joinAG_5.csv
        Deleted:    vignettes/data/clim_py_joinAG_no.csv
        Deleted:    vignettes/data/clim_py_minD_3.csv
        Deleted:    vignettes/data/clim_py_minD_7.csv
        Deleted:    vignettes/data/clim_py_pctile_80.csv
        Deleted:    vignettes/data/clim_py_pctile_95.csv
        Deleted:    vignettes/data/clim_py_pctile_99.csv
        Deleted:    vignettes/data/clim_py_random.csv
        Deleted:    vignettes/data/clim_py_spw_11.csv
        Deleted:    vignettes/data/clim_py_spw_51.csv
        Deleted:    vignettes/data/clim_py_spw_no.csv
        Deleted:    vignettes/data/clim_py_whw_3.csv
        Deleted:    vignettes/data/clim_py_whw_7.csv
        Deleted:    vignettes/data/mhwBlock.csv
        Deleted:    vignettes/data/mhws_py.csv
        Deleted:    vignettes/data/mhws_py_joinAG_1.csv
        Deleted:    vignettes/data/mhws_py_joinAG_5.csv
        Deleted:    vignettes/data/mhws_py_joinAG_no.csv
        Deleted:    vignettes/data/mhws_py_minD_3.csv
        Deleted:    vignettes/data/mhws_py_minD_7.csv
        Deleted:    vignettes/data/mhws_py_pctile_80.csv
        Deleted:    vignettes/data/mhws_py_pctile_95.csv
        Deleted:    vignettes/data/mhws_py_pctile_99.csv
        Deleted:    vignettes/data/mhws_py_random.csv
        Deleted:    vignettes/data/mhws_py_spw_11.csv
        Deleted:    vignettes/data/mhws_py_spw_51.csv
        Deleted:    vignettes/data/mhws_py_spw_no.csv
        Deleted:    vignettes/data/mhws_py_whw_3.csv
        Deleted:    vignettes/data/mhws_py_whw_7.csv
        Deleted:    vignettes/data/sst_ALL.Rdata
        Deleted:    vignettes/data/sst_ALL_KS.Rdata
        Deleted:    vignettes/data/sst_ALL_cat_chi.Rdata
        Deleted:    vignettes/data/sst_ALL_clim_category_count.Rdata
        Deleted:    vignettes/data/sst_ALL_con.Rdata
        Deleted:    vignettes/data/sst_ALL_detrend.Rdata
        Deleted:    vignettes/data/sst_ALL_event_cor.Rdata
        Deleted:    vignettes/data/sst_ALL_miss_clim_KS_p.Rdata
        Deleted:    vignettes/data/sst_ALL_miss_clim_aov_p.Rdata
        Deleted:    vignettes/data/sst_ALL_miss_clim_aov_tukey.Rdata
        Deleted:    vignettes/data/sst_ALL_miss_event_CI.Rdata
        Deleted:    vignettes/data/sst_ALL_miss_event_aov_tukey.Rdata
        Deleted:    vignettes/data/sst_ALL_smooth_R2.Rdata
        Deleted:    vignettes/data/sst_ALL_smooth_R2_base.Rdata
        Deleted:    vignettes/data/sst_ALL_smooth_cor_base.Rdata
        Deleted:    vignettes/data/sst_ALL_smooth_real_category.Rdata
        Deleted:    vignettes/data/sst_ALL_trend_cat_chi.Rdata
        Deleted:    vignettes/data/sst_ALL_trend_clim_KS_p.Rdata
        Deleted:    vignettes/data/sst_ALL_trend_clim_aov_tukey.Rdata
        Deleted:    vignettes/data/sst_ALL_trend_clim_only.Rdata
        Deleted:    vignettes/data/sst_ALL_trend_event_CI.Rdata
        Deleted:    vignettes/data/sst_ALL_trend_event_aov_tukey.Rdata
        Deleted:    vignettes/data/sst_WA.csv
        Deleted:    vignettes/data/sst_WA_miss_ice.csv
        Deleted:    vignettes/data/sst_WA_miss_random.csv
        Deleted:    vignettes/fig/detect_diagram.svg
        Deleted:    vignettes/fig/heatwaveR_v3.svg
        Deleted:    vignettes/gridded_products.Rmd
        Deleted:    vignettes/missing_data.Rmd
        Deleted:    vignettes/r_vs_python.Rmd
        Deleted:    vignettes/r_vs_python_additional.Rmd
        Deleted:    vignettes/r_vs_python_arguments.Rmd
        Deleted:    vignettes/time_series_duration.Rmd
        Deleted:    vignettes/trend.Rmd
        Deleted:    vignettes/variance.Rmd
    
    
    Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Expand here to see past versions:
    File Version Author Date Message
    Rmd 970b22c robwschlegel 2019-03-19 Publish the vignettes from when this was a pkgdown framework
    html fa7fd57 robwschlegel 2019-03-19 Build site.
    Rmd 64ac134 robwschlegel 2019-03-19 Publish analysis files


Overview

The purpose of this vignette is to walk one through how to calculate MHWs in both the R and Python languages. It will use code from both languages throughout. The secondary purpose of this vignette is to show that the outputs are identical. The tertiary purpose of this vignette is to run benchmarks on the code to compare their speeds.

Calculating events

First up, let’s look at the two different ways in which one may calculate MHWs. First we will look at the R code, then Python.

R code

Setup

The basic functionality for calculating marine heatwaves (MHWs) in R may be found in the heatwaveR package that may currently be downloaded and installed with the following lines of code. Note that if one has already installed these packages they do not need to be installed again. To run one of the lines in the following code chunk will require that the hash-tag first be removed.

## This is an R chunk ##

# install.packages("devtools")
## Development version from GitHub
# devtools::install_github("robwschlegel/heatwaveR") 
## Stable version from CRAN
# install.packages("heatwaveR")

With the necessary packages installed, we activate heatwaveR with the following line of code.

## This is an R chunk ##

library(heatwaveR)

Default output

With everything downloaded and ready for us to use we may now calculate some events. The heatwaveR package has three built in time series (sst_WA, sst_Med, sst_NW_Atl) that we may use to more easily demonstrate how the code works. The general pipeline in R for calculating events is first to create a climatology from a chosen time series using ts2clm(), and then to feed that output into detect_event(), as seen below.

## This is an R chunk ##

# First we create a default climatology as outlined in Hobday et al. (2016)
  # NB: We use a longer climatology period here as this matches the Python default
ts_clim <- ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))

# Then we feed that object into the second and final function
res_r <- detect_event(ts_clim)

To look at these outputs we would use the following options. For now we’ll just look at the event output.

## This is an R chunk ##

# Look at the top six rows of the first 6 columns of the events
res_r$event[1:6, 1:6]
# A tibble: 6 x 6
  event_no index_start index_peak index_end duration date_start
     <int>       <dbl>      <int>     <dbl>    <int> <date>    
1        1         884        887       895       12 1984-06-02
2        2         899        901       904        6 1984-06-17
3        3         908        922       926       19 1984-06-26
4        4        1008       1010      1012        5 1984-10-04
5        5        1023       1027      1029        7 1984-10-19
6        6        1033       1034      1052       20 1984-10-29
# Or perhaps the most intense event
res_r$event[res_r$event$intensity_max == max(res_r$event$intensity_max), 1:6]
# A tibble: 1 x 6
  event_no index_start index_peak index_end duration date_start
     <int>       <dbl>      <int>     <dbl>    <int> <date>    
1       42       10629      10651     10689       61 2011-02-06

Python code

Setup

To download and install the Python package for calculating MHWs one may run the following line of code in a console:

## This is a bash chunk ##

# Note that the hashtag must be removed in orde for the code to run
# pip install git+git://github.com/ecjoliver/marineHeatWaves@master

Or simply download the GitHub repository and follow the instructions there for downloading. Or if still lost, phone a friend!

Before we begin the calculations we need to create a time series. I’ve chosen to take the sst_WA time series from the heatwaveR package to ensure consistency between the results. The python code prefers to have the dates and the temperatures fed into it as two separate vectors, so I will first split these up in R before we feed them to Python.

## This is an R chunk ##

# Get the built-in time series from heatwaveR
ts <- sst_WA

# Take only the temperature column
sst <- ts$temp

# Take only date column
dates <- ts$t

# If one wants to save the sst values to load into Python
write.csv(ts$temp, "data/sst_WA.csv", row.names = FALSE)

Default calculations

With the package installed, and our time series prepped, the basic MHW calculation is performed as follows:

## This is a Python chunk ##
# Required for data prep and export
import numpy as np
from datetime import date
import pandas as pd
# The MHW functions
import marineHeatWaves as mhw
# The date values
  # We could take the date values we created above,
  # But let's rather create the date values in Python
t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)
# The temperature values
  # Note that to fetch an object created in R we use 'r.*'
  # Where '*' is the name of the R object
sst = np.asarray(r.sst)
# Or if one rather wants to load the data from a csv file
# sst = np.loadtxt("data/sst_WA.csv", delimiter=',', skiprows=1) # because a heading row needs to be skipped
# The event metrics
mhws, clim = mhw.detect(t, sst)
# If one wants to save the results as csv files
# Save event results
mhws_df = pd.DataFrame.from_dict(mhws)
mhws_df.to_csv('data/mhws_py.csv', sep = ',', index=False)
# 
# Save climatology results
clim_df = pd.DataFrame.from_dict(clim)
clim_df.to_csv('data/clim_py.csv', sep = ',', index=False)

Reticulate code

It is also possible to run the Python code through R with the use of the R package reticulate. This is particularly useful as it allows us to perform the comparisons and benchmarking all within the same language.

Setup

Here we load the reticulate package and choose the conda environment I’ve already created called py27. For help on how to set up a conda environment go here. I’ve ensured that all of the required python modules are installed within this environment.

## This is an R chunk ##

# install.packages("reticulate")
library(reticulate)
use_condaenv("py27")

Once we’ve told R which version/environment we would like to use for Python we may then load the necessary modules.

## This is an R chunk ##

np <- import("numpy")
datetime <- import("datetime")
mhw <- import("marineHeatWaves")

One may run python code in it’s native form within R by passing it as a character vector to py_run_string().

## This is an R chunk ##

py_run_string("t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)")
py_run_string("sst = np.loadtxt(open('data/sst_WA.csv', 'r'), delimiter=',', skiprows=1)")
# py_run_string("sst.flags.writeable = True")
# py_run_string("sst.setflags(write = True)")
# py_run_string("sst = np.array(sst).copy()")

Or we may just create the objects in native R.

## This is an R chunk ##

# These numbers were taken from print(t) in the python code above
t <- as.integer(np$array(seq(723546, 735598)))
sst <- np$array(sst_WA$temp)

Default calculations

Either way, once we have created the necessary parts of the time series in either language, we may call the Python functions in R directly with the line of code below.

## This is an R chunk ##

# The following line appears not to run because the numpy array that Python expects is given to it by R in a read-only format...
# Thus far I've not found a way to correct for this.
# So it appears that one may pass R objects to Python code directly, but not to reticulate code...
res_python <- mhw$detect(t = t, temp = sst)

Comparisons

With some climatologies and events calculated with both languages we may now compare the results of the two. I’ll do so here natively in R to avoid any potential hiccups from translating across languages. I’ll create the R output here, and load the Python output created in the Python code chunk above.

## This is an R chunk ##

# Load libraries for data manipulation and visualisation
library(tidyverse)

# Set R results
res_event_R <- res_r$event
res_clim_R <- res_r$climatology

# Set Python results
res_event_Python <- read_csv("data/mhws_py.csv")
res_clim_Python <- read_csv("data/clim_py.csv")

With these default results loaded in the same format in the same language we can now start to look at how they compare. For starters I am just doing some simple sums and correlations.

Climatologies

## This is an R chunk ##

cor(res_clim_R$seas, res_clim_Python$seas)
[1] 0.9999999
sum(res_clim_R$seas) - sum(res_clim_Python$seas)
[1] -2.039741
cor(res_clim_R$thresh, res_clim_Python$thresh)
[1] 0.9999996
sum(res_clim_R$thresh) - sum(res_clim_Python$thresh)
[1] -6.563502

The seasonal and threshold climatologies correlate very well, but adding them up we see that the Python values are consistently higher. This is due to rounding differences between the languages. Even though the sum difference between the thresholds may look large at a value of ~6.5, considering this is over 12053 days of data, the difference is only ~0.0005/day. So I think it is safe to move on and look at the events.

February 29th

It was also recently brought to my attention by Zijie Zhao, author of the MATLAB distribution for MHW code, that Python and R calculate February 29th differently. With Python calculating the climatologies first, and then filling in the missing days, whereas R calculates February 29th first, and then the climatologies from that. So with that being said let’s have a peek at the difference this may cause.

## This is an R chunk ##

res_clim_R_29 <- res_clim_R %>% 
  filter(doy == 60) %>% 
  select(seas, thresh) %>% 
  distinct()
res_clim_Python_29 <- res_clim_Python %>% 
  filter(res_clim_R$doy == 60) %>% 
  select(seas, thresh) %>% 
  distinct()

res_clim_R_29$seas - res_clim_Python_29$seas
[1] 0.0003483643
res_clim_R_29$thresh - res_clim_Python_29$thresh
[1] -0.005419341

Another thing pointed out by Zijie is that Python and MATLAB calculate percentiles differently, this then is likely also true for R and Python. This would explain why the seasonal climatologies (seas) are always much closer than the thresholds (thresh) for all of these tests.

Events

Below we have a for loop that will run a correlation on all rows with the same name, as well as find the sum difference between them.

## This is an R chunk ##

# Remove non-numeric columns
res_event_num <- res_event_R %>% 
  select_if(is.numeric)

# Run the loop
res_event <- data.frame()
for(i in 1:length(colnames(res_event_num))){
  if(colnames(res_event_num)[i] %in% colnames(res_event_Python)){
    x1 <- res_event_R[colnames(res_event_R) == colnames(res_event_num)[i]]
    x2 <- res_event_Python[colnames(res_event_Python) == colnames(res_event_num)[i]]
    x <- data.frame(r = cor(x1, x2, use = "complete.obs"),
                    difference = round(sum(x1, na.rm = T) - sum(x2, na.rm = T), 4),
                    var = colnames(res_event_num)[i])
    colnames(x)[1] <- "r"
    rownames(x) <- NULL
    } else {
      x <- data.frame(r = NA, difference = NA, var = colnames(res_event_num)[i])
      }
  res_event <- rbind(res_event, x)
  }
na.omit(res_event)
           r difference                            var
2  1.0000000    60.0000                    index_start
3  1.0000000    60.0000                     index_peak
4  1.0000000    60.0000                      index_end
5  1.0000000     0.0000                       duration
6  0.9999991     0.0086                 intensity_mean
7  0.9999997     0.0086                  intensity_max
8  0.9996080     0.7345                  intensity_var
9  1.0000000     0.1616           intensity_cumulative
10 0.9999971     0.0158       intensity_mean_relThresh
11 0.9999994     0.0166        intensity_max_relThresh
12 0.9995685     0.7374        intensity_var_relThresh
13 0.9999996     0.4741 intensity_cumulative_relThresh
14 1.0000000     0.0000             intensity_mean_abs
15 0.9990733     1.6100              intensity_max_abs
16 0.9997267     0.7667              intensity_var_abs
17 1.0000000     0.0000       intensity_cumulative_abs
18 1.0000000    -0.0001                     rate_onset
19 0.9999999     0.0003                   rate_decline

It is a bit odd that there is such a large difference between intensity_var for the two language. This implies that R is detecting larger differences in the intensity around the threshold than Python is. This may again be due to the differences in thresholds being detected, but that makes a rather small difference. It is perhaps more likely that variance is calculated slightly differently between languages. Looking at the source code one sees that this is calculated by first finding the variance of the intensity, and then the square root of that. These two calculations likely differ enough between the languages to be causing this. Ultimately it is of little concern as the variance values found throughout both packages are not of much interest to anyone.

We may also see that the difference in intensity_max_abs is a bit higher in Python than in R. This is rather surprising as intensity_max_abs simply shows the maximum temperature (independent of any thresholds etc.) experienced during that event. And considering that these calculations were performed on the exact same time series, intensity_max_abs should always be exactly the same between the two languages. Looking at the climatology output one sees that the temperature is still correct, meaning that the difference must occur somewhere within the R function detect_event. Looking at the R source code we see that intensity_max_abs is calculated on line 298, for Python this is calculated on line 376. The reason for the difference is because R is looking for the maximum temperature throughout all dates for a given event, whereas Python is simply taking the temperature on the peak date of the event. This is a subtle but significant difference as this means this value is showing two different things. I’m not certain which is more appropriate, though I’m leaning towards the Python implementation.

With all of our overlapping columns compared, and our differences and Pearson r values looking solid, let’s finish off this basic comparison by finding which columns are not shared between the different language outputs. Also, please note that the apparently large differences between the index_start, index_peak, and index_end values are due to the different indexing methods of R and Python. R starts at 1 and Python at 0. The

## This is an R chunk ##

cols_R <- colnames(res_event_R)[!(colnames(res_event_R) %in% colnames(res_event_Python))]
cols_R
[1] "event_no"
cols_Py <- colnames(res_event_Python)[!(colnames(res_event_Python) %in% colnames(res_event_R))]
cols_Py
[1] "category"          "duration_extreme"  "duration_moderate"
[4] "duration_severe"   "duration_strong"   "n_events"         
[7] "time_end"          "time_peak"         "time_start"       

Wonderful! Almost everything matches up exactly. The duration of categories of events is something added in R by another function category(), and will be left that way for now. The “time” columns in the Python output aren’t relevant as far as I can see in the present usage as these functions currently only take day values. The different methods of labelling the events will be left as they are for now as well.

It is also worth noting that the values for index_start, index_peak, and index_end are off by one between the two languages. This is due to the difference in indexing between the languages. Looking at the date_start, date_peak, and date_end values we see that they are the same.

Missing data

A bug was discovered in v0.3.3 of the R code with regards to the handling of missing data. Specifically, in the move to a C++ back-end, the R code stopped being able to handle missing data for the climatology calculations. This has since been corrected from v0.3.4 onwards, but it is worthwhile to ensure that missing data are handled the same way.

First we’ll create the dataframe with missing data:

## This is an R chunk ##

library(padr)
library(lubridate)
random_knockout <- function(df, prop){
  res <- df %>% 
    sample_frac(size = prop) %>% 
    arrange(t) %>%
    pad(interval = "day")
  return(res)
}

# Remove 10% of the data randomly
sst_WA_miss_random <- random_knockout(sst_WA, 0.9)
write.csv(sst_WA_miss_random$temp, file = "data/sst_WA_miss_random.csv", row.names = F, na = "NaN")

# Remove simulated ice cover
sst_WA_miss_ice <- sst_WA %>% 
  mutate(month = month(t, label = T),
         temp = ifelse(month %in% c("Jan", "Feb", "Mar"), NA, temp)) %>% 
  select(-month)
write.csv(sst_WA_miss_ice$temp, file = "data/sst_WA_miss_ice.csv", row.names = F, na = "NaN")

With the missing data saved, we can now calculate and compare the climatologies that the two different languages will create.

## This is a Python chunk ##
# Required for data prep and export
import numpy as np
from datetime import date
import pandas as pd
# The MHW functions
import marineHeatWaves as mhw
# The date values
t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)
# The temperature values
sst_random = np.loadtxt("data/sst_WA_miss_random.csv", delimiter = ',', skiprows = 1)
sst_ice = np.loadtxt("data/sst_WA_miss_ice.csv", delimiter = ',', skiprows = 1)
# The event metrics
mhws_random, clim_random = mhw.detect(t, sst_random)
# It appears as though the Python code can't deal with ice coverage...
#mhws_ice, clim_ice = mhw.detect(t, sst_ice)
# Save climatology results
clim_random_df = pd.DataFrame.from_dict(clim_random)
clim_random_df.to_csv('data/clim_py_random.csv', sep = ',', index = False)
mhws_random_df = pd.DataFrame.from_dict(mhws_random)
mhws_random_df.to_csv('data/mhws_py_random.csv', sep = ',', index = False)
#clim_ice_df = pd.DataFrame.from_dict(clim_ice)
#clim_ice_df.to_csv('data/clim_ice_py.csv', sep = ',', index = False)

With the Python climatologies and events calculated from the missing data we’ll now do the same for R.

## This is an R chunk ##

# Load and prep missing data
sst_WA_miss_random <- read_csv("data/sst_WA_miss_random.csv") %>% 
  dplyr::rename(temp = x) %>% 
  mutate(t = seq(as.Date("1982-01-01"), as.Date("2014-12-31"), by = "day")) %>% 
  select(t, temp)
sst_WA_miss_random$temp[is.nan(sst_WA_miss_random$temp)] <- NA

# caluclate R climatologies/events
res_random_R <- detect_event(ts2clm(sst_WA_miss_random, maxPadLength = 10,
                                    climatologyPeriod = c("1982-01-01", "2014-12-31")))
res_clim_random_R <- res_random_R$climatology
res_event_random_R <- res_random_R$event

# Load Python results
res_clim_random_Python <- read_csv("data/clim_py_random.csv")
res_event_random_Python <- read_csv("data/mhws_py_random.csv")

And then the comparisons of the seas and thresh values.

## This is an R chunk ##

# Compare clims/thresholds
cor(res_clim_random_R$seas, res_clim_random_Python$seas)
[1] 0.9999709
mean(res_clim_random_R$seas) - mean(res_clim_random_Python$seas)
[1] 0.0004467828
cor(res_clim_random_R$thresh, res_clim_random_Python$thresh)
[1] 0.9999379
mean(res_clim_random_R$thresh) - mean(res_clim_random_Python$thresh)
[1] 0.0005155218

We may see from the results above that the values still correlate very strongly, if slightly less so than with no missing data. We see again that the Python values are higher overall for both the seasonal climatology and the threshold. This time however the differences are much larger. The seasonal climatology is off by ~0.0004/day, which is acceptable, but the 90th percentile threshold is off by ~0.0695. This is too large of a difference and this issue will need to be investigated. It is most likely due to the difference in how quantiles are calculated between the languages though and so there may be little to do about it. Perhaps more importantly is how these differences may affect the events being detected.

## This is an R chunk ##

nrow(res_event_random_R)
[1] 61
nrow(res_event_random_Python)
[1] 53
## This is an R chunk ##

# Remove non-numeric columns
res_event_random_num <- res_event_random_R %>% 
  select_if(is.numeric)

# Run the loop
res_event_random <- data.frame()
for(i in 1:length(colnames(res_event_random_num))){
  if(colnames(res_event_random_num)[i] %in% colnames(res_event_random_Python)){
    x1 <- res_event_random_R[colnames(res_event_random_R) == colnames(res_event_random_num)[i]]
    x2 <- res_event_random_Python[colnames(res_event_random_Python) == colnames(res_event_random_num)[i]]
    x <- data.frame(difference = round(sum(x1, na.rm = T) - sum(x2, na.rm = T), 4),
                    var = colnames(res_event_random_num)[i])
    # colnames(x)[1] <- "r"
    rownames(x) <- NULL
    } else {
      x <- data.frame(difference = NA, var = colnames(res_event_random_num)[i])
      }
  res_event_random <- rbind(res_event_random, x)
  }
na.omit(res_event_random)
   difference                            var
2  50851.0000                    index_start
3  50961.0000                     index_peak
4  51114.0000                      index_end
5    271.0000                       duration
6      3.2705                 intensity_mean
7      6.7657                  intensity_max
8     -0.6593                  intensity_var
9    498.7071           intensity_cumulative
10    -3.8033       intensity_mean_relThresh
11    -0.3247        intensity_max_relThresh
12    -0.6651        intensity_var_relThresh
13   167.2651 intensity_cumulative_relThresh
14   154.8956             intensity_mean_abs
15   159.5400              intensity_max_abs
16    -0.1118              intensity_var_abs
17  6329.6356       intensity_cumulative_abs
18    -6.4583                     rate_onset
19   -10.6540                   rate_decline

Because the count of events detected is not the same, we cannot run correlations. We can sum up the differences to get an idea of how far off things are, and we can see that the results are troubling. The differences in the index_* values can be discounted as we have already established that we have a different number of events. The next largest difference is that for intensity_cumulative_abs. Looking at the one event that is the same between the languages one may see that this value is the same for both, meaning that we may discount this difference as it is not due to calculation differences, but rather also due to the differing number of events. Regardless, it is an indicator that overall the much lower threshold in the R language is giving us results that look much more dire than the Python language. This is problematic and stems from the quantile calculation issue.

I looked into this in quite some detail and it appears to stem not only from the calculation of quantiles being different, but also from the R quantile calculations never giving more than one additional decimal place of precision over the initial data, whereas the Python results float near infinity. This explains why the R results always tend to be lower, but does not account for why the presence of missing data has such a dramatic effect.

## This is an R chunk ##

clim_only_random_R <- res_clim_random_R %>% 
  select(doy, seas, thresh) %>% 
  distinct() %>% 
  filter(doy != 60)

clim_only_random_Python <- res_clim_random_Python %>% 
  slice(1:365)

cor(clim_only_random_R$seas, clim_only_random_Python$seas)
[1] 0.9999709
sum(clim_only_random_R$seas) - sum(clim_only_random_Python$seas)
[1] 0.1580903
cor(clim_only_random_R$seas, clim_only_random_Python$seas)
[1] 0.9999709
sum(clim_only_random_R$thresh) - sum(clim_only_random_Python$thresh)
[1] 0.1839033

Looking at only the values for one seasonal climatology or threshold cycle we see that the differences are the same as when we compare all of the data. I don’t know why they wouldn’t be… but I wanted to check.

After having gone over the R code very thoroughly I think the next course of action is to go through the step by step output of the Python code to see where exactly these changes begin to occur. It could be in the following places:

  • clim_calc_cpp: The calculation of quantile values
    • The rounding of the results to 0.001 will have an influence, but can’t explain the size of the difference when missing data are present
  • smooth_percentile: RcppRoll::roll_mean is used for the rolling means and this may differ greatly from Python when missing data are present
    • Line 290 in the Python code is where the function runavg is used for the same purpose

Benchmarks

The final thing we want to look at in this vignette is the speed differences in calculating MHWs between the two languages.

R

library(microbenchmark)
# The old R code
library(RmarineHeatWaves)
microbenchmark(detect(make_whole(data = sst_WA), climatology_start = "1982-01-01", climatology_end = "2014-12-31"), times = 10)
Unit: milliseconds
                                                                                                     expr
 detect(make_whole(data = sst_WA), climatology_start = "1982-01-01",      climatology_end = "2014-12-31")
      min       lq    mean   median       uq      max neval
 528.6332 537.7533 605.428 544.3456 555.6387 1146.521    10
# The new R code
microbenchmark(detect_event(ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))), times = 10)
Unit: milliseconds
                                                                                        expr
 detect_event(ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01",      "2014-12-31")))
      min       lq     mean   median       uq      max neval
 136.0114 137.5676 140.6235 139.3807 143.8128 147.6173    10

Unsurprisingly, the average for running the heatwave analysis on heatwaveR 10 times comes out at ~0.194 seconds per calculations, which is faster than the average of ~0.730 with RmarineHeatWaves.

Python

import time
total = 0
for i in range(10):
    start = time.clock()
    mhws, clim = mhw.detect(t, sst)
    end = time.clock()
    total += end-start
bench_py = total/10
print(bench_py)

The average speed for running the Python code 10 times comes out at ~0.175 seconds, which is slightly faster than R. And the Python code is also calculating the event categories, which R is not. That is done in an additional step with category(). The R code is however allowing for a wider range of options for climatologies, and may be more easily multi-cored, as shown in this vignette.

Conclusion

Overall the basic applications of these two languages provide near identical results, and nearly the same computational time. I would not say that one is better enough than the other to warrant switching between languages. Rather this vignette supports the argument that people collaborating on a research project while using R and Python do not need to worry about their results. The exception currently being that large amounts of missing data appear to be influencing the calculation of the 90th percentile threshold differently. This is something that will be investigated further and corrected as necessary.

Session information

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RmarineHeatWaves_0.17.0 microbenchmark_1.4-4   
 [3] forcats_0.3.0           stringr_1.3.1          
 [5] dplyr_0.7.6             purrr_0.2.5            
 [7] readr_1.1.1             tidyr_0.8.1            
 [9] tibble_1.4.2            tidyverse_1.2.1        
[11] bindrcpp_0.2.2          heatwaveR_0.3.3        
[13] ggplot2_3.0.0           data.table_1.11.6      

loaded via a namespace (and not attached):
 [1] httr_1.3.1        jsonlite_1.5      splines_3.5.3    
 [4] R.utils_2.7.0     modelr_0.1.2      assertthat_0.2.0 
 [7] sp_1.3-1          cellranger_1.1.0  yaml_2.2.0       
[10] pillar_1.3.0      backports_1.1.2   lattice_0.20-35  
[13] glue_1.3.0        reticulate_1.10   digest_0.6.16    
[16] rvest_0.3.2       colorspace_1.3-2  sandwich_2.5-0   
[19] htmltools_0.3.6   Matrix_1.2-14     R.oo_1.22.0      
[22] plyr_1.8.4        pkgconfig_2.0.2   broom_0.5.0      
[25] raster_2.6-7      haven_1.1.2       mvtnorm_1.0-8    
[28] scales_1.0.0      whisker_0.3-2     git2r_0.23.0     
[31] TH.data_1.0-9     withr_2.1.2       lazyeval_0.2.1   
[34] cli_1.0.0         survival_2.42-6   magrittr_1.5     
[37] crayon_1.3.4      readxl_1.1.0      evaluate_0.11    
[40] R.methodsS3_1.7.1 fansi_0.3.0       nlme_3.1-137     
[43] MASS_7.3-50       xml2_1.2.0        tools_3.5.3      
[46] hms_0.4.2         multcomp_1.4-8    munsell_0.5.0    
[49] compiler_3.5.3    RcppRoll_0.3.0    rlang_0.2.2      
[52] grid_3.5.3        rstudioapi_0.7    rmarkdown_1.10   
[55] gtable_0.2.0      codetools_0.2-15  R6_2.2.2         
[58] zoo_1.8-4         lubridate_1.7.4   knitr_1.20       
[61] utf8_1.1.4        bindr_0.1.1       workflowr_1.1.1  
[64] rprojroot_1.3-2   stringi_1.2.4     Rcpp_0.12.18     
[67] tidyselect_0.2.4 

This reproducible R Markdown analysis was created with workflowr 1.1.1