<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta charset="utf-8" /> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="generator" content="pandoc" /> <meta name="author" content="Robert W Schlegel" /> <meta name="date" content="2019-05-06" /> <title>Default MHW outputs in R and Python</title> <script src="site_libs/jquery-1.11.3/jquery.min.js"></script> <meta name="viewport" content="width=device-width, initial-scale=1" /> <link href="site_libs/bootstrap-3.3.5/css/cosmo.min.css" rel="stylesheet" /> <script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script> <script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script> <script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script> <script src="site_libs/jqueryui-1.11.4/jquery-ui.min.js"></script> <link href="site_libs/tocify-1.9.1/jquery.tocify.css" rel="stylesheet" /> <script src="site_libs/tocify-1.9.1/jquery.tocify.js"></script> <script src="site_libs/navigation-1.1/tabsets.js"></script> <link href="site_libs/highlightjs-9.12.0/textmate.css" rel="stylesheet" /> <script src="site_libs/highlightjs-9.12.0/highlight.js"></script> <link href="site_libs/font-awesome-5.0.13/css/fa-svg-with-js.css" rel="stylesheet" /> <script src="site_libs/font-awesome-5.0.13/js/fontawesome-all.min.js"></script> <script src="site_libs/font-awesome-5.0.13/js/fa-v4-shims.min.js"></script> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> pre:not([class]) { background-color: white; } </style> <script type="text/javascript"> if (window.hljs) { hljs.configure({languages: []}); hljs.initHighlightingOnLoad(); if (document.readyState && document.readyState === "complete") { window.setTimeout(function() { hljs.initHighlighting(); }, 0); } } </script> <style type="text/css"> h1 { font-size: 34px; } h1.title { font-size: 38px; } h2 { font-size: 30px; } h3 { font-size: 24px; } h4 { font-size: 18px; } h5 { font-size: 16px; } h6 { font-size: 12px; } .table th:not([align]) { text-align: left; } </style> </head> <body> <style type = "text/css"> .main-container { max-width: 940px; margin-left: auto; margin-right: auto; } code { color: inherit; background-color: rgba(0, 0, 0, 0.04); } img { max-width:100%; height: auto; } .tabbed-pane { padding-top: 12px; } .html-widget { margin-bottom: 20px; } button.code-folding-btn:focus { outline: none; } </style> <style type="text/css"> /* padding for bootstrap navbar */ body { padding-top: 51px; padding-bottom: 40px; } /* offset scroll position for anchor links (for fixed navbar) */ .section h1 { padding-top: 56px; margin-top: -56px; } .section h2 { padding-top: 56px; margin-top: -56px; } .section h3 { padding-top: 56px; margin-top: -56px; } .section h4 { padding-top: 56px; margin-top: -56px; } .section h5 { padding-top: 56px; margin-top: -56px; } .section h6 { padding-top: 56px; margin-top: -56px; } </style> <script> // manage active state of menu based on current page $(document).ready(function () { // active menu anchor href = window.location.pathname href = href.substr(href.lastIndexOf('/') + 1) if (href === "") href = "index.html"; var menuAnchor = $('a[href="' + href + '"]'); // mark it active menuAnchor.parent().addClass('active'); // if it's got a parent navbar menu mark it active as well menuAnchor.closest('li.dropdown').addClass('active'); }); </script> <div class="container-fluid main-container"> <!-- tabsets --> <script> $(document).ready(function () { window.buildTabsets("TOC"); }); </script> <!-- code folding --> <script> $(document).ready(function () { // move toc-ignore selectors from section div to header $('div.section.toc-ignore') .removeClass('toc-ignore') .children('h1,h2,h3,h4,h5').addClass('toc-ignore'); // establish options var options = { selectors: "h1,h2,h3", theme: "bootstrap3", context: '.toc-content', hashGenerator: function (text) { return text.replace(/[.\\/?&!#<>]/g, '').replace(/\s/g, '_').toLowerCase(); }, ignoreSelector: ".toc-ignore", scrollTo: 0 }; options.showAndHide = true; options.smoothScroll = true; // tocify var toc = $("#TOC").tocify(options).data("toc-tocify"); }); </script> <style type="text/css"> #TOC { margin: 25px 0px 20px 0px; } @media (max-width: 768px) { #TOC { position: relative; width: 100%; } } .toc-content { padding-left: 30px; padding-right: 40px; } div.main-container { max-width: 1200px; } div.tocify { width: 20%; max-width: 260px; max-height: 85%; } @media (min-width: 768px) and (max-width: 991px) { div.tocify { width: 25%; } } @media (max-width: 767px) { div.tocify { width: 100%; max-width: none; } } .tocify ul, .tocify li { line-height: 20px; } .tocify-subheader .tocify-item { font-size: 0.90em; padding-left: 25px; text-indent: 0; } .tocify .list-group-item { border-radius: 0px; } </style> <!-- setup 3col/9col grid for toc_float and main content --> <div class="row-fluid"> <div class="col-xs-12 col-sm-4 col-md-3"> <div id="TOC" class="tocify"> </div> </div> <div class="toc-content col-xs-12 col-sm-8 col-md-9"> <div class="navbar navbar-default navbar-fixed-top" role="navigation"> <div class="container"> <div class="navbar-header"> <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar"> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button> <a class="navbar-brand" href="index.html">MHW detection</a> </div> <div id="navbar" class="navbar-collapse collapse"> <ul class="nav navbar-nav"> <li> <a href="index.html">Overview</a> </li> <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-expanded="false"> Vignettes <span class="caret"></span> </a> <ul class="dropdown-menu" role="menu"> <li> <a href="r_vs_python.html">Python vs. R - default outputs</a> </li> <li> <a href="r_vs_python_arguments.html">Python vs. R - default arguments</a> </li> <li> <a href="r_vs_python_additional.html">Python vs. R - additional functions</a> </li> <li> <a href="time_series_duration.html">Effects of short time series</a> </li> <li> <a href="missing_data.html">Effects of missing data</a> </li> <li> <a href="trend.html">Effects of long-term trends</a> </li> <li> <a href="best_practices.html">Best practices</a> </li> </ul> </li> <li> <a href="portrait.pdf">Poster</a> </li> <li> <a href="license.html">License</a> </li> <li> <a href="news.html">News</a> </li> </ul> <ul class="nav navbar-nav navbar-right"> <li> <a href="https://github.com/robwschlegel/MHWdetection"> <span class="fa fa-github"></span> </a> </li> </ul> </div><!--/.nav-collapse --> </div><!--/.container --> </div><!--/.navbar --> <!-- Add a small amount of space between sections. --> <style type="text/css"> div.section { padding-top: 12px; } </style> <div class="fluid-row" id="header"> <h1 class="title toc-ignore">Default MHW outputs in R and Python</h1> <h4 class="author"><em>Robert W Schlegel</em></h4> <h4 class="date"><em>2019-05-06</em></h4> </div> <p><strong>Last updated:</strong> 2019-05-06</p> <strong>workflowr checks:</strong> <small>(Click a bullet for more information)</small> <ul> <li> <details> <p><summary> <strong style="color:red;">✖</strong> <strong>R Markdown file:</strong> uncommitted changes </summary> The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run <code>wflow_publish</code> to commit the R Markdown file and build the HTML.</p> </details> </li> <li> <details> <p><summary> <strong style="color:blue;">✔</strong> <strong>Environment:</strong> empty </summary></p> <p>Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.</p> </details> </li> <li> <details> <p><summary> <strong style="color:blue;">✔</strong> <strong>Seed:</strong> <code>set.seed(666)</code> </summary></p> <p>The command <code>set.seed(666)</code> was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.</p> </details> </li> <li> <details> <p><summary> <strong style="color:blue;">✔</strong> <strong>Session information:</strong> recorded </summary></p> <p>Great job! Recording the operating system, R version, and package versions is critical for reproducibility.</p> </details> </li> <li> <details> <p><summary> <strong style="color:blue;">✔</strong> <strong>Repository version:</strong> <a href="https://github.com/robwschlegel/MHWdetection/tree/b1a7e3fcf3dacdd2ba4d7cd79a44fef13dd6191f" target="_blank">b1a7e3f</a> </summary></p> Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated. <br><br> Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use <code>wflow_publish</code> or <code>wflow_git_commit</code>). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated: <pre><code> Ignored files: Ignored: .Rhistory Ignored: .Rproj.user/ Ignored: data/global/ Ignored: data/global_KS_cat.Rdata Ignored: data/global_KS_clim.Rdata Ignored: data/global_KS_event.Rdata Ignored: data/global_dec_trend.Rdata Ignored: data/global_effect_cat.Rdata Ignored: data/global_effect_clim.Rdata Ignored: data/global_effect_event.Rdata Ignored: data/global_effect_event_length_slope.Rdata Ignored: data/global_effect_event_slope.Rdata Ignored: data/sst_ALL_add_trend.Rdata Ignored: data/sst_ALL_aov_tukey.Rdata Ignored: data/sst_ALL_clim_event_cat.Rdata Ignored: data/sst_ALL_clim_event_cat_fix.Rdata Ignored: data/sst_ALL_flat.Rdata Ignored: data/sst_ALL_knockout.Rdata Ignored: data/sst_ALL_length.Rdata Ignored: data/sst_ALL_length_width_10.Rdata Ignored: data/sst_ALL_length_width_20.Rdata Ignored: data/sst_ALL_length_width_30.Rdata Ignored: data/sst_ALL_length_width_40.Rdata Ignored: data/sst_ALL_length_width_50.Rdata Ignored: data/sst_ALL_missing.Rdata Ignored: data/sst_ALL_missing_fix.Rdata Ignored: data/sst_ALL_repl.Rdata Ignored: data/sst_ALL_trended.Rdata Unstaged changes: Modified: analysis/Climatologies_and_baselines.Rmd Modified: analysis/best_practices.Rmd Modified: analysis/r_vs_python.Rmd </code></pre> Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes. </details> </li> </ul> <details> <summary> <small><strong>Expand here to see past versions:</strong></small> </summary> <ul> <table style="border-collapse:separate; border-spacing:5px;"> <thead> <tr> <th style="text-align:left;"> File </th> <th style="text-align:left;"> Version </th> <th style="text-align:left;"> Author </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Message </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> html </td> <td style="text-align:left;"> <a href="https://cdn.rawgit.com/robwschlegel/MHWdetection/38559da0810a23b9416a3df8d57aa1c1ab651204/docs/r_vs_python.html" target="_blank">38559da</a> </td> <td style="text-align:left;"> robwschlegel </td> <td style="text-align:left;"> 2019-03-19 </td> <td style="text-align:left;"> Build site. </td> </tr> <tr> <td style="text-align:left;"> Rmd </td> <td style="text-align:left;"> <a href="https://github.com/robwschlegel/MHWdetection/blob/970b22c7c2cae4e09824722f8bf1d87fb5118a73/analysis/r_vs_python.Rmd" target="_blank">970b22c</a> </td> <td style="text-align:left;"> robwschlegel </td> <td style="text-align:left;"> 2019-03-19 </td> <td style="text-align:left;"> Publish the vignettes from when this was a pkgdown framework </td> </tr> <tr> <td style="text-align:left;"> html </td> <td style="text-align:left;"> <a href="https://cdn.rawgit.com/robwschlegel/MHWdetection/fa7fd57d97caa42308afbb27f761077d74e5239e/docs/r_vs_python.html" target="_blank">fa7fd57</a> </td> <td style="text-align:left;"> robwschlegel </td> <td style="text-align:left;"> 2019-03-19 </td> <td style="text-align:left;"> Build site. </td> </tr> <tr> <td style="text-align:left;"> Rmd </td> <td style="text-align:left;"> <a href="https://github.com/robwschlegel/MHWdetection/blob/64ac134076a04088c834291ce86c6405eedaf672/analysis/r_vs_python.Rmd" target="_blank">64ac134</a> </td> <td style="text-align:left;"> robwschlegel </td> <td style="text-align:left;"> 2019-03-19 </td> <td style="text-align:left;"> Publish analysis files </td> </tr> </tbody> </table> </ul> </details> <hr /> <div id="overview" class="section level2"> <h2>Overview</h2> <p>The purpose of this vignette is to walk one through how to calculate MHWs in both the R and Python languages. It will use code from both languages throughout. The secondary purpose of this vignette is to show that the outputs are identical. The tertiary purpose of this vignette is to run benchmarks on the code to compare their speeds.</p> </div> <div id="calculating-events" class="section level2"> <h2>Calculating events</h2> <p>First up, let’s look at the two different ways in which one may calculate MHWs. First we will look at the R code, then Python.</p> <div id="r-code" class="section level3"> <h3>R code</h3> <div id="setup" class="section level4"> <h4>Setup</h4> <p>The basic functionality for calculating marine heatwaves (MHWs) in R may be found in the <strong><code>heatwaveR</code></strong> package that may currently be downloaded and installed with the following lines of code. Note that if one has already installed these packages they do not need to be installed again. To run one of the lines in the following code chunk will require that the hash-tag first be removed.</p> <pre class="r"><code>## This is an R chunk ## # install.packages("devtools") ## Development version from GitHub # devtools::install_github("robwschlegel/heatwaveR") ## Stable version from CRAN # install.packages("heatwaveR")</code></pre> <p>With the necessary packages installed, we activate <code>heatwaveR</code> with the following line of code.</p> <pre class="r"><code>## This is an R chunk ## library(heatwaveR)</code></pre> </div> <div id="default-output" class="section level4"> <h4>Default output</h4> <p>With everything downloaded and ready for us to use we may now calculate some events. The <code>heatwaveR</code> package has three built in time series (<code>sst_WA</code>, <code>sst_Med</code>, <code>sst_NW_Atl</code>) that we may use to more easily demonstrate how the code works. The general pipeline in <code>R</code> for calculating events is first to create a climatology from a chosen time series using <code>ts2clm()</code>, and then to feed that output into <code>detect_event()</code>, as seen below.</p> <pre class="r"><code>## This is an R chunk ## # First we create a default climatology as outlined in Hobday et al. (2016) # NB: We use a longer climatology period here as this matches the Python default ts_clim <- ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31")) # Then we feed that object into the second and final function res_r <- detect_event(ts_clim)</code></pre> <p>To look at these outputs we would use the following options. For now we’ll just look at the event output.</p> <pre class="r"><code>## This is an R chunk ## # Look at the top six rows of the first 6 columns of the events res_r$event[1:6, 1:6] # Or perhaps the most intense event res_r$event[res_r$event$intensity_max == max(res_r$event$intensity_max), 1:6]</code></pre> </div> </div> <div id="python-code" class="section level3"> <h3>Python code</h3> <div id="setup-1" class="section level4"> <h4>Setup</h4> <p>To download and install the Python package for calculating MHWs one may run the following line of code in a console:</p> <pre class="r"><code>## This is a bash chunk ## # Note that the hashtag must be removed in orde for the code to run # pip install git+git://github.com/ecjoliver/marineHeatWaves@master</code></pre> <p>Or simply download the <a href="https://github.com/ecjoliver/marineHeatWaves">GitHub repository</a> and follow the instructions there for downloading. Or if still lost, phone a friend!</p> <p>Before we begin the calculations we need to create a time series. I’ve chosen to take the <code>sst_WA</code> time series from the <strong><code>heatwaveR</code></strong> package to ensure consistency between the results. The python code prefers to have the dates and the temperatures fed into it as two separate vectors, so I will first split these up in R before we feed them to Python.</p> <pre class="r"><code>## This is an R chunk ## # Get the built-in time series from heatwaveR ts <- sst_WA # Take only the temperature column sst <- ts$temp # Take only date column dates <- ts$t # If one wants to save the sst values to load into Python write.csv(ts$temp, "data/sst_WA.csv", row.names = FALSE)</code></pre> </div> <div id="default-calculations" class="section level4"> <h4>Default calculations</h4> <p>With the package installed, and our time series prepped, the basic MHW calculation is performed as follows:</p> <pre class="python"><code>## This is a Python chunk ## # Required for data prep and export import numpy as np from datetime import date import pandas as pd # The MHW functions import marineHeatWaves as mhw # The date values # We could take the date values we created above, # But let's rather create the date values in Python t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1) # The temperature values # Note that to fetch an object created in R we use 'r.*' # Where '*' is the name of the R object sst = np.asarray(r.sst) # Or if one rather wants to load the data from a csv file # sst = np.loadtxt("data/sst_WA.csv", delimiter=',', skiprows=1) # because a heading row needs to be skipped # The event metrics mhws, clim = mhw.detect(t, sst) # If one wants to save the results as csv files # Save event results mhws_df = pd.DataFrame.from_dict(mhws) mhws_df.to_csv('data/mhws_py.csv', sep = ',', index=False) # # Save climatology results clim_df = pd.DataFrame.from_dict(clim) clim_df.to_csv('data/clim_py.csv', sep = ',', index=False)</code></pre> </div> </div> <div id="reticulate-code" class="section level3"> <h3>Reticulate code</h3> <p>It is also possible to run the Python code through R with the use of the R package <strong><code>reticulate</code></strong>. This is particularly useful as it allows us to perform the comparisons and benchmarking all within the same language.</p> <div id="setup-2" class="section level4"> <h4>Setup</h4> <p>Here we load the <strong><code>reticulate</code></strong> package and choose the conda environment I’ve already created called <code>py27</code>. For help on how to set up a conda environment go <a href="https://conda.io/docs/user-guide/tasks/manage-environments.html">here</a>. I’ve ensured that all of the required python modules are installed within this environment.</p> <pre class="r"><code>## This is an R chunk ## # install.packages("reticulate") library(reticulate) use_condaenv("py27")</code></pre> <p>Once we’ve told R which version/environment we would like to use for Python we may then load the necessary modules.</p> <pre class="r"><code>## This is an R chunk ## np <- import("numpy") datetime <- import("datetime") mhw <- import("marineHeatWaves")</code></pre> <p>One may run python code in it’s native form within R by passing it as a character vector to <code>py_run_string()</code>.</p> <pre class="r"><code>## This is an R chunk ## py_run_string("t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)") py_run_string("sst = np.loadtxt(open('data/sst_WA.csv', 'r'), delimiter=',', skiprows=1)") # py_run_string("sst.flags.writeable = True") # py_run_string("sst.setflags(write = True)") # py_run_string("sst = np.array(sst).copy()")</code></pre> <p>Or we may just create the objects in native R.</p> <pre class="r"><code>## This is an R chunk ## # These numbers were taken from print(t) in the python code above t <- as.integer(np$array(seq(723546, 735598))) sst <- np$array(sst_WA$temp)</code></pre> </div> <div id="default-calculations-1" class="section level4"> <h4>Default calculations</h4> <p>Either way, once we have created the necessary parts of the time series in either language, we may call the Python functions in R directly with the line of code below.</p> <pre class="r"><code>## This is an R chunk ## # The following line appears not to run because the numpy array that Python expects is given to it by R in a read-only format... # Thus far I've not found a way to correct for this. # So it appears that one may pass R objects to Python code directly, but not to reticulate code... res_python <- mhw$detect(t = t, temp = sst)</code></pre> </div> </div> </div> <div id="comparisons" class="section level2"> <h2>Comparisons</h2> <p>With some climatologies and events calculated with both languages we may now compare the results of the two. I’ll do so here natively in R to avoid any potential hiccups from translating across languages. I’ll create the R output here, and load the Python output created in the Python code chunk above.</p> <pre class="r"><code>## This is an R chunk ## # Load libraries for data manipulation and visualisation library(tidyverse) # Set R results res_event_R <- res_r$event res_clim_R <- res_r$climatology # Set Python results res_event_Python <- read_csv("data/mhws_py.csv") res_clim_Python <- read_csv("data/clim_py.csv")</code></pre> <p>With these default results loaded in the same format in the same language we can now start to look at how they compare. For starters I am just doing some simple sums and correlations.</p> <div id="climatologies" class="section level3"> <h3>Climatologies</h3> <pre class="r"><code>## This is an R chunk ## cor(res_clim_R$seas, res_clim_Python$seas) sum(res_clim_R$seas) - sum(res_clim_Python$seas) cor(res_clim_R$thresh, res_clim_Python$thresh) sum(res_clim_R$thresh) - sum(res_clim_Python$thresh)</code></pre> <p>The seasonal and threshold climatologies correlate very well, but adding them up we see that the Python values are consistently higher. This is due to rounding differences between the languages. Even though the sum difference between the thresholds may look large at a value of ~6.5, considering this is over 12053 days of data, the difference is only ~0.0005/day. So I think it is safe to move on and look at the events.</p> </div> <div id="february-29th" class="section level3"> <h3>February 29th</h3> <p>It was also recently brought to my attention by Zijie Zhao, author of the MATLAB distribution for MHW code, that Python and R calculate February 29th differently. With Python calculating the climatologies first, and then filling in the missing days, whereas R calculates February 29th first, and then the climatologies from that. So with that being said let’s have a peek at the difference this may cause.</p> <pre class="r"><code>## This is an R chunk ## res_clim_R_29 <- res_clim_R %>% filter(doy == 60) %>% select(seas, thresh) %>% distinct() res_clim_Python_29 <- res_clim_Python %>% filter(res_clim_R$doy == 60) %>% select(seas, thresh) %>% distinct() res_clim_R_29$seas - res_clim_Python_29$seas res_clim_R_29$thresh - res_clim_Python_29$thresh</code></pre> <p>Another thing pointed out by Zijie is that Python and MATLAB calculate percentiles differently, this then is likely also true for R and Python. This would explain why the seasonal climatologies (<code>seas</code>) are always much closer than the thresholds (<code>thresh</code>) for all of these tests.</p> </div> <div id="events" class="section level3"> <h3>Events</h3> <p>Below we have a for loop that will run a correlation on all rows with the same name, as well as find the sum difference between them.</p> <pre class="r"><code>## This is an R chunk ## # Remove non-numeric columns res_event_num <- res_event_R %>% select_if(is.numeric) # Run the loop res_event <- data.frame() for(i in 1:length(colnames(res_event_num))){ if(colnames(res_event_num)[i] %in% colnames(res_event_Python)){ x1 <- res_event_R[colnames(res_event_R) == colnames(res_event_num)[i]] x2 <- res_event_Python[colnames(res_event_Python) == colnames(res_event_num)[i]] x <- data.frame(r = cor(x1, x2, use = "complete.obs"), difference = round(sum(x1, na.rm = T) - sum(x2, na.rm = T), 4), var = colnames(res_event_num)[i]) colnames(x)[1] <- "r" rownames(x) <- NULL } else { x <- data.frame(r = NA, difference = NA, var = colnames(res_event_num)[i]) } res_event <- rbind(res_event, x) } na.omit(res_event)</code></pre> <p>It is a bit odd that there is such a large difference between <code>intensity_var</code> for the two language. This implies that R is detecting larger differences in the intensity around the threshold than Python is. This may again be due to the differences in thresholds being detected, but that makes a rather small difference. It is perhaps more likely that variance is calculated slightly differently between languages. Looking at the source code one sees that this is calculated by first finding the variance of the intensity, and then the square root of that. These two calculations likely differ enough between the languages to be causing this. Ultimately it is of little concern as the variance values found throughout both packages are not of much interest to anyone.</p> <p>We may also see that the difference in <code>intensity_max_abs</code> is a bit higher in Python than in R. This is rather surprising as <code>intensity_max_abs</code> simply shows the maximum temperature (independent of any thresholds etc.) experienced during that event. And considering that these calculations were performed on the exact same time series, <code>intensity_max_abs</code> should always be exactly the same between the two languages. Looking at the climatology output one sees that the temperature is still correct, meaning that the difference must occur somewhere within the R function <code>detect_event</code>. Looking at the R source code we see that <code>intensity_max_abs</code> is calculated on line 298, for Python this is calculated on line 376. The reason for the difference is because R is looking for the maximum temperature throughout all dates for a given event, whereas Python is simply taking the temperature on the peak date of the event. This is a subtle but significant difference as this means this value is showing two different things. I’m not certain which is more appropriate, though I’m leaning towards the Python implementation.</p> <p>With all of our overlapping columns compared, and our differences and Pearson r values looking solid, let’s finish off this basic comparison by finding which columns are not shared between the different language outputs. Also, please note that the apparently large differences between the <code>index_start</code>, <code>index_peak</code>, and <code>index_end</code> values are due to the different indexing methods of R and Python. R starts at 1 and Python at 0. The</p> <pre class="r"><code>## This is an R chunk ## cols_R <- colnames(res_event_R)[!(colnames(res_event_R) %in% colnames(res_event_Python))] cols_R cols_Py <- colnames(res_event_Python)[!(colnames(res_event_Python) %in% colnames(res_event_R))] cols_Py</code></pre> <p>Wonderful! Almost everything matches up exactly. The duration of categories of events is something added in R by another function <code>category()</code>, and will be left that way for now. The “time” columns in the Python output aren’t relevant as far as I can see in the present usage as these functions currently only take day values. The different methods of labelling the events will be left as they are for now as well.</p> <p>It is also worth noting that the values for <code>index_start</code>, <code>index_peak</code>, and <code>index_end</code> are off by one between the two languages. This is due to the difference in indexing between the languages. Looking at the <code>date_start</code>, <code>date_peak</code>, and <code>date_end</code> values we see that they are the same.</p> </div> <div id="missing-data" class="section level3"> <h3>Missing data</h3> <p>A bug was discovered in v0.3.3 of the R code with regards to the handling of missing data. Specifically, in the move to a C++ back-end, the R code stopped being able to handle missing data for the climatology calculations. This has since been corrected from v0.3.4 onwards, but it is worthwhile to ensure that missing data are handled the same way.</p> <p>First we’ll create the dataframe with missing data:</p> <pre class="r"><code>## This is an R chunk ## library(padr) library(lubridate) random_knockout <- function(df, prop){ res <- df %>% sample_frac(size = prop) %>% arrange(t) %>% pad(interval = "day") return(res) } # Remove 10% of the data randomly sst_WA_miss_random <- random_knockout(sst_WA, 0.9) write.csv(sst_WA_miss_random$temp, file = "data/sst_WA_miss_random.csv", row.names = F, na = "NaN") # Remove simulated ice cover sst_WA_miss_ice <- sst_WA %>% mutate(month = month(t, label = T), temp = ifelse(month %in% c("Jan", "Feb", "Mar"), NA, temp)) %>% select(-month) write.csv(sst_WA_miss_ice$temp, file = "data/sst_WA_miss_ice.csv", row.names = F, na = "NaN")</code></pre> <p>With the missing data saved, we can now calculate and compare the climatologies that the two different languages will create.</p> <pre class="python"><code>## This is a Python chunk ## # Required for data prep and export import numpy as np from datetime import date import pandas as pd # The MHW functions import marineHeatWaves as mhw # The date values t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1) # The temperature values sst_random = np.loadtxt("data/sst_WA_miss_random.csv", delimiter = ',', skiprows = 1) sst_ice = np.loadtxt("data/sst_WA_miss_ice.csv", delimiter = ',', skiprows = 1) # The event metrics mhws_random, clim_random = mhw.detect(t, sst_random) # It appears as though the Python code can't deal with ice coverage... #mhws_ice, clim_ice = mhw.detect(t, sst_ice) # Save climatology results clim_random_df = pd.DataFrame.from_dict(clim_random) clim_random_df.to_csv('data/clim_py_random.csv', sep = ',', index = False) mhws_random_df = pd.DataFrame.from_dict(mhws_random) mhws_random_df.to_csv('data/mhws_py_random.csv', sep = ',', index = False) #clim_ice_df = pd.DataFrame.from_dict(clim_ice) #clim_ice_df.to_csv('data/clim_ice_py.csv', sep = ',', index = False)</code></pre> <p>With the Python climatologies and events calculated from the missing data we’ll now do the same for R.</p> <pre class="r"><code>## This is an R chunk ## # Load and prep missing data sst_WA_miss_random <- read_csv("data/sst_WA_miss_random.csv") %>% dplyr::rename(temp = x) %>% mutate(t = seq(as.Date("1982-01-01"), as.Date("2014-12-31"), by = "day")) %>% select(t, temp) sst_WA_miss_random$temp[is.nan(sst_WA_miss_random$temp)] <- NA # caluclate R climatologies/events res_random_R <- detect_event(ts2clm(sst_WA_miss_random, maxPadLength = 10, climatologyPeriod = c("1982-01-01", "2014-12-31"))) res_clim_random_R <- res_random_R$climatology res_event_random_R <- res_random_R$event # Load Python results res_clim_random_Python <- read_csv("data/clim_py_random.csv") res_event_random_Python <- read_csv("data/mhws_py_random.csv")</code></pre> <p>And then the comparisons of the <code>seas</code> and <code>thresh</code> values.</p> <pre class="r"><code>## This is an R chunk ## # Compare clims/thresholds cor(res_clim_random_R$seas, res_clim_random_Python$seas) mean(res_clim_random_R$seas) - mean(res_clim_random_Python$seas) cor(res_clim_random_R$thresh, res_clim_random_Python$thresh) mean(res_clim_random_R$thresh) - mean(res_clim_random_Python$thresh)</code></pre> <p>We may see from the results above that the values still correlate very strongly, if slightly less so than with no missing data. We see again that the Python values are higher overall for both the seasonal climatology and the threshold. This time however the differences are much larger. The seasonal climatology is off by ~0.0004/day, which is acceptable, but the 90th percentile threshold is off by ~0.0695. This is too large of a difference and this issue will need to be investigated. It is most likely due to the difference in how quantiles are calculated between the languages though and so there may be little to do about it. Perhaps more importantly is how these differences may affect the events being detected.</p> <pre class="r"><code>## This is an R chunk ## nrow(res_event_random_R) nrow(res_event_random_Python)</code></pre> <pre class="r"><code>## This is an R chunk ## # Remove non-numeric columns res_event_random_num <- res_event_random_R %>% select_if(is.numeric) # Run the loop res_event_random <- data.frame() for(i in 1:length(colnames(res_event_random_num))){ if(colnames(res_event_random_num)[i] %in% colnames(res_event_random_Python)){ x1 <- res_event_random_R[colnames(res_event_random_R) == colnames(res_event_random_num)[i]] x2 <- res_event_random_Python[colnames(res_event_random_Python) == colnames(res_event_random_num)[i]] x <- data.frame(difference = round(sum(x1, na.rm = T) - sum(x2, na.rm = T), 4), var = colnames(res_event_random_num)[i]) # colnames(x)[1] <- "r" rownames(x) <- NULL } else { x <- data.frame(difference = NA, var = colnames(res_event_random_num)[i]) } res_event_random <- rbind(res_event_random, x) } na.omit(res_event_random)</code></pre> <p>Because the count of events detected is not the same, we cannot run correlations. We can sum up the differences to get an idea of how far off things are, and we can see that the results are troubling. The differences in the <code>index_*</code> values can be discounted as we have already established that we have a different number of events. The next largest difference is that for <code>intensity_cumulative_abs</code>. Looking at the one event that is the same between the languages one may see that this value is the same for both, meaning that we may discount this difference as it is not due to calculation differences, but rather also due to the differing number of events. Regardless, it is an indicator that overall the much lower threshold in the R language is giving us results that look much more dire than the Python language. This is problematic and stems from the quantile calculation issue.</p> <p>I looked into this in quite some detail and it appears to stem not only from the calculation of quantiles being different, but also from the R quantile calculations never giving more than one additional decimal place of precision over the initial data, whereas the Python results float near infinity. This explains why the R results always tend to be lower, but does not account for why the presence of missing data has such a dramatic effect.</p> <pre class="r"><code>## This is an R chunk ## clim_only_random_R <- res_clim_random_R %>% select(doy, seas, thresh) %>% distinct() %>% filter(doy != 60) clim_only_random_Python <- res_clim_random_Python %>% slice(1:365) cor(clim_only_random_R$seas, clim_only_random_Python$seas) sum(clim_only_random_R$seas) - sum(clim_only_random_Python$seas) cor(clim_only_random_R$seas, clim_only_random_Python$seas) sum(clim_only_random_R$thresh) - sum(clim_only_random_Python$thresh)</code></pre> <p>Looking at only the values for one seasonal climatology or threshold cycle we see that the differences are the same as when we compare all of the data. I don’t know why they wouldn’t be… but I wanted to check.</p> <p>After having gone over the R code very thoroughly I think the next course of action is to go through the step by step output of the Python code to see where exactly these changes begin to occur. It could be in the following places:</p> <ul> <li>clim_calc_cpp: The calculation of quantile values <ul> <li>The rounding of the results to 0.001 will have an influence, but can’t explain the size of the difference when missing data are present</li> </ul></li> <li>smooth_percentile: RcppRoll::roll_mean is used for the rolling means and this may differ greatly from Python when missing data are present <ul> <li>Line 290 in the Python code is where the function <code>runavg</code> is used for the same purpose</li> </ul></li> </ul> </div> </div> <div id="benchmarks" class="section level2"> <h2>Benchmarks</h2> <p>The final thing we want to look at in this vignette is the speed differences in calculating MHWs between the two languages.</p> <div id="r" class="section level3"> <h3>R</h3> <pre class="r"><code>library(microbenchmark) # The old R code library(RmarineHeatWaves) microbenchmark(detect(make_whole(data = sst_WA), climatology_start = "1982-01-01", climatology_end = "2014-12-31"), times = 10) # The new R code microbenchmark(detect_event(ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))), times = 10)</code></pre> <p>Unsurprisingly, the average for running the heatwave analysis on <strong><code>heatwaveR</code></strong> 10 times comes out at ~0.194 seconds per calculations, which is faster than the average of ~0.730 with <strong><code>RmarineHeatWaves</code></strong>.</p> </div> <div id="python" class="section level3"> <h3>Python</h3> <pre class="python"><code>import time total = 0 for i in range(10): start = time.clock() mhws, clim = mhw.detect(t, sst) end = time.clock() total += end-start bench_py = total/10 print(bench_py)</code></pre> <p>The average speed for running the Python code 10 times comes out at ~0.175 seconds, which is slightly faster than R. And the Python code is also calculating the event categories, which R is not. That is done in an additional step with <code>category()</code>. The R code is however allowing for a wider range of options for climatologies, and may be more easily multi-cored, as shown in this <a href="https://robwschlegel.github.io/heatwaveR/articles/gridded_event_detection.html">vignette</a>.</p> </div> </div> <div id="conclusion" class="section level2"> <h2>Conclusion</h2> <p>Overall the basic applications of these two languages provide near identical results, and nearly the same computational time. I would not say that one is better enough than the other to warrant switching between languages. Rather this vignette supports the argument that people collaborating on a research project while using R and Python do not need to worry about their results. The exception currently being that large amounts of missing data appear to be influencing the calculation of the 90th percentile threshold differently. This is something that will be investigated further and corrected as necessary.</p> <div id="refs"> </div> </div> <div id="session-information" class="section level2"> <h2>Session information</h2> <pre class="r"><code>sessionInfo()</code></pre> </div> <!-- Adjust MathJax settings so that all math formulae are shown using TeX fonts only; see http://docs.mathjax.org/en/latest/configuration.html. This will make the presentation more consistent at the cost of the webpage sometimes taking slightly longer to load. Note that this only works because the footer is added to webpages before the MathJax javascript. --> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ "HTML-CSS": { availableFonts: ["TeX"] } }); </script> <hr> <p> This reproducible <a href="http://rmarkdown.rstudio.com">R Markdown</a> analysis was created with <a href="https://github.com/jdblischak/workflowr">workflowr</a> 1.1.1 </p> <hr> </div> </div> </div> <script> // add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); }); </script> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>