Overview

It has been shown that the length (duration) of a time series may be one of the most important factors in determining the usefulness of those data for any number of applications (Schlegel and Smit 2016). For the detection of MHWs it is recommended that one has at least 30 years of data in order to accurately detect the events therein. This is no longer an issue for most ocean surfaces because many high quality SST products, such as NOAA OISST, are now in exceedence of the 30-year minimum. However, besides issues relating to the coarseness of these older products, there are many other reasons why one would want to detect MHWs in products with shorter records. In situ measurements along the coast are one good example of this, another being the use of the newer higher resolution SST products, a third being the use of reanalysis/model products for the detection of events in 3D. It is therefore necessary to quantify the effects that shorter time series duration (< 30 years) has on the accurate detection of events. Once any discrepancies have been accounted for, best practices must be developed to allow users to improve the precision of the detection of events in their shorter time series.

Time series shortening

A time series derives it’s usefulness from it’s length. This is generally because the greater the number of observations (larger sample size), the more confident one can be about the resultant statistics. This is particularly true for decadal scale measurements and is why the WMO recommends a minimum of 30 years of data before drawing any conclusions about decadal (long-term) trends observed in a time series. We however want to look not at decadal scale trends, but rather at how observed MHWs compare against one another when they have been detected using longer or shorter time series. In order to to quantify this effect we will use the minimum and maximum range of time series available to us. The maximum length will be 33 years, as this is the full extent of the NOAA OISST data included in heatwaveR. The minimum length will be 3 years, as it is not possible to calculate the climatology statistics with fewer years than this. Any time series under 5 – 10 years in length will almost certainly create wildly inaccurate results, so we use these short time series more as a proof against their usability, rather than to show that they may be a viable option.

Before we can discuss shortening techniques, we must first consider the inherent decadal trend in the time series themselves. This is usually the primary driver for much of the event detection changes over time (Oliver et al. 2018). Therefore, if we truly want to understand the effect that a shortened time series may have on event detection, apart from the effect of the decadal trend, we must de-trend the time series first. It is however worthwhile to compare results from the different time series lengths with and without the long-term trends removed. To this end there is an entire separate vignette that quantifies the effect of long-term trends on the detection of events.

The time series shortening technique proposed here is re-sampling. This methodology takes the full 33 year de-trended time series and randomly samples n years from it to simulate a 3 – 33 year long time series. One then uses the jumbled up, randomly selected years of data to create the seasonal signal and 90th percentile threshold (hereafter referred to as “the climatologies”). The real temperature data, in the correct annual order, are still used against the re-sample created climatologies. Re-sampling the time series in this way is useful because it allows us to replicate random climatology creation 100 (or more) times for each time series length in order to produce a more confident estimation of how likely climatologies generated from certain time series lengths are to impact the accuracy of event detection.

The WMO recommends that the period used to create a climatology be 30 years in length, starting on the first year of a decade (e.g. 1981), however; because we are going to be running these analyses on time series of many different lengths, it will be more consistent to use all of the years present in each as the climatology period. For example, if a time series spans 1983 – 2014, that will be the period used for calculating the climatologies. Likewise, should the time series only span the years 2006 – 2014, that will be the period used.

Re-sampling

In this section we will compare the detection of events when we shorten them using re-sampling to simulate 100 time series at each length of 3 – 33 year. This is presently being done with the three pre-packaged time series available in the heatwaveR package and the python distribution, but eventually will be run on the global data.

The effect this has on the climatologies will be quantified through the following statistics:

  • for each day-of-year (doy) in the climatology, the SD of the climatological means of the 100 re-samplings;
  • for each doy, the RMSE of the re-sampled means relative to the true climatology (i.e. the one produced from the 33-year long time series);
  • An ANOVA will compare the climatologies created from different lengths;
  • A post-hoc Tukey test to determine where the difference in time series length for climatology creation become significant;
  • Pairwise Kolmogorov-Smirnoff tests will be used to determine if the distributions of the climatologies differ

The effect this has on the MHW metrics will be quantified with the following statistics:

  • SD of the MHW metrics from each 100 re-samplings per time series length;
  • ANOVA comparing the primary metrics for MHWs detected using different time series lengths;
  • A post-hoc Tukey test to show where the difference in time series length for event detection become significant;
  • Confidence intervals (CIs) for the measured metrics

In addition to checking all of the statistics mentioned above, it is also necessary to record what the overall relationship is with the reduction of time series length and the resultant climatologies/metrics. This will form part of the best practice advice later.

library(tidyverse)
library(broom)
library(heatwaveR)
library(lubridate) # This is intentionally activated after data.table
library(fasttime)
library(ggpubr)
library(boot)
library(FNN)
library(mgcv)
library(doMC); doMC::registerDoMC(cores = 3)
# Remove the trend from a time series
detrend <- function(df){
  resids <- broom::augment(lm(temp ~ t, df))
  res <- df %>% 
    mutate(temp = temp - resids$.fitted)
  return(res)
}
# ggplot(sst_ALL_detrend, aes(x = t, y = temp)) +
#   geom_point() +
#   geom_smooth(method = "lm") +
#   facet_wrap(~site, nrow = 3)

# A clomp of functions used below
# Written out here for tidiness/convenience

# Calculate only events
detect_event_event <- function(df, y = temp){
  ts_y <- eval(substitute(y), df)
  df$temp <- ts_y
  res <- detect_event(df)$event
  return(res)
  }

# Run an ANOVA on each metric of the combined event results and get the p-value
event_aov_p <- function(df){
  aov_models <- df[ , -grep("year_index", names(df))] %>%
    map(~ aov(.x ~ df$year_index)) %>% 
    map_dfr(~ broom::tidy(.), .id = 'metric') %>%
    mutate(p.value = round(p.value, 4)) %>%
    filter(term != "Residuals") %>%
    select(metric, p.value)
  return(aov_models)
  }

# Run an ANOVA on each metric and then a Tukey test
event_aov_tukey <- function(df){
  aov_tukey <- df[ , -grep("year_index", names(df))] %>%
    map(~ TukeyHSD(aov(.x ~ df$year_index))) %>% 
    map_dfr(~ broom::tidy(.), .id = 'metric') %>%
    mutate(p.value = round(adj.p.value, 4)) %>%
    # filter(term != "Residuals") %>%
    select(metric, comparison, adj.p.value) %>% 
    # filter(adj.p.value <= 0.05) %>% 
    arrange(metric, adj.p.value)
  return(aov_tukey)
  }

# Calculate CIs using a bootstrapping technique to deal with the non-normal small sample sizes

# df <- sst_ALL_both_event %>%
#   filter(lubridate::year(date_peak) >= 2012,
#          site == "WA",
#          trend == "detrended") %>%
#   select(year_index, duration, intensity_mean, intensity_max, intensity_cumulative) #%>%
  # select(site, trend, year_index, duration, intensity_mean, intensity_max, intensity_cumulative) #%>%   
  # nest(-site, -trend)

Bmean <- function(data, indices) {
  d <- data[indices] # allows boot to select sample 
    return(mean(d))
} 

event_aov_CI <- function(df){
  df_conf <- gather(df, key = "metric", value = "value", -year_index) %>% 
    group_by(year_index, metric) %>% 
    # ggplot(aes(x = value)) +
    # geom_histogram(aes(fill = metric)) +
    # facet_wrap(~metric, scales = "free_x")
    summarise(lower = boot.ci(boot(data=value, statistic=Bmean, R=1000), type = "basic")$basic[4],
              mid = mean(value),
              upper = boot.ci(boot(data=value, statistic=Bmean, R=1000), type = "basic")$basic[5],
              n = n()) %>% 
    mutate_if(is.numeric, round, 4)
  return(df_conf)
  }

# A particular summary output
event_output <- function(df){
  res <- df %>%
    group_by(year_index) %>% 
    # select(-event_no) %>% 
    summarise_all(c("mean", "sd"))
  return(res)
}

# Quick wrapper for getting results for ANOVA on clims
clim_aov <- function(df){
  res <- df %>% 
    select(-t, - temp, -doy) %>% 
    mutate(year_index = as.factor(year_index)) %>% 
    unique() %>% 
    group_by(site, trend) %>% 
    nest() %>% 
    mutate(res = map(data, event_aov_p)) %>% 
    select(-data) %>% 
    unnest()
  return(res)
}

# Quick wrapper for getting results for Tukey on clims
clim_tukey <- function(df){
  res <- df %>% 
    select(-t, - temp, -doy) %>% 
    mutate(year_index = as.factor(year_index)) %>% 
    unique() %>% 
    group_by(site, trend) %>% 
    nest() %>% 
    mutate(res = map(data, event_aov_tukey)) %>% 
    select(-data) %>% 
    unnest()
  return(res)
}
# Calculate one specific clim period

# df <- sst_ALL_parse %>% 
#   filter(rep == 1, site == "WA", year_index2 == 2010) %>% 
  # group_by(rep, site, year_index2) %>% 
  # select(-rep, -site, -year_index2, -doy)
ts2clm_sub <- function(df){
  year_start <- df$year_index[1]
  res <- ts2clm(df, climatologyPeriod = c(paste0(year_start,"-01-01"), "2014-12-31")) %>%
    # filter(lubridate::year(t) >= year_index) %>%
    mutate(year_index = year_start)
  return(res)
}
# 
# # Calculate the full range of clims in a time series
# ts2clm_ALL <- function(df){
#   seq_year <- seq(min(lubridate::year(df$t)), 
#                   max(lubridate::year(df$t))-2)
#   res <- plyr::ldply(seq_year, .fun = ts2clm_sub, .parallel = T, df = df)
#   return(res)
# }

# df <- sst_ALL_detrend %>% 
#   filter(site == "Med") #%>% 
#   # select(-site)

sample_all <- function(n_sample, df){
  res <- sample_n(df, size = n_sample, replace = T) %>% 
    mutate(year_index = 2014-n_sample+1)
  return(res)
}

sst_repl <- function(df) {
  n_range <- seq(3, 33)
  res <- df %>% 
    group_by(site, doy) %>% 
    do(plyr::ldply(n_range, .fun = sample_all, .parallel = T, df = .)) %>% 
    ungroup() %>% 
    as.tibble(.)
  return(res)
}

# Re-sample the data 100 times
sst_ALL_repl <- purrr::rerun(100, sst_repl(sst_ALL_detrend)) %>%
  map_df(as.data.frame, .id = "rep")
save(sst_ALL_repl, file = "data/sst_ALL_repl.Rdata")

# Correct dates for upcoming climatology calculations
sst_ALL_parse <- sst_ALL_repl %>% 
  mutate(year_index2 = year_index) %>% 
  group_by(site, rep, doy, year_index2) %>% 
  parse_date_sst() %>%
  ungroup() %>% 
  # select(-year_index2) %>% 
  as.tibble(.)

# Calculate climatologies
sst_ALL_clim <- sst_ALL_parse %>% 
  group_by(rep, site, year_index2) %>% 
  nest() %>% 
  mutate(clims = map(data, ts2clm_sub))
  # do(ts2clm_sub(df = .))
  # do(ts2clm_sub(year_index = year_index, df = .))

save(sst_ALL_smooth, file = "data/sst_ALL_smooth.Rdata")

Climatology statistics

doy SD

(ECJO)[I have a hard time seeing what the doy std. dev. and RMSE are telling us. Is it information contributing to the question we are asking? I don’t think so….] (RWS)[This was incorporated at the outset of the project, but I’m not certain it is still the focus of the paper. It may get cut when I re-think what is the central question we are trying to answer.]

By looking at how the standard deviaiton (SD) changes over a year, we can see how re-sampling a greater number of times may or may not affect the accuracy of the climatologies created.

Kolmogorov-Smirnov tests

# Extract climatology values only
sst_ALL_both_clim_only <- sst_ALL_both_clim %>% 
  select(-t, -temp) %>% 
  unique() %>% 
  arrange(site, trend, year_index, doy)

KS_sub <- function(df, year_index_1, year_index_2){
  df_1 <- df %>% 
    filter(year_index == year_index_1)
  df_2 <- df %>% 
    filter(year_index == year_index_2)
  res <- data.frame(seas = round(ks.test(df_1$seas, df_2$seas)$p.value, 4),
                    thresh = round(ks.test(df_1$thresh, df_2$thresh)$p.value, 4),
                    year_1 = year_index_1,
                    year_2 = year_index_2)
  return(res)
}

# The KS results
sst_ALL_both_clim_KS_p <- data.frame()
for(i in 1:length(unique(sst_ALL_both_clim_only$year_index))){
  year_index_1 <- unique(sst_ALL_both_clim_only$year_index)[i]
  for(j in 1:length(unique(sst_ALL_both_clim_only$year_index))){
    year_index_2 <- unique(sst_ALL_both_clim_only$year_index)[j]
    if(year_index_1 < year_index_2){
      suppressWarnings(
      res <- sst_ALL_both_clim_only %>% 
        group_by(site, trend) %>% 
        do(KS_sub(., year_index_1 = year_index_1, year_index_2 = year_index_2)) %>% 
        ungroup() %>% 
        as.tibble(.)
      )
      sst_ALL_both_clim_KS_p <- rbind(sst_ALL_both_clim_KS_p, res)
    }
  }
}

sst_ALL_both_clim_KS_p_long <- sst_ALL_both_clim_KS_p %>% 
  tidyr::gather(key = "metric", value = "p.value", -site, -trend, -year_1, -year_2)

ggplot(sst_ALL_both_clim_KS_p_long, aes(x = year_1, y = year_2)) +
  geom_tile(aes(fill = p.value)) +
  geom_point(data = filter(sst_ALL_both_clim_KS_p_long, p.value <= 0.05),
             size = 0.01) +
  # geom_text(aes(label = round(adj.p.value, 2))) +
  scale_fill_gradient2(midpoint = 0.1) +
  coord_equal(expand = 0) +
  facet_grid(site~metric+trend) +
  labs(x = "", y = "")

MHW metrics

ANOVA p-values

# Then caluclate events using the many re-smapled clims on the real temperature data
sst_ALL_smooth_event <- sst_ALL_smooth_real %>% 
  group_by(site, id, rep) %>% 
  nest() %>% 
  mutate(res = map(data, detect_event_event)) %>%
  select(-data) %>% 
  unnest(res) %>% 
  filter(date_start >= "2002-01-01", date_end <= "2011-12-31") %>% 
  select(-c(index_start:index_end, date_start:date_end))

save(sst_ALL_smooth_event, file = "data/sst_ALL_smooth_event.Rdata")

The heatmap above shows the ANOVA results (p-values) when one compares the events detected in the 100 replicated re-samples of 10, 20, or 30 years of data. Meaning that these are very large pools of events being compared against one another.

(RWS: It would be worthwhile to compare the MHW metrics for each individual re-sampled time series against the MHW metrics from the real time series of the same length.)

Confidence intervals

The CI plot above demonstrates that the general increase in MHW metrics in longer time series detected in the first round of experiments (when the real 10, 20, and 30 year baselines were used to generate a single clim each) does not hold up when we re-sample the data 100 times. The centre around which the CI spread shrinks dramatically with re-sampling, but we also see how the three different time periods converge around 0, whereas the samples from the real data show the effect that the long-term trend has in the calculation of the metrics. The reason that we do not see the trend of increasing MHW duration/intensity with the re-sampling is that re-sampling effectively removes the decadal trend from the data. It is for this reason that I recommended at the start of this document that a second type of re-sampling be performed that would better maintain the effect that the decadal trend has on the data.

Note that in an earlier version of this vignette bootstrapping was also tested. It has since been removed as it was shown to not be effective. This was because the bootstrapping of random values to create climatologies created much lower values than the real data because while the bootstrapping does sample the data randomly, it then takes those n random samples and creates one mean value from them. This then makes artificially even values and so when one wants to calculate a 90th percentile threshold from this it is almost identical to the seasonal climatology.

Categories

In this section we want to look at how the categories in the different time periods compare. I’ll start out with doing basic calculations and comparisons of the categories of events with the different time periods. And then based on how that looks, see if I can think of a way of quantifying the differences.

The results in the figure above are about what I was expecting. We tend to see more larger events when the climatologies are based on 30 years of data, rather than 10.

# Calculate categories and filter only those from 2002 -- 2011
sst_ALL_smooth_real_category <- sst_ALL_smooth_real %>% 
  group_by(site, id, rep) %>% 
  nest() %>% 
  mutate(res = map(data, detect_event_category)) %>% 
  select(-data) %>% 
  unnest(res) %>% 
  filter(peak_date >= "2002-01-01", peak_date <= "2011-12-31") #%>% 
  # select(-c(index_start:index_end, date_start:date_end))
save(sst_ALL_smooth_real_category, file = "data/sst_ALL_smooth_real_category.Rdata")
load("data/sst_ALL_smooth_real_category.Rdata")

When re-sampling we actually see the detection of more larger category events with the 10 year periods than with either 20 or 30, but this is negligible. As noted earlier, this re-sampling removes the decadal trend from the data, which seems to cause the climatologies to become more similar, not less. Remember that the real temperature values are being used to detect these events, not the re-sampled temperature values that were used to calculate the climatologies. This was because the re-sampled temperatures were too jumbled to detect events with reliably.

Comparing top events

(AJS: I was thinking about the event categories. What we need to do is detect the top five events using the 30 yr climatology. Then, using the sections of time series that overlap with these events, use reduced time series, make climatologies, and see how those top five events compare in their event metrics to those detected using the shorter climatologies. Then we do the reverse. Detect events in the reduced time series, find the top five, and see what the matching ones in the full time series are like in comparison. So, compare specific events in addition to the way in which you have done it. Then do the same for median-sized events, and the smallest ones. Maybe even those at the 2nd and 5th quantiles.)

(RWS: Parts of this proposed methodology proved to be rather tricky given how the rest of the methodology has been framed thus far so I have not yet made all of the suggestion posed here. I have only compared the top five events, as proposed by AJS above.)

(ECJO)[All of this section is really just pulling out the effect of the long-term trend. It is not quantifying the uncertainty due to a short time series.]

(RWS)[This again is the same problem running throughout. It would perhaps be best just to detrend everything…]

The bar plots above are perhaps a bit difficult to interpret. What they are showing is the top five events detected in either the real 10, 20, or 30 year time series, but then those events are compared against the same event in the 30 year time series, and it is the category for that same event from the 30 year time series that is shown in the figure. For example, look at the event categories from the 10 year time series for the Northwest Atlantic (NW_Atl). We see that three of these events are ‘I Moderate’, and two are ‘II Strong’. When we look at the events based on 30 years of data we see that all of the top five are ‘II Strong’. This is because the 90th percentile threshold is higher when one uses only 10 years of data, so the events have a lower category. Oddly, this doesn’t hold up in the Mediterranean time series.

With the differences in the count of categories for the detected events in the given time series lengths shown above, we will now compare the categories of the matching events between the different climatologies used.

Now we are somewhat doing the opposite of the previous figure. In the figure above we are showing what the categories of the top five events in the 10 year time series really are, compared against what the matching events in the 30 year time series are. We see that the 30 year time series event categories are consistently larger to some degree.

The above figure shows a similar pattern to the previous one, except that the discrepancy in category classifications between the top five events from 20 vs 30 year time periods is less than 10 vs 30 years. This is to be expected.

(RWS: As proposed by AJS, were this section of the analysis to be completed, the median and minimum five events should also be compared.)

Trend control

(RWS)[This section will seek to understand the elationship between long-term trends and the effects they have on MHW results. This will be accomplished by adding known amounts of trends to de-trended time series and quantifying the results.]

Can we fix it?

Fourier transform climatologies

Lastly we will now go about reproducing all of the checks made above, but based on a climatology derived from a Fourier transformation, and not the default climatology creation method.

(RWS: I’ve not gotten to the Fourier transformations yet either. I’m also not convinced that this is going to be the way forward. I am thinking that it may be better to just increase the smoothing window width to produce a more sinusoidal climatology line. But I suppose that in order to know that it must be tested.)

(ECJO)[One advantage of the Fourier transform method (or, I think equivalently harmonic regression) is that it is VERY fast. Much faster than the method used in the MHW code. I also think you can probably get away with shorter time series using this method than with the “standard MHW climatology” method]

(RWS)[I suppose then that I will need to look into this.]

Best practices

(RWS: I envision this last section distilling everything learned above into a nice bullet list. These bulletted items for each different vignette would then all be added up in one central best practices vignette that would be ported over into the final write-up in the discussion/best-practices section of the paper. Ideally these best-practices could also be incorporated into the R/python distributions of the MHW detection code to allow users to make use of the findings from the paper in their own work.)

Options for improving climatology precision in short time series

Options for improving MHW metric precision in short time series

References

Oliver, Eric C.J., Markus G. Donat, Michael T. Burrows, Pippa J. Moore, Dan A. Smale, Lisa V. Alexander, Jessica A. Benthuysen, et al. 2018. “Longer and more frequent marine heatwaves over the past century.” Nature Communications 9 (1). https://doi.org/10.1038/s41467-018-03732-9.

Schlegel, Robert W., and Albertus J. Smit. 2016. “Climate Change in Coastal Waters: Time Series Properties Affecting Trend Estimation.” Journal of Climate 29 (24): 9113–24. https://doi.org/10.1175/JCLI-D-16-0014.1.