Vector preparation

Last updated: 2019-06-10

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20190513)

The command set.seed(20190513) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

✔ Repository version: c6b3c7b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/NAPA_clim_U.Rda
    Ignored:    data/NAPA_clim_V.Rda
    Ignored:    data/NAPA_clim_W.Rda
    Ignored:    data/NAPA_clim_emp_ice.Rda
    Ignored:    data/NAPA_clim_emp_oce.Rda
    Ignored:    data/NAPA_clim_fmmflx.Rda
    Ignored:    data/NAPA_clim_mldkz5.Rda
    Ignored:    data/NAPA_clim_mldr10_1.Rda
    Ignored:    data/NAPA_clim_qemp_oce.Rda
    Ignored:    data/NAPA_clim_qla_oce.Rda
    Ignored:    data/NAPA_clim_qns.Rda
    Ignored:    data/NAPA_clim_qsb_oce.Rda
    Ignored:    data/NAPA_clim_qt.Rda
    Ignored:    data/NAPA_clim_runoffs.Rda
    Ignored:    data/NAPA_clim_ssh.Rda
    Ignored:    data/NAPA_clim_sss.Rda
    Ignored:    data/NAPA_clim_sst.Rda
    Ignored:    data/NAPA_clim_taum.Rda
    Ignored:    data/NAPA_clim_vars.Rda
    Ignored:    data/NAPA_clim_vecs.Rda
    Ignored:    data/node_mean_all_anom.Rda
    Ignored:    data/packet_all_anom.Rda
    Ignored:    data/som_all_anom.Rda
    Ignored:    data/synoptic_states.Rda
    Ignored:    data/synoptic_vec_states.Rda

Untracked files:
    Untracked:  docs/figure/som.Rmd/
    Untracked:  output/som_plot_uoce_anom.pdf
    Untracked:  output/som_plot_voce_anom.pdf

Unstaged changes:
    Modified:   analysis/MHWNWA.bib
    Modified:   code/workflow.R
    Modified:   output/som_plot_mldr10_1_anom.pdf
    Modified:   output/som_plot_qt_anom.pdf
    Modified:   output/som_plot_sst_anom.pdf
    Modified:   output/som_plot_taum_anom.pdf

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
html	c61a15f	robwschlegel	2019-06-06	Build site.
html	46b98f2	robwschlegel	2019-06-06	Build site.
Rmd	49c6105	robwschlegel	2019-06-06	Finished vector data packet pipeline
Rmd	44ac335	robwschlegel	2019-06-06	Working on inclusion of vectors into SOM pipeline

Introduction

It would seem logical that the U, V, and W vectors for water movement would be prepared at the same time as all of the other variables, as seen in the Variable preparation vignette. The issue we face with the vectors is that they are stored in the 3D version of the model, which has a 5 day temporal resolution. This means that they will need to be handled slightly differently from the rest of the variables. Largely though we should be able to adjust the functions previously created for the other variables.

# Packages used in this vignette
library(jsonlite, lib.loc = "../R-packages/")
library(tidyverse) # Base suite of functions
library(ncdf4) # For opening and working with NetCDF files
library(lubridate) # For convenient date manipulation

# Load functions required below
source("code/functions.R")

# Set number of cores
doMC::registerDoMC(cores = 50)

# Disable scientific notation for numeric values
  # I just find it annoying
options(scipen = 999)

# Corners of the study area
NWA_corners <- readRDS("data/NWA_corners.Rda")

# The NAPA surface data location
NAPA_files <- dir("../../data/NAPA025/1d_grid_T_2D", full.names = T)

# The NAPA 5 day U vector data location
NAPA_U_files <- dir("../../data/NAPA025/5d_grid_U", full.names = T)

# The NAPA 5 day V vector data location
NAPA_V_files <- dir("../../data/NAPA025/5d_grid_V", full.names = T)

# The NAPA 5 day W vector data location
NAPA_W_files <- dir("../../data/NAPA025/5d_grid_W", full.names = T)

# The NAPA model lon/lat values
NAPA_coords <- readRDS("data/NAPA_coords.Rda")

# The NAPA model lon/lat values subsetted to the study area
NAPA_coords <- readRDS("data/NAPA_coords_sub.Rda")

# Load NAPA bathymetry/lon/lat
NAPA_bathy <- readRDS("data/NAPA_bathy.Rda")

# Load NAPA bathymetry/lon/lat subsetted to study area
NAPA_bathy <- readRDS("data/NAPA_bathy_sub.Rda")

# Load MHW results
NAPA_MHW_sub <- readRDS("data/NAPA_MHW_sub.Rda")

# MHW Events
NAPA_MHW_event <- NAPA_MHW_sub %>%
  select(-clims, -cats) %>%
  unnest(events) %>%
  filter(row_number() %% 2 == 0) %>%
  unnest(events)

Vector variables

We obviously want the primary vectors from the model, but I noticed that there are other things stored in these 5 day NetCDf files, so let’s have a closer look.

# The U variables
NAPA_U <- ncdump::NetCDF(NAPA_U_files[1])$variable[1:6]
NAPA_U

# A tibble: 9 x 6
  name           ndims natts prec   units                longname         
  <chr>          <int> <int> <chr>  <chr>                <chr>            
1 depthu_bounds      2     0 float  ""                   depthu_bounds    
2 e3u                4    10 float  m                    U-cell thickness 
3 nav_lat            2     3 float  degrees_north        Latitude         
4 nav_lon            2     3 float  degrees_east         Longitude        
5 time_centered      1     6 double seconds since 1900-… Time axis        
6 time_centered…     2     0 double ""                   time_centered_bo…
7 time_counter_…     2     0 double ""                   time_counter_bou…
8 uoce               4    10 float  m/s                  Sea Water X Velo…
9 utau               3    10 float  N/m2                 Surface Downward…

# The V variables
NAPA_V <- ncdump::NetCDF(NAPA_V_files[1])$variable[1:6]
NAPA_V

# A tibble: 9 x 6
  name           ndims natts prec   units                longname         
  <chr>          <int> <int> <chr>  <chr>                <chr>            
1 depthv_bounds      2     0 float  ""                   depthv_bounds    
2 e3v                4    10 float  m                    V-cell thickness 
3 nav_lat            2     3 float  degrees_north        Latitude         
4 nav_lon            2     3 float  degrees_east         Longitude        
5 time_centered      1     6 double seconds since 1900-… Time axis        
6 time_centered…     2     0 double ""                   time_centered_bo…
7 time_counter_…     2     0 double ""                   time_counter_bou…
8 voce               4    10 float  m/s                  Sea Water Y Velo…
9 vtau               3    10 float  N/m2                 Surface Downward…

# The W variables
NAPA_W <- ncdump::NetCDF(NAPA_W_files[1])$variable[1:6]
NAPA_W

# A tibble: 8 x 6
  name           ndims natts prec   units                longname         
  <chr>          <int> <int> <chr>  <chr>                <chr>            
1 avt                4    10 float  m2/s                 vertical eddy di…
2 depthw_bounds      2     0 float  ""                   depthw_bounds    
3 nav_lat            2     3 float  degrees_north        Latitude         
4 nav_lon            2     3 float  degrees_east         Longitude        
5 time_centered      1     6 double seconds since 1900-… Time axis        
6 time_centered…     2     0 double ""                   time_centered_bo…
7 time_counter_…     2     0 double ""                   time_counter_bou…
8 wo                 4    10 float  m/s                  ocean vertical v…

It appears that the U and V files also contain their corresponding downward stress (utau and vtau), though with the availability of the W vectors, I don’t know that we need these, too. For now we will just stick with: uoce, voce, and wo.

Synoptic states

As with the other variables, we will also need to create mean synoptic states of the vectors (U, V, W) during all of the recorded MHWs. From here out we will again be running on data stored on a local server, meaning the following code won’t be of much use unless one has a copy of the NAPA model data. The resulting synoptic states we will create now are likely to be in excess of a couple of GB so will not be hosted on the public GitHub server either. They are of course available upon request.

Date index for NAPA vector files

To create the index of dates to be found within each of the thousands of NAPA 5 day vector NetCDF files we will use the same simple for loop to crawl through the files and write down for us in one long spreadsheet which dates are to be found in which files. This worked well for the previous vignette, so I’ve kept this step the same. The vectors are stored in three different file directories, but their dates are the same, so we only need to create one date index file, not three individual ones.

# Pull out the dates
NAPA_vector_files_dates <- data.frame()
for(i in 1:length(NAPA_files)){
  file_name <- NAPA_U_files[i]
  date_start <- ymd(str_sub(basename(as.character(file_name)), start = 26, end = 33))
  date_end <- ymd(str_sub(basename(as.character(file_name)), start = 35, end = 43))
  date_seq <- seq(date_start, date_end, by = "day")
  # Take the middle date as this is the real data in the pentad
  date_info <- data.frame(file = file_name, date = date_seq[3])
  NAPA_vector_files_dates <- rbind(date_info, NAPA_vector_files_dates)
}

# Order by date, just for tidiness
NAPA_vector_files_dates <- dplyr::arrange(NAPA_vector_files_dates, date)

# Save
# saveRDS(NAPA_vector_files_dates, "data/NAPA_vector_files_dates.Rda")

Variable climatologies

Part of the data packet we need to create for the SOMs is the anomaly values. In order to create anomalies however we need to first create climatologies for all of the variables. This may prove to be a somewhat daunting task, but it’s what we are here to do! In order to create a climatology of values we will need to load all of the files and then pixel-wise go about getting the seasonal (daily) climatologies. This will be done with the same function (ts2clm()) that is used for the MHW climatologies. We will first create a function that extracts the desired variables from any NetCDF files fed to it. With that done it should be a routine matter to get the climatologies. Hold onto your hats, this is going to be RAM heavy…

# Load functions required below
source("code/functions.R")

# U
clim_one_vec(NAPA_U_files)
gc()

# V
clim_one_vec(NAPA_V_files)
gc()

# W
clim_one_vec(NAPA_W_files)
gc()

And then combine them into one big clim file.

# Load all variable climatologies
NAPA_clim_U <- readRDS("data/NAPA_clim_U.Rda")
NAPA_clim_V <- readRDS("data/NAPA_clim_V.Rda")
NAPA_clim_W <- readRDS("data/NAPA_clim_W.Rda")

# left join them
NAPA_clim_vecs <- left_join(NAPA_clim_U, NAPA_clim_V, by = c("lon", "lat", "doy")) %>% 
  left_join(NAPA_clim_W, by = c("lon", "lat", "doy"))

# Convert DOY to MM-DD for joining to daily data below
NAPA_clim_vecs$doy <- format(as.Date(NAPA_clim_vecs$doy, origin = "2015-12-31"), "%m-%d")

# Change column names to highlight that these are climatology values
colnames(NAPA_clim_vecs)[-c(1:3)] <- paste0(colnames(NAPA_clim_vecs)[-c(1:3)],"_clim")

# Save
saveRDS(NAPA_clim_vecs, "data/NAPA_clim_vecs.Rda")
rm(NAPA_clim_U, NAPA_clim_V, NAPA_clim_W); gc()

Variable extractor

We needed a list of the dates present in each file so that we can easily load only the NetCDF files we need to extract our desired variables. The dates we want are the range of dates during each of the MHWs detected in the SST preparation vignette. In the chunk below we will create a function that decides which files should have their variables loaded and a function that binds everything up into tidy data packets that our SOM can ingest.

# Load NAPA file date index
NAPA_vector_files_dates <- readRDS("data/NAPA_vector_files_dates.Rda")

# Load full variable climatology file
NAPA_clim_vecs <- readRDS("data/NAPA_clim_vecs.Rda")

# Function for extracting the desired variables from a given NetCDF file
# testers...
# file_number <- 1376
extract_all_vec <- function(file_number){
  
  # Extract and join variables with a for loop
    # NB: This should be optimised...
  u_vecs <- extract_one_vec(NAPA_U_files[file_number])
  v_vecs <- extract_one_vec(NAPA_V_files[file_number])
  if(file_number == 276){
    w_vecs <- v_vecs[1,]
    colnames(w_vecs)[6] <- "wo"
  } else {
    w_vecs <- extract_one_vec(NAPA_W_files[file_number])
  }
  
  NAPA_vecs_extracted <- left_join(u_vecs, v_vecs, by = colnames(u_vecs)[1:5]) %>% 
    left_join(w_vecs, by = colnames(u_vecs)[1:5])
  
  # Exit
  return(NAPA_vecs_extracted)
}

# Function for extracting variables from as many files as a MHW event requires
# testers...
# event_sub <- NAPA_MHW_event[23,]
data_vec_packet <- function(event_sub){
  
  # Create date and file index for loading
  date_idx <- seq(event_sub$date_start, event_sub$date_end, by = "day")
  file_idx <- which(NAPA_vector_files_dates$date %in% date_idx)
  
  # Load required base data
  # system.time(
  packet_base <- plyr::ldply(file_idx, extract_all_vec) %>% 
    filter(t %in% date_idx) %>%
    mutate(doy = format(t, "%m-%d"))
  # ) # 16 seconds for seven files
  
  # Join to climatologies
  packet_join <- left_join(packet_base, NAPA_clim_vecs, by = c("lon", "lat", "doy"))
  
  # Create anomaly values and remove clim columns
  packet_anom <- packet_join %>% 
    mutate(uoce_anom = uoce - uoce_clim,
           voce_anom = voce - voce_clim,
           wo_anom = wo - wo_clim) %>% 
    dplyr::select(lon, lat, doy, uoce:wo_anom, 
                  -c(colnames(NAPA_clim_vecs)[-c(1:3)]))
  
  # Create mean synoptic values
  packet_mean <- packet_anom %>% 
    select(-doy) %>% 
    # NB: The lowest pixels are a forcing edge and shouldn't be included
      # We can catch these out by filtering pixels whose SST is exactly 0
    # filter(sst != 0) %>% 
    group_by(lon, lat) %>% 
    summarise_all(mean, na.rm = T) %>% 
    arrange(lon, lat) %>% 
    ungroup() %>% 
    nest(.key = "synoptic")
  
  # Combine results with MHW dataframe
  packet_res <- cbind(event_sub, packet_mean)
  
  # Test visuals
  # ggplot(packet_mean, aes(x = lon, y = lat)) +
  #   geom_point(aes(colour = uoce_anom)) +
  #   scale_colour_gradient2(low = "blue", high = "red") +
  #   coord_cartesian(xlim = NWA_corners[1:2],
  #                   ylim = NWA_corners[3:4])
  
  # Exit
  return(packet_res)
}

With our functions sorted, it is now time to create our data packets.

# Set number of cores
  # NB: Was set to 25 as someone else was using the server at the time
doMC::registerDoMC(cores = 25)

# Create one big packet
system.time(
synoptic_vec_states <- plyr::ddply(NAPA_MHW_event, c("region", "sub_region", "event_no"), data_vec_packet, .parallel = T)
) # 28 seconds for first 2, 6125 seconds (102 minutes) for all

# Save
# saveRDS(synoptic_vec_states, "data/synoptic_vec_states.Rda")

With all of our synoptic snapshots for our chosen variables created it is now time to feed them to the Self-organising map (SOM) analysis.

Session information

sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2       tidync_0.2.1         heatwaveR_0.3.6.9004
 [4] data.table_1.11.6    lubridate_1.7.4      ncdf4_1.16          
 [7] forcats_0.3.0        stringr_1.3.1        dplyr_0.7.6         
[10] purrr_0.2.5          readr_1.1.1          tidyr_0.8.1         
[13] tibble_1.4.2         ggplot2_3.0.0        tidyverse_1.2.1     
[16] jsonlite_1.6        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18      lattice_0.20-35   utf8_1.1.4       
 [4] assertthat_0.2.0  rprojroot_1.3-2   digest_0.6.16    
 [7] foreach_1.4.4     R6_2.2.2          cellranger_1.1.0 
[10] plyr_1.8.4        backports_1.1.2   evaluate_0.11    
[13] httr_1.3.1        pillar_1.3.0      rlang_0.2.2      
[16] lazyeval_0.2.1    readxl_1.1.0      ncmeta_0.0.4     
[19] rstudioapi_0.7    whisker_0.3-2     R.utils_2.7.0    
[22] R.oo_1.22.0       rmarkdown_1.10    htmlwidgets_1.3  
[25] munsell_0.5.0     broom_0.5.0       compiler_3.6.0   
[28] modelr_0.1.2      pkgconfig_2.0.2   htmltools_0.3.6  
[31] tidyselect_0.2.4  workflowr_1.1.1   codetools_0.2-15 
[34] doMC_1.3.5        fansi_0.3.0       viridisLite_0.3.0
[37] crayon_1.3.4      withr_2.1.2       R.methodsS3_1.7.1
[40] grid_3.6.0        nlme_3.1-137      gtable_0.2.0     
[43] git2r_0.23.0      magrittr_1.5      scales_1.0.0     
[46] cli_1.0.0         stringi_1.2.4     xml2_1.2.0       
[49] iterators_1.0.10  tools_3.6.0       glue_1.3.0       
[52] RNetCDF_1.9-1     maps_3.3.0        hms_0.4.2        
[55] ncdump_0.0.3      parallel_3.6.0    yaml_2.2.0       
[58] colorspace_1.3-2  rvest_0.3.2       plotly_4.9.0     
[61] knitr_1.20        bindr_0.1.1       haven_1.1.2

This reproducible R Markdown analysis was created with workflowr 1.1.1