Structure

• A little bit about me

• Reasons for joining R community

• Some recent work using Tidyverse methods

• Future plans

• Research Associate at Heriot-Watt University

• Background in Psychology

• Currently working on an EPSRC project called SoCoRo (Socially Competent Robots)

• Developing a socially competent robot to modify social signal processing

Reasons for joining R community

• Interested in data modelling but not satisfied with SPSS

• Wanted a fully integrated system

• raw data -> data processing -> analysis -> report writing

• Clear and colourful graphs

• Improved workflow: tidy, wrangle, model

Parts of R I'll discuss today

• Introduce main verbs of Tidyverse

• Tidying data

• Dealing with dates

• Dealing with the autism-spectrum quotient (AQ)

• Summarising data

• Plotting data

• Analysis: Binary logistic regression (brief)

Main verbs of tidyverse

• filter: extract rows
• select: extract columns
• gather/spread: gather columns into rows/spread rows into columns
• mutate: compute and append new/existing columns
• summarise: summarise data based on stated criteria
• ntile : ranking and splitting vectors

Tidying data using Tidyverse methods pt1

# A tibble: 6 x 7
ID item    expression resp    date       resp_acc  resp_time
<int> <chr>        <int> <chr>   <chr>      <chr>         <dbl>
1     1 Toast            1 Like    05/09/2017 Correct        6.02
2     1 Milk             2 Like    05/09/2017 Correct        6.45
3     1 Beans            3 Dislike 05/09/2017 Correct        5.48
4     1 Oatmeal          4 Dislike 05/09/2017 Correct        4.21
5     2 Beans            1 Dislike 05/09/2017 Incorrect     11.5
6     2 Toast            2 Like    05/09/2017 Correct        3.58

Tidying data using Tidyverse methods pt 2

resp_data %>%
filter(ID != "pilot", ID != "3") %>%
mutate(pad = ifelse(expression %in% c("1", "2"),
"Approval", "Disapproval")) %>%
mutate(ID = as.character(ID)) %>%
mutate(expression = as.character(expression)) %>%
arrange(ID)

# remove pilot data, outliers
# more accuracte categories better for graphs
# vectors with numeric values may be categorical; change data type

Tidying data using Tidyverse methods pt 3

## # A tibble: 6 x 7
##   ID    item    expression resp    pad         resp_acc resp_time
##   <chr> <chr>   <chr>      <chr>   <chr>       <chr>        <dbl>
## 1 1     Toast   1          Like    Approval    Correct       6.02
## 2 1     Milk    2          Like    Approval    Correct       6.45
## 3 1     Beans   3          Dislike Disapproval Correct       5.48
## 4 1     Oatmeal 4          Dislike Disapproval Correct       4.21
## 5 10    Oatmeal 2          Like    Approval    Correct      19.0
## 6 10    Beans   3          Dislike Disapproval Correct       6.57

Dealing with dates pt 1

## # A tibble: 6 x 2
##   dob        expdate
##   <chr>      <chr>
## 1 12/04/1980 05/09/2017
## 2 19/09/1971 05/09/2017
## 3 05/11/1954 05/09/2017
## 4 31/05/1989 05/09/2017
## 5 11/10/1987 05/09/2017
## 6 29/07/1988 07/09/2017

Dealing with dates pt 2

q_data %>%
select(dob, expdate)   %>%
mutate(date = as.Date(expdate, "%d/%m/%Y")) %>%
mutate(birth = as.Date(dob, "%d/%m/%Y")) %>%
mutate(age = as.numeric(date - birth)/365.2422) %>%
head
## # A tibble: 6 x 5
##   dob        expdate    date       birth        age
##   <chr>      <chr>      <date>     <date>     <dbl>
## 1 12/04/1980 05/09/2017 2017-09-05 1980-04-12  37.4
## 2 19/09/1971 05/09/2017 2017-09-05 1971-09-19  46.0
## 3 05/11/1954 05/09/2017 2017-09-05 1954-11-05  62.8
## 4 31/05/1989 05/09/2017 2017-09-05 1989-05-31  28.3
## 5 11/10/1987 05/09/2017 2017-09-05 1987-10-11  29.9
## 6 29/07/1988 07/09/2017 2017-09-07 1988-07-29  29.1

Dealing with the AQ: Using gather and ntile pt 1

## # A tibble: 6 x 4
##   q1                q2                  q3                q4
##   <chr>             <chr>               <chr>             <chr>
## 1 slightly disagree slightly agree      slightly agree    slightly agree
## 2 slightly agree    slightly disagree   definitely agree  slightly agree
## 3 slightly agree    slightly disagree   slightly agree    slightly disagr~
## 4 slightly disagree definitely disagree definitely agree  slightly disagr~
## 5 slightly agree    slightly disagree   definitely agree  slightly disagr~
## 6 slightly agree    slightly disagree   slightly disagree slightly disagr~

Dealing with the AQ: Using gather and ntile pt 2

q_data %>%
filter(ID != "pilot", ID != "3") %>%
select(ID, q1:q50) %>%
gather(AQ_item, AQ_resp, q1:q50) %>%
mutate(AQ_num = substring(AQ_item,2)) %>%
mutate('AQ_score' =
ifelse(AQ_num %in%
c(2,4:7,9,12:13,16,18:23,26,33,35,39,41:43,45:46)
& AQ_resp %in%
c("definitely agree","slightly agree"), 1,
ifelse(AQ_num %in%
c(1,3,8,10:11,14:15,17,24:25,27:32,34,36:38,40,44,47:50)
& AQ_resp %in%
c("slightly disagree","definitely disagree"), 1, 0)))

Dealing with the AQ: Using gather and ntile pt 3

## # A tibble: 6 x 5
##   ID    AQ_item AQ_resp           AQ_num AQ_score
##   <chr> <chr>   <chr>             <chr>     <dbl>
## 1 1     q1      slightly agree    1          0
## 2 2     q1      slightly disagree 1          1.00
## 3 4     q1      slightly agree    1          0
## 4 5     q1      slightly agree    1          0
## 5 6     q1      slightly disagree 1          1.00
## 6 7     q1      slightly agree    1          0

Dealing with the AQ: Using gather and ntile pt 4

aq %>%
group_by(ID) %>%
mutate(AQ_tot = sum(AQ_score)) %>%
ungroup() %>%
select(ID, AQ_num, AQ_resp, AQ_score, AQ_tot) %>%
head

Dealing with the AQ: Using gather and ntile pt 5

## # A tibble: 6 x 5
##   ID    AQ_num AQ_resp           AQ_score AQ_tot
##   <chr> <chr>  <chr>                <dbl>  <dbl>
## 1 pilot 1      slightly disagree     1.00  28.0
## 2 pilot 1      slightly agree        0     28.0
## 3 1     1      slightly agree        0     16.0
## 4 2     1      slightly disagree     1.00  17.0
## 5 3     1      slightly agree        0     10.0
## 6 4     1      slightly agree        0      8.00

Dealing with the AQ: Using gather and ntile pt 6

aq %>%
group_by(ID) %>%
mutate(AQ_tot = sum(AQ_score)) %>%
ungroup() %>%
select(ID, AQ_tot) %>%
distinct() %>%
mutate(medAQ = ntile(AQ_tot, 2)) %>%
mutate(AQ_group = recode(medAQ,
"1" = "Low AQ",
"2" = "High AQ")) %>%
select(-medAQ) %>%
head

Dealing with the AQ: Using gather and ntile pt 7

## # A tibble: 56 x 3
##    ID    AQ_tot AQ_group
##    <chr>  <dbl> <chr>
##  1 1      16.0  Low AQ
##  2 10      9.00 Low AQ
##  3 11     17.0  Low AQ
##  4 12      9.00 Low AQ
##  5 13     20.0  High AQ
##  6 14      6.00 Low AQ
##  7 15     19.0  High AQ
##  8 16     11.0  Low AQ
##  9 17     22.0  High AQ
## 10 18     23.0  High AQ
## # ... with 46 more rows

Using summarise to generate frequencies and proportions pt 1

## # A tibble: 6 x 5
##   ID    eng                resp    resp_acc resp_time
##   <chr> <chr>              <chr>   <chr>        <dbl>
## 1 1     Native English     Like    Correct       6.02
## 2 1     Native English     Like    Correct       6.45
## 3 1     Native English     Dislike Correct       5.48
## 4 1     Native English     Dislike Correct       4.21
## 5 10    Non-native English Like    Correct      19.0
## 6 10    Non-native English Dislike Correct       6.57

Using summarise to generate frequencies and proportions pt 2

demog_resp %>%
select(ID, eng, resp) %>%
group_by(eng, resp) %>%
summarise(n = n()) %>%
group_by(eng) %>%
mutate(freq = n/ sum(n)) %>%
ungroup

Using summarise to generate frequencies and proportions pt 3

eng resp n freq
Native English Dislike 72 0.5294
Native English Like 63 0.4632
Native English Miss 1 0.0074
Non-native English Dislike 43 0.4886
Non-native English Like 41 0.4659
Non-native English Miss 4 0.0455

Plotting questionnaire data pt 1

nat_rob %>%
group_by(eng, robot_q_item) %>%
summarise(mean = mean(robot_q_resp),
median = median(robot_q_resp),
IQR = IQR(robot_q_resp),
sd = sd(robot_q_resp),
n = n()) %>%
mutate(se = sd / sqrt(n),
lower_ci = mean - qt(1 - (0.05 / 2), n - 1) * se,
upper_ci = mean + qt(1 - (0.05 / 2), n - 1) * se) %>%
mutate("Native Language" = recode(eng,
"No" = "Non-native English",
"Yes" = "Native English"))

Plotting questionnaire data pt 2

eng robot_q_item mean median IQR sd n se lower_ci upper_ci Native Language
Native English Friendliness 3.794 4 2 1.0826 136 0.0928 3.611 3.978 Native English
Native English Interaction_rating 4.147 5 1 1.0923 136 0.0937 3.962 4.332 Native English
Native English Likeability 3.618 4 1 1.1159 136 0.0957 3.428 3.807 Native English
Native English Perceived_positiveness 3.382 3 1 1.0615 136 0.0910 3.202 3.562 Native English
Native English Performance_rating 3.559 4 1 1.0093 136 0.0865 3.388 3.730 Native English
Native English Voice_clarity 4.559 5 1 0.8144 136 0.0698 4.421 4.697 Native English
Non-native English Friendliness 3.682 4 1 0.8241 88 0.0879 3.507 3.856 Non-native English
Non-native English Interaction_rating 4.273 4 1 0.7540 88 0.0804 4.113 4.433 Non-native English
Non-native English Likeability 4.227 4 1 0.6013 88 0.0641 4.100 4.355 Non-native English
Non-native English Perceived_positiveness 3.727 4 1 0.9189 88 0.0980 3.533 3.922 Non-native English
Non-native English Performance_rating 3.273 3 1 0.8126 88 0.0866 3.100 3.445 Non-native English
Non-native English Voice_clarity 4.682 5 1 0.5580 88 0.0595 4.564 4.800 Non-native English

Data analysis

• Tend to use binary logistic regression; accounts for non-normality
• At present I am not using Tidyverse for analysis. Work in progress…

Analysis procedure pt 1

• Step one: generate a correlation matrix for numeric vectors

correl <- cor(aq_a[ , c(8:14)], use = "pairwise.complete.obs")

symnum(correl)

• Evaluate collinearity - decide if need to drop/modify parameters

Analysis procedure pt 2

• Perform stepwise AIC check on main effect model

mainmod <- glm(AQ_group ~ expression + eng + sex + resp ..., data = aq_a, family = binomial(link = 'logit'))

summary(mainmod)

• Run a stepwise check of the model predictors

mod1 <- step(mainmod)

• Extract lowest AIC model and summarise using summary

Analysis procedure pt 3

• Check is model is significant

• chi-square difference: chidiff = mod2$null.deviance - mod2$deviance

• degrees of freedom difference: dfdiff = mod2$df.null - mod2$df.residual # degrees of freedom difference

pchisq(chidiff, dfdiff, lower.tail = F)

Analysis procedure pt 4

• Calculate effect size

• use BaylorEdPsych package and PseudoR2 syntax

PseudoR2(mod2)

Analysis procedure pt 5

• Calculate correctness of model

correct <- mod2$fitted.values • fitted values are a continous porbability of the liklihood of falling into the second group (High_AQ) • have to convert to binary values • use 50/50 cut off point as that is chance level for 2 groups binarycorrect <- ifelse(correct > 0.5, 1, 0) binarycorrect <- factor(binarycorrect, levels = c(0,1), labels = c("Low AQ", "High AQ")) table(aq_a$AQ_group, binarycorrect)

• perform matrix calculations

Concluding thoughts

• Tidyverse uses logical syntax

• Can tidy, wrangle, and model data with relative ease

• Graphical tool ggplot2 help visualise trends from multiple perspectives

Future work

• Eventually want to use R and R Studio as integrated program for all my research activities

• Writing functions to shorten code chunks

• Learn to use tidy methods for analysis

• Setting up a Github repo for version control

• Looking at methods to articulate .rmd with Overleaf