Introduction

In this analysis, I will investigate whether enabling the “skip” functionality in Minno.js is responsible for repetitions of questions and tasks in our data. To explore this, I modified the script at the beginning of the demo.genderscience study to randomly assign whether the “skip” functionality is activated for each participant. Specifically, I introduced a skipped variable, which is set to either ‘true’ or ‘false’ based on a random condition (50% chance).

Here’s the JavaScript code I used:

var skip = 'false';
if (Math.random() < 0.5) {
    skip = 'true';
    API.addSettings('skip', true);
}
API.save({skipped:skip});

The API.save function records the value of skipped in the dataset. This analysis will help us test whether enabling this “skip” functionality leads to duplicated questions and tasks in the data.

towork <- "C:/Users/Yoav/Documents/gdrive.mail.tau.ac.il/Other computers/My Laptop"
source(paste0(towork, "/work/resources/stasExamples/R/yba.funcs.R"))

# The data's folder (directory)
dir.to.data = "C:\\Users\\Yoav\\OneDrive - Tel-Aviv University\\Documents\\bigfiles\\"
dir.raw = paste0(dir.to.data, "demo.studies\\demo.genderscience.0003.skip\\raw")
dir.out <- paste0(dir.to.data, "demo.studies\\demo.genderscience.0003.skip\\processed")
explicit.raw <- read.table(paste(dir.raw, "explicit.txt", sep = "\\"), sep = "\t", header = TRUE, fill = TRUE, quote = NULL, comment = "")
exp.sids <- explicit.raw$session_id[which(explicit.raw$question_name == "skipped")]
exp.raw <- explicit.raw[which(explicit.raw$session_id %in% exp.sids), ]
cat("number of unique sessions in the explicit table:", length(unique(exp.raw$session_id)))
## number of unique sessions in the explicit table: 5691
skipped <- exp.raw[which(exp.raw$question_name == "skipped"), c("session_id", "question_response")]
names(skipped)[names(skipped) == "question_response"] <- "skipped"

Duplicated “skipped”

If the ‘skip’ functionality is responsible for duplicated questions, we should not observe any duplicate entries for the variable ‘skipped’, which only contain the value ‘false’.

# Count duplicate session_id in rows that recorded the 'skipped' variable
dups <- setDF(setDT(skipped)[, if (.N > 1L) .SD, by = .(session_id)])
dup.sids <- unique(dups$session_id)

skipped.dups <- setDF(skipped)[which(skipped$session_id %in% dup.sids), ]

# Group by session_id and check if all values of skipped are 'false'
false_only_sessions <- skipped.dups %>%
    group_by(session_id) %>%
    dplyr::summarize(all_false = all(skipped == "false"), num_repetitions = n()  # Count the number of repetitions for each session
)

# Compute the expected likelihood of having only 'false' by chance (0.5^n)
false_only_sessions <- false_only_sessions %>%
    mutate(expected_all_false = 0.5^num_repetitions)

# Count how many sessions have only 'false' in all their repetitions
false_only_count <- sum(false_only_sessions$all_false)

# Get the number of unique sessions in skipped.dups
unique_session_count <- n_distinct(skipped.dups$session_id)

# Compute the expected number of sessions with only 'false'
expected_false_only_count <- sum(false_only_sessions$expected_all_false)

Total number of unique sessions with at least 2 recordings of the ‘skipped’ variable: 251

Number of sessions that included at least 2 recordings of the ‘skipped’ variable, but always with the value ‘false’: 48

Expected number of sessions with only ‘false’ (by chance): 55.890625

We found that there are duplicated sessions with only “skipped = false” values. Further, the number of only-false cases fits the expected likelihood if skipping does not increase the likelihood of repetition.

Still, hereafter, in the rest of the analyses, we will consider any session that had “skipped=true” even once as a session in the “skipped=true” condition.

# Group by session_id and summarize the skipped column If any 'skipped' value is 'true', mark the whole session as 'true', otherwise 'false'
skipped_summary <- skipped %>%
    group_by(session_id) %>%
    dplyr::summarize(skipped = ifelse(any(skipped == "true"), "true", "false")) %>%
    ungroup()

# View the resulting data.frame head(skipped_summary)
# Count dups
for.dup <- exp.raw[, c("session_id", "attempt", "questionnaire_name", "question_name", "question_response")]

for.dup <- setDF(setDT(for.dup)[, nReps := 1:.N, by = c("session_id", "questionnaire_name", "question_name")])
rep.qst.sids <- unique(for.dup$session_id[which(for.dup$nReps > 1)])

for.dup <- setDF(setDT(for.dup)[, nResps := 1:.N, by = c("session_id", "questionnaire_name", "question_name", "question_response")])
rep.resp.sids <- unique(for.dup$session_id[which(for.dup$nResps > 1)])

multi.attempt.sids <- unique(for.dup$session_id[which(for.dup$attempt > 1)])

reps <- data.frame(session_id = unique(for.dup$session_id))

reps$rep.attempt <- ifelse(reps$session_id %in% multi.attempt.sids, T, F)

reps$rep.qst <- ifelse(reps$session_id %in% rep.qst.sids, T, F)
reps$rep.resp <- ifelse(reps$session_id %in% rep.resp.sids, T, F)

reps <- merge(reps, skipped_summary, by = "session_id")
attempts <- for.dup %>%
    group_by(session_id) %>%
    dplyr::summarise(m.attempt = mean(attempt, na.rm = TRUE), n.attempt.gt1 = sum(attempt > 1, na.rm = TRUE))

reps <- merge(reps, attempts, by = "session_id")

Repetition by skip condition

We will examine a few by-session variables:

  • rep.qst = whether there was any repeating question in that session.
  • rep.resp = whether there was any repeating question with exactly the same response in that session.
  • rep.attempt = whether the session had any rows with an attempt value larger than 1.
  • m.attempt = the mean of the attempt value within session.
  • n.attempt.gt1 = how many attempt values above 1 the session had.
mysumBy(rep.qst + rep.resp + rep.attempt + m.attempt + n.attempt.gt1 ~ skipped, dt = reps)

Any session that had repeated question also had a repeated response. Although there were repetitions even without having “skip” enabled, the results suggest that enabling skipping increased the likelihood of duplicate questions.

# Convert logical columns to numeric (TRUE -> 1, FALSE -> 0)
reps_numeric <- reps %>%
    mutate(across(c(rep.qst, rep.resp, rep.attempt, m.attempt, n.attempt.gt1), as.numeric)) %>%
    # Convert 'skipped' to numeric ('true' -> 1, 'false' -> 0)
mutate(skipped = ifelse(skipped == "true", 1, 0))

Let’s explore the relations between the different repetition variables (well, between having many repeating questions and attempt with values higher than 1).

my.htmlTable(cornp(reps_numeric[, c("rep.qst", "rep.resp", "rep.attempt", "m.attempt", "n.attempt.gt1", "skipped")]))
varName____ rep.qst____ rep.resp____ rep.attempt____ m.attempt____ n.attempt.gt1____
1 rep.resp 1
< .001
5691
2 rep.attempt 0.014
0.304
5691
0.014
0.304
5691
3 m.attempt 0.218
< .001
5691
0.218
< .001
5691
0.143
< .001
5691
4 n.attempt.gt1 0.222
< .001
5691
0.222
< .001
5691
0.101
< .001
5691
0.772
< .001
5691
5 skipped 0.13
< .001
5691
0.13
< .001
5691
-0.01
0.449
5691
0.037
0.005
5691
0.035
0.008
5691

Having repeated question was more likely when having attempt value above 1, but the relation was quite small. So, the attempt variable is probably not useful for investigating or detecting duplicates.

Different count variables

To verify these results, we will use different processing to compute similar variables.

for.dup2 <- exp.raw[, c("session_id", "attempt", "questionnaire_name", "question_name", "question_response")]

for.dup2 <- for.dup2 %>%
    arrange(session_id, questionnaire_name, question_name) %>%
    group_by(session_id, questionnaire_name, question_name) %>%
    mutate(iRep = row_number() - 1, nReps = n() - 1, isRep = nReps > 1) %>%
    ungroup()


for.dup2 <- for.dup2 %>%
    arrange(session_id, questionnaire_name, question_name, question_response) %>%
    group_by(session_id, questionnaire_name, question_name, question_response) %>%
    mutate(iRepResp = row_number() - 1, nRepsResp = n() - 1, isRepResp = nRepsResp > 1) %>%
    ungroup()

for.dup2 <- for.dup2[order(for.dup2$session_id, for.dup2$questionnaire_name, for.dup2$question_name, for.dup2$attempt, for.dup2$question_response), ]

How many of the all the rows in the explicit table where a repetition?

isRep = For each row in the explicit table, does its combination of [session_id, questionnaire_name, question_name] appear in other rows? If yes, that’s a repetition of the question.

isRepResp = The same as isRep, but the combination also includes the exact response (i.e., [session_id, questionnaire_name, question_name, question_response])

repsB <- merge(for.dup2, skipped_summary, by = "session_id", all.x = T)
mysumBy(isRep + isRepResp ~ skipped, dt = repsB, round = 4)

Without skipping, repetition occurred in 0.001% of the rows. With skipping, repetition occurred in 0.02% of the rows.

Let’s see how common each attempt value was in the explicit table.

my.freq(for.dup2$attempt)

Next, we will summarize the repetition variable by session_id, and the results are supposed to be very similar to what we saw previously.

# Convert logical columns to numeric (TRUE -> 1, FALSE -> 0)
reps2 <- for.dup2 %>%
  group_by(session_id) %>%
  dplyr::summarise(
    m.attempt = mean(attempt, na.rm = TRUE),
    n.attempt.gt1 = sum(attempt > 1, na.rm = TRUE),

    m.nReps = mean(nReps, na.rm = TRUE),
    n.reps = sum(nReps > 0, na.rm = TRUE),
    any.rep = m.nReps > 0,
    
    m.nRepsResp = mean(nRepsResp, na.rm = TRUE),
    n.repsResp = sum(nRepsResp > 0, na.rm = TRUE),
    any.repResp = m.nRepsResp > 0
  )

reps2 <- merge(reps2, skipped_summary, by='session_id')
  • n.reps = how many repeated questions the session had.
  • any.rep = Did the session include any repetitions of any question? This is probably the most important variable.
  • n.repsResp = How many repeated questions and resposnes the session had.
  • any.repResp = Did the session include any repetitions of a question and a response?
  • m.attempt = the mean value of the attempt column, for each session
  • n.attempt.gt1 = the number of row with an attempt value above 1, for each session
knitr::kable(mysumBy(n.reps + any.rep + n.repsResp + any.repResp + m.attempt + n.attempt.gt1 ~ skipped, dt = reps2))
var skipped n M SD SE med
n.reps false 2787 0.406 7.305 0.138 0.000
n.reps true 2904 2.083 16.149 0.300 0.000
any.rep false 2787 0.018 0.133 0.003 0.000
any.rep true 2904 0.072 0.259 0.005 0.000
n.repsResp false 2787 0.233 3.625 0.069 0.000
n.repsResp true 2904 0.969 6.785 0.126 0.000
any.repResp false 2787 0.018 0.133 0.003 0.000
any.repResp true 2904 0.072 0.259 0.005 0.000
m.attempt false 2787 1.673 0.291 0.006 1.699
m.attempt true 2904 1.696 0.305 0.006 1.714
n.attempt.gt1 false 2787 25.525 15.665 0.297 32.000
n.attempt.gt1 true 2904 26.661 16.538 0.307 33.000

The results are the same as in the previous method: having the “skip” option increased the likelihood of a repeated question (an effect of about d = 0.25 on any.rep), but repeated questions (and responses) were still possible even without the “skip” option.

Repetition by questionnaire

qreps <- for.dup2 %>%
  group_by(questionnaire_name) %>%
  dplyr::summarise(
    m.attempt = mean(attempt, na.rm = TRUE),
    n.attempt.gt1 = sum(attempt > 1, na.rm = TRUE),

    m.nReps = mean(nReps, na.rm = TRUE),
    n.reps = sum(nReps > 1, na.rm = TRUE),

    m.nRepsResp = mean(nRepsResp, na.rm = TRUE),
    n.repsResp = sum(nRepsResp > 1, na.rm = TRUE)
  )

For each questionnaire we computed:

  • m.attempt = the mean value of the attempt column.
  • n.attempt.gt1 = the number of attempt value greater than 1
  • m.nReps = For each row, how many times did it repeat? (the repeated combination was session_id, questionnaire_name, question_name)
  • n.Reps = How many rows shared the [session_id, questionnaire_name, question_name] combination with another row
  • m.nRepsResp = For each row, how many times did it repeat? (the repeated combination was session_id, questionnaire_name, question_name, question_response)
  • n.repsResp = How many rows shared the [session_id, questionnaire_name, question_name, question_response] combination with another row
knitr::kable(qreps[order(-qreps$m.nReps), ])
questionnaire_name m.attempt n.attempt.gt1 m.nReps n.reps m.nRepsResp n.repsResp
mgr 1.623097 6228 0.1326233 322 0.1001747 224
ageCheck 1.016317 320 0.0326340 48 0.0157343 24
race 2.136439 54886 0.0187882 54 0.0088210 24
demographics 2.734115 79644 0.0166533 0 0.0078833 0
explicits 1.007853 476 0.0161963 0 0.0054689 0
iat 1.010606 66 0.0141414 0 0.0033670 0
debriefing 1.336515 6940 0.0105012 0 0.0044869 0
under18 1.000000 0 0.0000000 0 0.0000000 0

It makes sense that duplicates would be detected more often in earlier stages of the study (mgr and ageCheck).

Repetition by question

qqreps <- for.dup2 %>%
  group_by(question_name) %>%
  dplyr::summarise(
    m.attempt = mean(attempt, na.rm = TRUE),
    n.attempt.gt1 = sum(attempt > 1, na.rm = TRUE),

    m.nReps = mean(nReps, na.rm = TRUE),
    n.reps = sum(nReps > 1, na.rm = TRUE),

    m.nRepsResp = mean(nRepsResp, na.rm = TRUE),
    n.repsResp = sum(nRepsResp > 1, na.rm = TRUE)
  )

For each question we computed:

  • m.attempt = the mean value of the attempt column.
  • n.attempt.gt1 = the number of attempt value greater than 1
  • m.nReps = For each row, how many times did it repeat? (the repeated combination was session_id, questionnaire_name, question_name)
  • n.Reps = How many rows shared the [session_id, questionnaire_name, question_name] combination with another row
  • m.nRepsResp = For each row, how many times did it repeat? (the repeated combination was session_id, questionnaire_name, question_name, question_response)
  • n.repsResp = How many rows shared the [session_id, questionnaire_name, question_name, question_response] combination with another row
knitr::kable(qqreps[order(-qqreps$m.nReps), ])
question_name m.attempt n.attempt.gt1 m.nReps n.reps m.nRepsResp n.repsResp
gaySet 1.500000 1 1.0000000 0 1.0000000 0
genderIdentity_0002otherrt 2.545454 6 0.1818182 0 0.0000000 0
isTouch 1.675986 3419 0.1331336 161 0.1331336 161
skipped 1.570525 2808 0.1318901 161 0.0669442 63
genderIdentity_0002other 2.142857 10 0.0952381 0 0.0000000 0
raceomb_003sub_blackrt 2.292035 257 0.0412979 3 0.0000000 0
raceomb_003sub_black 2.316384 268 0.0395480 3 0.0395480 3
raceomb_003sub_hispanicotherrt 2.675214 96 0.0341880 0 0.0000000 0
raceomb_003sub_hispanicother 2.650000 98 0.0333333 0 0.0333333 0
birthmonth 1.016317 80 0.0326340 12 0.0310800 12
birthmonthrt 1.016317 80 0.0326340 12 0.0007770 0
birthyear 1.016317 80 0.0326340 12 0.0303030 12
birthyearrt 1.016317 80 0.0326340 12 0.0007770 0
raceomb_003sub_middleeastrt 2.357143 60 0.0238095 0 0.0000000 0
raceomb_003sub_hispanicrt 2.439759 384 0.0200803 0 0.0000000 0
raceomb_003sub_middleeast 2.460784 76 0.0196078 0 0.0196078 0
raceomb_003sub_hispanic 2.446602 396 0.0194175 0 0.0194175 0
raceomb_003sub_whitert 2.280000 2028 0.0190826 0 0.0000000 0
raceomb_003_other 2.251561 3223 0.0189595 3 0.0184971 3
raceomb_003_otherrt 2.251561 3223 0.0189595 3 0.0000000 0
raceomb_003_white 2.251561 3223 0.0189595 3 0.0180347 3
raceomb_003_whitert 2.251561 3223 0.0189595 3 0.0000000 0
raceomb_003_asian 1.961858 2589 0.0189552 3 0.0184928 3
raceomb_003_asianrt 1.961858 2589 0.0189552 3 0.0000000 0
raceomb_003_black 1.961858 2589 0.0189552 3 0.0184928 3
raceomb_003_blackrt 1.961858 2589 0.0189552 3 0.0000000 0
raceomb_003_hispanic 2.131993 3012 0.0189552 3 0.0180305 3
raceomb_003_hispanicrt 2.131993 3012 0.0189552 3 0.0000000 0
raceomb_003_middleeast 2.131993 3012 0.0189552 3 0.0180305 3
raceomb_003_middleeastrt 2.131993 3012 0.0189552 3 0.0000000 0
raceomb_003_native 1.961858 2589 0.0189552 3 0.0175682 0
raceomb_003_nativert 1.961858 2589 0.0189552 3 0.0000000 0
raceomb_003_pacific 2.131993 3012 0.0189552 3 0.0180305 3
raceomb_003_pacificrt 2.131993 3012 0.0189552 3 0.0000000 0
raceomb_003sub_white 2.282656 2051 0.0188679 0 0.0145138 0
postcodelongrt 3.035539 2697 0.0177696 0 0.0000000 0
postcodenowrt 2.929841 2643 0.0177696 0 0.0000000 0
postcodelong 3.035474 2701 0.0177370 0 0.0159021 0
postcodenow 2.929358 2648 0.0177370 0 0.0159021 0
countrycit003 2.861353 3423 0.0165055 0 0.0165055 0
countrycit003rt 2.861353 3423 0.0165055 0 0.0004716 0
countryres003 2.861353 3423 0.0165055 0 0.0165055 0
countryres003rt 2.861353 3423 0.0165055 0 0.0004716 0
edu 2.942466 3472 0.0165055 0 0.0165055 0
edurt 2.942466 3472 0.0165055 0 0.0004716 0
occuSelf 2.966282 3506 0.0165055 0 0.0160340 0
occuSelfrt 2.966282 3506 0.0165055 0 0.0004716 0
politicalid7 2.637992 3248 0.0164978 0 0.0146123 0
politicalid7rt 2.637992 3248 0.0164978 0 0.0004714 0
religion2014 2.637992 3248 0.0164978 0 0.0155550 0
religion2014rt 2.637992 3248 0.0164978 0 0.0004714 0
religionid 2.641999 3251 0.0164978 0 0.0136696 0
religionidrt 2.641999 3251 0.0164978 0 0.0004714 0
genderIdentity_0002 2.379359 2820 0.0164939 0 0.0160226 0
genderIdentity_0002rt 2.379359 2820 0.0164939 0 0.0004713 0
num002 2.381715 2828 0.0164939 0 0.0113101 0
num002rt 2.381715 2828 0.0164939 0 0.0004713 0
transIdentity 2.379359 2820 0.0164939 0 0.0160226 0
transIdentityrt 2.379359 2820 0.0164939 0 0.0004713 0
Larts 1.007853 17 0.0161963 0 0.0112883 0
Lartsrt 1.007853 17 0.0161963 0 0.0004908 0
Lscience 1.007853 17 0.0161963 0 0.0103067 0
Lsciencert 1.007853 17 0.0161963 0 0.0004908 0
arts 1.007853 17 0.0161963 0 0.0103067 0
artsrt 1.007853 17 0.0161963 0 0.0004908 0
factorability 1.007853 17 0.0161963 0 0.0093252 0
factorabilityrt 1.007853 17 0.0161963 0 0.0004908 0
factordiscrimination 1.007853 17 0.0161963 0 0.0098160 0
factordiscriminationrt 1.007853 17 0.0161963 0 0.0004908 0
factorencouragement 1.007853 17 0.0161963 0 0.0098160 0
factorencouragementrt 1.007853 17 0.0161963 0 0.0004908 0
factorfamily 1.007853 17 0.0161963 0 0.0098160 0
factorfamilyrt 1.007853 17 0.0161963 0 0.0004908 0
factorhighpower 1.007853 17 0.0161963 0 0.0112883 0
factorhighpowerrt 1.007853 17 0.0161963 0 0.0004908 0
factorinterest 1.007853 17 0.0161963 0 0.0098160 0
factorinterestrt 1.007853 17 0.0161963 0 0.0004908 0
goal1 1.007853 17 0.0161963 0 0.0112883 0
goal1rt 1.007853 17 0.0161963 0 0.0004908 0
goal2 1.007853 17 0.0161963 0 0.0083436 0
goal2rt 1.007853 17 0.0161963 0 0.0004908 0
ran9thboys 1.007853 17 0.0161963 0 0.0107975 0
ran9thboysrt 1.007853 17 0.0161963 0 0.0004908 0
ran9thgirls 1.007853 17 0.0161963 0 0.0122699 0
ran9thgirlsrt 1.007853 17 0.0161963 0 0.0004908 0
science 1.007853 17 0.0161963 0 0.0117791 0
sciencert 1.007853 17 0.0161963 0 0.0004908 0
occuSelfDetailrt 3.128169 2406 0.0159778 0 0.0006947 0
occuSelfDetail 3.131220 2455 0.0156783 0 0.0149966 0
block3Cond 1.010606 22 0.0141414 0 0.0050505 0
d 1.010606 22 0.0141414 0 0.0010101 0
feedback 1.010606 22 0.0141414 0 0.0040404 0
raceomb_003sub_whiteotherrt 2.651515 466 0.0134680 0 0.0000000 0
raceomb_003sub_whiteother 2.640364 517 0.0121396 0 0.0091047 0
raceomb_003sub_asianrt 2.298246 387 0.0116959 0 0.0000000 0
raceomb_003sub_asian 2.329787 428 0.0106383 0 0.0070922 0
broughtwebsite 1.336515 694 0.0105012 0 0.0076372 0
broughtwebsitert 1.336515 694 0.0105012 0 0.0009547 0
iatevaluations 1.336515 694 0.0105012 0 0.0066826 0
iatevaluations001 1.336515 694 0.0105012 0 0.0095465 0
iatevaluations001rt 1.336515 694 0.0105012 0 0.0009547 0
iatevaluations002 1.336515 694 0.0105012 0 0.0076372 0
iatevaluations002rt 1.336515 694 0.0105012 0 0.0009547 0
iatevaluations003 1.336515 694 0.0105012 0 0.0085919 0
iatevaluations003rt 1.336515 694 0.0105012 0 0.0009547 0
iatevaluationsrt 1.336515 694 0.0105012 0 0.0009547 0
raceomb_003sub_otherrt 2.450451 177 0.0090090 0 0.0000000 0
raceomb_003sub_other 2.471572 233 0.0066890 0 0.0000000 0
blackLabels 1.000000 0 0.0000000 0 0.0000000 0
raceSet 1.000000 0 0.0000000 0 0.0000000 0
raceomb_003sub_asianother 2.412698 49 0.0000000 0 0.0000000 0
raceomb_003sub_asianotherrt 2.406780 46 0.0000000 0 0.0000000 0
raceomb_003sub_blackother 2.846154 42 0.0000000 0 0.0000000 0
raceomb_003sub_blackotherrt 2.833333 38 0.0000000 0 0.0000000 0
raceomb_003sub_middleeastother 2.782609 19 0.0000000 0 0.0000000 0
raceomb_003sub_middleeastotherrt 2.809524 17 0.0000000 0 0.0000000 0
raceomb_003sub_native 2.452514 141 0.0000000 0 0.0000000 0
raceomb_003sub_nativert 2.408451 55 0.0000000 0 0.0000000 0
raceomb_003sub_pacific 2.731707 31 0.0000000 0 0.0000000 0
raceomb_003sub_pacificother 3.500000 4 0.0000000 0 0.0000000 0
raceomb_003sub_pacificotherrt 3.333333 3 0.0000000 0 0.0000000 0
raceomb_003sub_pacificrt 2.730769 21 0.0000000 0 0.0000000 0
under18 1.000000 0 0.0000000 0 0.0000000 0
under18rt 1.000000 0 0.0000000 0 0.0000000 0
whiteLabels 1.000000 0 0.0000000 0 0.0000000 0

Nothing to see here, probably (beyond: questions in earlier stages of the studies were more likely to be repeated).

The duplicates without a true in the skipped variable

Let’s see whether we can learn anything about participants whose “skipped” variable was always “false”

Number of relevant sessions

length(unique(skipped_summary$session_id[which(skipped_summary$skipped == "false")]))
## [1] 2787
for.dup3 <- exp.raw[which(explicit.raw$session_id %in% unique(skipped_summary$session_id[which(skipped_summary$skipped == "false")])), c("session_id", "questionnaire_name", "question_name", "question_response")]

for.dup3 <- for.dup3 %>%
    arrange(session_id, questionnaire_name, question_name) %>%
    group_by(session_id, questionnaire_name, question_name) %>%
    mutate(iRep = row_number() - 1, nReps = n() - 1, isRep = nReps > 1) %>%
    ungroup()


for.dup3 <- for.dup3 %>%
    arrange(session_id, questionnaire_name, question_name, question_response) %>%
    group_by(session_id, questionnaire_name, question_name, question_response) %>%
    mutate(iRepResp = row_number() - 1, nRepsResp = n() - 1, isRepResp = nRepsResp > 1) %>%
    ungroup()

for.dup3 <- for.dup3 %>%
    mutate(repNoResp = iRepResp < iRep)

for.dup3 <- for.dup3[order(for.dup3$session_id, for.dup3$questionnaire_name, for.dup3$question_name, for.dup3$question_response), ]

How many repeated rows these subset of participants had?

my.freq(for.dup3$nReps[which(for.dup3$iRep == 0)])

Very few.

In which questionnaires did they have repetitions?

my.freq(for.dup3$questionnaire_name[which(for.dup3$nReps > 0 & for.dup3$iRep == 1)])

All of the questionnaires.

Let’s see whether the responses were sometimes different. If not, then maybe it was somehow the same data sent twice by the browser.

my.freq(for.dup3$repNoResp[which(for.dup3$iRep > 0)])

Yes, the response was sometimes different. So, this is an actual repetition of the question, and we don’t know why. At least it occurred only in 0.6% of the rows in the explicit table.

System Configuration

Let’s test whether particular client setups are more likely to produce duplicate data.

ss <- read.table(paste(dir.raw, "sessions.txt", sep = "\\"), sep = "\t", header = TRUE, fill = TRUE)

library(uaparserjs)
parsed_ua <- ua_parse(ss$user_agent)
parsed_ua$session_id <- ss$session_id
reps3 <- merge(reps2, parsed_ua, by = "session_id")

Let’s examine repetitions by browser:

ttt <- mysumBy(n.reps + any.rep ~ ua.family, dt = reps3)
ttt[which(ttt$n > 100), ]

There are repeated question in all of the browsers.

By the skipped feature:

ttt <- mysumBy(n.reps + any.rep ~ skipped + ua.family, dt = reps3)
knitr::kable(ttt[which(ttt$n > 100), ])
var skipped ua.family n M SD SE med
2 n.reps false Chrome 1724 0.447 7.761 0.187 0
6 n.reps false Edge 592 0.517 8.679 0.357 0
12 n.reps false Safari 330 0.097 0.863 0.047 0
14 n.reps true Chrome 1873 2.166 16.327 0.377 0
18 n.reps true Edge 506 1.036 10.696 0.475 0
20 n.reps true Firefox 118 1.831 16.783 1.545 0
24 n.reps true Safari 358 3.028 21.024 1.111 0
27 any.rep false Chrome 1724 0.019 0.135 0.003 0
31 any.rep false Edge 592 0.012 0.108 0.004 0
37 any.rep false Safari 330 0.015 0.122 0.007 0
39 any.rep true Chrome 1873 0.072 0.258 0.006 0
43 any.rep true Edge 506 0.045 0.209 0.009 0
45 any.rep true Firefox 118 0.068 0.252 0.023 0
49 any.rep true Safari 358 0.089 0.286 0.015 0

Bots?

If some of the skippipng was due to somekind of non-human behavior, perhaps we will see that in failure to complete the IAT reasonably.

iat <- read.table(paste(dir.raw, "iat.txt", sep = "\\"), sep = "\t", header = TRUE, fill = TRUE)
iat.errs <- iat %>%
    group_by(session_id) %>%
    summarise(iat.rows = n(), iat.err = mean(trial_error, na.rm = TRUE))
reps4 <- merge(reps2, iat.errs, by = "session_id", all.x = T)
reps4$anyIAT <- !is.na(reps4$iat.rows) & reps4$iat.rows > 0

IAT performance and number of recorded rows, by the “skip” functionality, and by whether the session had any repetition (any.rep):

knitr::kable(mysumBy(iat.err + iat.rows + anyIAT ~ skipped + any.rep, dt = reps4))
var skipped any.rep n M SD SE med
iat.err false FALSE 1987 0.079 0.067 0.001 0.061
iat.err false TRUE 48 0.097 0.101 0.014 0.066
iat.err true FALSE 1974 0.079 0.065 0.001 0.066
iat.err true TRUE 175 0.079 0.059 0.004 0.061
iat.rows false FALSE 1987 188.304 31.397 0.600 196.000
iat.rows false TRUE 48 210.750 59.308 8.387 196.000
iat.rows true FALSE 1974 188.227 31.828 0.613 196.000
iat.rows true TRUE 175 219.246 71.221 4.915 196.000
anyIAT false FALSE 2737 0.726 0.446 0.009 1.000
anyIAT false TRUE 50 0.960 0.198 0.028 1.000
anyIAT true FALSE 2694 0.733 0.443 0.009 1.000
anyIAT true TRUE 210 0.833 0.374 0.026 1.000

The IAT error rate was not much higher (if at all) in sessions with repetitions. The number of rows was higher among in sessions with repetitions, probably due to repetitions in taking the IAT. anyIAT means that there were any IAT rows. When repetition occurred, there was a higher likelihood of having any IAT rows.

Summary

  • Repetition of questions occurred in about 7% of the sessions that had the “skip” feature enabled, and almost 2% of the sessions that did not have the “Skip” feature enabled.

  • I did not find any clue regarding the reason for the duplicates when “skip” is not enabled (no relation to system configuration, or to IAT performance).