Introduction

In this analysis, I will investigate whether enabling the “skip” functionality in Minno.js is responsible for repetitions of questions and tasks in our data. To explore this, I modified the script at the beginning of the demo.genderscience study to randomly assign whether the “skip” functionality is activated for each participant. Specifically, I introduced a skipped variable, which is set to either ‘true’ or ‘false’ based on a random condition (50% chance).

Here’s the JavaScript code I used:

var skip = 'false';
if (Math.random() < 0.5) {
    skip = 'true';
    API.addSettings('skip', true);
}
API.save({skipped:skip});

The API.save function records the value of skipped in the dataset. This analysis will help us test whether enabling this “skip” functionality leads to duplicated questions and tasks in the data.

towork <- "C:/Users/Yoav/Documents/gdrive.mail.tau.ac.il/Other computers/My Laptop"
source(paste0(towork, "/work/resources/stasExamples/R/yba.funcs.R"))

# The data's folder (directory)
dir.to.data = "C:\\Users\\Yoav\\OneDrive - Tel-Aviv University\\Documents\\bigfiles\\"
dir.raw = paste0(dir.to.data, "demo.studies\\demo.genderscience.0003.skip\\raw")
dir.out <- paste0(dir.to.data, "demo.studies\\demo.genderscience.0003.skip\\processed")

explicit.raw <- read.table(paste(dir.raw, "explicit.txt", sep = "\\"), sep = "\t", header = TRUE, fill = TRUE, quote = NULL, comment = "")

exp.sids <- explicit.raw$session_id[which(explicit.raw$question_name == "skipped")]
exp.raw <- explicit.raw[which(explicit.raw$session_id %in% exp.sids), ]
cat("number of unique sessions in the explicit table:", length(unique(exp.raw$session_id)))

## number of unique sessions in the explicit table: 5691

skipped <- exp.raw[which(exp.raw$question_name == "skipped"), c("session_id", "question_response")]
names(skipped)[names(skipped) == "question_response"] <- "skipped"

Duplicated “skipped”

If the ‘skip’ functionality is responsible for duplicated questions, we should not observe any duplicate entries for the variable ‘skipped’, which only contain the value ‘false’.

# Count duplicate session_id in rows that recorded the 'skipped' variable
dups <- setDF(setDT(skipped)[, if (.N > 1L) .SD, by = .(session_id)])
dup.sids <- unique(dups$session_id)

skipped.dups <- setDF(skipped)[which(skipped$session_id %in% dup.sids), ]

# Group by session_id and check if all values of skipped are 'false'
false_only_sessions <- skipped.dups %>%
    group_by(session_id) %>%
    dplyr::summarize(all_false = all(skipped == "false"), num_repetitions = n()  # Count the number of repetitions for each session
)

# Compute the expected likelihood of having only 'false' by chance (0.5^n)
false_only_sessions <- false_only_sessions %>%
    mutate(expected_all_false = 0.5^num_repetitions)

# Count how many sessions have only 'false' in all their repetitions
false_only_count <- sum(false_only_sessions$all_false)

# Get the number of unique sessions in skipped.dups
unique_session_count <- n_distinct(skipped.dups$session_id)

# Compute the expected number of sessions with only 'false'
expected_false_only_count <- sum(false_only_sessions$expected_all_false)

Total number of unique sessions with at least 2 recordings of the ‘skipped’ variable: 251

Number of sessions that included at least 2 recordings of the ‘skipped’ variable, but always with the value ‘false’: 48

Expected number of sessions with only ‘false’ (by chance): 55.890625

We found that there are duplicated sessions with only “skipped = false” values. Further, the number of only-false cases fits the expected likelihood if skipping does not increase the likelihood of repetition.

Still, hereafter, in the rest of the analyses, we will consider any session that had “skipped=true” even once as a session in the “skipped=true” condition.

# Group by session_id and summarize the skipped column If any 'skipped' value is 'true', mark the whole session as 'true', otherwise 'false'
skipped_summary <- skipped %>%
    group_by(session_id) %>%
    dplyr::summarize(skipped = ifelse(any(skipped == "true"), "true", "false")) %>%
    ungroup()

# View the resulting data.frame head(skipped_summary)

# Count dups
for.dup <- exp.raw[, c("session_id", "attempt", "questionnaire_name", "question_name", "question_response")]

for.dup <- setDF(setDT(for.dup)[, nReps := 1:.N, by = c("session_id", "questionnaire_name", "question_name")])
rep.qst.sids <- unique(for.dup$session_id[which(for.dup$nReps > 1)])

for.dup <- setDF(setDT(for.dup)[, nResps := 1:.N, by = c("session_id", "questionnaire_name", "question_name", "question_response")])
rep.resp.sids <- unique(for.dup$session_id[which(for.dup$nResps > 1)])

multi.attempt.sids <- unique(for.dup$session_id[which(for.dup$attempt > 1)])

reps <- data.frame(session_id = unique(for.dup$session_id))

reps$rep.attempt <- ifelse(reps$session_id %in% multi.attempt.sids, T, F)

reps$rep.qst <- ifelse(reps$session_id %in% rep.qst.sids, T, F)
reps$rep.resp <- ifelse(reps$session_id %in% rep.resp.sids, T, F)

reps <- merge(reps, skipped_summary, by = "session_id")

attempts <- for.dup %>%
    group_by(session_id) %>%
    dplyr::summarise(m.attempt = mean(attempt, na.rm = TRUE), n.attempt.gt1 = sum(attempt > 1, na.rm = TRUE))

reps <- merge(reps, attempts, by = "session_id")

Repetition by skip condition

We will examine a few by-session variables:

rep.qst = whether there was any repeating question in that session.
rep.resp = whether there was any repeating question with exactly the same response in that session.
rep.attempt = whether the session had any rows with an attempt value larger than 1.
m.attempt = the mean of the attempt value within session.
n.attempt.gt1 = how many attempt values above 1 the session had.

mysumBy(rep.qst + rep.resp + rep.attempt + m.attempt + n.attempt.gt1 ~ skipped, dt = reps)

Any session that had repeated question also had a repeated response. Although there were repetitions even without having “skip” enabled, the results suggest that enabling skipping increased the likelihood of duplicate questions.

# Convert logical columns to numeric (TRUE -> 1, FALSE -> 0)
reps_numeric <- reps %>%
    mutate(across(c(rep.qst, rep.resp, rep.attempt, m.attempt, n.attempt.gt1), as.numeric)) %>%
    # Convert 'skipped' to numeric ('true' -> 1, 'false' -> 0)
mutate(skipped = ifelse(skipped == "true", 1, 0))

Let’s explore the relations between the different repetition variables (well, between having many repeating questions and attempt with values higher than 1).

my.htmlTable(cornp(reps_numeric[, c("rep.qst", "rep.resp", "rep.attempt", "m.attempt", "n.attempt.gt1", "skipped")]))

	varName____	rep.qst____	rep.resp____	rep.attempt____	m.attempt____	n.attempt.gt1____
1	rep.resp	1 < .001 5691
2	rep.attempt	0.014 0.304 5691	0.014 0.304 5691
3	m.attempt	0.218 < .001 5691	0.218 < .001 5691	0.143 < .001 5691
4	n.attempt.gt1	0.222 < .001 5691	0.222 < .001 5691	0.101 < .001 5691	0.772 < .001 5691
5	skipped	0.13 < .001 5691	0.13 < .001 5691	-0.01 0.449 5691	0.037 0.005 5691	0.035 0.008 5691

Having repeated question was more likely when having attempt value above 1, but the relation was quite small. So, the attempt variable is probably not useful for investigating or detecting duplicates.

Different count variables

To verify these results, we will use different processing to compute similar variables.

for.dup2 <- exp.raw[, c("session_id", "attempt", "questionnaire_name", "question_name", "question_response")]

for.dup2 <- for.dup2 %>%
    arrange(session_id, questionnaire_name, question_name) %>%
    group_by(session_id, questionnaire_name, question_name) %>%
    mutate(iRep = row_number() - 1, nReps = n() - 1, isRep = nReps > 1) %>%
    ungroup()


for.dup2 <- for.dup2 %>%
    arrange(session_id, questionnaire_name, question_name, question_response) %>%
    group_by(session_id, questionnaire_name, question_name, question_response) %>%
    mutate(iRepResp = row_number() - 1, nRepsResp = n() - 1, isRepResp = nRepsResp > 1) %>%
    ungroup()

for.dup2 <- for.dup2[order(for.dup2$session_id, for.dup2$questionnaire_name, for.dup2$question_name, for.dup2$attempt, for.dup2$question_response), ]

How many of the all the rows in the explicit table where a repetition?

isRep = For each row in the explicit table, does its combination of [session_id, questionnaire_name, question_name] appear in other rows? If yes, that’s a repetition of the question.

isRepResp = The same as isRep, but the combination also includes the exact response (i.e., [session_id, questionnaire_name, question_name, question_response])

repsB <- merge(for.dup2, skipped_summary, by = "session_id", all.x = T)
mysumBy(isRep + isRepResp ~ skipped, dt = repsB, round = 4)

Without skipping, repetition occurred in 0.001% of the rows. With skipping, repetition occurred in 0.02% of the rows.

Let’s see how common each attempt value was in the explicit table.

my.freq(for.dup2$attempt)

Next, we will summarize the repetition variable by session_id, and the results are supposed to be very similar to what we saw previously.

# Convert logical columns to numeric (TRUE -> 1, FALSE -> 0)
reps2 <- for.dup2 %>%
  group_by(session_id) %>%
  dplyr::summarise(
    m.attempt = mean(attempt, na.rm = TRUE),
    n.attempt.gt1 = sum(attempt > 1, na.rm = TRUE),

    m.nReps = mean(nReps, na.rm = TRUE),
    n.reps = sum(nReps > 0, na.rm = TRUE),
    any.rep = m.nReps > 0,
    
    m.nRepsResp = mean(nRepsResp, na.rm = TRUE),
    n.repsResp = sum(nRepsResp > 0, na.rm = TRUE),
    any.repResp = m.nRepsResp > 0
  )

reps2 <- merge(reps2, skipped_summary, by='session_id')

n.reps = how many repeated questions the session had.
any.rep = Did the session include any repetitions of any question? This is probably the most important variable.
n.repsResp = How many repeated questions and resposnes the session had.
any.repResp = Did the session include any repetitions of a question and a response?
m.attempt = the mean value of the attempt column, for each session
n.attempt.gt1 = the number of row with an attempt value above 1, for each session

knitr::kable(mysumBy(n.reps + any.rep + n.repsResp + any.repResp + m.attempt + n.attempt.gt1 ~ skipped, dt = reps2))

var	skipped	n	M	SD	SE	med
n.reps	false	2787	0.406	7.305	0.138	0.000
n.reps	true	2904	2.083	16.149	0.300	0.000
any.rep	false	2787	0.018	0.133	0.003	0.000
any.rep	true	2904	0.072	0.259	0.005	0.000
n.repsResp	false	2787	0.233	3.625	0.069	0.000
n.repsResp	true	2904	0.969	6.785	0.126	0.000
any.repResp	false	2787	0.018	0.133	0.003	0.000
any.repResp	true	2904	0.072	0.259	0.005	0.000
m.attempt	false	2787	1.673	0.291	0.006	1.699
m.attempt	true	2904	1.696	0.305	0.006	1.714
n.attempt.gt1	false	2787	25.525	15.665	0.297	32.000
n.attempt.gt1	true	2904	26.661	16.538	0.307	33.000

The results are the same as in the previous method: having the “skip” option increased the likelihood of a repeated question (an effect of about d = 0.25 on any.rep), but repeated questions (and responses) were still possible even without the “skip” option.

Repetition by questionnaire

qreps <- for.dup2 %>%
  group_by(questionnaire_name) %>%
  dplyr::summarise(
    m.attempt = mean(attempt, na.rm = TRUE),
    n.attempt.gt1 = sum(attempt > 1, na.rm = TRUE),

    m.nReps = mean(nReps, na.rm = TRUE),
    n.reps = sum(nReps > 1, na.rm = TRUE),

    m.nRepsResp = mean(nRepsResp, na.rm = TRUE),
    n.repsResp = sum(nRepsResp > 1, na.rm = TRUE)
  )

For each questionnaire we computed:

m.attempt = the mean value of the attempt column.
n.attempt.gt1 = the number of attempt value greater than 1
m.nReps = For each row, how many times did it repeat? (the repeated combination was session_id, questionnaire_name, question_name)
n.Reps = How many rows shared the [session_id, questionnaire_name, question_name] combination with another row
m.nRepsResp = For each row, how many times did it repeat? (the repeated combination was session_id, questionnaire_name, question_name, question_response)
n.repsResp = How many rows shared the [session_id, questionnaire_name, question_name, question_response] combination with another row

knitr::kable(qreps[order(-qreps$m.nReps), ])

questionnaire_name	m.attempt	n.attempt.gt1	m.nReps	n.reps	m.nRepsResp	n.repsResp
mgr	1.623097	6228	0.1326233	322	0.1001747	224
ageCheck	1.016317	320	0.0326340	48	0.0157343	24
race	2.136439	54886	0.0187882	54	0.0088210	24
demographics	2.734115	79644	0.0166533	0	0.0078833	0
explicits	1.007853	476	0.0161963	0	0.0054689	0
iat	1.010606	66	0.0141414	0	0.0033670	0
debriefing	1.336515	6940	0.0105012	0	0.0044869	0
under18	1.000000	0	0.0000000	0	0.0000000	0

It makes sense that duplicates would be detected more often in earlier stages of the study (mgr and ageCheck).

Repetition by question

qqreps <- for.dup2 %>%
  group_by(question_name) %>%
  dplyr::summarise(
    m.attempt = mean(attempt, na.rm = TRUE),
    n.attempt.gt1 = sum(attempt > 1, na.rm = TRUE),

    m.nReps = mean(nReps, na.rm = TRUE),
    n.reps = sum(nReps > 1, na.rm = TRUE),

    m.nRepsResp = mean(nRepsResp, na.rm = TRUE),
    n.repsResp = sum(nRepsResp > 1, na.rm = TRUE)
  )

For each question we computed:

m.attempt = the mean value of the attempt column.
n.attempt.gt1 = the number of attempt value greater than 1
m.nReps = For each row, how many times did it repeat? (the repeated combination was session_id, questionnaire_name, question_name)
n.Reps = How many rows shared the [session_id, questionnaire_name, question_name] combination with another row
m.nRepsResp = For each row, how many times did it repeat? (the repeated combination was session_id, questionnaire_name, question_name, question_response)
n.repsResp = How many rows shared the [session_id, questionnaire_name, question_name, question_response] combination with another row

knitr::kable(qqreps[order(-qqreps$m.nReps), ])

question_name	m.attempt	n.attempt.gt1	m.nReps	n.reps	m.nRepsResp	n.repsResp
gaySet	1.500000	1	1.0000000	0	1.0000000	0
genderIdentity_0002otherrt	2.545454	6	0.1818182	0	0.0000000	0
isTouch	1.675986	3419	0.1331336	161	0.1331336	161
skipped	1.570525	2808	0.1318901	161	0.0669442	63
genderIdentity_0002other	2.142857	10	0.0952381	0	0.0000000	0
raceomb_003sub_blackrt	2.292035	257	0.0412979	3	0.0000000	0
raceomb_003sub_black	2.316384	268	0.0395480	3	0.0395480	3
raceomb_003sub_hispanicotherrt	2.675214	96	0.0341880	0	0.0000000	0
raceomb_003sub_hispanicother	2.650000	98	0.0333333	0	0.0333333	0
birthmonth	1.016317	80	0.0326340	12	0.0310800	12
birthmonthrt	1.016317	80	0.0326340	12	0.0007770	0
birthyear	1.016317	80	0.0326340	12	0.0303030	12
birthyearrt	1.016317	80	0.0326340	12	0.0007770	0
raceomb_003sub_middleeastrt	2.357143	60	0.0238095	0	0.0000000	0
raceomb_003sub_hispanicrt	2.439759	384	0.0200803	0	0.0000000	0
raceomb_003sub_middleeast	2.460784	76	0.0196078	0	0.0196078	0
raceomb_003sub_hispanic	2.446602	396	0.0194175	0	0.0194175	0
raceomb_003sub_whitert	2.280000	2028	0.0190826	0	0.0000000	0
raceomb_003_other	2.251561	3223	0.0189595	3	0.0184971	3
raceomb_003_otherrt	2.251561	3223	0.0189595	3	0.0000000	0
raceomb_003_white	2.251561	3223	0.0189595	3	0.0180347	3
raceomb_003_whitert	2.251561	3223	0.0189595	3	0.0000000	0
raceomb_003_asian	1.961858	2589	0.0189552	3	0.0184928	3
raceomb_003_asianrt	1.961858	2589	0.0189552	3	0.0000000	0
raceomb_003_black	1.961858	2589	0.0189552	3	0.0184928	3
raceomb_003_blackrt	1.961858	2589	0.0189552	3	0.0000000	0
raceomb_003_hispanic	2.131993	3012	0.0189552	3	0.0180305	3
raceomb_003_hispanicrt	2.131993	3012	0.0189552	3	0.0000000	0
raceomb_003_middleeast	2.131993	3012	0.0189552	3	0.0180305	3
raceomb_003_middleeastrt	2.131993	3012	0.0189552	3	0.0000000	0
raceomb_003_native	1.961858	2589	0.0189552	3	0.0175682	0
raceomb_003_nativert	1.961858	2589	0.0189552	3	0.0000000	0
raceomb_003_pacific	2.131993	3012	0.0189552	3	0.0180305	3
raceomb_003_pacificrt	2.131993	3012	0.0189552	3	0.0000000	0
raceomb_003sub_white	2.282656	2051	0.0188679	0	0.0145138	0
postcodelongrt	3.035539	2697	0.0177696	0	0.0000000	0
postcodenowrt	2.929841	2643	0.0177696	0	0.0000000	0
postcodelong	3.035474	2701	0.0177370	0	0.0159021	0
postcodenow	2.929358	2648	0.0177370	0	0.0159021	0
countrycit003	2.861353	3423	0.0165055	0	0.0165055	0
countrycit003rt	2.861353	3423	0.0165055	0	0.0004716	0
countryres003	2.861353	3423	0.0165055	0	0.0165055	0
countryres003rt	2.861353	3423	0.0165055	0	0.0004716	0
edu	2.942466	3472	0.0165055	0	0.0165055	0
edurt	2.942466	3472	0.0165055	0	0.0004716	0
occuSelf	2.966282	3506	0.0165055	0	0.0160340	0
occuSelfrt	2.966282	3506	0.0165055	0	0.0004716	0
politicalid7	2.637992	3248	0.0164978	0	0.0146123	0
politicalid7rt	2.637992	3248	0.0164978	0	0.0004714	0
religion2014	2.637992	3248	0.0164978	0	0.0155550	0
religion2014rt	2.637992	3248	0.0164978	0	0.0004714	0
religionid	2.641999	3251	0.0164978	0	0.0136696	0
religionidrt	2.641999	3251	0.0164978	0	0.0004714	0
genderIdentity_0002	2.379359	2820	0.0164939	0	0.0160226	0
genderIdentity_0002rt	2.379359	2820	0.0164939	0	0.0004713	0
num002	2.381715	2828	0.0164939	0	0.0113101	0
num002rt	2.381715	2828	0.0164939	0	0.0004713	0
transIdentity	2.379359	2820	0.0164939	0	0.0160226	0
transIdentityrt	2.379359	2820	0.0164939	0	0.0004713	0
Larts	1.007853	17	0.0161963	0	0.0112883	0
Lartsrt	1.007853	17	0.0161963	0	0.0004908	0
Lscience	1.007853	17	0.0161963	0	0.0103067	0
Lsciencert	1.007853	17	0.0161963	0	0.0004908	0
arts	1.007853	17	0.0161963	0	0.0103067	0
artsrt	1.007853	17	0.0161963	0	0.0004908	0
factorability	1.007853	17	0.0161963	0	0.0093252	0
factorabilityrt	1.007853	17	0.0161963	0	0.0004908	0
factordiscrimination	1.007853	17	0.0161963	0	0.0098160	0
factordiscriminationrt	1.007853	17	0.0161963	0	0.0004908	0
factorencouragement	1.007853	17	0.0161963	0	0.0098160	0
factorencouragementrt	1.007853	17	0.0161963	0	0.0004908	0
factorfamily	1.007853	17	0.0161963	0	0.0098160	0
factorfamilyrt	1.007853	17	0.0161963	0	0.0004908	0
factorhighpower	1.007853	17	0.0161963	0	0.0112883	0
factorhighpowerrt	1.007853	17	0.0161963	0	0.0004908	0
factorinterest	1.007853	17	0.0161963	0	0.0098160	0
factorinterestrt	1.007853	17	0.0161963	0	0.0004908	0
goal1	1.007853	17	0.0161963	0	0.0112883	0
goal1rt	1.007853	17	0.0161963	0	0.0004908	0
goal2	1.007853	17	0.0161963	0	0.0083436	0
goal2rt	1.007853	17	0.0161963	0	0.0004908	0
ran9thboys	1.007853	17	0.0161963	0	0.0107975	0
ran9thboysrt	1.007853	17	0.0161963	0	0.0004908	0
ran9thgirls	1.007853	17	0.0161963	0	0.0122699	0
ran9thgirlsrt	1.007853	17	0.0161963	0	0.0004908	0
science	1.007853	17	0.0161963	0	0.0117791	0
sciencert	1.007853	17	0.0161963	0	0.0004908	0
occuSelfDetailrt	3.128169	2406	0.0159778	0	0.0006947	0
occuSelfDetail	3.131220	2455	0.0156783	0	0.0149966	0
block3Cond	1.010606	22	0.0141414	0	0.0050505	0
d	1.010606	22	0.0141414	0	0.0010101	0
feedback	1.010606	22	0.0141414	0	0.0040404	0
raceomb_003sub_whiteotherrt	2.651515	466	0.0134680	0	0.0000000	0
raceomb_003sub_whiteother	2.640364	517	0.0121396	0	0.0091047	0
raceomb_003sub_asianrt	2.298246	387	0.0116959	0	0.0000000	0
raceomb_003sub_asian	2.329787	428	0.0106383	0	0.0070922	0
broughtwebsite	1.336515	694	0.0105012	0	0.0076372	0
broughtwebsitert	1.336515	694	0.0105012	0	0.0009547	0
iatevaluations	1.336515	694	0.0105012	0	0.0066826	0
iatevaluations001	1.336515	694	0.0105012	0	0.0095465	0
iatevaluations001rt	1.336515	694	0.0105012	0	0.0009547	0
iatevaluations002	1.336515	694	0.0105012	0	0.0076372	0
iatevaluations002rt	1.336515	694	0.0105012	0	0.0009547	0
iatevaluations003	1.336515	694	0.0105012	0	0.0085919	0
iatevaluations003rt	1.336515	694	0.0105012	0	0.0009547	0
iatevaluationsrt	1.336515	694	0.0105012	0	0.0009547	0
raceomb_003sub_otherrt	2.450451	177	0.0090090	0	0.0000000	0
raceomb_003sub_other	2.471572	233	0.0066890	0	0.0000000	0
blackLabels	1.000000	0	0.0000000	0	0.0000000	0
raceSet	1.000000	0	0.0000000	0	0.0000000	0
raceomb_003sub_asianother	2.412698	49	0.0000000	0	0.0000000	0
raceomb_003sub_asianotherrt	2.406780	46	0.0000000	0	0.0000000	0
raceomb_003sub_blackother	2.846154	42	0.0000000	0	0.0000000	0
raceomb_003sub_blackotherrt	2.833333	38	0.0000000	0	0.0000000	0
raceomb_003sub_middleeastother	2.782609	19	0.0000000	0	0.0000000	0
raceomb_003sub_middleeastotherrt	2.809524	17	0.0000000	0	0.0000000	0
raceomb_003sub_native	2.452514	141	0.0000000	0	0.0000000	0
raceomb_003sub_nativert	2.408451	55	0.0000000	0	0.0000000	0
raceomb_003sub_pacific	2.731707	31	0.0000000	0	0.0000000	0
raceomb_003sub_pacificother	3.500000	4	0.0000000	0	0.0000000	0
raceomb_003sub_pacificotherrt	3.333333	3	0.0000000	0	0.0000000	0
raceomb_003sub_pacificrt	2.730769	21	0.0000000	0	0.0000000	0
under18	1.000000	0	0.0000000	0	0.0000000	0
under18rt	1.000000	0	0.0000000	0	0.0000000	0
whiteLabels	1.000000	0	0.0000000	0	0.0000000	0

Nothing to see here, probably (beyond: questions in earlier stages of the studies were more likely to be repeated).

The duplicates without a true in the skipped variable

Let’s see whether we can learn anything about participants whose “skipped” variable was always “false”

Number of relevant sessions

length(unique(skipped_summary$session_id[which(skipped_summary$skipped == "false")]))

## [1] 2787

for.dup3 <- exp.raw[which(explicit.raw$session_id %in% unique(skipped_summary$session_id[which(skipped_summary$skipped == "false")])), c("session_id", "questionnaire_name", "question_name", "question_response")]

for.dup3 <- for.dup3 %>%
    arrange(session_id, questionnaire_name, question_name) %>%
    group_by(session_id, questionnaire_name, question_name) %>%
    mutate(iRep = row_number() - 1, nReps = n() - 1, isRep = nReps > 1) %>%
    ungroup()


for.dup3 <- for.dup3 %>%
    arrange(session_id, questionnaire_name, question_name, question_response) %>%
    group_by(session_id, questionnaire_name, question_name, question_response) %>%
    mutate(iRepResp = row_number() - 1, nRepsResp = n() - 1, isRepResp = nRepsResp > 1) %>%
    ungroup()

for.dup3 <- for.dup3 %>%
    mutate(repNoResp = iRepResp < iRep)

for.dup3 <- for.dup3[order(for.dup3$session_id, for.dup3$questionnaire_name, for.dup3$question_name, for.dup3$question_response), ]

How many repeated rows these subset of participants had?

my.freq(for.dup3$nReps[which(for.dup3$iRep == 0)])

Very few.

In which questionnaires did they have repetitions?

my.freq(for.dup3$questionnaire_name[which(for.dup3$nReps > 0 & for.dup3$iRep == 1)])

All of the questionnaires.

Let’s see whether the responses were sometimes different. If not, then maybe it was somehow the same data sent twice by the browser.

my.freq(for.dup3$repNoResp[which(for.dup3$iRep > 0)])

Yes, the response was sometimes different. So, this is an actual repetition of the question, and we don’t know why. At least it occurred only in 0.6% of the rows in the explicit table.

System Configuration

Let’s test whether particular client setups are more likely to produce duplicate data.

ss <- read.table(paste(dir.raw, "sessions.txt", sep = "\\"), sep = "\t", header = TRUE, fill = TRUE)

library(uaparserjs)
parsed_ua <- ua_parse(ss$user_agent)
parsed_ua$session_id <- ss$session_id

reps3 <- merge(reps2, parsed_ua, by = "session_id")

Let’s examine repetitions by browser:

ttt <- mysumBy(n.reps + any.rep ~ ua.family, dt = reps3)
ttt[which(ttt$n > 100), ]

There are repeated question in all of the browsers.

By the skipped feature:

ttt <- mysumBy(n.reps + any.rep ~ skipped + ua.family, dt = reps3)
knitr::kable(ttt[which(ttt$n > 100), ])

	var	skipped	ua.family	n	M	SD	SE
2	n.reps	false	Chrome	1724	0.447	7.761	0.187
6	n.reps	false	Edge	592	0.517	8.679	0.357
12	n.reps	false	Safari	330	0.097	0.863	0.047
14	n.reps	true	Chrome	1873	2.166	16.327	0.377
18	n.reps	true	Edge	506	1.036	10.696	0.475
20	n.reps	true	Firefox	118	1.831	16.783	1.545
24	n.reps	true	Safari	358	3.028	21.024	1.111
27	any.rep	false	Chrome	1724	0.019	0.135	0.003
31	any.rep	false	Edge	592	0.012	0.108	0.004
37	any.rep	false	Safari	330	0.015	0.122	0.007
39	any.rep	true	Chrome	1873	0.072	0.258	0.006
43	any.rep	true	Edge	506	0.045	0.209	0.009
45	any.rep	true	Firefox	118	0.068	0.252	0.023
49	any.rep	true	Safari	358	0.089	0.286	0.015

Bots?

If some of the skippipng was due to somekind of non-human behavior, perhaps we will see that in failure to complete the IAT reasonably.

iat <- read.table(paste(dir.raw, "iat.txt", sep = "\\"), sep = "\t", header = TRUE, fill = TRUE)

iat.errs <- iat %>%
    group_by(session_id) %>%
    summarise(iat.rows = n(), iat.err = mean(trial_error, na.rm = TRUE))

reps4 <- merge(reps2, iat.errs, by = "session_id", all.x = T)
reps4$anyIAT <- !is.na(reps4$iat.rows) & reps4$iat.rows > 0

IAT performance and number of recorded rows, by the “skip” functionality, and by whether the session had any repetition (any.rep):

knitr::kable(mysumBy(iat.err + iat.rows + anyIAT ~ skipped + any.rep, dt = reps4))

var	skipped	any.rep	n	M	SD	SE	med
iat.err	false	FALSE	1987	0.079	0.067	0.001	0.061
iat.err	false	TRUE	48	0.097	0.101	0.014	0.066
iat.err	true	FALSE	1974	0.079	0.065	0.001	0.066
iat.err	true	TRUE	175	0.079	0.059	0.004	0.061
iat.rows	false	FALSE	1987	188.304	31.397	0.600	196.000
iat.rows	false	TRUE	48	210.750	59.308	8.387	196.000
iat.rows	true	FALSE	1974	188.227	31.828	0.613	196.000
iat.rows	true	TRUE	175	219.246	71.221	4.915	196.000
anyIAT	false	FALSE	2737	0.726	0.446	0.009	1.000
anyIAT	false	TRUE	50	0.960	0.198	0.028	1.000
anyIAT	true	FALSE	2694	0.733	0.443	0.009	1.000
anyIAT	true	TRUE	210	0.833	0.374	0.026	1.000

The IAT error rate was not much higher (if at all) in sessions with repetitions. The number of rows was higher among in sessions with repetitions, probably due to repetitions in taking the IAT. anyIAT means that there were any IAT rows. When repetition occurred, there was a higher likelihood of having any IAT rows.

Summary

Repetition of questions occurred in about 7% of the sessions that had the “skip” feature enabled, and almost 2% of the sessions that did not have the “Skip” feature enabled.
I did not find any clue regarding the reason for the duplicates when “skip” is not enabled (no relation to system configuration, or to IAT performance).

Report: The skipping feature and duplicates

Yoav Bar-Anan

2024-09-30