# Outcome Analysis and Statistical Methods Guide

## Overview

Rigorous outcome analysis is essential for clinical decision support documents. This guide covers survival analysis, response assessment, statistical testing, and data visualization for patient cohort analyses and treatment evaluation.

## Survival Analysis

### Kaplan-Meier Method

**Overview**
- Non-parametric estimator of survival function from time-to-event data
- Handles censored observations (patients alive at last follow-up)
- Provides survival probability at each time point
- Generates characteristic step-function survival curves

**Key Concepts**

**Censoring**
- **Right censoring**: Most common - patient alive at last follow-up or study end
- **Left censoring**: Rare in clinical studies
- **Interval censoring**: Event occurred between two assessment times
- **Informative vs non-informative**: Censoring should be independent of outcome

**Survival Function S(t)**
- S(t) = Probability of surviving beyond time t
- S(0) = 1.0 (100% alive at time zero)
- S(t) decreases as time increases
- Step decreases at each event time

**Median Survival**
- Time point where S(t) = 0.50
- 50% of patients alive, 50% have had event
- Reported with 95% confidence interval
- "Not reached (NR)" if fewer than 50% events

**Survival Rates at Fixed Time Points**
- 1-year survival rate, 2-year survival rate, 5-year survival rate
- Read from K-M curve at specific time point
- Report with 95% CI: S(t) ± 1.96 × SE

**Calculation Example**
```
Time  Events  At Risk  Survival Probability
0     0       100      1.000
3     2       100      0.980 (98/100)
5     1       95       0.970 (97/100 × 95/98)
8     3       87       0.936 (94/100 × 92/95 × 84/87)
...
```

### Log-Rank Test

**Purpose**: Compare survival curves between two or more groups

**Null Hypothesis**: No difference in survival distributions between groups

**Test Statistic**
- Compares observed vs expected events in each group at each time point
- Weights all time points equally
- Follows chi-square distribution with df = k-1 (k groups)

**Reporting**
- Chi-square statistic, degrees of freedom, p-value
- Example: χ² = 6.82, df = 1, p = 0.009
- Interpretation: Significant difference in survival curves

**Assumptions**
- Censoring is non-informative and independent
- Proportional hazards (constant HR over time)
- If non-proportional, consider time-varying effects

**Alternatives for Non-Proportional Hazards**
- **Gehan-Breslow test**: Weights early events more heavily
- **Peto-Peto test**: Modifies Gehan-Breslow weighting
- **Restricted mean survival time (RMST)**: Difference in area under K-M curve

### Cox Proportional Hazards Regression

**Purpose**: Multivariable survival analysis, estimate hazard ratios adjusting for covariates

**Model**: h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₚXₚ)
- h(t|X): Hazard rate for individual with covariates X
- h₀(t): Baseline hazard function (unspecified)
- exp(β): Hazard ratio for one-unit change in covariate

**Hazard Ratio Interpretation**
- HR = 1.0: No effect
- HR > 1.0: Increased risk (harmful)
- HR < 1.0: Decreased risk (beneficial)
- HR = 0.50: 50% reduction in hazard (risk of event)

**Example Output**
```
Variable              HR      95% CI         p-value
Treatment (B vs A)    0.62    0.43-0.89      0.010
Age (per 10 years)    1.15    1.02-1.30      0.021
ECOG PS (2 vs 0-1)    1.85    1.21-2.83      0.004
Biomarker+ (vs -)     0.71    0.48-1.05      0.089
```

**Proportional Hazards Assumption**
- Hazard ratio constant over time
- Test: Schoenfeld residuals, log-minus-log plots
- Violation: Time-varying effects, consider stratification or time-dependent covariates

**Multivariable vs Univariable**
- **Univariable**: One covariate at a time, unadjusted HRs
- **Multivariable**: Multiple covariates simultaneously, adjusted HRs
- Report both: Univariable for all variables, multivariable for final model

**Model Selection**
- **Forward selection**: Start with empty model, add significant variables
- **Backward elimination**: Start with all variables, remove non-significant
- **Clinical judgment**: Include known prognostic factors regardless of p-value
- **Parsimony**: Avoid overfitting, rule of thumb 1 variable per 10-15 events

## Response Assessment

### RECIST v1.1 (Response Evaluation Criteria in Solid Tumors)

**Target Lesions**
- Select up to 5 lesions total (maximum 2 per organ)
- Measurable: ≥10 mm longest diameter (≥15 mm for lymph nodes short axis)
- Sum of longest diameters (SLD) at baseline

**Response Categories**

**Complete Response (CR)**
- Disappearance of all target and non-target lesions
- Lymph nodes must regress to <10 mm short axis
- Confirmation required at ≥4 weeks

**Partial Response (PR)**
- ≥30% decrease in SLD from baseline
- No new lesions or unequivocal progression of non-target lesions
- Confirmation required at ≥4 weeks

**Stable Disease (SD)**
- Neither PR nor PD criteria met
- Minimum duration typically 6-8 weeks from baseline

**Progressive Disease (PD)**
- ≥20% increase in SLD AND ≥5 mm absolute increase from smallest SLD (nadir)
- OR appearance of new lesions
- OR unequivocal progression of non-target lesions

**Example Calculation**
```
Baseline SLD: 80 mm (4 target lesions)
Week 6 SLD: 52 mm

Percent change: (52 - 80)/80 × 100% = -35%
Classification: Partial Response (≥30% decrease)

Week 12 SLD: 48 mm (nadir)
Week 18 SLD: 62 mm

Percent change from nadir: (62 - 48)/48 × 100% = +29%
Absolute change: 62 - 48 = 14 mm
Classification: Progressive Disease (>20% AND ≥5 mm increase)
```

### iRECIST (Immune RECIST)

**Purpose**: Account for atypical response patterns with immunotherapy

**Modifications from RECIST v1.1**

**iUPD (Immune Unconfirmed Progressive Disease)**
- Initial increase in tumor burden or new lesions
- Requires confirmation at next assessment (≥4 weeks later)
- Continue treatment if clinically stable

**iCPD (Immune Confirmed Progressive Disease)**
- Confirmed progression at repeat imaging
- Discontinue immunotherapy

**Pseudoprogression**
- Initial apparent progression followed by response
- Mechanism: Immune cell infiltration increases tumor size
- Incidence: 5-10% of patients on immunotherapy
- Management: Continue treatment if patient clinically stable

**New Lesions**
- Record size and location but continue treatment
- Do not automatically classify as PD
- Confirm progression if new lesions grow or additional new lesions appear

### Other Response Criteria

**Lugano Classification (Lymphoma)**
- **PET-based**: Deauville 5-point scale
  - Score 1-3: Negative (metabolic CR)
  - Score 4-5: Positive (residual disease)
- **CT-based**: If PET not available
- **Bone marrow**: Required for staging in some lymphomas

**RANO (Response Assessment in Neuro-Oncology)**
- **Glioblastoma-specific**: Accounts for pseudoprogression with radiation/temozolomide
- **Enhancing disease**: Bidimensional measurements (product of perpendicular diameters)
- **Non-enhancing disease**: FLAIR changes assessed separately
- **Corticosteroid dose**: Must document, increase may indicate progression

**mRECIST (Modified RECIST for HCC)**
- **Viable tumor**: Enhancing portion only (arterial phase enhancement)
- **Necrosis**: Non-enhancing areas excluded from measurements
- **Application**: Hepatocellular carcinoma with arterial enhancement

## Outcome Metrics

### Efficacy Endpoints

**Overall Survival (OS)**
- **Definition**: Time from randomization/treatment start to death from any cause
- **Advantages**: Objective, not subject to assessment bias, regulatory gold standard
- **Disadvantages**: Requires long follow-up, affected by subsequent therapies
- **Censoring**: Last known alive date
- **Analysis**: Kaplan-Meier, log-rank test, Cox regression

**Progression-Free Survival (PFS)**
- **Definition**: Time from randomization to progression (RECIST) or death
- **Advantages**: Earlier readout than OS, direct treatment effect
- **Disadvantages**: Requires regular imaging, subject to assessment timing
- **Censoring**: Last tumor assessment without progression
- **Sensitivity Analysis**: Assess impact of censoring assumptions

**Objective Response Rate (ORR)**
- **Definition**: Proportion of patients achieving CR or PR (best response)
- **Denominator**: Evaluable patients (baseline measurable disease)
- **Reporting**: Percentage with 95% CI (exact binomial method)
- **Duration**: Time from first response to progression (DOR)
- **Advantage**: Binary endpoint, no censoring complications

**Disease Control Rate (DCR)**
- **Definition**: CR + PR + SD (stable disease ≥6-8 weeks)
- **Less Stringent**: Captures clinical benefit beyond objective response
- **Reporting**: Percentage with 95% CI

**Duration of Response (DOR)**
- **Definition**: Time from first CR or PR to progression (among responders only)
- **Population**: Subset analysis of responders
- **Analysis**: Kaplan-Meier among responders
- **Reporting**: Median DOR with 95% CI

**Time to Treatment Failure (TTF)**
- **Definition**: Time from start to discontinuation for any reason (progression, toxicity, death, patient choice)
- **Advantage**: Reflects real-world treatment duration
- **Components**: PFS + toxicity-related discontinuations

### Safety Endpoints

**Adverse Events (CTCAE v5.0)**

**Grading**
- **Grade 1**: Mild, asymptomatic or mild symptoms, clinical intervention not indicated
- **Grade 2**: Moderate, minimal/local intervention indicated, age-appropriate ADL limitation
- **Grade 3**: Severe or medically significant, not immediately life-threatening, hospitalization/prolongation indicated, disabling, self-care ADL limitation
- **Grade 4**: Life-threatening consequences, urgent intervention indicated
- **Grade 5**: Death related to adverse event

**Reporting Standards**
```
Adverse Event Summary Table:

AE Term (MedDRA)        Any Grade, n (%)  Grade 3-4, n (%)  Grade 5, n (%)
                        Trt A    Trt B    Trt A   Trt B     Trt A   Trt B
─────────────────────────────────────────────────────────────────────────
Hematologic
  Anemia                45 (90%) 42 (84%) 8 (16%) 6 (12%)   0       0
  Neutropenia           35 (70%) 38 (76%) 15 (30%) 18 (36%) 0       0
  Thrombocytopenia      28 (56%) 25 (50%) 6 (12%) 4 (8%)    0       0
  Febrile neutropenia   4 (8%)   6 (12%)  4 (8%)  6 (12%)   0       0

Gastrointestinal
  Nausea                42 (84%) 40 (80%) 2 (4%)  1 (2%)    0       0
  Diarrhea              31 (62%) 28 (56%) 5 (10%) 3 (6%)    0       0
  Mucositis             18 (36%) 15 (30%) 3 (6%)  2 (4%)    0       0

Any AE                  50 (100%) 50 (100%) 38 (76%) 35 (70%) 1 (2%) 0
```

**Serious Adverse Events (SAEs)**
- SAE incidence and type
- Relationship to treatment (related vs unrelated)
- Outcome (resolved, ongoing, fatal)
- Causality assessment (definite, probable, possible, unlikely, unrelated)

**Treatment Modifications**
- Dose reductions: n (%), reason
- Dose delays: n (%), duration
- Discontinuations: n (%), reason (toxicity vs progression vs other)
- Relative dose intensity: (actual dose delivered / planned dose) × 100%

## Statistical Analysis Methods

### Comparing Continuous Outcomes

**Independent Samples t-test**
- **Application**: Compare means between two independent groups (normally distributed)
- **Assumptions**: Normal distribution, equal variances (or use Welch's t-test)
- **Reporting**: Mean ± SD for each group, mean difference (95% CI), t-statistic, df, p-value
- **Example**: Mean age 62.3 ± 8.4 vs 58.7 ± 9.1 years, difference 3.6 years (95% CI 0.2-7.0, p=0.038)

**Mann-Whitney U Test (Wilcoxon Rank-Sum)**
- **Application**: Compare medians between two groups (non-normal distribution)
- **Non-parametric**: No distributional assumptions
- **Reporting**: Median [IQR] for each group, median difference, U-statistic, p-value
- **Example**: Median time to response 6.2 [4.1-8.3] vs 8.5 [5.9-11.2] weeks, p=0.042

**ANOVA (Analysis of Variance)**
- **Application**: Compare means across three or more groups
- **Output**: F-statistic, p-value (overall test)
- **Post-hoc**: If significant, pairwise comparisons with Tukey or Bonferroni correction
- **Example**: Treatment effect varied by biomarker subgroup (F=4.32, df=2, p=0.016)

### Comparing Categorical Outcomes

**Chi-Square Test for Independence**
- **Application**: Compare proportions between two or more groups
- **Assumptions**: Expected count ≥5 in at least 80% of cells
- **Reporting**: n (%) for each cell, χ², df, p-value
- **Example**: ORR 45% vs 30%, χ²=6.21, df=1, p=0.013

**Fisher's Exact Test**
- **Application**: 2×2 tables when expected count <5
- **Exact p-value**: No large-sample approximation
- **Two-sided vs one-sided**: Typically report two-sided
- **Example**: SAE rate 3/20 (15%) vs 8/22 (36%), Fisher's exact p=0.083

**McNemar's Test**
- **Application**: Paired categorical data (before/after, matched pairs)
- **Example**: Response before vs after treatment switch in same patients

### Sample Size and Power

**Power Analysis Components**
- **Alpha (α)**: Type I error rate, typically 0.05 (two-sided)
- **Beta (β)**: Type II error rate, typically 0.10 or 0.20
- **Power**: 1 - β, typically 0.80 or 0.90 (80-90% power)
- **Effect size**: Expected difference (HR, mean difference, proportion difference)
- **Sample size**: Number of patients or events needed

**Survival Study Sample Size**
- Events-driven: Need sufficient events (deaths, progressions)
- Rule of thumb: 80% power requires approximately 165 events for HR=0.70 (α=0.05, two-sided)
- Accrual time + follow-up time determines calendar time

**Response Rate Study**
```
Example: Detect ORR difference 45% vs 30% (15 percentage points)
- α = 0.05 (two-sided)
- Power = 0.80
- Sample size: n = 94 per group (188 total)
- With 10% dropout: n = 105 per group (210 total)
```

## Data Visualization

### Survival Curves

**Kaplan-Meier Plot Best Practices**

```python
# Key elements for publication-quality survival curve
1. X-axis: Time (months or years), starts at 0
2. Y-axis: Survival probability (0 to 1.0 or 0% to 100%)
3. Step function: Survival curve with steps at event times
4. 95% CI bands: Shaded region around survival curve (optional but recommended)
5. Number at risk table: Below x-axis showing n at risk at time intervals
6. Censoring marks: Vertical tick marks (|) at censored observations
7. Legend: Clearly identify each curve
8. Log-rank p-value: Prominently displayed
9. Median survival: Horizontal line at 0.50, labeled
10. Follow-up: Median follow-up time reported
```

**Number at Risk Table Format**
```
Number at risk
Group A   50    42    35    28    18    10     5
Group B   48    38    29    19    12     6     2
Time      0     6     12    18    24    30    36 (months)
```

**Hazard Ratio Annotation**
```
On plot: HR 0.62 (95% CI 0.43-0.89), p=0.010
Or in caption: Log-rank test p=0.010; Cox model HR=0.62 (95% CI 0.43-0.89)
```

### Waterfall Plots

**Purpose**: Visualize individual patient responses to treatment

**Construction**
- **X-axis**: Individual patients (anonymized patient IDs)
- **Y-axis**: Best % change from baseline tumor burden
- **Bars**: Vertical bars, one per patient
  - Positive values: Tumor growth
  - Negative values: Tumor shrinkage
- **Ordering**: Sorted from best response (left) to worst (right)
- **Color coding**:
  - Green/blue: CR or PR (≥30% decrease)
  - Yellow: SD (-30% to +20%)
  - Red: PD (≥20% increase)
- **Reference lines**: Horizontal lines at +20% (PD), -30% (PR)
- **Annotations**: Biomarker status, response duration (symbols)

**Example Annotations**
```
■ = Biomarker-positive
○ = Biomarker-negative
* = Ongoing response
† = Progressed
```

### Forest Plots

**Purpose**: Display subgroup analyses with hazard ratios and confidence intervals

**Construction**
- **Y-axis**: Subgroup categories
- **X-axis**: Hazard ratio (log scale), vertical line at HR=1.0
- **Points**: HR estimate for each subgroup
- **Horizontal lines**: 95% confidence interval
- **Square size**: Proportional to sample size or precision
- **Overall effect**: Diamond at bottom, width represents 95% CI

**Subgroups to Display**
```
Subgroup                    n     HR (95% CI)          Favors A  Favors B
──────────────────────────────────────────────────────────────────────────
Overall                     300   0.65 (0.48-0.88)         ●────┤
Age
  <65 years                 180   0.58 (0.39-0.86)        ●────┤
  ≥65 years                 120   0.78 (0.49-1.24)          ●──────┤
Sex
  Male                      175   0.62 (0.43-0.90)        ●────┤
  Female                    125   0.70 (0.44-1.12)         ●─────┤
Biomarker Status
  Positive                  140   0.45 (0.28-0.72)      ●───┤
  Negative                  160   0.89 (0.59-1.34)           ●──────┤
                                  p-interaction=0.041

                                  0.25  0.5   1.0   2.0
                                        Hazard Ratio
```

**Interaction Testing**
- Test whether treatment effect differs across subgroups
- p-interaction <0.05 suggests heterogeneity
- Pre-specify subgroups to avoid data mining

### Spider Plots

**Purpose**: Display longitudinal tumor burden changes over time for individual patients

**Construction**
- **X-axis**: Time from treatment start (weeks or months)
- **Y-axis**: % change from baseline tumor burden
- **Lines**: One line per patient connecting assessments
- **Color coding**: By response category or biomarker status
- **Reference lines**: 0% (no change), +20% (PD threshold), -30% (PR threshold)

**Clinical Insights**
- Identify delayed responders (initial SD then PR)
- Detect early progression (rapid upward trajectory)
- Assess depth of response (maximum tumor shrinkage)
- Duration visualization (when lines cross PD threshold)

### Swimmer Plots

**Purpose**: Display treatment duration and response for individual patients

**Construction**
- **X-axis**: Time from treatment start (weeks or months)
- **Y-axis**: Individual patients (one row per patient)
- **Bars**: Horizontal bars representing treatment duration
- **Symbols**:
  - ● Start of treatment
  - ▼ Ongoing treatment (arrow)
  - ■ Progressive disease (end of bar)
  - ◆ Death
  - | Dose modification
- **Color**: Response status (CR=green, PR=blue, SD=yellow, PD=red)

**Example**
```
Patient ID    |0   3   6   9   12  15  18  21  24 months
──────────────|──────────────────────────────────────────
Pt-001        ●═══PR═══════════|════════PR══════════▼
Pt-002        ●═══PR═══════════════PD■
Pt-003        ●══════SD══════════PD■
Pt-004        ●PR══════════════════════════════════PR▼
...
```

## Confidence Intervals

### Interpretation

**95% Confidence Interval**
- Range of plausible values for true population parameter
- If study repeated 100 times, 95 of the 95% CIs would contain true value
- **Not**: 95% probability true value within this interval (frequentist, not Bayesian)

**Relationship to p-value**
- If 95% CI excludes null value (HR=1.0, difference=0), p<0.05
- If 95% CI includes null value, p≥0.05
- CI provides more information: magnitude and precision of effect

**Precision**
- **Narrow CI**: High precision, large sample size
- **Wide CI**: Low precision, small sample size or high variability
- **Example**: HR 0.65 (95% CI 0.62-0.68) very precise; HR 0.65 (0.30-1.40) imprecise

### Calculation Methods

**Hazard Ratio CI**
- From Cox regression output
- Standard error of log(HR) → exp(log(HR) ± 1.96×SE)
- Example: HR=0.62, SE(logHR)=0.185 → 95% CI (0.43, 0.89)

**Survival Rate CI (Greenwood Formula)**
- SE(S(t)) = S(t) × sqrt(Σ[d_i / (n_i × (n_i - d_i))])
- 95% CI: S(t) ± 1.96 × SE(S(t))
- Can use complementary log-log transformation for better properties

**Proportion CI (Exact Binomial)**
- For ORR, DCR: Use exact method (Clopper-Pearson) for small samples
- Wilson score interval: Better properties than normal approximation
- Example: 12/30 responses → ORR 40% (95% CI 22.7-59.4%)

## Censoring and Missing Data

### Types of Censoring

**Right Censoring**
- **End of study**: Patient alive at study termination (administrative censoring)
- **Loss to follow-up**: Patient stops attending visits
- **Withdrawal**: Patient withdraws consent
- **Competing risk**: Death from unrelated cause (in disease-specific survival)

**Handling Censoring**
- **Assumption**: Non-informative - censoring independent of event probability
- **Sensitivity Analysis**: Assess impact if assumption violated
  - Best case: All censored patients never progress
  - Worst case: All censored patients progress immediately after censoring
  - Actual result should fall between best/worst case

### Missing Data

**Mechanisms**
- **MCAR (Missing Completely at Random)**: Missingness unrelated to any variable
- **MAR (Missing at Random)**: Missingness related to observed but not unobserved variables
- **NMAR (Not Missing at Random)**: Missingness related to the missing value itself

**Handling Strategies**
- **Complete case analysis**: Exclude patients with missing data (biased if not MCAR)
- **Multiple imputation**: Generate multiple plausible datasets, analyze each, pool results
- **Maximum likelihood**: Estimate parameters using all available data
- **Sensitivity analysis**: Assess robustness to missing data assumptions

**Response Assessment Missing Data**
- **Unevaluable for response**: Baseline measurable disease but post-baseline assessment missing
  - Exclude from ORR denominator or count as non-responder (sensitivity analysis)
- **PFS censoring**: Last adequate tumor assessment date if later assessments missing

## Reporting Standards

### CONSORT Statement (RCTs)

**Flow Diagram**
- Assessed for eligibility (n=)
- Randomized (n=)
- Allocated to intervention (n=)
- Lost to follow-up (n=, reasons)
- Discontinued intervention (n=, reasons)
- Analyzed (n=)

**Baseline Table**
- Demographics and clinical characteristics
- Baseline prognostic factors
- Show balance between arms

**Outcomes Table**
- Primary endpoint results with CI and p-value
- Secondary endpoints
- Safety summary

### STROBE Statement (Observational Studies)

**Study Design**: Cohort, case-control, or cross-sectional

**Participants**: Eligibility, sources, selection methods, sample size

**Variables**: Clearly define outcomes, exposures, predictors, confounders

**Statistical Methods**: Describe all methods, handling of missing data, sensitivity analyses

**Results**: Participant flow, descriptive data, outcome data, main results, other analyses

### Reproducible Research Practices

**Statistical Analysis Plan (SAP)**
- Pre-specify all analyses before data lock
- Primary and secondary endpoints
- Analysis populations (ITT, per-protocol, safety)
- Statistical tests and models
- Subgroup analyses (pre-specified)
- Interim analyses (if planned)
- Multiple testing procedures

**Transparency**
- Report all pre-specified analyses
- Distinguish pre-specified from post-hoc exploratory
- Report both positive and negative results
- Provide access to anonymized individual patient data (when possible)

## Software and Tools

### R Packages for Survival Analysis
- **survival**: Core package (Surv, survfit, coxph, survdiff)
- **survminer**: Publication-ready Kaplan-Meier plots (ggsurvplot)
- **rms**: Regression modeling strategies
- **flexsurv**: Flexible parametric survival models

### Python Libraries
- **lifelines**: Kaplan-Meier, Cox regression, survival curves
- **scikit-survival**: Machine learning for survival analysis
- **matplotlib**: Custom survival curve plotting

### Statistical Software
- **R**: Most comprehensive for survival analysis
- **Stata**: Medical statistics, good for epidemiology
- **SAS**: Industry standard for clinical trials
- **GraphPad Prism**: User-friendly for basic analyses
- **SPSS**: Point-and-click interface, limited survival features