Content is user-generated and unverified.

Stata Lecture: Data Management, Survival Analysis & Data Visualization

Preparing for Labs 4-5 (2.5 Hours)

Session Overview & Learning Objectives

By the end of this session, you will be able to:

Perform advanced data management tasks including grouping and tagging
Conduct survival analysis using Stata's st commands
Create professional data visualizations with twoway graphics
Apply string matching and pattern recognition techniques
Generate descriptive statistics and handle missing data appropriately

Part I: Advanced Data Management (45 minutes)

Working with Grouped Data and Center-Level Variables

When working with hierarchical data (patients within hospitals, students within schools), we often need to:

Calculate summary statistics at the group level
Create variables that represent group characteristics
Analyze data at different levels of aggregation

Key Commands for Grouped Operations:

stata

// Generate group-level summary statistics
bysort group_var: gen group_count = _N
bysort group_var: gen group_mean = mean(outcome_var)
bysort group_var: egen group_total = total(outcome_var)

// Create observation identifiers within groups
bysort group_var: gen obs_number = _n

Practice Concept: If you have transplant data with multiple patients per center, how would you calculate each center's total volume?

The Power of Tagging with `egen`

Tagging is essential when you want to analyze data at the group level but avoid counting each observation multiple times:

stata

// Create a tag variable (1 for first obs in group, 0 for others)
egen group_tag = tag(group_var)

// Use with summarize to get group-level statistics
summarize group_volume if group_tag == 1, detail

Why This Matters: When analyzing center volumes, you want statistics about centers, not about individual patients.

String Matching and Pattern Recognition

Medical data often contains messy text fields that need cleaning:

stata

// Basic string matching
gen diagnosis_unknown = 0
replace diagnosis_unknown = 1 if diagnosis == "UNKNOWN"

// Advanced pattern matching with regex
replace diagnosis_unknown = 1 if regexm(diagnosis, "UNKNOWN")
replace diagnosis_unknown = 1 if regexm(diagnosis, "UNCERTAIN")
replace diagnosis_unknown = 1 if regexm(diagnosis, "^NOT SPECI")  // starts with "NOT SPECI"
replace diagnosis_unknown = 1 if regexm(diagnosis, "UNK")         // contains "UNK"

Regular Expression Patterns:

^ = starts with
$ = ends with
| = OR operator
.* = any characters

Date Calculations and Time Differences

Working with dates is crucial for survival analysis:

stata

// Convert string dates to Stata date format
gen date_numeric = date(date_string, "DMY")
format date_numeric %td

// Calculate time differences
gen days_difference = end_date - start_date
gen died_within_180 = (died == 1) & (days_difference <= 180)

Part II: Survival Analysis Fundamentals (45 minutes)

Understanding Survival Data Structure

Survival analysis requires three key components:

Time to event (or censoring)
Event indicator (did the event occur?)
Time origin (when does follow-up start?)

Setting Up Survival Data with `stset`

The stset command declares your data as survival data:

stata

stset failure_time, failure(event_indicator) origin(start_time) scale(365.25)

Parameters Explained:

failure_time: When the event occurred or censoring happened
failure(): Binary variable indicating if event occurred
origin(): When follow-up begins for each subject
scale(): Convert time units (days to years)

Kaplan-Meier Survival Curves

Once data is stset, you can create survival curves:

stata

// Basic survival curve
sts graph

// Stratified by group
sts graph, by(group_variable)

// Customize appearance
sts graph, by(group_var) title("Survival by Group") ///
    xtitle("Years") ytitle("Survival Probability")

Testing Survival Differences

Compare survival between groups:

stata

// Log-rank test
sts test group_variable

// Interpret results
// p < 0.05 suggests significant difference
// p ≥ 0.05 suggests no significant difference

Cox Proportional Hazards Regression

Estimate hazard ratios:

stata

stcox predictor_variable

// Extract hazard ratio and confidence interval
// HR > 1: increased hazard (worse survival)
// HR < 1: decreased hazard (better survival)
// HR = 1: no effect

Interpreting Output:

Coefficient (β): log hazard ratio
Hazard Ratio: exp(β)
95% CI: confidence interval for HR

Part III: Data Visualization with `twoway` (45 minutes)

Basic Scatter Plots

stata

twoway scatter y_variable x_variable

Combining Multiple Plot Types

stata

// Overlay scatter plot and line
twoway (scatter y1 x) (line y2 x)

// Bar chart with overlay
twoway (bar count year) (line total year, yaxis(2))

Customizing Graph Appearance

Axis Control:

stata

twoway scatter y x, yscale(range(0 100)) ylabel(0(20)100)
//                   ^scale range        ^labels: start(increment)end

Colors and Markers:

stata

twoway scatter y x, mcolor(red) msymbol(circle)

Legends and Labels:

stata

twoway scatter y x, legend(label(1 "Group 1") label(2 "Group 2"))

Advanced Graph Combinations

Multiple Y-Axes:

stata

twoway (bar unemployed year) (line total year, yaxis(2))

Conditional Plotting:

stata

twoway (scatter weight age if gender==1, mcolor(blue)) ///
       (scatter weight age if gender==2, mcolor(red))

Professional Graph Elements

Adding Reference Lines:

stata

twoway line y x, xline(2008, lcolor(red) lpattern(dash))

Titles and Labels:

stata

twoway line y x, title("Main Title") ///
    xtitle("X-axis Label") ytitle("Y-axis Label") ///
    note("Data source: XYZ")

Multi-column Legends:

stata

twoway line y x, legend(cols(3))

Part IV: Integration and Best Practices (30 minutes)

Combining Data Management with Analysis

Typical workflow:

Import and explore data
Clean and create variables
Perform statistical analysis
Create visualizations
Export results

Error Checking and Validation

stata

// Check variable creation
tab new_variable, missing
assert new_variable != . // Ensure no unexpected missing values
count if condition // Verify counts match expectations

Reproducible Research Practices

Do-file Structure:

stata

// Header with description, author, date
// Clear existing data and set working directory
// Import data
// Data management section
// Analysis section
// Export results

Logging Results:

stata

log using "analysis_log.log", replace
// ... your analysis code ...
log close

Export Strategies

stata

// Export graphs
graph export "figure1.png", replace width(800) height(600)

// Export tables
estout using "results.txt", replace

Part V: Practical Applications and Troubleshooting (25 minutes)

Common Challenges and Solutions

Date Format Issues:

Always check date variables after import
Use describe and list to verify formats
Remember that Stata dates are numeric (days since Jan 1, 1960)

Missing Data Handling:

stata

// Check for missing patterns
misstable summarize
misstable patterns

// Handle missing data in calculations
gen new_var = old_var if !missing(old_var)

Graph Debugging:

Start simple, add complexity gradually
Use graph describe to understand graph structure
Check variable labels and formats

Performance Tips

Use preserve and restore for temporary changes
Sort data appropriately for operations
Use quietly prefix for commands that don't need output

Quality Assurance

Always verify your results:

Do the numbers make sense?
Are sample sizes correct?
Do visualizations accurately represent the data?
Are statistical tests appropriate for the data structure?

Session Wrap-up and Lab Preparation

Key Takeaways:

Data Management: Group operations, tagging, and string matching are powerful tools
Survival Analysis: Proper setup with stset is crucial for valid results
Visualization: Build graphs systematically, starting simple and adding complexity
Best Practices: Document your work, check your results, and maintain reproducibility

For Your Labs:

Start with data exploration before creating variables
Use help command when unsure about syntax
Test your code on small examples first
Export graphs as you create them to avoid losing work
Remember that good scientific computing is iterative - expect to refine your approach

Questions to Consider:

How do group-level variables differ from individual-level variables?
When would you use survival analysis versus other statistical methods?
What makes a graph effective for communicating results?
How can you verify that your data management steps worked correctly?

Remember: The goal is not just to complete the labs, but to develop skills in data analysis that will serve you throughout your career. Focus on understanding the concepts, not just memorizing syntax.

Content is user-generated and unverified.

Stata Lecture: Data Management, Survival Analysis & Data Visualization

Preparing for Labs 4-5 (2.5 Hours)

Session Overview & Learning Objectives

Part I: Advanced Data Management (45 minutes)

Working with Grouped Data and Center-Level Variables

Key Commands for Grouped Operations:

The Power of Tagging with egen

String Matching and Pattern Recognition

Date Calculations and Time Differences

Part II: Survival Analysis Fundamentals (45 minutes)

Understanding Survival Data Structure

Setting Up Survival Data with stset

Kaplan-Meier Survival Curves

Testing Survival Differences

Cox Proportional Hazards Regression

Part III: Data Visualization with twoway (45 minutes)

Basic Scatter Plots

Combining Multiple Plot Types

Customizing Graph Appearance

Axis Control:

Colors and Markers:

Legends and Labels:

Advanced Graph Combinations

Multiple Y-Axes:

Conditional Plotting:

Professional Graph Elements

Adding Reference Lines:

Titles and Labels:

Multi-column Legends:

Part IV: Integration and Best Practices (30 minutes)

Combining Data Management with Analysis

Error Checking and Validation

Reproducible Research Practices

Do-file Structure:

Logging Results:

Export Strategies

Part V: Practical Applications and Troubleshooting (25 minutes)

Common Challenges and Solutions

Date Format Issues:

Missing Data Handling:

Graph Debugging:

Performance Tips

Quality Assurance

Session Wrap-up and Lab Preparation

Key Takeaways:

For Your Labs:

Questions to Consider:

The Power of Tagging with `egen`

Setting Up Survival Data with `stset`

Part III: Data Visualization with `twoway` (45 minutes)