Content is user-generated and unverified.

Stata Lecture: Data Management, Survival Analysis & Data Visualization

Preparing for Labs 4-5 (2.5 Hours)


Session Overview & Learning Objectives

By the end of this session, you will be able to:

  • Perform advanced data management tasks including grouping and tagging
  • Conduct survival analysis using Stata's st commands
  • Create professional data visualizations with twoway graphics
  • Apply string matching and pattern recognition techniques
  • Generate descriptive statistics and handle missing data appropriately

Part I: Advanced Data Management (45 minutes)

Working with Grouped Data and Center-Level Variables

When working with hierarchical data (patients within hospitals, students within schools), we often need to:

  • Calculate summary statistics at the group level
  • Create variables that represent group characteristics
  • Analyze data at different levels of aggregation

Key Commands for Grouped Operations:

stata
// Generate group-level summary statistics
bysort group_var: gen group_count = _N
bysort group_var: gen group_mean = mean(outcome_var)
bysort group_var: egen group_total = total(outcome_var)

// Create observation identifiers within groups
bysort group_var: gen obs_number = _n

Practice Concept: If you have transplant data with multiple patients per center, how would you calculate each center's total volume?

The Power of Tagging with egen

Tagging is essential when you want to analyze data at the group level but avoid counting each observation multiple times:

stata
// Create a tag variable (1 for first obs in group, 0 for others)
egen group_tag = tag(group_var)

// Use with summarize to get group-level statistics
summarize group_volume if group_tag == 1, detail

Why This Matters: When analyzing center volumes, you want statistics about centers, not about individual patients.

String Matching and Pattern Recognition

Medical data often contains messy text fields that need cleaning:

stata
// Basic string matching
gen diagnosis_unknown = 0
replace diagnosis_unknown = 1 if diagnosis == "UNKNOWN"

// Advanced pattern matching with regex
replace diagnosis_unknown = 1 if regexm(diagnosis, "UNKNOWN")
replace diagnosis_unknown = 1 if regexm(diagnosis, "UNCERTAIN")
replace diagnosis_unknown = 1 if regexm(diagnosis, "^NOT SPECI")  // starts with "NOT SPECI"
replace diagnosis_unknown = 1 if regexm(diagnosis, "UNK")         // contains "UNK"

Regular Expression Patterns:

  • ^ = starts with
  • $ = ends with
  • | = OR operator
  • .* = any characters

Date Calculations and Time Differences

Working with dates is crucial for survival analysis:

stata
// Convert string dates to Stata date format
gen date_numeric = date(date_string, "DMY")
format date_numeric %td

// Calculate time differences
gen days_difference = end_date - start_date
gen died_within_180 = (died == 1) & (days_difference <= 180)

Part II: Survival Analysis Fundamentals (45 minutes)

Understanding Survival Data Structure

Survival analysis requires three key components:

  1. Time to event (or censoring)
  2. Event indicator (did the event occur?)
  3. Time origin (when does follow-up start?)

Setting Up Survival Data with stset

The stset command declares your data as survival data:

stata
stset failure_time, failure(event_indicator) origin(start_time) scale(365.25)

Parameters Explained:

  • failure_time: When the event occurred or censoring happened
  • failure(): Binary variable indicating if event occurred
  • origin(): When follow-up begins for each subject
  • scale(): Convert time units (days to years)

Kaplan-Meier Survival Curves

Once data is stset, you can create survival curves:

stata
// Basic survival curve
sts graph

// Stratified by group
sts graph, by(group_variable)

// Customize appearance
sts graph, by(group_var) title("Survival by Group") ///
    xtitle("Years") ytitle("Survival Probability")

Testing Survival Differences

Compare survival between groups:

stata
// Log-rank test
sts test group_variable

// Interpret results
// p < 0.05 suggests significant difference
// p ≥ 0.05 suggests no significant difference

Cox Proportional Hazards Regression

Estimate hazard ratios:

stata
stcox predictor_variable

// Extract hazard ratio and confidence interval
// HR > 1: increased hazard (worse survival)
// HR < 1: decreased hazard (better survival)
// HR = 1: no effect

Interpreting Output:

  • Coefficient (β): log hazard ratio
  • Hazard Ratio: exp(β)
  • 95% CI: confidence interval for HR

Part III: Data Visualization with twoway (45 minutes)

Basic Scatter Plots

stata
twoway scatter y_variable x_variable

Combining Multiple Plot Types

stata
// Overlay scatter plot and line
twoway (scatter y1 x) (line y2 x)

// Bar chart with overlay
twoway (bar count year) (line total year, yaxis(2))

Customizing Graph Appearance

Axis Control:

stata
twoway scatter y x, yscale(range(0 100)) ylabel(0(20)100)
//                   ^scale range        ^labels: start(increment)end

Colors and Markers:

stata
twoway scatter y x, mcolor(red) msymbol(circle)

Legends and Labels:

stata
twoway scatter y x, legend(label(1 "Group 1") label(2 "Group 2"))

Advanced Graph Combinations

Multiple Y-Axes:

stata
twoway (bar unemployed year) (line total year, yaxis(2))

Conditional Plotting:

stata
twoway (scatter weight age if gender==1, mcolor(blue)) ///
       (scatter weight age if gender==2, mcolor(red))

Professional Graph Elements

Adding Reference Lines:

stata
twoway line y x, xline(2008, lcolor(red) lpattern(dash))

Titles and Labels:

stata
twoway line y x, title("Main Title") ///
    xtitle("X-axis Label") ytitle("Y-axis Label") ///
    note("Data source: XYZ")

Multi-column Legends:

stata
twoway line y x, legend(cols(3))

Part IV: Integration and Best Practices (30 minutes)

Combining Data Management with Analysis

Typical workflow:

  1. Import and explore data
  2. Clean and create variables
  3. Perform statistical analysis
  4. Create visualizations
  5. Export results

Error Checking and Validation

stata
// Check variable creation
tab new_variable, missing
assert new_variable != . // Ensure no unexpected missing values
count if condition // Verify counts match expectations

Reproducible Research Practices

Do-file Structure:

stata
// Header with description, author, date
// Clear existing data and set working directory
// Import data
// Data management section
// Analysis section
// Export results

Logging Results:

stata
log using "analysis_log.log", replace
// ... your analysis code ...
log close

Export Strategies

stata
// Export graphs
graph export "figure1.png", replace width(800) height(600)

// Export tables
estout using "results.txt", replace

Part V: Practical Applications and Troubleshooting (25 minutes)

Common Challenges and Solutions

Date Format Issues:

  • Always check date variables after import
  • Use describe and list to verify formats
  • Remember that Stata dates are numeric (days since Jan 1, 1960)

Missing Data Handling:

stata
// Check for missing patterns
misstable summarize
misstable patterns

// Handle missing data in calculations
gen new_var = old_var if !missing(old_var)

Graph Debugging:

  • Start simple, add complexity gradually
  • Use graph describe to understand graph structure
  • Check variable labels and formats

Performance Tips

  • Use preserve and restore for temporary changes
  • Sort data appropriately for operations
  • Use quietly prefix for commands that don't need output

Quality Assurance

Always verify your results:

  1. Do the numbers make sense?
  2. Are sample sizes correct?
  3. Do visualizations accurately represent the data?
  4. Are statistical tests appropriate for the data structure?

Session Wrap-up and Lab Preparation

Key Takeaways:

  1. Data Management: Group operations, tagging, and string matching are powerful tools
  2. Survival Analysis: Proper setup with stset is crucial for valid results
  3. Visualization: Build graphs systematically, starting simple and adding complexity
  4. Best Practices: Document your work, check your results, and maintain reproducibility

For Your Labs:

  • Start with data exploration before creating variables
  • Use help command when unsure about syntax
  • Test your code on small examples first
  • Export graphs as you create them to avoid losing work
  • Remember that good scientific computing is iterative - expect to refine your approach

Questions to Consider:

  • How do group-level variables differ from individual-level variables?
  • When would you use survival analysis versus other statistical methods?
  • What makes a graph effective for communicating results?
  • How can you verify that your data management steps worked correctly?

Remember: The goal is not just to complete the labs, but to develop skills in data analysis that will serve you throughout your career. Focus on understanding the concepts, not just memorizing syntax.

Content is user-generated and unverified.
    Stata Lecture: Data Management, Survival Analysis & Visualization (Labs 4-5 Prep) | Claude