By the end of this session, you will be able to:
st commandstwoway graphicsWhen working with hierarchical data (patients within hospitals, students within schools), we often need to:
// Generate group-level summary statistics
bysort group_var: gen group_count = _N
bysort group_var: gen group_mean = mean(outcome_var)
bysort group_var: egen group_total = total(outcome_var)
// Create observation identifiers within groups
bysort group_var: gen obs_number = _nPractice Concept: If you have transplant data with multiple patients per center, how would you calculate each center's total volume?
egenTagging is essential when you want to analyze data at the group level but avoid counting each observation multiple times:
// Create a tag variable (1 for first obs in group, 0 for others)
egen group_tag = tag(group_var)
// Use with summarize to get group-level statistics
summarize group_volume if group_tag == 1, detailWhy This Matters: When analyzing center volumes, you want statistics about centers, not about individual patients.
Medical data often contains messy text fields that need cleaning:
// Basic string matching
gen diagnosis_unknown = 0
replace diagnosis_unknown = 1 if diagnosis == "UNKNOWN"
// Advanced pattern matching with regex
replace diagnosis_unknown = 1 if regexm(diagnosis, "UNKNOWN")
replace diagnosis_unknown = 1 if regexm(diagnosis, "UNCERTAIN")
replace diagnosis_unknown = 1 if regexm(diagnosis, "^NOT SPECI") // starts with "NOT SPECI"
replace diagnosis_unknown = 1 if regexm(diagnosis, "UNK") // contains "UNK"Regular Expression Patterns:
^ = starts with$ = ends with| = OR operator.* = any charactersWorking with dates is crucial for survival analysis:
// Convert string dates to Stata date format
gen date_numeric = date(date_string, "DMY")
format date_numeric %td
// Calculate time differences
gen days_difference = end_date - start_date
gen died_within_180 = (died == 1) & (days_difference <= 180)Survival analysis requires three key components:
stsetThe stset command declares your data as survival data:
stset failure_time, failure(event_indicator) origin(start_time) scale(365.25)Parameters Explained:
failure_time: When the event occurred or censoring happenedfailure(): Binary variable indicating if event occurredorigin(): When follow-up begins for each subjectscale(): Convert time units (days to years)Once data is stset, you can create survival curves:
// Basic survival curve
sts graph
// Stratified by group
sts graph, by(group_variable)
// Customize appearance
sts graph, by(group_var) title("Survival by Group") ///
xtitle("Years") ytitle("Survival Probability")Compare survival between groups:
// Log-rank test
sts test group_variable
// Interpret results
// p < 0.05 suggests significant difference
// p ≥ 0.05 suggests no significant differenceEstimate hazard ratios:
stcox predictor_variable
// Extract hazard ratio and confidence interval
// HR > 1: increased hazard (worse survival)
// HR < 1: decreased hazard (better survival)
// HR = 1: no effectInterpreting Output:
twoway (45 minutes)twoway scatter y_variable x_variable// Overlay scatter plot and line
twoway (scatter y1 x) (line y2 x)
// Bar chart with overlay
twoway (bar count year) (line total year, yaxis(2))twoway scatter y x, yscale(range(0 100)) ylabel(0(20)100)
// ^scale range ^labels: start(increment)endtwoway scatter y x, mcolor(red) msymbol(circle)twoway scatter y x, legend(label(1 "Group 1") label(2 "Group 2"))twoway (bar unemployed year) (line total year, yaxis(2))twoway (scatter weight age if gender==1, mcolor(blue)) ///
(scatter weight age if gender==2, mcolor(red))twoway line y x, xline(2008, lcolor(red) lpattern(dash))twoway line y x, title("Main Title") ///
xtitle("X-axis Label") ytitle("Y-axis Label") ///
note("Data source: XYZ")twoway line y x, legend(cols(3))Typical workflow:
// Check variable creation
tab new_variable, missing
assert new_variable != . // Ensure no unexpected missing values
count if condition // Verify counts match expectations// Header with description, author, date
// Clear existing data and set working directory
// Import data
// Data management section
// Analysis section
// Export resultslog using "analysis_log.log", replace
// ... your analysis code ...
log close// Export graphs
graph export "figure1.png", replace width(800) height(600)
// Export tables
estout using "results.txt", replacedescribe and list to verify formats// Check for missing patterns
misstable summarize
misstable patterns
// Handle missing data in calculations
gen new_var = old_var if !missing(old_var)graph describe to understand graph structurepreserve and restore for temporary changesquietly prefix for commands that don't need outputAlways verify your results:
stset is crucial for valid resultshelp command when unsure about syntaxRemember: The goal is not just to complete the labs, but to develop skills in data analysis that will serve you throughout your career. Focus on understanding the concepts, not just memorizing syntax.