by Command (15 min)Basic Syntax:
_n = row number within group_N = total rows in group/dataset// Create group totals and IDs
sort abo
by abo: gen cat_n = _N // Total in each ABO group
by abo: gen cat_id = _n // Sequential ID within group
// Shorthand: bysort (or bys)
bysort abo age: gen age_rank = _negen Command (20 min)Key Functions:
// Basic statistics
egen mean_age = mean(age)
egen median_age = median(age)
egen max_age = max(age)
egen age_sd = sd(age)
// By groups
bysort dx: egen mean_age_by_dx = mean(age)
bysort abo gender: egen max_bmi = max(bmi)egen (10 min)// Tag one record per group
egen grouptag = tag(abo gender)
list abo gender spread if grouptag
// Useful for graphing large datasets
egen ethtag = tag(ethcat)
bysort ethcat: egen mean_bmi = mean(bmi)Numeric:
// Rounding
floor(8.9) // Round down: 8
ceil(8.9) // Round up: 9
round(8.9) // Round nearest: 9
// Math
min(8,6,7,5) // Minimum value
max(8,6,7,5) // Maximum value
exp(1) // e^1
ln(20) // Natural log
sqrt(729) // Square root
abs(-6) // Absolute valueCore String Functions:
// Extract words
word("Hello world", 2) // Returns "world"
// String length
strlen("Same as it ever was") // Returns 19
// Find position
strpos("history", "story") // Returns 4
strpos("history", "xyz") // Returns 0 (not found)
// Pattern matching
regexm("Earth", "art") // Returns 1 (true)
regexm("team", "I") // Returns 0 (false)
// Advanced regex
regexm(text, "^A") // Starts with A
regexm(text, "X$") // Ends with XPractical Example: Drug Classification
// Check for antihistamine exposure
gen has_benadryl = (strpos(drug_name, "DIPHENHYDRAMINE") != 0)
gen has_hydroxyzine = (strpos(drug_name, "HYDROXYZINE") != 0)
gen any_antihistamine = (has_benadryl | has_hydroxyzine)Date Storage & Conversion:
// Stata stores dates as days since Jan 1, 1960
disp %td 19400 // Shows as date format
// Convert string to date
gen date_var = date("August 15, 1969", "MDY")
gen date_var2 = date("2061 28 July", "YDM")
// Date arithmetic
gen oneweek_later = transplant_date + 7
format %td oneweek_laterEssential Habits:
// Bad
program define myprogram
quietly {
forvalues i=1/10 {
if `i' < 5 {
noisily disp "`i' is less than five" }
else {
noisily disp "`i' is at least five" }
}
}
end
// Good
program define display_numbers
quietly {
forvalues i = 1/10 {
if `i' < 5 {
noisily display "`i' is less than five"
}
else {
noisily display "`i' is at least five"
}
}
}
endBasic Setup:
// Method 1: Specify follow-up time
gen followup_time = end_date - transplant_date
stset followup_time, failure(died)
// Method 2: Specify start and end dates
stset end_date, origin(transplant_date) failure(died)
// Method 3: Scale to years
stset end_date, origin(transplant_date) failure(died) scale(365.25)Analysis Commands:
// Kaplan-Meier curves
sts graph
sts graph, by(treatment_group)
// Summary statistics
stsum
stsum, by(treatment_group)
// Test differences
sts test treatment_group
// Cox regression
stcox age gender treatment_groupMerge Types:
1:1 - One-to-one (each key appears once in each dataset)1:m - One-to-many (master has unique keys, using may have duplicates)m:1 - Many-to-one (master may have duplicates, using has unique keys)Basic Syntax:
// Simple merge
use master_data, clear
merge 1:1 patient_id using other_data
// Merge options
merge 1:1 patient_id using other_data, keep(match) // Keep only matched
merge 1:1 patient_id using other_data, nogen // Don't create _merge
merge 1:1 patient_id using other_data, gen(merge_flag) // Custom merge variableSequential Merges:
// Link recipients to donors via crosswalk table
use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(match) nogen
merge m:1 fake_don_id using donors, keep(match) nogen keepusing(age_don)
// Check merge success
quietly count if _merge == 3
assert r(N)/_N > 0.99 // Ensure 99%+ match rateMerge Quality Checks:
// Verify expected matches
tab _merge
// Assert perfect match if expected
assert _merge == 3
// Check match rate
display "Match rate: " r(N)/_N * 100 "%"// Data manipulation
bysort group: egen mean_var = mean(variable)
egen tag = tag(group1 group2)
gen new_var = function(old_var)
// String functions
strlen(string)
strpos(string, substring)
word(string, n)
regexm(string, pattern)
// Dates
date(string, format)
format %td date_variable
// Survival analysis
stset time, failure(event)
sts graph, by(group)
stcox var1 var2
// Merging
merge 1:1 key using dataset, keep(match)
merge m:1 key using dataset, nogenby - Always sort firstegen vs gen - Use egen for group functions_merge resultsupper() or lower() for consistencykeep(match) in merges to avoid unexpected data lossstsum