Content is user-generated and unverified.

Stata Advanced File Management - 3 Hour Intensive

Hour 1: Data Manipulation Essentials

1.1 The by Command (15 min)

Basic Syntax:

  • _n = row number within group
  • _N = total rows in group/dataset
stata
// Create group totals and IDs
sort abo
by abo: gen cat_n = _N        // Total in each ABO group
by abo: gen cat_id = _n       // Sequential ID within group

// Shorthand: bysort (or bys)
bysort abo age: gen age_rank = _n

1.2 The egen Command (20 min)

Key Functions:

stata
// Basic statistics
egen mean_age = mean(age)
egen median_age = median(age)
egen max_age = max(age)
egen age_sd = sd(age)

// By groups
bysort dx: egen mean_age_by_dx = mean(age)
bysort abo gender: egen max_bmi = max(bmi)

1.3 Tagging with egen (10 min)

stata
// Tag one record per group
egen grouptag = tag(abo gender)
list abo gender spread if grouptag

// Useful for graphing large datasets
egen ethtag = tag(ethcat)
bysort ethcat: egen mean_bmi = mean(bmi)

1.4 Essential Functions (15 min)

Numeric:

stata
// Rounding
floor(8.9)    // Round down: 8
ceil(8.9)     // Round up: 9
round(8.9)    // Round nearest: 9

// Math
min(8,6,7,5)  // Minimum value
max(8,6,7,5)  // Maximum value
exp(1)        // e^1
ln(20)        // Natural log
sqrt(729)     // Square root
abs(-6)       // Absolute value

Hour 2: String Manipulation & Data Types

2.1 String Functions (25 min)

Core String Functions:

stata
// Extract words
word("Hello world", 2)     // Returns "world"

// String length
strlen("Same as it ever was")  // Returns 19

// Find position
strpos("history", "story")     // Returns 4
strpos("history", "xyz")       // Returns 0 (not found)

// Pattern matching
regexm("Earth", "art")         // Returns 1 (true)
regexm("team", "I")            // Returns 0 (false)

// Advanced regex
regexm(text, "^A")             // Starts with A
regexm(text, "X$")             // Ends with X

Practical Example: Drug Classification

stata
// Check for antihistamine exposure
gen has_benadryl = (strpos(drug_name, "DIPHENHYDRAMINE") != 0)
gen has_hydroxyzine = (strpos(drug_name, "HYDROXYZINE") != 0)
gen any_antihistamine = (has_benadryl | has_hydroxyzine)

2.2 Date Handling (20 min)

Date Storage & Conversion:

stata
// Stata stores dates as days since Jan 1, 1960
disp %td 19400    // Shows as date format

// Convert string to date
gen date_var = date("August 15, 1969", "MDY")
gen date_var2 = date("2061 28 July", "YDM")

// Date arithmetic
gen oneweek_later = transplant_date + 7
format %td oneweek_later

2.3 Code Quality Practices (15 min)

Essential Habits:

  1. Indent your code - use consistent spacing
  2. Test iteratively - run small sections
  3. Comment extensively - explain your logic
  4. Use descriptive variable names
stata
// Bad
program define myprogram
quietly {
forvalues i=1/10 {
if `i' < 5 {
noisily disp "`i' is less than five" }
else {
noisily disp "`i' is at least five" }
}
} 
end

// Good
program define display_numbers
    quietly {
        forvalues i = 1/10 {
            if `i' < 5 {
                noisily display "`i' is less than five"
            }
            else {
                noisily display "`i' is at least five"
            }
        }
    } 
end

Hour 3: Survival Analysis & Data Merging

3.1 Survival Analysis Setup (25 min)

Basic Setup:

stata
// Method 1: Specify follow-up time
gen followup_time = end_date - transplant_date
stset followup_time, failure(died)

// Method 2: Specify start and end dates
stset end_date, origin(transplant_date) failure(died)

// Method 3: Scale to years
stset end_date, origin(transplant_date) failure(died) scale(365.25)

Analysis Commands:

stata
// Kaplan-Meier curves
sts graph
sts graph, by(treatment_group)

// Summary statistics
stsum
stsum, by(treatment_group)

// Test differences
sts test treatment_group

// Cox regression
stcox age gender treatment_group

3.2 Merging Datasets (35 min)

Merge Types:

  • 1:1 - One-to-one (each key appears once in each dataset)
  • 1:m - One-to-many (master has unique keys, using may have duplicates)
  • m:1 - Many-to-one (master may have duplicates, using has unique keys)

Basic Syntax:

stata
// Simple merge
use master_data, clear
merge 1:1 patient_id using other_data

// Merge options
merge 1:1 patient_id using other_data, keep(match)      // Keep only matched
merge 1:1 patient_id using other_data, nogen            // Don't create _merge
merge 1:1 patient_id using other_data, gen(merge_flag)  // Custom merge variable

Sequential Merges:

stata
// Link recipients to donors via crosswalk table
use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(match) nogen
merge m:1 fake_don_id using donors, keep(match) nogen keepusing(age_don)

// Check merge success
quietly count if _merge == 3
assert r(N)/_N > 0.99  // Ensure 99%+ match rate

Merge Quality Checks:

stata
// Verify expected matches
tab _merge

// Assert perfect match if expected
assert _merge == 3

// Check match rate
display "Match rate: " r(N)/_N * 100 "%"

Quick Reference Card

Essential Commands

stata
// Data manipulation
bysort group: egen mean_var = mean(variable)
egen tag = tag(group1 group2)
gen new_var = function(old_var)

// String functions
strlen(string)
strpos(string, substring)
word(string, n)
regexm(string, pattern)

// Dates
date(string, format)
format %td date_variable

// Survival analysis
stset time, failure(event)
sts graph, by(group)
stcox var1 var2

// Merging
merge 1:1 key using dataset, keep(match)
merge m:1 key using dataset, nogen

Common Pitfalls

  1. Forgot to sort before by - Always sort first
  2. Missing egen vs gen - Use egen for group functions
  3. Date format confusion - Remember MDY, DMY, YMD order
  4. Merge without checking - Always examine _merge results
  5. String case sensitivity - Use upper() or lower() for consistency

Pro Tips

  • Use keep(match) in merges to avoid unexpected data loss
  • Tag variables are essential for efficient graphing
  • Test string functions with simple examples first
  • Always validate survival analysis setup with stsum
  • Comment your merge logic extensively
Content is user-generated and unverified.
    Stata Advanced File Management - 3 Hour Intensive | Claude