The Robust Beauty of Improper Linear Models in Decision Making
Dawes, R.M. (1979). American Psychologist, 34(7), 571-582
Executive Summary
This seminal paper demonstrates that even "improper" linear models—where predictor weights are chosen non-optimally (equal weighting, random weighting, or intuitive weighting)—consistently outperform expert clinical judgment in prediction tasks. Building on Meehl's 1954 finding that optimal ("proper") linear models beat human judges, Dawes shows that the key insight is simply knowing which variables to consider and their directional relationships to outcomes, not finding optimal weights. The practical implication is profound: simple equal-weighting schemes can provide superior predictions compared to expert intuition across domains from graduate admissions to psychiatric diagnosis to bankruptcy prediction.
Author and Institutional Affiliations
Primary Author: Robyn M. Dawes, Ph.D.
Affiliations:
- University of Oregon, Department of Psychology (primary affiliation)
- Decision Research, Inc., Eugene, Oregon
- James McKeen Cattell Sabbatical Fellow at:
- University of Michigan, Psychology Department
- Institute for Social Research, Research Center for Group Dynamics
Academic Context: Work began at University of Oregon and was completed during sabbatical fellowship at University of Michigan in 1978-1979.
Conflict of Interest Assessment
Declared Conflicts: None explicitly stated
Potential Conflicts:
- The author uses his own previous research extensively (particularly graduate admissions at University of Oregon)
- Academic self-interest in promoting statistical methods over clinical judgment
- No industry funding mentioned
- No financial relationships disclosed
Assessment: This appears to be purely academic research with no apparent commercial or financial conflicts. The work challenges professional practices of clinical psychologists and admissions committees, which could create professional resistance, but represents no financial conflict.
Data Review
Studies Examined
1. Neurosis vs. Psychosis Diagnosis (Goldberg, 1970)
- Sample: 861 psychiatric patients with MMPI profiles
- Judges: 29 clinical psychologists of varying experience
- Task: Predict diagnosis (neurosis=0, psychosis=1) from 11 MMPI scores
- Results:
- Average judge validity: r = 0.28
- Judge's paramorphic model: r = 0.31
- Random linear model: r = 0.30
- Equal-weighting model: r = 0.34
- Optimal linear model: r = 0.46
2. Illinois Graduate Student GPA Prediction (Wiggins & Kohen, 1971)
- Sample: 90 first-year psychology graduate students
- Predictors: 10 variables (GRE, GPA, peer ratings, self-ratings)
- Judges: 80 graduate students at University of Illinois
- Results:
- Average judge validity: r = 0.33
- Judge's model: r = 0.50
- Random model: r = 0.51
- Equal weighting: r = 0.60
- Cross-validated regression: r = 0.57
- Optimal model: r = 0.69
3. Oregon Graduate Student GPA Prediction
- Sample: 90 Illinois students (same as above)
- Judges: 41 University of Oregon graduate students
- Results:
- Average judge: r = 0.37
- Judge's model: r = 0.43
- Random model: r = 0.51
- Equal weighting: r = 0.60
- Cross-validated regression: r = 0.57
- Optimal model: r = 0.69
4. Oregon Faculty Ratings Prediction (Dawes, 1971)
- Sample: 111 graduate students (1964-1967 cohort)
- Predictors: GRE, undergraduate GPA, selectivity of undergraduate institution
- Judges: Admissions committee members
- Results:
- Admissions committee: r = 0.19
- Cross-validated proper model: r = 0.38
- Equal-weighting model: r = 0.48
5. Ellipse Value Prediction (Yntema & Torgerson, 1961)
- Task: Predict experimenter-assigned values based on size, eccentricity, grayness
- Formula: ij + kj + ik
- Results:
- Average judge: r = 0.84
- Judge's model: r = 0.89
- Random model: r = 0.84
- Equal weighting: r = 0.97
- Optimal model: r = 0.97
6. Marital Happiness Prediction
- Sample: 42 couples (Alexander, 1971; Howard & Dawes, 1976)
- Improper model: Frequency of lovemaking - Frequency of arguments
- Replication 1 (Oregon): 27 couples, r = 0.40 (p < 0.05)
- Replication 2 (Texas, Thornton, 1977): 28 couples, r = 0.81 (p < 0.01)
- Qualitative finding: 30/30 happy couples had more sex than arguments; 12/12 unhappy couples argued more than they had sex
7. Hodgkin's Disease Survival (Einhorn, 1972)
- Sample: 193 patients with Hodgkin's disease
- Expert doctors coded biopsies and made overall severity ratings
- Finding: Doctors' overall ratings did NOT predict survival (r ≈ 0)
- Linear model using doctors' coded variables DID predict survival
8. Denver Police Bullet Selection (Hammond & Adelman, 1976)
- Context: Public policy decision on police ammunition
- Dimensions identified: Stopping effectiveness, probability of serious injury, probability of harm to bystanders
- Method: Equal weighting of three dimensions with expert ballistics ratings
- Outcome: Identified bullet superior to both existing bullet and police chief's recommendation
- Result: Accepted by City Council and implemented
9. Bankruptcy Prediction (Libby, 1976)
- Sample: 60 firms (30 bankrupt within 3 years)
- Judges: 16 small bank loan officers + 27 large bank loan officers
- Predictors: 5 financial ratios
- Results:
- Loan officers: 74% accuracy
- Paramorphic models: 72% accuracy
- Goldberg's rescaled models: 77% of models beat judges
- Proper linear model: ~78% accuracy (Beaver, 1966; Deacon, 1972)
- Simplest model (assets/liabilities ratio): 80% accuracy
Key Statistical Findings
Random Model Methodology:
- Weights selected from normal distribution (unit variance)
- Sign determined a priori based on expected relationship to criterion
- 10,000 random models constructed per example
- Variables must be standardized before weighting
Mathematical Insight:
Equal-weighting must outperform average random model when all predictors are positively correlated with criterion. The correlation of average of Xs with Y = (Σrᵢ)/(M + M(M-1)r̄)^0.5, which exceeds average rᵢ.
Strengths
Methodological Strengths
- Comprehensive Evidence Base: Synthesizes multiple independent studies across diverse domains (clinical psychology, academic admissions, marriage counseling, medical prognosis, public policy)
- Rigorous Statistical Approach: Uses appropriate cross-validation, demonstrates robustness across different weighting schemes, includes both correlational and classification accuracy metrics
- Replication: Key findings replicated across institutions (Oregon, Illinois, Missouri, Texas) and populations
- Mathematical Justification: Provides theoretical explanation for why equal-weighting works, not just empirical demonstration
- Practical Applications: Goes beyond academic demonstration to real-world impact (Denver police bullet selection actually implemented)
- Honest Effect Sizes: Doesn't overstate findings—acknowledges that correlations are often modest (r = 0.40-0.60 range) while still demonstrating superiority to clinical judgment
- Addresses Multiple Comparison Points: Compares judges not just to optimal models but to paramorphic models, random models, and equal-weighting schemes
Conceptual Strengths
- Clear Distinction: Articulates that humans excel at variable selection and coding, not integration—a crucial insight for division of labor
- Challenges Conventional Wisdom: Directly confronts the assumption that expert judgment should outperform "mere formulas"
- Ethical Framework: Frames the use of linear models as an ethical imperative when they demonstrably serve clients better
- Anticipates and Addresses Objections: Systematically responds to technical, psychological, and ethical criticisms
- Parsimony: Demonstrates that simpler is often better—equal weighting frequently beats more complex approaches
Weaknesses and Limitations
Methodological Limitations
- Short-Term Criteria: Most criterion variables are proximal (GPA, initial diagnosis) rather than ultimate outcomes (career success, treatment response). The 20-year follow-up needed for "professional self-actualization" was never feasible.
- Sample Restrictions: Studies necessarily limited to those accepted/selected, creating range restriction and negative covariance structures between predictors that complicate interpretation
- Limited Criterion Reliability Information: While some criteria show acceptable reliability (faculty ratings η² = 0.67), others have unknown or potentially low reliability, which caps possible validity
- Variable Selection Already Done: All studies use pre-selected variables; doesn't address the crucial question of which variables to include in the first place (though Dawes acknowledges humans are good at this)
- Monotonicity Assumption: Linear models work best when relationships are conditionally monotone; paper doesn't extensively test robustness to violations
- Judge Expertise Questioned: While Dawes addresses this, critics could argue the "right" experts weren't used (though 25 years of research failed to produce counterexamples)
Statistical Limitations
- No Confidence Intervals: Point estimates of correlations provided without standard errors or confidence bounds
- Limited Discussion of When Models Fail: Libby (1976) initially appeared to show bootstrapping failure; Goldberg's rescaling saved the finding, but this suggests boundary conditions not fully explored
- Unclear Generalizability of Random Weights: The specific distribution (normal, rectangular) might matter more in some contexts
- Cross-Validation Shrinkage: While acknowledged, the practical implications of shrinkage for small samples could be more thoroughly addressed
- Insufficient Detail on Variable Standardization: Critical importance of standardization mentioned but procedural details sometimes sparse
Conceptual Limitations
- Configurality Not Addressed: Human judges might detect important interactions or nonlinear patterns that linear models miss (though no evidence presented that they do)
- Limited Discussion of Model Updating: Static models vs. dynamic human judgment that can incorporate new information
- Context Dependence: Most examples from academic/clinical psychology; generalizability to other domains (business, engineering) assumed but not proven
- Doesn't Address "Why": Explains that equal weighting works but less exploration of the psychological mechanisms that cause human judgment to fail
- Implementation Barriers Underexplored: Beyond listing objections, doesn't deeply engage with organizational/political obstacles to adoption
Domain-Specific Limitations
- Marital Happiness Model Oversimplified: The lovemaking minus fighting formula is acknowledged as "not very profound" and likely works only for couples without severe pathology
- Bullet Selection Had Only 8 Pareto-Optimal Options: Any reasonable weighting would select one of these eight, limiting the test of the approach
- Graduate Admissions Context-Bound: What predicts GPA may not predict research creativity, clinical skill, or other important outcomes
- Psychiatric Diagnosis Binary: Neurosis vs. psychosis is an outdated diagnostic framework; unclear if findings generalize to modern diagnostic systems
Presentation Weaknesses
- Polemical Tone: While engaging, the paper sometimes seems more advocacy than balanced scientific reporting (e.g., "cognitive conceit")
- Selective Literature Review: Focuses heavily on supportive studies; while noting 25 years of failed counterexamples, could more systematically review contradictory evidence
- Ethical Arguments Potentially Overstated: Calling alternative approaches "unethical" may be too strong when genuine uncertainty exists
- Limited Discussion of Costs: Implementation costs, development time, and maintenance of linear models not thoroughly analyzed
Critical Analysis
What This Paper Established
This paper definitively demonstrated that the weights in linear prediction models are far less important than commonly assumed. The radical insight is that knowing which variables to include and their directional relationship to outcomes is the hard part; once that's done, equal weighting performs remarkably well—often better than optimal regression weights on cross-validation and nearly always better than human judgment.
What Remains Uncertain
- Ultimate Outcome Prediction: Do these findings hold for truly long-term, complex outcomes?
- Boundary Conditions: Under what circumstances DO clinical judges add value?
- Variable Selection Process: How do we systematically identify the "right" variables to include?
- Dynamic Environments: How do linear models perform when the prediction environment is rapidly changing?
Impact and Legacy
This paper, combined with Meehl's earlier work, fundamentally challenged clinical psychology's reliance on expert judgment. Its influence extended to:
- Evidence-based medicine movements
- Actuarial risk assessment in criminal justice
- Automated decision-making in finance
- Machine learning feature engineering philosophy
The core insight—that simple, transparent algorithms often beat human experts—remains controversial but empirically robust across 45 years of subsequent research.
Contemporary Relevance
In the era of complex machine learning models, Dawes' findings about equal weighting remain surprisingly relevant:
- Occam's Razor for model selection
- Interpretability vs. complexity trade-offs
- Baseline models for comparison
- Division of labor between human judgment (variable selection) and algorithms (integration)
Conclusion
This paper represents a landmark contribution demonstrating that improper linear models—particularly equal-weighting schemes—can match or exceed both expert judgment and complex statistical models in prediction tasks. While limited by short-term criteria and specific domains, the breadth of evidence and theoretical grounding make this a foundational work in decision science. The practical message remains powerful: for prediction with multiple numerical inputs, simply standardizing variables and adding them together often provides the best available forecast, especially in resource-constrained settings where optimal model development is impractical.
The ethical framework—that we owe our clients the best available decision method—continues to resonate, though implementation barriers remain substantial. Dawes' insights laid groundwork for evidence-based practice across multiple disciplines and anticipated modern debates about algorithm aversion and human-AI collaboration.