Content is user-generated and unverified.

The Robust Beauty of Improper Linear Models in Decision Making

Dawes, R.M. (1979). American Psychologist, 34(7), 571-582

Executive Summary

This seminal paper demonstrates that even "improper" linear models—where predictor weights are chosen non-optimally (equal weighting, random weighting, or intuitive weighting)—consistently outperform expert clinical judgment in prediction tasks. Building on Meehl's 1954 finding that optimal ("proper") linear models beat human judges, Dawes shows that the key insight is simply knowing which variables to consider and their directional relationships to outcomes, not finding optimal weights. The practical implication is profound: simple equal-weighting schemes can provide superior predictions compared to expert intuition across domains from graduate admissions to psychiatric diagnosis to bankruptcy prediction.

Author and Institutional Affiliations

Primary Author: Robyn M. Dawes, Ph.D.

Affiliations:

University of Oregon, Department of Psychology (primary affiliation)
Decision Research, Inc., Eugene, Oregon
James McKeen Cattell Sabbatical Fellow at:
- University of Michigan, Psychology Department
- Institute for Social Research, Research Center for Group Dynamics

Academic Context: Work began at University of Oregon and was completed during sabbatical fellowship at University of Michigan in 1978-1979.

Conflict of Interest Assessment

Declared Conflicts: None explicitly stated

Potential Conflicts:

The author uses his own previous research extensively (particularly graduate admissions at University of Oregon)
Academic self-interest in promoting statistical methods over clinical judgment
No industry funding mentioned
No financial relationships disclosed

Assessment: This appears to be purely academic research with no apparent commercial or financial conflicts. The work challenges professional practices of clinical psychologists and admissions committees, which could create professional resistance, but represents no financial conflict.

Data Review

Studies Examined

1. Neurosis vs. Psychosis Diagnosis (Goldberg, 1970)

Sample: 861 psychiatric patients with MMPI profiles
Judges: 29 clinical psychologists of varying experience
Task: Predict diagnosis (neurosis=0, psychosis=1) from 11 MMPI scores
Results:
- Average judge validity: r = 0.28
- Judge's paramorphic model: r = 0.31
- Random linear model: r = 0.30
- Equal-weighting model: r = 0.34
- Optimal linear model: r = 0.46

2. Illinois Graduate Student GPA Prediction (Wiggins & Kohen, 1971)

Sample: 90 first-year psychology graduate students
Predictors: 10 variables (GRE, GPA, peer ratings, self-ratings)
Judges: 80 graduate students at University of Illinois
Results:
- Average judge validity: r = 0.33
- Judge's model: r = 0.50
- Random model: r = 0.51
- Equal weighting: r = 0.60
- Cross-validated regression: r = 0.57
- Optimal model: r = 0.69

3. Oregon Graduate Student GPA Prediction

Sample: 90 Illinois students (same as above)
Judges: 41 University of Oregon graduate students
Results:
- Average judge: r = 0.37
- Judge's model: r = 0.43
- Random model: r = 0.51
- Equal weighting: r = 0.60
- Cross-validated regression: r = 0.57
- Optimal model: r = 0.69

4. Oregon Faculty Ratings Prediction (Dawes, 1971)

Sample: 111 graduate students (1964-1967 cohort)
Predictors: GRE, undergraduate GPA, selectivity of undergraduate institution
Judges: Admissions committee members
Results:
- Admissions committee: r = 0.19
- Cross-validated proper model: r = 0.38
- Equal-weighting model: r = 0.48

5. Ellipse Value Prediction (Yntema & Torgerson, 1961)

Task: Predict experimenter-assigned values based on size, eccentricity, grayness
Formula: ij + kj + ik
Results:
- Average judge: r = 0.84
- Judge's model: r = 0.89
- Random model: r = 0.84
- Equal weighting: r = 0.97
- Optimal model: r = 0.97

6. Marital Happiness Prediction

Sample: 42 couples (Alexander, 1971; Howard & Dawes, 1976)
Improper model: Frequency of lovemaking - Frequency of arguments
Replication 1 (Oregon): 27 couples, r = 0.40 (p < 0.05)
Replication 2 (Texas, Thornton, 1977): 28 couples, r = 0.81 (p < 0.01)
Qualitative finding: 30/30 happy couples had more sex than arguments; 12/12 unhappy couples argued more than they had sex

7. Hodgkin's Disease Survival (Einhorn, 1972)

Sample: 193 patients with Hodgkin's disease
Expert doctors coded biopsies and made overall severity ratings
Finding: Doctors' overall ratings did NOT predict survival (r ≈ 0)
Linear model using doctors' coded variables DID predict survival

8. Denver Police Bullet Selection (Hammond & Adelman, 1976)

Context: Public policy decision on police ammunition
Dimensions identified: Stopping effectiveness, probability of serious injury, probability of harm to bystanders
Method: Equal weighting of three dimensions with expert ballistics ratings
Outcome: Identified bullet superior to both existing bullet and police chief's recommendation
Result: Accepted by City Council and implemented

9. Bankruptcy Prediction (Libby, 1976)

Sample: 60 firms (30 bankrupt within 3 years)
Judges: 16 small bank loan officers + 27 large bank loan officers
Predictors: 5 financial ratios
Results:
- Loan officers: 74% accuracy
- Paramorphic models: 72% accuracy
- Goldberg's rescaled models: 77% of models beat judges
- Proper linear model: ~78% accuracy (Beaver, 1966; Deacon, 1972)
- Simplest model (assets/liabilities ratio): 80% accuracy

Key Statistical Findings

Random Model Methodology:

Weights selected from normal distribution (unit variance)
Sign determined a priori based on expected relationship to criterion
10,000 random models constructed per example
Variables must be standardized before weighting

Mathematical Insight: Equal-weighting must outperform average random model when all predictors are positively correlated with criterion. The correlation of average of Xs with Y = (Σrᵢ)/(M + M(M-1)r̄)^0.5, which exceeds average rᵢ.

Strengths

Methodological Strengths

Comprehensive Evidence Base: Synthesizes multiple independent studies across diverse domains (clinical psychology, academic admissions, marriage counseling, medical prognosis, public policy)
Rigorous Statistical Approach: Uses appropriate cross-validation, demonstrates robustness across different weighting schemes, includes both correlational and classification accuracy metrics
Replication: Key findings replicated across institutions (Oregon, Illinois, Missouri, Texas) and populations
Mathematical Justification: Provides theoretical explanation for why equal-weighting works, not just empirical demonstration
Practical Applications: Goes beyond academic demonstration to real-world impact (Denver police bullet selection actually implemented)
Honest Effect Sizes: Doesn't overstate findings—acknowledges that correlations are often modest (r = 0.40-0.60 range) while still demonstrating superiority to clinical judgment
Addresses Multiple Comparison Points: Compares judges not just to optimal models but to paramorphic models, random models, and equal-weighting schemes

Conceptual Strengths

Clear Distinction: Articulates that humans excel at variable selection and coding, not integration—a crucial insight for division of labor
Challenges Conventional Wisdom: Directly confronts the assumption that expert judgment should outperform "mere formulas"
Ethical Framework: Frames the use of linear models as an ethical imperative when they demonstrably serve clients better
Anticipates and Addresses Objections: Systematically responds to technical, psychological, and ethical criticisms
Parsimony: Demonstrates that simpler is often better—equal weighting frequently beats more complex approaches

Weaknesses and Limitations

Methodological Limitations

Short-Term Criteria: Most criterion variables are proximal (GPA, initial diagnosis) rather than ultimate outcomes (career success, treatment response). The 20-year follow-up needed for "professional self-actualization" was never feasible.
Sample Restrictions: Studies necessarily limited to those accepted/selected, creating range restriction and negative covariance structures between predictors that complicate interpretation
Limited Criterion Reliability Information: While some criteria show acceptable reliability (faculty ratings η² = 0.67), others have unknown or potentially low reliability, which caps possible validity
Variable Selection Already Done: All studies use pre-selected variables; doesn't address the crucial question of which variables to include in the first place (though Dawes acknowledges humans are good at this)
Monotonicity Assumption: Linear models work best when relationships are conditionally monotone; paper doesn't extensively test robustness to violations
Judge Expertise Questioned: While Dawes addresses this, critics could argue the "right" experts weren't used (though 25 years of research failed to produce counterexamples)

Statistical Limitations

No Confidence Intervals: Point estimates of correlations provided without standard errors or confidence bounds
Limited Discussion of When Models Fail: Libby (1976) initially appeared to show bootstrapping failure; Goldberg's rescaling saved the finding, but this suggests boundary conditions not fully explored
Unclear Generalizability of Random Weights: The specific distribution (normal, rectangular) might matter more in some contexts
Cross-Validation Shrinkage: While acknowledged, the practical implications of shrinkage for small samples could be more thoroughly addressed
Insufficient Detail on Variable Standardization: Critical importance of standardization mentioned but procedural details sometimes sparse

Conceptual Limitations

Configurality Not Addressed: Human judges might detect important interactions or nonlinear patterns that linear models miss (though no evidence presented that they do)
Limited Discussion of Model Updating: Static models vs. dynamic human judgment that can incorporate new information
Context Dependence: Most examples from academic/clinical psychology; generalizability to other domains (business, engineering) assumed but not proven
Doesn't Address "Why": Explains that equal weighting works but less exploration of the psychological mechanisms that cause human judgment to fail
Implementation Barriers Underexplored: Beyond listing objections, doesn't deeply engage with organizational/political obstacles to adoption

Domain-Specific Limitations

Marital Happiness Model Oversimplified: The lovemaking minus fighting formula is acknowledged as "not very profound" and likely works only for couples without severe pathology
Bullet Selection Had Only 8 Pareto-Optimal Options: Any reasonable weighting would select one of these eight, limiting the test of the approach
Graduate Admissions Context-Bound: What predicts GPA may not predict research creativity, clinical skill, or other important outcomes
Psychiatric Diagnosis Binary: Neurosis vs. psychosis is an outdated diagnostic framework; unclear if findings generalize to modern diagnostic systems

Presentation Weaknesses

Polemical Tone: While engaging, the paper sometimes seems more advocacy than balanced scientific reporting (e.g., "cognitive conceit")
Selective Literature Review: Focuses heavily on supportive studies; while noting 25 years of failed counterexamples, could more systematically review contradictory evidence
Ethical Arguments Potentially Overstated: Calling alternative approaches "unethical" may be too strong when genuine uncertainty exists
Limited Discussion of Costs: Implementation costs, development time, and maintenance of linear models not thoroughly analyzed

Critical Analysis

What This Paper Established

This paper definitively demonstrated that the weights in linear prediction models are far less important than commonly assumed. The radical insight is that knowing which variables to include and their directional relationship to outcomes is the hard part; once that's done, equal weighting performs remarkably well—often better than optimal regression weights on cross-validation and nearly always better than human judgment.

What Remains Uncertain

Ultimate Outcome Prediction: Do these findings hold for truly long-term, complex outcomes?
Boundary Conditions: Under what circumstances DO clinical judges add value?
Variable Selection Process: How do we systematically identify the "right" variables to include?
Dynamic Environments: How do linear models perform when the prediction environment is rapidly changing?

Impact and Legacy

This paper, combined with Meehl's earlier work, fundamentally challenged clinical psychology's reliance on expert judgment. Its influence extended to:

Evidence-based medicine movements
Actuarial risk assessment in criminal justice
Automated decision-making in finance
Machine learning feature engineering philosophy

The core insight—that simple, transparent algorithms often beat human experts—remains controversial but empirically robust across 45 years of subsequent research.

Contemporary Relevance

In the era of complex machine learning models, Dawes' findings about equal weighting remain surprisingly relevant:

Occam's Razor for model selection
Interpretability vs. complexity trade-offs
Baseline models for comparison
Division of labor between human judgment (variable selection) and algorithms (integration)

Conclusion

This paper represents a landmark contribution demonstrating that improper linear models—particularly equal-weighting schemes—can match or exceed both expert judgment and complex statistical models in prediction tasks. While limited by short-term criteria and specific domains, the breadth of evidence and theoretical grounding make this a foundational work in decision science. The practical message remains powerful: for prediction with multiple numerical inputs, simply standardizing variables and adding them together often provides the best available forecast, especially in resource-constrained settings where optimal model development is impractical.

The ethical framework—that we owe our clients the best available decision method—continues to resonate, though implementation barriers remain substantial. Dawes' insights laid groundwork for evidence-based practice across multiple disciplines and anticipated modern debates about algorithm aversion and human-AI collaboration.

Content is user-generated and unverified.