A Framework for Hypothesis Evaluation: From Speculation to Evidence
Introduction
This framework provides a systematic approach to thinking through hypotheses and evaluating them as evidence accumulates. It synthesizes classical epistemology with modern data science to create a practical guide for navigating from early hunches through to well-supported conclusions.
The central challenge: How do we rationally navigate the journey from initial speculation—when data is sparse and uncertainty is high—to justified belief, as evidence accumulates and our understanding deepens?
Part I: The Nature of Evidence and Belief
What Counts as Evidence?
Before evaluating hypotheses, we must understand what evidence is and what it does.
Evidence can be:
- Propositional: Statements that are true or false ("the patient's creatinine is 2.5 mg/dL")
- Experiential: Direct observations and sensory data (the physical exam finding of ascites)
- Derived: Outputs from instruments, algorithms, or analytical processes
Evidence serves to:
- Justify belief: It provides rational grounds for accepting or rejecting propositions
- Discriminate between hypotheses: It helps us choose among competing explanations
- Increase or decrease probability: It shifts our confidence in particular claims
- Explain phenomena: It serves as both explainer and explained
The Evidential Relationship
Evidence E relates to hypothesis H through several possible relationships:
| Relationship Type | Definition | Strength | When Applicable |
|---|
| Deductive | E logically entails H (or ~H) | Strongest | Mathematical proofs, formal logic |
| Probabilistic | E increases P(H|E) > P(H) | Variable | Most empirical contexts |
| Explanatory | H best explains E | Moderate-Strong | Causal inference, theory selection |
| Instantial | E is a positive instance of H | Weak-Moderate | Universal generalizations |
Key insight: Different types of evidence require different evaluation frameworks. A single observation of a black raven provides weaker evidence for "all ravens are black" than a randomized trial provides for a treatment effect.
Part II: Stages of Hypothesis Development
Stage 0: Pre-Hypothesis (Observation and Pattern Recognition)
Characteristics:
- Raw observations without clear interpretation
- Vague intuitions or clinical hunches
- Unsystematic data collection
- No explicit hypothesis yet formulated
What to do:
- Document observations carefully: What exactly did you observe? Under what conditions?
- Distinguish observation from interpretation: Separate the raw data from your mental model
- Note patterns and anomalies: What stands out? What defies current understanding?
- Consider alternative explanations early: Resist premature commitment to a single interpretation
Modes of inference operative:
- Abductive reasoning: Inferring the best explanation for surprising observations
- Pattern recognition: Detecting structures in heterogeneous phenomena
Example:
Observation: Several patients on the same dialysis machine develop unexplained hypotension during treatment.
Not yet hypothesis-worthy: Just documented phenomena requiring explanation.
Pitfalls to avoid:
- Confirmation bias: Noticing only data that fits an emerging story
- Apophenia: Seeing patterns in random noise
- Premature theorizing: Committing to an explanation before sufficient observation
Stage 1: Hypothesis Formation (Conjecture)
Characteristics:
- Explicit statement of a testable proposition
- Minimal supporting evidence
- High uncertainty
- Multiple plausible alternatives exist
Essential questions:
- Is the hypothesis testable? Can evidence discriminate between H and ~H?
- Is it falsifiable? What would count as evidence against it?
- What are the alternatives? What other hypotheses could explain the same phenomena?
- What would constitute strong evidence? What kind of data would move you significantly?
Formulating hypotheses well:
Poor hypothesis: "Something is wrong with the dialysis process"
- Too vague to test
- No clear alternatives
- No specified relationship to observations
Better hypothesis: "The dialyzer membrane batch X-2347 causes complement activation leading to transient hypotension"
- Testable through membrane analysis and complement markers
- Falsifiable (could test other batches, measure complement)
- Specifies mechanism
- Suggests clear interventions
Modes of inference:
- Abduction: Generating hypothesis as best explanation for existing observations
- Analogy: Drawing on similar situations or mechanisms
- Theoretical deduction: Deriving hypotheses from established theories
At this stage, your hypothesis has:
- Coherence: Internal logical consistency
- Plausibility: Compatibility with established knowledge
- Explanatory potential: Accounts for known observations
- Minimal empirical support: Perhaps 1-3 anecdotal observations
Epistemic status: Conjecture - rational to entertain but not yet rational to believe
Stage 2: Initial Evidence Collection (Weak Support)
Characteristics:
- First systematic observations
- Small sample size (n = 5-20)
- Uncontrolled conditions
- High potential for confounding
Key activities:
1. Establish the evidential baseline
- What is P(H) before new evidence? (Your prior probability)
- What evidence do you already possess?
- What are the plausible alternatives and their priors?
2. Design initial tests
- What is minimally necessary to discriminate H from alternatives?
- What data is feasible to collect quickly?
- How can you maximize the severity of the test?
3. Apply probabilistic thinking
Use Bayes' theorem to think through evidence:
P(H|E) = P(H) × P(E|H) / P(E)
Where:
P(H|E) = Probability of hypothesis given evidence (posterior)
P(H) = Prior probability of hypothesis
P(E|H) = Likelihood (probability of evidence if hypothesis true)
P(E) = Marginal probability of evidence
Example continued:
You examine 15 patients who experienced hypotension:
- 12/15 were on machines using batch X-2347
- 3/15 were on machines using other batches
Evidence assessment:
- P(E|H) is moderate: If the batch is causative, we'd expect high overlap
- P(E|~H) is also moderate: Without the hypothesis, some overlap would occur by chance
- This is suggestive but far from conclusive
Evaluating weak evidence:
| Criterion | Assessment Method |
|---|
| Relevance | Does E actually bear on H, or could it be explained equally by background factors? |
| Reliability | How trustworthy is the measurement/observation? |
| Discrimination | Does E distinguish H from alternatives, or is it compatible with many hypotheses? |
| Severity | Would this test probably produce different results if H were false? |
Mayo's Severity Principle:
Evidence E provides strong support for H only if:
- E "fits" or "agrees with" H
- The test had a high probability of producing a result less compatible with H if H were false
At this stage, you likely fail the severity test. A small uncontrolled observation could easily produce these results even if H is false.
Epistemic status: Weak hypothesis - some evidence in favor, but underdetermination remains high. Rational to investigate further, but premature to act on or believe strongly.
Critical decision point:
- If evidence is positive but weak: Pursue more rigorous investigation
- If evidence is negative or equivocal: Reconsider hypothesis or reformulate
- If alternative explanations remain plausible: Design tests to discriminate between them
Stage 3: Accumulating Evidence (Building Support)
Characteristics:
- Moderate sample sizes (n = 20-200)
- Some control for confounding
- Replication attempts
- Consideration of alternative explanations
Key activities:
1. Expand and diversify evidence
- Temporal variation: Does the relationship hold over time?
- Population variation: Does it hold across different subgroups?
- Methodological variation: Do different measurement approaches yield consistent results?
- Contextual variation: Does it hold in different settings or conditions?
Why diversification matters:
The same evidence can support multiple hypotheses if it's not sufficiently varied. Consider Achinstein's requirement for high probability: evidence must be varied enough to make P(H|E) genuinely high, not just higher than P(H).
Example:
Weak diversity: Testing only morning dialysis sessions, only male patients, only one clinic
- Many confounders remain plausible
- Alternative explanations not ruled out
Strong diversity: Multiple times of day, both sexes, multiple clinics, different operators
- Systematic confounders less plausible
- Specificity of relationship becomes clearer
2. Test competing hypotheses directly
Don't just accumulate evidence for your hypothesis—actively seek evidence that would favor alternatives:
| Your Hypothesis (H1) | Alternative (H2) | Discriminating Evidence |
|---|
| Batch X-2347 membranes cause hypotension | Contaminated dialysate in one clinic | Test membranes from batch in different clinics |
| Complement activation mechanism | Endotoxin contamination | Measure complement markers AND endotoxin levels |
| Manufacturing defect | Storage/transport problem | Examine membranes from same batch with different storage histories |
3. Quantify uncertainty appropriately
As evidence accumulates, maintain explicit uncertainty estimates:
- Point estimate: What is your best guess at the truth value of H?
- Confidence interval: What range of values is consistent with your evidence?
- Sensitivity analysis: How much would conclusions change with different assumptions?
4. Apply causal inference criteria
When the hypothesis involves causation (as most do), consider Bradford Hill's criteria:
| Criterion | Question | Status in Example |
|---|
| Strength | How large is the association? | Moderate (12/15 vs 3/15) |
| Consistency | Has it been replicated? | Needs testing across clinics |
| Specificity | Is the outcome specific to the exposure? | Unclear—other batches? |
| Temporality | Does cause precede effect? | Yes—batch introduced before symptoms |
| Biological gradient | Is there a dose-response? | Could test concentration/duration |
| Plausibility | Is there a known mechanism? | Yes—complement activation is known |
| Coherence | Does it fit with existing knowledge? | Yes—consistent with immunology |
| Experiment | Can we intervene? | Yes—can switch batches |
| Analogy | Are there similar known effects? | Yes—other membrane reactions known |
None of these alone is sufficient, but collectively they build the case.
5. Beware of black boxes
If your evidence comes from complex algorithms or processes:
- Interpretability: Can you explain why the evidence points to H?
- Generalizability: Will the relationship hold in new contexts?
- Robustness: Is the evidence stable to perturbations in methods?
Example: If an ML algorithm predicts hypotension risk, but you can't explain why certain features matter, your evidence is weaker than if you have a mechanistic understanding.
Epistemic status: Supported hypothesis - evidence makes H substantially more probable than alternatives. Rational to have moderate confidence, perhaps sufficient to take action in low-stakes decisions. Not yet justified as knowledge in high-stakes contexts.
Modes of inference:
- Induction: Generalizing from accumulating instances
- Probabilistic inference: Updating degrees of belief via Bayes' theorem
- Explanatory inference: Evaluating which hypothesis best unifies the diverse evidence
Stage 4: Strong Evidence (Approaching Justified Belief)
Characteristics:
- Large sample sizes (n = 200+) OR
- Highly controlled conditions (RCT) OR
- Multiple independent replications
- Systematic ruling out of confounders
- Mechanistic understanding
Key activities:
1. Achieve severity in testing
At this stage, your evidence should pass Mayo's severity criterion:
Severe test achieved when:
- You've designed tests specifically to refute H
- These tests had high power to detect alternatives if present
- The tests nevertheless yielded results consistent with H
- You've explored the hypothesis' failure modes and found it robust
Example progression:
Week 1: Noticed pattern in 15 patients (Stage 2)
Week 3: Confirmed association in 50 patients across 3 clinics (Stage 3)
Week 6: Conducted controlled comparison:
- Randomly assigned 100 patients to batch X-2347 vs. control batches
- Blinded outcome assessment
- Pre-specified primary endpoint: hypotension episodes
- Measured complement markers to confirm mechanism
- Results: 45% hypotension rate with X-2347 vs. 8% with controls (p<0.001)
- Complement elevation correlates with hypotension (r=0.72)
This is severe: High probability that if H were false, we would have seen different results
2. Address the Duhem-Quine problem
Your evidence doesn't test H in isolation—it tests H plus all auxiliary assumptions. Make these explicit:
Hypothesis: Batch X-2347 membranes cause complement-mediated hypotension
Auxiliary assumptions being tested:
- The complement assays are valid
- The dialysis machines function properly
- The patient selection wasn't biased
- The hypotension measurement is accurate
- There are no unmeasured confounders
- The batch assignment was truly random
How to address:
- Bootstrap approach: Use other established hypotheses plus your evidence to derive instances of H
- Vary auxiliary assumptions: Test H under different measurement approaches, different populations, etc.
- Independent confirmation: Have others test H with completely different auxiliary hypotheses
3. Consider the total evidence
Your evidence for H exists within a broader web of belief. Assess coherence:
- Internal consistency: Do different pieces of evidence point the same direction?
- External consistency: Does H cohere with established scientific knowledge?
- Explanatory power: Does H unify and explain diverse phenomena?
- Predictive success: Can H successfully predict new observations?
4. Quantify strength of evidence
Use multiple metrics:
Probabilistic:
- Likelihood ratio: P(E|H) / P(E|~H)
- Bayes factor: How much E shifts probability
- Posterior probability: P(H|E)
Frequentist:
- p-value: Probability of E (or more extreme) if H false
- Effect size: Magnitude of relationship
- Confidence interval: Range of plausible values
Qualitative:
- Number of independent replications
- Diversity of methods yielding consistent results
- Strength of mechanism understanding
At this stage:
- P(H|E) should be >0.90 for strong support
- Effect size should be clinically/scientifically meaningful
- Multiple independent lines of evidence should converge
- Mechanistic understanding should be present
Epistemic status: Justified belief - rational to believe H is true, sufficient to act upon in most contexts, appropriate to communicate as established finding.
However: Not immune to revision. Remains open to refutation by future evidence.
Stage 5: Established Knowledge (High Confidence)
Characteristics:
- Extensive replication across labs/groups
- Integration into theoretical frameworks
- Successful novel predictions
- Practical applications that work
- Consensus among experts
Distinguishing knowledge from justified belief:
Justified belief = Evidence makes H highly probable for you given your evidence
Knowledge = Justified belief that is:
- True (corresponds to reality—though we can't always be certain)
- Reliably formed (resulted from truth-conducive processes)
- Socially validated (others with access to evidence reach same conclusion)
- Predictively successful (enables successful interventions)
The genealogy of knowledge matters:
In data-intensive contexts, how we arrived at knowledge affects its epistemic status:
- Theory-driven discovery: Started with mechanism, derived predictions, tested them
- Strengths: Deep understanding, generalizable, less prone to overfitting
- Weaknesses: Can miss unexpected patterns
- Data-driven discovery: Patterns emerged from large-scale data analysis, mechanism inferred later
- Strengths: Can find surprising relationships, comprehensive
- Weaknesses: Higher risk of spurious patterns, requires external validation
Example:
Theory-driven: Understanding complement activation biology → predicting membrane reactions → testing specific membranes → confirming mechanism
Data-driven: ML algorithm identifies batch X-2347 as high-risk from EHR data → investigating why → discovering complement mechanism → validating with targeted experiments
Both can yield knowledge, but:
- Theory-driven has stronger prior plausibility
- Data-driven requires more extensive validation
- Theory-driven more likely to generalize beyond training data
Epistemic status: Established knowledge - forms part of the background against which new hypotheses are evaluated. High confidence, but not absolute certainty. Embedded in network of mutually supporting beliefs.
Key insight: Even established knowledge remains provisional. Science is self-correcting. New evidence can overturn what seemed certain.
Part III: Modes of Inference Across Stages
The Three Fundamental Modes
Throughout hypothesis evaluation, three modes of inference operate:
1. Deduction (Certainty)
Structure: If premises are true, conclusion must be true
Role in hypothesis evaluation:
- Deriving testable predictions from hypotheses
- Checking internal logical consistency
- Mathematical and statistical reasoning
- Ruling out logical impossibilities
Example:
H: All patients on batch X-2347 will show complement elevation
Patient Jones is on batch X-2347
Therefore: Patient Jones will show complement elevation (if H is true)
Strength: Provides certainty within the logical system
Limitation: Doesn't tell us whether premises match reality
2. Induction (Generalization)
Structure: Observed pattern in sample → pattern holds in population
Role in hypothesis evaluation:
- Generalizing from observed instances to universal claims
- Moving from finite data to probabilistic conclusions
- Foundation of statistical inference
Example:
Observed: 45% of 100 patients on X-2347 developed hypotension
Induced: ~45% of all patients on X-2347 will develop hypotension (with uncertainty)
Strength: Enables predictions beyond observed data
Limitation: Never logically certain—inductive step always involves leap
Varieties of induction:
- Enumerative: X₁ is Y, X₂ is Y, X₃ is Y → All X are Y
- Statistical: High frequency in sample → high frequency in population
- Analogical: X is similar to Y in respects A,B,C; X has property D → Y probably has property D
Quality criteria:
- Sample size: Larger = stronger
- Representativeness: Random/diverse sampling = stronger
- Effect size: Larger deviations from null = stronger
- Background knowledge: Consistent with theory = stronger
3. Abduction (Inference to Best Explanation)
Structure: Surprising observation E → H would explain E → Therefore H (tentatively)
Role in hypothesis evaluation:
- Generating initial hypotheses from puzzling observations
- Choosing between empirically equivalent theories
- Integrating diverse evidence into unified explanation
Example:
Observation: Patients develop hypotension specifically with batch X-2347
Hypothesis 1: Complement activation from membrane defect
Hypothesis 2: Contamination during manufacturing
Hypothesis 3: Coincidental timing with other factors
H1 best explains the observations (specificity, mechanism, predictability) → Tentatively accept H1
Strength: Enables discovery and hypothesis generation
Limitation: "Best explanation" is often subjective and can change
Criteria for best explanation:
- Explanatory power: Accounts for more phenomena
- Simplicity: Fewer ad hoc assumptions (Occam's razor)
- Unification: Connects disparate observations
- Predictive fertility: Generates novel testable predictions
- Coherence: Fits with established knowledge
Combining Modes Across Stages
| Stage | Primary Mode(s) | Role |
|---|
| 0: Observation | Abduction | Generating proto-hypotheses from surprising patterns |
| 1: Formation | Abduction + Deduction | Formulating testable H, deriving predictions |
| 2: Initial Evidence | Induction + Abduction | Generalizing from first instances, comparing explanations |
| 3: Accumulating | Induction + Deduction | Statistical inference, testing logical consequences |
| 4: Strong Evidence | All three | Induction for generalization, deduction for testing, abduction for integration |
| 5: Established | Primarily deduction | Using H as premise to derive new predictions |
The Rebalancing in Data-Intensive Science
Traditional scientific method (pre-big data):
- Heavy emphasis on deduction from theory
- Theory → predictions → small-scale tests → theory refinement
- Induction limited by small sample sizes
- Abduction for anomaly resolution
Data-intensive scientific method (contemporary):
- Elevated role for induction from large datasets
- Data patterns → hypothesis → mechanistic explanation → further testing
- Machine learning enables pattern detection at scale
- Abduction for integrating data-driven findings with theory
Key insight: Data science doesn't eliminate theory—it rebalances the inference modes. Theory still:
- Guides what data to collect
- Frames interpretations
- Provides mechanistic understanding
- Determines what counts as "interesting" patterns
But: Large-scale induction can now suggest hypotheses that would never emerge from pure theory-driven deduction.
Part IV: Common Pitfalls and How to Avoid Them
Pitfall 1: Premature Conviction
Manifestation: Treating weak evidence as strong; acting on hypotheses before sufficient support
Why it happens:
- Psychological need for certainty
- Pressure to act or decide
- Overconfidence from initial positive findings
- Availability bias (recent/vivid evidence overweighted)
How to avoid:
- Explicitly track evidential stage
- Maintain calibrated confidence intervals
- Use pre-registered analysis plans
- Seek disconfirming evidence actively
- Engage with informed skeptics
Correction mechanism:
- Before acting on H, ask: "What stage am I at?"
- If Stage 2-3: Frame as "investigation" not "conclusion"
- Require Stage 4 evidence for high-stakes decisions
Pitfall 2: Confirmation Bias
Manifestation: Seeking and interpreting evidence in ways that confirm pre-existing beliefs
Why it happens:
- Cognitive ease (familiar ideas feel true)
- Emotional attachment to hypotheses
- Career/reputational investment
- Selective attention and memory
How to avoid:
- Pre-registration: Specify hypothesis and analysis plan before seeing full data
- Adversarial collaboration: Partner with someone who holds alternative hypothesis
- Red team exercise: Explicitly try to disprove your own hypothesis
- Consider alternatives: For every piece of confirming evidence, ask "what else could explain this?"
- Track disconfirming evidence: Keep explicit log of evidence against H
Correction mechanism:
- Bayes' theorem forces accounting for both P(E|H) and P(E|~H)
- Mayo's severity principle: Evidence only counts if it could have refuted H
Pitfall 3: Multiple Testing and P-Hacking
Manifestation: Finding "significant" results by testing many hypotheses or analysis methods
Why it happens:
- Natural to explore data multiple ways
- Publication bias rewards positive findings
- Researchers unaware they're p-hacking
- Lack of correction for multiple comparisons
How to avoid:
- Bonferroni or other corrections: Adjust significance threshold for number of tests
- Hold-out validation: Test on completely independent dataset
- Pre-registration: Commit to analysis approach before seeing data
- Report all tests performed: Not just the significant ones
- Use Bayesian approaches: Less sensitive to multiple testing
Example:
You test 20 different batches for association with hypotension
One shows p=0.03
Without correction, this could be random chance (0.05 × 20 = 1 expected false positive)
Need: Bonferroni correction (α = 0.05/20 = 0.0025) or replication
Pitfall 4: Confusing Correlation and Causation
Manifestation: Inferring causal relationship from mere association
Why it happens:
- Intuitive to interpret correlation causally
- Causal language is natural
- Lack of understanding of confounding
- Temporal precedence mistaken for causation
How to avoid:
- Explicit causal diagrams: Draw DAGs showing assumed relationships
- Consider confounders: What else could cause both variables?
- Look for mechanism: How would cause produce effect?
- Seek natural experiments: Quasi-random exposure variations
- Use causal inference methods: Instrumental variables, difference-in-differences, RCT
Bradford Hill criteria (revisited): Not definitive but helpful heuristics
Correction mechanism:
- Distinguish "X is associated with Y" from "X causes Y"
- Report associations as associations until causality established
- Recognize correlation is often the first step toward understanding causation
Pitfall 5: Ignoring Base Rates (Base Rate Fallacy)
Manifestation: Evaluating evidence without considering prior probability
Why it happens:
- Base rates often unknown or hard to estimate
- Recent evidence psychologically vivid
- Lack of Bayesian thinking
- Focus on P(E|H) without considering P(H)
Example:
Diagnostic test is 95% sensitive and 95% specific
Patient tests positive for rare disease (prevalence 0.1%)
Most doctors think: "95% chance patient has disease"
Actually: P(disease|positive) ≈ 1.9% (by Bayes' theorem)
Why? The disease is so rare that most positive tests are false positives
How to avoid:
- Always start with base rates: P(H) before considering evidence
- Use Bayes' theorem explicitly
- Consider both P(E|H) and P(E|~H)
- Remember: Surprising evidence is more diagnostic than expected evidence
Pitfall 6: Overfit Models and Spurious Patterns
Manifestation: Complex models fit noise rather than signal; "patterns" that don't replicate
Why it happens:
- High-dimensional data (many variables)
- Flexible models (many parameters)
- Optimization on same data used for model building
- Lack of external validation
How to avoid:
- Cross-validation: Test on data not used for training
- Regularization: Penalize model complexity
- External validation: Test in completely different population
- Mechanistic plausibility: Does pattern make sense?
- Simplicity preference: Favor simpler models unless complexity justified
Data science specific:
- Be especially careful with black-box ML models
- Understand training/validation/test split
- Report out-of-sample performance
- Consider adversarial examples
- Test robustness to data perturbations
Pitfall 7: Underdetermination (Duhem-Quine Problem)
Manifestation: Evidence fails to uniquely select among competing hypotheses
Why it happens:
- Hypotheses rarely tested in isolation
- Auxiliary assumptions implicit
- Multiple theories compatible with same data
- Evidence logically consistent with alternatives
Example:
Finding: Patients on batch X-2347 have high hypotension rate
H1: The membranes are defective
H2: The membranes are fine, but they're stored improperly at certain sites
H3: The membranes are fine, but used differently by certain technicians
H4: The patients assigned to these membranes differ in unmeasured ways
Same evidence, many explanations
How to avoid:
- Make auxiliaries explicit: State all assumptions clearly
- Test auxiliaries independently: Validate measurement instruments, check for selection bias
- Design discriminating tests: Create situations where H1 and H2 make different predictions
- Vary auxiliary assumptions: Bootstrap approach—use different auxiliaries to test H
- Seek mechanism: Understanding how H works helps rule out alternatives
Correction mechanism:
- Use Bayesian reasoning to compare P(E|H1) vs P(E|H2) vs P(E|H3)...
- Design crucial experiments that yield different results under different hypotheses
- Remember: Underdetermination is often temporary—future evidence can discriminate
Pitfall 8: The File Drawer Problem
Manifestation: Negative results unpublished; literature biased toward positive findings
Why it happens:
- Publication bias against null results
- Career incentives favor novel positive findings
- "Boring" negative results
- Difficulty publishing replications
Implications for hypothesis evaluation:
- Published evidence is selected sample, not representative
- True evidence for H may be weaker than appears
- Replication rate lower than expected from published record
- Meta-analyses biased
How to avoid:
- Pre-registration of studies: Commit to publishing regardless of outcome
- Funnel plots: Look for asymmetry suggesting missing negative results
- Consider prior plausibility: Extraordinary claims need extraordinary evidence
- Value replications: Attempt to replicate key findings
- Report all your tests: Not just the significant ones
When evaluating others' evidence:
- Ask: "How many unpublished negative results might exist?"
- Discount evidence from literatures with known publication bias
- Seek pre-registered studies
- Weight high-powered negative results heavily
Part V: Practical Decision Framework
The Hypothesis Evaluation Checklist
Use this checklist to assess where you are and what you need:
Stage Assessment
Evidence Quality
Inference Quality
Bias Check
Communication
Decision Matrix: When to Act on Hypothesis
Not all hypotheses require the same evidential standard. Use this matrix:
| Stakes | Reversibility | Required Evidence Stage | Key Considerations |
|---|
| Low/Low | Easy to reverse, low cost | Stage 2-3 | Act on weak evidence, learn quickly |
| Low/High | Difficult to reverse, low cost | Stage 3 | Need moderate confidence before committing |
| High/Low | Easy to reverse, high cost | Stage 3-4 | Cost justifies higher bar, but reversibility provides safety |
| High/High | Difficult to reverse, high cost | Stage 4-5 | Maximum evidence required—potentially catastrophic if wrong |
Examples:
Low stakes, high reversibility: Trying a new workflow in your office
- Stage 2 evidence adequate
- Learn by doing
- Easy to revert if doesn't work
High stakes, low reversibility: Removing an organ
- Stage 4-5 evidence required
- Irreversible decision
- Lives at stake
Moderate stakes, moderate reversibility: Changing dialysis membrane for all patients
- Stage 3-4 evidence needed
- Can switch back, but disruptive
- Patient safety important but changes possible
Updating Your Beliefs: The Bayesian Engine
At each stage, explicitly update your probability estimates:
- Start with prior: P(H) = your initial probability before new evidence
- Based on: background knowledge, theoretical plausibility, prior evidence
- Observe new evidence E
- Assess likelihoods:
- P(E|H) = probability of seeing E if H is true
- P(E|~H) = probability of seeing E if H is false
- Calculate posterior:
P(H|E) = P(H) × P(E|H) / [P(H) × P(E|H) + P(~H) × P(E|~H)]
- Your new prior = P(H|E) for the next round of evidence
Example walkthrough:
Initial:
- P(H) = 0.10 (novel hypothesis, somewhat plausible)
- H: Batch X-2347 causes hypotension
Evidence 1: 12/15 patients on X-2347 had hypotension, vs 3/15 on other batches
- P(E1|H) = 0.70 (if true, would expect strong association, but not perfect)
- P(E1|~H) = 0.20 (by chance, might see some clustering)
- Calculate: P(H|E1) = 0.10 × 0.70 / [0.10 × 0.70 + 0.90 × 0.20] = 0.28
Evidence 2: Complement markers elevated in X-2347 patients
- Start with: P(H) = 0.28 (our new prior)
- P(E2|H) = 0.80 (mechanism predicts this)
- P(E2|~H) = 0.10 (less likely without membrane issue)
- Calculate: P(H|E2) = 0.28 × 0.80 / [0.28 × 0.80 + 0.72 × 0.10] = 0.76
Evidence 3: Randomized trial shows 45% vs 8% hypotension rate
- Start with: P(H) = 0.76
- P(E3|H) = 0.95 (RCT with large effect strongly expected if H true)
- P(E3|~H) = 0.01 (very unlikely to see this if H false)
- Calculate: P(H|E3) = 0.76 × 0.95 / [0.76 × 0.95 + 0.24 × 0.01] = 0.997
Now at Stage 4-5: Justified belief, ready to act
Key principles:
- Each piece of evidence updates your beliefs incrementally
- Strong evidence (high P(E|H), low P(E|~H)) produces large updates
- Weak evidence (similar P(E|H) and P(E|~H)) produces small updates
- Multiple weak pieces can accumulate into strong support
Part VI: Special Topics
Working with Black Box Models
Machine learning models increasingly provide "evidence" but with opacity problems.
Types of opacity:
- Intentional concealment: Proprietary algorithms
- Technical literacy barrier: Requires advanced knowledge
- Inherent complexity: Even experts can't fully explain
Epistemological challenges:
Interpretability: Can you explain why the model makes its predictions?
- Global: Understanding overall model behavior
- Local: Understanding specific predictions
Overfitting: Model fits training noise, not generalizable patterns
Reliability: Does model maintain performance in new contexts?
Framework for evaluating black box evidence:
| Question | Assessment Strategy |
|---|
| Is the prediction accurate? | Out-of-sample validation, cross-validation |
| Is it causally meaningful? | Intervention studies, causal inference methods |
| Is it robust? | Adversarial testing, perturbation analysis |
| Is it explainable? | SHAP values, LIME, attention mechanisms |
| Does mechanism make sense? | Expert review, consistency with theory |
When black box evidence is appropriate:
- Prediction accuracy is what matters (not explanation)
- Human performance baseline is low
- Model extensively validated out-of-sample
- Stakes are low or decision is reversible
- Human oversight remains in place
When black box evidence is problematic:
- Causal understanding needed
- High stakes, irreversible decisions
- Model might encode biases
- Generalization beyond training context required
- Accountability and transparency legally required
Mitigation strategies:
- Require external validation datasets
- Use inherently interpretable models when possible
- Apply post-hoc explanation methods
- Compare black box to mechanistic models
- Integrate domain expertise in model evaluation
Theory-Free vs. Theory-Driven Science
The proliferation of data has enabled new modes of discovery.
Classical paradigm: Theory → Hypothesis → Prediction → Test → Refine theory
Data-intensive paradigm: Data → Pattern → Hypothesis → Mechanism → Test → Theory
Neither is superior in principle—both have roles:
| Aspect | Theory-Driven | Data-Driven |
|---|
| Starting point | Theoretical understanding | Observed patterns |
| Hypothesis generation | Deductive from theory | Inductive from data |
| Risk | Confirmation bias | Spurious patterns |
| Strength | Strong prior plausibility | Discovery of unexpected |
| Generalization | Often good | Requires extensive validation |
| Understanding | Deep mechanistic | May remain phenomenological |
Key insight from Kitchin: Data-driven science doesn't eliminate theory—it rebalances inference modes. Even "theory-free" science involves:
- Theory in selecting what data to collect
- Theory in interpreting patterns
- Theory in distinguishing signal from noise
- Theory in generalizing findings
Implications for hypothesis evaluation:
If hypothesis emerged from data mining:
- Require more extensive external validation
- Be especially alert to overfitting
- Seek mechanistic understanding
- Test in genuinely novel contexts
- Correct for multiple testing
If hypothesis emerged from theory:
- Still require rigorous testing
- Watch for confirmation bias
- May miss unexpected patterns
- Consider data-driven approaches for hypothesis refinement
Best practice: Integrate both approaches
- Use theory to guide data collection
- Use data to generate novel hypotheses
- Use theory to interpret data patterns
- Use data to refine and test theory
- Iterate between data and theory
The Problem of Old Evidence
Situation: You formulate hypothesis H to explain already-known evidence E
Challenge: Since P(E) = 1 (it's already known), Bayes' theorem seems to imply E can't increase P(H)
Example:
- Newton formulated gravitational theory partly to explain already-known Kepler's laws
- Einstein formulated GR partly to explain Mercury's perihelion precession (already observed)
- But surely these explanations count as evidence for the theories?
Solutions:
1. Logical Bayesian: Consider counterfactual—if you hadn't known E, how would H have predicted it?
2. Explanatory value: E counts as evidence when H provides best explanation, even if E known first
3. Historical reconstruction: What would rational agent have believed before E was known?
Practical implication:
- Hypothesis that explains old evidence has value
- But: Less impressive than predictions of novel evidence
- Novel predictions provide stronger confirmation
- Old evidence can still increase confidence when hypothesis provides superior explanation to alternatives
For hypothesis evaluation:
- Seek opportunities for novel predictions
- Value prospective validation over retrospective fit
- Pre-register predictions before testing
- Distinguish "predicted" from "postdicted"
Part VII: Synthesis and Principles
Ten Commandments of Hypothesis Evaluation
- State your hypothesis explicitly and testably
- Vague hunches are not hypotheses
- Specify what would count as evidence for and against
- Know your evidential stage
- Don't confuse weak evidence with strong
- Adjust confidence claims to evidence quality
- Match actions to evidential strength
- Consider alternative hypotheses
- Never evaluate H in isolation
- Design tests that discriminate between alternatives
- Actively seek disconfirming evidence
- Make auxiliary assumptions explicit
- Evidence tests H + auxiliaries, not H alone
- Validate your methods and measures
- Consider what else must be true for your interpretation to hold
- Update beliefs incrementally via Bayesian reasoning
- Start with prior probabilities
- Update on new evidence
- Consider both P(E|H) and P(E|~H)
- Diversify your evidence
- Vary populations, contexts, methods
- Replicate findings independently
- Triangulate with different data sources
- Test severely
- Design tests capable of refuting H
- High power to detect alternatives
- Embrace opportunities for falsification
- Guard against biases
- Pre-register when possible
- Seek out informed skeptics
- Report all analyses, not just significant ones
- Correct for multiple testing
- Distinguish types of knowledge
- Association ≠ causation
- Prediction ≠ explanation
- Statistical significance ≠ practical importance
- Individual-level ≠ population-level
- Communicate uncertainty honestly
- Match language to evidential strength
- Acknowledge limitations explicitly
- Distinguish what you know from what you suspect
- Update in light of new evidence
Final Thought: Epistemic Humility
Science is not about achieving certainty—it's about reducing uncertainty incrementally while maintaining appropriate humility.
Even our best-supported hypotheses remain provisional. They are:
- Justified by current evidence
- Coherent with existing knowledge
- Productive of successful predictions
- Subject to revision by future evidence
The mark of scientific maturity is not confidence in what you know, but:
- Clarity about what you don't know
- Honesty about what the evidence actually supports
- Willingness to revise in light of new evidence
- Ability to hold beliefs with appropriate tentativeness
Your hypothesis has made it from wild speculation to justified belief. Celebrate that achievement. And remain open to being proven wrong.
Appendix A: Quick Reference Guide
Evidence Strength Terminology
| Term | Probabilistic | Qualitative Description |
|---|
| Negligible | P(H|E) ≈ P(H) | E doesn't change your mind |
| Weak | 1.5 < LR < 3 | E shifts belief slightly |
| Moderate | 3 < LR < 10 | E shifts belief substantially |
| Strong | 10 < LR < 100 | E strongly supports H |
| Very Strong | LR > 100 | E overwhelmingly supports H |
Where LR = Likelihood Ratio = P(E|H) / P(E|~H)
Sample Size Guidelines
Very rough heuristics—context matters enormously
| Stage | Minimum n | Better n | Comments |
|---|
| Stage 1-2 | 5-10 | 20-30 | Exploratory only |
| Stage 3 | 30-50 | 100-200 | Moderate confidence |
| Stage 4 | 100+ | 500+ | For strong claims |
| Stage 5 | Multiple studies | Meta-analysis | Consensus building |
Remember:
- Quality > quantity
- Diverse small samples > large homogeneous sample
- Well-controlled small RCT > large observational study (for causation)
When You Need Different Methodologies
| Hypothesis Type | Appropriate Methods |
|---|
| Association | Observational studies, correlation |
| Causation | RCT, natural experiments, causal inference methods |
| Mechanism | Experimental manipulation, pathway analysis |
| Prediction | Machine learning, validation datasets |
| Generalization | External validation, multiple populations |
| Rare events | Case-control, larger samples, Bayesian methods |
Appendix B: Case Study Walkthrough
From Hunch to Knowledge: A Complete Example
Context: You're a nephrologist noticing patterns in your dialysis patients.
Stage 0: Observation
Initial observation: Over 3 weeks, you notice 5 patients develop severe itching that doesn't respond to usual interventions. You also notice they all started or increased calcium acetate.
What you do:
- Document carefully: exactly which patients, when started, severity
- Note: All five started same manufacturer's formulation
- Don't commit to explanation yet—just documenting
Status: Pre-hypothesis. Interesting pattern, needs more observation.
Stage 1: Hypothesis Formation
Observation expands: Review last 6 months of charts. Find 15 total patients with unexplained itching, 12 on this calcium acetate formulation.
Formulate hypotheses:
- H1: The calcium acetate formulation causes itching (sensitivity reaction)
- H2: Patients needing high-dose calcium have more severe hyperparathyroidism causing itching
- H3: Itching is seasonal (winter months) and calcium timing is coincidental
- H4: This calcium acetate batch was contaminated
Testable predictions:
- If H1: Switching formulations should resolve itching
- If H2: PTH levels should correlate with itching; other high-calcium patients should itch too
- If H3: Should see seasonal pattern across all patients
- If H4: Only specific lot numbers should be associated
Prior probabilities (estimates):
- P(H1) = 0.20 (plausible but unusual)
- P(H2) = 0.30 (known that secondary hyperparathyroidism causes itching)
- P(H3) = 0.15 (seasonal patterns exist)
- P(H4) = 0.10 (contamination rare but possible)
- P(other) = 0.25 (many possibilities)
Status: Stage 1. Multiple plausible hypotheses, minimal evidence to discriminate.
Stage 2: Initial Evidence Collection
Action: Small pilot intervention
- Switch 5 currently itching patients to different calcium formulation
- Continue monitoring 5 new patients starting original formulation
- Check PTH levels on all
- Note lot numbers
Results after 4 weeks:
- 4/5 switched patients: itching resolved within 2 weeks
- 1/5 switched patients: itching persisted (had PTH of 890)
- 2/5 new patients on original: developed itching
- PTH levels not systematically elevated in itching patients (except the one)
- All cases from same lot number: X4729
Update probabilities:
Evidence E1: Switching resolves itching in 4/5
- P(E1|H1) = 0.80 (if sensitivity, would expect most resolve)
- P(E1|H2) = 0.30 (wouldn't expect calcium change to help if PTH primary)
- P(E1|H3) = 0.40 (seasonal effect might wane coincidentally)
- P(E1|H4) = 0.85 (contamination would cause switching to help)
Update: P(H1|E1) = 0.40; P(H2|E1) = 0.18; P(H3|E1) = 0.13; P(H4|E1) = 0.20
Evidence E2: All cases from lot X4729
- P(E2|H1) = 0.50 (sensitivity might be to specific excipients varying by lot)
- P(E2|H2) = 0.15 (PTH wouldn't cluster by lot)
- P(E2|H3) = 0.10 (seasonal wouldn't cluster by lot)
- P(E2|H4) = 0.95 (contamination would be lot-specific)
Combined update: P(H4|E1,E2) ≈ 0.45; P(H1|E1,E2) ≈ 0.35
Status: Stage 2-3. Evidence now moderately favors contamination hypothesis, with sensitivity reaction also plausible. Need to discriminate between H1 and H4, and test more rigorously.
Stage 3: Accumulating Evidence
Action: More systematic investigation
- Contact manufacturer for lot X4729 composition/testing
- Check if other clinics using lot X4729 seeing similar issues
- Test samples of lot X4729 vs other lots (send to lab)
- Formal case-control study: 30 cases (itching) vs 60 controls (no itching)
Results:
- Manufacturer reports lot X4729 passed all QC tests, but used slightly different drying process
- Two other clinics report itching complaints with same lot
- Lab testing: Lot X4729 has 2.3% higher residual solvent (ethanol) than other lots; within specs but higher
- Case-control study:
- 28/30 cases were on lot X4729
- 12/60 controls were on lot X4729
- OR = 23.3, p < 0.001
- When stratified by residual ethanol level: dose-response evident
Mechanistic hypothesis refined:
H1b: Elevated residual ethanol in lot X4729 causes skin irritation/itching
This discriminates H1b from H4:
- Not strictly "contamination" (within specs)
- Manufacturing variation in acceptable compound
- Specific mechanism identified
Update: P(H1b|all evidence) ≈ 0.85
Additional test: Prospective cohort
- Follow 100 patients newly starting any calcium acetate
- Track lot numbers and symptom development
- 20 patients started on lot X4729, 80 on other lots
- Results: 9/20 (45%) on X4729 developed itching vs 4/80 (5%) on others
- RR = 9.0, p < 0.001
Status: Stage 3-4. Strong evidence that lot X4729 specifically causes itching, likely via elevated residual ethanol. Multiple independent lines of evidence converge.
Stage 4: Strong Evidence
Action: Interventional validation
- Switch all patients on X4729 to different lots (n=50)
- Monitor symptom resolution
- Test hypothesis in different population (non-dialysis CKD patients on calcium acetate)
Results:
- 43/50 patients had symptom resolution within 4 weeks
- 7/50 had persistent itching (chart review revealed other causes in 6)
- Study in non-dialysis patients: smaller effect but same direction (lot X4729 associated with itching)
Mechanism confirmation:
- Dermatology consult on 5 patients: consistent with irritant dermatitis
- Patch testing with residual ethanol: positive in affected patients
- Makes sense: ethanol is known skin irritant
Severe test:
- If hypothesis false, would not expect:
- Resolution with switching
- Replication across clinics
- Dose-response with ethanol level
- Positive patch tests
- Consistency in non-dialysis population
Meta-analysis perspective:
- Original observation: 12/15 cases on X4729
- Other clinics: 34/40 cases on X4729
- Case-control: OR = 23.3
- Prospective cohort: RR = 9.0
- Intervention: 43/50 resolution
All converge on strong effect
Final probability estimate: P(H1b|all evidence) > 0.95
Status: Stage 4-5. Justified belief that lot X4729 causes itching via elevated residual ethanol. Appropriate to act: avoid this lot, inform FDA, inform other clinicians.
Stage 5: Established Knowledge
Subsequent developments:
- FDA issues notice about lot X4729
- Manufacturer recalls lot, improves drying process
- Published case series confirms findings
- Incorporated into clinical guidelines
- Mechanism well-understood
- Used to inform QC standards
The hypothesis has become established knowledge:
- Part of background against which new observations evaluated
- Taught to trainees
- Guides clinical practice
- High confidence but not absolute certainty
Epistemic status: This finding is now part of the edifice of medical knowledge, but remains:
- Subject to refinement (maybe other lots have issues)
- Subject to revision (maybe long-term follow-up reveals something different)
- Contingent (applies to this formulation, manufacturer, process)
Lessons from the Case Study
- Initial pattern recognition preceded hypothesis formation: We documented before theorizing
- Multiple competing hypotheses from the start: Avoided premature commitment
- Evidence accumulated across methods:
- Clinical observation → case series → case-control → prospective cohort → intervention
- Each strengthened confidence incrementally
- Mechanistic understanding developed alongside statistical association:
- Not just "X4729 associated with itching"
- But "X4729's elevated ethanol causes irritant dermatitis"
- Severe testing throughout:
- Each study designed to discriminate between alternatives
- Actively sought disconfirming evidence
- Tested in different populations
- Bayesian updating made explicit:
- Started with priors
- Updated incrementally
- Final posterior very high
- Action calibrated to evidence:
- Stage 1-2: Just monitoring, gathering data
- Stage 3: Small pilot interventions
- Stage 4: Broader recommendations
- Stage 5: Policy changes
- The process took months, not hours:
- Resisted premature conclusions
- Built evidence systematically
- Worth the time to get it right
Closing Reflection
The scientific method is not a algorithm—it's a disposition.
A disposition toward:
- Curiosity tempered by skepticism
- Confidence tempered by humility
- Conviction tempered by openness to revision
- Action informed by evidence
- Certainty appropriate to what we actually know
Your hypotheses will fail. That's not a bug—it's a feature. Each failure teaches us something about the world and about how to ask better questions.
The journey from hunch to knowledge is rarely linear. It involves false starts, dead ends, surprising detours, and occasional breakthroughs. Embrace the messiness. Trust the process. Let the evidence guide you.
And always remember: the goal is not to be right—it's to become less wrong.
Document prepared by combining insights from:
- Internet Encyclopedia of Philosophy: Evidence (IEP)
- Desai et al. (2024): "The epistemological foundations of data science: a critical analysis"
For: Hypothesis evaluation from early ideation through established knowledge
Perspective: Integrated classical epistemology and modern data science
Purpose: Practical framework for rigorous thinking about evidence and belief