Content is user-generated and unverified.

A Framework for Hypothesis Evaluation: From Speculation to Evidence

Introduction

This framework provides a systematic approach to thinking through hypotheses and evaluating them as evidence accumulates. It synthesizes classical epistemology with modern data science to create a practical guide for navigating from early hunches through to well-supported conclusions.

The central challenge: How do we rationally navigate the journey from initial speculation—when data is sparse and uncertainty is high—to justified belief, as evidence accumulates and our understanding deepens?

Part I: The Nature of Evidence and Belief

What Counts as Evidence?

Before evaluating hypotheses, we must understand what evidence is and what it does.

Evidence can be:

Propositional: Statements that are true or false ("the patient's creatinine is 2.5 mg/dL")
Experiential: Direct observations and sensory data (the physical exam finding of ascites)
Derived: Outputs from instruments, algorithms, or analytical processes

Evidence serves to:

Justify belief: It provides rational grounds for accepting or rejecting propositions
Discriminate between hypotheses: It helps us choose among competing explanations
Increase or decrease probability: It shifts our confidence in particular claims
Explain phenomena: It serves as both explainer and explained

The Evidential Relationship

Evidence E relates to hypothesis H through several possible relationships:

Relationship Type	Definition	Strength	When Applicable
Deductive	E logically entails H (or ~H)	Strongest	Mathematical proofs, formal logic
Probabilistic	E increases P(H\|E) > P(H)	Variable	Most empirical contexts
Explanatory	H best explains E	Moderate-Strong	Causal inference, theory selection
Instantial	E is a positive instance of H	Weak-Moderate	Universal generalizations

Key insight: Different types of evidence require different evaluation frameworks. A single observation of a black raven provides weaker evidence for "all ravens are black" than a randomized trial provides for a treatment effect.

Part II: Stages of Hypothesis Development

Stage 0: Pre-Hypothesis (Observation and Pattern Recognition)

Characteristics:

Raw observations without clear interpretation
Vague intuitions or clinical hunches
Unsystematic data collection
No explicit hypothesis yet formulated

What to do:

Document observations carefully: What exactly did you observe? Under what conditions?
Distinguish observation from interpretation: Separate the raw data from your mental model
Note patterns and anomalies: What stands out? What defies current understanding?
Consider alternative explanations early: Resist premature commitment to a single interpretation

Modes of inference operative:

Abductive reasoning: Inferring the best explanation for surprising observations
Pattern recognition: Detecting structures in heterogeneous phenomena

Example:

Observation: Several patients on the same dialysis machine develop unexplained hypotension during treatment.

Not yet hypothesis-worthy: Just documented phenomena requiring explanation.

Pitfalls to avoid:

Confirmation bias: Noticing only data that fits an emerging story
Apophenia: Seeing patterns in random noise
Premature theorizing: Committing to an explanation before sufficient observation

Stage 1: Hypothesis Formation (Conjecture)

Characteristics:

Explicit statement of a testable proposition
Minimal supporting evidence
High uncertainty
Multiple plausible alternatives exist

Essential questions:

Is the hypothesis testable? Can evidence discriminate between H and ~H?
Is it falsifiable? What would count as evidence against it?
What are the alternatives? What other hypotheses could explain the same phenomena?
What would constitute strong evidence? What kind of data would move you significantly?

Formulating hypotheses well:

Poor hypothesis: "Something is wrong with the dialysis process"

Too vague to test
No clear alternatives
No specified relationship to observations

Better hypothesis: "The dialyzer membrane batch X-2347 causes complement activation leading to transient hypotension"

Testable through membrane analysis and complement markers
Falsifiable (could test other batches, measure complement)
Specifies mechanism
Suggests clear interventions

Modes of inference:

Abduction: Generating hypothesis as best explanation for existing observations
Analogy: Drawing on similar situations or mechanisms
Theoretical deduction: Deriving hypotheses from established theories

At this stage, your hypothesis has:

Coherence: Internal logical consistency
Plausibility: Compatibility with established knowledge
Explanatory potential: Accounts for known observations
Minimal empirical support: Perhaps 1-3 anecdotal observations

Epistemic status: Conjecture - rational to entertain but not yet rational to believe

Stage 2: Initial Evidence Collection (Weak Support)

Characteristics:

First systematic observations
Small sample size (n = 5-20)
Uncontrolled conditions
High potential for confounding

Key activities:

1. Establish the evidential baseline

What is P(H) before new evidence? (Your prior probability)
What evidence do you already possess?
What are the plausible alternatives and their priors?

2. Design initial tests

What is minimally necessary to discriminate H from alternatives?
What data is feasible to collect quickly?
How can you maximize the severity of the test?

3. Apply probabilistic thinking Use Bayes' theorem to think through evidence:

P(H|E) = P(H) × P(E|H) / P(E)

Where:
P(H|E) = Probability of hypothesis given evidence (posterior)
P(H) = Prior probability of hypothesis
P(E|H) = Likelihood (probability of evidence if hypothesis true)
P(E) = Marginal probability of evidence

Example continued:

You examine 15 patients who experienced hypotension:

12/15 were on machines using batch X-2347

3/15 were on machines using other batches

Evidence assessment:

P(E|H) is moderate: If the batch is causative, we'd expect high overlap

P(E|~H) is also moderate: Without the hypothesis, some overlap would occur by chance

This is suggestive but far from conclusive

Evaluating weak evidence:

Criterion	Assessment Method
Relevance	Does E actually bear on H, or could it be explained equally by background factors?
Reliability	How trustworthy is the measurement/observation?
Discrimination	Does E distinguish H from alternatives, or is it compatible with many hypotheses?
Severity	Would this test probably produce different results if H were false?

Mayo's Severity Principle: Evidence E provides strong support for H only if:

E "fits" or "agrees with" H
The test had a high probability of producing a result less compatible with H if H were false

At this stage, you likely fail the severity test. A small uncontrolled observation could easily produce these results even if H is false.

Epistemic status: Weak hypothesis - some evidence in favor, but underdetermination remains high. Rational to investigate further, but premature to act on or believe strongly.

Critical decision point:

If evidence is positive but weak: Pursue more rigorous investigation
If evidence is negative or equivocal: Reconsider hypothesis or reformulate
If alternative explanations remain plausible: Design tests to discriminate between them

Stage 3: Accumulating Evidence (Building Support)

Characteristics:

Moderate sample sizes (n = 20-200)
Some control for confounding
Replication attempts
Consideration of alternative explanations

Key activities:

1. Expand and diversify evidence

Temporal variation: Does the relationship hold over time?
Population variation: Does it hold across different subgroups?
Methodological variation: Do different measurement approaches yield consistent results?
Contextual variation: Does it hold in different settings or conditions?

Why diversification matters:

The same evidence can support multiple hypotheses if it's not sufficiently varied. Consider Achinstein's requirement for high probability: evidence must be varied enough to make P(H|E) genuinely high, not just higher than P(H).

Example:

Weak diversity: Testing only morning dialysis sessions, only male patients, only one clinic

Many confounders remain plausible

Alternative explanations not ruled out

Strong diversity: Multiple times of day, both sexes, multiple clinics, different operators

Systematic confounders less plausible

Specificity of relationship becomes clearer

2. Test competing hypotheses directly

Don't just accumulate evidence for your hypothesis—actively seek evidence that would favor alternatives:

Your Hypothesis (H1)	Alternative (H2)	Discriminating Evidence
Batch X-2347 membranes cause hypotension	Contaminated dialysate in one clinic	Test membranes from batch in different clinics
Complement activation mechanism	Endotoxin contamination	Measure complement markers AND endotoxin levels
Manufacturing defect	Storage/transport problem	Examine membranes from same batch with different storage histories

3. Quantify uncertainty appropriately

As evidence accumulates, maintain explicit uncertainty estimates:

Point estimate: What is your best guess at the truth value of H?
Confidence interval: What range of values is consistent with your evidence?
Sensitivity analysis: How much would conclusions change with different assumptions?

4. Apply causal inference criteria

When the hypothesis involves causation (as most do), consider Bradford Hill's criteria:

Criterion	Question	Status in Example
Strength	How large is the association?	Moderate (12/15 vs 3/15)
Consistency	Has it been replicated?	Needs testing across clinics
Specificity	Is the outcome specific to the exposure?	Unclear—other batches?
Temporality	Does cause precede effect?	Yes—batch introduced before symptoms
Biological gradient	Is there a dose-response?	Could test concentration/duration
Plausibility	Is there a known mechanism?	Yes—complement activation is known
Coherence	Does it fit with existing knowledge?	Yes—consistent with immunology
Experiment	Can we intervene?	Yes—can switch batches
Analogy	Are there similar known effects?	Yes—other membrane reactions known

None of these alone is sufficient, but collectively they build the case.

5. Beware of black boxes

If your evidence comes from complex algorithms or processes:

Interpretability: Can you explain why the evidence points to H?
Generalizability: Will the relationship hold in new contexts?
Robustness: Is the evidence stable to perturbations in methods?

Example: If an ML algorithm predicts hypotension risk, but you can't explain why certain features matter, your evidence is weaker than if you have a mechanistic understanding.

Epistemic status: Supported hypothesis - evidence makes H substantially more probable than alternatives. Rational to have moderate confidence, perhaps sufficient to take action in low-stakes decisions. Not yet justified as knowledge in high-stakes contexts.

Modes of inference:

Induction: Generalizing from accumulating instances
Probabilistic inference: Updating degrees of belief via Bayes' theorem
Explanatory inference: Evaluating which hypothesis best unifies the diverse evidence

Stage 4: Strong Evidence (Approaching Justified Belief)

Characteristics:

Large sample sizes (n = 200+) OR
Highly controlled conditions (RCT) OR
Multiple independent replications
Systematic ruling out of confounders
Mechanistic understanding

Key activities:

1. Achieve severity in testing

At this stage, your evidence should pass Mayo's severity criterion:

Severe test achieved when:

You've designed tests specifically to refute H
These tests had high power to detect alternatives if present
The tests nevertheless yielded results consistent with H
You've explored the hypothesis' failure modes and found it robust

Example progression:

Week 1: Noticed pattern in 15 patients (Stage 2)

Week 3: Confirmed association in 50 patients across 3 clinics (Stage 3)

Week 6: Conducted controlled comparison:

Randomly assigned 100 patients to batch X-2347 vs. control batches

Blinded outcome assessment

Pre-specified primary endpoint: hypotension episodes

Measured complement markers to confirm mechanism

Results: 45% hypotension rate with X-2347 vs. 8% with controls (p<0.001)

Complement elevation correlates with hypotension (r=0.72)

This is severe: High probability that if H were false, we would have seen different results

2. Address the Duhem-Quine problem

Your evidence doesn't test H in isolation—it tests H plus all auxiliary assumptions. Make these explicit:

Hypothesis: Batch X-2347 membranes cause complement-mediated hypotension

Auxiliary assumptions being tested:

The complement assays are valid
The dialysis machines function properly
The patient selection wasn't biased
The hypotension measurement is accurate
There are no unmeasured confounders
The batch assignment was truly random

How to address:

Bootstrap approach: Use other established hypotheses plus your evidence to derive instances of H
Vary auxiliary assumptions: Test H under different measurement approaches, different populations, etc.
Independent confirmation: Have others test H with completely different auxiliary hypotheses

3. Consider the total evidence

Your evidence for H exists within a broader web of belief. Assess coherence:

Internal consistency: Do different pieces of evidence point the same direction?
External consistency: Does H cohere with established scientific knowledge?
Explanatory power: Does H unify and explain diverse phenomena?
Predictive success: Can H successfully predict new observations?

4. Quantify strength of evidence

Use multiple metrics:

Probabilistic:

Likelihood ratio: P(E|H) / P(E|~H)
Bayes factor: How much E shifts probability
Posterior probability: P(H|E)

Frequentist:

p-value: Probability of E (or more extreme) if H false
Effect size: Magnitude of relationship
Confidence interval: Range of plausible values

Qualitative:

Number of independent replications
Diversity of methods yielding consistent results
Strength of mechanism understanding

At this stage:

P(H|E) should be >0.90 for strong support
Effect size should be clinically/scientifically meaningful
Multiple independent lines of evidence should converge
Mechanistic understanding should be present

Epistemic status: Justified belief - rational to believe H is true, sufficient to act upon in most contexts, appropriate to communicate as established finding.

However: Not immune to revision. Remains open to refutation by future evidence.

Stage 5: Established Knowledge (High Confidence)

Characteristics:

Extensive replication across labs/groups
Integration into theoretical frameworks
Successful novel predictions
Practical applications that work
Consensus among experts

Distinguishing knowledge from justified belief:

Justified belief = Evidence makes H highly probable for you given your evidence

Knowledge = Justified belief that is:

True (corresponds to reality—though we can't always be certain)
Reliably formed (resulted from truth-conducive processes)
Socially validated (others with access to evidence reach same conclusion)
Predictively successful (enables successful interventions)

The genealogy of knowledge matters:

In data-intensive contexts, how we arrived at knowledge affects its epistemic status:

Theory-driven discovery: Started with mechanism, derived predictions, tested them
- Strengths: Deep understanding, generalizable, less prone to overfitting
- Weaknesses: Can miss unexpected patterns
Data-driven discovery: Patterns emerged from large-scale data analysis, mechanism inferred later
- Strengths: Can find surprising relationships, comprehensive
- Weaknesses: Higher risk of spurious patterns, requires external validation

Example:

Theory-driven: Understanding complement activation biology → predicting membrane reactions → testing specific membranes → confirming mechanism

Data-driven: ML algorithm identifies batch X-2347 as high-risk from EHR data → investigating why → discovering complement mechanism → validating with targeted experiments

Both can yield knowledge, but:

Theory-driven has stronger prior plausibility

Data-driven requires more extensive validation

Theory-driven more likely to generalize beyond training data

Epistemic status: Established knowledge - forms part of the background against which new hypotheses are evaluated. High confidence, but not absolute certainty. Embedded in network of mutually supporting beliefs.

Key insight: Even established knowledge remains provisional. Science is self-correcting. New evidence can overturn what seemed certain.

Part III: Modes of Inference Across Stages

The Three Fundamental Modes

Throughout hypothesis evaluation, three modes of inference operate:

1. Deduction (Certainty)

Structure: If premises are true, conclusion must be true

Role in hypothesis evaluation:

Deriving testable predictions from hypotheses
Checking internal logical consistency
Mathematical and statistical reasoning
Ruling out logical impossibilities

Example:

H: All patients on batch X-2347 will show complement elevation Patient Jones is on batch X-2347 Therefore: Patient Jones will show complement elevation (if H is true)

Strength: Provides certainty within the logical system Limitation: Doesn't tell us whether premises match reality

2. Induction (Generalization)

Structure: Observed pattern in sample → pattern holds in population

Role in hypothesis evaluation:

Generalizing from observed instances to universal claims
Moving from finite data to probabilistic conclusions
Foundation of statistical inference

Example:

Observed: 45% of 100 patients on X-2347 developed hypotension Induced: ~45% of all patients on X-2347 will develop hypotension (with uncertainty)

Strength: Enables predictions beyond observed data Limitation: Never logically certain—inductive step always involves leap

Varieties of induction:

Enumerative: X₁ is Y, X₂ is Y, X₃ is Y → All X are Y
Statistical: High frequency in sample → high frequency in population
Analogical: X is similar to Y in respects A,B,C; X has property D → Y probably has property D

Quality criteria:

Sample size: Larger = stronger
Representativeness: Random/diverse sampling = stronger
Effect size: Larger deviations from null = stronger
Background knowledge: Consistent with theory = stronger

3. Abduction (Inference to Best Explanation)

Structure: Surprising observation E → H would explain E → Therefore H (tentatively)

Role in hypothesis evaluation:

Generating initial hypotheses from puzzling observations
Choosing between empirically equivalent theories
Integrating diverse evidence into unified explanation

Example:

Observation: Patients develop hypotension specifically with batch X-2347 Hypothesis 1: Complement activation from membrane defect Hypothesis 2: Contamination during manufacturing Hypothesis 3: Coincidental timing with other factors

H1 best explains the observations (specificity, mechanism, predictability) → Tentatively accept H1

Strength: Enables discovery and hypothesis generation Limitation: "Best explanation" is often subjective and can change

Criteria for best explanation:

Explanatory power: Accounts for more phenomena
Simplicity: Fewer ad hoc assumptions (Occam's razor)
Unification: Connects disparate observations
Predictive fertility: Generates novel testable predictions
Coherence: Fits with established knowledge

Combining Modes Across Stages

Stage	Primary Mode(s)	Role
0: Observation	Abduction	Generating proto-hypotheses from surprising patterns
1: Formation	Abduction + Deduction	Formulating testable H, deriving predictions
2: Initial Evidence	Induction + Abduction	Generalizing from first instances, comparing explanations
3: Accumulating	Induction + Deduction	Statistical inference, testing logical consequences
4: Strong Evidence	All three	Induction for generalization, deduction for testing, abduction for integration
5: Established	Primarily deduction	Using H as premise to derive new predictions

The Rebalancing in Data-Intensive Science

Traditional scientific method (pre-big data):

Heavy emphasis on deduction from theory
Theory → predictions → small-scale tests → theory refinement
Induction limited by small sample sizes
Abduction for anomaly resolution

Data-intensive scientific method (contemporary):

Elevated role for induction from large datasets
Data patterns → hypothesis → mechanistic explanation → further testing
Machine learning enables pattern detection at scale
Abduction for integrating data-driven findings with theory

Key insight: Data science doesn't eliminate theory—it rebalances the inference modes. Theory still:

Guides what data to collect
Frames interpretations
Provides mechanistic understanding
Determines what counts as "interesting" patterns

But: Large-scale induction can now suggest hypotheses that would never emerge from pure theory-driven deduction.

Part IV: Common Pitfalls and How to Avoid Them

Pitfall 1: Premature Conviction

Manifestation: Treating weak evidence as strong; acting on hypotheses before sufficient support

Why it happens:

Psychological need for certainty
Pressure to act or decide
Overconfidence from initial positive findings
Availability bias (recent/vivid evidence overweighted)

How to avoid:

Explicitly track evidential stage
Maintain calibrated confidence intervals
Use pre-registered analysis plans
Seek disconfirming evidence actively
Engage with informed skeptics

Correction mechanism:

Before acting on H, ask: "What stage am I at?"
If Stage 2-3: Frame as "investigation" not "conclusion"
Require Stage 4 evidence for high-stakes decisions

Pitfall 2: Confirmation Bias

Manifestation: Seeking and interpreting evidence in ways that confirm pre-existing beliefs

Why it happens:

Cognitive ease (familiar ideas feel true)
Emotional attachment to hypotheses
Career/reputational investment
Selective attention and memory

How to avoid:

Pre-registration: Specify hypothesis and analysis plan before seeing full data
Adversarial collaboration: Partner with someone who holds alternative hypothesis
Red team exercise: Explicitly try to disprove your own hypothesis
Consider alternatives: For every piece of confirming evidence, ask "what else could explain this?"
Track disconfirming evidence: Keep explicit log of evidence against H

Correction mechanism:

Bayes' theorem forces accounting for both P(E|H) and P(E|~H)
Mayo's severity principle: Evidence only counts if it could have refuted H

Pitfall 3: Multiple Testing and P-Hacking

Manifestation: Finding "significant" results by testing many hypotheses or analysis methods

Why it happens:

Natural to explore data multiple ways
Publication bias rewards positive findings
Researchers unaware they're p-hacking
Lack of correction for multiple comparisons

How to avoid:

Bonferroni or other corrections: Adjust significance threshold for number of tests
Hold-out validation: Test on completely independent dataset
Pre-registration: Commit to analysis approach before seeing data
Report all tests performed: Not just the significant ones
Use Bayesian approaches: Less sensitive to multiple testing

Example:

You test 20 different batches for association with hypotension One shows p=0.03 Without correction, this could be random chance (0.05 × 20 = 1 expected false positive) Need: Bonferroni correction (α = 0.05/20 = 0.0025) or replication

Pitfall 4: Confusing Correlation and Causation

Manifestation: Inferring causal relationship from mere association

Why it happens:

Intuitive to interpret correlation causally
Causal language is natural
Lack of understanding of confounding
Temporal precedence mistaken for causation

How to avoid:

Explicit causal diagrams: Draw DAGs showing assumed relationships
Consider confounders: What else could cause both variables?
Look for mechanism: How would cause produce effect?
Seek natural experiments: Quasi-random exposure variations
Use causal inference methods: Instrumental variables, difference-in-differences, RCT

Bradford Hill criteria (revisited): Not definitive but helpful heuristics

Correction mechanism:

Distinguish "X is associated with Y" from "X causes Y"
Report associations as associations until causality established
Recognize correlation is often the first step toward understanding causation

Pitfall 5: Ignoring Base Rates (Base Rate Fallacy)

Manifestation: Evaluating evidence without considering prior probability

Why it happens:

Base rates often unknown or hard to estimate
Recent evidence psychologically vivid
Lack of Bayesian thinking
Focus on P(E|H) without considering P(H)

Example:

Diagnostic test is 95% sensitive and 95% specific Patient tests positive for rare disease (prevalence 0.1%) Most doctors think: "95% chance patient has disease" Actually: P(disease|positive) ≈ 1.9% (by Bayes' theorem)

Why? The disease is so rare that most positive tests are false positives

How to avoid:

Always start with base rates: P(H) before considering evidence
Use Bayes' theorem explicitly
Consider both P(E|H) and P(E|~H)
Remember: Surprising evidence is more diagnostic than expected evidence

Pitfall 6: Overfit Models and Spurious Patterns

Manifestation: Complex models fit noise rather than signal; "patterns" that don't replicate

Why it happens:

High-dimensional data (many variables)
Flexible models (many parameters)
Optimization on same data used for model building
Lack of external validation

How to avoid:

Cross-validation: Test on data not used for training
Regularization: Penalize model complexity
External validation: Test in completely different population
Mechanistic plausibility: Does pattern make sense?
Simplicity preference: Favor simpler models unless complexity justified

Data science specific:

Be especially careful with black-box ML models
Understand training/validation/test split
Report out-of-sample performance
Consider adversarial examples
Test robustness to data perturbations

Pitfall 7: Underdetermination (Duhem-Quine Problem)

Manifestation: Evidence fails to uniquely select among competing hypotheses

Why it happens:

Hypotheses rarely tested in isolation
Auxiliary assumptions implicit
Multiple theories compatible with same data
Evidence logically consistent with alternatives

Example:

Finding: Patients on batch X-2347 have high hypotension rate H1: The membranes are defective H2: The membranes are fine, but they're stored improperly at certain sites H3: The membranes are fine, but used differently by certain technicians H4: The patients assigned to these membranes differ in unmeasured ways

Same evidence, many explanations

How to avoid:

Make auxiliaries explicit: State all assumptions clearly
Test auxiliaries independently: Validate measurement instruments, check for selection bias
Design discriminating tests: Create situations where H1 and H2 make different predictions
Vary auxiliary assumptions: Bootstrap approach—use different auxiliaries to test H
Seek mechanism: Understanding how H works helps rule out alternatives

Correction mechanism:

Use Bayesian reasoning to compare P(E|H1) vs P(E|H2) vs P(E|H3)...
Design crucial experiments that yield different results under different hypotheses
Remember: Underdetermination is often temporary—future evidence can discriminate

Pitfall 8: The File Drawer Problem

Manifestation: Negative results unpublished; literature biased toward positive findings

Why it happens:

Publication bias against null results
Career incentives favor novel positive findings
"Boring" negative results
Difficulty publishing replications

Implications for hypothesis evaluation:

Published evidence is selected sample, not representative
True evidence for H may be weaker than appears
Replication rate lower than expected from published record
Meta-analyses biased

How to avoid:

Pre-registration of studies: Commit to publishing regardless of outcome
Funnel plots: Look for asymmetry suggesting missing negative results
Consider prior plausibility: Extraordinary claims need extraordinary evidence
Value replications: Attempt to replicate key findings
Report all your tests: Not just the significant ones

When evaluating others' evidence:

Ask: "How many unpublished negative results might exist?"
Discount evidence from literatures with known publication bias
Seek pre-registered studies
Weight high-powered negative results heavily

Part V: Practical Decision Framework

The Hypothesis Evaluation Checklist

Use this checklist to assess where you are and what you need:

Stage Assessment

I have clearly stated my hypothesis H
I have identified 2-3 plausible alternative hypotheses
I can state what evidence would refute H (falsifiability)
I can state what evidence would strongly support H
I have estimated my current evidential stage (0-5)

Evidence Quality

My evidence is relevant to H (not merely compatible)
My evidence discriminates between H and alternatives
My sample size is adequate for the claim I'm making
My evidence is diverse (varied contexts, methods, populations)
Potential confounders have been addressed
The measurement is reliable and valid
I have considered P(E|H) and P(E|~H)

Inference Quality

I can identify which mode(s) of inference I'm using
If induction: sample is representative and adequate size
If abduction: H is genuinely the best explanation, not just compatible
If deduction: premises are justified
I have considered alternative explanations
I have tested the severity of my evidence

Bias Check

I have actively sought disconfirming evidence
I have consulted with informed skeptics
I have considered confirmation bias in my interpretation
I have corrected for multiple testing if applicable
I have not cherry-picked data or analyses
I have made auxiliary assumptions explicit

Communication

I can clearly state the evidential stage
I communicate appropriate uncertainty
I distinguish correlation from causation
I acknowledge limitations explicitly
I frame conclusions at appropriate confidence level

Decision Matrix: When to Act on Hypothesis

Not all hypotheses require the same evidential standard. Use this matrix:

Stakes	Reversibility	Required Evidence Stage	Key Considerations
Low/Low	Easy to reverse, low cost	Stage 2-3	Act on weak evidence, learn quickly
Low/High	Difficult to reverse, low cost	Stage 3	Need moderate confidence before committing
High/Low	Easy to reverse, high cost	Stage 3-4	Cost justifies higher bar, but reversibility provides safety
High/High	Difficult to reverse, high cost	Stage 4-5	Maximum evidence required—potentially catastrophic if wrong

Examples:

Low stakes, high reversibility: Trying a new workflow in your office

Stage 2 evidence adequate
Learn by doing
Easy to revert if doesn't work

High stakes, low reversibility: Removing an organ

Stage 4-5 evidence required
Irreversible decision
Lives at stake

Moderate stakes, moderate reversibility: Changing dialysis membrane for all patients

Stage 3-4 evidence needed
Can switch back, but disruptive
Patient safety important but changes possible

Updating Your Beliefs: The Bayesian Engine

At each stage, explicitly update your probability estimates:

Start with prior: P(H) = your initial probability before new evidence
- Based on: background knowledge, theoretical plausibility, prior evidence
Observe new evidence E
Assess likelihoods:
- P(E|H) = probability of seeing E if H is true
- P(E|~H) = probability of seeing E if H is false
Calculate posterior:

   P(H|E) = P(H) × P(E|H) / [P(H) × P(E|H) + P(~H) × P(E|~H)]

Your new prior = P(H|E) for the next round of evidence

Example walkthrough:

Initial:

P(H) = 0.10 (novel hypothesis, somewhat plausible)
H: Batch X-2347 causes hypotension

Evidence 1: 12/15 patients on X-2347 had hypotension, vs 3/15 on other batches

P(E1|H) = 0.70 (if true, would expect strong association, but not perfect)
P(E1|~H) = 0.20 (by chance, might see some clustering)
Calculate: P(H|E1) = 0.10 × 0.70 / [0.10 × 0.70 + 0.90 × 0.20] = 0.28

Evidence 2: Complement markers elevated in X-2347 patients

Start with: P(H) = 0.28 (our new prior)
P(E2|H) = 0.80 (mechanism predicts this)
P(E2|~H) = 0.10 (less likely without membrane issue)
Calculate: P(H|E2) = 0.28 × 0.80 / [0.28 × 0.80 + 0.72 × 0.10] = 0.76

Evidence 3: Randomized trial shows 45% vs 8% hypotension rate

Start with: P(H) = 0.76
P(E3|H) = 0.95 (RCT with large effect strongly expected if H true)
P(E3|~H) = 0.01 (very unlikely to see this if H false)
Calculate: P(H|E3) = 0.76 × 0.95 / [0.76 × 0.95 + 0.24 × 0.01] = 0.997

Now at Stage 4-5: Justified belief, ready to act

Key principles:

Each piece of evidence updates your beliefs incrementally
Strong evidence (high P(E|H), low P(E|~H)) produces large updates
Weak evidence (similar P(E|H) and P(E|~H)) produces small updates
Multiple weak pieces can accumulate into strong support

Part VI: Special Topics

Working with Black Box Models

Machine learning models increasingly provide "evidence" but with opacity problems.

Types of opacity:

Intentional concealment: Proprietary algorithms
Technical literacy barrier: Requires advanced knowledge
Inherent complexity: Even experts can't fully explain

Epistemological challenges:

Interpretability: Can you explain why the model makes its predictions?

Global: Understanding overall model behavior
Local: Understanding specific predictions

Overfitting: Model fits training noise, not generalizable patterns

Reliability: Does model maintain performance in new contexts?

Framework for evaluating black box evidence:

Question	Assessment Strategy
Is the prediction accurate?	Out-of-sample validation, cross-validation
Is it causally meaningful?	Intervention studies, causal inference methods
Is it robust?	Adversarial testing, perturbation analysis
Is it explainable?	SHAP values, LIME, attention mechanisms
Does mechanism make sense?	Expert review, consistency with theory

When black box evidence is appropriate:

Prediction accuracy is what matters (not explanation)
Human performance baseline is low
Model extensively validated out-of-sample
Stakes are low or decision is reversible
Human oversight remains in place

When black box evidence is problematic:

Causal understanding needed
High stakes, irreversible decisions
Model might encode biases
Generalization beyond training context required
Accountability and transparency legally required

Mitigation strategies:

Require external validation datasets
Use inherently interpretable models when possible
Apply post-hoc explanation methods
Compare black box to mechanistic models
Integrate domain expertise in model evaluation

Theory-Free vs. Theory-Driven Science

The proliferation of data has enabled new modes of discovery.

Classical paradigm: Theory → Hypothesis → Prediction → Test → Refine theory

Data-intensive paradigm: Data → Pattern → Hypothesis → Mechanism → Test → Theory

Neither is superior in principle—both have roles:

Aspect	Theory-Driven	Data-Driven
Starting point	Theoretical understanding	Observed patterns
Hypothesis generation	Deductive from theory	Inductive from data
Risk	Confirmation bias	Spurious patterns
Strength	Strong prior plausibility	Discovery of unexpected
Generalization	Often good	Requires extensive validation
Understanding	Deep mechanistic	May remain phenomenological

Key insight from Kitchin: Data-driven science doesn't eliminate theory—it rebalances inference modes. Even "theory-free" science involves:

Theory in selecting what data to collect
Theory in interpreting patterns
Theory in distinguishing signal from noise
Theory in generalizing findings

Implications for hypothesis evaluation:

If hypothesis emerged from data mining:

Require more extensive external validation
Be especially alert to overfitting
Seek mechanistic understanding
Test in genuinely novel contexts
Correct for multiple testing

If hypothesis emerged from theory:

Still require rigorous testing
Watch for confirmation bias
May miss unexpected patterns
Consider data-driven approaches for hypothesis refinement

Best practice: Integrate both approaches

Use theory to guide data collection
Use data to generate novel hypotheses
Use theory to interpret data patterns
Use data to refine and test theory
Iterate between data and theory

The Problem of Old Evidence

Situation: You formulate hypothesis H to explain already-known evidence E

Challenge: Since P(E) = 1 (it's already known), Bayes' theorem seems to imply E can't increase P(H)

Example:

Newton formulated gravitational theory partly to explain already-known Kepler's laws
Einstein formulated GR partly to explain Mercury's perihelion precession (already observed)
But surely these explanations count as evidence for the theories?

Solutions:

1. Logical Bayesian: Consider counterfactual—if you hadn't known E, how would H have predicted it?

2. Explanatory value: E counts as evidence when H provides best explanation, even if E known first

3. Historical reconstruction: What would rational agent have believed before E was known?

Practical implication:

Hypothesis that explains old evidence has value
But: Less impressive than predictions of novel evidence
Novel predictions provide stronger confirmation
Old evidence can still increase confidence when hypothesis provides superior explanation to alternatives

For hypothesis evaluation:

Seek opportunities for novel predictions
Value prospective validation over retrospective fit
Pre-register predictions before testing
Distinguish "predicted" from "postdicted"

Part VII: Synthesis and Principles

Ten Commandments of Hypothesis Evaluation

State your hypothesis explicitly and testably
- Vague hunches are not hypotheses
- Specify what would count as evidence for and against
Know your evidential stage
- Don't confuse weak evidence with strong
- Adjust confidence claims to evidence quality
- Match actions to evidential strength
Consider alternative hypotheses
- Never evaluate H in isolation
- Design tests that discriminate between alternatives
- Actively seek disconfirming evidence
Make auxiliary assumptions explicit
- Evidence tests H + auxiliaries, not H alone
- Validate your methods and measures
- Consider what else must be true for your interpretation to hold
Update beliefs incrementally via Bayesian reasoning
- Start with prior probabilities
- Update on new evidence
- Consider both P(E|H) and P(E|~H)
Diversify your evidence
- Vary populations, contexts, methods
- Replicate findings independently
- Triangulate with different data sources
Test severely
- Design tests capable of refuting H
- High power to detect alternatives
- Embrace opportunities for falsification
Guard against biases
- Pre-register when possible
- Seek out informed skeptics
- Report all analyses, not just significant ones
- Correct for multiple testing
Distinguish types of knowledge
- Association ≠ causation
- Prediction ≠ explanation
- Statistical significance ≠ practical importance
- Individual-level ≠ population-level
Communicate uncertainty honestly
- Match language to evidential strength
- Acknowledge limitations explicitly
- Distinguish what you know from what you suspect
- Update in light of new evidence

Final Thought: Epistemic Humility

Science is not about achieving certainty—it's about reducing uncertainty incrementally while maintaining appropriate humility.

Even our best-supported hypotheses remain provisional. They are:

Justified by current evidence
Coherent with existing knowledge
Productive of successful predictions
Subject to revision by future evidence

The mark of scientific maturity is not confidence in what you know, but:

Clarity about what you don't know
Honesty about what the evidence actually supports
Willingness to revise in light of new evidence
Ability to hold beliefs with appropriate tentativeness

Your hypothesis has made it from wild speculation to justified belief. Celebrate that achievement. And remain open to being proven wrong.

Appendix A: Quick Reference Guide

Evidence Strength Terminology

Term	Probabilistic	Qualitative Description
Negligible	P(H\|E) ≈ P(H)	E doesn't change your mind
Weak	1.5 < LR < 3	E shifts belief slightly
Moderate	3 < LR < 10	E shifts belief substantially
Strong	10 < LR < 100	E strongly supports H
Very Strong	LR > 100	E overwhelmingly supports H

Where LR = Likelihood Ratio = P(E|H) / P(E|~H)

Sample Size Guidelines

Very rough heuristics—context matters enormously

Stage	Minimum n	Better n	Comments
Stage 1-2	5-10	20-30	Exploratory only
Stage 3	30-50	100-200	Moderate confidence
Stage 4	100+	500+	For strong claims
Stage 5	Multiple studies	Meta-analysis	Consensus building

Remember:

Quality > quantity
Diverse small samples > large homogeneous sample
Well-controlled small RCT > large observational study (for causation)

When You Need Different Methodologies

Hypothesis Type	Appropriate Methods
Association	Observational studies, correlation
Causation	RCT, natural experiments, causal inference methods
Mechanism	Experimental manipulation, pathway analysis
Prediction	Machine learning, validation datasets
Generalization	External validation, multiple populations
Rare events	Case-control, larger samples, Bayesian methods

Appendix B: Case Study Walkthrough

From Hunch to Knowledge: A Complete Example

Context: You're a nephrologist noticing patterns in your dialysis patients.

Stage 0: Observation

Initial observation: Over 3 weeks, you notice 5 patients develop severe itching that doesn't respond to usual interventions. You also notice they all started or increased calcium acetate.

What you do:

Document carefully: exactly which patients, when started, severity
Note: All five started same manufacturer's formulation
Don't commit to explanation yet—just documenting

Status: Pre-hypothesis. Interesting pattern, needs more observation.

Stage 1: Hypothesis Formation

Observation expands: Review last 6 months of charts. Find 15 total patients with unexplained itching, 12 on this calcium acetate formulation.

Formulate hypotheses:

H1: The calcium acetate formulation causes itching (sensitivity reaction)
H2: Patients needing high-dose calcium have more severe hyperparathyroidism causing itching
H3: Itching is seasonal (winter months) and calcium timing is coincidental
H4: This calcium acetate batch was contaminated

Testable predictions:

If H1: Switching formulations should resolve itching
If H2: PTH levels should correlate with itching; other high-calcium patients should itch too
If H3: Should see seasonal pattern across all patients
If H4: Only specific lot numbers should be associated

Prior probabilities (estimates):

P(H1) = 0.20 (plausible but unusual)
P(H2) = 0.30 (known that secondary hyperparathyroidism causes itching)
P(H3) = 0.15 (seasonal patterns exist)
P(H4) = 0.10 (contamination rare but possible)
P(other) = 0.25 (many possibilities)

Status: Stage 1. Multiple plausible hypotheses, minimal evidence to discriminate.

Stage 2: Initial Evidence Collection

Action: Small pilot intervention

Switch 5 currently itching patients to different calcium formulation
Continue monitoring 5 new patients starting original formulation
Check PTH levels on all
Note lot numbers

Results after 4 weeks:

4/5 switched patients: itching resolved within 2 weeks
1/5 switched patients: itching persisted (had PTH of 890)
2/5 new patients on original: developed itching
PTH levels not systematically elevated in itching patients (except the one)
All cases from same lot number: X4729

Update probabilities:

Evidence E1: Switching resolves itching in 4/5

P(E1|H1) = 0.80 (if sensitivity, would expect most resolve)
P(E1|H2) = 0.30 (wouldn't expect calcium change to help if PTH primary)
P(E1|H3) = 0.40 (seasonal effect might wane coincidentally)
P(E1|H4) = 0.85 (contamination would cause switching to help)

Update: P(H1|E1) = 0.40; P(H2|E1) = 0.18; P(H3|E1) = 0.13; P(H4|E1) = 0.20

Evidence E2: All cases from lot X4729

P(E2|H1) = 0.50 (sensitivity might be to specific excipients varying by lot)
P(E2|H2) = 0.15 (PTH wouldn't cluster by lot)
P(E2|H3) = 0.10 (seasonal wouldn't cluster by lot)
P(E2|H4) = 0.95 (contamination would be lot-specific)

Combined update: P(H4|E1,E2) ≈ 0.45; P(H1|E1,E2) ≈ 0.35

Status: Stage 2-3. Evidence now moderately favors contamination hypothesis, with sensitivity reaction also plausible. Need to discriminate between H1 and H4, and test more rigorously.

Stage 3: Accumulating Evidence

Action: More systematic investigation

Contact manufacturer for lot X4729 composition/testing
Check if other clinics using lot X4729 seeing similar issues
Test samples of lot X4729 vs other lots (send to lab)
Formal case-control study: 30 cases (itching) vs 60 controls (no itching)

Results:

Manufacturer reports lot X4729 passed all QC tests, but used slightly different drying process
Two other clinics report itching complaints with same lot
Lab testing: Lot X4729 has 2.3% higher residual solvent (ethanol) than other lots; within specs but higher
Case-control study:
- 28/30 cases were on lot X4729
- 12/60 controls were on lot X4729
- OR = 23.3, p < 0.001
- When stratified by residual ethanol level: dose-response evident

Mechanistic hypothesis refined: H1b: Elevated residual ethanol in lot X4729 causes skin irritation/itching

This discriminates H1b from H4:

Not strictly "contamination" (within specs)
Manufacturing variation in acceptable compound
Specific mechanism identified

Update: P(H1b|all evidence) ≈ 0.85

Additional test: Prospective cohort

Follow 100 patients newly starting any calcium acetate
Track lot numbers and symptom development
20 patients started on lot X4729, 80 on other lots
Results: 9/20 (45%) on X4729 developed itching vs 4/80 (5%) on others
RR = 9.0, p < 0.001

Status: Stage 3-4. Strong evidence that lot X4729 specifically causes itching, likely via elevated residual ethanol. Multiple independent lines of evidence converge.

Stage 4: Strong Evidence

Action: Interventional validation

Switch all patients on X4729 to different lots (n=50)
Monitor symptom resolution
Test hypothesis in different population (non-dialysis CKD patients on calcium acetate)

Results:

43/50 patients had symptom resolution within 4 weeks
7/50 had persistent itching (chart review revealed other causes in 6)
Study in non-dialysis patients: smaller effect but same direction (lot X4729 associated with itching)

Mechanism confirmation:

Dermatology consult on 5 patients: consistent with irritant dermatitis
Patch testing with residual ethanol: positive in affected patients
Makes sense: ethanol is known skin irritant

Severe test:

If hypothesis false, would not expect:
- Resolution with switching
- Replication across clinics
- Dose-response with ethanol level
- Positive patch tests
- Consistency in non-dialysis population

Meta-analysis perspective:

Original observation: 12/15 cases on X4729
Other clinics: 34/40 cases on X4729
Case-control: OR = 23.3
Prospective cohort: RR = 9.0
Intervention: 43/50 resolution

All converge on strong effect

Final probability estimate: P(H1b|all evidence) > 0.95

Status: Stage 4-5. Justified belief that lot X4729 causes itching via elevated residual ethanol. Appropriate to act: avoid this lot, inform FDA, inform other clinicians.

Stage 5: Established Knowledge

Subsequent developments:

FDA issues notice about lot X4729
Manufacturer recalls lot, improves drying process
Published case series confirms findings
Incorporated into clinical guidelines
Mechanism well-understood
Used to inform QC standards

The hypothesis has become established knowledge:

Part of background against which new observations evaluated
Taught to trainees
Guides clinical practice
High confidence but not absolute certainty

Epistemic status: This finding is now part of the edifice of medical knowledge, but remains:

Subject to refinement (maybe other lots have issues)
Subject to revision (maybe long-term follow-up reveals something different)
Contingent (applies to this formulation, manufacturer, process)

Lessons from the Case Study

Initial pattern recognition preceded hypothesis formation: We documented before theorizing
Multiple competing hypotheses from the start: Avoided premature commitment
Evidence accumulated across methods:
- Clinical observation → case series → case-control → prospective cohort → intervention
- Each strengthened confidence incrementally
Mechanistic understanding developed alongside statistical association:
- Not just "X4729 associated with itching"
- But "X4729's elevated ethanol causes irritant dermatitis"
Severe testing throughout:
- Each study designed to discriminate between alternatives
- Actively sought disconfirming evidence
- Tested in different populations
Bayesian updating made explicit:
- Started with priors
- Updated incrementally
- Final posterior very high
Action calibrated to evidence:
- Stage 1-2: Just monitoring, gathering data
- Stage 3: Small pilot interventions
- Stage 4: Broader recommendations
- Stage 5: Policy changes
The process took months, not hours:
- Resisted premature conclusions
- Built evidence systematically
- Worth the time to get it right

Closing Reflection

The scientific method is not a algorithm—it's a disposition.

A disposition toward:

Curiosity tempered by skepticism
Confidence tempered by humility
Conviction tempered by openness to revision
Action informed by evidence
Certainty appropriate to what we actually know

Your hypotheses will fail. That's not a bug—it's a feature. Each failure teaches us something about the world and about how to ask better questions.

The journey from hunch to knowledge is rarely linear. It involves false starts, dead ends, surprising detours, and occasional breakthroughs. Embrace the messiness. Trust the process. Let the evidence guide you.

And always remember: the goal is not to be right—it's to become less wrong.

Document prepared by combining insights from:

Internet Encyclopedia of Philosophy: Evidence (IEP)
Desai et al. (2024): "The epistemological foundations of data science: a critical analysis"

For: Hypothesis evaluation from early ideation through established knowledge

Perspective: Integrated classical epistemology and modern data science

Purpose: Practical framework for rigorous thinking about evidence and belief

Content is user-generated and unverified.