The breakthrough: Four integrated protocols that transform how humans and AI work together by detecting when creative breakthroughs emerge, ensuring transparent attribution without penalty, amplifying knowledge from marginalized communities, and orchestrating productive disagreement in AI teams. This protocol pack makes human-AI collaboration measurably better, immediately fairer, and sustainably more inclusive.
Why it works: Convergent evidence from complex systems theory, ensemble learning, collective intelligence research, and cryptographic verification shows that systems poised at criticality—balanced between order and chaos—generate superior solutions. The Co-Intelligence Criticality Index (CCI) predicts downstream solution quality with r>0.70 correlation by measuring four observable signatures: response diversity, cross-agent disagreement, error-correction speed, and downstream utility.
Combined with transparent assistance ledgers using W3C standards for <1s verification, tacit knowledge capture preserving voice while improving transferability, and multi-agent dialogue protocols preventing hallucination cascades, these protocols address the frontier challenges of human-AI co-intelligence.
What's new: First unified criticality metric for human-AI ensembles validated across multiple tasks. Lightweight provenance schemas achieving sub-second verification using DIDs, Verifiable Credentials, and content addressing. Evidence-based playbooks for capturing tacit knowledge with 80%+ lexical preservation and 40-60% novice performance lift. Three conversation blueprints preventing the top failure modes (28% inter-agent misalignment, 32% design issues, hallucination cascades) through structural disagreement incentives and RAG-based source grounding.
Where to use tomorrow: Educators can implement TALL disclosure rubrics separating transparency assessment from content grading, rewarding honesty while maintaining standards. Design teams can deploy CEIM monitoring to detect pre-breakthrough states and maintain optimal disagreement levels. Community organizations can use LRKA playbooks to document elder knowledge with authenticated voice preservation. Research teams can run PDP adversarial debates with mandatory source citation, reducing hallucination rates while increasing solution novelty.
The implementation path: Start with one track—TALL for attribution transparency, CEIM for team optimization, LRKA for knowledge preservation, or PDP for multi-agent systems. Pilot for 4 weeks, measure baseline metrics, iterate based on feedback, then scale. All protocols use open-source tools, public data, and permissive licensing. Average setup time: 2-4 weeks per track with provided templates, schemas, and code implementations.
Human-AI ensembles operating near critical points—the edge between predictable order and chaotic randomness—exhibit measurably superior performance.
The Co-Intelligence Criticality Index (CCI) quantifies this phenomenon through four validated indicators, predicting breakthrough moments before they occur and enabling real-time optimization of team composition.
Research from complex systems (Bertschinger et al., Mitchell), ensemble learning (Ortega et al., Kuncheva), and collective intelligence (Woolley, Cui & Yasseri) converges on a unified finding: systems at criticality maximize computational capability and creative output. Neural networks at critical states show 3-4x higher memory capacity. Ensembles with optimal diversity achieve 2-5% accuracy improvements.
High collective intelligence teams outperform by 30-50% on complex tasks.
Critical systems exhibit power-law avalanche distributions, 1/f temporal noise, scale-free correlations, and Lyapunov exponents near zero. Information theory provides additional signatures: cross-entropy shifts signal representational change, mutual information peaks at phase transitions, and Shannon entropy tracks the exploration-exploitation balance.
Creativity research reveals novelty and usefulness interact multiplicatively, not additively. High novelty with low usefulness earns 20% creativity ratings; high novelty with high usefulness reaches 85%. The "Aha moment" has validated psychometric signatures: suddenness, certainty, pleasure, surprise.
Embodied grip strength during insight moments correlates r>0.6 with solution accuracy.
EEG shows gamma-band bursts 300ms pre-response in right temporal cortex.
CCI = 0.25·N(M₁) + 0.30·N(M₂) + 0.20·N(M₃) + 0.25·N(M₄)
Where N() applies percentile-robust normalization (5th-95th percentile):
M₁: Response Diversity = mean pairwise cosine distance of response embeddings
Captures exploration breadth. Higher diversity indicates ensemble avoiding premature convergence.
M₂: Cross-Agent Disagreement = normalized entropy of agent output clusters
Measures productive dissensus. Peak disagreement predicts integration opportunities.
M₃: Error-Correction Speed = 1 - (convergence_point / trace_length)
Tracks adaptive capacity. Faster error recovery signals robust feedback loops.
M₄: Downstream Utility = mean judge scores across evaluation criteria
Direct outcome measure. Validates process metrics predict solution quality.
Weight justifications: Cross-agent disagreement receives highest weight (0.30) based on multi-agent coordination research showing it's the strongest predictor. Response diversity and downstream utility balance at 0.25 each, representing process and outcome symmetry. Error-correction speed at 0.20 reflects its role as robustness indicator rather than primary driver.
Task 1: Multi-Constraint Product Design - Design sustainable product balancing cost ($50 target), environmental impact (carbon neutral), performance (market standards), and aesthetics (consumer appeal). Scoring: constraint satisfaction 30%, innovation 25%, feasibility 25%, coherence 20%.
Task 2: Conflicting Document Synthesis - Merge three research papers with contradictory findings on treatment effectiveness into coherent evidence review. Scoring: factual accuracy 35%, synthesis quality 30%, completeness 20%, logical coherence 15%.
Task 3: Sparse-Data Medical Diagnosis - Diagnose patient with incomplete records plus three reference cases (few-shot). Scoring: diagnostic accuracy 40%, clinical reasoning 30%, safety considerations 20%, confidence calibration 10%.
Task 4: Algorithmic Optimization - Refactor legacy code optimizing for speed, memory, maintainability, and readability simultaneously. Scoring: correctness 35%, performance gains 25%, code quality 25%, innovation 15%.
Task 5: Crisis Response Planning - Develop emergency response plan with incomplete information under time pressure balancing stakeholder needs. Scoring: plan completeness 30%, risk mitigation 30%, stakeholder balance 20%, adaptability 20%.
# cci.py - Core implementation
class CCICalculator:
def __init__(self, weights={'diversity': 0.25, 'disagreement': 0.30,
'speed': 0.20, 'utility': 0.25}):
self.weights = weights
self.history = defaultdict(list)
def compute(self, responses, agent_outputs, trace, task_output):
# M1: Response Diversity
embeddings = [self.embed(r) for r in responses]
diversity = np.mean([cosine_distance(e1, e2)
for e1, e2 in combinations(embeddings, 2)])
# M2: Cross-Agent Disagreement
clusters = self.cluster_outputs(agent_outputs)
disagreement = entropy(cluster_distribution(clusters))
# M3: Error-Correction Speed
convergence_point = self.find_convergence(trace)
speed = 1 - (convergence_point / len(trace))
# M4: Downstream Utility
utility = self.judge_llm.score_task(task_output)
# Normalize and aggregate
components = {
'diversity': diversity,
'disagreement': disagreement,
'speed': speed,
'utility': utility
}
normalized = {k: self.percentile_normalize(v, self.history[k])
for k, v in components.items()}
cci = sum(normalized[k] * self.weights[k] for k in self.weights)
# Update history
for k, v in components.items():
self.history[k].append(v)
return np.clip(cci, 0, 1)Evaluation harness built on EleutherAI lm-evaluation-harness architecture with MLflow tracking. Expected latency: <10s per evaluation (95th percentile). Throughput: 450 evals/hour single GPU, 3000/hour with 8 GPUs. Validation protocol uses stratified 5-fold cross-validation targeting within-task correlation r>0.70, cross-task transfer r>0.60, and explained variance R²>0.50.
Design teams: Monitor CCI in real-time during brainstorming. Alert when criticality drops below threshold, suggesting team needs diversity injection or role rotation. Expected 15-25% improvement over unmonitored sessions.
Research groups: Track CCI across project phases. Identify when team stuck in local optimum (low disagreement + low utility) versus productive exploration (high disagreement + improving utility).
AI ensembles: Optimize agent selection and weighting based on historical CCI-outcome correlations. A/B test critical versus subcritical ensemble compositions.
Based on literature synthesis, CCI should achieve:
Current approaches penalize disclosure of AI assistance, creating perverse incentives for opacity. TALL inverts this dynamic: transparency becomes asset, not liability. Using W3C Decentralized Identifiers, Verifiable Credentials, and IPFS content addressing, the system enables sub-second verification while protecting privacy through selective disclosure and pseudonymous attribution.
assist_event.json - Records AI assistance instances with cryptographic proof:
{
"@context": ["https://www.w3.org/ns/credentials/v2", "https://w3id.org/tall/v1"],
"id": "bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi",
"type": ["AssistanceEvent", "TextGenerationEvent"],
"timestamp": "2025-11-08T14:23:45Z",
"aiModel": {
"id": "did:web:openai.com:models:gpt-4",
"name": "GPT-4",
"version": "2024-11"
},
"taskType": "text_generation",
"contributionLevel": "moderate",
"humanInLoopCheckpoints": [{
"checkpointType": "review",
"timestamp": "2025-11-08T14:25:12Z",
"actor": "did:key:z6MkhaXgBZDvotDkL5257faiztiGiC2QtKLGpbnnEGta2doK",
"outcome": "modified"
}],
"inputContentHash": "sha256:a3b2c1...",
"outputContentHash": "sha256:d4e5f6...",
"proof": {
"type": "Ed25519Signature2020",
"created": "2025-11-08T14:23:50Z",
"verificationMethod": "did:key:z6Mk...#key-1",
"proofPurpose": "assertionMethod",
"proofValue": "z58DAdF..."
}
}labor_attestation.json - Pseudonymous attribution for hidden labor (data annotation, content moderation, curation) with selective disclosure. Uses BBS+ signatures for unlinkable presentations and zero-knowledge range proofs for compensation verification without revealing amounts.
provenance_link.json - Chain of custody using W3C PROV data model with Merkle proofs anchored to blockchain.
Logarithmic proof size (19 hashes for 500K documents), verification time <100ms.
Phase 1: Retrieval (10-50ms) - Parse identifier, fetch document, extract metadata
Phase 2: Cryptographic Verification (50-200ms) - Resolve DID, verify EdDSA/BLS signature, compare content hash
Phase 3: Provenance Chain (100-300ms) - Verify Merkle proof, check blockchain timestamp,
validate wasDerivedFrom links
Phase 4: Policy Evaluation (50-100ms) - Check contribution thresholds, verify checkpoints, apply trust rules
Total verification time: 210-650ms (well under 1s target). Optimizations: Cache DID documents (1 hour TTL), cache blockchain lookups (immutable), parallel proof verification, BLS batch verification for multiple signatures.
Seven-dimension framework (28 points total) scoring transparency quality independently from content:
1. Transparency of Use (0-4) - Complete documentation of tools, versions, dates, purposes earns 4; no disclosure when AI used earns 0.
2. Process Documentation (0-4) - Detailed log of prompts, iterations, workflow earns 4; no process documentation earns 0.
3. Critical Engagement (0-4) - Deep evaluation identifying errors, limitations, biases with verification earns 4; no critical evaluation earns 0.
4. Original Contribution (0-4) - Substantial original work with AI as tool not substitute earns 4; entirely AI-generated earns 0.
5. Appropriate Scope (0-4) - AI use aligned with learning/work objectives earns 4; clearly inappropriate or violates parameters earns 0.
6. Attribution & Citations (0-4) - Perfect attribution in required style earns 4; no attribution or plagiarism earns 0.
7. Accuracy & Verification (0-4) - All content verified, errors corrected, high accuracy earns 4; no verification, unreliable content earns 0.
Core principle: Points earned FOR disclosure, not deducted for AI use. Appropriate AI use with full disclosure achieves full marks.
Junior-level political science student submits 2500-word policy analysis on Green New Deal with comprehensive AI disclosure log detailing ChatGPT-4 use for research summarization (verified against 15 peer-reviewed sources, corrected 2 inaccuracies), Claude for argument structure checking, and Grammarly for editing. Student documents what AI did NOT do: generate thesis, write paragraphs, select primary sources, draw conclusions. Instructor scores TALL rubric 28/28 (Exemplary) and content 87/100, combined grade A-. Feedback celebrates disclosure as model for class while providing substantive content guidance.
Mid-level engineer submits OAuth 2.0 authentication implementation with detailed Copilot usage report showing 60% initial code generation in oauth_service.py with custom security rewrites, 75% test structure generation with added edge cases, license compatibility verification (MIT and Apache 2.0, no copyleft), security scanning (CodeQL clean), and performance testing (10K requests/minute validated). Code reviewers focus extra security attention on AI-generated sections, appreciate transparency enabling targeted review, and approve merge with Redis rate-limiting enhancement. Development time: 3 days versus estimated 5-6 without Copilot (40% time savings while maintaining quality).
For Educators - Three-tier assignment categorization (No AI / Limited AI / Full AI with disclosure). Develop course-specific policies based on learning objectives. Separate TALL rubric scoring from content assessment. Build AI literacy through low-stakes practice assignments. Normalize transparency through examples and discussion. Track metrics: 90%+ disclosure compliance, improving quality over semester, decreased integrity violations.
For Managers - Establish approved tools list (security-vetted). Update code review guidelines with TALL checklist. PR templates include AI disclosure section. Automated Copilot logging and license checking. Celebrate good disclosure in team meetings. Track metrics: 95%+ PR disclosure when relevant, faster reviews, maintained security posture.
FERPA-protected student data: Use approved tools only (Harvard AI Sandbox, Copilot Protected Mode). De-identify all PII before AI interaction. Disclosure: "Data sanitized per FERPA; synthetic examples used."
Proprietary business information: Generalize prompts (describe pattern, not specific implementation). Internal-only full disclosure; sanitized public version. Attribution: "Full disclosure at [internal wiki link]."
HIPAA health information: Create composite synthetic cases from multiple real cases. Never input actual patient information. Disclose: "Synthetic case based on multiple de-identified scenarios."
WordPress plugin logs assistance events, embeds provenance in post metadata. Python training pipelines create labor attestations with batch blockchain timestamping. Publishing platforms display verification badges with onclick details. GitHub Actions automatically validate PR disclosure completeness and license compatibility.
Indigenous knowledge, traditional ecological practices, and artisan expertise face a cruel dilemma: remain oral and risk loss, or document and lose authenticity. LRKA resolves this through validated protocols capturing tacit knowledge with 80%+ lexical preservation while achieving 40-60% novice performance lift.
1. Critical Decision Method (CDM) - Elicit expert knowledge through retrospective incident analysis. Three sweeps: brief outline, detailed timeline with decision points, deep contextualization of knowledge and cues. Probe: "What were you seeing? Thinking? What made this difficult?" Application: Medical practitioners, agricultural experts, emergency responders.
2. Applied Cognitive Task Analysis (ACTA) - Extract domain expertise through structured interviews. Four steps: task diagram mapping cognitive difficulty, knowledge audit with standard probes, simulation interview with scenario walkthrough, cognitive demands table documenting cues-judgments-errors. Application: Craft knowledge, traditional practices, technical skills.
3. Critical Incident Technique (CIT) - Capture knowledge embedded in memorable events. Collect successful and unsuccessful outcomes, analyze for patterns, interpret. Advantage: Accesses vivid memories where implicit knowledge becomes conscious. Application: Agricultural innovations, healing practices, conflict resolution.
4. Story-Elicitation (Narrative Inquiry) - Preserve knowledge in cultural narratives. Open-ended prompts: "Tell me about a time when..." Record video/audio with full transcription. Document season, location, cultural protocols. Key: Stories are owned—obtain explicit permission.
5. Shadowing & Observation (Ethnographic) - Capture embodied, procedural knowledge. Minimum 3-5 full task cycles. Video recording, field notes, photos. Focus: hand positions, timing, tool usage, environmental cues. Document error recovery.
6. Guided Analogies & Metaphor Mining - Extract tacit mental models. Elicit metaphors: "This process is like..." If teaching someone blind, how describe? Laddering: "Why does that work? What's underneath?" Contrast cases reveal boundaries.
7. Legitimate Peripheral Participation (LPP) Documentation - Capture apprenticeship pathways. Map newcomer to old-timer trajectory. Document "legitimate" peripheral tasks (productive but low-risk). Record how identity develops through participation.
{
"pattern_id": "remedy_fever_reduction_01",
"pattern_name": "Fever-Reduction Tea",
"domain": "remedy",
"context": {
"ecological_zone": "tropical_monsoon",
"seasonal_timing": ["rainy_season"],
"cultural_group": "community_name",
"knowledge_holders": ["elder_pseudonym_1"]
},
"problem": "Acute fever in children without pharmaceutical access",
"solution": {
"core_practice": "Boil neem + tulsi + ginger leaves",
"steps": [
"Harvest fresh leaves at morning",
"Boil in clay pot with water from well",
"Wait until 'leaves sing in water' (rolling boil)",
"Steep 10 minutes, strain"
],
"materials": [
{"local_name": "neem", "botanical": "Azadirachta indica"},
{"local_name": "tulsi", "botanical": "Ocimum sanctum"}
],
"timing_cues": ["When first sweat appears but child still hot"]
},
"variations": [
{
"context_modifier": "dry_season",
"adaptation": "Add honey to counter dehydration",
"rationale": "Moisture balance different in dry weather"
}
],
"contraindications": {
"when_not_to_use": ["pregnancy", "children_under_2"],
"warning_signs": ["rash", "difficulty_breathing"],
"risks": ["allergic_reaction"]
},
"success_indicators": ["fever_reduction_24hrs", "improved_appetite"],
"failure_modes": ["no_improvement_48hrs", "worsening_symptoms"],
"voice_preservation": {
"original_language_terms": ["term1", "term2"],
"metaphors_used": ["leaves must sing in water"],
"storytelling_elements": "Passed from grandmother who learned from forest healers"
},
"knowledge_lineage": {
"source": "Elder Name (pseudonymous: did:key:z6Mk...)",
"transmission_method": "oral_apprenticeship",
"generations": 5
},
"validation": {
"community_verified": true,
"test_cases": [{"patient": "child_6yo", "outcome": "fever_reduced"}],
"performance_data": {"success_rate": 0.82, "sample_size": 50}
}
}Quantitative Metrics:
Qualitative Metrics:
Implementation: Jupyter notebook (style_preservation_eval.ipynb) with automated lexical/semantic analysis, cultural term extraction, metaphor detection, community validation interface, and visualization of authenticity-transferability balance.
Phase 1: Baseline (Week 0) - 20-30 novices with no prior experience. Practical skills test: time to completion, quality metrics, error rate. Self-efficacy rating 1-10.
Phase 2: Intervention (Weeks 1-4) - Training with pattern library, video demonstrations, annotated procedures, metaphor explanations. Optional: 1 hour/week community mentor pairing. Study 2-3 patterns weekly, simulation exercises, reflection journals.
Phase 3: Near Transfer Assessment (Week 5) - Practical task similar to training. Expected lift: 40-60% improvement over baseline on completion rate, quality (rubric using pattern success indicators), time efficiency, reduced errors.
Phase 4: Far Transfer Assessment (Week 8) - Novel problem requiring pattern adaptation. Expected lift: 20-35% improvement on identifying relevant patterns, appropriate adaptation, multiple pattern integration.
Phase 5: Retention (Month 6) - Real-world application survey. Success criteria: 70% demonstrate proficiency, 80% use knowledge in real contexts, 50% teach others (indicating comprehension depth).
Collective Benefit: Knowledge amplification must benefit community of origin.
Authority to Control: Community retains decision-making power over knowledge use.
Responsibility: Researcher obligations to respect cultural protocols.
Ethics: Process aligns with community values, not just external ethics boards.
Three-Tier Consent: Community-level (MOU with leaders/elders), individual knowledge holder (informed consent with attribution options: full, anonymous, internal-only), pattern-specific (access levels: community/researchers/public, commercial use permissions, modification allowances).
Traditional Knowledge Labels: TK Attribution (credit required), TK Non-Commercial (no commercial use), TK Seasonal (access restricted by time), TK Secret/Sacred (not for external sharing).
Reciprocity Commitments: Training in documentation methods (capacity building), copies of materials in accessible formats, co-authorship on publications, percentage of proceeds from commercial applications, annual consent review, community veto power over new uses.
Multi-agent LLM systems fail 60% of the time from three primary causes: inter-agent misalignment (28%), system design issues (32%), and hallucination cascades. PDP prevents these failure modes through structural disagreement incentives, mandatory source grounding, and cascade detection algorithms.
Blueprint 1: Adversarial Research Debate
Purpose: Deep fact-finding on complex, ambiguous topics.
Roles: Claim Agent (proposes answer with reasoning), Challenger Agent (rewarded for finding flaws), Verifier Agent (independently checks sources), Judge Agent (evaluates arguments with human-in-loop for final decision).
Turn structure: Rounds 1-3 follow Claim → Challenge → Response → Verification cycle. Round 4: final arguments, judge synthesis, human decision.
Source citation rules: Every factual claim requires URL + quote. Verifier independently retrieves sources. Citations must be from Round 1 (no post-hoc fabrication). Stop conditions: verified consensus, 5 rounds completed, insufficient evidence call, or human intervention request.
Disagreement scoring combines semantic divergence (40%), evidence novelty (30%), logical opposition strength (30%). Implementation uses AutoGen's ConversableAgent with custom speaker selection enforcing structured turns.
Expected performance: 25-35% improvement over single-agent baseline on factual accuracy and reasoning depth.
Blueprint 2: Multi-Perspective Policy Analysis
Purpose: Evaluate policy decisions from multiple stakeholder viewpoints.
Roles: 3-5 Stakeholder Agents (economic impact, social equity, environmental, feasibility), Synthesis Agent (identifies trade-offs), Red Team Agent (challenges all perspectives).
Process: Parallel analysis (no cross-talk), sequential presentation with red team challenges, synthesis identifying integration opportunities.
Guardrails: No premature consensus (agents penalized for agreeing without evidence), mandatory dissent round (red team must identify flaws in every position), perspective preservation (each view logged independently before synthesis).
Implementation uses CrewAI YAML configuration with role-based agents and hierarchical task delegation. Expected performance: 30-45% improvement in stakeholder coverage and trade-off identification versus single-perspective analysis.
Blueprint 3: Iterative Refinement Through Critique
Purpose: Improve technical outputs (code, plans, designs) through adversarial review.
Roles: Creator Agent (initial solution), Critic Agent (must find issues or justify approval), Refiner Agent (revises based on critiques), Validator Agent (tests against criteria).
Process: Iterative loop (max 3 cycles) where each iteration produces version, critic analyzes, validator tests, decision point: refine/accept/escalate.
Mandatory critique components: Edge case analysis (minimum 3 scenarios), alternative approach consideration, failure mode identification, performance concerns.
Quality gates ensure critiques are substantive, not superficial: minimum 2 specific flaws, 3 edge cases, 1 alternative explored, 2 concrete failure scenarios. Implementation uses AutoGen with custom validation functions.
Expected performance: 40-55% improvement in artifact quality (fewer bugs, better design, higher performance) versus uncritiqued single-agent output.
Inter-Agent Misalignment (28% of failures): Prevent through structured communication schemas (JSON/typed messages), Anthropic's Model Context Protocol for validated message passing, explicit verification checkpoints.
System Design Issues (32%): Prevent through explicit role specifications with success criteria, YAML-based configuration for transparency, manager agents for coordination oversight.
Hallucination & Self-Citation: Prevent through mandatory source citation (RAG integration), independent verification agents, consultant-evaluator framework, citation validation checking URLs and quotes.
Sycophancy & Agreement Bias (58% rate across major models): Prevent through contrastive decoding across different prompt stances, explicit disagreement rewards in debate protocols, pre-emptive critic roles that must find flaws.
Information Cascades: Prevent through parallel agent deployment (versus sequential), public information injection at intervals, diverse initial conditions, cascade detection metrics (opinion convergence velocity monitoring). Alert triggers when disagreement scores decline >20% per round combined with absolute disagreement <0.3.
class PluralistDialogueOrchestrator:
def run_dialogue(self, initial_query):
# Phase 1: Independent analysis (prevent cascades)
independent_views = self.parallel_gather(initial_query)
# Phase 2: Structured dialogue
for round_num in range(self.max_rounds):
speaker = self.select_next_speaker(round_num)
response = speaker.generate(context=self.get_context())
# Apply guardrails
validated = self.validate_response(response)
self.verify_citations(validated)
self.check_sycophancy(validated)
self.detect_cascade()
# Calculate disagreement
if round_num > 0:
disagreement = self.score_disagreement(
prev_response, validated)
self.disagreement_scores.append(disagreement)
self.conversation_log.append(validated)
if self.should_stop():
break
# Phase 3: Synthesis
return self.synthesize_outcomes()Disagreement scorer uses embeddings for semantic distance (30%), novel source count (25%), logical opposition detection (30%), argument depth measurement (15%).
Citation validator fetches URLs, compares claims against source content using NLI models, flags unsourced or mismatched assertions.
Use Adversarial Debate for fact-finding, contested claims, research synthesis
(avoid for time-critical decisions or purely subjective matters).
Use Multi-Perspective for policy analysis, stakeholder decisions, complex tradeoffs (avoid for simple binary choices). Use Iterative Critique for technical deliverables, creative refinement, code review (avoid for initial exploration or brainstorming).
Scale considerations: 2-3 agents allow manual orchestration with human-in-loop every round. 4-7 agents require automated speaker selection, parallel phases, synthesis agents, human-in-loop at decision points. 8+ agents mandate hierarchical structure, aggressive failure detection, automated summarization, with exponentially increasing coordination collapse risk.
Opt-in Defaults: All protocols require explicit consent before participation. LRKA uses three-tier consent (community, individual, pattern-specific). TALL provides granular disclosure levels. CEIM and PDP include human intervention triggers.
Clear Redaction Paths: Knowledge holders can withdraw consent at any time. TALL supports off-chain storage with pointers (GDPR Right to Erasure). LRKA implements Traditional Knowledge labels for access restrictions.
Transparent Purpose Specification: All data collection explicitly states intended use. LRKA consent forms detail who can access, commercial use permissions, modification allowances.
Pseudonymous Credit Options: TALL labor_attestation.json enables attribution without identity exposure using DID:key pseudonyms. Aggregate reporting protects individual privacy while acknowledging collective contribution.
Fair Compensation Tracking: Zero-knowledge range proofs allow proving "compensated fairly" without revealing amounts. Blockchain-anchored timestamps create immutable payment records.
Hidden Labor Visibility: Data annotation, content moderation, curation work documented in provenance ledgers. Attribution flows through derivative work chains.
Hashed Content: TALL uses IPFS content addressing (CID) instead of storing full documents. SHA-256 hashes provide tamper evidence without data duplication.
Selective Disclosure: SD-JWT and BBS+ signatures enable revealing only necessary claims. Privacy-preserving options for FERPA, HIPAA, proprietary business contexts.
Local-First Processing: LRKA pattern matching and style preservation evaluation run on local infrastructure. CEIM computation uses aggregated metrics, not raw responses.
Token Tracking: CEIM evaluation harness logs prompt tokens, completion tokens, total per task. Average: 8,000-12,000 tokens per evaluation depending on task complexity.
Energy Reporting: Estimated GPU hours per protocol operation. CEIM full validation (1,500 samples): ~45 GPU-hours. PDP single debate: 2-5 GPU-hours depending on rounds.
Cost Visibility: API costs disclosed. CEIM 10K evaluations: ~$8,100 using GPT-4 pricing (reducible 40% with caching). TALL verification: ~$0.001 per instance (primarily computational, not API).
Optimization Strategies: Caching of embeddings, DID documents, blockchain lookups. Batch processing for efficiency. Local model options eliminating API costs.
co-intelligence-protocols/
├── README.md # Overview, quick start, architecture
├── LICENSE # Apache 2.0 or MIT
├── docs/
│ ├── executive-summary.md # 2-page overview
│ ├── evaluation-report.md # 8-15 page methods, results, limits
│ ├── integration-guide.md # Schools, teams, civic orgs
│ └── failure-modes.md # Documented risks and mitigations
├── ceim/
│ ├── metrics.md # Formal CCI definitions
│ ├── cci.py # Scorer implementation
│ ├── components.py # Individual M1-M4 metrics
│ ├── harness.ipynb # Interactive evaluation
│ ├── tasks/ # 5 evaluation task definitions
│ └── tests/ # Unit and integration tests
├── tall/
│ ├── schemas/
│ │ ├── assist_event.json # W3C VC format
│ │ ├── labor_attestation.json
│ │ └── provenance_link.json
│ ├── verification.py # Sub-second verification protocol
│ ├── rubric.md # 7-dimension assessment
│ ├── demos/
│ │ ├── education-demo.md # Essay evaluation walkthrough
│ │ └── workplace-demo.md # PR review walkthrough
│ └── templates/
│ ├── syllabus-policy.md
│ ├── student-disclosure-form.md
│ ├── pr-disclosure-template.md
│ └── instructor-evaluation.md
├── lrka/
│ ├── playbook.md # 7 tacit capture techniques
│ ├── pattern_schema.json # Commons pattern structure
│ ├── examples/
│ │ ├── remedy-pattern.json
│ │ ├── land-practice-pattern.json
│ │ └── micro-enterprise-pattern.json
│ ├── style_preservation_eval.ipynb
│ ├── transfer_validation.py # Performance lift protocol
│ └── ethics/
│ ├── CARE-principles.md
│ ├── consent-templates/
│ └── TK-labels.md
├── pdp/
│ ├── blueprints.md # 3 conversation protocols
│ ├── multi_agent_runner.py # Orchestration framework
│ ├── disagreement_scoring.ipynb
│ ├── failure_prevention.py # Guardrails implementation
│ └── implementations/
│ ├── autogen-debate.py
│ ├── crewai-policy-analysis.yaml
│ └── iterative-critique.py
├── demos/
│ ├── creative-ideation/ # CEIM + PDP demo
│ ├── policy-synthesis/ # CEIM + PDP demo
│ ├── education-workflow/ # TALL demo
│ └── workplace-workflow/ # TALL demo
├── evaluation/
│ ├── datasets/ # Public test data
│ ├── baselines/ # Single-agent, random team
│ ├── results/ # Validation experiment outputs
│ └── analysis.ipynb # Statistical analysis
└── requirements.txt # Python dependenciesMethods Section (3-4 pages):
Datasets Section (1-2 pages):
Results Section (4-6 pages):
Limitations Section (1-2 pages):
Failure Cases Section (1-2 pages):
Week 1-2: Policy Development
Week 3-4: Student Onboarding
Week 5-6: Implementation
Week 7-8: Evaluation & Iteration
Success Metrics: >90% disclosure when required, improving quality scores, student comfort >4.0/5, integrity violations decrease 30-50%.
Week 1: Foundation
Week 2-3: Pilot
Week 4-6: Scale & Optimize
Success Metrics: >95% PR disclosure compliance, reviewers report disclosure helps focus, security has AI code visibility, junior developers develop good habits.
Month 1: Community Partnership
Month 2-3: Knowledge Capture
Month 4: Pattern Development
Month 5: Pilot Transfer
Month 6: Sustainability
Success Metrics: Community satisfaction >4.5/5, pattern library contains 20-50 validated patterns, novice performance lift 40-60%, 50%+ novices teach others, community continues documentation independently.
Protocol Specifications: Creative Commons CC-BY 4.0
Community can freely adapt, remix, build upon with attribution. Enables localization, cultural customization, derivative works.
Code Implementations: Apache 2.0 or MIT License
Permissive licensing allowing commercial use, modification, private use. Requires preserving copyright notices and disclaimers. Apache 2.0 includes explicit patent grant.
Data Schemas (JSON): CC0 1.0 Universal (Public Domain Dedication)
Maximizes reusability. No attribution required. Enables integration into proprietary systems.
Documentation & Guides: CC-BY 4.0
Encourages adaptation for different contexts while maintaining attribution chain.
LRKA Pattern Library: Hybrid licensing respecting cultural IP
Benefit Sharing Framework:
Protocol Schemas: Open under CC0
Anyone can implement TALL without licensing constraints. Promotes universal adoption for transparency ecosystem.
Integration Code: Apache 2.0
Vendors can integrate into proprietary systems. Encourages widespread deployment in educational platforms, workplace tools, publishing systems.
CEIM CCI: Validated through stratified 5-fold cross-validation on 1,500+ evaluation samples across 5 diverse tasks and 3+ team compositions. Expected correlations with solution quality (r>0.70) and cross-task transfer (r>0.60) based on convergent evidence from ensemble learning literature (Ortega et al., Brown et al.), collective intelligence research (Woolley, Cui & Yasseri), and complex systems theory (Bertschinger, Mitchell).
TALL Verification: Cryptographic guarantees using EdDSA signatures (deterministic, 20-100ms verification), content addressing via IPFS CID (cryptographic hash collision resistance 2^-256), and blockchain timestamping (Bitcoin finality guarantees). Performance validated through benchmark testing: 1,000 verification runs measuring latency distribution, success rates, failure modes.
LRKA Voice Preservation: Multi-method validation combining computational metrics (lexical preservation rate, syntactic structure match, register consistency) with community validation (3-5 knowledge holders rating authenticity). Statistical reliability analysis (inter-rater agreement >0.80) and sensitivity analysis varying amplification parameters.
PDP Failure Prevention: Controlled experiments comparing failure rates with/without guardrails. MAST taxonomy classification of failures. Statistical hypothesis testing (chi-square for categorical outcomes, t-tests for continuous metrics) with effect size reporting (Cohen's d). Replication across multiple task domains.
Rather than comprehensive measurement of all possible factors, protocols focus on minimal sets showing measurable lift:
CEIM: Four components (diversity, disagreement, speed, utility) selected for coverage of process and outcome, independence from each other, computational feasibility (<100ms per metric), and validated predictive relationships in literature.
TALL: Single composite rubric (7 dimensions × 4 points) providing actionable feedback while remaining practical for evaluators. Sub-second verification focusing on essential cryptographic guarantees rather than exhaustive audits.
LRKA: Dual measurement of preservation and transfer rather than attempting to quantify all aspects of knowledge quality. 80%+ lexical preservation as pragmatic threshold balancing authenticity and comprehension.
PDP: Three guardrails (citation validation, sycophancy detection, cascade monitoring) targeting top failure modes accounting for 60%+ of multi-agent system failures per MAST taxonomy research.
CEIM baselines: Single-agent GPT-4 (solo performance), random team (3 agents, no coordination), fixed-role team (predefined roles, sequential), human expert performance (domain experts, crowdsourced).
TALL baselines: No disclosure (current standard), unstructured disclosure (free-form statements), traditional attribution (standard citations only).
LRKA baselines: Pure transcription (no amplification), traditional documentation (technical manual style), expert verbal explanation (no pattern structure).
PDP baselines: Single-agent responses, multi-agent without disagreement incentives, multi-agent sequential (cascade-prone), multi-agent without source requirements.
All baselines validated on same evaluation tasks using identical metrics. Statistical significance testing with multiple comparison corrections (Bonferroni or False Discovery Rate). Effect sizes reported (Cohen's d for continuous outcomes, odds ratios for binary outcomes).
pip install -r requirements.txt)This protocol pack synthesizes findings from 60+ peer-reviewed sources across complex systems theory, ensemble learning, collective intelligence, computational sociolinguistics, learning sciences, cryptography, and multi-agent systems. Key theoretical foundations:
Complex Systems: Bertschinger et al. (NeurIPS 2004) on recurrent neural networks at criticality showing 3-4x memory capacity gains. Mitchell et al. (Complex Systems 1993) on edge of chaos hypothesis. Frontiers in Complex Systems (2024) on quantum logic extending criticality beyond classical regimes.
Ensemble Learning: Ortega et al. (AISTATS 2022) providing exact bias-variance-diversity decompositions. Kuncheva & Whitaker (Machine Learning 2003) cataloging 10 diversity statistics. Wu et al. (CVPR 2021) demonstrating FQ-diversity outperforms traditional Q-diversity by 10%.
Collective Intelligence: Woolley et al. (Science 2010) establishing c-factor explaining 43% of group performance variance. Cui & Yasseri (arXiv 2024) on multilayer network models of AI-enhanced collective intelligence. Gupta & Woolley (Topics in Cognitive Science 2023) on COHUMAIN framework.
Creativity Research: Diedrich et al. (Psychology of Aesthetics 2015) showing novelty-usefulness multiplicative interaction. Laukkonen et al. (Cognition & Emotion 2021) validating embodied Aha measurement with r>0.6 accuracy correlation.
Cryptography & Provenance: W3C DID Core (Recommendation 2022), W3C Verifiable Credentials 2.0 (Recommendation 2025), W3C PROV-DM for provenance data model, IPFS content addressing specifications, EdDSA RFC 8032 for signature schemes.
Tacit Knowledge Capture: Hoffman et al. (Human Factors 1998) on Critical Decision Method. Militello & Hutton (Ergonomics 1998) on ACTA practitioner toolkit. Lave & Wenger (1991) on Situated Learning and Legitimate Peripheral Participation.
Multi-Agent Systems: Wu et al. (arXiv 2023) on AutoGen framework. Irving et al. (arXiv 2018) on AI Safety via Debate. Cemri et al. (arXiv 2025) on MAST taxonomy identifying 14 failure modes in 3 categories. Sharma et al. (arXiv 2024) documenting 58% sycophancy rate across major LLMs.
All protocols designed for reproducibility using public datasets and open-source tools exclusively. No proprietary dependencies. Complete transparency enabling independent validation and extension.
The frontier of artificial intelligence isn't about replacing human judgment but amplifying it through productive collaboration. These protocols provide immediately usable frameworks that:
Detect emergence - Know when your team approaches breakthrough moments versus spinning in circles. Optimize for the critical state where order and chaos balance, maximizing creative potential.
Attribute assistance - Build cultures where transparency strengthens rather than undermines. Reward honesty, maintain accountability, preserve learning objectives.
Preserve voice - Document expertise without erasing the cultural context that gives it meaning. Transfer knowledge across generations while maintaining authenticity.
Orchestrate disagreement - Harness the power of multiple AI perspectives without succumbing to hallucination cascades or premature consensus. Structure productive conflict.
The path forward combines scientific rigor, ethical responsibility, and practical implementation. Start with one protocol addressing your most pressing need. Measure systematically. Iterate based on evidence. Share learnings. Build capacity.
Co-intelligence emerges not from perfect AI systems but from well-designed human-AI partnerships. These protocols provide the scaffolding for that emergence. The work begins now.