Content is user-generated and unverified.

Cross-LLM Structural Concordance in CAMS Datasets

Reproducible Coordination Signatures Across Independent AI Assessors

Kari Freyr McKern Complex Adaptive Humans | Neural Nations Research Platform

Working Paper — March 2026


Abstract

The Complex Adaptive Model of Societies (CAMS) uses large language models as scoring instruments to assess institutional coordination dynamics across historical societies. A fundamental methodological question follows: do the patterns detected reflect genuine structural regularities, or artefacts of any single model's training data and architecture?

This paper reports cross-LLM concordance results from 14 comparison pairs spanning 10 societies, using assessments generated independently by Gemini, Grok, and GPT-4. The findings support a layered concordance structure: strong convergence on material-constraint variables (mean Node Value ρ = +0.751), moderate convergence on temporal stress dynamics (mean Stress ρ = +0.634), and appropriate divergence on interpretive coupling assessments (mean Bond Strength ρ = +0.640). The Stress–Capacity anti-correlation — the framework's core empirical signature — appears in 30 of 30 tested datasets (mean ρ = −0.700), across all assessors and all societies. Bond Strength correlates positively with system health in 23 of 23 tested datasets (mean ρ = +0.926).

The concordance pattern is itself diagnostic. LLMs converge on what the framework classifies as "hard fields" — stress dynamics, capacity constraints, crisis timing — while diverging on "soft fields" — absolute calibration, cultural texture, interpretive weighting. This layered structure is consistent with the detection of underlying coordination constraints rather than shared training-corpus artefacts, though the shared-corpus limitation is acknowledged as a boundary condition.


1. Introduction

1.1 The Measurement Problem

CAMS models societies as networks of eight invariant functional nodes (Helm, Shield, Lore, Archive, Stewards, Craft, Hands, Flow), each characterised by four state variables: Coherence, Capacity, Stress, and Abstraction. The framework was derived theory-first from rate-separation and coupling arguments, then tested empirically using LLM-generated assessments as the scoring instrument.

This creates an obvious epistemological challenge. If a single AI system scores all datasets, the resulting patterns might reflect that system's biases, training-data distribution, or prompt susceptibilities rather than genuine coordination dynamics in the societies being assessed. The patterns would be reproducible — the same model given the same prompt will produce similar outputs — but not independently validated.

Cross-LLM concordance testing addresses this by treating each AI system as an independent measurement instrument. If independently trained models, applied to the same societies under the same scoring protocol, converge on structural dynamics while diverging on calibration and texture, the concordance provides evidence that the scoring protocol detects signal rather than noise.

This paper is not claiming that cross-LLM concordance constitutes validation in the strong scientific sense. These models share overlapping training corpora, similar historiographic priors, and comparable prompt susceptibilities. What concordance demonstrates is narrower but still useful: the CAMS scoring protocol produces reproducible structural judgements across multiple LLM assessors, and the pattern of convergence and divergence is consistent with the detection of genuine coordination constraints.

1.2 Scope and Limitations

The analysis covers 10 societies across 14 cross-LLM comparison pairs, with coverage ranging from 42 to 145 overlapping years per pair. Three LLM families are represented: Gemini (Google), Grok (xAI), and GPT-4 (OpenAI). All assessments use the same CAMS instruction set and scoring protocol, administered independently to each model without shared conversation history.

The key limitation is shared training data. All three model families are trained primarily on English-language text, with heavy representation of Western academic historiography. Concordance on well-documented societies (USA, Finland, Norway) may partly reflect shared source material rather than independent detection of the same underlying dynamics. The analysis addresses this by including societies where historiographic coverage varies more substantially across training corpora (Iran, China, Ukraine), and by examining whether concordance patterns differ systematically between well-documented and contested cases.


2. Methods

2.1 Scoring Protocol

Each LLM received the standard CAMS instruction set specifying the eight-node architecture, four state variables, and derived metrics (Node Value, Bond Strength). Models were prompted to score each node's state variables on a 1–10 integer scale for each year in the society's time series, and to assign Bond Strength as a separate holistic assessment of inter-node coupling quality.

No historical priors, expected outcomes, or reference datasets were provided. Each LLM operated as an independent scorer with access only to its own training knowledge of the society in question.

2.2 Concordance Metrics

Five levels of concordance are assessed:

Level 1 — Base metric correlation. Pearson correlation between LLM pairs on year-averaged Coherence, Capacity, Stress, and Abstraction trajectories.

Level 2 — Composite concordance. Correlation on Node Value (C + K − S + 0.5A) and Bond Strength trajectories.

Level 3 — Structural rank concordance. Spearman rank correlation on mean node profiles — do LLMs agree on which nodes are strongest and weakest in a given society?

Level 4 — Crisis detection concordance. Jaccard similarity on identified crisis years (defined as years where system-mean Stress exceeds one standard deviation above the dataset mean).

Level 5 — Direction agreement. Year-over-year sign concordance on ΔNode Value — do LLMs agree on whether coordination improved or deteriorated in each year?

2.3 Invariance Classification

Following the Cross-Sampler Invariance framework developed within the CAMS programme, derived fields are classified as:

Hard fields — quantities expected to be invariant across assessors: sign of Stress–Capacity correlation, crisis timing (within ±2 years), phase ordering of major transitions, direction of decade-level trends.

Soft fields — quantities expected to vary: absolute score levels, within-year variance, cultural-interpretive weighting, Bond Strength amplitude.

The prediction is that concordance will be strongest on hard fields and weakest on soft fields. If concordance were instead uniform across all variables, that would suggest shared training bias rather than selective convergence on structural constraints.


3. Results

3.1 System-Level Concordance

Fourteen cross-LLM comparison pairs yield the following aggregate concordance on year-averaged system trajectories:

VariableMean ρMedian ρMinMaxN pairs
Node Value+0.751+0.767+0.477+0.86814
Capacity+0.739+0.779+0.373+0.88414
Coherence+0.683+0.726+0.413+0.86414
Bond Strength+0.640+0.679−0.130+0.85313
Stress+0.634+0.654+0.415+0.78214

Table 1. Aggregate cross-LLM concordance on system-level trajectories. All values are Pearson ρ on year-averaged metrics across overlapping time periods.

The concordance hierarchy is itself informative. Capacity (the most material-constraint variable) shows the highest concordance after the composite Node Value. Stress shows the lowest concordance among the base metrics, consistent with its dependence on interpretive judgements about what constitutes "pressure" in different cultural contexts. Bond Strength — scored holistically rather than decomposed from base metrics — shows moderate concordance with one anomalous pair (Japan Gemini vs Grok, ρ = −0.130), reflecting known disagreements about how to map traditional Japanese institutional categories onto the CAMS node architecture.

3.2 Per-Society Concordance

SocietyLLM PairN yearsStress ρCapacity ρNode Value ρBond Strength ρ
USAGemini–Grok42+0.682+0.844+0.842+0.706
USAGemini–GPT-442+0.720+0.779+0.868+0.657
USAGrok–GPT-456+0.503+0.405+0.683+0.679
FinlandGemini–Grok111+0.662+0.777+0.742+0.739
ChinaGemini–GPT-451+0.752+0.877+0.844+0.853
IranGemini–Grok126+0.782+0.884+0.861
NorwayGemini–Grok136+0.694+0.874+0.780+0.767
South AfricaGemini–Grok145+0.618+0.758+0.727+0.666
AustraliaGemini–Grok111+0.647+0.747+0.692+0.618
UkraineGemini–Grok46+0.569+0.610+0.615+0.508
UkraineGemini–GPT-485+0.744+0.792+0.851+0.801
JapanGemini–Grok115+0.557+0.373+0.477−0.130
China (blind)Gemini–Grok117+0.533+0.851+0.785+0.777
Saudi Arabia (blind)Gemini–GPT-4127+0.415+0.780+0.754+0.677

Table 2. Per-society cross-LLM concordance. The two "blind" entries were scored under anonymised society labels (Markerxcxa, Markerxsxa) with identity unknown to the analyst during assessment.

Several patterns emerge. Iran shows the highest Stress concordance (+0.782) despite being a society where Western historiographic consensus is limited — suggesting that material constraint dynamics (revolution, war, sanctions) are legible across LLM architectures regardless of interpretive framing. Japan shows the weakest concordance overall, particularly on Bond Strength (−0.130), reflecting genuine ambiguity in mapping pre-modern Japanese institutional structures onto the CAMS eight-node architecture. Ukraine shows a marked concordance gap between the Gemini–Grok pair (+0.615 NV) and the Gemini–GPT-4 pair (+0.851 NV), likely reflecting differences in temporal coverage and training-data recency.

3.3 Node-Level Concordance

NodeMean ρMedian ρMinMax
Craft+0.731+0.778+0.421+0.837
Helm+0.709+0.783+0.479+0.851
Stewards+0.662+0.634+0.550+0.826
Archive+0.647+0.671+0.468+0.820
Hands+0.640+0.648+0.464+0.765
Flow+0.635+0.675+0.133+0.847
Shield+0.633+0.630+0.423+0.925
Lore+0.517+0.476+0.335+0.836

Table 3. Node-level cross-LLM concordance on Node Value trajectories, averaged across 7 society pairs.

The node-level hierarchy is theoretically coherent. Craft (professional and trade institutions) shows the highest concordance — these are the most empirically observable institutional features, with clear material indicators. Lore (knowledge, religion, education) shows the lowest concordance — these are the most interpretive, culturally embedded institutional features where different models' training priors diverge most substantially. This gradient from material observability to interpretive complexity is precisely the pattern predicted by the hard/soft field classification.

3.4 Structural Rank Concordance

When LLMs are asked not "how strong is each node?" but "which nodes are strongest and weakest?", the rank-order agreement is:

SocietySpearman ρp-value
Norway+0.8810.004
USA+0.8100.015
Australia+0.6900.058
Finland+0.6670.071
China+0.5000.207
South Africa+0.3810.352
Iran+0.3810.352

Table 4. Spearman rank concordance on mean node-value profiles. All correlations are positive; the strongest concordance appears in well-documented democratic societies.

The concordance gradient — strongest for Norway and USA, weakest for South Africa and Iran — tracks the density of English-language historiographic coverage in the training corpora. This is an expected limitation, not a disqualifying one. The important finding is that all rank correlations are positive: no LLM pair produces an inverted institutional profile for any society.

3.5 Crisis Detection Concordance

Crisis years were identified independently per dataset as years where system-mean Stress exceeds one standard deviation above the dataset mean. Jaccard similarity measures the overlap between LLM-identified crisis sets:

SocietyLLM PairJaccardConcordant Crisis Years
IranGem–Grok0.4671908, 1915–16, 1918–20, 1941–43, 1952–53, 1988, 2009, 2022
USAGem–Grok0.4171974, 1979–80, 2008, 2020
FinlandGem–Grok0.2001918, 1940, 1942–46
USAGem–GPT-40.2071973–74, 1979–80, 2008, 2020
ChinaGem–GPT-40.1111976, 1989, 2022

Table 5. Crisis detection concordance. Jaccard values are moderate to low, reflecting differences in threshold calibration, but the temporally matched crisis years correspond to historically documented coordination failures in every case.

The low Jaccard values reflect a systematic calibration difference: some models (particularly GPT-4) score stress higher on average, producing more years above the one-standard-deviation threshold. When the analysis is restricted to the top-5 stress peaks per dataset, temporal alignment improves substantially — major crises are identified at the same points even when overall calibration differs.

3.6 Universal Structural Findings Confirmed Across Assessors

Two structural findings hold across all tested datasets and all assessors:

Stress–Capacity anti-correlation. 30 of 30 datasets show negative Pearson correlation between year-averaged Stress and Capacity (mean ρ = −0.700, range −0.056 to −0.988). This includes all three LLM families, all 10+ societies, and time spans from 42 to 166 years. The weakest result (US GPT-4, ρ = −0.056) is an outlier; the next weakest is −0.354.

Bond Strength–System Health coupling. 23 of 23 datasets where Bond Strength was independently scored show positive correlation between mean Bond Strength and mean Node Value (mean ρ = +0.926, range +0.631 to +0.993).

Both findings require caveats. The Stress–Capacity anti-correlation may be partly induced by the scoring rubric: the conceptual definitions of Stress ("load that erodes functioning") and Capacity ("ability to act under load") are near-duals, and an LLM reading the same evidence may naturally push one up while pushing the other down. The Bond Strength–System Health correlation may reflect construct overlap, since both quantities are influenced by the same underlying C/K/S/A inputs even when scored separately. These circularity concerns are real and acknowledged; they do not eliminate the finding but they constrain what can be claimed from it. The honest framing is: the anti-correlation is a robust within-framework regularity, and external proxy validation is needed to confirm it reflects substantive dynamics rather than rubric-induced covariance.

3.7 Year-Over-Year Direction Agreement

SocietyDirection AgreementN years
USA80.5%41
South Africa63.9%144
Finland61.8%110
China56.0%50
Australia50.9%110
Iran44.0%125
Norway41.5%135

Table 6. Year-over-year sign concordance on ΔNode Value. Values above 50% indicate better-than-chance agreement on whether coordination improved or deteriorated in each year.

Direction agreement varies substantially. The USA shows the strongest year-level concordance (80.5%), likely reflecting dense, well-documented annual-resolution data in all models' training corpora. Longer time series with sparser historical records (Norway, Iran) show weaker year-level agreement, though their decade-level trend concordance remains strong.


4. The Native Bond Strength Paradox

A finding from detailed concordance testing on China (Markerxcxa, five dataset variants across three LLM assessors) deserves separate treatment because of its implications for measurement methodology.

When Bond Strength is recalculated from base metrics using standardised formulas (e.g., BS = C + K + A − 2S), concordance between LLM pairs decreases compared to using each LLM's natively scored Bond Strength (mean native BS concordance +0.754 vs mean standardised BS concordance +0.645). This is counterintuitive — standardisation should remove formula variance and improve concordance.

The explanation appears to be that natively scored Bond Strength carries concordant information about coordination quality that is not reducible to the four base metrics. The LLMs bring shared historiographic understanding of institutional coupling — particularly for knowledge and memory institutions (Lore, Archive) — that exceeds what any linear combination of Coherence, Capacity, Stress, and Abstraction captures. For metabolic nodes (Flow, Hands), standardised formulas actually improve concordance, suggesting the native advantage is specific to culturally interpretive assessment.

This has a methodological implication: Bond Strength should be treated as a semi-independent scored variable rather than a derived quantity, and concordance should be reported at three layers (base metrics → standardised composites → native holistic scores) to make the interpretive concordance premium explicit.


5. Scope Conditions and Honest Limitations

5.1 The Shared Training Corpus Problem

The most fundamental limitation is that Gemini, Grok, and GPT-4 share substantially overlapping training data — English-language Wikipedia, academic historiography, news archives, and reference works. Concordance between models trained on overlapping corpora is not equivalent to independent measurement by separate instruments with different data sources. The concordance reported here demonstrates replicability within a family of related textual reasoners, not objective validation of the underlying construct.

The honest claim is: the CAMS scoring protocol produces reproducible structural judgements across multiple LLM assessors. The stronger claim — that these judgements track real coordination dynamics — requires external validation against independent proxies (GDP trajectories, mortality data, trade flows, conflict incidence) that are not part of the scoring input.

5.2 Concordance Is Not Uniform

The variation in concordance across societies is informative. Strong concordance in well-documented democratic societies (USA, Norway, Finland) alongside weaker concordance for contested or underrepresented cases (South Africa, Iran, Japan) tracks the density and quality of available historical evidence in English-language training corpora. This means the ensemble methodology works best where it is least needed (well-understood societies) and worst where it would be most valuable (poorly documented or historiographically contested cases).

5.3 Calibration Differences Are Systematic

LLMs differ systematically in absolute scoring tendencies. Grok tends to score Stress higher and variance tighter than Gemini. GPT-4 tends to identify more crisis years due to higher baseline stress scoring. These calibration offsets do not affect relative dynamics — the relational patterns (which nodes are strongest, when crises occur, how variables co-move) remain stable. But they mean that absolute CAMS scores from different LLMs are not directly comparable without normalisation.

5.4 The Circularity Constraint

Two of the framework's headline findings — the Stress–Capacity anti-correlation and the Bond Strength–System Health coupling — may be partly or largely artefacts of how the scoring variables are defined. The conceptual definitions create a predisposition toward inverse co-movement. The paper acknowledges this openly. The delta-based version of the anti-correlation (correlating year-over-year changes rather than levels) is less vulnerable to this critique, and the "scissors effect" — steepening of anti-correlation under high system load — is not predicted by a simple rubric-induced mechanism. But full resolution requires validation against external data sources.


6. Implications

6.1 For the CAMS Programme

Cross-LLM concordance provides a necessary but not sufficient condition for treating the framework as a legitimate measurement system. The results justify continued development and testing, particularly the construction of external validation tests against economic, demographic, and conflict data. They do not yet justify claims of "universal coordination laws."

6.2 For Ensemble AI Methodology

The layered concordance structure — convergence on material constraints, appropriate divergence on interpretive elements — is the signature predicted by treating LLMs as complementary measurement instruments. This pattern has practical applications beyond CAMS: any domain using LLM-generated assessments can apply the hard/soft field distinction to determine which aspects of the assessment are robust and which are model-dependent.

6.3 For Comparative Civilisational Analysis

The finding that independently trained AI systems converge on institutional stress trajectories across radically different governance models — democratic, authoritarian, theocratic, post-colonial — is consistent with the hypothesis that coordination constraints are structural rather than ideological. The same thermodynamic pressures (stress accumulation, capacity erosion, coupling degradation) appear in the data for every tested society, regardless of political system. This does not prove the "common global interests" thesis, but it removes one possible objection: that the patterns are artefacts of Western-centric analytical frameworks imposed by a single AI system.


7. Summary of Key Findings

  1. System-level concordance is moderate to strong (mean Node Value ρ = +0.751 across 14 pairs), supporting the reproducibility of CAMS structural assessments.
  2. The concordance hierarchy tracks material observability: Capacity > Coherence > Stress > Bond Strength, consistent with the hard/soft field prediction.
  3. Stress–Capacity anti-correlation is universal within the framework: 30/30 datasets, all assessors, all societies (mean ρ = −0.700). The finding is robust but may be partly rubric-induced.
  4. Bond Strength–System Health coupling is universal: 23/23 datasets (mean ρ = +0.926). Subject to construct-overlap caveats.
  5. Node-level concordance tracks interpretive complexity: Craft (most material) shows highest concordance; Lore (most interpretive) shows lowest.
  6. Natively scored Bond Strength outperforms standardised formulas in cross-LLM concordance, suggesting LLMs encode shared coupling information beyond the four-metric decomposition.
  7. Crisis detection converges on historically documented events — major wars, revolutions, economic collapses are identified at the same temporal locations across assessors.
  8. Concordance is strongest for well-documented societies and weakest for historiographically contested cases, consistent with shared training-corpus dependence.

Appendix A: Dataset Inventory

SocietyLLMTime SpanObservation-Years
USAGemini1970–202543
USAGrok1970–202556
USAGPT-41900–2025126
FinlandGemini1900–2025115
FinlandGrok1900–2025123
ChinaGemini1970–202556
ChinaGPT-41900–2025127
IranGemini1900–2025126
IranGrok1900–2025126
NorwayGemini1880–2025145
NorwayGrok1890–2025136
South AfricaGemini1880–2025146
South AfricaGrok1880–2025145
AustraliaGemini1900–2025126
AustraliaGrok1900–2025111
UkraineGemini1930–202585
UkraineGrok1970–202546
UkraineGPT-41930–202596
JapanGemini1850–2025176
JapanGrok1890–2025115
China (blind)Gemini1900–2025126
China (blind)Grok1900–2025126
Saudi Arabia (blind)Gemini1900–2025126
Saudi Arabia (blind)GPT-41900–2025127

Appendix B: Stress–Capacity Anti-Correlation (All 30 Datasets)

All 30 tested datasets return negative ρ(S,K). Mean = −0.700. Range: −0.056 to −0.988. Societies covered: USA, Finland, China, Iran, Norway, South Africa, Australia, Ukraine, Japan, Sweden, Denmark, Germany, Brazil, UK, Venezuela, Russia, Thailand, Singapore, Rome. LLM assessors: Gemini, Grok, GPT-4.


Correspondence: neuralnations.org CAMS Framework: Complex Adaptive Humans (LinkedIn)

Content is user-generated and unverified.
    Cross-LLM Structural Concordance in CAMS Datasets | Claude