The choice of data format can impact LLM performance by up to 48% while affecting costs by 16-60% through token efficiency variations. Markdown emerges as the optimal default format for most LLM applications, delivering 16% average token savings over JSON while maintaining or improving comprehension accuracy. However, task-specific optimization matters: YAML achieves 62% accuracy for nested data (versus JSON's 50%), while JSON wrapping can devastate code generation performance by up to 26%. The emerging TOON format demonstrates 30-60% token reduction but underperforms established formats in comprehension tasks, reinforcing that LLMs perform better with formats prevalent in their training data—not necessarily those optimized for theoretical efficiency.
The JSON to Markdown conversion problem has matured around a core insight: different JSON structures require fundamentally different Markdown representations. The most common pattern—arrays of uniform objects—maps naturally to Markdown tables, where object keys become headers and values populate rows. This transformation, implemented by libraries like tablemark (JavaScript, 53 stars) and pandas' to_markdown() (Python), works exceptionally well for tabular data but fails for heterogeneous structures. For single objects, the key-value list pattern dominates, rendering {"name": "Product", "price": 99.99} as bulleted lists with bold keys. Deeply nested objects present the trickiest challenge: they either become hierarchical headings (consuming the limited h1-h6 space) or flatten into dot-notation paths like user.address.street.
json2md (IonicaBizau, 628 stars, 124k weekly downloads) exemplifies the converter pattern approach. It uses extensible converter objects where each JSON element type maps to a Markdown transformer function. The library accepts domain-specific JSON like {h1: "Title", table: {headers: [...], rows: [...]}} rather than generic JSON, trading automatic conversion for predictable, high-quality output. Users can inject custom converters: json2md.converters.customElement = (input) => \Custom: ${input}``. This design prioritizes extensibility and control over convenience.
The specialized tablemark library represents the opposite philosophy—doing one thing exceptionally well. It generates only tables but offers 11 casing options (camelCase, kebab-case, sentence case), three text handling strategies for Unicode/emoji/ANSI support, and sophisticated overflow handling. Its configuration object pattern allows fine-grained control: column alignment, width limits, custom cell transformations via toCellText functions, and wrap-versus-truncate strategies. The library intelligently handles edge cases like ANSI escape codes (calculating display width correctly) and line breaks in cells (which Markdown tables cannot natively support). For production table generation, tablemark sets the standard.
Edge case handling reveals maturity in library design. Boolean values can render as true/false, Yes/No, checkmarks (✔/✘), or 1/0 depending on context. Null values present a choice: empty cells, literal "null", "N/A", or omission entirely. The correct answer depends on whether you're documenting an API (distinguish null from missing) or presenting data for readability (prefer omission). Mixed-type arrays force a decision between sparse tables (union all keys, leave cells empty) or separate representations. Deep nesting ultimately requires either flattening with dot notation, showing partial structure with JSON code blocks for the deepest parts, or accepting that some data doesn't translate well to Markdown's inherently flat structure.
Python's ecosystem splits between specialized tools and general-purpose powerhouses. jsonschema-markdown (version 2025.10.2) converts JSON Schema to documentation with tables, multi-language support, and deprecated field handling—perfect for API documentation but useless for arbitrary JSON. For generic conversion, developers often reach for pandas with df.to_markdown() or tabulate with tablefmt="pipe", accepting heavy dependencies for rock-solid table handling. Go's json-schema-docs (Grafana-backed) takes a template-based approach, separating data from presentation through Go templates and enabling complete output customization at the cost of requiring template authorship.
The relationship between token efficiency and LLM comprehension is non-linear and task-dependent, destroying any notion of a universal best format. Testing 1,000 employee records with 1,000 queries across 11 formats revealed that CSV achieves the most efficient token usage (19,524 tokens) but delivers the worst accuracy (44.3%). Meanwhile, Markdown-KV consumes 2.7× more tokens (52,104) yet achieves the highest accuracy (60.7%)—a 37% accuracy improvement despite the token cost. This finding demolishes the assumption that compressing data inherently helps LLMs understand it.
Markdown delivers the best balance of efficiency and comprehension across diverse tasks. Compared to JSON, Markdown achieves 16% token reduction on average (11,612 versus 13,869 tokens in real-world testing with OpenAI's tiktoken). The savings come from eliminating repetitive syntax: no braces wrapping every object, no quotes around keys, no commas between fields. A nested structure like {"user": {"profile": {"name": "Alice"}}} becomes a clean hierarchy of headers. For structured prompting in RAG systems, this 16% reduction translates directly to 16% cost savings and more available context window space. One developer reported 20-30% overall savings in a production system requiring 92,945 input tokens.
The advantage extends beyond token counts to processing efficiency. Markdown aligns closely with natural language, reducing what researchers call "cognitive load" on models. Headers (#, ##), bold, italics, and code blocks provide semantic cues about text importance and structure that JSON's uniform syntax lacks. The Webex Developer Blog analysis found Markdown improved RAG accuracy by 20-35% versus HTML or plain text, attributing the gain to reduced processing overhead and better alignment with training data. Interestingly, JSON and XML require models to navigate nested tags before extracting content, while Markdown presents information directly.
Format performance varies dramatically by task type. For nested/hierarchical data across 1,000 queries with GPT-5-nano, YAML achieved 62.1% accuracy—the best by far—while XML collapsed to 44.4% and TOON to 43.1%. Yet for flat tabular data, HTML scored 65.4% while CSV languished at 44.3%. The pattern holds across models: Gemini 2.5 Flash Lite showed YAML at 51.9% and XML at 33.8% for nested data, an 18-point spread. Code generation tasks reveal JSON's peculiar weakness: Claude-3-5-Sonnet dropped from 68.5% pass rate (Markdown) to 51.2% (JSON)—a devastating 25.3% loss—while DeepSeek Coder V2 fell 26.4%. The JSON wrapper causes syntax errors through quote escaping issues and apparently reduces problem-solving capacity even beyond syntax failures.
Model-specific preferences compound the complexity. GPT-3.5-turbo prefers JSON (59.7% accuracy on MMLU versus Markdown's 50%), while GPT-4 prefers Markdown (81.2% versus JSON's 73.9%)—a complete reversal between model versions from the same provider. This 48.8% performance swing on HumanEval between formats on GPT-3.5 represents the maximum observed variation. Larger models show more robustness (GPT-4 coefficient of mean deviation <0.036 versus GPT-3.5 up to 0.176) but format still matters significantly. The finding that same-series models have transferability IoU>0.7 while different providers fall to IoU<0.2 means optimization work doesn't transfer across vendors.
Minified JSON offers an intriguing middle path for RAG specifically. GovTech Singapore testing found Structure of Arrays format reduced Markdown table tokens by 47% (from 361 to 191 tokens) while producing "very similar" LLM responses with only slightly increased verbosity. Converting [{id:1,name:"Alice"}, {id:2,name:"Bob"}] to {ids:[1,2], names:["Alice","Bob"]} achieves maximum density. The researchers concluded that token efficiency "does not have a significant impact on RAG performance"—suggesting that for retrieval-augmented generation specifically, aggressive compression works without comprehension penalties. This technique suits preprocessing for LLM consumption rather than human-readable documentation.
TOON (Token-Oriented Object Notation) represents the most ambitious attempt to design a format specifically for LLM applications. Created by Johann Schopplich and released at version 1.0 on November 10, 2025, TOON has achieved 16,200 GitHub stars and 653 forks in under a year—remarkable momentum for a data format. The specification lives at github.com/toon-format/spec with production-ready implementations in 20+ languages including TypeScript, Python, Go, Rust, and .NET. The format's core innovation lies in tabular arrays: declaring field names once in a header like users[2]{id,name,role}: followed by data rows without repeated keys, mimicking CSV's efficiency within a structured format.
The token reduction claims are substantial and empirically validated. Official benchmarks across 209 questions, 4 models, and 11 datasets show TOON achieving 2,744 tokens versus JSON's 4,545 (39.6% reduction) or compact JSON's 3,081 (10.9% reduction). For uniform employee records, the savings reach 60.7%; for time-series analytics, 59.0%; for GitHub repositories, 42.3%. Independent testing on a GPT-4o prompt recorded 56% reduction (589 versus 1,344 tokens) with 5 seconds faster response time. The format minimizes quoting (only when strings contain delimiters or look like booleans/numbers), eliminates braces and brackets, and uses indentation for structure. Real-world implementations report 41-49% prompt size reductions in production systems like PRISM v2.4.
However, TOON's comprehension performance tells a more complicated story. While official benchmarks show 73.9% accuracy versus JSON's 69.7%, independent testing by ImprovingAgents found TOON struggling: 47.5% accuracy on tabular data versus Markdown-Table's 51.9%, and a catastrophic 43.1% on nested data versus YAML's 62.1%—the worst performance of all tested formats. The discrepancy likely reflects task dependency: TOON's benchmarks may emphasize different capabilities than the independent tests. Parsing accuracy tells another angle—TOON achieved 73.9% versus JSON's 69.7%, suggesting models can validate TOON structure better even when comprehension lags.
This performance gap illuminates the training data composition hypothesis: LLMs perform better with formats prevalent in their training data regardless of theoretical optimization. Multiple sources describe Markdown as the "native language" of most LLMs because "training data includes natural language with markdown formatting." BPE token encoders optimize on corpora heavily featuring Markdown, making common Markdown patterns likely to tokenize as single tokens while JSON syntax (braces, quotes, commas) often fragments. Statistical associations between Markdown structure and semantic meaning run deep in model weights. One analysis states: "Models have deep statistical associations between markdown structure and semantic meaning" from billions of training examples.
Anthropic's Claude training provides direct evidence for format-specific preferences. Claude was explicitly "trained with XML tags in the training data," making tags like <example>, <document>, <context> particularly effective for guiding output. Anthropic heavily uses XML tags in system prompts, demonstrating deliberate exploitation of training data composition. The academic paper "I Learn Better If You Speak My Language" formalizes this: models show "language style preference" affecting learning capability and performance. Lower perplexity (indicating data closer to learned distribution) predicts better performance. Fine-tuning with style-aligned responses significantly improves outcomes through "Preference Curriculum Learning."
TOON's position in the ecosystem requires nuanced understanding. The format serves as a translation layer: use JSON programmatically, convert to TOON for LLM input, convert back to JSON for result processing. It's not a general-purpose JSON replacement but a specialized optimization for cost-critical, high-volume LLM pipelines with uniform structured data. The MIT-licensed specification, comprehensive test suites, VS Code extension, and CLI tools (npx @toon-format/cli input.json -o output.toon) provide a production-ready ecosystem. For developers facing massive token costs with tabular data, TOON offers real savings—but the comprehension accuracy trade-off demands careful testing on specific use cases rather than blind adoption.
StructEval (arXiv:2505.20139v1, 2025) represents the most comprehensive format generation benchmark to date. Testing 18 formats including JSON, XML, YAML, Markdown, CSV, TOML, HTML, React, SVG, LaTeX, TikZ, and Mermaid across 2,035 examples and 44 task types, researchers evaluated both generation (natural language to structure) and conversion (format to format) capabilities. GPT-4o achieved 66.8% average accuracy—the best overall—while the leading open-source model Qwen3-4B scored 56.9%, a 10-point gap. Generation proved harder than conversion across all models. High success rates (>90%) appeared for JSON, HTML, CSV, and Markdown generation, while TOML, SVG, and Mermaid generation fell below 40%, revealing which formats align with training data.
The SUC (Structural Understanding Capabilities) benchmark from Sui et al. (WSDM 2024, arXiv:2305.13062v4) isolated LLM ability to understand table structure through 7 tasks: table partition, size detection, cell lookup, column/row retrieval, and merged cell detection. Testing 1,500 tables per task from datasets like TabFact and HybridQA, researchers found HTML performed best at 65.43% average accuracy—6.76% better than natural language with separators. GPT-4 consistently scored 80%+ versus GPT-3.5's 40-60%. Critically, removing one-shot examples caused the largest accuracy drop (-30.38%), while placing external information before tables improved performance by 6.81%. Downstream tasks showed dramatic gains: ToTTo BLEU-4 score jumped from 17.24% to 22.92% (+33%) using HTML with self-augmentation.
He et al.'s prompt format impact study (arXiv:2411.10541v1, 2024) revealed model-specific format preferences using matched pairs t-tests across Plain Text, Markdown, JSON, and YAML. Testing MMLU (14,079 questions), HumanEval (164 problems), NER Finance (500 samples), and CodeXGLUE (1,000 examples), researchers documented the GPT-3.5/GPT-4 divergence: GPT-3.5 achieved best results with JSON in 5/7 benchmarks while GPT-4 preferred Markdown in 5/7. The maximum gap reached 48.8% on HumanEval (GPT-3.5: JSON 59.8% versus Plain 40.2%), demonstrating statistical significance at p<0.05. Format transferability analysis showed same-series models achieve IoU>0.7 while different providers fall below IoU<0.2—optimization doesn't transfer across vendors.
The Aider code generation experiment provides controlled evidence of JSON's harm to code tasks. Testing 133 Exercism problems with 5 runs per configuration using Pass@1 metrics, results showed minimal impact for GPT-4o-2024-05-13 (72.8% Markdown versus 72.4% JSON) but severe degradation in newer models: GPT-4o-2024-08-06 lost 10.7% (73.2% to 65.4%), Claude-3-5-Sonnet lost 25.3% (68.5% to 51.2%), and DeepSeek Coder V2 lost 26.4% (61.3% to 45.1%). The JSON burden causes more syntax errors through quote escaping and apparently reduces problem-solving capacity beyond syntax failures alone. OpenAI strict mode offered no improvement, suggesting fundamental incompatibility between JSON wrapping and code generation.
Statistical validation methods ensure findings represent genuine effects rather than noise. Matched pairs t-tests compare performance on identical datasets with only format varying, testing null hypothesis of no difference with significance threshold p<0.05. Confidence intervals (95% via bootstrap) assess reliability—all ImprovingAgents studies report CIs showing accuracy ranges of ±3-4 percentage points. Coefficient of Mean Deviation (CMD) measures format sensitivity: GPT-4's CMD<0.036 indicates high robustness versus GPT-3.5's up to 0.176 showing high sensitivity. Intersection-over-Union (IoU) quantifies format transferability between models, revealing that optimal formats don't transfer across providers.
Practitioners can design reproducible experiments by controlling five critical variables: content (identical semantic information across formats), prompt structure (only format syntax differs), model settings (temperature=0 for determinism), evaluation criteria (automated, consistent scoring), and sample selection (random or stratified, minimum 100 samples for significance). The testing protocol involves converting canonical data to candidate formats, running multiple trials (≥3) with the specific target model, measuring both accuracy and token usage, applying matched pairs t-tests, and calculating confidence intervals before selecting the optimal format for the specific use case and documenting for reproducibility.
Default to Markdown for most LLM applications as the evidence-based starting point. It delivers 16% average token efficiency over JSON while matching or exceeding comprehension accuracy across diverse tasks, offers excellent human readability for debugging and collaboration, and aligns with LLM training data composition for strong model understanding. The format handles conversion from arrays of objects to tables (using libraries like tablemark or json2md), single objects to key-value lists with bold keys, nested objects to hierarchical headings, and primitive arrays to bulleted lists. For edge cases, configure null representation (empty cell, "N/A", or omission), boolean display (true/false, Yes/No, or checkmarks ✔/✘), and line break handling (strip, truncate, or replace with spaces since Markdown tables cannot contain newlines).
Optimize beyond Markdown when specific requirements demand it. Use YAML for nested data where accuracy matters most—it achieved 62.1% accuracy on hierarchical structures versus Markdown's 54.3% and JSON's 50.3% across multiple models. The 10% token efficiency penalty versus Markdown (42,477 versus 38,357 tokens) pays for superior comprehension of complex configuration objects and deeply nested relationships. Use Markdown-KV for critical accuracy on tabular lookups, accepting 2.7× token costs (52,104 versus 19,524 for CSV) for 37% better accuracy (60.7% versus 44.3%). This format represents individual records with headers and key-value pairs in code blocks, creating maximum clarity for retrieval tasks.
Avoid JSON for code generation tasks entirely. The evidence shows consistent, severe degradation: Claude-3-5-Sonnet loses 25.3% pass rate, DeepSeek Coder V2 loses 26.4%, and GPT-4o-2024-08-06 loses 10.7% when code appears in JSON wrappers. Use Markdown code blocks with language tags instead. Similarly, avoid CSV unless extreme token constraints force the choice—it's 3.4× more efficient than JSON but 8% less accurate, and the poor comprehension (44.3% accuracy) makes it practical only for the most token-starved scenarios. If using CSV, repeat headers every ~100 records to improve long-sequence understanding.
For specialized optimization, minified JSON (Structure of Arrays) offers 47% token reduction versus Markdown tables in RAG systems specifically. Convert [{id:1,name:"Alice"}, {id:2,name:"Bob"}] to {ids:[1,2], names:["Alice","Bob"]} when preprocessing structured data for LLM consumption rather than human reading. GovTech Singapore found similar comprehension to Markdown despite half the tokens. TOON provides 30-60% token savings for uniform tabular data but demands testing on your specific tasks—independent benchmarks show accuracy concerns (43-47% versus 52-62% for established formats) despite official claims, likely due to training data unfamiliarity.
The format selection process requires empirical validation on your specific model, data, and tasks rather than relying on general recommendations. Sample 100-500 representative examples from production data, ensuring diverse difficulty levels. Convert to 3-4 candidate formats based on data structure (Markdown, YAML, JSON, Markdown-KV for objects; Markdown, HTML, CSV for tables; YAML, Markdown, JSON for nested) while verifying semantic equivalence through round-trip conversion. Run evaluation with your target model using temperature=0 for determinism across ≥3 trials per format, measuring accuracy via task-appropriate metrics (Pass@1 for code, exact match for Q&A, F1 for extraction), token usage via model-specific tokenizer, and total cost combining both factors.
Apply matched pairs t-tests between best and worst formats with significance threshold p<0.05, calculate 95% confidence intervals on accuracy to assess reliability, and measure Coefficient of Mean Deviation to understand format sensitivity. Document results including format rankings, statistical significance, token efficiency trade-offs, and model-specific behaviors. Select the format balancing accuracy requirements with cost constraints—a 16-point accuracy improvement might justify 2.7× token costs for critical applications while cost-sensitive systems might accept 8% accuracy loss for 3.4× efficiency.
Common pitfalls destroy experiment validity: changing more than format syntax between conditions, running single trials without statistical validation, inconsistent evaluation criteria across formats, ignoring token costs in optimization decisions, and failing to test statistical significance of differences. Control content (identical semantic data), prompt structure (only format syntax varies), model parameters (fixed temperature and sampling), evaluation criteria (automated, consistent), and sample selection (random or stratified). The relationship between token efficiency and comprehension is non-linear and task-dependent—the most efficient format often delivers poor comprehension while optimal comprehension may cost significantly more tokens.
Model-specific testing matters critically because optimal formats don't transfer. GPT-3.5 prefers JSON while GPT-4 prefers Markdown—a complete reversal within the same product line. Claude-3-5-Sonnet benefits from XML tags due to explicit training. Llama 3.2 3B shows format agnosticism. Transferability IoU measurements confirm same-series models share preferences (IoU>0.7) while different providers diverge (IoU<0.2). Test with your deployment target, measure both accuracy and tokens, apply statistical validation, balance trade-offs based on your priorities, and document for reproducibility. The evidence strongly supports that format choice significantly impacts both performance (up to 48% variation) and cost (16-60% token differences), making empirical optimization a high-value activity.
The convergence of evidence reveals a fundamental principle: LLM format optimization is training data archaeology. Models don't prefer theoretically optimal formats—they prefer formats matching their training distribution. Markdown dominates not because it's inherently superior but because billions of training examples (GitHub READMEs, documentation sites, Stack Overflow, blogs) embedded statistical associations between Markdown structure and semantic meaning. This explains TOON's paradox: designed specifically for LLM efficiency with explicit structural guardrails like [N] array length markers and {field1,field2} declarations, it achieves stunning token reduction (30-60%) yet underperforms established formats in independent comprehension tests (43-47% accuracy versus 52-62% for Markdown/YAML). The format is theoretically better but empirically worse because LLMs lack adequate training examples.
This insight suggests a strategic timeline for novel formats. TOON's current position mirrors JSON's early days or YAML's initial adoption—technically superior but fighting training data inertia. As more developers use TOON for prompts, more TOON-formatted data appears in training corpora, potentially creating a positive feedback loop. The format's comprehensive specification (v2.0), production-ready tooling (20+ language implementations), and growing community (16,200 GitHub stars in under a year) provide infrastructure for mainstream adoption. Within 2-3 model training cycles (2026-2028), if TOON appears sufficiently in training data, comprehension accuracy could match or exceed today's token efficiency advantage.
The non-linearity of the accuracy-efficiency trade-off demands sophisticated optimization strategies. CSV achieves 3.4× better token efficiency than Markdown-KV but loses 37% accuracy—the relationship isn't proportional. This creates a multi-objective optimization problem where the Pareto frontier matters more than single-metric optimization. For token-constrained applications (large documents, many API calls), 16% Markdown savings over JSON might enable fitting critical context within limits. For accuracy-critical retrieval (legal document lookup, medical record queries), 37% accuracy gains justify 2.7× token costs. The optimal choice depends on your position in the cost-accuracy space and which resource is the binding constraint.
Model evolution introduces temporal dynamics. GPT-3.5 and GPT-4's format preference reversal (JSON to Markdown) between versions suggests that as models grow more capable, they increasingly align with human-readable formats. GPT-4's low format sensitivity (CMD<0.036) versus GPT-3.5's high sensitivity (CMD up to 0.176) indicates larger models become more robust to format variations—but not format-agnostic. The 48.8% performance swings on specific tasks mean format still matters enormously even for frontier models. This suggests format selection remains a critical optimization dimension for the foreseeable future rather than becoming obsolete as models improve.
The code generation findings reveal task-format interactions beyond simple preferences. JSON's 25-26% performance loss on code tasks (Claude, DeepSeek) versus minimal impact on natural language tasks indicates format choice interacts with task type in complex ways. The JSON wrapper apparently consumes cognitive resources that code generation desperately needs, creating a capacity problem beyond syntax. This suggests thinking about format selection not just as input encoding but as cognitive load management—simpler formats free model capacity for harder reasoning. For complex tasks pushing model limits, format simplification might unlock performance gains beyond token savings.
The ultimate strategic insight: treat format selection as an empirical optimization problem with statistical validation, not an architectural decision based on convention. The 16-60% token differences and up to 48% accuracy variations make format testing one of the highest-ROI optimization activities available. A simple experiment with 100 examples, 3-4 candidate formats, and matched pairs t-tests can identify quick wins worth thousands of dollars annually in API costs or double-digit accuracy improvements. Combined with the evidence that optimal formats don't transfer between providers (IoU<0.2), this makes format testing a necessary step for production LLM systems rather than premature optimization. The era of format-agnostic LLM applications hasn't arrived—format engineering remains a critical skill for 2025 and beyond.