Markdown is the lingua franca of AI
A format designed for bloggers in 2004 now sits at the center of how AI systems read, write, and think.
Table of contents
- How we got here
- A quick origin story
- Markdown before AI
- The present: Markdown is everywhere in AI
- Why Markdown won
- The conversion industrial complex
- Agents live in Markdown
- Where this is going
- The risks of monoculture
- Final thoughts
How we got here
If you work with LLMs at all, you’ve probably noticed something: Markdown is everywhere. Ask Claude a question, you get Markdown back. Ask GPT-4, same thing. Feed a PDF into a RAG pipeline, the first thing that happens is it gets converted to Markdown. Cursor rules? Markdown. Agent memory files? Markdown. System prompts? Mostly Markdown.
Nobody planned this. There was no committee meeting where the AI industry decided Markdown would be its interchange format. It happened gradually, and then all at once, and at this point it’s so baked in that most people don’t even think about it.
I keep thinking about how weird this is. A format that John Gruber made for writing blog posts two decades ago is now the connective tissue between humans and AI systems. It’s worth understanding why.
A quick origin story
Markdown was created in 2004 by John Gruber, with help from Aaron Swartz. The original motivation was simple: writing HTML by hand is annoying. Gruber wanted a way to write in plain text that could be converted to HTML, but that was also pleasant to read as-is.
The design goals, from Gruber’s original documentation:
A Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.
That’s it. No grand ambitions about becoming a universal data format. Just a way to make writing for the web less painful.
The syntax borrowed from conventions people were already using in plain text emails — *asterisks* for emphasis, > for quotes, - for lists. This wasn’t invention so much as codification of what people naturally did when writing without a word processor.
Gruber released it as a Perl script. It converted Markdown to HTML. The spec was informal, described in a single blog post. There were ambiguities all over the place.
Markdown before AI
Before 2022, Markdown had already become the dominant format for a few specific domains.
Developer documentation. GitHub adopted Markdown for README files, issues, pull requests, and wikis. GitHub Flavored Markdown (GFM) added fenced code blocks, task lists, and tables. By the mid-2010s, if you were a developer, you were writing Markdown daily whether you thought about it or not.
Static site generators. Jekyll, Hugo, Gatsby, and dozens of others used Markdown as the authoring format. Write your content in .md files, the build tool handles the HTML.
Note-taking. Obsidian, Notion (partially), Bear, Typora, and a wave of other apps adopted Markdown or Markdown-like syntax. The appeal was portability: your notes were just text files, not locked in some proprietary database.
Technical writing. API docs, runbooks, architecture decision records, RFCs. Markdown became the default for anything technical that needed to be version-controlled alongside code.
By the time LLMs arrived, there was already a massive corpus of Markdown content on the internet. Billions of .md files on GitHub alone. Documentation sites, blog posts, forum comments on Stack Overflow and Reddit (which use Markdown-like formatting). This matters a lot for what comes next.
The present: Markdown is everywhere in AI
LLM outputs are Markdown
Every major LLM outputs Markdown by default. When ChatGPT gives you a response with headers, bullet points, bold text, and code blocks, that’s Markdown being rendered. Claude does the same. Gemini too.
This isn’t just a rendering choice by the chat interface. The models themselves have learned that Markdown is the appropriate output format for structured responses. They produce ## headers, **bold** text, ``` fenced code blocks, and - list items because that’s what they saw in training data. The training corpus was saturated with Markdown.
You can see this if you hit the API directly. Ask GPT-4 or Claude a question via the raw API with no system prompt, and the response comes back with Markdown formatting. It’s the model’s natural mode of expression.
LLM inputs are Markdown
It goes the other direction too. System prompts are written in Markdown. If you look at the system prompts for ChatGPT, Claude, or any serious AI product, they’re structured with Markdown headers, lists, and code blocks.
This makes sense. Markdown is a way to add structure to text without consuming many tokens. A ## header costs 2-3 tokens but tells the model “this is a section boundary.” A - bullet costs 2 tokens but communicates “this is one item in a list.” Compare that to XML tags (<section><header>...</header></section>) or JSON, which burn tokens on structural overhead.
RAG pipelines convert everything to Markdown
Here’s where it gets really interesting. When you build a RAG (retrieval-augmented generation) pipeline, you need to convert source documents into a format the LLM can consume. PDFs, Word docs, HTML pages, PowerPoint slides — they all need to become text.
The default target format is Markdown. Not plain text, not HTML, not JSON. Markdown.
Why? Because Markdown preserves just enough structure. A PDF converted to plain text loses all formatting — headers become indistinguishable from body text, tables collapse into gibberish. HTML preserves structure but adds massive token overhead with all those tags. Markdown hits a sweet spot: headers remain headers, lists remain lists, tables remain tables, and the token cost is minimal.
Tools like Docling, MarkItDown (Microsoft’s open source library), Jina Reader, and Unstructured all do the same thing: take some document format as input, produce Markdown as output. There’s an entire ecosystem of tools whose job is to convert things into Markdown so that LLMs can read them.
Fine-tuning data is Markdown
When you fine-tune a model or create training data for RLHF, the preferred responses are typically formatted in Markdown. The human annotators writing gold-standard responses use Markdown. This creates a feedback loop: models are trained on Markdown, so they produce Markdown, so new training data is in Markdown, so the next generation of models is even more fluent in Markdown.
Why Markdown won
You could argue that any lightweight format could have filled this role. reStructuredText exists. AsciiDoc exists. Textile existed. Wiki markup exists. Why Markdown?
1. Readability as plain text
Gruber’s original design goal turned out to be exactly the right property for LLM training. A Markdown file is readable by both humans and machines without any processing. The formatting syntax is minimal enough that it doesn’t confuse language models, but expressive enough to convey document structure.
Compare:
## Installation
1. Clone the repo
2. Run `npm install`
3. Copy `.env.example` to `.env`
vs. HTML:
<h2>Installation</h2>
<ol>
<li>Clone the repo</li>
<li>Run <code>npm install</code></li>
<li>Copy <code>.env.example</code> to <code>.env</code></li>
</ol>
The Markdown version is 85 characters. The HTML version is 178 characters. In a context window where every token matters, that’s not a trivial difference.
2. Already dominant in training data
GitHub alone hosts hundreds of millions of repositories, most of which have a README.md. Stack Overflow uses Markdown. Reddit uses a Markdown-like format. Developer blogs overwhelmingly use Markdown. Documentation sites use Markdown. This means LLMs absorbed enormous amounts of Markdown during pre-training, and they know how to work with it fluently.
If reStructuredText had been the dominant format on GitHub, we’d probably be talking about reStructuredText right now. But it wasn’t.
3. Token efficiency
Markdown’s formatting syntax is lightweight. A ## costs much less than <h2></h2>. A - costs less than <li></li>. When you’re operating within a 128K or 200K token context window, these savings compound. A complex document in Markdown might use 30-40% fewer tokens than the same document in HTML.
4. Good enough structure
Markdown doesn’t try to represent everything. You can’t express complex nested data structures, precise layouts, or semantic relationships. But for the things AI systems need most — section headings, paragraphs, lists, code blocks, tables, links, and emphasis — Markdown handles them fine.
The formats that express more (HTML, LaTeX, XML) pay for that expressiveness with complexity. The formats that express less (plain text) lose structural information that LLMs benefit from. Markdown is right in the middle.
5. Forgiving syntax
Markdown is remarkably tolerant of inconsistency. Mix * and - for list markers? Works fine. Forget a blank line before a header? Most parsers handle it. Use # or ## or ### inconsistently? The document still makes sense.
This matters because LLMs generate Markdown that’s often slightly malformed. Missing a closing backtick, inconsistent indentation, weird edge cases with nested lists. A strict format would break. Markdown’s looseness means these imperfections are usually invisible.
The conversion industrial complex
One of the clearest signs that Markdown has become AI’s interchange format is the ecosystem of conversion tools that has sprung up in the last two years.
Docling (IBM): Converts PDFs, Word documents, PowerPoint, and images to Markdown. Uses layout analysis and OCR to preserve document structure. It’s specifically designed for feeding documents into LLM pipelines.
MarkItDown (Microsoft): Open source library that converts Office documents, PDFs, images, audio, and HTML to Markdown. Microsoft built this because their own AI products needed a universal document-to-text pipeline, and they chose Markdown as the target.
Jina Reader: Pass it any URL and it returns the page content as Markdown. Built for RAG pipelines that need to ingest web content.
Unstructured: Document processing library that converts PDFs, images, emails, and more into structured elements, with Markdown as a primary output format.
Crawl4AI, FireCrawl: Web crawlers designed for AI pipelines. They crawl websites and return Markdown. Not HTML. Not JSON. Markdown.
The pattern is consistent: whatever format the source data is in, the first step is to convert it to Markdown. Then the LLM can work with it. Markdown has become the “common tongue” that bridges the gap between arbitrary document formats and language model context windows.
Agents live in Markdown
AI agents have taken this even further. The entire agent ecosystem runs on Markdown at almost every layer.
System prompts are Markdown documents. Open source agent frameworks like LangChain, CrewAI, and AutoGen all use Markdown-formatted prompts. Anthropic’s own system prompt for Claude is a Markdown document with headers, lists, and code blocks.
Memory files are often Markdown. Cursor stores rules as .md files. Windsurf has .windsurfrules files that are Markdown. Many agent frameworks persist memory as MEMORY.md or similar plain-text Markdown files that get loaded into context.
Tool descriptions use Markdown. When you define tools for an agent — function calling in OpenAI’s API, tool use in Claude — the descriptions are typically Markdown-formatted strings.
Agent output is Markdown. When a coding agent explains what it’s about to do, shows you a diff, or writes documentation, it’s all Markdown.
Configuration as Markdown. There’s even a growing pattern of using Markdown files as configuration: AGENTS.md, RULES.md, CONVENTIONS.md. These aren’t parsed as structured data; they’re loaded directly into LLM context as Markdown text. The model reads them the same way a human would.
Something I find interesting about this: we’ve gone from “Markdown as a human authoring format” to “Markdown as a machine-readable configuration format.” It works because LLMs are surprisingly good at interpreting semi-structured Markdown text, extracting rules and instructions even when the formatting isn’t perfectly consistent.
Where this is going
Markdown as a universal document layer
The conversion pipeline (any format -> Markdown -> LLM) is becoming standard infrastructure. I expect this pattern to solidify. We’ll see better conversion tools, better structure preservation, and probably some Markdown extensions designed specifically for AI consumption.
There’s already movement here. Some teams are experimenting with metadata blocks in Markdown that help LLMs understand document context — things like authorship, last-modified dates, confidence levels, and source provenance. YAML front matter (the --- blocks at the top of Markdown files) is being repurposed for this.
Richer structure without abandoning Markdown
The tension right now is that Markdown isn’t expressive enough for some AI use cases. You can’t represent complex data relationships, nested metadata, or fine-grained semantic annotations in pure Markdown. But moving to a more expressive format (XML, JSON-LD) means losing Markdown’s advantages.
The likely compromise: Markdown with embedded structured blocks. We already see this with YAML front matter and fenced code blocks containing JSON or YAML. Expect more conventions along these lines — Markdown as the prose layer, with structured data embedded where needed.
CommonMark and standardization pressure
Markdown’s lack of a single standard has been a problem for decades. CommonMark tried to fix this and has gained traction, but there are still competing dialects (GFM, MultiMarkdown, PHP Markdown Extra, etc.).
AI might force the standardization question. When you’re converting millions of documents to Markdown for training data, inconsistencies in how different parsers handle edge cases become a real data quality issue. I wouldn’t be surprised if a “Markdown for AI” profile emerges — a strict subset of Markdown that’s optimized for LLM consumption and document conversion.
Competing formats won’t replace Markdown, but they’ll complement it
JSON and YAML handle structured data better. XML handles complex nested documents better. These formats aren’t going away, and for specific use cases (API responses, configuration, data serialization) they’re the right choice.
But for the prose-heavy, semi-structured text that makes up the majority of LLM input and output, Markdown is likely to remain dominant. The question is how Markdown and structured formats coexist. The current answer — Markdown for text, JSON/YAML for data, with switching between them as needed — seems stable enough.
The risks of monoculture
There’s a less comfortable angle to this story. Markdown’s dominance in AI creates some real problems.
Lossy conversion. Converting a PDF or Word doc to Markdown loses information. Page layout, precise formatting, embedded metadata, accessibility tags, complex table structures — Markdown can’t represent all of this. When AI systems only see the Markdown version, they’re working with a degraded copy of the original.
Bias toward text. Markdown is fundamentally a text format. It handles prose, code, and simple tables well. It handles images, diagrams, charts, and multimedia poorly. The reliance on Markdown as an interchange format means AI systems are better at processing text-heavy documents than visually rich ones. Multimodal models are changing this, but the Markdown pipeline still dominates.
Training data homogeneity. If every LLM is trained on Markdown-heavy data and every RAG pipeline converts to Markdown, there’s a risk of models becoming too specialized for this one format. Subtle biases in how Markdown structures information (flat hierarchies, limited nesting, text-centric) might shape how models think about document structure in general.
The ambiguity problem. Markdown’s forgiving syntax is both a strength and a weakness. The same Markdown can be parsed differently by different tools. In an AI pipeline, this means the same source document can produce slightly different Markdown depending on which converter you use, which can lead to inconsistent behavior downstream.
Final thoughts
Markdown wasn’t designed for AI. It was designed for one person who wanted to write blog posts without typing HTML tags. Twenty-two years later, it’s become the format that AI systems read, write, think in, and communicate through.
This happened for boring, practical reasons. Markdown was already everywhere when LLMs arrived. It’s readable by both humans and machines. It’s token-efficient. It’s good enough at representing structure without being heavy. And once the first wave of LLMs internalized Markdown from their training data, every subsequent system had to speak the same language.
Will something eventually replace it? Maybe. But it would need to be readable as plain text, lightweight on tokens, expressive enough for document structure, already present in massive amounts of training data, and forgiving of imperfect syntax. That’s a hard combination to beat. For now, if you’re working with AI in any capacity, you’re working with Markdown. It’s the lingua franca whether anyone chose it or not.