[HN Gopher] Which table format do LLMs understand best?
___________________________________________________________________
Which table format do LLMs understand best?
Author : oidar
Score : 136 points
Date : 2025-10-03 02:59 UTC (2 days ago)
(HTM) web link (www.improvingagents.com)
(TXT) w3m dump (www.improvingagents.com)
| ggm wrote:
| I find this extremely surprising. I would have expected dict
| structures to have higher semantic context associated with them.
| nightshift1 wrote:
| I am not an expert on the subject but i suggest that you can also
| save context space by using shorter XML element names (like f
| instead of function, c instead of class, etc.). Just add a legend
| at the top or bottom to explain what each abbreviation means,
| LLMs can figure out the mapping without issues. I use this
| approach when generating project structure maps with Tree-sitter.
| I did a quick comparison and didn't notice much degradation with
| claude, so the context space you save may make it worthwhile. I
| would be interested to see a proper comparison.
| 1aurent29 wrote:
| Common enough words like `function` and `class` are generally
| encoded as a single token by the tokenizer and may provide a
| slightly better context to the LLM. For openai you can test
| this stuff at https://platform.openai.com/tokenizer
| Yiin wrote:
| if both f and function uses 1 token, are you really saving
| anything?
| reidgreer wrote:
| interesting. I'm curious how this compares across different model
| families.
| Sharlin wrote:
| > where accuracy is paramount
|
| > accuracy: 60%
|
| Not to mention that the least poorly performing format is
| probably the stupidest way to encode tabular data, beating even
| XML. But I guess that's the new normal because we're trying to
| shoehorn conversational AI models to every use case rather than,
| say, training finetunes that are better at particular tasks.
| (Yes, of course you can't train finetunes when the model is a
| proprietary black box on someone else's computer.) Something
| about hammers and nails...
| mritchie712 wrote:
| they used GPT-4.1 nano, results would be quite different with
| sonnet or gpt5.
| lyu07282 wrote:
| Or just regular gpt-4.1, it's a quite capable model.
| fnordpiglet wrote:
| I was looking for the frontier curve where they tested their
| benchmark across different models since this sort of behavior
| is highly parameter, architecture, training, and fine tuning
| sensitive. It's a practically useful question so I was really
| disappointed when a) they didn't publish their code so you
| could test yourself, b) they didn't do even a cursory
| examination of other models and sizes.
| gpt5 wrote:
| Isn't the best performing (markdown tables) and the worst (pipe
| delimited tables) basically the same format?
| simonw wrote:
| The best performing isn't markdown tables, it's markdown
| key/value pairs: ## Record 1 ```
| id: 1 name: Charlie A0 age: 56 city: New
| York department: Operations salary: 67896
| years_experience: 7 project_count: 1 ```
|
| Which makes sense to me because the problem with formats like
| CSV and regular markdown tables is that it is too easy for
| the model to mistakenly associate a value in a row with the
| wrong header.
|
| Explicit key/value formats like this or YAML or JSON objects
| make that a lot less likely.
| cwmoore wrote:
| I was surprised that XML (56%), with closing tags, wasn't
| as good as YAMl/KV(60%), though line breaks perform the
| same kind of grouping function.
|
| Then I realized from the table that XML used about 50% more
| tokens (~75K vs ~50K) for similar accuracy, and for the
| first time felt a kind of sympathy for the LLM...
| mattcollins wrote:
| I'm the person who ran the test.
|
| To explain the 60% a bit more...
|
| With small amounts of input data, the accuracy is near 100%. As
| you increase the size of the input data, the accuracy gradually
| decreases.
|
| For this test, I intentionally chose an input data set large
| enough that the LLM would score in the region of 50% accuracy
| (with variation between formats) in order to maximise the
| discriminative power of the test.
| ysleepy wrote:
| Wouldn't it be more useful to measure the number of rows the
| model can process while still hitting 100% accuracy?
| rcarmo wrote:
| Hmmm. I've been using YAML data for tables for a while now, and
| had pretty good results.
| cjonas wrote:
| The test really needed to be run on multiple data sizes (50, 100,
| 500, 1000, 5000). The more token efficient formats would probably
| eventually overtake the token heavy ones due to context
| pollution. All this test really says is what performs best for 1
| particular model at one particular context length.
| lmeyerov wrote:
| That's a cool concept - would be curious about a more common
| setup for agentic data analysis (ex: for using in Claude Code)
| like:
|
| * Multiple tasks vs 1
|
| * O3/o3-mini + 4o/4o-mini instead of nano
|
| * Extra credit: Inside a fixed cost/length reasoning loop
|
| Ex: does the md-kv benefit disappear with smarter models that
| you'r typically use, and thus just become a 2-3x cost?
| brap wrote:
| I wonder how this compares to a more agentic approach where the
| LLM composes SQL queries to answer the questions, for example.
| efitz wrote:
| This was exactly my thought. Rather than feed the table
| directly to the LLM, build agents that extract the data and
| have the LLM act on the extracted data items. Then it's a
| preference issue.
|
| The author didn't see much more than 60% accuracy which is not
| very useful for many (most?) real world tasks.
| coeneedell wrote:
| "Agents that extract the data" Are we really reinventing data
| frame readers to have an LLM in the critical path?
| efitz wrote:
| Reinventing? No. Using? Yes, for a lot of good reasons.
|
| LLMs are expensive. Spending tokens to do something in bulk
| that is well suited to existing tools and algorithms, is
| wasteful and slow. And the main reason is that, using LLMs,
| the original author indicated only a 60% success rate for
| the task. Why spend many times more time and money and
| energy just to use an LLM on a well-understood preparatory
| task that it sucks at, when you can get much better results
| more inexpensively with off-the-shelf tools, and feed their
| results to the LLM for its unique value.
| jitl wrote:
| Yeah I mean for many real world scale datasets you don't want
| to blow the whole context window on a massive markdown file.
| Instead you can provide a tool that presents the data as a
| SQLite database. In my testing Claude code seems very capable
| of answering questions via SQLite queries or even `head` and
| `grep` on CSV files.
| bwestergard wrote:
| But the result from the SQL query is going to be... a table.
| So at some point, tables need to go into context, and we need
| to know how well LLMs can incorporate those tables.
| thom wrote:
| Well, ironically you then have the issue of how to present your
| database schema (including important things like the values in
| some categorical fields) to the LLM and in what format, so you
| never really escape this issue.
| xnx wrote:
| Title says "LLMs" (plural) but they only tested one
|
| > We only tested OpenAI's GPT-4.1 nano.
| picardo wrote:
| This should be higher. While the research question is
| interesting, the sample size makes the conclusion highly
| suspect. I'd like to see more research on this.
| cwyers wrote:
| And not even a commonly used one. Gemini Flash or o4-mini would
| have been a much better choice if they wanted a cheap model
| secwang wrote:
| maybe be org table
| sega_sai wrote:
| Bizarre conclusions when on average all the formats perform
| poorly with average accuracy of 50%. Sure 60% is better than 40%
| but they are both unusable if you actually care about numbers...
| zeitgeistcowboy wrote:
| My sentiments exactly. All the formats were so poorly read that
| they are all effectively useless.
| zaidf wrote:
| I've been stunned by how many smart people talk so casually
| about LLMs becoming better at math. Do they just forget that a
| calculator that is wrong 1% of the time is a de facto
| calculator that doesn't work and should not be used?
| xnx wrote:
| > I've been stunned by how many smart people talk so casually
| about LLMs becoming better at math
|
| Could they be referring to this?
|
| "Advanced version of Gemini with Deep Think officially
| achieves gold-medal standard at the International
| Mathematical Olympiad"
| https://deepmind.google/discover/blog/advanced-version-of-
| ge...
| westoncb wrote:
| Doing math is not the same as calculating. LLMs can be very
| useful in doing math; for calculating they are the wrong tool
| (and even there they can be very useful, but you ask them to
| use calculating tools, not to do the calculations themselves
| --both Claude and ChatGPT are set up to do this).
|
| If you're curious, check out how mathematicians like Robert
| Ghrist or Terence Tao are using LLMs for math research, both
| have written about it online repeatedly (along with an
| increasing number of other researchers).
|
| Apart from assisting with research, their ability on e.g.
| math olympiad problems is periodically measured and
| objectively rapidly improving, so this isn't just a matter of
| opinion.
| magicalhippo wrote:
| The best math lecturers I had at university sucked at mental
| calculations. Some almost screwed up 2+2 on the blackboard.
|
| Yes LLMs suck at calculating stuff. However they can
| manipulate equations and such, and sometimes impressively so.
| crazygringo wrote:
| You realize that when typing into a calculator, you probably
| hit a wrong key more than 1% of the time? Which is why you
| always type important calculations twice?
|
| I've been stunned by how many smart people talk so casually
| about how because LLMs aren't perfect, they therefore have no
| value. Do they just forget that nothing in the world is
| perfect, and the values of things are measured in degrees?
| BolexNOLA wrote:
| There's a big difference between mistyping 1% of the time
| yourself (human error) and a calculator failing 1% of the
| time (machine error) and I am willing to bet there isn't a
| company out there (maybe a handful of less scrupulous ones)
| that has knowingly shipped a calculator that got it wrong
| 1% of the time. Especially in previous decades when
| countless people were using a dedicated calculator dozens
| of times a day. Hard to imagine a 1% margin of error was
| acceptable.
|
| Not to mention now you have the compounded problem of your
| mistakes plus the calculator's mistakes.
| mattcollins wrote:
| I'm the person who ran the test.
|
| To hopefully clarify a bit...
|
| I intentionally chose input data large enough that the LLM
| would be scoring in the region of 50% accuracy in order to
| maximise the discriminative power of the test.
| fancyfredbot wrote:
| This is an interesting theoretical exercise but please for the
| love of god don't actually use an LLM to search tabular data.
| This is a solved problem. Free software does this with 100%
| accuracy and insane efficiency.
| ModernMech wrote:
| This is a really eye-popping example. Because here we have
| input text that is fully structured perfectly unambiguous (it
| was carefully designed that way!) and yet the LLM can't get all
| the information out of it. Yet people are using these tools to
| summarize unstructured text, assuming the summary will capture
| the most salient points. Well how is the LLM supposed to be
| good for that task, if it can't even summarize the dang XML
| document? They keep telling me this thing is more expert than
| all the experts combined.
| ComputerGuru wrote:
| Inputs were not long enough to properly see either of the true
| wins in terms of reduced token counts for terser formats or their
| benefits in terms of avoiding stuffing the context window thereby
| potentially reducing accuracy. The test really needs to be
| conducted across multiple dimensions!
| dctoedt wrote:
| KSON? (I'm a complete ignoramus in this area but recently read
| about KSON in a piece posted here at HN.)
|
| https://ochagavia.nl/blog/configuration-files-are-user-inter...
|
| https://news.ycombinator.com/item?id=45291858 (135 comments)
| freehorse wrote:
| Tbh I am more interested in processing data and _formatting_ it
| to tabular forms than _extracting_ data from tabular forms. One
| of the main uses I see in LLMs is structuring unstructured
| /semistructured data. I may occasionally feed a table to an LLM
| and ask such kinds of questions when I feel lazy, but I see no
| serious application of this as compared with using whatever
| language/library to process the data from the table (whether
| using an llm or not in the whole process). The point of having
| structured data is exactly this. But much more often I feed data
| to an llm and ask it to create a table.
| veryrealsid wrote:
| I'm surprised by the accuracy, in practice, I feel like I
| generally have a lot better results
| coeneedell wrote:
| Do you measure your results in a repeatable way? In a way where
| your hypotheses about accuracy are falsifiable? Or do they just
| "feel" right?
| mattcollins wrote:
| I'm the person who ran the test.
|
| The context I used in the test was pretty large. You'll see
| much better (near 100%) accuracy if you're using smaller
| amounts of context.
|
| [I chose the context size so that the LLM would be scoring in
| the ballpark of 50% accuracy (with variation between formats)
| to maximise the discriminative power of the test.]
| mingtianzhang wrote:
| The current OCR approach typically relies on a Vision-Language
| Model (VLM) to convert a table into a JSON structure. However, a
| table inherently has a 2D spatial structure, while Large Language
| Models (LLMs) are optimized for processing 1D sequential text.
| This creates a fundamental mismatch between the data
| representation and the model's input format.
|
| Most existing pipelines address this by preprocessing the table
| into a linearized 1D string before passing it to the LLM -- a
| question-agnostic step that may lose structural information.
|
| Instead, one could retain the original table form and, when a
| question is asked, feed both the question and the original table
| (as an image) directly into the VLM. This approach allows the
| model to reason over the data in its native 2D domain, providing
| a more natural and potentially more accurate solution.
| fragmede wrote:
| Yeah, I wonder how PNG would fare in this contest.
| dcre wrote:
| Only testing GPT-4.1-nano makes this basically useless. Most
| people are almost certainly using GPT-5 mini or better. This very
| poor analysis is like an LLM literacy test for readers.
| grey-area wrote:
| Please go away and do the work for us and let us know what
| anmazing accuracy you got with whatever version you think is
| better.
|
| Anything below 100% is actually pretty useless when it comes to
| stats.
| simonw wrote:
| If you want 100% accuracy from these kinds of tasks with LLMs
| you can get it today, but you need to provide the LLM with
| the ability to run Python code and tell it to use something
| like Pandas.
|
| You can confirm it's doing the right thing by reviewing the
| code it wrote.
| dcre wrote:
| Simon is right about using code execution, but many tables
| one might look at outside of formal data work are small
| enough for LLMs to be very reliable at, so this format
| question is practically relevant. I wish they had tested
| better models.
| grey-area wrote:
| They don't understand any table formats; as shown by these
| results.
|
| They can transform information in tables but information is lost
| due to that lack of understanding.
| xnx wrote:
| Great idea. Very limited execution. If they release the source
| data and question set, I'll repeat with more LLMs to flesh out
| the findings.
| Ciantic wrote:
| This is a bit silly way to use LLMs to process tabular data. In
| reality, you'd ask it to write functions and execute them. First
| you'd ask it to create a type definition from the table, then ask
| it to create functions to process the data.
|
| "Write a function to find years of experience by name? Return
| just the number, e.g. '12'."
|
| It works much better, and it can single-shot many of the
| processing requirements just from type definitions it can infer
| from the data.
|
| This way it's easier to stick to tabular formats that have easy
| reading libraries, like with TypeScript/JavaScript JSON, and with
| Python, maybe CSV...
| sails wrote:
| I'd be interested in testing different data formats when using
| the structured outputs api
| skyfantom wrote:
| Super surprised, I would expect CSV to beat all the others. And
| Markdown KV is something I hear first time about.
| Bolwin wrote:
| It's made up, not a standard format
| SweetSoftPillow wrote:
| Misleading title, just one LLM was tested.
| jcheng wrote:
| I was curious enough to have Codex create a similar benchmark:
| https://github.com/jcheng5/table-formats
|
| With 1000 rows and 100 samples and markdown-kv, I got these
| scores:
|
| - gpt-4.1-nano: 52%
|
| - gpt-4.1-mini: 72%
|
| - gpt-4.1: 93%
|
| - gpt-5: 100%
|
| I was so surprised by gpt-5 getting 100% that I ran it again with
| 1000 samples. It got 999 correct, and one wrong.
|
| To reproduce it yourself, clone the repo, add a .env file with
| OPENAI_API_KEY, `uv sync`, and then run: uv run
| inspect eval
| evals/table_formats_eval.py@table_formats_markdown_kv --model
| openai/gpt-5 --limit 100
|
| Update: Also, number of rows makes a massive difference,
| unsurprisingly; at 100 rows, gpt-4.1-nano scores 95%+ for both
| markdown-kv and csv. Both model and record count seem to matter a
| lot more than format.
| jcheng wrote:
| gpt-5 also got 100/100 for both CSV and JSON.
| uv run inspect eval
| evals/table_formats_eval.py@table_formats_csv --model
| openai/gpt-5 --limit 100 uv run inspect eval
| evals/table_formats_eval.py@table_formats_json --model
| openai/gpt-5 --limit 100
| lowbloodsugar wrote:
| In mice.
|
| Or in this case gpt-4.1-nano
| olliem36 wrote:
| We ended up making middleware for LLM 'tools/functions' that take
| common data/table formats like CSV, Excel and JSON.
|
| The tool uses an LLM to write code to parse the data and conduct
| the analysis to return back to the LLM. Otherwise, we found
| pumping raw table data into a LLM is just not reliable, even if
| you go to the effort to conduct analysis on smaller chunks and
| merge the results.
| jimjimjim wrote:
| accuracy: 60%
|
| This should have been a python script.
|
| How much of the current peak of the Gartner Hype Cycle should
| just be python scripts?
| faxmeyourcode wrote:
| Curious how text-aligned tabular formats work for LLMs
| considering humans probably find them more readable than other
| formats
| System Sales(a)
| Number of Units (in Millions) ------------
| ------------------------------------------------------------
| KFC Division 31,981 $
| 34,452 Taco Bell Division
| 8,757 17,193 Pizza Hut Division
| 20,225 13,108 Habit Burger & Grill
| Division 383 713
| YUM 61,346 $
| 65,466
|
| I'm seeing pretty good success with extracting data out of 10-Qs
| which are formatted like this by default using the `edgartools`
| library's default `filing.text()` method.
| johnfn wrote:
| The article has interesting data. But it's frustrating to read AI
| generated text like this:
|
| > Performance Optimization: Reducing processing overhead while
| maintaining accuracy
|
| What on earth does it mean that this "optimized performance"?
| This is nonsensical content. Performance wasn't even measured,
| accuracy was. You can tell this was AI generated because "
| Reducing processing overhead while maintaining accuracy" would
| likely be true for a perf optimization, but it has no meaning
| whatsoever in the context of the article.
|
| This really throws into question whether I can take the rest of
| the article and data seriously.
___________________________________________________________________
(page generated 2025-10-05 23:00 UTC)