[HN Gopher] Which table format do LLMs understand best?
       ___________________________________________________________________
        
       Which table format do LLMs understand best?
        
       Author : oidar
       Score  : 136 points
       Date   : 2025-10-03 02:59 UTC (2 days ago)
        
 (HTM) web link (www.improvingagents.com)
 (TXT) w3m dump (www.improvingagents.com)
        
       | ggm wrote:
       | I find this extremely surprising. I would have expected dict
       | structures to have higher semantic context associated with them.
        
       | nightshift1 wrote:
       | I am not an expert on the subject but i suggest that you can also
       | save context space by using shorter XML element names (like f
       | instead of function, c instead of class, etc.). Just add a legend
       | at the top or bottom to explain what each abbreviation means,
       | LLMs can figure out the mapping without issues. I use this
       | approach when generating project structure maps with Tree-sitter.
       | I did a quick comparison and didn't notice much degradation with
       | claude, so the context space you save may make it worthwhile. I
       | would be interested to see a proper comparison.
        
         | 1aurent29 wrote:
         | Common enough words like `function` and `class` are generally
         | encoded as a single token by the tokenizer and may provide a
         | slightly better context to the LLM. For openai you can test
         | this stuff at https://platform.openai.com/tokenizer
        
         | Yiin wrote:
         | if both f and function uses 1 token, are you really saving
         | anything?
        
       | reidgreer wrote:
       | interesting. I'm curious how this compares across different model
       | families.
        
       | Sharlin wrote:
       | > where accuracy is paramount
       | 
       | > accuracy: 60%
       | 
       | Not to mention that the least poorly performing format is
       | probably the stupidest way to encode tabular data, beating even
       | XML. But I guess that's the new normal because we're trying to
       | shoehorn conversational AI models to every use case rather than,
       | say, training finetunes that are better at particular tasks.
       | (Yes, of course you can't train finetunes when the model is a
       | proprietary black box on someone else's computer.) Something
       | about hammers and nails...
        
         | mritchie712 wrote:
         | they used GPT-4.1 nano, results would be quite different with
         | sonnet or gpt5.
        
           | lyu07282 wrote:
           | Or just regular gpt-4.1, it's a quite capable model.
        
           | fnordpiglet wrote:
           | I was looking for the frontier curve where they tested their
           | benchmark across different models since this sort of behavior
           | is highly parameter, architecture, training, and fine tuning
           | sensitive. It's a practically useful question so I was really
           | disappointed when a) they didn't publish their code so you
           | could test yourself, b) they didn't do even a cursory
           | examination of other models and sizes.
        
         | gpt5 wrote:
         | Isn't the best performing (markdown tables) and the worst (pipe
         | delimited tables) basically the same format?
        
           | simonw wrote:
           | The best performing isn't markdown tables, it's markdown
           | key/value pairs:                 ## Record 1              ```
           | id: 1       name: Charlie A0       age: 56       city: New
           | York       department: Operations       salary: 67896
           | years_experience: 7       project_count: 1       ```
           | 
           | Which makes sense to me because the problem with formats like
           | CSV and regular markdown tables is that it is too easy for
           | the model to mistakenly associate a value in a row with the
           | wrong header.
           | 
           | Explicit key/value formats like this or YAML or JSON objects
           | make that a lot less likely.
        
             | cwmoore wrote:
             | I was surprised that XML (56%), with closing tags, wasn't
             | as good as YAMl/KV(60%), though line breaks perform the
             | same kind of grouping function.
             | 
             | Then I realized from the table that XML used about 50% more
             | tokens (~75K vs ~50K) for similar accuracy, and for the
             | first time felt a kind of sympathy for the LLM...
        
         | mattcollins wrote:
         | I'm the person who ran the test.
         | 
         | To explain the 60% a bit more...
         | 
         | With small amounts of input data, the accuracy is near 100%. As
         | you increase the size of the input data, the accuracy gradually
         | decreases.
         | 
         | For this test, I intentionally chose an input data set large
         | enough that the LLM would score in the region of 50% accuracy
         | (with variation between formats) in order to maximise the
         | discriminative power of the test.
        
           | ysleepy wrote:
           | Wouldn't it be more useful to measure the number of rows the
           | model can process while still hitting 100% accuracy?
        
       | rcarmo wrote:
       | Hmmm. I've been using YAML data for tables for a while now, and
       | had pretty good results.
        
       | cjonas wrote:
       | The test really needed to be run on multiple data sizes (50, 100,
       | 500, 1000, 5000). The more token efficient formats would probably
       | eventually overtake the token heavy ones due to context
       | pollution. All this test really says is what performs best for 1
       | particular model at one particular context length.
        
       | lmeyerov wrote:
       | That's a cool concept - would be curious about a more common
       | setup for agentic data analysis (ex: for using in Claude Code)
       | like:
       | 
       | * Multiple tasks vs 1
       | 
       | * O3/o3-mini + 4o/4o-mini instead of nano
       | 
       | * Extra credit: Inside a fixed cost/length reasoning loop
       | 
       | Ex: does the md-kv benefit disappear with smarter models that
       | you'r typically use, and thus just become a 2-3x cost?
        
       | brap wrote:
       | I wonder how this compares to a more agentic approach where the
       | LLM composes SQL queries to answer the questions, for example.
        
         | efitz wrote:
         | This was exactly my thought. Rather than feed the table
         | directly to the LLM, build agents that extract the data and
         | have the LLM act on the extracted data items. Then it's a
         | preference issue.
         | 
         | The author didn't see much more than 60% accuracy which is not
         | very useful for many (most?) real world tasks.
        
           | coeneedell wrote:
           | "Agents that extract the data" Are we really reinventing data
           | frame readers to have an LLM in the critical path?
        
             | efitz wrote:
             | Reinventing? No. Using? Yes, for a lot of good reasons.
             | 
             | LLMs are expensive. Spending tokens to do something in bulk
             | that is well suited to existing tools and algorithms, is
             | wasteful and slow. And the main reason is that, using LLMs,
             | the original author indicated only a 60% success rate for
             | the task. Why spend many times more time and money and
             | energy just to use an LLM on a well-understood preparatory
             | task that it sucks at, when you can get much better results
             | more inexpensively with off-the-shelf tools, and feed their
             | results to the LLM for its unique value.
        
         | jitl wrote:
         | Yeah I mean for many real world scale datasets you don't want
         | to blow the whole context window on a massive markdown file.
         | Instead you can provide a tool that presents the data as a
         | SQLite database. In my testing Claude code seems very capable
         | of answering questions via SQLite queries or even `head` and
         | `grep` on CSV files.
        
           | bwestergard wrote:
           | But the result from the SQL query is going to be... a table.
           | So at some point, tables need to go into context, and we need
           | to know how well LLMs can incorporate those tables.
        
         | thom wrote:
         | Well, ironically you then have the issue of how to present your
         | database schema (including important things like the values in
         | some categorical fields) to the LLM and in what format, so you
         | never really escape this issue.
        
       | xnx wrote:
       | Title says "LLMs" (plural) but they only tested one
       | 
       | > We only tested OpenAI's GPT-4.1 nano.
        
         | picardo wrote:
         | This should be higher. While the research question is
         | interesting, the sample size makes the conclusion highly
         | suspect. I'd like to see more research on this.
        
         | cwyers wrote:
         | And not even a commonly used one. Gemini Flash or o4-mini would
         | have been a much better choice if they wanted a cheap model
        
       | secwang wrote:
       | maybe be org table
        
       | sega_sai wrote:
       | Bizarre conclusions when on average all the formats perform
       | poorly with average accuracy of 50%. Sure 60% is better than 40%
       | but they are both unusable if you actually care about numbers...
        
         | zeitgeistcowboy wrote:
         | My sentiments exactly. All the formats were so poorly read that
         | they are all effectively useless.
        
         | zaidf wrote:
         | I've been stunned by how many smart people talk so casually
         | about LLMs becoming better at math. Do they just forget that a
         | calculator that is wrong 1% of the time is a de facto
         | calculator that doesn't work and should not be used?
        
           | xnx wrote:
           | > I've been stunned by how many smart people talk so casually
           | about LLMs becoming better at math
           | 
           | Could they be referring to this?
           | 
           | "Advanced version of Gemini with Deep Think officially
           | achieves gold-medal standard at the International
           | Mathematical Olympiad"
           | https://deepmind.google/discover/blog/advanced-version-of-
           | ge...
        
           | westoncb wrote:
           | Doing math is not the same as calculating. LLMs can be very
           | useful in doing math; for calculating they are the wrong tool
           | (and even there they can be very useful, but you ask them to
           | use calculating tools, not to do the calculations themselves
           | --both Claude and ChatGPT are set up to do this).
           | 
           | If you're curious, check out how mathematicians like Robert
           | Ghrist or Terence Tao are using LLMs for math research, both
           | have written about it online repeatedly (along with an
           | increasing number of other researchers).
           | 
           | Apart from assisting with research, their ability on e.g.
           | math olympiad problems is periodically measured and
           | objectively rapidly improving, so this isn't just a matter of
           | opinion.
        
           | magicalhippo wrote:
           | The best math lecturers I had at university sucked at mental
           | calculations. Some almost screwed up 2+2 on the blackboard.
           | 
           | Yes LLMs suck at calculating stuff. However they can
           | manipulate equations and such, and sometimes impressively so.
        
           | crazygringo wrote:
           | You realize that when typing into a calculator, you probably
           | hit a wrong key more than 1% of the time? Which is why you
           | always type important calculations twice?
           | 
           | I've been stunned by how many smart people talk so casually
           | about how because LLMs aren't perfect, they therefore have no
           | value. Do they just forget that nothing in the world is
           | perfect, and the values of things are measured in degrees?
        
             | BolexNOLA wrote:
             | There's a big difference between mistyping 1% of the time
             | yourself (human error) and a calculator failing 1% of the
             | time (machine error) and I am willing to bet there isn't a
             | company out there (maybe a handful of less scrupulous ones)
             | that has knowingly shipped a calculator that got it wrong
             | 1% of the time. Especially in previous decades when
             | countless people were using a dedicated calculator dozens
             | of times a day. Hard to imagine a 1% margin of error was
             | acceptable.
             | 
             | Not to mention now you have the compounded problem of your
             | mistakes plus the calculator's mistakes.
        
         | mattcollins wrote:
         | I'm the person who ran the test.
         | 
         | To hopefully clarify a bit...
         | 
         | I intentionally chose input data large enough that the LLM
         | would be scoring in the region of 50% accuracy in order to
         | maximise the discriminative power of the test.
        
       | fancyfredbot wrote:
       | This is an interesting theoretical exercise but please for the
       | love of god don't actually use an LLM to search tabular data.
       | This is a solved problem. Free software does this with 100%
       | accuracy and insane efficiency.
        
         | ModernMech wrote:
         | This is a really eye-popping example. Because here we have
         | input text that is fully structured perfectly unambiguous (it
         | was carefully designed that way!) and yet the LLM can't get all
         | the information out of it. Yet people are using these tools to
         | summarize unstructured text, assuming the summary will capture
         | the most salient points. Well how is the LLM supposed to be
         | good for that task, if it can't even summarize the dang XML
         | document? They keep telling me this thing is more expert than
         | all the experts combined.
        
       | ComputerGuru wrote:
       | Inputs were not long enough to properly see either of the true
       | wins in terms of reduced token counts for terser formats or their
       | benefits in terms of avoiding stuffing the context window thereby
       | potentially reducing accuracy. The test really needs to be
       | conducted across multiple dimensions!
        
       | dctoedt wrote:
       | KSON? (I'm a complete ignoramus in this area but recently read
       | about KSON in a piece posted here at HN.)
       | 
       | https://ochagavia.nl/blog/configuration-files-are-user-inter...
       | 
       | https://news.ycombinator.com/item?id=45291858 (135 comments)
        
       | freehorse wrote:
       | Tbh I am more interested in processing data and _formatting_ it
       | to tabular forms than _extracting_ data from tabular forms. One
       | of the main uses I see in LLMs is structuring unstructured
       | /semistructured data. I may occasionally feed a table to an LLM
       | and ask such kinds of questions when I feel lazy, but I see no
       | serious application of this as compared with using whatever
       | language/library to process the data from the table (whether
       | using an llm or not in the whole process). The point of having
       | structured data is exactly this. But much more often I feed data
       | to an llm and ask it to create a table.
        
       | veryrealsid wrote:
       | I'm surprised by the accuracy, in practice, I feel like I
       | generally have a lot better results
        
         | coeneedell wrote:
         | Do you measure your results in a repeatable way? In a way where
         | your hypotheses about accuracy are falsifiable? Or do they just
         | "feel" right?
        
         | mattcollins wrote:
         | I'm the person who ran the test.
         | 
         | The context I used in the test was pretty large. You'll see
         | much better (near 100%) accuracy if you're using smaller
         | amounts of context.
         | 
         | [I chose the context size so that the LLM would be scoring in
         | the ballpark of 50% accuracy (with variation between formats)
         | to maximise the discriminative power of the test.]
        
       | mingtianzhang wrote:
       | The current OCR approach typically relies on a Vision-Language
       | Model (VLM) to convert a table into a JSON structure. However, a
       | table inherently has a 2D spatial structure, while Large Language
       | Models (LLMs) are optimized for processing 1D sequential text.
       | This creates a fundamental mismatch between the data
       | representation and the model's input format.
       | 
       | Most existing pipelines address this by preprocessing the table
       | into a linearized 1D string before passing it to the LLM -- a
       | question-agnostic step that may lose structural information.
       | 
       | Instead, one could retain the original table form and, when a
       | question is asked, feed both the question and the original table
       | (as an image) directly into the VLM. This approach allows the
       | model to reason over the data in its native 2D domain, providing
       | a more natural and potentially more accurate solution.
        
         | fragmede wrote:
         | Yeah, I wonder how PNG would fare in this contest.
        
       | dcre wrote:
       | Only testing GPT-4.1-nano makes this basically useless. Most
       | people are almost certainly using GPT-5 mini or better. This very
       | poor analysis is like an LLM literacy test for readers.
        
         | grey-area wrote:
         | Please go away and do the work for us and let us know what
         | anmazing accuracy you got with whatever version you think is
         | better.
         | 
         | Anything below 100% is actually pretty useless when it comes to
         | stats.
        
           | simonw wrote:
           | If you want 100% accuracy from these kinds of tasks with LLMs
           | you can get it today, but you need to provide the LLM with
           | the ability to run Python code and tell it to use something
           | like Pandas.
           | 
           | You can confirm it's doing the right thing by reviewing the
           | code it wrote.
        
           | dcre wrote:
           | Simon is right about using code execution, but many tables
           | one might look at outside of formal data work are small
           | enough for LLMs to be very reliable at, so this format
           | question is practically relevant. I wish they had tested
           | better models.
        
       | grey-area wrote:
       | They don't understand any table formats; as shown by these
       | results.
       | 
       | They can transform information in tables but information is lost
       | due to that lack of understanding.
        
       | xnx wrote:
       | Great idea. Very limited execution. If they release the source
       | data and question set, I'll repeat with more LLMs to flesh out
       | the findings.
        
       | Ciantic wrote:
       | This is a bit silly way to use LLMs to process tabular data. In
       | reality, you'd ask it to write functions and execute them. First
       | you'd ask it to create a type definition from the table, then ask
       | it to create functions to process the data.
       | 
       | "Write a function to find years of experience by name? Return
       | just the number, e.g. '12'."
       | 
       | It works much better, and it can single-shot many of the
       | processing requirements just from type definitions it can infer
       | from the data.
       | 
       | This way it's easier to stick to tabular formats that have easy
       | reading libraries, like with TypeScript/JavaScript JSON, and with
       | Python, maybe CSV...
        
       | sails wrote:
       | I'd be interested in testing different data formats when using
       | the structured outputs api
        
       | skyfantom wrote:
       | Super surprised, I would expect CSV to beat all the others. And
       | Markdown KV is something I hear first time about.
        
         | Bolwin wrote:
         | It's made up, not a standard format
        
       | SweetSoftPillow wrote:
       | Misleading title, just one LLM was tested.
        
       | jcheng wrote:
       | I was curious enough to have Codex create a similar benchmark:
       | https://github.com/jcheng5/table-formats
       | 
       | With 1000 rows and 100 samples and markdown-kv, I got these
       | scores:
       | 
       | - gpt-4.1-nano: 52%
       | 
       | - gpt-4.1-mini: 72%
       | 
       | - gpt-4.1: 93%
       | 
       | - gpt-5: 100%
       | 
       | I was so surprised by gpt-5 getting 100% that I ran it again with
       | 1000 samples. It got 999 correct, and one wrong.
       | 
       | To reproduce it yourself, clone the repo, add a .env file with
       | OPENAI_API_KEY, `uv sync`, and then run:                   uv run
       | inspect eval
       | evals/table_formats_eval.py@table_formats_markdown_kv --model
       | openai/gpt-5 --limit 100
       | 
       | Update: Also, number of rows makes a massive difference,
       | unsurprisingly; at 100 rows, gpt-4.1-nano scores 95%+ for both
       | markdown-kv and csv. Both model and record count seem to matter a
       | lot more than format.
        
         | jcheng wrote:
         | gpt-5 also got 100/100 for both CSV and JSON.
         | uv run inspect eval
         | evals/table_formats_eval.py@table_formats_csv --model
         | openai/gpt-5 --limit 100         uv run inspect eval
         | evals/table_formats_eval.py@table_formats_json --model
         | openai/gpt-5 --limit 100
        
       | lowbloodsugar wrote:
       | In mice.
       | 
       | Or in this case gpt-4.1-nano
        
       | olliem36 wrote:
       | We ended up making middleware for LLM 'tools/functions' that take
       | common data/table formats like CSV, Excel and JSON.
       | 
       | The tool uses an LLM to write code to parse the data and conduct
       | the analysis to return back to the LLM. Otherwise, we found
       | pumping raw table data into a LLM is just not reliable, even if
       | you go to the effort to conduct analysis on smaller chunks and
       | merge the results.
        
       | jimjimjim wrote:
       | accuracy: 60%
       | 
       | This should have been a python script.
       | 
       | How much of the current peak of the Gartner Hype Cycle should
       | just be python scripts?
        
       | faxmeyourcode wrote:
       | Curious how text-aligned tabular formats work for LLMs
       | considering humans probably find them more readable than other
       | formats
       | System Sales(a)
       | Number of Units         (in Millions)                ------------
       | ------------------------------------------------------------
       | KFC Division                              31,981    $
       | 34,452                 Taco Bell Division
       | 8,757                17,193                 Pizza Hut Division
       | 20,225                13,108                 Habit Burger & Grill
       | Division                383                   713
       | YUM                                       61,346    $
       | 65,466
       | 
       | I'm seeing pretty good success with extracting data out of 10-Qs
       | which are formatted like this by default using the `edgartools`
       | library's default `filing.text()` method.
        
       | johnfn wrote:
       | The article has interesting data. But it's frustrating to read AI
       | generated text like this:
       | 
       | > Performance Optimization: Reducing processing overhead while
       | maintaining accuracy
       | 
       | What on earth does it mean that this "optimized performance"?
       | This is nonsensical content. Performance wasn't even measured,
       | accuracy was. You can tell this was AI generated because "
       | Reducing processing overhead while maintaining accuracy" would
       | likely be true for a perf optimization, but it has no meaning
       | whatsoever in the context of the article.
       | 
       | This really throws into question whether I can take the rest of
       | the article and data seriously.
        
       ___________________________________________________________________
       (page generated 2025-10-05 23:00 UTC)