[HN Gopher] How well can LLMs write COBOL?
___________________________________________________________________
How well can LLMs write COBOL?
Author : ggordonhall
Score : 70 points
Date : 2024-03-30 11:22 UTC (1 days ago)
(HTM) web link (bloop.ai)
(TXT) w3m dump (bloop.ai)
| pmg101 wrote:
| I tried to get chatgpt to write 6502 assembler for the 1980s
| 8-bit home computer the BBC Micro. It was game, but clueless
| fmxexpress wrote:
| Try feeding it 8 pages of examples first? Something like this
| https://atariwiki.org/wiki/Wiki.jsp?page=Advanced%206502%20A...
| TillE wrote:
| That's a little surprising; 6502 assembly is a fairly popular
| hobbyist thing, so I would expect the data is out there. It's
| also _mostly_ pretty simple, but you do have to watch out for
| quirks like rotating through the carry bit.
| _the_inflator wrote:
| It depends on what you test for.
|
| I am from the C64 demo scene and in this regard ChatGPT is
| pretty useless. VIC tricks, raster timing - nothing a LLM can
| help with at the moment judging from my experience with ChatGPT
| 4.0 so far.
|
| Same goes for Amiga and simple blitter access for scrolling.
|
| LLMs will be very limited here unless they receive sensory
| feedback repeatedly.
|
| Pure algorithms like sorting for example may be doable by GPT,
| but the mentioned machines are very creatively used to come up
| with effects.
|
| Here are techniques reused, not so much code itself, which gets
| modified and optimized for every demo and often times
| recombined with other techniques.
|
| Most techniques for the VIC are pretty well documented, but the
| timing as well as recombining them is the heavy lifting.
| pcwalton wrote:
| I tried to get it to write 6502 assembler for the NES. As I
| recall it mistakenly thought that there was a BIOS containing a
| print function.
| Terretta wrote:
| It can write 6502, but you have to know 6502 yourself to coach
| it.
| dwheeler wrote:
| I suspect there were relatively few training examples for COBOL.
| It would be interesting to see the results for a system which had
| a significant number of such examples in the training set.
| nradov wrote:
| There is probably a business opportunity for an AI company to
| build private LLMs for large enterprises trained on their own
| COBOL code bases. They won't find much available as open
| source, and individual companies tend to have significantly
| different coding styles tied to COBOL versions and database
| schemas.
| IshKebab wrote:
| Yeah I think Facebook and Google are already doing that
| internally.
| pjmlp wrote:
| Indeed, even the author wasn't aware of modern COBOL, with IDE
| tooling and OOP capabilities, focusing on the classical micros
| instead.
| giantrobot wrote:
| The issue is less COBOL the language and more the literal
| business logic the COBOL is encoding. You can learn the COBOL
| language as easily as any other. What you can't learn as
| easily are the accounting rules, administration policies, and
| regulations any major COBOL codebase is implementing.
|
| You'll be able to see the code multiplying a dollar value by
| 0.03 but not necessarily know that this is because of some
| statutory requirement of some Minnesota tax code that only
| applies to industries producing both left and right handed
| monkey wrenches but only if the company was incorporated
| before 1975. That obscure law isn't referenced in any
| documentation but was found by an accountant in 1982. The
| change was made to the code but only referenced in a paper
| memo with a small distribution list but all of those memos
| were shredded after being archived for ten years.
|
| ChatGPT can't really help document code that's decades old
| and doesn't have any references to the _why_ of the code. The
| _how_ is straightforward but rarely as important as the
| _why_.
| Solvency wrote:
| Not knowing a statutory law expressed in code is not just a
| COBOL thing.
| ripvanwinkle wrote:
| It would be interesting to feed it a formal language
| specification of some language it hasn't seen and then ask it
| write code and see how it does.
|
| That could be a test of reasoning and reading comprehension
| ape4 wrote:
| Reasoning vs being a completion engine (I could make a guess at
| how well that would work)
| CuriouslyC wrote:
| Reasoning is a form of completion (logical), the problem is
| that LLMs aren't language agnostic in their learned semantic
| reasoning.
| CuriouslyC wrote:
| I've been thinking about a benchmark designed this way for a
| while. It doesn't even need to be code, particularly, it could
| be basic reasoning problems. The key is that you define a new,
| random language that has never before been seen (maybe it has
| statistical similarity to existing languages, maybe not),
| create a translation key, then ask a question in that language.
| allusernamesare wrote:
| On one hand, it'd be super cool to solve the talent shortage in
| the field, also Cobol code isn't very pleasant to write.
|
| On the other, I'm not sure I'd want tools known for poor code
| quality, and hallucinations, to write these super critical
| systems.
|
| Guess there might be a copilotesque productivity booster for
| human developers, but I think these systems are some of the last
| places I'd want LLMs to contribute.
| backtoyoujim wrote:
| That would remove some portion of the human coders, too.
| mrbombastic wrote:
| I wonder if a better goal would be valid translation of the
| COBOL into X language. Obviously that has its own can of worms
| but it seems like our goal generally should be getting critical
| systems to more modern languages that can be maintained more
| easily.
| sanxiyn wrote:
| In 2009, 4 million lines of COBOL were migrated to Java using
| an automatic translator.
|
| https://www.infoq.com/news/2009/07/cobol-to-java/
| kamma4434 wrote:
| The problem is not translation - once it is ported to Java,
| who will maintain it? It has no specs, no tests, and it's a
| spaghetti mess. Is any weirdness a bug or a feature? Nobody
| knows. Least of all the body-rental remote devs you hired
| to maintain it.
| ikari_pl wrote:
| which is worse than the same situation, but in COBOL, how
| exactly?
| bongodongobob wrote:
| Why do people always add this "blah blah hallucinations and
| critical systems"? 1. People write bad and buggy code. 2. You
| act like we're just blindly throwing untested code at
| production systems from LLMs.
|
| It's just intellectually dishonest to talk this way.
|
| They will still be helpful but we obviously need to test before
| we add code into systems. It goes without saying.
| fragmede wrote:
| Look, we can't all just be realistic about a thing that's
| going to take our jobs, so we have to lean on tired old
| excuses instead. Instead of being reasonable, why don't you
| pick a team - for or against, and then fight about it on
| Internet forums because, well, what else are you going to do
| while the build compiles? Look at cat videos?
| bongodongobob wrote:
| I generate my own custom cat videos with AI now tyvm.
| giantrobot wrote:
| > It's just intellectually dishonest to talk this way.
|
| > They will still be helpful but we obviously need to test
| before we add code into systems. It goes without saying.
|
| It's not intellectually dishonest at all. It's an issue of
| conditioning. There's a class of developers that blindly copy
| and paste code from StackOverflow or the first hit on Google.
| They're the same class that will uncritically copy and paste
| ChatGPT answers.
|
| ChatGPT is worse than SO because it's adaptive. If someone
| pastes in a SO answer and it doesn't immediately work the
| developer has to at least engage with the code. ChatGPT can
| be asked to refine its hallucination until it
| parses/compiles.
|
| The class of developer blindly copying and pasting answers
| will not have the expertise to spot hallucinations or likely
| even fix the inevitable bugs they introduce. Additionally
| ChatGPT by its nature elides the source of its answers. At
| the very least a SO answer has _some_ provenance. Not only
| the poster but some social signally through votes that the
| answer is legitimate.
|
| ChatGPT answers don't have any of that. It will also happily
| hallucinate references.
|
| Conditioning junior developers and learners to rely on and
| trust AI coding is setting them up to fail. It's also going
| to stunt their growth as developers because they'll never
| gain any domain knowledge. In the meantime they'll be
| unknowingly sabotaging products with legit looking but broken
| code.
| bongodongobob wrote:
| I should be worried that the very worst developers might
| paste bad code from ChatGPT and that's why it's dangerous?
| Looks an awful lot like mental gymnastics to me.
| sanxiyn wrote:
| Considering MTOB(Machine Translation from One Book) result, where
| LLM in-context learns a new language from a grammar book, I
| wonder how LLMs fare given, say, GnuCOBOL Programmer's Guide PDF,
| which is easily available. It would be an interesting addition to
| the benchmark.
|
| https://arxiv.org/abs/2309.16575
|
| https://gnucobol.sourceforge.io/guides.html
| pcwalton wrote:
| I tried to get ChatGPT to write LLVM IR last year. The results
| were interesting: the LLM wrote superficially correct-looking IR,
| but it ultimately failed to grasp the concept of SSA, as it kept
| trying to write to SSA registers. ChatGPT can generalize across
| language syntax reasonably well, but it doesn't understand deeper
| differences in language semantics.
| wvenable wrote:
| I tried to get ChatGPT to write 6502 assembly and it had
| similar issues.
| saurik wrote:
| FWIW, I had similar issues trying to get both it and Claude
| to help me with x86 assembly: it kept thinking if it added
| another * or some more parentheses it could get some
| impossible assembly to work.
| bongodongobob wrote:
| Reminds me of adding *'s and &'s until my C++ code worked
| in college.
| pama wrote:
| Since the LLM sometimes generates invalid COBOL a simple
| practical solution would be to use an API and allow it to test
| its code with GnuCOBOL, feed the output and have it try again a
| couple of times. I wonder what the updated benchmarks would be in
| that setting.
| ptx wrote:
| The general approach seems to work anyway. I tried it out with
| ChatGPT 3.5 and an online Cobol compiler[0], manually feeding
| back the output, and it managed to produce a working program on
| the 10th attempt (that displays the first 10 Fibonacci
| numbers).
|
| Edit: Well, maybe. With the example from the article it wasn't
| as successful.
|
| [0] https://onecompiler.com/cobol/
| danenania wrote:
| This looks interesting. I'm working on an OpenAI-based tool for
| coding tasks that are too complex for ChatGPT -
| https://github.com/plandex-ai/plandex
|
| It's working quite well for me, but it definitely needs some time
| spent on benchmarking and ironing out edge cases.
|
| I'm especially curious how it will do on more "obscure"
| languages. Not that Cobol is obscure exactly--I suppose there's
| probably quite a bit of it in GPT-4's training considering how
| pervasive it is in some domains. In any case, I'll try out this
| benchmark and see how it goes.
| arittr wrote:
| This looks great! Can't wait to try it out today
| skissane wrote:
| > Not that Cobol is obscure exactly--I suppose there's probably
| quite a bit of it in GPT-4's training considering how pervasive
| it is in some domains
|
| There is a huge amount of COBOL code in existence - but, almost
| all of it is non-public code used to run business and
| governments. Very little of it is publicly source-available
| (whether open source or something more restrictive than that)
|
| Unless GPT-4's training data includes non-public code bases (I
| doubt it), it likely has rather little COBOL code in it
| SonOfLilit wrote:
| I've been using GPT4 to help me navigate a mainframe and a
| COBOL codebase and it knows far more than what my googling
| abilities manage to fish up in forums. It's actually
| surprisingly good at surprisingly deep mainframe topics.
| skissane wrote:
| No doubt its training data contains a lot of IBM manuals,
| probably even some commercial books on relevant topics,
| maybe even the contents of some of the forums you mention -
| and all that could be enough to correctly answer your
| questions.
|
| However, for languages like Python, Java, C, C++,
| JavaScript, Go, etc, it also contains untold millions of
| lines of code slurped from places like GitHub. Whereas, I
| really doubt it contains anywhere remotely near as much
| COBOL code, just because you look for COBOL code on GitHub
| public repos, you will find very little - the vast majority
| of COBOL code is in-house or vendor business software, and
| few seem to want to make that stuff public - and what COBOL
| code GitHub has is mostly toy exercises or ancient stuff,
| not examples of significant contemporary production code.
| The only way OpenAI is going to get a substantial quantity
| of that is if multiple private parties (such as banks) give
| them access to their COBOL code bases - not impossible, but
| absent some public info saying it has happened, it seems
| more likely it hasn't.
|
| I expect GPT-4 (or any LLM) is not going to perform as well
| on complicated programming tasks for COBOL compared to
| other languages. For more mainstream languages, it has
| millions of examples to help it do a better job, for COBOL
| it likely doesn't.
| jobigoud wrote:
| But it probably read all the books ever published on COBOL.
| GaggiX wrote:
| Someone should test the benchmark on Claude 3 models.
| andy99 wrote:
| I've asked chatGPT a fair number of Fortran questions. There are
| differences - Fortran is still in use lots of places, there are
| forums and documentation sites (though presumably that's true for
| COBOL). But compare to python, there is way less info out there
| on how to do different things, for example little Stack Overflow
| content.
|
| I'd say I has mixed results, definitely chatGPT knows the
| language and can give examples but I've also had a lot of
| frustrating things it wasn't able to resolve.
| gnatolf wrote:
| Fortran also has the disadvantage of numerous separate
| dialects/flavours that plenty of times can't be mixed. And
| rarely is the exact style mentioned in random code found
| somewhere.
|
| It got a lot better with Fortran 95 and newer, but in the old
| world of e.g. lahey compilers and custom commands only
| available there, any LLM has failed me consistently to stick to
| these intricacies. I can't even blame them, when asking humans
| questions about these topics, you'll get all sorts of answers
| that are equally close to a correct solution, but almost never
| precisely correct.
| treebeard901 wrote:
| One thing that makes OpenAI so valuable over time is how they can
| take all of this expert input from the rush to test out the new
| technology and use that to exponentially improve the next model.
| You have to think, just like we see here, experts in their field
| who know all of these edge cases or other fundamental aspects of
| what makes their own companies or projects valuable and they tend
| to hand that over in the process.
|
| It's kind if interesting how far this could be taken with all
| kinds of valuable information from people and companies,
| especially code and business logic. Everyone wants to say AI this
| and AI that to keep up with the times and they all keep dumping
| all of this valuable data in for free.
|
| Then consider the authentication method used tied to this data
| collection and if you are using a company emial, etc, it helps
| them weed out the garbage too.
|
| I guess we can't fight progress...
| CuriouslyC wrote:
| That might be true, but my feeling so far is that OpenAI
| doesn't want to do what they'd need to do to make any one
| product actually good, so they're going to keep bouncing from
| AI thing to thing, making foundation models that have a lot of
| wow factor but can't really deliver on their promise because
| they're too closed off to integrate into a workflow that might
| alleviate the issues.
|
| Other people will try to build on OpenAI stuff, find that it's
| not quite good enough and OpenAI doesn't care to really make it
| good enough because it's a lot of work, and it won't be until
| we get competitors that take less sexy model tech and take the
| time to make it REALLY GOOD at certain things that AI really
| makes good on its promise. I'm guessing that will be driven by
| people taking open source tools that are ~80% of the way there
| and really building a system and domain logic around it to make
| it excellent.
| bigEnotation wrote:
| I think you're forgetting about the use case where the LLM
| returns something partially correct to a discerning expert, who
| is still able to use the response, but does not bother with a
| message like "btw I had to do X to make your suggestions
| usable".
| aj7 wrote:
| A while back, I asked a question here, roughly, why hasn't
| someone written, say, a C to COBOL translator? Such a program
| might take a lot of work, but it seemed to me that with an
| impending dearth of COBOL programmers, there would be demand for
| such an app. I was informed that there were so many different
| COBOLs in use that the output of such a program would STILL have
| to be tended to by an experienced programmer in the output
| dialect desired. This is just the Copilot situation.
| Analemma_ wrote:
| As much as everyone likes to poke fun at COBOL, the language
| itself really isn't the problem with maintaining/updating old
| COBOL systems. It's old, but it's not _that_ bad.
|
| The real problem is the entire ecosystem _around_ those
| systems. Remember, a lot of COBOL software dates back to a time
| before things like relational databases. You 'll be working
| with flat files that might, if you're very lucky, have column
| and record separators and useful names/documentation explaining
| what they are. If you're unlucky you'll have to figure out
| field widths from the code and infer what the fields are based
| on their actual usage. Oh and if you get it wrong you just
| messed up something related to payroll or financial compliance;
| enjoy the punishing fines.
|
| That kind of stuff, more than the language, is the reason
| nobody wants to touch old COBOL systems.
| 8thcross wrote:
| wow! thats a memory lane i hope never revisit!
___________________________________________________________________
(page generated 2024-03-31 23:01 UTC)