[HN Gopher] How well can LLMs write COBOL?
       ___________________________________________________________________
        
       How well can LLMs write COBOL?
        
       Author : ggordonhall
       Score  : 70 points
       Date   : 2024-03-30 11:22 UTC (1 days ago)
        
 (HTM) web link (bloop.ai)
 (TXT) w3m dump (bloop.ai)
        
       | pmg101 wrote:
       | I tried to get chatgpt to write 6502 assembler for the 1980s
       | 8-bit home computer the BBC Micro. It was game, but clueless
        
         | fmxexpress wrote:
         | Try feeding it 8 pages of examples first? Something like this
         | https://atariwiki.org/wiki/Wiki.jsp?page=Advanced%206502%20A...
        
         | TillE wrote:
         | That's a little surprising; 6502 assembly is a fairly popular
         | hobbyist thing, so I would expect the data is out there. It's
         | also _mostly_ pretty simple, but you do have to watch out for
         | quirks like rotating through the carry bit.
        
         | _the_inflator wrote:
         | It depends on what you test for.
         | 
         | I am from the C64 demo scene and in this regard ChatGPT is
         | pretty useless. VIC tricks, raster timing - nothing a LLM can
         | help with at the moment judging from my experience with ChatGPT
         | 4.0 so far.
         | 
         | Same goes for Amiga and simple blitter access for scrolling.
         | 
         | LLMs will be very limited here unless they receive sensory
         | feedback repeatedly.
         | 
         | Pure algorithms like sorting for example may be doable by GPT,
         | but the mentioned machines are very creatively used to come up
         | with effects.
         | 
         | Here are techniques reused, not so much code itself, which gets
         | modified and optimized for every demo and often times
         | recombined with other techniques.
         | 
         | Most techniques for the VIC are pretty well documented, but the
         | timing as well as recombining them is the heavy lifting.
        
         | pcwalton wrote:
         | I tried to get it to write 6502 assembler for the NES. As I
         | recall it mistakenly thought that there was a BIOS containing a
         | print function.
        
         | Terretta wrote:
         | It can write 6502, but you have to know 6502 yourself to coach
         | it.
        
       | dwheeler wrote:
       | I suspect there were relatively few training examples for COBOL.
       | It would be interesting to see the results for a system which had
       | a significant number of such examples in the training set.
        
         | nradov wrote:
         | There is probably a business opportunity for an AI company to
         | build private LLMs for large enterprises trained on their own
         | COBOL code bases. They won't find much available as open
         | source, and individual companies tend to have significantly
         | different coding styles tied to COBOL versions and database
         | schemas.
        
           | IshKebab wrote:
           | Yeah I think Facebook and Google are already doing that
           | internally.
        
         | pjmlp wrote:
         | Indeed, even the author wasn't aware of modern COBOL, with IDE
         | tooling and OOP capabilities, focusing on the classical micros
         | instead.
        
           | giantrobot wrote:
           | The issue is less COBOL the language and more the literal
           | business logic the COBOL is encoding. You can learn the COBOL
           | language as easily as any other. What you can't learn as
           | easily are the accounting rules, administration policies, and
           | regulations any major COBOL codebase is implementing.
           | 
           | You'll be able to see the code multiplying a dollar value by
           | 0.03 but not necessarily know that this is because of some
           | statutory requirement of some Minnesota tax code that only
           | applies to industries producing both left and right handed
           | monkey wrenches but only if the company was incorporated
           | before 1975. That obscure law isn't referenced in any
           | documentation but was found by an accountant in 1982. The
           | change was made to the code but only referenced in a paper
           | memo with a small distribution list but all of those memos
           | were shredded after being archived for ten years.
           | 
           | ChatGPT can't really help document code that's decades old
           | and doesn't have any references to the _why_ of the code. The
           | _how_ is straightforward but rarely as important as the
           | _why_.
        
             | Solvency wrote:
             | Not knowing a statutory law expressed in code is not just a
             | COBOL thing.
        
       | ripvanwinkle wrote:
       | It would be interesting to feed it a formal language
       | specification of some language it hasn't seen and then ask it
       | write code and see how it does.
       | 
       | That could be a test of reasoning and reading comprehension
        
         | ape4 wrote:
         | Reasoning vs being a completion engine (I could make a guess at
         | how well that would work)
        
           | CuriouslyC wrote:
           | Reasoning is a form of completion (logical), the problem is
           | that LLMs aren't language agnostic in their learned semantic
           | reasoning.
        
         | CuriouslyC wrote:
         | I've been thinking about a benchmark designed this way for a
         | while. It doesn't even need to be code, particularly, it could
         | be basic reasoning problems. The key is that you define a new,
         | random language that has never before been seen (maybe it has
         | statistical similarity to existing languages, maybe not),
         | create a translation key, then ask a question in that language.
        
       | allusernamesare wrote:
       | On one hand, it'd be super cool to solve the talent shortage in
       | the field, also Cobol code isn't very pleasant to write.
       | 
       | On the other, I'm not sure I'd want tools known for poor code
       | quality, and hallucinations, to write these super critical
       | systems.
       | 
       | Guess there might be a copilotesque productivity booster for
       | human developers, but I think these systems are some of the last
       | places I'd want LLMs to contribute.
        
         | backtoyoujim wrote:
         | That would remove some portion of the human coders, too.
        
         | mrbombastic wrote:
         | I wonder if a better goal would be valid translation of the
         | COBOL into X language. Obviously that has its own can of worms
         | but it seems like our goal generally should be getting critical
         | systems to more modern languages that can be maintained more
         | easily.
        
           | sanxiyn wrote:
           | In 2009, 4 million lines of COBOL were migrated to Java using
           | an automatic translator.
           | 
           | https://www.infoq.com/news/2009/07/cobol-to-java/
        
             | kamma4434 wrote:
             | The problem is not translation - once it is ported to Java,
             | who will maintain it? It has no specs, no tests, and it's a
             | spaghetti mess. Is any weirdness a bug or a feature? Nobody
             | knows. Least of all the body-rental remote devs you hired
             | to maintain it.
        
               | ikari_pl wrote:
               | which is worse than the same situation, but in COBOL, how
               | exactly?
        
         | bongodongobob wrote:
         | Why do people always add this "blah blah hallucinations and
         | critical systems"? 1. People write bad and buggy code. 2. You
         | act like we're just blindly throwing untested code at
         | production systems from LLMs.
         | 
         | It's just intellectually dishonest to talk this way.
         | 
         | They will still be helpful but we obviously need to test before
         | we add code into systems. It goes without saying.
        
           | fragmede wrote:
           | Look, we can't all just be realistic about a thing that's
           | going to take our jobs, so we have to lean on tired old
           | excuses instead. Instead of being reasonable, why don't you
           | pick a team - for or against, and then fight about it on
           | Internet forums because, well, what else are you going to do
           | while the build compiles? Look at cat videos?
        
             | bongodongobob wrote:
             | I generate my own custom cat videos with AI now tyvm.
        
           | giantrobot wrote:
           | > It's just intellectually dishonest to talk this way.
           | 
           | > They will still be helpful but we obviously need to test
           | before we add code into systems. It goes without saying.
           | 
           | It's not intellectually dishonest at all. It's an issue of
           | conditioning. There's a class of developers that blindly copy
           | and paste code from StackOverflow or the first hit on Google.
           | They're the same class that will uncritically copy and paste
           | ChatGPT answers.
           | 
           | ChatGPT is worse than SO because it's adaptive. If someone
           | pastes in a SO answer and it doesn't immediately work the
           | developer has to at least engage with the code. ChatGPT can
           | be asked to refine its hallucination until it
           | parses/compiles.
           | 
           | The class of developer blindly copying and pasting answers
           | will not have the expertise to spot hallucinations or likely
           | even fix the inevitable bugs they introduce. Additionally
           | ChatGPT by its nature elides the source of its answers. At
           | the very least a SO answer has _some_ provenance. Not only
           | the poster but some social signally through votes that the
           | answer is legitimate.
           | 
           | ChatGPT answers don't have any of that. It will also happily
           | hallucinate references.
           | 
           | Conditioning junior developers and learners to rely on and
           | trust AI coding is setting them up to fail. It's also going
           | to stunt their growth as developers because they'll never
           | gain any domain knowledge. In the meantime they'll be
           | unknowingly sabotaging products with legit looking but broken
           | code.
        
             | bongodongobob wrote:
             | I should be worried that the very worst developers might
             | paste bad code from ChatGPT and that's why it's dangerous?
             | Looks an awful lot like mental gymnastics to me.
        
       | sanxiyn wrote:
       | Considering MTOB(Machine Translation from One Book) result, where
       | LLM in-context learns a new language from a grammar book, I
       | wonder how LLMs fare given, say, GnuCOBOL Programmer's Guide PDF,
       | which is easily available. It would be an interesting addition to
       | the benchmark.
       | 
       | https://arxiv.org/abs/2309.16575
       | 
       | https://gnucobol.sourceforge.io/guides.html
        
       | pcwalton wrote:
       | I tried to get ChatGPT to write LLVM IR last year. The results
       | were interesting: the LLM wrote superficially correct-looking IR,
       | but it ultimately failed to grasp the concept of SSA, as it kept
       | trying to write to SSA registers. ChatGPT can generalize across
       | language syntax reasonably well, but it doesn't understand deeper
       | differences in language semantics.
        
         | wvenable wrote:
         | I tried to get ChatGPT to write 6502 assembly and it had
         | similar issues.
        
           | saurik wrote:
           | FWIW, I had similar issues trying to get both it and Claude
           | to help me with x86 assembly: it kept thinking if it added
           | another * or some more parentheses it could get some
           | impossible assembly to work.
        
             | bongodongobob wrote:
             | Reminds me of adding *'s and &'s until my C++ code worked
             | in college.
        
       | pama wrote:
       | Since the LLM sometimes generates invalid COBOL a simple
       | practical solution would be to use an API and allow it to test
       | its code with GnuCOBOL, feed the output and have it try again a
       | couple of times. I wonder what the updated benchmarks would be in
       | that setting.
        
         | ptx wrote:
         | The general approach seems to work anyway. I tried it out with
         | ChatGPT 3.5 and an online Cobol compiler[0], manually feeding
         | back the output, and it managed to produce a working program on
         | the 10th attempt (that displays the first 10 Fibonacci
         | numbers).
         | 
         | Edit: Well, maybe. With the example from the article it wasn't
         | as successful.
         | 
         | [0] https://onecompiler.com/cobol/
        
       | danenania wrote:
       | This looks interesting. I'm working on an OpenAI-based tool for
       | coding tasks that are too complex for ChatGPT -
       | https://github.com/plandex-ai/plandex
       | 
       | It's working quite well for me, but it definitely needs some time
       | spent on benchmarking and ironing out edge cases.
       | 
       | I'm especially curious how it will do on more "obscure"
       | languages. Not that Cobol is obscure exactly--I suppose there's
       | probably quite a bit of it in GPT-4's training considering how
       | pervasive it is in some domains. In any case, I'll try out this
       | benchmark and see how it goes.
        
         | arittr wrote:
         | This looks great! Can't wait to try it out today
        
         | skissane wrote:
         | > Not that Cobol is obscure exactly--I suppose there's probably
         | quite a bit of it in GPT-4's training considering how pervasive
         | it is in some domains
         | 
         | There is a huge amount of COBOL code in existence - but, almost
         | all of it is non-public code used to run business and
         | governments. Very little of it is publicly source-available
         | (whether open source or something more restrictive than that)
         | 
         | Unless GPT-4's training data includes non-public code bases (I
         | doubt it), it likely has rather little COBOL code in it
        
           | SonOfLilit wrote:
           | I've been using GPT4 to help me navigate a mainframe and a
           | COBOL codebase and it knows far more than what my googling
           | abilities manage to fish up in forums. It's actually
           | surprisingly good at surprisingly deep mainframe topics.
        
             | skissane wrote:
             | No doubt its training data contains a lot of IBM manuals,
             | probably even some commercial books on relevant topics,
             | maybe even the contents of some of the forums you mention -
             | and all that could be enough to correctly answer your
             | questions.
             | 
             | However, for languages like Python, Java, C, C++,
             | JavaScript, Go, etc, it also contains untold millions of
             | lines of code slurped from places like GitHub. Whereas, I
             | really doubt it contains anywhere remotely near as much
             | COBOL code, just because you look for COBOL code on GitHub
             | public repos, you will find very little - the vast majority
             | of COBOL code is in-house or vendor business software, and
             | few seem to want to make that stuff public - and what COBOL
             | code GitHub has is mostly toy exercises or ancient stuff,
             | not examples of significant contemporary production code.
             | The only way OpenAI is going to get a substantial quantity
             | of that is if multiple private parties (such as banks) give
             | them access to their COBOL code bases - not impossible, but
             | absent some public info saying it has happened, it seems
             | more likely it hasn't.
             | 
             | I expect GPT-4 (or any LLM) is not going to perform as well
             | on complicated programming tasks for COBOL compared to
             | other languages. For more mainstream languages, it has
             | millions of examples to help it do a better job, for COBOL
             | it likely doesn't.
        
           | jobigoud wrote:
           | But it probably read all the books ever published on COBOL.
        
       | GaggiX wrote:
       | Someone should test the benchmark on Claude 3 models.
        
       | andy99 wrote:
       | I've asked chatGPT a fair number of Fortran questions. There are
       | differences - Fortran is still in use lots of places, there are
       | forums and documentation sites (though presumably that's true for
       | COBOL). But compare to python, there is way less info out there
       | on how to do different things, for example little Stack Overflow
       | content.
       | 
       | I'd say I has mixed results, definitely chatGPT knows the
       | language and can give examples but I've also had a lot of
       | frustrating things it wasn't able to resolve.
        
         | gnatolf wrote:
         | Fortran also has the disadvantage of numerous separate
         | dialects/flavours that plenty of times can't be mixed. And
         | rarely is the exact style mentioned in random code found
         | somewhere.
         | 
         | It got a lot better with Fortran 95 and newer, but in the old
         | world of e.g. lahey compilers and custom commands only
         | available there, any LLM has failed me consistently to stick to
         | these intricacies. I can't even blame them, when asking humans
         | questions about these topics, you'll get all sorts of answers
         | that are equally close to a correct solution, but almost never
         | precisely correct.
        
       | treebeard901 wrote:
       | One thing that makes OpenAI so valuable over time is how they can
       | take all of this expert input from the rush to test out the new
       | technology and use that to exponentially improve the next model.
       | You have to think, just like we see here, experts in their field
       | who know all of these edge cases or other fundamental aspects of
       | what makes their own companies or projects valuable and they tend
       | to hand that over in the process.
       | 
       | It's kind if interesting how far this could be taken with all
       | kinds of valuable information from people and companies,
       | especially code and business logic. Everyone wants to say AI this
       | and AI that to keep up with the times and they all keep dumping
       | all of this valuable data in for free.
       | 
       | Then consider the authentication method used tied to this data
       | collection and if you are using a company emial, etc, it helps
       | them weed out the garbage too.
       | 
       | I guess we can't fight progress...
        
         | CuriouslyC wrote:
         | That might be true, but my feeling so far is that OpenAI
         | doesn't want to do what they'd need to do to make any one
         | product actually good, so they're going to keep bouncing from
         | AI thing to thing, making foundation models that have a lot of
         | wow factor but can't really deliver on their promise because
         | they're too closed off to integrate into a workflow that might
         | alleviate the issues.
         | 
         | Other people will try to build on OpenAI stuff, find that it's
         | not quite good enough and OpenAI doesn't care to really make it
         | good enough because it's a lot of work, and it won't be until
         | we get competitors that take less sexy model tech and take the
         | time to make it REALLY GOOD at certain things that AI really
         | makes good on its promise. I'm guessing that will be driven by
         | people taking open source tools that are ~80% of the way there
         | and really building a system and domain logic around it to make
         | it excellent.
        
         | bigEnotation wrote:
         | I think you're forgetting about the use case where the LLM
         | returns something partially correct to a discerning expert, who
         | is still able to use the response, but does not bother with a
         | message like "btw I had to do X to make your suggestions
         | usable".
        
       | aj7 wrote:
       | A while back, I asked a question here, roughly, why hasn't
       | someone written, say, a C to COBOL translator? Such a program
       | might take a lot of work, but it seemed to me that with an
       | impending dearth of COBOL programmers, there would be demand for
       | such an app. I was informed that there were so many different
       | COBOLs in use that the output of such a program would STILL have
       | to be tended to by an experienced programmer in the output
       | dialect desired. This is just the Copilot situation.
        
         | Analemma_ wrote:
         | As much as everyone likes to poke fun at COBOL, the language
         | itself really isn't the problem with maintaining/updating old
         | COBOL systems. It's old, but it's not _that_ bad.
         | 
         | The real problem is the entire ecosystem _around_ those
         | systems. Remember, a lot of COBOL software dates back to a time
         | before things like relational databases. You 'll be working
         | with flat files that might, if you're very lucky, have column
         | and record separators and useful names/documentation explaining
         | what they are. If you're unlucky you'll have to figure out
         | field widths from the code and infer what the fields are based
         | on their actual usage. Oh and if you get it wrong you just
         | messed up something related to payroll or financial compliance;
         | enjoy the punishing fines.
         | 
         | That kind of stuff, more than the language, is the reason
         | nobody wants to touch old COBOL systems.
        
       | 8thcross wrote:
       | wow! thats a memory lane i hope never revisit!
        
       ___________________________________________________________________
       (page generated 2024-03-31 23:01 UTC)