[HN Gopher] How AI is unlocking ancient texts
       ___________________________________________________________________
        
       How AI is unlocking ancient texts
        
       Author : Marceltan
       Score  : 195 points
       Date   : 2024-12-30 13:35 UTC (3 days ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | aaronbrethorst wrote:
       | (2024)
        
         | Tagbert wrote:
         | :-)
        
       | datavirtue wrote:
       | Nothing to see here. LLMs and AI suck and aren't really good at
       | anything. /s
       | 
       | The world is about to change much faster than any of us have ever
       | witnessed to this point. What a life.
        
         | muglug wrote:
         | There's a big difference between LLMs and this application of
         | CNNs and RNNs.
         | 
         | Very few people on HN are claiming there's no value to neural
         | networks -- CNNs have been heralded here for well over a
         | decade.
        
         | mcphage wrote:
         | There are definitely things they're good at. And there's
         | definitely things that they're bad at, worse than nothing at
         | all. The problem is how often they're being used in the later
         | case, and how rarely in the former.
        
         | zeofig wrote:
         | Build a strawman, knock him down, and plant the glorious flag
         | of hyperbole on his strawwy corpse.
        
       | mlepath wrote:
       | This is a great application of various domains of ML. This
       | reminds me of Vesuvius Challenge. This kid of thing is accessible
       | to beginners too since the data by definition are pretty
       | limitted.
        
         | jhanschoo wrote:
         | Perhaps you missed it while skimming, but indeed, the Vesuvius
         | Challenge is a primary topic of discussion in the article :)
        
       | adriand wrote:
       | I find this incredibly exciting. There could be some truly
       | remarkable works whose contents are about to be revealed, and we
       | don't really know what we might find. Histories of the ancient
       | (more ancient) world. Accounts of contact with cultures and
       | civilizations that are currently lost to history. Scientific and
       | mathematical discoveries. And what I often find to be the most
       | moving: stories of daily life that illuminate what regular people
       | thought and felt and experienced thousands of years ago.
        
         | Applejinx wrote:
         | Which becomes a real gotcha when it turns out to be
         | hallucinated 'content' misleading people into following their
         | assumptions on what regular people thought and felt and
         | experienced thousands of years ago.
         | 
         | What we call AI does have superhuman powers but they are not
         | powers of insight, they are powers of generalization. AI is
         | more capable than a human is of homogenizing experience down to
         | what a current snapshot of 'human thought' would be, because
         | it's by definition PEOPLE rather than 'person'. The effort to
         | invoke a specific perspective from it (that seems ubiquitous)
         | sees AI at its worst. This idea that you could use it to
         | correctly extract a specific perspective from the long dead, is
         | wildly, wildly misguided.
        
       | Sparkyte wrote:
       | Can't wait to read ancient smutt from the time.
        
         | sapphicsnail wrote:
         | I wouldn't call in smut but there are 5 surviving Greek novels
         | and some Roman elegaic poetry that's a little horny. We know
         | there used to be a lot of crazier stuff but it mostly doesn't
         | survive.
        
       | mmooss wrote:
       | There isn't much about accuracy:
       | 
       |  _" Ithaca restored artificially produced gaps in ancient texts
       | with 62% accuracy, compared with 25% for human experts. But
       | experts aided by Ithaca's suggestions had the best results of
       | all, filling gaps with an accuracy of 72%. Ithaca also identified
       | the geographical origins of inscriptions with 71% accuracy, and
       | dated them to within 30 years of accepted estimates."_
       | 
       | and
       | 
       |  _" [Using] an RNN to restore missing text from a series of 1,100
       | Mycenaean tablets ... written in a script called Linear B in the
       | second millennium bc. In tests with artificially produced gaps,
       | the model's top ten predictions included the correct answer 72%
       | of the time, and in real-world cases it often matched the
       | suggestions of human specialists."_
       | 
       | Obviously 62%, 72%, 72% in ten tries, etc. is not sufficient by
       | itself. How do scholars use these tools? Without some external
       | source to verify the truth, you can't know if the software output
       | is accurate. And if you have some reliable external source, you
       | don't need the software.
       | 
       | Obviously, they've thought of that, and it's worth experimenting
       | with these powerful tools. But I wonder how they've solved that
       | problem.
        
         | sapphicsnail wrote:
         | > Obviously 62%, 72%, 72% in ten tries, etc. is not sufficient
         | by itself. How do scholars use these tools? Without some
         | external source to verify the truth, you can't know if the
         | software output is accurate. And if you have some reliable
         | external source, you don't need the software.
         | 
         | Without an extant text to compare, everything would be a guess.
         | Maybe this would be helpful if you're trying to get a rough and
         | dirty translation of a bunch of papyri or inscriptions? Until
         | we have an AI that's able to adequately explain it's reasoning
         | I can't see this replacing philologists with domain-specific
         | expertise who are able to walk you through the choices they
         | made.
        
           | EA-3167 wrote:
           | I wonder if maybe the goal is to provide the actual scholars
           | with options, approaches or translations they hadn't thought
           | of yet. In essence just what you said, structured guessing,
           | but if you can have a well-trained bot guess within specific
           | bounds countless times and output the patterns in the
           | guesses, maybe it would be enough. Not, "My AI translated
           | this ancient fragment of text," but "My AI sent us in a
           | direction we hadn't previously had the time or inclination to
           | explore, which turned out to be fruitful."
        
             | mmooss wrote:
             | I agree, but lets remember that the software repeats
             | patterns, it doesn't so much innovate new ones. If you get
             | too dependent on it, theoretically you might not break as
             | much new ground, find new paradigms, discover the long-
             | mistaken assumption in prior scholarship (that the software
             | is repeating), etc.
        
               | Validark wrote:
               | Interesting point in theory but I'd love to get to the
               | point where our problem is that we solved all the
               | problems we already know how to solve.
        
               | Zancarius wrote:
               | Human proclivities tend toward repetition as well,
               | partially as a memory/mnemonic device, so I don't see
               | this as disadvantageous. For example, there's a minor
               | opinion in biblical scholarship that John 21 was a later
               | scribal addition because of the end of John 20 seeming to
               | mark the end of the book itself. However, John's
               | tendencies to use specific verbiage and structure
               | provides a much stronger argument that the book was
               | written by the same author--including chapter 21--
               | suggesting that the last chapter is an epilogue.
               | 
               | Care needs to be taken, of course, but ancient works
               | often followed certain patterns or linguistic choices
               | that could be used to identify authorship. As long as
               | this is viewed as one tool of many, there's unlikely much
               | harm unless scholars lean too heavily on the opinions of
               | AI analysis (which is the real risk, IMO).
        
         | manquer wrote:
         | If the texts are truly missing , then accuracy is subjective ?
         | i.e. human opinion versus AI generation
        
           | ip26 wrote:
           | _artificially produced gaps in ancient texts_
           | 
           | Someone deleted part of a known text.
           | 
           | This does require the AI hasn't been trained on the test text
           | previously..
        
             | rtkwe wrote:
             | They do mention that the missing data test was done on
             | "new" data that the models were not viewed trained on in
             | the article so it's not just regurgitation for at least
             | some of the results it seems.
        
           | mmooss wrote:
           | > If the texts are truly missing , then accuracy is
           | subjective ?
           | 
           | Then accuracy might be unknown but it's not subjective.
        
           | BeefWellington wrote:
           | One way to test this kind of efficacy is to compare it to a
           | known sample with a missing piece, e.g.: create an artifact
           | with known text, destroy it in similar fashion, compare what
           | this model suggests as outputs with the real known text.
           | 
           | The "known" sample would need to be handled and controlled
           | for by an independent trusted party, obviously, and therein
           | lies the problem: It will be hard to properly configure an
           | experiment and _believe it_ if any of the parties have any
           | kind of vested interest in the success of the project.
        
         | rnd0 wrote:
         | Thank you, and also I'd like to know how they'd even evaluate
         | the results to begin with...
         | 
         | I hope to GOD they're holding on to the originals so they can
         | go back and redo this in 20,30 years when tools have improved.
        
       | zozbot234 wrote:
       | The really nice thing about this is that the AI can now acquire
       | these newly-decoded texts as part of its training set, and begin
       | learning at a geometric rate.
        
         | mistrial9 wrote:
         | errors in => errors out
        
           | WhereIsTheTruth wrote:
           | Don't forget to spice it up with some bias!
           | 
           | https://x.com/i/grok/share/uMwJwGkl2XVUep0N4ZPV1QUx6
        
         | rzzzt wrote:
         | But do I want to see ancient programming advice written in
         | Linear B?
        
         | zeofig wrote:
         | Why not just feed it random data? It's so smart that it will
         | figure out which parts are random, so eventually you will
         | generate some good data randomly, and it will feed on it, and
         | become exponentially smarter exponentially fast.
        
           | Validark wrote:
           | This is actually hilarious and I'm sad you are getting
           | downvoted for it.
        
         | nitwit005 wrote:
         | With our current methods, feeding back even fairly small
         | amounts of outputs back in as training data leads to declining
         | performance.
         | 
         | Just think of it abstractly. The AI will be trained on the
         | errors the previous generation made. As long as it keeps making
         | new errors each generation, they will tend to multiply.
        
           | red75prime wrote:
           | Degradation of autoregressive models being fed their own
           | unfiltered output is pretty obvious: it's, basically, noise
           | being injected into the ground truth probability
           | distribution.
           | 
           | But. "Our current methods" include reinforcement learning. So
           | long as there's a signal indicating better solutions,
           | performance tends to improve.
        
       | taffronaut wrote:
       | From TFA "decoding rare and lost languages of which hardly any
       | traces survive". Assuming that's not hype, let's see it have a go
       | at Rongorongo[1] then.
       | 
       | [1] https://en.m.wikipedia.org/wiki/Rongorongo
        
         | nick238 wrote:
         | and Linear A [1]. To be fair, whatever model would require data
         | about the context of where the texts were found unless the
         | corpus is _massive_.
         | 
         | https://en.wikipedia.org/wiki/Linear_A
        
           | shellfishgene wrote:
           | It's mentioned in the article that they hope the model for
           | Linear B can also help with Linear A.
        
       | yzydserd wrote:
       | Klaatu barada nikto
        
       | userbinator wrote:
       | The full title is "How AI is unlocking ancient texts -- _and
       | could rewrite history_ ", and that second part is especially
       | fitting, although unfortunately not mentioned in the article
       | itself, which is full of rather horrifying stories about using AI
       | to "fill in" missing data, which is clearly not true data
       | recovery in any meaningful sense.
       | 
       | I am aware of how advanced algorithms such as those used for
       | flash memory today can "recover" data from imperfect probability
       | distributions naturally created by NAND flash operation, but
       | there seems to a huge gap between those, which are based on well-
       | understood information-theoretic principles, and the AI
       | techniques described here.
        
         | mistrial9 wrote:
         | related -- A Chinese woman, known by the pseudonym Zhemao, was
         | found to have been rewriting and falsifying Russian history on
         | Wikipedia for over a decade. She created over 200 detailed
         | articles on the Chinese Wikipedia, which included fictitious
         | states, battles, and aristocrats. The hoax was exposed when a
         | Chinese novelist, Yifan, researching for a book, stumbled upon
         | an article about the Kashin silver mine. Yifan noticed that the
         | article contained extensive details that were not supported by
         | other language versions of Wikipedia, including the Russian
         | one.
         | 
         | Zhemao, posing as a scholar, claimed to have a Ph.D. in world
         | history from Moscow State University and was the daughter of a
         | Chinese diplomat based in Russia. She used machine translation
         | to understand Russian-language sources and filled in gaps with
         | her own imagination. Her articles were well-crafted and
         | included detailed references, making them appear credible.
         | However, many of the sources cited were either fake or did not
         | exist.
         | 
         | The articles were eventually investigated by Wikipedia editors
         | who found that Zhemao had used multiple "puppet accounts" to
         | lend credibility to her edits. Following the investigation,
         | Zhemao was banned from Chinese Wikipedia, and her edits were
         | deleted.
        
       | fuzztester wrote:
       | >How AI is
       | 
       | I almost read it as "How Ali is" due to speed reading and the
       | font in the original article. :)
       | 
       | And now I wonder how AI would do on that same test :)
       | 
       | Chat, GPT!
        
       | cormorant wrote:
       | Does anyone have a subscription or can otherwise read past the
       | heading "A flood of information"? (I can see ~2500 words but
       | there is apparently more.)
        
         | msephton wrote:
         | The easiest way is to perpend the URL with archive.is, as
         | follows:
         | https://archive.is/https://www.nature.com/articles/d41586-02...
        
           | Validark wrote:
           | It's cut off on archive.is too. Can Springer Nature not
           | afford us all to be able to read the full article or what? Do
           | they really need $70 for a single page of information again?
        
       | teleforce wrote:
       | I hope we can decipher Indus script using AI or not [1].
       | 
       | It's well overdue although from statistical profiling it's
       | believed to be a valid linguistic script being used for writing
       | system of the ancient Harappan language, the likely precursor of
       | modern Dravidian languages.
       | 
       | [1] Indus script:
       | 
       | https://en.wikipedia.org/wiki/Indus_script
        
         | dr_dshiv wrote:
         | Claude is all too willing to provide interpretations. Why not
         | give it a go and see if you can't crack it yourself? Hypothesis
         | generation is needed!
        
       | Oarch wrote:
       | What if the deciphered content is the ancient equivalent of Vogon
       | poetry? Do we stop?
        
         | Octoth0rpe wrote:
         | No, but the translation process would transfer from academia to
         | the military industrial complex.
        
       | sans_souse wrote:
       | This concerns me. How do we assess the AI's interpretation when
       | it comes to what we ourselves can't see? Have we not learned that
       | AI desparately wants to supply answers to the point it
       | prioritizes answers over accuracy? We already lose enough in
       | translation, and do well to twist those words we can discern -
       | I'd really prefer we not start filling the gaps with lies formed
       | of regurgitated data pools where it's most likely sourcing
       | whatever fabricated fluff it does end up using to fill in said
       | gaps.
        
         | watt wrote:
         | why wouldn't you prefer _something_ over _nothing_. I assume AI
         | steps in for issues that people haven't been able to begin to
         | solve in decades.
        
           | xenospn wrote:
           | That _something_ could be worse than nothing.
        
           | Majestic121 wrote:
           | It's much better to have _nothing_ than the wrong
           | _something_, since with a wrong _something_ you build
           | assumptions on wrong premises. Much better to accept that we
           | don't know (hopefully temporarily), so that people can keep
           | looking into it instead of falsely believing the problem is
           | solved
        
           | davidclark wrote:
           | Absolutely prefer nothing here.
        
           | throw4847285 wrote:
           | I bet Heinrich Schliemann would have loved AI.
        
         | palmfacehn wrote:
         | What is 'accuracy' when examined at depth?
         | 
         | With the benefit of greater knowledge and context we are able
         | to critique some of the answers provided by today's LLMs. With
         | the benefit of hindsight we are able to see where past
         | academics and thought leaders went wrong. This isn't the same
         | as confirming that our own position is a zenith of
         | understanding. It would be more reasonable to assume it is a
         | false summit.
         | 
         | Could we not also say that academics have a priority to
         | "publish or perish"? When we use the benefits of hindsight to
         | examine debunked theories, could we not also say that they were
         | too eager to supply answers?
         | 
         | I agree about models filling the gaps with whatever is most
         | probable. That's what they are designed to do. My quibble is
         | that humans often synthesize the least objectionable answers
         | based on group-think, institutional norms and pure laziness.
        
         | d357r0y3r wrote:
         | The AI interpretation can be folded into a multidisciplinary
         | approach. We wouldn't merely take AI's word for it. Does this
         | interpretation make sense given what historians and
         | anthropologists have learned, etc.
        
         | indymike wrote:
         | > This concerns me. How do we assess the AI's interpretation
         | when it comes to what we ourselves can't see?
         | 
         | Sometimes a clue or nudge can trigger a cascade of discovery.
         | Even if that clue is wrong, it causes people to look at
         | something they maybe never would have. In any case, so long as
         | we're reasonably skeptical this is really no different than a
         | very human way of working... have you tried "...fill in wild
         | idea..."
         | 
         | > I'd really prefer we not start filling the gaps with lies
         | formed of regurgitated data pools
         | 
         | A lie requires an intent to deceive and that is beyond the
         | capability of modern AI. In many cases lie can reveal adjacent
         | truth - and I suspect that is what is happening. Regardless,
         | finding truth in history is really hard because many times, the
         | record is filled with actual lies intended to make the victor,
         | ruler or look better.
        
         | dismalaf wrote:
         | > Have we not learned that AI desparately wants to supply
         | answers to the point it prioritizes answers over accuracy?
         | 
         | Have you ever met an archaeologist?
        
           | throw4847285 wrote:
           | Yeah, I know a number of archaeologists. Among academics,
           | they are some of the most conservative when it comes to
           | drawing sweeping conclusions from their research. A thesis
           | defense is a brutal exercise in being accused of crimes
           | against parsimony by your mentors and peers.
        
         | Electricniko wrote:
         | I like to think that today's clickbait data pools are perfect
         | for translating ancient texts. The software will see modern
         | headlines like "Politician roasts the opposition for spending
         | cuts" and come up with translations like "Emperor A roasted his
         | enemies" and it will still be correct.
        
       | InsOp wrote:
       | are there any news in the voynich code?
        
       ___________________________________________________________________
       (page generated 2025-01-02 23:02 UTC)