[HN Gopher] Using a Markov chain to generate readable nonsense w...
       ___________________________________________________________________
        
       Using a Markov chain to generate readable nonsense with 20 lines of
       Python
        
       Author : benhoyt
       Score  : 199 points
       Date   : 2024-01-01 04:40 UTC (18 hours ago)
        
 (HTM) web link (benhoyt.com)
 (TXT) w3m dump (benhoyt.com)
        
       | westurner wrote:
       | >> _Making the prefix shorter tends to produce less coherent
       | prose; making it longer tends to reproduce the input text
       | verbatim. For English text, using two words to select a third is
       | a good compromise; it seems to recreate the flavor of the input
       | while adding its own whimsical touch._
       | 
       | Next word prediction; vector databases:
       | 
       | Vector database: https://en.wikipedia.org/wiki/Vector_database
       | 
       | https://news.ycombinator.com/item?id=37812219
        
       | raister wrote:
       | This was posted 50 days ago - how does it reach first page again?
       | https://news.ycombinator.com/from?site=benhoyt.com
        
         | j4yav wrote:
         | By people being interested and upvoting it. Not everyone reads
         | the home page all day every day.
        
       | not2b wrote:
       | This is exactly the approach I used back in the late 80s; I
       | published a Markov chain text generator on comp.sources.games
       | that digested Usenet posts, built tables and used the previous
       | two words to generate the next word. It was mildly amusing at the
       | time. Someone resurrected it and put on GitHub, which I'm fine
       | with.
        
         | 082349872349872 wrote:
         | Mark V. Shaney?
        
           | not2b wrote:
           | No, it was called markov3. Mark V. Shaney was similar but
           | from Bell labs. See https://github.com/cheako/markov3 (not
           | coded very well but I got better).
        
       | lowbloodsugar wrote:
       | Never forget a mate implemented a markov chainer in Lambda MOO,
       | and named it "Cute Ickle Baby". Left it in the Living Room,
       | listening to what people said, and randomly saying things.
       | Sometimes it said the funniest shit and had us all pissing
       | ourselves laughing.
        
       | bugbuddy wrote:
       | This works pretty well for languages that use a space as word
       | separator. Now do this for those languages that don't and have no
       | clear and easy rules for word separation.
        
       | abetusk wrote:
       | As many know and point out, this idea is now very old (40+
       | years?). The big problem is that it creates "word salad" [0]. In
       | the generative art and procgen community, the term "10,000 bowls
       | of oatmeal" problem has been used [1].
       | 
       | Taking a step back, this is a perennial problem, even with AI old
       | and new. I've heard that the early (good) chess bots, after Deep
       | Blue, had a problem of only being locally context sensitive and
       | being easily fooled by longer range attack planning. We even see
       | it, to a certain extent, in LLMs where they forget or contradict
       | themselves with what was just said. I have no doubt that LLMs
       | will just keep getting better but, as a snapshot of right now,
       | you can see shades of this "word salad" problem, just a few steps
       | removed to become a "logic salad" problem.
       | 
       | [0] https://en.wikipedia.org/wiki/Word_salad
       | 
       | [1] https://galaxykate0.tumblr.com/post/139774965871/so-you-
       | want...
        
         | tgv wrote:
         | Also: it works much better in English than in other languages I
         | know. English grammar is simple, and many of the function words
         | carry a strong prediction for the next tokens.
        
           | echelon wrote:
           | Which languages are harder for LLMs? Is there any writeup or
           | analysis of this?
        
             | tahnyall wrote:
             | If I had to guess, from my time working with ASR (Advanced
             | Speech Recognition) languages and its difficulty with the
             | Indian (country of) accent; I'd guess Indian languages.
        
             | justsomehnguy wrote:
             | Any with inflections and particularly with grammatical
             | gender.
        
             | tgv wrote:
             | Idk. First, the obvious candidates: those with small, low
             | quality text corpora. Second, I expect LLMs to perform less
             | on long distance dependencies. English has a pretty strict
             | word order, and the constituents are close to each other.
             | There's also a lot of input available, so there are more
             | large gaps to learn from. But (some) Germanic languages
             | have some weird rules which allows constituents to move far
             | away, so it's likely LLMs will make more errors there.
             | Also, free word order languages might be a problem,
             | although I've always suspected that those languages aren't
             | that free in reality: speakers probably have strong
             | preferences.
        
         | 3abiton wrote:
         | Any suggestiob on the background of the method?
        
           | tempodox wrote:
           | Do a web search for "Markov chain".
        
         | tempodox wrote:
         | And the logic salad problem will not be solved by LLMs because
         | there is nothing in their construction that makes them capable
         | of understanding logic.
        
           | brookst wrote:
           | Can you elaborate? I would argue that logic has even stronger
           | token prediction characteristics than language.
           | 
           | As a total handwave, I would expect an LLM trained on formal
           | logic and a huge corpus of proofs to produce pretty strong
           | logic output.
        
         | YeGoblynQueenne wrote:
         | The output is "word salad" because the probabilities of the
         | model are uniform, albeit implicitly- eyballing the python, it
         | selects the next n-gram uniformly at random. A better n-gram
         | model would calculate the probability of an n-gram following
         | another n-gram according to their frequency in the training
         | corpus. With a larger corpus and better training techniques
         | (some smoothing for one thing) you'd get much more coherent
         | output.
         | 
         | And of course, if you trained a gigantic model on the entire
         | web, then you'd get a very smooth approximation of the text in
         | the corpus and very fluent, grammatical output. So basically,
         | an LLM.
         | 
         | >> As many know and point out, this idea is now very old (40+
         | years?).
         | 
         | More like 110 years. Markov Chains were proposed in 1913 (by
         | Markov) and popularised by Shannon in 1948 (in "A Mathematical
         | Theory of Communication", the work that introduced information
         | theory):
         | 
         |  _The underlying mathematics of the n-gram was first proposed
         | by Markov (1913), who used what are now called Markov chains
         | (bigrams and trigrams) to predict whether an upcoming letter in
         | Pushkin's Eugene Onegin would be a vowel or a con- sonant.
         | Markov classified 20,000 letters as V or C and computed the
         | bigram and trigram probability that a given letter would be a
         | vowel given the previous one or two letters. Shannon (1948)
         | applied n-grams to compute approximations to English word
         | sequences. Based on Shannon's work, Markov models were commonly
         | used in engineering, linguistic, and psychological work on
         | modeling word sequences by the 1950s. In a series of extremely
         | influential papers starting with Chomsky (1956) and including
         | Chomsky (1957) and Miller and Chomsky (1963), Noam Chomsky
         | argued that "finite-state Markov processes", while a possibly
         | useful engineering heuristic, were incapable of being a
         | complete cognitive model of human grammatical knowl- edge.
         | These arguments led many linguists and computational linguists
         | to ignore work in statistical modeling for decades._
         | 
         | https://web.archive.org/web/20220522005827/https://web.stanf...
         | 
         | (Note the usual shaking of angry fists at Chomsky. The NLP
         | community continued on its path anyway, and built smooth,
         | fluent generators of total bullshit and absolutely no progress
         | towards a "cognitive model of human grammatical knowledge").
        
           | shawnz wrote:
           | > A better n-gram model would calculate the probability of an
           | n-gram following another n-gram according to their frequency
           | in the training corpus. With a larger corpus and better
           | training techniques (some smoothing for one thing) you'd get
           | much more coherent output.
           | 
           | I don't understand what you mean here, isn't that exactly
           | what an n-gram model is already designed to achieve where n >
           | 1? Since this is a 2-gram model isn't it already doing this?
        
           | danieldk wrote:
           | _The output is "word salad" because the probabilities of the
           | model are uniform, albeit implicitly- eyballing the python,
           | it selects the next n-gram uniformly at random._
           | 
           | From a quick look, it doesn't seem to sample uniformly. w3 is
           | added to a list for the context w1, w2, not a set. So, say
           | word A occurs twice as often in a particular context as B, it
           | will be in the list twice as often. So, even though a uniform
           | choice function is used, the probability of A getting samples
           | is twice as high.
           | 
           | You get a word salad because a trigram model has to little
           | context to do anything else. This is a well-known issue with
           | Markov and hidden Markov models.
           | 
           | (Fun fact: some hidden Markov model taggers switch to a
           | different trigram distribution for V2 languages after seeing
           | the finite verb, otherwise they often fail catastrophically
           | in the verb cluster due to the limited context.)
        
           | jameshart wrote:
           | > absolutely no progress towards a "cognitive model of human
           | grammatical knowledge"
           | 
           | The thing is, it's only if you buy fully into Chomskyan
           | thinking that you think a "cognitive model of human
           | grammatical knowledge" might even be useful. Or that it's a
           | particularly special thing compared to any other cognitive
           | model of human knowledge.
        
         | doubloon wrote:
         | i put Alexis De Tocqueville through this and the results were
         | legible but hard to read.
         | 
         | Then I put a demagogue politician speech through this and the
         | results were almost like a speech that politician would have
         | given. It was hard to tell the difference between the original
         | and the generated.
         | 
         | in other words, if the public admires word salad, and someone
         | speaks in word salad, then word salad output would not be
         | considered as problematic by a large number of people.
        
           | chuckadams wrote:
           | I recall one site way back that used a Markov chain generator
           | to mash up Karl Marx and Ayn Rand. Was fairly plausible
           | reading, actually.
        
             | nurettin wrote:
             | I imagine that's like mixing different kinds of salad.
        
           | hermitcrab wrote:
           | I did a markov reworking of some of Trump's speeches and got:
           | 
           | "Hillary brought death and disaster to Iraq, Syria and Libya,
           | she empowered Iran, and she unleashed ISIS. Now she wants to
           | raise your taxes very substantially. Highest taxed nation in
           | the world is a tenant of mine in Manhattan, so many great
           | people. These are people that have been stolen, stolen by
           | either very stupid politicians ask me the question, how are
           | you going to get rid of all the emails?" "Yes, ma'am, they're
           | gonna stay in this country blind. My contract with the
           | American voter begins with a plan to end government that will
           | not protect its people is a government corruption at the
           | State Department of Justice is trying as hard as they can to
           | protect religious liberty"
           | 
           | Details at:
           | https://successfulsoftware.net/2019/04/02/bloviate/
        
             | vorticalbox wrote:
             | When ngrams is high (10) how do you pick "what" to start
             | from?
        
               | hermitcrab wrote:
               | IIRC it starts with the same few characters as the source
               | text.
        
       | max_ wrote:
       | I wonder how this would behave if trained on 1 trillion words
       | like the LLMs.
       | 
       | Also, Is training a Markov cheaper that training neural nets? It
       | would be a great way to cut AI costs if they could be made as
       | effective as neural nets.
       | 
       | There is also an interesting post by Orange duck. [0]
       | 
       | [0] https://theorangeduck.com/page/17-line-markov-chain
        
         | gattilorenz wrote:
         | The only way to make it sound less like a word salad is to
         | increase the n-gram length (e.g. use the last 10 words to
         | predict the next), but at that point it starts repeating the
         | corpus.
         | 
         | I guess technically you can train on a huge corpus like those
         | of NNs to mitigate that, but you'll end up with more refined
         | word salads then (edit: refined as in "locally coherent but
         | globally still a word salad")
        
           | sgt101 wrote:
           | well - the lawyers from the NYT called, and they disagree...
        
             | gattilorenz wrote:
             | I didn't get the call, so it's not clear what they disagree
             | on :)
             | 
             | I doubt the have anything to say about Markov chains. We're
             | talking about technical possibilities, not legality of
             | training corpora
        
               | sgt101 wrote:
               | The NN's trained on huge corpus's appear to be prone to
               | memorization as well.
        
           | danieldk wrote:
           | Yes, the problem with such long context lengths is sparsity -
           | you still don't have enough data points to make
           | representative many-gram distributions. This was exactly the
           | motivation for Bengio and others to propose neural language
           | models in 2003:
           | 
           | https://dl.acm.org/doi/10.5555/944919.944966
           | 
           | I recommend everyone interested in neural language models
           | and/or wondering why we can't scale up Markov models by just
           | using longer ngrams to read this paper. It's pretty
           | accessible and explains how we got to where we are now in
           | 2023.
        
         | stabbles wrote:
         | Yeah, a century old approach using frequency tables where the
         | next word is generated based just on the previous two words is
         | highly competitive with contemporary neural nets. To produce
         | coherent sentences, do you ever need to remember more than the
         | last two words you said? I doubt it.
        
           | CamperBob2 wrote:
           | Now ask it to render a deepfake of Noam Chomsky waving his
           | hands saying "Nothing to see here."
        
       | bradrn wrote:
       | For a more thorough exploration of Markov text generation, I like
       | this article: https://zompist.com/markov.html
        
       | alejoar wrote:
       | We had a bot that would randomly say things in our IRC channel 15
       | years ago that worked like this. You could also mention it to
       | prompt it to reply.
       | 
       | Every message was added to it's knowledge base and it would say
       | random but hilarious stuff made up from all the nonsense we used
       | to talk about.
       | 
       | Good times.
        
         | duke_sam wrote:
         | For a hack day project someone took our IRC logs to do
         | something similar for employees who had left (one bot per
         | engineer, small company so a handful of bots). This made for a
         | great afternoon until the bots started talking to each other
         | and then started triggering the simple command the ops bot
         | would accept. It quickly got shut down.
        
         | Lukkaroinen wrote:
         | I wrote one bot like this. It was fun to play around.
        
         | bhattid wrote:
         | Out of curiosity, was the bot based on the bMotion repository?
         | https://github.com/jamesoff/bmotion
         | 
         | I remember a friend of mine settings up an IRC bot (named Zeta)
         | like that for his sheet music forum many years ago. She was
         | involved in a lot of hilarity - probably my favorite antics
         | were when she randomly decided to courtmatial someone. Good
         | times indeed! :)
        
           | seabass-labrax wrote:
           | > she randomly decided to courtmatial someone
           | 
           | Well, you can't stave off fate - you've got to face the music
           | eventually.
        
       | ZetaH wrote:
       | I was doing a similar experiment recently to generate random
       | names that sound like names from a specific language.
       | 
       | I was breaking the list of names apart into 3 and 2 letter parts,
       | marking which fragments are from the start, middle and end.
       | 
       | To generate the words I started from a random one from the start
       | fragments, then continued with a random one from the middle
       | fragments that starts with the latter that the previous one ended
       | with, and similarly ended it with one from the end fragments.
       | Some examples:
       | 
       | Spanish Names:
       | 
       | Armusa Vantara Modria
       | 
       | German Names:
       | 
       | Ven Marwar, Ger Naroff, Vort Kraldent, Gorn Henter, Urg Wicher,
       | Wan Ehranus, Eck Hayazin, Wert Biewin, Rein Relberid,
       | 
       | Catalan:
       | 
       | Pallava Ecorus Sangana Ginavari Telamita Exorxio
       | 
       | Hungarian cities:
       | 
       | Joszog Alszeny Hernafo Garnaza Ragytur Hidacska Mezecs
       | 
       | (edit formatting)
        
         | jokoon wrote:
         | I remember finding a set of english names generated with a
         | markov chain, I wish I know how to make it, they sounded good.
        
           | pavel_lishin wrote:
           | > _I wish I know how to make it_
           | 
           | You can learn! I know it's easy to say something is easy to a
           | beginner, but figuring out Markov chains is truly something
           | you can get the basics of over a weekend if you've ever
           | written any software at all.
        
       | weare138 wrote:
       | https://en.wikipedia.org/wiki/MegaHAL
        
       | forinti wrote:
       | I did this once with Perl and a book called Eu by the Brazilian
       | poet Augusto dos Anjos. It spat out verses and I joined a few to
       | create this little thing:
       | 
       | Andam monstros sombrios pela escuridao dos remorsos
       | 
       | Pairando acima dos transeuntes
       | 
       | Maldito seja o genero humano
       | 
       | Prostituido talvez em desintegracoes maravilhosas
       | 
       | Source code: https://alquerubim.blogspot.com/2018/01/gerador-de-
       | augusto-d...
        
       | Frummy wrote:
       | We uh as a group project did markov chains with some custom
       | algorithmic adjustments, scraped reddit via api, and academic
       | torrent (this data required a lot of cleaning and was orders of
       | magnitude bigger, ran separately), to simulate posts and
       | comments. In same group project we also implemented Variable
       | Length Markov Chain, tree-based custom context-length, although a
       | team member did the matrix based implementation custom context-
       | length as well through some matrix magic.
        
         | Frummy wrote:
         | Genuinely curious to hear what's wrong with what I thought a
         | cool, curious fun time implementing markov chains. We learned a
         | lot about statistics and linear algebra, matrix diagonalisation
         | and exponentiation, I find it very relevant and can answer
         | questions if you don't understand.
        
       | mebassett wrote:
       | this is old tech - but it had me thinking. Markov chains are
       | picking the next token from a random set and they are giving
       | approximately all possible tokens (from the training set) an
       | equal probability. What if it weighted the probability - say
       | using the inverse vector distance or cosine similarity of the
       | neighbors as a proxy for probability, where the vector embedding
       | came from word2vec...how close would the performance be to a
       | transformer model or even something like lstm rnns ? (I suppose
       | I'm cheating a bit using word2vec here. I might as well just say
       | I'm using the attention mechanism...)
        
         | pona-a wrote:
         | That sounds interesting, but still fails to capture long-range
         | dependencies.
         | 
         | If I understand correctly, what you're proposing is to replace
         | co-occurrence frequency with word2vec cosine similarity.
         | 
         | I suppose it may help improve overall performance, you're still
         | just blindly predicting the next word based on the previous one
         | like a first order Markov chain would.
         | 
         | For example, it won't ever fit "2 plus 2 equals 4," because
         | right when we get to equals, we discard all the previous words.
         | 
         | Perhaps if we could get the embedding model to consider the
         | full sentence and then produce a set of probability-scored next
         | token predictions it may work, but now we've just reinvented a
         | transformer.
        
           | csmpltn wrote:
           | > "I suppose it may help improve overall performance, you're
           | still just blindly predicting the next word based on the
           | previous one like a first order Markov chain would."
           | 
           | Instead of only taking in the last "token" as context to the
           | function that generates the next token - take the last 15
           | tokens (ie. the last 2-3 sentences), and predict based on
           | that. And that's your "attention" mechanism.
        
         | terminous wrote:
         | Yeah, that's just attention with extra steps
        
       | kwhitefoot wrote:
       | > using Lewis Carroll's Alice in Wonderland as input. The book is
       | more or less nonsense to begin with,
       | 
       | Has he ever read Alice?
        
       | ltr_ wrote:
       | When I read about Markov Chains and hallucination of LLMs, the
       | first thing that comes to my mind is Reggie watts [0]
       | performances, also remind me i used to do the same thing when i
       | was bored in the classroom, but on my paper notebooks,just
       | nonsense that seems plausible, it was kind of entering in a
       | meditative state, it felt good. when i discovered watts i thought
       | to my self, 'what?! I cloud have done that for a living?'.
       | 
       | [0] https://www.youtube.com/watch?v=UXyHf_SpUUI
        
       | samsquire wrote:
       | I had an idea.
       | 
       | What if you trained a markov chain and got weights of every pair
       | or sequence of pairs for a certain length.
       | 
       | Could you also do rotations with this information in vector
       | space? With multiple points of freedom?
       | 
       | What if you could do inverse kinematics with markov chains?
       | 
       | Like robot planning for a goal?
        
       | dahart wrote:
       | See also Mark V Shaney
       | https://en.wikipedia.org/wiki/Mark_V._Shaney
       | 
       | This was a class assignment in college, we had a lot of fun with
       | it, and one of my classmates, the brilliant Alyosha Efros,
       | decided to apply the exact same technique to images instead of
       | text. It turned into a paper that revitalized texture synthesis
       | in the Siggraph community. The most interesting part about it (in
       | my opinion) is that he ran it on images of text, and it produces
       | images of readable text nonsense! With the right window size,
       | perhaps there's a nonzero possibility of being able to produce
       | the same text either way. This always struck me as very meta and
       | makes me wonder if there are ways we could go in reverse with
       | image processing (or other types of data); if there's a latent
       | underlying representation for the information contained in image
       | content that is much smaller and more efficient to work with.
       | Neural networks might or might not be providing evidence.
       | 
       | https://people.eecs.berkeley.edu/~efros/research/EfrosLeung....
        
         | fluoridation wrote:
         | That's amazing! It's a literal uncrop. I'll have to look into
         | this later.
        
       | AndruLuvisi wrote:
       | The first edition of Programming Perl had a script that did this
       | called travesty. You can download it from
       | https://resources.oreilly.com/examples/9780937175644
       | 
       | When I was learning Perl in the '90s, I remember having a lot of
       | fun running this on different manual pages.
        
         | chuckadams wrote:
         | It's also built into emacs: I believe it's called
         | disassociated-press
        
       | heikkilevanto wrote:
       | I remember as a school boy I heard about Markov Chains, and
       | played a bit with them. Instead of building big n-gram tables, I
       | found a simpler way. Start with one n-gram, then scan the source
       | text until you find the next occurrence of it, and take the next
       | letter after it. Output it, and append it to your n-gram,
       | dropping its first letter, and repeat until you have sufficient
       | amount of nonsense text.
        
       | mdberkey wrote:
       | One of my first personal projects was making 2 discord bots that
       | would have an infinite "conversation" using Markov chains.
       | Feeding them text from Reddit produced some hilarious results.
        
       | hermitcrab wrote:
       | I created a little program in C++/Qt for using Markov chains to
       | re-write text. But it works on character sequences, rather than
       | words. There are examples and binaries for Windows and Mac:
       | 
       | https://successfulsoftware.net/2019/04/02/bloviate/
        
       | qarl wrote:
       | https://archive.org/details/ComputerRecreationsMarkovChainer
        
       | swayvil wrote:
       | Plausible nonsense of sufficient complexity and consistency as to
       | seem real. It's trivial to generate.
       | 
       | It casts every cherished truth into doubt.
        
       ___________________________________________________________________
       (page generated 2024-01-01 23:01 UTC)