[HN Gopher] ChatLZMA - text generation from data compression
       ___________________________________________________________________
        
       ChatLZMA - text generation from data compression
        
       Author : bschne
       Score  : 67 points
       Date   : 2023-08-30 07:34 UTC (15 hours ago)
        
 (HTM) web link (pepijndevos.nl)
 (TXT) w3m dump (pepijndevos.nl)
        
       | awayto wrote:
       | This reminds me of an interesting exeriment I did earlier this
       | year with ChatGPT.
       | 
       | First, I came upon this reddit post [1] which describes being
       | able to convert text into some ridiculous symbol soup that makes
       | sense to ChatGPT.
       | 
       | Then, I considered the structure of my Typescript type files, ex
       | [2], which are pretty straightforward and uniform, all things
       | considered.
       | 
       | Playing around with the reddit compression prompt, I realized it
       | performed poorly just passing in my type structures. So I made a
       | simple script which essentially turned my types into a story.
       | 
       | Given a type definition:                   type IUserProfile {
       | name: string;             age: number;         }
       | 
       | It's somewhat trivial to make a script to turn these into
       | sentence structures, given the type is simple enough:
       | 
       | "IUserProfile contains: name which is a string; age which is a
       | number; .... IUserProfiles contains: users which is an array of
       | IUserProfile" and so on.
       | 
       | Passing this into the compression prompt was much more effective,
       | and I ended up with a compressed version of my type system [3].
       | 
       | Regardless of the variability of the exercise, I can definitely
       | say the prompt was able to generate some sensible components
       | which more or less correctly implemented my type system when
       | asked to, with some massaging. Not scalable, but interesting.
       | 
       | [1]
       | https://www.reddit.com/r/ChatGPT/comments/12cvx9l/compressio...
       | 
       | [2]
       | https://github.com/jcmccormick/wc/blob/c222aa577038fb55156b4...
       | 
       | [3]
       | https://github.com/keybittech/wizapp/blob/f75e12dc3cc2da3a41...
        
         | ericlewis wrote:
         | I'm curious, did you actually run it through the tokenizer and
         | see if it was less tokens vs uncompressed? I have seen a lot of
         | people try these "compression" schemes and token usage can be
         | higher.
        
       | api wrote:
       | Wasn't there some work a while back on training LLMs on
       | compressed data?
        
       | haxton wrote:
       | GPT4[0] is actually very good with base64 to the point where it
       | makes perfect sense.
       | 
       | I'd be interested in how well you could finetune 3.5 to use
       | different compression.
       | 
       | [0] -
       | https://platform.openai.com/playground/p/hfLUBCTE8RrRYPIRxEe...
        
       | bob1029 wrote:
       | The compressor idea is really clever, but wouldn't it be nice to
       | have 100% direct control over everything?
       | 
       | This got me thinking about the possibility of building a series
       | of simple context/token probability tables in SQLite and running
       | the show that way. Assuming we don't require _massive_ context
       | windows, what would prevent this from working?
       | 
       | It's not like we need to touch every row in the database all at
       | the same time or load everything into RAM. Prediction is just an
       | iterative query over a basic table - You could have a simple key-
       | value pair of context & the next most likely token for the given
       | context. All manner of normalization and database trickery
       | available for abuse here. Clearly a shitload of rows, but I've
       | seen some 10TB+ databases still satisfy queries in seconds. You
       | could even store additional statistics per token/context for
       | online learning scenarios (aka query-time calculation of token
       | probabilities). You could keep multiple tokenization schemes
       | online at the same time and combine them with various weightings.
       | 
       | What would be more efficient/cheaper than this if we could make
       | it fit? Wouldn't it be easier to iterate basic tables of data and
       | some SQL queries than to trip over python ML toolchains and GPU
       | drivers all day?
        
         | hiddencost wrote:
         | Weighted Finite State Transducers in speech recognition:
         | https://scholar.google.com/scholar?q=finite+state+transducer...
         | 
         | Modified Kneser-Ney smoothing: https://en.m.wikipedia.org/wiki/
         | Kneser%E2%80%93Ney_smoothing....
         | 
         | We've been here before, neural LMs replaced that generation of
         | models.
        
       | blueberrychpstx wrote:
       | Reminds me of trying to read Gullivers Travels
        
         | duskwuff wrote:
         | You might be thinking of some other literary work; Gulliver's
         | Travels isn't known for being particularly hard to read.
         | 
         | Myself, I was reminded of _Finnegans Wake_ by James Joyce.
        
       | marcodiego wrote:
       | Actually a neural network is just that: data compressed with
       | losses. A transformer makes multiples queries to a large loss-y
       | and stochastically compressed database to determine the next
       | token to generate. The PAQ archiver is famous for being just
       | that: a neural network to predict the next symbol.
        
       | PaulHoule wrote:
       | I was lately playing Disgaea PC, which unlike a lot of games
       | these days, has good text FAQs like
       | 
       | https://gamefaqs.gamespot.com/pc/183289-disgaea-pc/faqs/2623...
       | 
       | and thought about a question I'd thought about for a while which
       | is extracting facts from that sort of thing and one notable thing
       | is that certain named entities appear over and over throughout
       | the document (say "Cave of Ordeal") and how both attention and
       | compression-based approaches can draw a line between those
       | occurrences.
        
       ___________________________________________________________________
       (page generated 2023-08-30 23:00 UTC)