[HN Gopher] ChatLZMA - text generation from data compression
___________________________________________________________________
ChatLZMA - text generation from data compression
Author : bschne
Score : 67 points
Date : 2023-08-30 07:34 UTC (15 hours ago)
(HTM) web link (pepijndevos.nl)
(TXT) w3m dump (pepijndevos.nl)
| awayto wrote:
| This reminds me of an interesting exeriment I did earlier this
| year with ChatGPT.
|
| First, I came upon this reddit post [1] which describes being
| able to convert text into some ridiculous symbol soup that makes
| sense to ChatGPT.
|
| Then, I considered the structure of my Typescript type files, ex
| [2], which are pretty straightforward and uniform, all things
| considered.
|
| Playing around with the reddit compression prompt, I realized it
| performed poorly just passing in my type structures. So I made a
| simple script which essentially turned my types into a story.
|
| Given a type definition: type IUserProfile {
| name: string; age: number; }
|
| It's somewhat trivial to make a script to turn these into
| sentence structures, given the type is simple enough:
|
| "IUserProfile contains: name which is a string; age which is a
| number; .... IUserProfiles contains: users which is an array of
| IUserProfile" and so on.
|
| Passing this into the compression prompt was much more effective,
| and I ended up with a compressed version of my type system [3].
|
| Regardless of the variability of the exercise, I can definitely
| say the prompt was able to generate some sensible components
| which more or less correctly implemented my type system when
| asked to, with some massaging. Not scalable, but interesting.
|
| [1]
| https://www.reddit.com/r/ChatGPT/comments/12cvx9l/compressio...
|
| [2]
| https://github.com/jcmccormick/wc/blob/c222aa577038fb55156b4...
|
| [3]
| https://github.com/keybittech/wizapp/blob/f75e12dc3cc2da3a41...
| ericlewis wrote:
| I'm curious, did you actually run it through the tokenizer and
| see if it was less tokens vs uncompressed? I have seen a lot of
| people try these "compression" schemes and token usage can be
| higher.
| api wrote:
| Wasn't there some work a while back on training LLMs on
| compressed data?
| haxton wrote:
| GPT4[0] is actually very good with base64 to the point where it
| makes perfect sense.
|
| I'd be interested in how well you could finetune 3.5 to use
| different compression.
|
| [0] -
| https://platform.openai.com/playground/p/hfLUBCTE8RrRYPIRxEe...
| bob1029 wrote:
| The compressor idea is really clever, but wouldn't it be nice to
| have 100% direct control over everything?
|
| This got me thinking about the possibility of building a series
| of simple context/token probability tables in SQLite and running
| the show that way. Assuming we don't require _massive_ context
| windows, what would prevent this from working?
|
| It's not like we need to touch every row in the database all at
| the same time or load everything into RAM. Prediction is just an
| iterative query over a basic table - You could have a simple key-
| value pair of context & the next most likely token for the given
| context. All manner of normalization and database trickery
| available for abuse here. Clearly a shitload of rows, but I've
| seen some 10TB+ databases still satisfy queries in seconds. You
| could even store additional statistics per token/context for
| online learning scenarios (aka query-time calculation of token
| probabilities). You could keep multiple tokenization schemes
| online at the same time and combine them with various weightings.
|
| What would be more efficient/cheaper than this if we could make
| it fit? Wouldn't it be easier to iterate basic tables of data and
| some SQL queries than to trip over python ML toolchains and GPU
| drivers all day?
| hiddencost wrote:
| Weighted Finite State Transducers in speech recognition:
| https://scholar.google.com/scholar?q=finite+state+transducer...
|
| Modified Kneser-Ney smoothing: https://en.m.wikipedia.org/wiki/
| Kneser%E2%80%93Ney_smoothing....
|
| We've been here before, neural LMs replaced that generation of
| models.
| blueberrychpstx wrote:
| Reminds me of trying to read Gullivers Travels
| duskwuff wrote:
| You might be thinking of some other literary work; Gulliver's
| Travels isn't known for being particularly hard to read.
|
| Myself, I was reminded of _Finnegans Wake_ by James Joyce.
| marcodiego wrote:
| Actually a neural network is just that: data compressed with
| losses. A transformer makes multiples queries to a large loss-y
| and stochastically compressed database to determine the next
| token to generate. The PAQ archiver is famous for being just
| that: a neural network to predict the next symbol.
| PaulHoule wrote:
| I was lately playing Disgaea PC, which unlike a lot of games
| these days, has good text FAQs like
|
| https://gamefaqs.gamespot.com/pc/183289-disgaea-pc/faqs/2623...
|
| and thought about a question I'd thought about for a while which
| is extracting facts from that sort of thing and one notable thing
| is that certain named entities appear over and over throughout
| the document (say "Cave of Ordeal") and how both attention and
| compression-based approaches can draw a line between those
| occurrences.
___________________________________________________________________
(page generated 2023-08-30 23:00 UTC)