[HN Gopher] A Multimodal Dataset with One Trillion Tokens
___________________________________________________________________
A Multimodal Dataset with One Trillion Tokens
Author : kulikalov
Score : 76 points
Date : 2024-07-24 20:04 UTC (2 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| punnerud wrote:
| More info on the Salesforce blog:
| https://blog.salesforceairesearch.com/mint-1t/
| j7ake wrote:
| Wow did not expect sales force to be behind this.
|
| It's basically free advertising for technical people to join
| sales force.
| jszymborski wrote:
| Salesforce has long been involved in publishing quality NLP
| papers, especially during Stephen Merity's tenure.
|
| Smerity's papers are some of my favourite. Check out
|
| https://ar5iv.labs.arxiv.org/html/1708.02182
|
| And my all-time favourite
|
| https://ar5iv.labs.arxiv.org/html/1911.11423
| optimalsolver wrote:
| How effective would modeling raw byte sequences be, with the
| individual bytes as the "tokens", and a vocabulary of 256
| elements?
|
| You could then train on any kind of digital data.
| akrymski wrote:
| Forget bytes, go for bits. Vocab of size 2. At a theoretical
| level all of AI comes down to a classifier that is able to
| predict the next bit given a string of bits. Check out Tsetlin
| Machines. At some point we will be doing it in hardware.
|
| https://byte-gpt.github.io/
| kulikalov wrote:
| Sounds inefficient. It's like predicting the boiling point of
| a kettle by measuring the speed of individual molecules of
| water.
| BizarroLand wrote:
| That would be surprisingly easy with 1st year calculus as
| long as you were willing to accept a small degree of
| inaccuracy.
| nodja wrote:
| Somewhat inefficient for text, very inefficient for images,
| specially if you work in pixel space. The max context a model
| today has been trained is 1M tokens, which takes up a lot of
| memory. Even if context was not an issue, to generate a
| 1000x1000 image would take ~3 hours on 100token/s inference.
|
| Google has trained an encoder/decoder LLM on bytes called
| ByT5[1]
|
| [1] https://huggingface.co/google/byt5-xxl
| Tostino wrote:
| I think the work on multi-token prediction[0] within a single
| turn could be a significant development that makes byte-level
| tokenization models more practical. This approach allows the
| model to predict multiple tokens in parallel, potentially
| addressing the efficiency concerns raised about byte-level
| models.
|
| By predicting multiple tokens simultaneously, it could
| significantly speed up inference time, especially for tasks
| that require generating large amounts of data (like images).
| This could help mitigate the performance bottleneck mentioned
| in the parent comment about generating a 1000x1000 image.
|
| [0] https://ar5iv.labs.arxiv.org/html/2404.19737
| joshuamcginnis wrote:
| You might be interested in reading up on DNA sequence llm
| models and tooling.
| donnyg wrote:
| It has been tried with decent results:
| https://ai.meta.com/blog/ai-self-supervised-learning-data2ve...
___________________________________________________________________
(page generated 2024-07-24 23:00 UTC)