[HN Gopher] A Multimodal Dataset with One Trillion Tokens
       ___________________________________________________________________
        
       A Multimodal Dataset with One Trillion Tokens
        
       Author : kulikalov
       Score  : 76 points
       Date   : 2024-07-24 20:04 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | punnerud wrote:
       | More info on the Salesforce blog:
       | https://blog.salesforceairesearch.com/mint-1t/
        
         | j7ake wrote:
         | Wow did not expect sales force to be behind this.
         | 
         | It's basically free advertising for technical people to join
         | sales force.
        
           | jszymborski wrote:
           | Salesforce has long been involved in publishing quality NLP
           | papers, especially during Stephen Merity's tenure.
           | 
           | Smerity's papers are some of my favourite. Check out
           | 
           | https://ar5iv.labs.arxiv.org/html/1708.02182
           | 
           | And my all-time favourite
           | 
           | https://ar5iv.labs.arxiv.org/html/1911.11423
        
       | optimalsolver wrote:
       | How effective would modeling raw byte sequences be, with the
       | individual bytes as the "tokens", and a vocabulary of 256
       | elements?
       | 
       | You could then train on any kind of digital data.
        
         | akrymski wrote:
         | Forget bytes, go for bits. Vocab of size 2. At a theoretical
         | level all of AI comes down to a classifier that is able to
         | predict the next bit given a string of bits. Check out Tsetlin
         | Machines. At some point we will be doing it in hardware.
         | 
         | https://byte-gpt.github.io/
        
           | kulikalov wrote:
           | Sounds inefficient. It's like predicting the boiling point of
           | a kettle by measuring the speed of individual molecules of
           | water.
        
             | BizarroLand wrote:
             | That would be surprisingly easy with 1st year calculus as
             | long as you were willing to accept a small degree of
             | inaccuracy.
        
         | nodja wrote:
         | Somewhat inefficient for text, very inefficient for images,
         | specially if you work in pixel space. The max context a model
         | today has been trained is 1M tokens, which takes up a lot of
         | memory. Even if context was not an issue, to generate a
         | 1000x1000 image would take ~3 hours on 100token/s inference.
         | 
         | Google has trained an encoder/decoder LLM on bytes called
         | ByT5[1]
         | 
         | [1] https://huggingface.co/google/byt5-xxl
        
           | Tostino wrote:
           | I think the work on multi-token prediction[0] within a single
           | turn could be a significant development that makes byte-level
           | tokenization models more practical. This approach allows the
           | model to predict multiple tokens in parallel, potentially
           | addressing the efficiency concerns raised about byte-level
           | models.
           | 
           | By predicting multiple tokens simultaneously, it could
           | significantly speed up inference time, especially for tasks
           | that require generating large amounts of data (like images).
           | This could help mitigate the performance bottleneck mentioned
           | in the parent comment about generating a 1000x1000 image.
           | 
           | [0] https://ar5iv.labs.arxiv.org/html/2404.19737
        
         | joshuamcginnis wrote:
         | You might be interested in reading up on DNA sequence llm
         | models and tooling.
        
         | donnyg wrote:
         | It has been tried with decent results:
         | https://ai.meta.com/blog/ai-self-supervised-learning-data2ve...
        
       ___________________________________________________________________
       (page generated 2024-07-24 23:00 UTC)