[HN Gopher] Best 7B LLM on leaderboards made by an amateur follo...
       ___________________________________________________________________
        
       Best 7B LLM on leaderboards made by an amateur following a medium
       tutorial
        
       Author : Der_Einzige
       Score  : 132 points
       Date   : 2024-01-05 18:34 UTC (4 hours ago)
        
 (HTM) web link (huggingface.co)
 (TXT) w3m dump (huggingface.co)
        
       | brcmthrowaway wrote:
       | Very interesting this was managed with a 6 months out of date
       | course.
        
         | ch33zer wrote:
         | On a free collab instance...
        
           | brcmthrowaway wrote:
           | How much more low hanging fruit is there?
        
           | SushiHippie wrote:
           | The GPU he used is not free, you need to buy compute units
           | for these "premium GPUs".
           | 
           | https://colab.research.google.com/signup
        
       | cjbprime wrote:
       | Is it possible that they fine-tuned on the leaderboard test set?
        
       | nabakin wrote:
       | The only leaderboard this model is good on is the HuggingFace LLM
       | Leaderboard which is known to be manipulated and victim to gross
       | overfit. The Lmsys Arena Leaderboard is a better representation
       | of the best models.
        
         | refulgentis wrote:
         | Thank you!!! _And_ it has the proprietary models...insanely
         | more useful.
         | 
         | It's a bug, not a feature, that the stock leaderboard ends up
         | with endless fine tunes, and as you point out and is
         | demonstrated by the article, its more about something else than
         | about quality
        
           | Der_Einzige wrote:
           | Even that chatbot arena shows that many models freely
           | available and open source are better than some versions of
           | GPT-3.5 and are within a small stones throw of the latest GPT
           | 3.5
        
             | int_19h wrote:
             | Note that it only includes gpt-3.5-turbo (the current
             | iteration), not the original gpt-3.5. It's not exactly a
             | secret that "turbo" models are noticeably dumber than the
             | originals, whatever OpenAI says. There's no free lunch -
             | that's why it's so much cheaper and faster...
             | 
             | That said, we do have public 120b models now that genuinely
             | feel better than the original gpt-3.5.
             | 
             | The holy grail remains beating gpt-4 (or even gpt-4-turbo).
             | This seems to be out of reach on consumer hardware at
             | least...
        
               | modeless wrote:
               | Um, the lmsys elo ranking clearly shows that GPT-4 Turbo
               | is better than GPT-4.
        
               | anoncareer0212 wrote:
               | Same for the ChatGPTs post launch (let's not talk about
               | 11_02 :) )
               | 
               | -- and as long as we're asserting andecdotes freely, I
               | work in the field and have a couple years in before
               | ChatGPT -- it most certainly is not a well-kept secret or
               | a secret or true or anything else other than standard
               | post-millenial self-peasantization.
               | 
               | "Outright lie" is kinder toward the average reader via
               | being more succinct, but usually causes explosive
               | reactions because people take a while to come to terms
               | with their ad-hoc knowledge via consuming commentary is
               | fundamentally flawed, if ever.
        
               | sdenton4 wrote:
               | Note that 'no free lunch' has a specific meaning with no
               | relation whatsoever to model size/quality trade-offs...
               | 
               | https://en.wikipedia.org/wiki/No_free_lunch_theorem
               | 
               | In the speed/quality trade-off sense, there have /often/
               | been free lunches in many areas of computer science,
               | where algorithmic improvements let us solve problems
               | orders of magnitude faster. We don't fully understand
               | what further improvements will be available for LLMs.
        
               | pests wrote:
               | That phrase comes from the more general adage though.
               | 
               | https://en.wikipedia.org/wiki/No_such_thing_as_a_free_lun
               | ch
        
               | mewpmewp2 wrote:
               | Free lunch here relates to pricing/speed I would say,
               | because gpt-4 and gpt-4-turbo are sold together. If
               | gpt-4-turbo is cheaper, faster and has much larger
               | context window, why would it make sense to also sell
               | gpt-4... Unless it's a marketing trick or perhaps for
               | backwards compatibility, which could also be.
        
       | barnabee wrote:
       | If only all people and companies were as honest as this guy about
       | how much of their success they owe to luck and others!
        
       | wavemode wrote:
       | "Leaderboards", meh.
       | 
       | This tweet is still very true:
       | https://twitter.com/karpathy/status/1737544497016578453
        
         | not2b wrote:
         | Goodhart's law would seem to apply:
         | 
         | https://en.wikipedia.org/wiki/Goodhart%27s_law
         | 
         | Nevertheless, scoring so well on this benchmark is an
         | accomplishment, though I'm not in a position to evaluate how
         | significant it is.
        
         | make3 wrote:
         | that's why the huggingface llm arena exists
        
         | int_19h wrote:
         | Nothing beats an actual human spending a couple hours with the
         | model when it comes to meaningful evaluation.
        
       | jasonjmcghee wrote:
       | Why does huggingface list this as a 9B model?
        
         | make3 wrote:
         | it's trained with a lora adaptor, so it's either an error or
         | they also count the adaptor. they use a 16 param inner lora
         | dimension however, so it's unlikely that that's the reason (too
         | small)
         | 
         | an important point to keep in mind is that at inference, the
         | lora adaptors are made to be merged into the base model so they
         | don't affect inference speed. (you need to explicitly do it
         | though, if you train your own adaptor)
        
       | latexr wrote:
       | Are we sure the tutorial was medium? It might have been quite
       | good, or at least above average. Ba dum tss.
       | 
       | "medium" should be capitalised in the title, as it refers to the
       | blogging platform.
       | 
       | https://medium.com
        
         | ben_w wrote:
         | Agreed, because I initially read it as "mediocre" rather than
         | the brand.
        
           | idorube wrote:
           | And here I thought he was helped by a crystal ball...
        
       | politelemon wrote:
       | Good thread I saw on Reddit about this a few days ago.
       | 
       | https://www.reddit.com/r/LocalLLaMA/comments/18xbevs/open_ll...
       | 
       | Many top models are overfitting to top leaderboards rather than
       | be actually useful.
        
         | SubiculumCode wrote:
         | It seems so easy for just one poisoned model (trained with test
         | data) to infect a ton of finetune model mixtures...it could
         | happen without intention.
         | 
         | Under this scenario, the ones that achieve the top performance
         | are the closest relation to the poison model?
        
       | mbb70 wrote:
       | I'm as quick to jump on the Medium roastwagon as anyone else, but
       | I will say Towards Data Science has a surprising number of
       | quality tutorials running the full spectrum of data science
       | tasks.
       | 
       | That and they have great SEO, you basically can't avoid them.
        
         | behnamoh wrote:
         | I avoid them easily using Kagi :))
        
       | rck wrote:
       | Everyone knows the hf leaderboard is actively being gamed
       | (Goodhart's law strikes again), but the guy who wrote the Medium
       | post is active in doing stuff with models, and the tutorial is
       | (clearly) pretty good.
        
       | SubiculumCode wrote:
       | It seems that a lot of the leaders are the results of mixing
       | finetunes, which really makes me think that there was a leak of
       | test sets into the training data.
        
         | SubiculumCode wrote:
         | To reply to myself. I am not saying that _this_ model did, or
         | that even if it did, that it was done intentionally. ML is
         | hard, and there are so many ways for data to leak.
         | 
         | What I AM surprised about is that it is not clear what CultriX
         | did that was better than what a ton of others have done.
         | 
         | Any clues?
        
       | ramoz wrote:
       | Who is using 7Bs in a serious manner, instead of OpenAI, in a
       | cost efficient way?
        
         | justinl33 wrote:
         | 1. fine tune to specific tasks 2. Not subject to OpenAI's
         | censorshi 3. Can run on local instead of cloud compute
         | (offline) 4. experimentation
        
         | dingdingdang wrote:
         | Using Mistral ft optimized 1218 7b q5km - very useful for basic
         | queries/creative input. Often just as useful as ChatGPT and
         | feels far more "real" to have it fully local, don't want to
         | depend fully on one proprietary service for something as
         | fundamentally useful as this!
        
       ___________________________________________________________________
       (page generated 2024-01-05 23:02 UTC)