[HN Gopher] Best 7B LLM on leaderboards made by an amateur follo...
___________________________________________________________________
Best 7B LLM on leaderboards made by an amateur following a medium
tutorial
Author : Der_Einzige
Score : 132 points
Date : 2024-01-05 18:34 UTC (4 hours ago)
(HTM) web link (huggingface.co)
(TXT) w3m dump (huggingface.co)
| brcmthrowaway wrote:
| Very interesting this was managed with a 6 months out of date
| course.
| ch33zer wrote:
| On a free collab instance...
| brcmthrowaway wrote:
| How much more low hanging fruit is there?
| SushiHippie wrote:
| The GPU he used is not free, you need to buy compute units
| for these "premium GPUs".
|
| https://colab.research.google.com/signup
| cjbprime wrote:
| Is it possible that they fine-tuned on the leaderboard test set?
| nabakin wrote:
| The only leaderboard this model is good on is the HuggingFace LLM
| Leaderboard which is known to be manipulated and victim to gross
| overfit. The Lmsys Arena Leaderboard is a better representation
| of the best models.
| refulgentis wrote:
| Thank you!!! _And_ it has the proprietary models...insanely
| more useful.
|
| It's a bug, not a feature, that the stock leaderboard ends up
| with endless fine tunes, and as you point out and is
| demonstrated by the article, its more about something else than
| about quality
| Der_Einzige wrote:
| Even that chatbot arena shows that many models freely
| available and open source are better than some versions of
| GPT-3.5 and are within a small stones throw of the latest GPT
| 3.5
| int_19h wrote:
| Note that it only includes gpt-3.5-turbo (the current
| iteration), not the original gpt-3.5. It's not exactly a
| secret that "turbo" models are noticeably dumber than the
| originals, whatever OpenAI says. There's no free lunch -
| that's why it's so much cheaper and faster...
|
| That said, we do have public 120b models now that genuinely
| feel better than the original gpt-3.5.
|
| The holy grail remains beating gpt-4 (or even gpt-4-turbo).
| This seems to be out of reach on consumer hardware at
| least...
| modeless wrote:
| Um, the lmsys elo ranking clearly shows that GPT-4 Turbo
| is better than GPT-4.
| anoncareer0212 wrote:
| Same for the ChatGPTs post launch (let's not talk about
| 11_02 :) )
|
| -- and as long as we're asserting andecdotes freely, I
| work in the field and have a couple years in before
| ChatGPT -- it most certainly is not a well-kept secret or
| a secret or true or anything else other than standard
| post-millenial self-peasantization.
|
| "Outright lie" is kinder toward the average reader via
| being more succinct, but usually causes explosive
| reactions because people take a while to come to terms
| with their ad-hoc knowledge via consuming commentary is
| fundamentally flawed, if ever.
| sdenton4 wrote:
| Note that 'no free lunch' has a specific meaning with no
| relation whatsoever to model size/quality trade-offs...
|
| https://en.wikipedia.org/wiki/No_free_lunch_theorem
|
| In the speed/quality trade-off sense, there have /often/
| been free lunches in many areas of computer science,
| where algorithmic improvements let us solve problems
| orders of magnitude faster. We don't fully understand
| what further improvements will be available for LLMs.
| pests wrote:
| That phrase comes from the more general adage though.
|
| https://en.wikipedia.org/wiki/No_such_thing_as_a_free_lun
| ch
| mewpmewp2 wrote:
| Free lunch here relates to pricing/speed I would say,
| because gpt-4 and gpt-4-turbo are sold together. If
| gpt-4-turbo is cheaper, faster and has much larger
| context window, why would it make sense to also sell
| gpt-4... Unless it's a marketing trick or perhaps for
| backwards compatibility, which could also be.
| barnabee wrote:
| If only all people and companies were as honest as this guy about
| how much of their success they owe to luck and others!
| wavemode wrote:
| "Leaderboards", meh.
|
| This tweet is still very true:
| https://twitter.com/karpathy/status/1737544497016578453
| not2b wrote:
| Goodhart's law would seem to apply:
|
| https://en.wikipedia.org/wiki/Goodhart%27s_law
|
| Nevertheless, scoring so well on this benchmark is an
| accomplishment, though I'm not in a position to evaluate how
| significant it is.
| make3 wrote:
| that's why the huggingface llm arena exists
| int_19h wrote:
| Nothing beats an actual human spending a couple hours with the
| model when it comes to meaningful evaluation.
| jasonjmcghee wrote:
| Why does huggingface list this as a 9B model?
| make3 wrote:
| it's trained with a lora adaptor, so it's either an error or
| they also count the adaptor. they use a 16 param inner lora
| dimension however, so it's unlikely that that's the reason (too
| small)
|
| an important point to keep in mind is that at inference, the
| lora adaptors are made to be merged into the base model so they
| don't affect inference speed. (you need to explicitly do it
| though, if you train your own adaptor)
| latexr wrote:
| Are we sure the tutorial was medium? It might have been quite
| good, or at least above average. Ba dum tss.
|
| "medium" should be capitalised in the title, as it refers to the
| blogging platform.
|
| https://medium.com
| ben_w wrote:
| Agreed, because I initially read it as "mediocre" rather than
| the brand.
| idorube wrote:
| And here I thought he was helped by a crystal ball...
| politelemon wrote:
| Good thread I saw on Reddit about this a few days ago.
|
| https://www.reddit.com/r/LocalLLaMA/comments/18xbevs/open_ll...
|
| Many top models are overfitting to top leaderboards rather than
| be actually useful.
| SubiculumCode wrote:
| It seems so easy for just one poisoned model (trained with test
| data) to infect a ton of finetune model mixtures...it could
| happen without intention.
|
| Under this scenario, the ones that achieve the top performance
| are the closest relation to the poison model?
| mbb70 wrote:
| I'm as quick to jump on the Medium roastwagon as anyone else, but
| I will say Towards Data Science has a surprising number of
| quality tutorials running the full spectrum of data science
| tasks.
|
| That and they have great SEO, you basically can't avoid them.
| behnamoh wrote:
| I avoid them easily using Kagi :))
| rck wrote:
| Everyone knows the hf leaderboard is actively being gamed
| (Goodhart's law strikes again), but the guy who wrote the Medium
| post is active in doing stuff with models, and the tutorial is
| (clearly) pretty good.
| SubiculumCode wrote:
| It seems that a lot of the leaders are the results of mixing
| finetunes, which really makes me think that there was a leak of
| test sets into the training data.
| SubiculumCode wrote:
| To reply to myself. I am not saying that _this_ model did, or
| that even if it did, that it was done intentionally. ML is
| hard, and there are so many ways for data to leak.
|
| What I AM surprised about is that it is not clear what CultriX
| did that was better than what a ton of others have done.
|
| Any clues?
| ramoz wrote:
| Who is using 7Bs in a serious manner, instead of OpenAI, in a
| cost efficient way?
| justinl33 wrote:
| 1. fine tune to specific tasks 2. Not subject to OpenAI's
| censorshi 3. Can run on local instead of cloud compute
| (offline) 4. experimentation
| dingdingdang wrote:
| Using Mistral ft optimized 1218 7b q5km - very useful for basic
| queries/creative input. Often just as useful as ChatGPT and
| feels far more "real" to have it fully local, don't want to
| depend fully on one proprietary service for something as
| fundamentally useful as this!
___________________________________________________________________
(page generated 2024-01-05 23:02 UTC)