[HN Gopher] Chatbot Arena Leaderboard
___________________________________________________________________
Chatbot Arena Leaderboard
Author : tosh
Score : 53 points
Date : 2023-05-25 20:46 UTC (2 hours ago)
(HTM) web link (lmsys.org)
(TXT) w3m dump (lmsys.org)
| zhwu wrote:
| Very interesting! Quite surprised to see PaLM-2 ranked even lower
| than open-sourced Vicuna.
| tikkun wrote:
| When do you (HN readers) think that we'll have an open source
| model that scores 1150 or higher, and where do you think it'll
| come from?
| sottol wrote:
| The "win matrix" (dissimilarity Matrix) seems very interesting,
| looks eg like Vicuna13b paired against gpt4 wins 20% of the time.
| Larger difference than I'd have guessed based on scores.
| furyofantares wrote:
| Yeah the win matrix is what you want to look at if you haven't
| internalized or memorized what various Elo differences mean
| redox99 wrote:
| Unfortunately the Arena is missing some of the strongest "open"
| models, such as WizardLM Uncensored 30B. In fact they don't have
| any Llama 30B/65B based models, just 13B models.
| djdsol wrote:
| Why is there no Claude+ ? Seems like their competitor to GPT-4.
| exizt88 wrote:
| Only the bottom 2 out of top 10 are open-source and available for
| commercial use. So if you want to use an open-source LLM for your
| commercial product, be aware that your competitors who use
| proprietary LLMs through APIs will outperform you _dramatically_.
| Or am I missing something?
| netsec_burn wrote:
| What you're missing is not reflected on the leaderboard right
| now, Guanaco 65B.
| EvgeniyZh wrote:
| Guanaco is LLaMA tune and thus is irrelevant for commercial
| use, isn't it?
| netsec_burn wrote:
| Ah true! It isn't for commercial use.
| version_five wrote:
| I'd say your missing the importance of not being bound to a
| proprietary model, and of not having to explain to your
| customers why you send their data to a third party. It's still
| early days - definitely if you need the sota performance this
| second, you don't have any options. But in the fairly near
| term, I see no evidence that the proprietary _generic_ models
| will keep their leads in a way that 's meaningful for
| commercial products. Do you?
| com2kid wrote:
| I've been working extensively with LLMs on a generative
| storytelling side project (named www.generativestorytelling.ai
| because I am terrible at naming things) and once prompts start
| getting complex, ChatGPT wins by a landslide. I can do all sorts
| of complicated prompts to ChatGPT[0] and it will, by and large,
| come up with great output.
|
| Meanwhile, Bard gets confused by basic things such as "after this
| message I will send another one, do not reply until the second
| message is sent" and instead tries to immediately reply.
|
| IMHO not very many people doing reviews of chatbots are really
| pushing them the bots to their limits, and those who are pushing
| the bots really hard are often too busy to take the time and make
| their work public (which is the reason I am developing in the
| open!)
|
| [0]
| https://github.com/devlinb/arcadia/blob/main/backend/src/rou...
| refulgentis wrote:
| Have you tried Claude on stories? my goodness, it seemed out of
| this world amazing a couple months back
| com2kid wrote:
| * * *
___________________________________________________________________
(page generated 2023-05-25 23:01 UTC)