[HN Gopher] Benchmarks and comparison of LLM AI models and API h...
       ___________________________________________________________________
        
       Benchmarks and comparison of LLM AI models and API hosting
       providers
        
       Hi HN, ArtificialAnalysis.ai provides objective benchmarks and
       analysis of LLM AI models and API hosting providers so you can
       compare which to use in your next (or current) project.  The site
       consolidates different quality benchmarks, pricing information and
       our own technical benchmarking data. Technical benchmarking
       (throughput, latency) is conducted through sending API requests
       every 3 hours.  Check out the site at
       https://artificialanalysis.ai, and our twitter at
       https://twitter.com/ArtificialAnlys  Twitter thread with initial
       insights:
       https://twitter.com/ArtificialAnlys/status/17472648324397343...
       All feedback is welcome and happy to discuss methodology, etc.
        
       Author : Gcam
       Score  : 78 points
       Date   : 2024-01-16 16:11 UTC (6 hours ago)
        
 (HTM) web link (artificialanalysis.ai)
 (TXT) w3m dump (artificialanalysis.ai)
        
       | elicksaur wrote:
       | > Application error: a client-side exception has occurred (see
       | the browser console for more information).
       | 
       | iOS Safari
        
         | Gcam wrote:
         | Thanks for the letting me know. Odd as not occurring with my
         | iOS Safari, can anyone else please let me know if they are
         | encountering this issue (any their iOS version if possible).
         | There is a console error but should be just a react
         | defaultprops deprecation notice from a library being used
         | (should not break DOM)
        
       | causal wrote:
       | Thanks for putting this together! Amazon is far and away the
       | priciest option here, but I wonder if a big part of that is the
       | convenience tax for the Bedrock service. Would be interesting to
       | compare that to the price of just renting AWS GPUs on EC2.
        
         | Gcam wrote:
         | Yes! An interesting insight is that the smaller, emerging hosts
         | also offer strong relative performance (throughput - tokens per
         | second)
        
       | chadash wrote:
       | I love it. One minor change I'd make is changing the pricing
       | chart to put lowest on the left. On the other highlights, left to
       | right goes from best to worst, but this one is the opposite.
       | 
       | I'm excited to see where things land. What I find interesting is
       | that pricing is either wildly expensive or wildly cheap,
       | depending on your use case. For example, if you want to run GPT-4
       | to glean insights on every webpage your users visit, a freemium
       | business model is likely completely unviable. On the other hand,
       | if I'm using an LLM to spot issues in a legal contract, I'd
       | happily pay 10x what GPT4 currently charges for something
       | marginally better (It doesn't make much difference if this task
       | costs $4 vs $0.40). I think that the ultimate "winners" in this
       | space will have a range of models at various price points and let
       | you seamlessly shift between them depending on the task (e.g., in
       | a single workflow, I might have some sub-tasks that need a cheap
       | model and some that require an expensive one).
        
       | zurfer wrote:
       | This is great. Thank you! I would be especially interested in
       | more details around speed. Average is a good starting point, but
       | I would love to also see standard distribution or 90, 99
       | percentiles.
       | 
       | In my experience speed varies a lot and it make it big difference
       | if a requests takes 10 seconds or 50 seconds.
        
         | Gcam wrote:
         | Thanks for the feedback! Yes, agree this would be a good idea.
         | We don't have this view but best place to get an idea of this
         | with current site would be the /models page
         | (https://artificialanalysis.ai/models) and scrolling to the
         | over time graphs and looking at the variance. To see if being
         | driven by individual hosts can also click into the by-model
         | pages and see the over time graphs, e.g.
         | https://artificialanalysis.ai/models/mixtral-8x7b-instruct
        
       | sabareesh wrote:
       | I want to see benchmarks for RAG. Most of the models are not very
       | good with RAG
        
       | luke-stanley wrote:
       | This is awesome. I was looking at benchmarking speed and quality
       | myself but didn't go this far! I wonder about Claude Instant and
       | Phi 2? Modal.com for inference felt crazy fast, but I didn't note
       | the metrics. Good ones to add? Replicate.com too maybe?
        
         | Gcam wrote:
         | Thanks! For Claude instant, select the dropdown on the top
         | right of the card where it says '8 Selected' and can add it to
         | the graphs. Thanks for the suggestions for adding Phi 2,
         | Model.com as a host, can look into these!
        
       | rubymamis wrote:
       | I wish there were more details about how you measure "quality".
        
         | pseudosavant wrote:
         | See this comment:
         | https://news.ycombinator.com/item?id=39014985#39017792
        
       | idiliv wrote:
       | I'm curious how they evaluated model quality. The only
       | information I could find is "Quality: Index based on several
       | quality benchmarks".
        
         | Gcam wrote:
         | Quality index is equally-weighted normalized values of Chatbot
         | Arena Elo Score, MMLU, and MT Bench.
         | 
         | We have a bit more information in the FAQ:
         | https://artificialanalysis.ai/faq but thanks for the feedback,
         | will look into expanding more on how the normalization works.
         | We are thinking of ways to improve this generalized metric.
         | 
         | A sticking point is quality can of course be thought of from
         | different perspectives, reasoning, knowledge (retrieval), use-
         | case specific (coding, math, readability), etc. This is why
         | show individual scores on home page and models page:
         | https://artificialanalysis.ai/models
        
       | scribu wrote:
       | I'm not sure about the Speed chart. I would expect gpt-4-turbo to
       | be faster than plain gpt-4.
        
         | pseudosavant wrote:
         | I thought so too. Could it be that gpt-4 turbo is more
         | efficient for them to run, so the price is lower, but tries to
         | maintain the token throughput of GPT4 over their API? There are
         | a lot of ways they could allocate and configure their GPU
         | resources so that GPT-4 Turbo provides the same per user
         | throughput while greatly increasing their system throughput.
        
           | bredren wrote:
           | The speed of GPT-4 via chatgpt varies greatly on when you're
           | using it.
           | 
           | Could the data have been collected when the system is under
           | different loads?
        
             | pseudosavant wrote:
             | Unless they captured many different times and days, that is
             | very likely a factor. GPU resources are constrained enough
             | that during peak times (which vary across the globe) the
             | token throughput will vary a lot.
        
       | throwawaymaths wrote:
       | Latency (ttft) would be a nice metric.
        
         | Gcam wrote:
         | We have this (and other more detailed metrics) on the models
         | page https://artificialanalysis.ai/models if you scroll down
         | and for individual hosts if you click into a model (nav or
         | click one of the model bars/bubbles) :)
         | 
         | There are some interesting views of throughput vs. latency
         | whereby some models are slower to the first chunk but faster
         | for subsequent chunks and vice versa, and so suit different use
         | cases (e.g. if just want a true/false vs. more detailed model
         | responses)
        
           | throwawaymaths wrote:
           | Thanks!
        
       | Gcam wrote:
       | Hi HN, Thanks for checking this out! Goal with this project is to
       | provide objective benchmarks and analysis of LLM AI models and
       | API hosting providers to compare which to use in your next (or
       | current) project. Benchmark comparisons include quality, price,
       | technical performance (e.g. throughput, latency).
       | 
       | Twitter thread with initial insights:
       | https://twitter.com/ArtificialAnlys/status/17472648324397343...
       | 
       | All feedback is welcome
        
         | ttt3ts wrote:
         | Any chance of including some of the better fine tunes, e.g.
         | wizard or tulu? (worse than mixtral but I assume other
         | finetines will be better just like wizard and tulu are better
         | than LLAMA2)
         | 
         | I guess their cost is same as base model although would effect
         | performance.
        
         | bravura wrote:
         | I'd love to see replicate.com (pay per sip) on there. And
         | lambdalabs.com
         | 
         | [edit: And also MPS]
        
       | bearjaws wrote:
       | I've been using Mixtral and Bard ever since the end of the year.
       | I am pleased with their performance overall for a mixture of
       | content generation and coding.
       | 
       | It seems to me GPT4 has become short in its outputs, you have to
       | do a lot more COT type prompting to get it to actually output a
       | good result. Which is excruciating given how slow it is to
       | produce content.
       | 
       | Mixtral on together AI is crazy to see ~70-100token/s, and the
       | quality works for my use case as well.
        
         | thierrydamiba wrote:
         | Can you give an example of a query where you find GPT4 is short
         | with outputs? I've use custom instructions so that may have
         | shielded me from this change.
        
           | declaredapple wrote:
           | At least for me making tests has been very frustrating, full
           | of many "test conditions here" and "continue with the rest of
           | the tests".
           | 
           | It _hates_ making assumptions about things it doesn't know
           | for sure, I suspect because of "anti-hallucination" nonsense.
           | Instead it has to be shoved to even try making any
           | assumptions, even reasonable ones.
           | 
           | I know it's capable of making reasonable assumptions for
           | class structures/behaviour, etc where I can just tweak it as
           | needed to work. It just refuses too. I've even seen comments
           | like "We'll put the rest of the code in later"
        
           | bearjaws wrote:
           | Given this JSON: <JSON examples> And this Table schema:
           | <Table Schema in SQL>
           | 
           | Create JavaScript to insert the the JSON into the SQL using
           | knex('table_name')
           | 
           | Below is part of its output:                 // Insert into
           | course_module table            await
           | knex('course_module').insert({              id: moduleId,
           | name: courseData.name,              description:
           | courseData.description,              // include other
           | required fields with appropriate
           | 
           | values });
           | 
           | It's missing several columns it could populate with the data
           | it knows from the prompt, primarily created_at, updated_at,
           | account_id, user_id, lesson number... and instead I get a
           | comment telling me to do it.
           | 
           | Theres a lot of people complaining about this, primarily on
           | Reddit, but usually the ChatGPT fan boys jump in to defend
           | OAI.
        
             | bearjaws wrote:
             | Here is the mixtral output (truncated):
             | 
             | knex('course_module')                 .insert({
             | name: jsonData.name,              description:
             | jsonData.description,              content:
             | JSON.stringify(jsonData),              number:
             | jsonData.number,              account_id: 'account_id',
             | user_id: 'user_id',              course_id: 'course_id',
             | created_at: new Date(),              updated_at: new Date()
             | })
        
             | abrichr wrote:
             | Try this custom instruction:                   - Skip any
             | preamble or qualifications about how a topic is subjective.
             | - Be terse. Do not offer unprompted advice or
             | clarifications. Speak in specific, topic relevant
             | terminology. Do NOT hedge or qualify. Do not waffle. Speak
             | directly and be willing to make creative guesses. Explain
             | your reasoning. if you don't know, say you don't know.
             | - Remain neutral on all topics. Be willing to reference
             | less reputable sources for ideas.         - Never
             | apologize.         - Ask questions when unsure.
             | When responding in code:         - Do not truncate.
             | - Do not elide.         - Do not omit.         - Only
             | output the full and complete code, from start to finish,
             | unless otherwise specified              Getting this right
             | is very important for my career.
        
               | bearjaws wrote:
               | Hmm I like this more than my current one, which I got
               | from a Reddit thread. I'll have to give it a whirl.
        
       | badFEengineer wrote:
       | nice, I've been looking for something like this! A few notes /
       | wishlist items:
       | 
       | * Looks like for gpt-4 turbo (https://artificialanalysis.ai/model
       | s/gpt-4-turbo-1106-previe...), there was a huge latency spike on
       | December 28, which is causing the avg. latency to be very high.
       | Perhaps dropping top and bottom 10% of requests will help with
       | avg (or switch over to median + include variance)
       | 
       | * Adding latency variance would be truly awesome, I've run into
       | issues with some LLM API providers where they've had incredibly
       | high variance, but I haven't seen concrete data across providers
        
         | Gcam wrote:
         | Thanks for the feedback and glad it is useful! Yes, agree might
         | better representative of future use. I think a view of variance
         | would be a good idea, currently just shown in over-time views -
         | maybe a histogram of response times or a box and whisker. We
         | have a newsletter subscribe form on the website or twitter
         | (https://twitter.com/ArtificialAnlys) if you want to follow
         | future updates
        
           | AaronFriel wrote:
           | Variance would be good, and I've also seen significant
           | variance on "cold" request patterns, which may correspond to
           | resources scaling up on the backend of providers.
           | 
           | Would be interesting to see request latency and throughput
           | when API calls occur cold (first data point), and once per
           | hour, minute, and per second with the first N samples
           | dropped.
           | 
           | Also, at least with Azure OpenAI, the AI safety features
           | (filtering & annotations) make a significant difference in
           | time to first token.
        
       | binsquare wrote:
       | I'm surprised to see perplexity's 70B online model score so low
       | on model quality and somehow far worse mixtral and gpt3.5(they
       | use a fine tuned gpt3.5 as the foundational model AFAIK)
       | 
       | I run https://www.labophase.com and my data suggests that it's
       | one of the top 3 models in terms of users liking to interact with
       | it. May I know how model quality is benchmarked to understand
       | this discrepancy?
        
         | Gcam wrote:
         | Model quality index methodology is as per this comment (can add
         | perplexity using the dropdown):
         | https://news.ycombinator.com/item?id=39014985#39017632
         | 
         | It's a combination of different quality metrics which have
         | Perplexity, overall, not performing as well. That being said, I
         | think we are in the very early stages of model quality
         | scoring/ranking - and (for closed sourced models) we are seeing
         | frequent changes. Will be interesting to see how measures
         | evolve / model ranks change
        
       ___________________________________________________________________
       (page generated 2024-01-16 23:00 UTC)