[HN Gopher] Command A: Max performance, minimal compute - 256k c...
       ___________________________________________________________________
        
       Command A: Max performance, minimal compute - 256k context window
        
       Author : lastdong
       Score  : 61 points
       Date   : 2025-03-14 07:02 UTC (2 days ago)
        
 (HTM) web link (cohere.com)
 (TXT) w3m dump (cohere.com)
        
       | Szpadel wrote:
       | what disqualified this model for me (I mostly use llms for
       | codding) was 12% score in aider benchmark
       | (https://aider.chat/docs/leaderboards/)
        
       | razemio wrote:
       | I distrust those benchmarks after working with sonnet for half a
       | year now. Many OpenAI models beat Sonnet on paper. This seems to
       | be the case because it's strength (agent, visual, caching) aren't
       | being used, I guess? Otherwise there is no explanation why it's
       | not constantly on top. I have tried so many times to use other
       | models for various tasks, not only coding. The only thing where
       | OpenAI accelerates is analytic tasks at a much higher price.
       | Everything else sonnet works for me the best and Gemini Flash 2.0
       | for cost effective and latency relevant tasks.
       | 
       | In practice this perception of mine seems to be valid:
       | https://openrouter.ai/models?order=top-weekly
       | 
       | The same with this model. It claims to be good at coding but it
       | seriously isn't compared to sonnet. Funny enough it isn't being
       | tested against.
        
       | integralof6y wrote:
       | I just tried the chat and asked the LLM to compute the double
       | integral of 6*y on the interior of a triangle given the vertices.
       | There were many trials all incorrect, then I asked to compute a
       | python program to solve this, again incorrect. I know math
       | computation is a weak point for LLM specially on a chat. In one
       | of the programs it used a hardcoded number 10 to branch, this
       | suggests that the program generated was fitted to give a good
       | result for the test (I did give it the correct result before).
       | This suggests that you should be careful when testing the
       | generated programs, they could be fitted to pass your simple
       | tests. Edited: Also I tried to compute the integral of 6 _y on
       | the triangle with vertices A(8, 8), B(15, 29), C(10, 12) and it
       | yield a wrong result of 2341, then I suggested computing that
       | using the formula for the barycenter of the triangle, that is, 6_
       | Area*(Mean of y-coordinates) and it returned the correct value of
       | 686.
       | 
       | To summarize: It seems that LLM are not able to give correct
       | result for simple math problems (here a double integral on a
       | triangle). So students should not rely on them since nowaday they
       | are not able to perform simple task without many errors.
        
         | HeatrayEnjoyer wrote:
         | >compute the integral of 6*y on the triangle with vertices A(8,
         | 8), B(15, 29), C(10, 12)
         | 
         | o3-mini returned 686 on the first try, without executing any
         | code.
        
         | vmg12 wrote:
         | Here is an even easier one, ask llms to take the integral from
         | 0 to 3 of 1/(x-1)^3. It fails to notice it's an improper
         | integral and just gives an answer.
        
           | floam wrote:
           | ChatGPT definitely noticed: o1, o3-mini, o3-mini-high.
           | 
           | Maybe 4o will get it wrong? I wouldn't try it for math.
        
       | Alifatisk wrote:
       | Cohere API Pricing for Command A
       | 
       | - Input Tokens: $2.50 / 1M
       | 
       | - Output Tokens: $10.00 / 1M
       | 
       | WOW, what makes them this expensive? Are we going against the
       | trend here and raising the prices instead?
        
       | Oras wrote:
       | Cohere always coming across as a niche LLM, not a generic one.
       | 
       | I once tried it to enforce returning the response in British
       | English and it worked a lot better than any other model that
       | time. But that was about it for following the prompt. Their
       | pricing is not competitive for others to jump on and I suspect
       | that's why it's not widely used.
        
       | codedokode wrote:
       | It is interesting that there is a graph showing performance on
       | benchmarks like MMLU, and different models have similar
       | performance. I wonder, are the tasks they cannot solve, the same
       | for every model? And how the "unsolvable" tasks are different
       | from solvable?
       | 
       | Also, I cannot check it with latest models, but I am curious,
       | have they learned to answer simple questions like "What is
       | 10000099983 + 1000017"?
        
         | floam wrote:
         | There are questions on MMLU that you must get wrong if you are
         | right:
         | 
         | > The most widespread and important retrovirus is HIV-1; which
         | of the following is true? (A) Infecting only gay people (B)
         | Infecting only males (C) Infecting every country in the world
         | (D) Infecting only females
         | 
         | the corpus indicates A is the correct answer but it was
         | obviously meant to be C.
        
       | jasonjmcghee wrote:
       | It's got Claude Sonnet pricing but they don't compare to it in
       | benchmarks.
        
         | UncleEntity wrote:
         | To be fair, or not, Claude isn't all that great.
         | 
         | I was working on a project to get the historic data out of a
         | bluetooth thermometer I bought a while back to learn about
         | Bluetooth LE and it would quite often rewrite the entire thing
         | using a completely different bluetooth library instead of
         | simply addressing the error.
         | 
         | And this is after I gave up having it create a kernel module
         | for the same thermometer (just because, not that anyone needs
         | such a thing) where it would _continually_ try to write a
         | helper program that wrote to the  /proc filesystem and I would
         | ask "why would I want to do this when I could just use the
         | example program I gave you?" Claude, of course, was highly
         | apologetic every single time it made the exact same mistake so
         | there's that.
         | 
         | I understand these are the early days of the robotic overthrow
         | of humanity but, please, at least sell me a working product.
        
       | bionhoward wrote:
       | Funny how AI companies love training competitors to human labor
       | on human output but then write in their terms that you're not
       | supposed to train competing bots on their bot output. Explicitly
       | anticompetitive hypocrisy, and millions of suckers pay for it ,
       | how sad
        
         | stuartjohnson12 wrote:
         | To be fair to the robots, those humans also had the audacity to
         | learn from the creative output of their fellow humans and then
         | use the law to restrict access to their intellectual property.
        
       | jstummbillig wrote:
       | "Command A is on par or better than GPT-4o and DeepSeek-V3 across
       | agentic enterprise tasks, with significantly greater efficiency."
       | 
       | Visible above the fold. Thanks for getting to the point.
        
       ___________________________________________________________________
       (page generated 2025-03-16 23:01 UTC)