[HN Gopher] Command A: Max performance, minimal compute - 256k c...
___________________________________________________________________
Command A: Max performance, minimal compute - 256k context window
Author : lastdong
Score : 61 points
Date : 2025-03-14 07:02 UTC (2 days ago)
(HTM) web link (cohere.com)
(TXT) w3m dump (cohere.com)
| Szpadel wrote:
| what disqualified this model for me (I mostly use llms for
| codding) was 12% score in aider benchmark
| (https://aider.chat/docs/leaderboards/)
| razemio wrote:
| I distrust those benchmarks after working with sonnet for half a
| year now. Many OpenAI models beat Sonnet on paper. This seems to
| be the case because it's strength (agent, visual, caching) aren't
| being used, I guess? Otherwise there is no explanation why it's
| not constantly on top. I have tried so many times to use other
| models for various tasks, not only coding. The only thing where
| OpenAI accelerates is analytic tasks at a much higher price.
| Everything else sonnet works for me the best and Gemini Flash 2.0
| for cost effective and latency relevant tasks.
|
| In practice this perception of mine seems to be valid:
| https://openrouter.ai/models?order=top-weekly
|
| The same with this model. It claims to be good at coding but it
| seriously isn't compared to sonnet. Funny enough it isn't being
| tested against.
| integralof6y wrote:
| I just tried the chat and asked the LLM to compute the double
| integral of 6*y on the interior of a triangle given the vertices.
| There were many trials all incorrect, then I asked to compute a
| python program to solve this, again incorrect. I know math
| computation is a weak point for LLM specially on a chat. In one
| of the programs it used a hardcoded number 10 to branch, this
| suggests that the program generated was fitted to give a good
| result for the test (I did give it the correct result before).
| This suggests that you should be careful when testing the
| generated programs, they could be fitted to pass your simple
| tests. Edited: Also I tried to compute the integral of 6 _y on
| the triangle with vertices A(8, 8), B(15, 29), C(10, 12) and it
| yield a wrong result of 2341, then I suggested computing that
| using the formula for the barycenter of the triangle, that is, 6_
| Area*(Mean of y-coordinates) and it returned the correct value of
| 686.
|
| To summarize: It seems that LLM are not able to give correct
| result for simple math problems (here a double integral on a
| triangle). So students should not rely on them since nowaday they
| are not able to perform simple task without many errors.
| HeatrayEnjoyer wrote:
| >compute the integral of 6*y on the triangle with vertices A(8,
| 8), B(15, 29), C(10, 12)
|
| o3-mini returned 686 on the first try, without executing any
| code.
| vmg12 wrote:
| Here is an even easier one, ask llms to take the integral from
| 0 to 3 of 1/(x-1)^3. It fails to notice it's an improper
| integral and just gives an answer.
| floam wrote:
| ChatGPT definitely noticed: o1, o3-mini, o3-mini-high.
|
| Maybe 4o will get it wrong? I wouldn't try it for math.
| Alifatisk wrote:
| Cohere API Pricing for Command A
|
| - Input Tokens: $2.50 / 1M
|
| - Output Tokens: $10.00 / 1M
|
| WOW, what makes them this expensive? Are we going against the
| trend here and raising the prices instead?
| Oras wrote:
| Cohere always coming across as a niche LLM, not a generic one.
|
| I once tried it to enforce returning the response in British
| English and it worked a lot better than any other model that
| time. But that was about it for following the prompt. Their
| pricing is not competitive for others to jump on and I suspect
| that's why it's not widely used.
| codedokode wrote:
| It is interesting that there is a graph showing performance on
| benchmarks like MMLU, and different models have similar
| performance. I wonder, are the tasks they cannot solve, the same
| for every model? And how the "unsolvable" tasks are different
| from solvable?
|
| Also, I cannot check it with latest models, but I am curious,
| have they learned to answer simple questions like "What is
| 10000099983 + 1000017"?
| floam wrote:
| There are questions on MMLU that you must get wrong if you are
| right:
|
| > The most widespread and important retrovirus is HIV-1; which
| of the following is true? (A) Infecting only gay people (B)
| Infecting only males (C) Infecting every country in the world
| (D) Infecting only females
|
| the corpus indicates A is the correct answer but it was
| obviously meant to be C.
| jasonjmcghee wrote:
| It's got Claude Sonnet pricing but they don't compare to it in
| benchmarks.
| UncleEntity wrote:
| To be fair, or not, Claude isn't all that great.
|
| I was working on a project to get the historic data out of a
| bluetooth thermometer I bought a while back to learn about
| Bluetooth LE and it would quite often rewrite the entire thing
| using a completely different bluetooth library instead of
| simply addressing the error.
|
| And this is after I gave up having it create a kernel module
| for the same thermometer (just because, not that anyone needs
| such a thing) where it would _continually_ try to write a
| helper program that wrote to the /proc filesystem and I would
| ask "why would I want to do this when I could just use the
| example program I gave you?" Claude, of course, was highly
| apologetic every single time it made the exact same mistake so
| there's that.
|
| I understand these are the early days of the robotic overthrow
| of humanity but, please, at least sell me a working product.
| bionhoward wrote:
| Funny how AI companies love training competitors to human labor
| on human output but then write in their terms that you're not
| supposed to train competing bots on their bot output. Explicitly
| anticompetitive hypocrisy, and millions of suckers pay for it ,
| how sad
| stuartjohnson12 wrote:
| To be fair to the robots, those humans also had the audacity to
| learn from the creative output of their fellow humans and then
| use the law to restrict access to their intellectual property.
| jstummbillig wrote:
| "Command A is on par or better than GPT-4o and DeepSeek-V3 across
| agentic enterprise tasks, with significantly greater efficiency."
|
| Visible above the fold. Thanks for getting to the point.
___________________________________________________________________
(page generated 2025-03-16 23:01 UTC)