[HN Gopher] Claude 2 Internal API Client and CLI
___________________________________________________________________
Claude 2 Internal API Client and CLI
Author : explosion-s
Score : 57 points
Date : 2023-07-14 19:55 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dandiep wrote:
| Who here is using Claude? And can you comment on your experiences
| with it vs. GPT 3.5/4?
| BryanLegend wrote:
| I'm pleased with it. Claude seems kinder and less patronizing
| than GPT. Not as good at coding yet.
| cl42 wrote:
| Using it regularly for executive feedback at some of our
| clients (think of this as an internal coach for policies). I'd
| say it's almost as good at GPT-4 at having broader
| conversations and sharing ideas.
|
| The 100K model is FANTASTIC for quick prototyping as well.
|
| Implementing everything via PhaseLLM to plug and play Claude +
| GPT-3.5/4 as needed. All other LLMs don't stack up to these
| two.
| xfalcox wrote:
| I added it as an option in Discourse, and I've been happy with
| it's output for summarization tasks, suggesting titles and
| proofreading.
| linsomniac wrote:
| I started playing with it last weekend, the 100K token limit is
| very useful for things like "Give me a summary of this 5 hours
| Lex Fridman podcast in about 10 sentences: <podcast
| transcript>"
| celestialcheese wrote:
| I use it for filtering and summarization tasks on huge
| contexts. Specifically for extracting data from raw HTML in
| scraping tasks. It works surprisingly well.
| ronsor wrote:
| The ability to upload entire documents is honestly a game-
| changer, even if GPT-4 is better with certain reasoning tasks.
| I don't think I can go back to tiny context lengths now.
| jerrygenser wrote:
| It's good for general things but less good at coding. Can
| usually get the correct answer for simpler things but much less
| idiomatic for python than gpt4
| ec109685 wrote:
| For javascript, it did just as well as gpt-4 for several
| questions and used more modern JavaScript syntax.
|
| First time I have felt something feel nearly as good, and the
| user interface is a bit nicer.
| speedgoose wrote:
| Does it uses the ECMAScript modules instead of the CommonJS
| modules by default?
| HyprMusic wrote:
| We're still in the early stages of testing v2 in the real world
| but it aced our suite of internal tests... we are very
| impressed. Claude 1.2 did ok but it struggled with nuance &
| accuracy whereas v2 seems to handle nuance very well and is
| both accurate and, most importantly, consistent. The thing with
| evaluating LLMs is it's not about how well they do on your
| first evaluation - consistency is key and even the slightest
| little deviation in circumstance can throw them off so we're
| being very cautious before we make the jump. GPT4 brought that
| consistency but the slow speed and constant downtime makes it
| vey difficult to use in a product so we'd love to move to
| Anthropic.
|
| Our product is a tool to turn user stories into end-to-end
| tests so we use LLMs for NLP, identifying key parts of HTML and
| writing very simple code (we've not officially launched to the
| public just yet but for the curious, https://carbonate.dev is
| our product).
| paxys wrote:
| I love it. It may not objectively be on par with GPT-4, but
| uploading a 100 page document and getting a summary in seconds
| is nothing short of miraculous.
| swyx wrote:
| is it an accurate summary tho?
| technics256 wrote:
| In my experience it is very hollow. It skips details unless
| you force it to.
|
| Gpt4 is still way better
| deadmutex wrote:
| How do you know if it is correct or a hallucination?
| luma wrote:
| Investigate, same as you would a paralegal etc. If it makes
| an assertion, contest it and ask where it found supporting
| evidence in the document for the claims made. Ask it to
| make the counter-argument, also with sources. Verify as
| needed.
| paxys wrote:
| That's what prompting is all about. Ask it to prove its
| statements. Ask it to quote passages that support its
| arguments. Then double and triple check it yourself. It
| isn't going to do the work for you, but can still be a
| pretty great reference tool.
| ronsor wrote:
| Presumably one can test it with documents one has already
| read and knows before. If the summaries of the test
| documents are good, future summaries will probably be OK
| too.
| zmmmmm wrote:
| > If the summaries of the test documents are good, future
| summaries will probably be OK too
|
| But that is exactly what is problematic with
| hallucinations. It's a rare / exceptional behaviour that
| triggers extreme departure from reality. So you can't
| estimate the extremes by extrapolating from common /
| moderate observations. You would have to test a _lot_ of
| previous documents to be confident, and even then there
| would be a residual risk.
| lambdaba wrote:
| Maybe having it summarize a fiction book (outside of
| training data)?
| drewbitt wrote:
| Claude's training data is a year further into the future which
| is often beneficial. The 100k token limit is fantastic for long
| conversations and pasting in documents. The two downsides are
| 1) it seems to get confused a bit more than GPT-4 and I have to
| repeat instructions more often 2) the code-writing ability is
| definitely subpar compared to GPT-4
| explosion-s wrote:
| I prefer it over 3.5, (can't afford GPT4 so I'm not sure about
| comparisons there). It's much faster imo and refuses to respond
| less. In addition they make uploading (text based) files easy,
| so although it's not truly multimodal it's still nice to use.
|
| I also like the 100k token limit, that's insane. It almost
| never loses track of what you were talking about!
| binkHN wrote:
| I looked at the pricing and it appears to be less than half
| the cost of GPT-4, but significantly more expensive than
| GPT-3.5. Does that sound correct?
| blowski wrote:
| I've spent quite a bit of time with both, but I'm not an expert
| in this field so take my comments with a fist of salt.
|
| It's pretty good. Certainly as good as GPT-3.5 for speed and
| quality. Claude seems to consider the context you've supplied
| more than GPT-3.5.
|
| Compared to GPT-4, it has similar levels of knowledge. Claude
| is less verbose. It's less good at building real world models
| based on context. Anecdotally, I've found it hallucinated more
| than GPT.
|
| So, it's probably better at summarising large blocks of text,
| but less good at generating content that requires knowledge
| outside of what you've supplied.
| swyx wrote:
| comparing them every day via https://github.com/smol-ai/menubar
| . i'd say when it comes to coding I pick their suggestions
| about 30% of the time. not SOTA, but pretty darn good!
| philipkglass wrote:
| I am frequently interested in problems where answers are easily
| calculated from public data but the answer is unlikely to be
| already recorded in a form that search engines can find.
| Normally I spend a while noodling around looking for data and
| then use unit-conversion and basic arithmetic to get the final
| answer.
|
| I tested Claude vs ChatGPT (which I believe is GPT 3.5) and vs
| Bard for a problem of this sort.
|
| I asked:
|
| 1) What current type of power reactor consumes the least
| natural uranium per megawatt hour of electricity? (The answer
| is the pressurized heavy water reactor or CANDU type).
|
| 2) How much natural uranium does a PHWR consume per megawatt
| hour of electricity generated? (The answer is about 18 grams.)
|
| 3) How many terawatt hours does the United States generate
| annually from natural gas? (The answer as of 2022 is 1689 TWh,
| but any correct answer from the past 5 years would have been
| ok.)
|
| 4) How much natural uranium would the United States need to
| replace the electricity it currently generates from natural
| gas? (The answer is 1689 * 10^6 * 18 grams, e.g. about 30,400
| metric tons of uranium.)
|
| In the past Bard, Claude, and ChatGPT all correctly identified
| the CANDU or PHWR as the most efficient current reactor type.
|
| Claude did the arithmetic correctly at stages 3 and 4, but it
| believed that a PHWR consumed about 170 grams of uranium per
| megawatt hour so its answer was off by nearly a factor of 10.
| ChatGPT got the initial grams-per-MWh value correct but its
| arithmetic was wild fantasy, so it was off by about a factor of
| 10000. Bard made multiple mistakes.
|
| ------
|
| I just retried with Bard and ChatGPT as of today. On today's
| retry they fail at the first step.
|
| Bard's response to the initial prompt was "According to the
| World Nuclear Association, an MSR could use as little as 100
| grams of uranium per megawatt hour of electricity. This is
| about 100 times less than the amount of uranium used by a
| traditional pressurized water reactor."
|
| Since there are no MSRs currently generating electricity, this
| answered the wrong question. The answer is also quantitatively
| wrong. Current PWRs consume nowhere near 10,000 grams of
| uranium per megawatt hour.
|
| ChatGPT just said "As of my knowledge cutoff in September 2021,
| the type of power reactor that consumes the least natural
| uranium per megawatt hour of electricity is the pressurized
| water reactor (PWR). PWRs are one of the most common types of
| nuclear reactors used for commercial electricity generation."
|
| This is wrong also. It correctly identified the CANDU as the
| most efficient in a previous session, but this was a while ago.
| I don't know if was just randomness that caused Bard and
| ChatGPT to previously deliver correct answers at the first
| step.
| gizajob wrote:
| I spent the afternoon chatting with it one day this week and
| had a brilliant time. I fed it half of a book I've written
| recently, a piece of narrative and descriptive non-fiction, and
| its analysis was absolutely great. It digested the text and
| found things that even human readers have missed. What was
| interesting was that the book is mostly genderless, and at
| first it gave the analysis like the writer was male. Then I
| said "the writer is actually a woman" and it not only
| apologised quite genuinely for getting it wrong, it altered its
| literary analysis and criticism in a way that was perfectly
| suited to a human reader knowing that the writer was female,
| and changed the slant of its analysis. It was deeply useful and
| interesting to converse with, and it found the relevant topics
| that an educated human reader would likely find interesting and
| comment on... and it did this in a few minutes, compared to a
| human reader where you'd be talking weeks of latency to read
| and analyse the text as a complete work.
|
| Pretty great! Bit of a party trick at the same time (it did
| hallucinate a couple of minor things) but enough for me as the
| writer to be gripped by talking to Claude. It even came up with
| some really interesting questions to ask _me_ once I told it
| that I was the author, and many of them were better than a lot
| of lazy interviewers or reviewers would come up with.
|
| Highly recommended.
| foundry27 wrote:
| YMMV, but I've found that interacting with Claude
| conversationally gives me a much stronger impression of having
| a productive discussion with an individual, receiving pushback
| on ideas that had identifiable flaws and giving advice on how
| to improve my own thought processes, rather than the blind
| obedience that GPT-4 output is so well known for. When it comes
| to raw problem-solving capacity GPT-4 still handily beats it,
| but this is the first LLM I've used that makes me actually
| regret having to swap to GPT-4 to analyze a trickier problem.
| BoorishBears wrote:
| Everyone accepts output from LLMs is largely predicated on
| grounding them, but few seem to be realizing that grounding
| them applies to more than raw data.
|
| They perform better at many tasks simply by grounding their
| alignment in-context, by telling them very specific people to
| act as.
|
| It's an example of something that "prompt engineering" solves
| today and people only glancingly familiar with how LLMs work
| insist won't be needed soon... by their very nature the
| models will always have this limitation.
|
| Say user A is an expert with 10 years of experience and user
| B is an beginner with 1 year of experience: they both enter a
| question and all the model has to go on is the tokens in the
| question.
|
| The model might have uncountable ways to reply to that
| question if you had inserted more tokens, but with only the
| question in context, you'll always get answers that are
| clustered around the mean answer it can produce... but
| because it's the literal mean of all those possibilities it's
| unlikely user A or user B will find particularly great.
|
| Because of that there's no way to ever produce an answer that
| satisfies both A and B _to the full capabilities of that
| LLM_. When the input is just the question you 're not even
| touching the tip of the iceberg of knowledge it could have
| distilled into a good answer. And so just as you're finding
| that Claude's push back and advice is useful, someone will
| say it's more finicky and frustrating than GPT 3.5.
|
| It mostly boils down to the fact because groups of user
| aren't really defined by the mean. No one is the average of
| all developers in terms of understanding (if anything that'd
| make you an exceptional developer) instead people are
| clustered around various levels of understanding in very
| complex ways.
|
| -
|
| With that in mind, instead of banking on the alignment and
| training data of a given model happening to make the answer
| to that question good for you, you can trivially "ground" the
| model and tell it you're a senior developer speaking frankly
| with your coworker who's open to push back and realizes you
| might have the X/Y problem and other similar fallacies.
|
| You can remind it that it's allowed unsure, or it's very
| sure, you can even ask it to list gaps in it's abilities (or
| yours!) that are most relevant to a useful response.
|
| That's why hearing model X can't do Y but model Z doesn't
| really passes muster for me at this point unless how Y was
| inputted into the model is shared.
| civilitty wrote:
| _> The model might have uncountable ways to reply to that
| question if you had inserted more tokens, but with only the
| question in context, you 'll always get answers that are
| clustered around the mean answer it can produce... but
| because it's the literal mean of all those possibilities
| it's unlikely user A or user B will find particularly
| great._
|
| I refer to it as giving the LLM "pedagogical context" since
| a core part of teaching is predicting what kind of answer
| will actually help the audience depending on surrounding
| context. The question "What is multiplication?" demands a
| vastly different answer in an elementary school than a
| university set theory class.
|
| I think that's why there's such a large variance in HNer's
| experience with ChatGPT. The GPT API with a custom system
| prompt is far more powerful than the ChatGPT interface
| specifically because it grounds the conversation in the way
| that the moderated ChatGPT system prompt can't.
|
| The chat GUI I created for my own use has a ton of
| different roles that I choose based on what I'm asking. For
| example, when discussing cuisine I have roles like
| (shortened and simplified) "Julia Childs talking to a
| layman who cares about classic technique", "expert
| molecular gastronomy chef teaching a culinary school
| student", etc.
| pmoriarty wrote:
| I've played around with Claude quite a bit, but mostly with
| creative writing, at which I think it is stronger than any
| other LLMs that I've tried, including GPT, Claude+ (which as
| far as I can tell has not been rebranded as Claude 2), GPT 3.5,
| Bard, and Bing.
|
| I also much prefer to use Claude for explanations (I haven't
| experimented much with Claude+, but limited experiments have
| shown it to be even better) over the GPT's and other LLMs. It
| gives much more thorough and natural-sounding explanations than
| the competition, without extra prompting.
|
| That said, the Claude variants don't seem to be as good at
| logic-puzzly sort of stuff that most people love to test LLMs
| with. So if you're in to that, you're probably better off with
| GPT4.
|
| I also haven't tested it much with programming.. but I've been
| very disappointed with every LLM as far as my limited testing
| in that realm has gone.
|
| Claude deserves to get more attention, and I eagerly await
| Claude 3.
| desireco42 wrote:
| My experience, fairly smaller, is that it is weaker then GPT 4,
| which I mostly interact and use, but still usable. Some of it
| is weaker, some of it is just different flavor of responses.
|
| It is an AI and can help you be productive for sure.
| politelemon wrote:
| It's comparable to GPT3.x, and featurewise it does seem to
| match up, so overall, it's not bad.
|
| We're using it via langchain talking to Amazon Bedrock which is
| hosting Claude 1.x. The integration doesn't seem to be fully
| there though, I think langchain is expecting "Human:" and
| "AI:", but Claude uses "Assistant:".
|
| https://github.com/hwchase17/langchain/issues/2638
| youssefabdelm wrote:
| It's a bit less "anodyne" than GPT. GPT tends to give the most
| "mainstream" answer in many cases and is less "malleable" so to
| speak. I remember the differences between RLHF'd GPT and the
| original davinci GPT-3 before mode collapse. If you spent a
| while on a good prompt, it really paid off.
|
| Thankfully, Claude seems to maintain this "creativity" somehow.
|
| It's excellent at recommending books, creative writing, etc.
|
| For coding, it's not as good as GPT-4, but still helps me more
| than GPT in certain coding tasks.
| collegeburner wrote:
| honestly the biggest annoying thing is it seems too restricted.
| like it will nitpick my use of the word "think" when i ask what
| it thinks because "hurr durr as a LLM i don't have thoughts"
| yeah idc, just answer. it's also way more restricted in terms
| of refusing to say anything that's less than 100% anodyne.
| which i get the need for a clean version, just gets frustrating
| if e.g. i want it to add humor and the best it can do is the
| comedic equivalent of a knock knock joke
| ronsor wrote:
| This client wouldn't exist if it were possible to actually get
| access to the official API.
| linsomniac wrote:
| Have you tried getting on the waitlist? It worked for me, ISTR
| it took around 2 weeks.
| williamstein wrote:
| I submitted applications three times to their waitlist over
| the last several months, and I haver never heard back with
| any response at all. I think my use case is very reasonable
| (integration with https://cocalc.com, where we use ChatGPT's
| API heavily right now). My experience is that you fill out a
| web form to request access to the waitlist, and get no
| feedback at all ever (I just double checked my spam folders
| as well). Is that what normally happens for people?
| ronsor wrote:
| I'm pretty sure it's been over a month now since I submitted
| my application
| explosion-s wrote:
| The API costs money though
| bmitc wrote:
| I think it would be nice if companies and projects stopped using
| famous names to promote their projects.
| refulgentis wrote:
| Claude Shannon.
|
| It's a beautiful homage.
| catgary wrote:
| Yeah I rolled my eyes pretty hard when a crypto company used
| something like "team grothendieck".
| cubefox wrote:
| Note that Claude 2 scores 71.2% zero-shot on the python coding
| benchmark HumanEval, which is better than GPT-4, which scores
| 67.0%. Is there already real-world experience with its
| programming performance?
| og_kalu wrote:
| GPT-4 out in the wild's (reproducible) performance appears to
| be much higher than 67. Testing from 3/15 (presumably on the
| 0314 model) seems to be at 85.36%
| (https://twitter.com/amanrsanger/status/1635751764577361921).
| And the linked paper from my
| post(https://doi.org/10.48550/arXiv.2305.01210) got a pass@1 of
| 88.4 from GPT-4 recently (May? June?).
| lerchmo wrote:
| I have found just using it in the web interface comperable to
| OpenAI. But the context window makes a huge difference. I can
| dump alot more files in ( entire schema, sample records etc)
___________________________________________________________________
(page generated 2023-07-14 23:00 UTC)