[HN Gopher] Pi.ai LLM Outperforms Palm/GPT3.5
___________________________________________________________________
Pi.ai LLM Outperforms Palm/GPT3.5
Author : ergodas
Score : 133 points
Date : 2023-06-24 13:42 UTC (9 hours ago)
(HTM) web link (inflection.ai)
(TXT) w3m dump (inflection.ai)
| alsodumb wrote:
| Do you know why they left Google's PaLM 0-shot and 1-shot results
| blank in the TriviaQA benchmarks? It's not because Google didn't
| release this data; they did in the same table as other data.
|
| It's because PaLM significantly outperforms them in both these
| cases and they can't make their entire line bold to flaunt that
| it's good.
|
| I'm not trusting any of these benchmarks. A day or two of using
| the model I'd know if it's better than GPT4 or not.
| lumost wrote:
| It's really hard to believe any model is "better than Openai"
| when I can't try it out, right now.
|
| Why should I spend 2-3 hours reading the paper, requesting
| access, and then setting up the system - to likely confirm that
| the evaluation was faulty?
| simonster wrote:
| There are two steps to building a conversational LLM. The
| first is pretraining on an enormous amount of text. The
| second is fine-tuning, which usually involves a combination
| of a small amount of high-quality human data and
| reinforcement learning from human feedback (in practice, from
| another neural net trained to model human feedback).
|
| This paper is about the quality of the pretraining. It is not
| necessarily going to be correlated with your subjective
| judgment of how good the model is. A good pretrained model
| without any fine-tuning will be very difficult to use for
| most purposes, because it won't do a very good job following
| instructions. However, assuming that the fine-tuning is done
| well, the quality of the pretraining determines the limits of
| the capabilities of the model. This tech report shows that
| the team did a good (or at least reasonable) job with the
| pretraining.
|
| The primary audience for this post and tech report is (or at
| least should be) ML researchers that Inflection would like to
| recruit and technically knowledgeable investors, not end-
| users. To remain competitive, Inflection is gonna have to
| train a 10x more expensive model someday; OpenAI and Google
| already have. They need talent and investor $ to do that.
| high_derivative wrote:
| We are not the audience. The VCs they are trying to raise the
| next megaround from are
| moffkalast wrote:
| Why should VCs have lower standards than random internet
| people?
| FireInsight wrote:
| You can try this one at https://pi.ai/
| lumost wrote:
| Tried it, safety filters seem better than openAI. However
| the model lacks the depth of technical knowledge.
| reaperman wrote:
| It's definitely impressive for having very coherent
| responses without major verbal tics. However, while I
| agree with your assessment of apparently lack of
| technical knowledge, but I think it's mainly because the
| answers are so short.
|
| It has nice responses to:
|
| > _what 's a good LED driver if I want to power one
| hundred LEDs which are 1.5A, 3.25V each? input voltage
| can be anything from 24VDC to 48VDC. List specific model
| numbers. Ideally something with boost conversion._
|
| But the responses are far too short to list a reasonable
| number of options, so it ends up only listing two
| usually. Sometimes it skips listing any and prefers
| giving me a generic description of a process I should
| follow to find this. But ChatGPT has a lot more response
| space to work with, and generally seems to "need" it to
| answer this question -- as it also tends towards
| description answers rather than prescriptive suggestions.
| With additional space, ChatGPT often eventually gets
| around to suggestion some parts for the BOM.
| mikeravkine wrote:
| This is so terrible, almost hilariously so:
|
| https://heypi.com/s/gf72UPDDacbLwxTHEQLzg
| jdiff wrote:
| Is it? For a single-line function, that parses and runs
| just fine. Also it might just not have the text
| formatting of ChatGPT. That doesn't make it terrible,
| just makes it significantly less wieldy for formatting-
| heavy tasks like code, especially the whitespace-
| sensitive Python.
| Trung0246 wrote:
| Yeah feels like the hallucination of this AI is strong.
| Way too strong. I tried to give it a simple word ranking
| but it fails spectacularly.
|
| https://heypi.com/s/Qmvu2EscbGZzWCVbpCarh
| rcfox wrote:
| I tried asking it to tell me a story, and it quickly got
| the characters' roles mixed up. I also asked it to make one
| character speak in rhymes, and it just made everything
| rhyme. ChatGPT does a better job on story telling.
|
| Though, pi.ai was a big more engaging to work with. It was
| willing to break the fourth wall and compliment me on my
| unexpected twists that I introduced.
| brucethemoose2 wrote:
| Chronos 33B is SOTA for storytelling, from what I have
| personally seen.
|
| Its probably even better merged with an instruct model.
| ilaksh wrote:
| They didn't say it was better than GPT-4. They said better than
| GPT-3.5.
|
| I tested it with a coding exercise. It's definitely not as good
| as coding as GPT-3.5.
| qwytw wrote:
| Putting an emoji in every single sentence really makes it
| hard to read or take it seriously though..
|
| I just got this response to a prompt telling it to stop using
| emojis after every third word:
|
| "I appreciate the effort you're putting into this, but I know
| that you're not being serious. I'm designed to be empathetic
| and understand human emotions, and I can tell that you're not
| actually upset about the emoji thing"
|
| ok...
|
| GPT-3.5 at least doesn't pretend it understands human
| emotions better than humans. Generally this seems to behave a
| bit too much like pretentious-asshole-LLM than anything else.
| ergodas wrote:
| TriviaQA has different splits. They did compare and their model
| is much better.
|
| https://twitter.com/MaartenBosma/status/1672349512499867648
| [deleted]
| wejick wrote:
| The chatbot they have seems has access to knowledge graph, which
| is a very good way to make sure access to more updated data.
|
| This means an access to text classifier and NER are needed to
| create a good graph queries.
| zmmmmm wrote:
| They don't seem to say how large their model is?
|
| Comparing themselves to LLaMa 65b seems like a bit of a tell ...
| it was explicitly released as a small foundational model to allow
| others to train and improve it. If they couldn't beat that
| despite more parameters / training / data then they would be
| effectively going backwards.
|
| They do say they trained using "thousands of H100 GPUs" which is
| a lot more than LLaMa 65b required.
| eshack94 wrote:
| This headline and press release is misleading.
| 19h wrote:
| falcon-40b also outperforms GPT3.5 -- is it powering Pi.ai?
| r_singh wrote:
| Just tried it right now and it seemed kind of like a human to me.
|
| Also ran Guanaco 13B locally on my M1 Pro today and it ran pretty
| smooth.
|
| I'm planning to use it for text extraction on a large amount of
| data. Let's see how versatile LLMs can get and whether they can
| replace some of the NLP tech I was using before.
| 19h wrote:
| "You can talk with me anywhere." -- and yet no Telegram or Signal
| :-)
| nik736 wrote:
| > Do you have an API?
|
| Yes, I have an HTTP API that you can use to programmatically
| interact with me. The API is at heypi.com/api. You can find the
| documentation and examples at heypi.com/docs. The API allows you
| to create conversations, send and receive messages, and more. You
| can even integrate me with your own apps!
|
| ---
|
| Not true, lol. Returning 404s.
| celestialcheese wrote:
| It's pretty impressive - but too aligned to be anything more than
| a customer service bot. It just feels like a corporate drone that
| continuously is friendly and upbeat. Which, maybe that's their
| goal.
| 2h wrote:
| why are the tables SVG? that is horrible for accessibility.
|
| https://www.datocms-assets.com/98476/1687548656-inflection-1...
| avereveard wrote:
| eh they always say 'beats this or that' using benchnmark, but
| then the ai is really limited once you push it. this will not
| code, has difficulties in writing queries, and will absolutely
| won't understand what to do with many palm prompt.
|
| and the ui is terrible. can't organize chat, can't clear chats.
| why didn't they just integrate any of the many mit chat frondends
| that already exist?
| swyx wrote:
| if not open source or API is not available, benchmarks can't be
| independently reproduced, its hard to take at face value any
| claim of outperforming GPT3.5, which is a major claim that must
| be verified first (Falcon has had similar reproduction issues
| https://twitter.com/Francis_YAO_/status/1667245675447468034?...)
| ilaksh wrote:
| You can very easily test it on their pi website. I tried a
| coding exercise. It's not as good at programming as
| gpt-3.5-turbo.
|
| From my short test, what it really seems to excel at is
| smugness.
| rajnathani wrote:
| From the about page: This is cofounded by one of the founders of
| DeepMind.
| waynecochran wrote:
| I guess they didn't pause training large models. I didn't see
| their name on the list:
| https://futureoflife.org/open-letter/pause-giant-ai-experiments/
|
| I am not saying they should have, but I am interested in what
| fraction of those training large LLM's have signed on?
| spullara wrote:
| No one paused.
| waynecochran wrote:
| I suspect that you are correct, but do we know that Sam
| Altman is not telling the truth then about OpenAI pausing?
| JieJie wrote:
| He is almost certainly parsing his words very carefully
| when he says "We are not currently training GPT-5", he
| means that they are not currently feeding training data
| into a foundation model.
|
| He has clearly said, though, that they are working towards
| the moment when they do start, and they hope to have
| something really remarkable to show for it.
| flapjaxy wrote:
| Did I miss where the model size is? One shot rankings is nice,
| but it sounds like they're trying to build a proprietary
| alternative to other models, rather than focusing on outright
| competitiveness.
|
| I wonder at the applicability of performance metrics for
| specialized models. (This is to be a personal assistant ai,
| right?) I'd think that either; 1. All models perform the same
| natural language understanding functions, or 2. Context matters a
| ton. If it's 1, then there's no need for a specialized model. If
| it's 2 then the relevance of performance metrics diminishes.
| zwaps wrote:
| Sad that they also refuse to offer up any technical information.
|
| I am a bit salty given that all these companies basically use 90%
| OpenSource data, public research and most likely copy a good bit
| of their ideas from public repos.
|
| Alas, such is live.
| behnamoh wrote:
| Beautiful website and nice fonts!
|
| but back to the topic: I'm quite shocked that PaLM gets
| outclassed by much smaller models on a regular basis. I would
| have thought that Google, despite not having a moat, at least had
| enough talent and focus to get LLMs right. But what I'm observing
| is that startups like ClosedAI, Anthrophic, etc. constantly beat
| big players like Google in their own game.
| solrik wrote:
| I am sure Google is working on some great new models. PalM was
| a disappointment, but they have been leading the charge in deep
| learning for decades.
| cubefox wrote:
| They (with the new combined Google DeepMind team) do indeed
| work on a new large model: Gemini. The intention seems to be
| to outperform GPT-4.
|
| During Google IO a while ago, Pichai said Gemini was
| currently in training.
| cubefox wrote:
| Note that PaLM and PaLM 2 are completely different models.
| theage wrote:
| Look at Hollywood, they spawned photo-realistic CGI and figured
| out their captive audience will settle for much lesser quality
| anyway so why bother.
| CyberDildonics wrote:
| Hollywood is a city. If you mean movies, every production is
| different and has its own budget. They don't have a captive
| audience, people have to choose to watch a movie for
| entertainment and then choose a specific movie.
|
| I don't know what you mean by 'much lesser' quality but
| hundreds of millions in CG were put into just the biggest
| movies of the summer. Avatar 2 alone was an enormous feat. No
| one at any point in the process is capable of making
| something spectacular, then deciding to just make something
| that looks mediocre instead. The only place that happens is
| cartoons for kids.
| gwern wrote:
| What happened was Google beat itself at its own game, if you
| will. Google PaLM-1 is beaten so often simply because Google
| Chinchilla scaling is so much better. (Note that PaLM-2 is not
| benchmarked in OP.)
|
| PaLM-1 is hobbled by the fact that it was probably the largest
| (because the last) LLM to be trained with the Kaplan scaling
| laws rather than the Chinchilla. As soon as Chinchilla came
| out, no one would train like PaLM-1 again, because it was
| giving up so much performance compared to if one had instead
| trained a much smaller Chinchilla-optimal model. (This had the
| interesting consequence that PaLM-1 would thereby remain the
| largest, by parameter-count, dense LLM trained for probably
| years to come - because why would you train one inefficiently
| as that, while a larger-than-PaLM-1 Chinchilla-optimal model
| would require staggering levels of compute+data.)
|
| The PaLM-1 paper came out within days of the Chinchilla paper,
| and many people noted that this pointed to extraordinary levels
| of dysfunctionality within Google - that DeepMind would not
| tell Google Brain that they were wasting literally millions of
| dollars of compute by training a model in what DeepMind was
| busy showing was a very suboptimal way.
| behnamoh wrote:
| thanks for your explanation, this clarified a lot of things.
| flkenosad wrote:
| So could Google now just spend another few millions of
| compute training with Chinchilla and be exponentially further
| along?
| gwern wrote:
| You mean of PaLM-1? No. PaLM-1 was _extremely_ far off
| Chinchilla scaling, it 's not something you can salvage
| with just a few more millions of TPU-pod time. Think more
| like, hundreds of millions... I think somewhere I estimated
| PaLM-1 was at something like 5% of Chinchilla-optimal, so
| they would have to train it 20x more. Obviously, they are
| not going to, not with methods improving so rapidly, with
| innovations like UL2. PaLM-1 is just a sunk cost, is all,
| and a useful historical datapoint (eg. studying inverse
| scaling).
|
| Equally obviously, PaLM-2 did not make the same mistake and
| so for that and other reasons, greatly outperforms PaLM-1.
| So, beating PaLM-1 at this point is an achievement, but you
| are still far from 'beating Google'.
| rizky05 wrote:
| [dead]
| comboy wrote:
| I've spent some time conversing (it said 1B params, is that true?
| It wouldn't say when asked directly..). I asked one question in
| Polish to check if it knows other languages, then it asked me to
| log in and after using apple to log in "Sorry, you have been
| blocked". Question was literally if it knows other languages.
|
| Output length seems very limiting. Optimizing for positive
| emotions hurts truth and unfortunately knowledge seems pretty
| limited. It's got some wit though.
| vjeux wrote:
| " I'm sorry, but I can't write code. But I can answer any
| questions you might have about BPE tokenization or any other text
| pre-processing techniques. Are you interested in natural language
| processing?"
|
| Well this is unfortunate :(
| zamalek wrote:
| I think it's a good idea, they can focus on a product that does
| one thing well. There are plenty of other models out there than
| can write code.
| littlestymaar wrote:
| GPT 3.5 can't write code either, it just doesn't admit it this
| straight /s
| gavi wrote:
| It wrote some code for me - but stopped in the middle ---- Q:
| The following are table definitions Users(id, name, username,
| password, first_name, last_name, last_login_dttm) Pictures(id,
| user_id, name, date_taken, size) Generate SQL for the following
| - Who is the top user by number of pictures A: Oh fun, a little
| SQL problem! So let's see here... for this, I'll need to join
| the tables of Users and Pictures, using the User IDs as the key
| to link them. Then, I'll need to perform a GROUP BY on the User
| IDs, to get the counts of pictures for each user. Finally, I
| can sort by the count and select the top user. I think the
| final query should look something like this: SELECT id,
| first_name, last_name FROM Users AS u INNER JOIN Pictures AS p
| ON https://u.idd = p.user_
| jdiff wrote:
| It seems to have its output limited to a pretty short amount.
| Even natural language gets clipped after a paragraph or two.
| FireInsight wrote:
| I could trick it into providing code blocks and snippets, very
| small though. And you have to really converse such that it
| comes up naturally. It's definetly been on lots of dev docs on
| the internet, I just think that the devs aren't too confident
| about it's ability to create functioning code.
|
| It outputs markdown codeblocks but is bot made to handle the
| rendering.
| leobg wrote:
| > As a vertically integrated AI studio, we do everything in-house
| for AI training and inference: from data ingestion, to model
| design, to high-performance infrastructure.
|
| What does that even mean? They run their own GPUs vs using some
| cloud provider? They hand-type their own training data? And even
| if they did, why would that matter?
| Havoc wrote:
| Same reasons devs put "full stack" on their CV
| littlestymaar wrote:
| "Full stack" merely means you have experience working with
| both back-end and front-end stuff, it's not a bullshit self-
| marketing term.
| class4behavior wrote:
| It's a buzzword salad for investors or users, not developers.
| DYI stuff is cheaper and you can pretend to be an expert in
| everything (at least until you'll ruin everything). Input data
| is free as it's stolen from the internet just as other
| companies do it.
| Snacklive wrote:
| why is it stolen ? Assuming you are using data from the
| public internet. Why would someone consider that "stolen
| data" ?
| cubefox wrote:
| For text-to-image models there are currently two major
| lawsuits because they were trained on copyrighted pictures.
| I'm not aware of any such lawsuits for text, but in terms
| of copyright, text isn't very different from images.
| actuallyalys wrote:
| I'm not aware of a text lawsuit either, but there is one
| for code: https://githubcopilotlitigation.com/. I'm a
| little surprised there isn't one for text yet, since the
| Washington Post published an article detailing how many
| tokens from websites, including those run by major media
| companies, go into large models: https://www.washingtonpo
| st.com/technology/interactive/2023/a.... It may be that
| corporations think they can profit off these models to a
| greater extent than they are subject to damages, that
| their attorneys simply don't think they have a case, or
| that they want to see how the image and code lawsuits go
| first. This is all speculation, however.
| astrange wrote:
| It doesn't matter if there are lawsuits if none of them
| are successful.
| jdiff wrote:
| Let's not get ahead of ourselves and assume we know how
| they'll turn out.
| Sharlin wrote:
| There is this thing called "copyright".
| rtuin wrote:
| Based on the title I expected a well performing edge model
| available to run on a Raspberry Pi.
|
| This is clearly not that.
| geek_at wrote:
| I also got excited for the same reason. Or even an open source
| model. Double dissapointing
| 29athrowaway wrote:
| How does it perform in well known human tests?
| cubefox wrote:
| The article contains benchmarks to those tests. On several it
| is better than GPT-3.5.
| mark_l_watson wrote:
| I always feel better when surveys are done by impartial
| researchers. Is this the case here?
|
| I looked to see if Pi.ai's LLM was open and available, and I
| didn't see if it was available. I have a new strategy for using
| LLM APIs: I use FastChat with one of the Vicuna 7B, 13B, or 33B
| models - both the command line interface tool and the OpenAI API
| compatible APIs via the FastChat REST server. By setting
| environment variables, my code can switch to using the OpenAI
| APIs. I rent a Lambda Labs GPU server to run these models myself.
| This is the strategy I am also using in the book I just started
| writing "Safe For Humans AI" https://leanpub.com/safe-for-humans-
| AI
| byteknight wrote:
| I have been looking for a way to tie a NextJS app with second
| python server serving an OpenAI compatible apis that serve
| chains using langchain.
|
| I found LocalAI but it seems like it's everything I need except
| it's for local models only.
|
| I found a couple others as well but they all require rewriting
| or wrapping your code in some new paradigm.
|
| Does the solution you proposed offer a path for what i'm
| looking for?
| mark_l_watson wrote:
| It might. You need a GPU server running FastChat services.
| iambateman wrote:
| Will any LLM API be able to achieve a real "Google like" moat
| over the next decade?
|
| It feels like the switching cost is low enough to transition one
| API to another for marginally better performance or cost.
|
| Maybe "being in bed with Microsoft" IS the moat...
| m3kw9 wrote:
| The problem is there is not standard way to test, everyone's mode
| is beating OpenAI but then it was found to be a subset. They all
| deserve not to be trusted till they allow people to try it in
| real world
| [deleted]
___________________________________________________________________
(page generated 2023-06-24 23:00 UTC)