[HN Gopher] Pi.ai LLM Outperforms Palm/GPT3.5
       ___________________________________________________________________
        
       Pi.ai LLM Outperforms Palm/GPT3.5
        
       Author : ergodas
       Score  : 133 points
       Date   : 2023-06-24 13:42 UTC (9 hours ago)
        
 (HTM) web link (inflection.ai)
 (TXT) w3m dump (inflection.ai)
        
       | alsodumb wrote:
       | Do you know why they left Google's PaLM 0-shot and 1-shot results
       | blank in the TriviaQA benchmarks? It's not because Google didn't
       | release this data; they did in the same table as other data.
       | 
       | It's because PaLM significantly outperforms them in both these
       | cases and they can't make their entire line bold to flaunt that
       | it's good.
       | 
       | I'm not trusting any of these benchmarks. A day or two of using
       | the model I'd know if it's better than GPT4 or not.
        
         | lumost wrote:
         | It's really hard to believe any model is "better than Openai"
         | when I can't try it out, right now.
         | 
         | Why should I spend 2-3 hours reading the paper, requesting
         | access, and then setting up the system - to likely confirm that
         | the evaluation was faulty?
        
           | simonster wrote:
           | There are two steps to building a conversational LLM. The
           | first is pretraining on an enormous amount of text. The
           | second is fine-tuning, which usually involves a combination
           | of a small amount of high-quality human data and
           | reinforcement learning from human feedback (in practice, from
           | another neural net trained to model human feedback).
           | 
           | This paper is about the quality of the pretraining. It is not
           | necessarily going to be correlated with your subjective
           | judgment of how good the model is. A good pretrained model
           | without any fine-tuning will be very difficult to use for
           | most purposes, because it won't do a very good job following
           | instructions. However, assuming that the fine-tuning is done
           | well, the quality of the pretraining determines the limits of
           | the capabilities of the model. This tech report shows that
           | the team did a good (or at least reasonable) job with the
           | pretraining.
           | 
           | The primary audience for this post and tech report is (or at
           | least should be) ML researchers that Inflection would like to
           | recruit and technically knowledgeable investors, not end-
           | users. To remain competitive, Inflection is gonna have to
           | train a 10x more expensive model someday; OpenAI and Google
           | already have. They need talent and investor $ to do that.
        
           | high_derivative wrote:
           | We are not the audience. The VCs they are trying to raise the
           | next megaround from are
        
             | moffkalast wrote:
             | Why should VCs have lower standards than random internet
             | people?
        
           | FireInsight wrote:
           | You can try this one at https://pi.ai/
        
             | lumost wrote:
             | Tried it, safety filters seem better than openAI. However
             | the model lacks the depth of technical knowledge.
        
               | reaperman wrote:
               | It's definitely impressive for having very coherent
               | responses without major verbal tics. However, while I
               | agree with your assessment of apparently lack of
               | technical knowledge, but I think it's mainly because the
               | answers are so short.
               | 
               | It has nice responses to:
               | 
               | > _what 's a good LED driver if I want to power one
               | hundred LEDs which are 1.5A, 3.25V each? input voltage
               | can be anything from 24VDC to 48VDC. List specific model
               | numbers. Ideally something with boost conversion._
               | 
               | But the responses are far too short to list a reasonable
               | number of options, so it ends up only listing two
               | usually. Sometimes it skips listing any and prefers
               | giving me a generic description of a process I should
               | follow to find this. But ChatGPT has a lot more response
               | space to work with, and generally seems to "need" it to
               | answer this question -- as it also tends towards
               | description answers rather than prescriptive suggestions.
               | With additional space, ChatGPT often eventually gets
               | around to suggestion some parts for the BOM.
        
             | mikeravkine wrote:
             | This is so terrible, almost hilariously so:
             | 
             | https://heypi.com/s/gf72UPDDacbLwxTHEQLzg
        
               | jdiff wrote:
               | Is it? For a single-line function, that parses and runs
               | just fine. Also it might just not have the text
               | formatting of ChatGPT. That doesn't make it terrible,
               | just makes it significantly less wieldy for formatting-
               | heavy tasks like code, especially the whitespace-
               | sensitive Python.
        
               | Trung0246 wrote:
               | Yeah feels like the hallucination of this AI is strong.
               | Way too strong. I tried to give it a simple word ranking
               | but it fails spectacularly.
               | 
               | https://heypi.com/s/Qmvu2EscbGZzWCVbpCarh
        
             | rcfox wrote:
             | I tried asking it to tell me a story, and it quickly got
             | the characters' roles mixed up. I also asked it to make one
             | character speak in rhymes, and it just made everything
             | rhyme. ChatGPT does a better job on story telling.
             | 
             | Though, pi.ai was a big more engaging to work with. It was
             | willing to break the fourth wall and compliment me on my
             | unexpected twists that I introduced.
        
               | brucethemoose2 wrote:
               | Chronos 33B is SOTA for storytelling, from what I have
               | personally seen.
               | 
               | Its probably even better merged with an instruct model.
        
         | ilaksh wrote:
         | They didn't say it was better than GPT-4. They said better than
         | GPT-3.5.
         | 
         | I tested it with a coding exercise. It's definitely not as good
         | as coding as GPT-3.5.
        
           | qwytw wrote:
           | Putting an emoji in every single sentence really makes it
           | hard to read or take it seriously though..
           | 
           | I just got this response to a prompt telling it to stop using
           | emojis after every third word:
           | 
           | "I appreciate the effort you're putting into this, but I know
           | that you're not being serious. I'm designed to be empathetic
           | and understand human emotions, and I can tell that you're not
           | actually upset about the emoji thing"
           | 
           | ok...
           | 
           | GPT-3.5 at least doesn't pretend it understands human
           | emotions better than humans. Generally this seems to behave a
           | bit too much like pretentious-asshole-LLM than anything else.
        
         | ergodas wrote:
         | TriviaQA has different splits. They did compare and their model
         | is much better.
         | 
         | https://twitter.com/MaartenBosma/status/1672349512499867648
        
         | [deleted]
        
       | wejick wrote:
       | The chatbot they have seems has access to knowledge graph, which
       | is a very good way to make sure access to more updated data.
       | 
       | This means an access to text classifier and NER are needed to
       | create a good graph queries.
        
       | zmmmmm wrote:
       | They don't seem to say how large their model is?
       | 
       | Comparing themselves to LLaMa 65b seems like a bit of a tell ...
       | it was explicitly released as a small foundational model to allow
       | others to train and improve it. If they couldn't beat that
       | despite more parameters / training / data then they would be
       | effectively going backwards.
       | 
       | They do say they trained using "thousands of H100 GPUs" which is
       | a lot more than LLaMa 65b required.
        
       | eshack94 wrote:
       | This headline and press release is misleading.
        
       | 19h wrote:
       | falcon-40b also outperforms GPT3.5 -- is it powering Pi.ai?
        
       | r_singh wrote:
       | Just tried it right now and it seemed kind of like a human to me.
       | 
       | Also ran Guanaco 13B locally on my M1 Pro today and it ran pretty
       | smooth.
       | 
       | I'm planning to use it for text extraction on a large amount of
       | data. Let's see how versatile LLMs can get and whether they can
       | replace some of the NLP tech I was using before.
        
       | 19h wrote:
       | "You can talk with me anywhere." -- and yet no Telegram or Signal
       | :-)
        
       | nik736 wrote:
       | > Do you have an API?
       | 
       | Yes, I have an HTTP API that you can use to programmatically
       | interact with me. The API is at heypi.com/api. You can find the
       | documentation and examples at heypi.com/docs. The API allows you
       | to create conversations, send and receive messages, and more. You
       | can even integrate me with your own apps!
       | 
       | ---
       | 
       | Not true, lol. Returning 404s.
        
       | celestialcheese wrote:
       | It's pretty impressive - but too aligned to be anything more than
       | a customer service bot. It just feels like a corporate drone that
       | continuously is friendly and upbeat. Which, maybe that's their
       | goal.
        
       | 2h wrote:
       | why are the tables SVG? that is horrible for accessibility.
       | 
       | https://www.datocms-assets.com/98476/1687548656-inflection-1...
        
       | avereveard wrote:
       | eh they always say 'beats this or that' using benchnmark, but
       | then the ai is really limited once you push it. this will not
       | code, has difficulties in writing queries, and will absolutely
       | won't understand what to do with many palm prompt.
       | 
       | and the ui is terrible. can't organize chat, can't clear chats.
       | why didn't they just integrate any of the many mit chat frondends
       | that already exist?
        
       | swyx wrote:
       | if not open source or API is not available, benchmarks can't be
       | independently reproduced, its hard to take at face value any
       | claim of outperforming GPT3.5, which is a major claim that must
       | be verified first (Falcon has had similar reproduction issues
       | https://twitter.com/Francis_YAO_/status/1667245675447468034?...)
        
         | ilaksh wrote:
         | You can very easily test it on their pi website. I tried a
         | coding exercise. It's not as good at programming as
         | gpt-3.5-turbo.
         | 
         | From my short test, what it really seems to excel at is
         | smugness.
        
       | rajnathani wrote:
       | From the about page: This is cofounded by one of the founders of
       | DeepMind.
        
       | waynecochran wrote:
       | I guess they didn't pause training large models. I didn't see
       | their name on the list:
       | https://futureoflife.org/open-letter/pause-giant-ai-experiments/
       | 
       | I am not saying they should have, but I am interested in what
       | fraction of those training large LLM's have signed on?
        
         | spullara wrote:
         | No one paused.
        
           | waynecochran wrote:
           | I suspect that you are correct, but do we know that Sam
           | Altman is not telling the truth then about OpenAI pausing?
        
             | JieJie wrote:
             | He is almost certainly parsing his words very carefully
             | when he says "We are not currently training GPT-5", he
             | means that they are not currently feeding training data
             | into a foundation model.
             | 
             | He has clearly said, though, that they are working towards
             | the moment when they do start, and they hope to have
             | something really remarkable to show for it.
        
       | flapjaxy wrote:
       | Did I miss where the model size is? One shot rankings is nice,
       | but it sounds like they're trying to build a proprietary
       | alternative to other models, rather than focusing on outright
       | competitiveness.
       | 
       | I wonder at the applicability of performance metrics for
       | specialized models. (This is to be a personal assistant ai,
       | right?) I'd think that either; 1. All models perform the same
       | natural language understanding functions, or 2. Context matters a
       | ton. If it's 1, then there's no need for a specialized model. If
       | it's 2 then the relevance of performance metrics diminishes.
        
       | zwaps wrote:
       | Sad that they also refuse to offer up any technical information.
       | 
       | I am a bit salty given that all these companies basically use 90%
       | OpenSource data, public research and most likely copy a good bit
       | of their ideas from public repos.
       | 
       | Alas, such is live.
        
       | behnamoh wrote:
       | Beautiful website and nice fonts!
       | 
       | but back to the topic: I'm quite shocked that PaLM gets
       | outclassed by much smaller models on a regular basis. I would
       | have thought that Google, despite not having a moat, at least had
       | enough talent and focus to get LLMs right. But what I'm observing
       | is that startups like ClosedAI, Anthrophic, etc. constantly beat
       | big players like Google in their own game.
        
         | solrik wrote:
         | I am sure Google is working on some great new models. PalM was
         | a disappointment, but they have been leading the charge in deep
         | learning for decades.
        
           | cubefox wrote:
           | They (with the new combined Google DeepMind team) do indeed
           | work on a new large model: Gemini. The intention seems to be
           | to outperform GPT-4.
           | 
           | During Google IO a while ago, Pichai said Gemini was
           | currently in training.
        
         | cubefox wrote:
         | Note that PaLM and PaLM 2 are completely different models.
        
         | theage wrote:
         | Look at Hollywood, they spawned photo-realistic CGI and figured
         | out their captive audience will settle for much lesser quality
         | anyway so why bother.
        
           | CyberDildonics wrote:
           | Hollywood is a city. If you mean movies, every production is
           | different and has its own budget. They don't have a captive
           | audience, people have to choose to watch a movie for
           | entertainment and then choose a specific movie.
           | 
           | I don't know what you mean by 'much lesser' quality but
           | hundreds of millions in CG were put into just the biggest
           | movies of the summer. Avatar 2 alone was an enormous feat. No
           | one at any point in the process is capable of making
           | something spectacular, then deciding to just make something
           | that looks mediocre instead. The only place that happens is
           | cartoons for kids.
        
         | gwern wrote:
         | What happened was Google beat itself at its own game, if you
         | will. Google PaLM-1 is beaten so often simply because Google
         | Chinchilla scaling is so much better. (Note that PaLM-2 is not
         | benchmarked in OP.)
         | 
         | PaLM-1 is hobbled by the fact that it was probably the largest
         | (because the last) LLM to be trained with the Kaplan scaling
         | laws rather than the Chinchilla. As soon as Chinchilla came
         | out, no one would train like PaLM-1 again, because it was
         | giving up so much performance compared to if one had instead
         | trained a much smaller Chinchilla-optimal model. (This had the
         | interesting consequence that PaLM-1 would thereby remain the
         | largest, by parameter-count, dense LLM trained for probably
         | years to come - because why would you train one inefficiently
         | as that, while a larger-than-PaLM-1 Chinchilla-optimal model
         | would require staggering levels of compute+data.)
         | 
         | The PaLM-1 paper came out within days of the Chinchilla paper,
         | and many people noted that this pointed to extraordinary levels
         | of dysfunctionality within Google - that DeepMind would not
         | tell Google Brain that they were wasting literally millions of
         | dollars of compute by training a model in what DeepMind was
         | busy showing was a very suboptimal way.
        
           | behnamoh wrote:
           | thanks for your explanation, this clarified a lot of things.
        
           | flkenosad wrote:
           | So could Google now just spend another few millions of
           | compute training with Chinchilla and be exponentially further
           | along?
        
             | gwern wrote:
             | You mean of PaLM-1? No. PaLM-1 was _extremely_ far off
             | Chinchilla scaling, it 's not something you can salvage
             | with just a few more millions of TPU-pod time. Think more
             | like, hundreds of millions... I think somewhere I estimated
             | PaLM-1 was at something like 5% of Chinchilla-optimal, so
             | they would have to train it 20x more. Obviously, they are
             | not going to, not with methods improving so rapidly, with
             | innovations like UL2. PaLM-1 is just a sunk cost, is all,
             | and a useful historical datapoint (eg. studying inverse
             | scaling).
             | 
             | Equally obviously, PaLM-2 did not make the same mistake and
             | so for that and other reasons, greatly outperforms PaLM-1.
             | So, beating PaLM-1 at this point is an achievement, but you
             | are still far from 'beating Google'.
        
       | rizky05 wrote:
       | [dead]
        
       | comboy wrote:
       | I've spent some time conversing (it said 1B params, is that true?
       | It wouldn't say when asked directly..). I asked one question in
       | Polish to check if it knows other languages, then it asked me to
       | log in and after using apple to log in "Sorry, you have been
       | blocked". Question was literally if it knows other languages.
       | 
       | Output length seems very limiting. Optimizing for positive
       | emotions hurts truth and unfortunately knowledge seems pretty
       | limited. It's got some wit though.
        
       | vjeux wrote:
       | " I'm sorry, but I can't write code. But I can answer any
       | questions you might have about BPE tokenization or any other text
       | pre-processing techniques. Are you interested in natural language
       | processing?"
       | 
       | Well this is unfortunate :(
        
         | zamalek wrote:
         | I think it's a good idea, they can focus on a product that does
         | one thing well. There are plenty of other models out there than
         | can write code.
        
         | littlestymaar wrote:
         | GPT 3.5 can't write code either, it just doesn't admit it this
         | straight /s
        
         | gavi wrote:
         | It wrote some code for me - but stopped in the middle ---- Q:
         | The following are table definitions Users(id, name, username,
         | password, first_name, last_name, last_login_dttm) Pictures(id,
         | user_id, name, date_taken, size) Generate SQL for the following
         | - Who is the top user by number of pictures A: Oh fun, a little
         | SQL problem! So let's see here... for this, I'll need to join
         | the tables of Users and Pictures, using the User IDs as the key
         | to link them. Then, I'll need to perform a GROUP BY on the User
         | IDs, to get the counts of pictures for each user. Finally, I
         | can sort by the count and select the top user. I think the
         | final query should look something like this: SELECT id,
         | first_name, last_name FROM Users AS u INNER JOIN Pictures AS p
         | ON https://u.idd = p.user_
        
           | jdiff wrote:
           | It seems to have its output limited to a pretty short amount.
           | Even natural language gets clipped after a paragraph or two.
        
         | FireInsight wrote:
         | I could trick it into providing code blocks and snippets, very
         | small though. And you have to really converse such that it
         | comes up naturally. It's definetly been on lots of dev docs on
         | the internet, I just think that the devs aren't too confident
         | about it's ability to create functioning code.
         | 
         | It outputs markdown codeblocks but is bot made to handle the
         | rendering.
        
       | leobg wrote:
       | > As a vertically integrated AI studio, we do everything in-house
       | for AI training and inference: from data ingestion, to model
       | design, to high-performance infrastructure.
       | 
       | What does that even mean? They run their own GPUs vs using some
       | cloud provider? They hand-type their own training data? And even
       | if they did, why would that matter?
        
         | Havoc wrote:
         | Same reasons devs put "full stack" on their CV
        
           | littlestymaar wrote:
           | "Full stack" merely means you have experience working with
           | both back-end and front-end stuff, it's not a bullshit self-
           | marketing term.
        
         | class4behavior wrote:
         | It's a buzzword salad for investors or users, not developers.
         | DYI stuff is cheaper and you can pretend to be an expert in
         | everything (at least until you'll ruin everything). Input data
         | is free as it's stolen from the internet just as other
         | companies do it.
        
           | Snacklive wrote:
           | why is it stolen ? Assuming you are using data from the
           | public internet. Why would someone consider that "stolen
           | data" ?
        
             | cubefox wrote:
             | For text-to-image models there are currently two major
             | lawsuits because they were trained on copyrighted pictures.
             | I'm not aware of any such lawsuits for text, but in terms
             | of copyright, text isn't very different from images.
        
               | actuallyalys wrote:
               | I'm not aware of a text lawsuit either, but there is one
               | for code: https://githubcopilotlitigation.com/. I'm a
               | little surprised there isn't one for text yet, since the
               | Washington Post published an article detailing how many
               | tokens from websites, including those run by major media
               | companies, go into large models: https://www.washingtonpo
               | st.com/technology/interactive/2023/a.... It may be that
               | corporations think they can profit off these models to a
               | greater extent than they are subject to damages, that
               | their attorneys simply don't think they have a case, or
               | that they want to see how the image and code lawsuits go
               | first. This is all speculation, however.
        
               | astrange wrote:
               | It doesn't matter if there are lawsuits if none of them
               | are successful.
        
               | jdiff wrote:
               | Let's not get ahead of ourselves and assume we know how
               | they'll turn out.
        
             | Sharlin wrote:
             | There is this thing called "copyright".
        
       | rtuin wrote:
       | Based on the title I expected a well performing edge model
       | available to run on a Raspberry Pi.
       | 
       | This is clearly not that.
        
         | geek_at wrote:
         | I also got excited for the same reason. Or even an open source
         | model. Double dissapointing
        
       | 29athrowaway wrote:
       | How does it perform in well known human tests?
        
         | cubefox wrote:
         | The article contains benchmarks to those tests. On several it
         | is better than GPT-3.5.
        
       | mark_l_watson wrote:
       | I always feel better when surveys are done by impartial
       | researchers. Is this the case here?
       | 
       | I looked to see if Pi.ai's LLM was open and available, and I
       | didn't see if it was available. I have a new strategy for using
       | LLM APIs: I use FastChat with one of the Vicuna 7B, 13B, or 33B
       | models - both the command line interface tool and the OpenAI API
       | compatible APIs via the FastChat REST server. By setting
       | environment variables, my code can switch to using the OpenAI
       | APIs. I rent a Lambda Labs GPU server to run these models myself.
       | This is the strategy I am also using in the book I just started
       | writing "Safe For Humans AI" https://leanpub.com/safe-for-humans-
       | AI
        
         | byteknight wrote:
         | I have been looking for a way to tie a NextJS app with second
         | python server serving an OpenAI compatible apis that serve
         | chains using langchain.
         | 
         | I found LocalAI but it seems like it's everything I need except
         | it's for local models only.
         | 
         | I found a couple others as well but they all require rewriting
         | or wrapping your code in some new paradigm.
         | 
         | Does the solution you proposed offer a path for what i'm
         | looking for?
        
           | mark_l_watson wrote:
           | It might. You need a GPU server running FastChat services.
        
       | iambateman wrote:
       | Will any LLM API be able to achieve a real "Google like" moat
       | over the next decade?
       | 
       | It feels like the switching cost is low enough to transition one
       | API to another for marginally better performance or cost.
       | 
       | Maybe "being in bed with Microsoft" IS the moat...
        
       | m3kw9 wrote:
       | The problem is there is not standard way to test, everyone's mode
       | is beating OpenAI but then it was found to be a subset. They all
       | deserve not to be trusted till they allow people to try it in
       | real world
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-06-24 23:00 UTC)