[HN Gopher] StableCode
       ___________________________________________________________________
        
       StableCode
        
       Author : kpozin
       Score  : 155 points
       Date   : 2023-08-08 15:25 UTC (7 hours ago)
        
 (HTM) web link (stability.ai)
 (TXT) w3m dump (stability.ai)
        
       | runako wrote:
       | Is this a "product" that one could install and use or a model
       | that one should expect an OEM to integrate into a product before
       | programmers can use it? I'm asking because I don't see any links
       | that would help me figure out how to try it out.
        
         | nwoli wrote:
         | Ctrl-F for "Code for using StableCode Instruct to generate a
         | response to a given instruction." and you'll see a super
         | straightforward piece of code to copy to test it out for
         | generating code
        
           | runako wrote:
           | Thanks! The verbiage at the beginning of the announcement
           | seems to go out of its way to not call StableCode a "model,"
           | which was confusing. By contrast, the recent release of SDXL
           | 1.0 is described as a "model" in its announcement.
        
         | yohannparis wrote:
         | To be honest, you'd better buy GitHub co-pilot and enjoy the
         | productivity boost at a cheap price. Trying to
         | download/install/setup/use StableCode is worth it only if you
         | want to learn all those steps as well. If what you care is the
         | final result, just buy an existing service.
        
           | arcanemachiner wrote:
           | I may put all my open source stuff on GitHub, but hell will
           | freeze over before I willingly let Microsoft get a whiff of
           | my private data, no matter how irrelevant it may be.
           | 
           | GitHub Copilot sounds pretty neat though, I will admit that.
        
           | hmottestad wrote:
           | I have bought into co-pilot, but I can't say it's that much
           | of a productivity boost. More often than not it recommends
           | something completely wrong. I guess it might be more useful
           | if I did more spring boot or maybe hibernate.
           | 
           | I've found chat gpt to be more helpful in general. I can
           | paste some code in and have a discussion about what I want it
           | to fix for me.
        
         | carom wrote:
         | Yes, the model is available. However, it just released so no
         | one has wrapped it in a plugin yet. I would expect that within
         | the month there will be a nicely runnable local version,
         | similar to llama2's wrappers.
        
       | cutler wrote:
       | Yet another site whose data privacy policy amounts to nothing
       | more than an Accept button. Refuse to use such sites.
        
         | capableweb wrote:
         | It's a model you download and run yourself, on your own
         | hardware. No privacy policy needed.
        
           | barrotes wrote:
           | He's referring to the actual website, which doesn't give you
           | the option of reject profilation cookies (mandatory in
           | Europe). I commented about another website posted here few
           | days ago. It gets me mad too
        
       | monlockandkey wrote:
       | Any performance metrics?
        
       | nwoli wrote:
       | I love stability AI
        
       | sebzim4500 wrote:
       | Hard to believe it can work that well when it only has 3B
       | parameters, but I'd love to be proven wrong.
        
         | csjh wrote:
         | phi-1[0] is only 1.3 billion parameters and performs very well
         | in coding tasks - small models have a massive amount of
         | potential
         | 
         | [0] - https://arxiv.org/abs/2306.11644
        
         | nwoli wrote:
         | Reminder that GPT-2 was considered "too dangerous" to be
         | released at just 1.5B weights
        
           | ben_w wrote:
           | My memory may be imperfect, but I thought it was more "we
           | aren't sure and we want to promote a culture of safety"
           | rather than "this is definitely unsafe... oh wait never
           | mind"?
        
             | arugulum wrote:
             | It's actually even less remarkable than that. It was an
             | experiment in having a limited release, to shift the field
             | toward a different release convention.
             | 
             | > Nearly a year ago we wrote in the OpenAI Charter: "we
             | expect that safety and security concerns will reduce our
             | traditional publishing in the future, while increasing the
             | importance of sharing safety, policy, and standards
             | research," and we see this current work as potentially
             | representing the early beginnings of such concerns, which
             | we expect may grow over time.
             | 
             | > This decision, as well as our discussion of it, is an
             | experiment: while we are not sure that it is the right
             | decision today, we believe that the AI community will
             | eventually need to tackle the issue of publication norms in
             | a thoughtful way in certain research areas.
             | 
             | > We will further publicly discuss this strategy in six
             | months.
             | 
             | https://openai.com/research/better-language-models
        
         | thewataccount wrote:
         | I was impressed enough by replit's 2.7B model that I'm
         | convinced it's doable. I have a 4090 and consider that the "max
         | expected card for a consumer to own".
         | 
         | Also exllama doesn't support non-llama models and the creator
         | doesn't seem interested in adding support for wizardcoder/etc.
         | Because of this, using the alternatives are prohibitively slow
         | to use a quantized 16B model on a 4090 (if the exllama author
         | reads this _please_ add support for other model types!).
         | 
         | 3B models with refact are pretty snappy with Refact, about as
         | fast as github copilot. The other benefit is more context
         | space, which will be a limiting factor for 16B models.
         | 
         | tl;dr - I think we need ~3B models if we want any chance of
         | consumer hardware to reasonably run coding models akin to
         | github copilot with decent context length. And I think it's
         | doable.
        
           | eyegor wrote:
           | I'm fairly confident a coding specific model should be a lot
           | smaller - 3b should be plenty if not 1b or less. As it
           | stands, there are quite a few 7-13b model sizes that can
           | predict natural language quite well. Code seems at its
           | surface a much simpler language, strict grammars, etc so I
           | wouldn't think it needs to be anywhere near as large as the
           | nlp models. Right now people are retraining nlp models to
           | work with code, but I think the best code helper models in
           | the future will be trained primarily on code and maybe fine
           | tuned on some language. I'm thinking less of a chat bot api
           | and more of a giant leap in "intellisense" services.
        
             | gsuuon wrote:
             | I'd really like to see smaller models trained on only one
             | specific language, with it's own language specific
             | tokenizer. I imagine the reduction in vocab size would
             | translate to handling more context easier?
        
               | thewataccount wrote:
               | I think simply having the vocab more code friendly (e.g.
               | codex) would make the biggest difference, whitespace is
               | the biggest one (afaik every space is a token), but
               | consider how many languages continue `for(int i=0;`, `)
               | {\n`, `} else {`, 'import ', etc.
               | 
               | My understanding is that a model properly trained on
               | multiple languages will beat an expert based system. I
               | feel like programming languages overlap, and interop with
               | each other enough that I wouldn't want to specialize it
               | in just one language.
        
             | johndough wrote:
             | > Code seems at its surface a much simpler language
             | 
             | When using GitHub Copilot, I often write a brief comment
             | first and most of the time, it is able to complete my code
             | faster than if I had written it myself. For my workflow, a
             | good code model must therefore also be able to understand
             | natural text well.
             | 
             | Although I am not sure to which degree the ability to
             | _understand_ natural text and the ability to _generate_
             | natural text are related. Perhaps a bit of text generation
             | capabilities can be traded off against faster execution and
             | fewer parameters.
        
               | GeneralMayhem wrote:
               | Understanding should be much easier, for the same reason
               | humans (e.g. children, foreign-language learners) can
               | always understand more than they can say: human language
               | is fairly low-entropy, so if there's a word you don't
               | understand, you can pick up most of the meaning from
               | context. On the other hand, producing natural-sounding
               | language requires knowing _every single_ word you 're
               | going to use.
        
           | thorum wrote:
           | replit's model is surprisingly good at generating code, even
           | at following complex instructions that I was sure would
           | confuse it. I have found it's a bit weak on code _analysis_ ,
           | for open-ended questions like 'is there a bug anywhere in
           | this code?' that GPT-4 can answer.
        
         | politelemon wrote:
         | But it does mean, hopefully, it is easier to run on small
         | hardware. Making it much more accessible.
        
         | capableweb wrote:
         | I had that thought at first too, but then the scope is really
         | small (programming) compared to other models (everything) so
         | might not be that bad.
        
       | rvz wrote:
       | Either way, the race to zero has been further accelerated.
       | 
       | Stability AI, Apple, Meta, etc are clearly at the finish line
       | putting pressure on cloud only AI models and cannot raise prices
       | or compete with free.
        
         | empath-nirvana wrote:
         | Open Source doesn't mean free. It costs a lot of money to run
         | models and keep models up to date, and maybe a "good enough"
         | model runs relatively cheaply, but there's always going to be a
         | "state of the art" that people are willing to pay for.
        
         | _pdp_ wrote:
         | Lots of folks out there would rather skip the hassle of running
         | their own models, and that's totally understandable. Similarly,
         | you've got plenty of folks who'd rather pay for managed hosting
         | services instead of dealing with the nitty-gritty of setting up
         | everything themselves using free tools. This opens up exciting
         | opportunities for successful companies to offer some real perks
         | - think convenience, a smoother user experience, and lightning-
         | fast speeds, just to name a few! All of these things save time
         | and are worth paying for.
        
         | thewataccount wrote:
         | > Stability AI, Apple, Meta, etc are clearly at the finish line
         | 
         | I'm very optimistic and expect them to catch up. I've used the
         | open models a lot, to be clear they are starting to compare to
         | GPT3.5Turbo right now, they can't compete with GPT4 at all.
         | GPT4 is almost a year old from when it finished training I
         | think?
         | 
         | I expect open source models to stay ~1.5 years behind. That
         | said they will eventually be "good enough".
         | 
         | Keep in mind too though that using and scaling GPUs is not
         | free. You have to run the models somewhere. Most businesses
         | will still prefer a simple api to call instead of managing the
         | infrastructure. On top of this many business (medium and
         | smaller) will likely find models like GPT4 to be sufficient for
         | their workload, and will appreciate the built in "rails" for
         | their specific usecases.
         | 
         | tl;dr - open models don't even compare to GPT4 yet (I use them
         | all daily), they aren't free to run, and a API option is still
         | preferably to a massive if not most companies.
        
           | nwoli wrote:
           | > Keep in mind too though that using and scaling GPUs is not
           | free. You have to run the models somewhere.
           | 
           | Long or medium term these will probably be dirt cheap to just
           | run in the background though. It might be within 3-5 years
           | since parallel compute is still growing and isn't as bounded
           | by moores law stagnation
        
             | thewataccount wrote:
             | I get decent performance with my 4090, enough that LLMs
             | with exllama at 30B quantitized are very usable. But we're
             | severely VRAM limited, especially on lower end hardware
             | which rarely sees > 10GB of VRAM.
             | 
             | I don't know how much slower it could be and still be
             | useful though. The big thing is we need more VRAM, 30B is
             | context length limited with only 24GB of vram, I've only
             | barely made it above 3.2k tokens before running out.
             | 
             | I hope you're right, that it becomes common for systems to
             | have either dedicated TPU type stuff similar to
             | smartphones, and that they absolutely load the crap with
             | VRAM (which I don't think is even that expensive?)
             | 
             | Models will also get smaller but I'm skeptical we'll get
             | GPT4 performance with any useful context length under 24GB
             | VRAM any time soon.
        
       | RomanPushkin wrote:
       | Is it good at algos?
       | 
       | From interviews:
       | 
       | Implement queue that supports three methods:
       | 
       | * push
       | 
       | * pop
       | 
       | * peek(i)
       | 
       | peek returns element by its index. All three methods should have
       | O(1) complexity [write code in Ruby].
       | 
       | ChatGPT wasn't able to solve that last time I tried
       | https://twitter.com/romanpushkin/status/1617037136364199938
        
       | thewataccount wrote:
       | I can't seem to find a demo, if anyone has a chance to test it,
       | how does it compare to replit and wizardcoder?
        
         | james2doyle wrote:
         | Looks like there is one on the Hugging Face page:
         | https://huggingface.co/stabilityai/stablecode-instruct-alpha...
         | 
         | Not very promising based on this lame test
        
           | politelemon wrote:
           | I ran it locally and it seemed to do better. I switched
           | Python to Bash and it also gave a good answer (nproc).
        
       | 3rd3 wrote:
       | How does it compare to GitHub Copilot?
        
         | karmasimida wrote:
         | On HumanEval, Copilot is 40+ on pass@1 comparing to 26 for
         | stable code 3b.
         | 
         | HumanEval is abused but this model is only good for its size,
         | it is no match for Copilot ... yet
        
           | UncleOxidant wrote:
           | > On HumanEval, Copilot is 40+ on pass@1 comparing to 26 for
           | stable code 3b.
           | 
           | Can you put those numbers into context for those who haven't
           | done HumanEval? Are those percentages so that 40+ means 40+%
           | and 26 is 26%? If so does that imply both would be failing
           | scores?
        
         | jstummbillig wrote:
         | When they don't voluntarily answer the question, you know the
         | answer.
        
           | sebzim4500 wrote:
           | It's not easy to compare them, to be fair.
           | 
           | I guess you could come up with a thousand example prompts and
           | pay some students to pick which output is better, but I can
           | also see why you wouldn't bother. It probably depends on
           | language, type of prompt, etc.
        
             | maaaaattttt wrote:
             | One could team up with Hackerrank/leetcode, let the model
             | code in the interface (maybe there's an API for that
             | already, no idea), execute their code verbatim and see how
             | many test cases they get right the first time around. Then,
             | like for humans, give them a clue about one of the tests no
             | passing (or code not working, too slow, etc.). Give points
             | based on the difficulty of the question and the number of
             | clues needed.
             | 
             | I guess the obvious caveat is that these model are probably
             | overfitted on these types of questions. But a specific
             | benchmark could be made containing question kept secret for
             | models. Time to build "Botrank" I guess.
        
             | erwald wrote:
             | Sure it's easy -- you can use benchmarks like HumanEval,
             | which Stability did. They just didn't compare to Codex or
             | GPT-4. Of course such benchmarks don't capture all aspects
             | of an LLM's capabilities, but they're a lot better than
             | nothing!
        
         | miohtama wrote:
         | The model, source, etc. are available under permissive terms
         | 
         | https://huggingface.co/stabilityai/stablecode-instruct-alpha...
         | 
         | You can "run it locally". Very handy if you do not trust
         | automatically sending all your code to someone in the United
         | States.
        
           | UncleOxidant wrote:
           | Hmmm... so on that hugging face page there's a text box where
           | you enter input then you click the 'compute' button.
           | 
           | So I asked it to "Write a python function that computes the
           | square of the input number."
           | 
           | And it responds with:                    def square(x):
           | 
           | Which seems quite underwhelming.
        
           | lolinder wrote:
           | > to reproduce, distribute, and create derivative works of
           | the Software Products solely for your non-commercial research
           | purposes
           | 
           | I wouldn't call these terms permissive. It's in line with the
           | recent trend in released AI models, but fairly restrictive in
           | what you're actually allowed to do with it.
        
             | coder543 wrote:
             | The Completion model appears to place the model weights
             | under the Apache 2 license, which is a permissive license:
             | https://huggingface.co/stabilityai/stablecode-completion-
             | alp...
             | 
             | The Instruct model has that non-commercial restriction, but
             | I'm not sure why. They say it was trained with Alpaca-
             | formatted questions and responses, but I'm not sure if that
             | includes the original Alpaca dataset.
        
       ___________________________________________________________________
       (page generated 2023-08-08 23:01 UTC)