[HN Gopher] Yi-Coder: A Small but Mighty LLM for Code
___________________________________________________________________
Yi-Coder: A Small but Mighty LLM for Code
Author : crbelaus
Score : 237 points
Date : 2024-09-05 03:38 UTC (19 hours ago)
(HTM) web link (01-ai.github.io)
(TXT) w3m dump (01-ai.github.io)
| smcleod wrote:
| Weird they're comparing it to really old deepseek v1 models, even
| v2 has been out a long time now.
| butterfly42069 wrote:
| My exact thoughts, especially because DeepseekV2 is meant to be
| a massive improvement.
|
| It seems to be an emerging trend people should look out for
| that model release sheets often contain comparisons with out of
| date models and don't inform so much as just try to make the
| model look "best."
|
| It's an annoying trend. Untrustworthy metrics betray
| untrustworthy morals.
| bubblyworld wrote:
| My barely-informed guess is that they don't have the resources
| to run it (it's a 200b+ model).
| regularfry wrote:
| They could compare to DeepSeek-Coder-V2-Lite-Instruct. That's
| a 16B model, and it comes out at 24.3 on LiveCodeBench. Given
| the size delta they're respectably close - they're only just
| behind at 23.4. The full V2 is way ahead.
| smcleod wrote:
| That's for the larger model, most people running it locally
| use the -lite model (both of which has lots of benchmarks
| published)
| theshrike79 wrote:
| > Continue pretrained on 2.4 Trillion high-quality tokens over 52
| major programming languages.
|
| I'm still waiting for a model that's highly specialised for a
| single language only - and either a lot smaller than these jack
| of all trades ones or VERY good at that specific language's
| nuances + libraries.
| wiz21c wrote:
| If the LLM training makes the LLM generalize things _between_
| languages, then it is better to leave it like it is...
| richardw wrote:
| I'd be interested to know if that trade off ends up better.
| There's probably a lot of useful training that transfers well
| between languages, so I wouldn't be that surprised if the extra
| tokens helped across all languages. I would guess a top quality
| single language model would need to be very well supported, eg
| Python or JavaScript. Not, say, Clojure.
| rfoo wrote:
| An unfortunate fact is, similar to human with infinite time,
| LLMs usually have better performance on your specific langauge
| when they are not limited to learn or over-sample one single
| language. Not unlike the common saying "learning to code in
| Haskell makes you a better C++ programmer".
|
| Of course, this is far from trivial, you don't just add more
| data and expect it to automatically be better for everything.
| So is time management for us mere mortals.
| deely3 wrote:
| > usually have better performance on your specific langauge
| when they are not limited to learn or over-sample one single
| language.
|
| Source? Im very curious how learning one language helps model
| to generate code in language with different paradigms. Java,
| Markdown, JSON, HTML, Fortran?
| imjonse wrote:
| Unclear how much of their coding knowledge is in the space of
| syntax/semantics of a given language and how much in the latent
| space that generalizes across languages and logic in general.
| If I were to guess I'd say 80% is in the latter for the larger
| capable models. Even very small models (like in Karpathy's
| famous RNN blog) will get syntax right but that is superficial
| knowledge.
| kamphey wrote:
| I wonder what those 52 languages are.
| richardw wrote:
| According to the repo README: 'java', 'markdown', 'python',
| 'php', 'javascript', 'c++', 'c#', 'c', 'typescript', 'html',
| 'go', 'java_server_pages', 'dart', 'objective-c', 'kotlin',
| 'tex', 'swift', 'ruby', 'sql', 'rust', 'css', 'yaml',
| 'matlab', 'lua', 'json', 'shell', 'visual_basic', 'scala',
| 'rmarkdown', 'pascal', 'fortran', 'haskell', 'assembly',
| 'perl', 'julia', 'cmake', 'groovy', 'ocaml', 'powershell',
| 'elixir', 'clojure', 'makefile', 'coffeescript', 'erlang',
| 'lisp', 'toml', 'batchfile', 'cobol', 'dockerfile', 'r',
| 'prolog', 'verilog'
|
| https://github.com/01-ai/Yi-Coder
| Y_Y wrote:
| They're playing a dangerous game if they assume that a
| single language or even family of similar languages is
| referred to by e.g. "assembly", "shell", "lisp".
|
| (I also note that several of these are markup or config
| languages which are explicitly not for programming.)
| karagenit wrote:
| Yep, been waiting for the same thing. Maybe at some point it'll
| be possible to use a large multilingual model to translate the
| dataset into one programming language, then train a new smaller
| model on just that language?
| terminalcommand wrote:
| Isn't microsoft phi specifically trained for Python? I recall
| that Phi 1 was advertised as a Python coding helper.
|
| It's a small model trained only by quality sources (ie
| textbooks).
| mark_l_watson wrote:
| I get your point. Models that support many dozens of human
| languages seem not what I personally need because I only speak
| English.
|
| However, I enjoy using various Lisp languages and I was pleased
| last night when I set up Emacs + ellama + Ollama + Yi-Coder. I
| experimented with Cursor last weekend, and it was nice for
| Python, not so great for Common Lisp.
| rty32 wrote:
| I don't know if that will happen, but there are tools that at
| least _try_ to improve performance for specific languages,
| especially "underrepresented" languages, e.g.
| https://sourcegraph.com/blog/enhancing-code-completion-for-r...
| sitkack wrote:
| The models benefit immensely from being trained with more data
| from other languages, even if you only ever use it in one.
|
| You could finetune it on your codebases and specific docs for
| added perf.
| mythz wrote:
| Claude 3.5 Sonnet still holds the LLM crown for code which I'll
| use when wanting to check the output of the best LLM, however my
| Continue Dev, Aider and Claude Dev plugins are currently
| configured to use DeepSeek Coder V2 236B (and local ollama
| DeepSeek Coder V2 for tab completions) as it offers the best
| value at $0.14M/$0.28M which sits just below Claude 3.5 Sonnet on
| Aider's leaderboard [1] whilst being 43x cheaper.
|
| [1] https://aider.chat/docs/leaderboards/
| dsp_person wrote:
| DeepSeek sounds really good, but the terms/privacy policy look
| a bit sketch (e.g. grant full license to use/reproduce inputs
| and outputs). Is there anywhere feasible to spin up the 240B
| model for a similarly cheap price in private?
|
| The following quotes from a reddit comment here
| https://www.reddit.com/r/LocalLLaMA/comments/1dkgjqg/comment...
|
| > under International Data Transfers (in the Privacy Policy):
| """ The personal information we collect from you may be stored
| on a server located outside of the country where you live. We
| store the information we collect in secure servers located in
| the People's Republic of China . """
|
| > under How We Share Your Information > Our Corporate Group (in
| the Privacy Policy): """ The Services are supported by certain
| entities within our corporate group. These entities process
| Information You Provide, and Automatically Collected
| Information for us, as necessary to provide certain functions,
| such as storage, content delivery, security, research and
| development, analytics, customer and technical support, and
| content moderation. """
|
| > under How We Use Your Information (in the Privacy Policy):
| """ Carry out data analysis, research and investigations, and
| test the Services to ensure its stability and security; """
|
| > under 4.Intellectual Property (in the Terms): """ 4.3 By
| using our Services, you hereby grant us an unconditional,
| irrevocable, non-exclusive, royalty-free, sublicensable,
| transferable, perpetual and worldwide licence, to the extent
| permitted by local law, to reproduce, use, modify your Inputs
| and Outputs in connection with the provision of the Services.
| """
| yumraj wrote:
| There's no company info on DeepSeek's website. Looking at the
| above, and considering that, it seems very sketchy indeed.
|
| Maybe OK for trying out stuff, a big no no for real work.
| dotancohen wrote:
| Might be good for contributing to open source projects. But
| not for clients' projects.
| rfoo wrote:
| > There's no company info on DeepSeek's website.
|
| It's backed solely by a hedge fund who do not want to draw
| attention to their business. So yeah, as sketchy as DESRES.
| redeyedtreefrog wrote:
| The names of their researchers are on this recent paper:
| https://arxiv.org/pdf/2408.15664
|
| Their terms of service say "The DeepSeek Open Platform is
| jointly owned and operated by Hangzhou DeepSeek Artificial
| Intelligence Co., Ltd., Beijing DeepSeek Artificial
| Intelligence Co., Ltd. "
|
| And they're funded by https://www.high-flyer.cn/en/fund/
| which the FT did an article on: https://www.ft.com/content/
| 357f3c68-b866-4c2e-b678-0d075051a...
|
| In terms of the personal data you share when using their
| models, I can't see why they would be any more or less
| nefarious than big Western tech companies.
|
| That said, if you're using a model based in China then by
| providing them with data and feedback you are in a very
| small way helping researchers in China catch up with/keep
| up with/overtake researchers in the West. Maybe in the long
| term that could end badly. And if you are concerned about
| the environment, it's entirely possible their training and
| inference is run using coal power stations.
| mythz wrote:
| It's a 236B MoE model with only 21B active parameters that
| ollama is reporting having 258k downloads [1] (for 16/236
| combined) whilst Hugging Face says was downloaded 37k times
| last month [2], which can run at 25 tok/s on a single M2
| Ultra [3].
|
| At $0.14M/$0.28M it's a no brainier to use their APIs. I
| understand some people would have privacy concerns and would
| want to avoid their APIs, although I personally spend all my
| time contributing to publicly available OSS code bases so I'm
| happy for any OSS LLM to use any of our code bases to improve
| their LLM and hopefully also improving the generated code for
| anyone using our libraries.
|
| Since many LLM orgs are looking to build proprietary moats
| around their LLMs to maintain their artificially high prices,
| I'll personally make an effort to use the best OSS LLMs
| available first (i.e. from DeepSeek, Meta, Qwen or Mistral
| AI) since they're bringing down the cost of LLMs and aiming
| to render the technology a commodity.
|
| [1] https://ollama.com/library/deepseek-coder-v2
|
| [2] https://huggingface.co/deepseek-ai/DeepSeek-
| Coder-V2-Lite-In...
|
| [3] https://x.com/awnihannun/status/1814045712512090281
| bilekas wrote:
| > """ The personal information we collect from you may be
| stored on a server located outside of the country where you
| live. We store the information we collect in secure servers
| located in the People's Republic of China . """
|
| Is that even legal in regards to EU users ?
| onli wrote:
| Of course not.
| yard2010 wrote:
| Well, good luck prosecuting Winnie the Pooh
| bilekas wrote:
| Their services will just be blocked in the EU instead..
| we've seen it in Italy early on with ChatGPT..
| samstave wrote:
| When will we have token-flow-aware-networking gear...
| Surely NVIDIA and others are already doing special
| traffic shaping for tokenFlows?
|
| Whats the current state of such
| tech/thought/standards/vendors?
| stavros wrote:
| I'm making a small calendar renderer for e-ink screens
| (https://github.com/skorokithakis/calumny) which Claude
| basically wrote all of, so I figured I'd try DeepSeek. I had it
| add a small circle to the left of the "current day" line, which
| it added fine, but it couldn't solve the problem of the circle
| not being shown over another element. It tried and tried, to no
| avail, until I switched to Claude, which fixed the problem
| immediately.
|
| 43x cheaper is good, but my time is also worth money, and it
| unfortunately doesn't bode well for me that it's stumped by the
| first problem I throw at it.
| jadbox wrote:
| What's the better plug-in among Continue Dev, Aider and Claude
| Dev?
| phren0logy wrote:
| You probably already know that Aider is not a plugin, but
| just in case - it's a program that runs from the terminal. I
| think the results are very impressive, and it readily handles
| multiples source files for context.
| ziofill wrote:
| Are coding LLMs trained with the help of interpreters?
| willvarfar wrote:
| Google's Gemini does.
|
| I can't find a post that I remember Google published just after
| all the ChatGPT SQL generation hype happened, but it felt like
| they were trying to counter that hype by explaining that most
| complex LLM-generated code snippets won't actually run or work,
| and that they were putting a code-evaluation step after the LLM
| for Bard.
|
| (A bit like why did they never put an old fashioned rules-based
| grammar checker check stage in google translate results?)
|
| Fast forward to today and it seems it's a normal step for
| Gemini etc https://ai.google.dev/gemini-api/docs/code-
| execution?lang=py...
| redeyedtreefrog wrote:
| That's interesting! Where it says that is will "learn
| iteratively from the results until it arrives at a final
| output" I assume it's therefore trying multiple LLM
| generations until it finds one that works, which I didn't
| know about before.
|
| However, AFAIK it's only ever at inference time, an
| interpreter isn't included during LLM training? I wonder if
| it would be possible to fine tune a model for coding with an
| interpreter. Though if noone has done it yet there is
| presumably a good reason why not.
| littlestymaar wrote:
| > Though if noone has done it yet there is presumably a
| good reason why not.
|
| The field is vast, moving quickly and there are more
| directions to explore than researchers working at top AI
| labs. There's lots of open doors that haven't been explored
| yet but that doesn't mean it's not worth it, it's just not
| done yet.
| Havoc wrote:
| Beats deepseek 33. That's impressive
| tuukkah wrote:
| They used DeepSeek-Coder-33B-Instruct in comparisons, while
| DeepSeek-Coder-v2-Instruct (236B) and -Lite-Instruct (16B) are
| available since a while: https://github.com/deepseek-
| ai/DeepSeek-Coder-v2
|
| EDIT: Granted, Yi-Coder 9B is still smaller than any of these.
| mtrovo wrote:
| I'm new to this whole area and feeling a bit lost. How are people
| setting up these small LLMs like Yi-Coder locally for tab
| completion? Does it work natively on VSCode?
|
| Also for the cloud models apart from GitHub Copilot, what tools
| or steps are you all using to get them working on your projects?
| Any tips or resources would be super helpful!
| cassianoleal wrote:
| You can run this LLM on Ollama [0] and then use Continue [1] on
| VS Code.
|
| The setup is pretty simple:
|
| * Install Ollama (instructions for your OS on their website -
| for macOS, `brew install ollama`)
|
| * Download the model: `ollama pull yi-coder`
|
| * Install and configure Continue on VS Code
| (https://docs.continue.dev/walkthroughs/llama3.1 <- this is for
| Llama 3.1 but it should work by replacing the relevant bits)
|
| [0] https://ollama.com/
|
| [1] https://www.continue.dev/
| suprjami wrote:
| If you have a project which supports OpenAI API keys, you can
| point it at a LocalAI instance:
|
| https://localai.io/
|
| This is easy to get "working" but difficult to configure for
| specific tasks due to docs being lacking or contradictory.
| samstave wrote:
| Can you post screens/configs on how setup success?
|
| Or at least state what you configured toward and how?
| suprjami wrote:
| The documentation gives a quick start and many examples of
| integration with OpenAI projects like a chatbot. That's all
| I did.
| NKosmatos wrote:
| It would be good if LLMs were somehow packaged in an easy
| way/format for us "novice" (ok I mean lazy) users to try them
| out.
|
| I'm not interested so much with the response time (anyone has a
| couple of spare A100s?), but it would be good to be able to try
| out different LLMs locally.
| nusl wrote:
| This is already possible. There are various tools online you
| can find and use.
| hosteur wrote:
| You should try GPT4all. It seems to be exactly what you're
| asking for.
| suprjami wrote:
| One Docker command if you don't mind waiting minutes for CPU-
| bound replies:
|
| https://localai.io/
|
| You can also use several GPU options, but they are not as easy
| to get working.
| PhilippGille wrote:
| With Mozilla's llamafile you can run LLMs locally without
| installing anything: https://github.com/Mozilla-Ocho/llamafile
| senko wrote:
| LM Studio is pretty good: https://lmstudio.ai/
| dizhn wrote:
| I understand your situation. It sounds super simple to me now
| but I remember having to spend at least a week trying to get
| the concepts and figuring out what prerequisite knowledge I
| would need between a continium of just using chatgpt and
| learning relevant vector math etc. It is much closer to the
| chatgpt side fortunately. I don't like ollama per se (because i
| can't reuse its models with other frontends due to it
| compressing them in its own format) but it's still a very good
| place to start. Any interface that lets you download models as
| gguf from huggingface will do just fine. Don't be turned off by
| the roleplaying/waifu sounding frontend names. They are all
| fine. This is what I mostly prefer:
| https://github.com/oobabooga/text-generation-webui
| cassianoleal wrote:
| Is there an LLM that's useful for Terraform? Something that
| understands HCL and has been trained on the providers, I imagine.
| bloopernova wrote:
| Copilot writes terraform just fine, including providers.
| cassianoleal wrote:
| Thanks. I should have specified, LLMs that can be run locally
| is what interests me.
| lasermike026 wrote:
| Try this, https://ollama.com/jeffrymilan/aiac
| Palmik wrote:
| The difference between (A) software engineers reacting to AI
| models and systems for programming and (B) artists (whether it's
| painters, musicians or otherwise) reacting to AI models for
| generating images, music, etc. is very interesting.
|
| I wonder what's the reason.
| suprjami wrote:
| Because code either works or it doesn't. Nobody is replacing
| our entire income stream with an LLM.
|
| You also need a knowledge of code to instruct an LLM to
| generate decent code, and even then it's not always perfect.
|
| Meanwhile plenty of people are using free/cheap image
| generation and going "good enough". Now they don't need to pay
| a graphic artist or a stock photo licence
|
| Any layperson can describe what they want a picture to look
| like so the barrier to entry and successful exit is a lot lower
| for LLM image generation than for LLM code generation.
| rty32 wrote:
| Coding Assistants are not good enough (yet). Inline suggestions
| and chats are incredibly helpful and boost productivity (and
| only to those who know to use them well), but that's as fast as
| they go today.
|
| If they can take a Jira ticket, debug the code, create a patch
| for a large codebase and understand and respect all the
| workarounds in a legacy codebase, I would have a problem with
| it.
| xvector wrote:
| Except they can't do the equivalent for art yet either, and I
| am fairly familiar with the state of image diffusion today.
|
| I've commissioned tens of thousands of dollars in art, and
| spent many hundreds of hours working with Stable Diffusion,
| Midjourney, and Flux. What all the generators are missing is
| _intentionality_ in art.
|
| They can generate something that looks great at surface
| level, but doesn't make sense when you look at the details.
| Why is a particular character wearing a certain bracelet? Why
| do the windows on that cottage look a certain way? What does
| a certain engraving mean? Which direction is a character
| looking, and why?
|
| The diffusers do not understand what they are generating, so
| they just generates what "looks right." Often this results in
| art that looks pretty but has no deeper logic, world
| building, meaning, etc.
|
| And of course, image generators cannot handle the client-
| artist relationship as well (even LLMs cannot), because it
| requires an understanding of what the customer wants and what
| emotion they want to convey with the piece they're
| commissioning.
|
| So - I rely on artists for art I care about (art I will hang
| on my walls), and image generators for throwaway work (such
| as weekly D&D campaign images.)
| rty32 wrote:
| Of course the "art" art -- the part that is all about human
| creativity -- will always be there.
|
| But lots of people in the art business aren't doing that.
| If you didn't have midjourney etc, what would you be doing
| for the throwaway work? Learn to design the stuff yourself,
| hire someone to do it on Upwork, or just not do it all?
| Some money likely will exchange hands there.
| xvector wrote:
| The throwaway work is worth pennies per piece to me _at
| most._ So I probably wouldn 't do it at all if it wasn't
| for the generators.
|
| And even when it comes to the generators, I typically
| just use the free options like open-source diffusion
| models, as opposed to something paid like Midjourney.
| mrklol wrote:
| But that's not that far. Like sure, currently it's not. But
| "reading a ticket with a description, find the relevant code,
| understand the code (often better than human), test it,
| return the result" is totally doable with some more
| iterations. It's already doable for smaller projects, see
| GitHub workspaces etc.
| viraptor wrote:
| Have you seen https://www.swebench.com/ ?
|
| Once you engage agentic behaviour, it can take you way
| further than just the chats. We're already in the "resolving
| JIRA tickets" area - it's just hard to setup, not very well
| known, and may be expensive.
| rty32 wrote:
| Looks like the definition of "resolving a ticket" here is
| "come up with a patch that ensures all tests pass", which
| does not necessarily include "add a new test", "make sure
| the patch is actually doing something meaningful",
| "communicate how this is fixed". Based on my experience and
| what I saw in the reports in the logs, a solution could be
| just hallucinating completely useless code -- as long as it
| doesn't fail a test.
|
| Of course, it is still impressive, and definitely would
| help with the small bugs that require small fixes,
| especially for open source projects that have thousands of
| open issues. But is it going to make a big difference?
| Probably not yet.
|
| Also, good luck doing that on our poorly written, poorly
| documented and under tested codebase. By any standard
| django is a much better codebase than the one I work on
| every day.
| viraptor wrote:
| Some are happy with creating tests as well, but you
| probably want to mostly write them yourself. I mean, only
| you know the real world context - if the ticket didn't
| explain it well enough, LLMs can't do magic.
|
| Actually the poorly documented and poorly written is not
| a huge issue in my experience. The under tested is way
| more important if you want to automate that work.
| IshKebab wrote:
| > We're already in the "resolving JIRA tickets" area
|
| For very simple tasks maybe, but not for the kinds of
| things I get paid to do.
|
| I don't think it will be able to get to the level of
| reliably doing difficult programming tasks that require
| understanding and inferring requirements without having
| AGI, in which case society has other things to worry about
| than programmers losing their jobs.
| viraptor wrote:
| Is it really? I know people who love using LLMs, people who are
| allergic to the idea of even taking about AI usability and lots
| of others in between. Same with artists hating the idea,
| artists who spend hours crafting very specific things with SD,
| and many in between.
|
| I'm not sure I can really point out a big difference here.
| Maybe the artists are more skewed towards not liking AI since
| they work with medium that's not digital in the first place,
| but the range of responses really feels close.
| crimsoneer wrote:
| I mean, it's supply and demand right.
|
| - There is a big demand for _really complex_ software
| development, and an LLM can 't do that alone. So software devs
| have to do lots of busywork, and like the opportunity to be
| augmented by AI
|
| - Conversely, there is a huge demand for _not very high level_
| art. - eg, lots of people want a custom logo or a little
| jingle, but no many people want to hire a concert pianist or
| comission the next Salvadore Dali.
|
| So most artists spend a lot of time doing a lot of low level
| work to pay the bills, while software devs spend a lot of time
| doing low level code monkey work so they can get to the
| creative part of their job.
| aithrowaway1987 wrote:
| Look at who the tools are marketed towards. Writing software
| involves a lot of tedium, eye strain, and frustration, even for
| experts who have put in a lot of hours practicing, so LLMs are
| marketed to help developers make their jobs easier.
|
| This is not the case for art or music generators: they are
| marketed towards (and created by) laypeople with who want
| generic content and don't care about human artists. These
| systems are a significant burden on productivity (and fatal
| burden on creativity) if you are an honest illustrator or
| musician.
|
| Another perspective: a lot of the most useful LLM codegen is
| not asking the LLM to solve a tricky problem, but rather to
| translate and refine a somewhat loose English-language solution
| into a more precise JavaScript solution (or whatever),
| including a large bag of memorized tricks around sorting,
| regexes, etc. It is more "science than art," and for a
| sufficiently precise English prompt there is even a plausible
| set of optimal solutions. The LLM does not have to "understand"
| the prompt or rely on plagiarism to give a good answer.
| (Although GPT-3.5 was a horrific F# plagiarist... I don't like
| LLM codegen but it is far more defensible than music
| generation)
|
| This is not the case with art or music generators: it makes no
| sense to describe them as "English to song" translators, and
| the only "optimal" solutions are the plagiarized / interpolated
| stuff the human raters most preferred. They clearly don't
| understand what they are drawing, nor do they understand what
| melodies are. Their output is either depressing content slop or
| suspiciously familiar. And their creators have filled the tech
| community with insultingly stupid propaganda like "they learn
| art just like human artists do." No wonder artists are mad!
| rahimnathwani wrote:
| What you say may be true about the simplest workflow: enter a
| prompt and get one or more finished images.
|
| But many people use diffusion models in a much more
| interactive way, doing much more of the editing by hand. The
| simplest case is to erase part of a generated image, and
| prompt to infill. But there are people who spend hours to get
| a single image where they want it.
| eropple wrote:
| This is true, and there's some really cool stuff there, but
| that's not who most of this is marketed at. Small wonder
| there's backlash from artists and people who appreciate
| artists when the stated value proposition is "render
| artists unemployed".
| tcdent wrote:
| It's just gatekeeping.
|
| Artists put a ton of time into education and refining their
| vision inside the craft. Amateur efforts to produce compelling
| work always look amateur. With augmentation, suddenly the
| "real" artists aren't as differentiated.
|
| The whole conversation is obviously extremely skewed toward
| digital art, and the ones talking about it most visibly are the
| digital artists. No abstract painter thinks AI is coming for
| their occupation or cares wether it is easier to create anime
| dreamscapes this year or the next.
| JediPig wrote:
| I tested this out on my workload ( SRE/Devops/C#/Golang/C++ ). it
| started responding about non-sense on a simple write me boto
| python script that changes x ,y,z value.
|
| Then I tried other questions in my past to compare... However, I
| believe the engineer who did the LLM, just used the questions in
| benchmarks.
|
| One instance after a hour of use ( I stopped then ) it answered
| one question with 4 different programming languages, and answers
| that was no way related to the question.
| tmikaeld wrote:
| I have the same experience, hallucinates and rambles on and on
| about "solutions" that are not related.
|
| Unfortunately, this has always been my experience with all open
| source code models that can be self-hosted.
| Gracana wrote:
| It sounds like you are trying to chat with the base model
| when you should be using a chat model.
| tarruda wrote:
| Have you ran the model in full FP16? It is possible a lot of
| performance is lost when running quantized versions.
| Tepix wrote:
| Sounds very promising!
|
| I hope that Yi-Coder 9B FP16 and Q8 will be available soon for
| Ollama, right now i only see the 4bit quantized 9B model.
|
| I'm assuming that these models will be quite a bit better than
| the 4bit model.
| tmikaeld wrote:
| Click on "View more" in the dropdown on their page, it has many
| many quantized versions to choose from.
| anotherpaulg wrote:
| Yi-Coder scored below GPT-3.5 on aider's code editing benchmark.
| GitHub user cheahjs recently submitted the results for the 9b
| model and a q4_0 version.
|
| Yi-Coder results, with Sonnet and GPT-3.5 for scale:
| 77% Sonnet 58% GPT-3.5 54% Yi-Coder-9b-Chat 45%
| Yi-Coder-9b-Chat-q4_0
|
| Full leaderboard:
|
| https://aider.chat/docs/leaderboards/
| kleiba wrote:
| What is the recommended hardware to run a model like that locally
| on a desktop PC?
| tadasv wrote:
| you can easily run 8b yi coder on 4090 rtx. Probably could do
| on a smaller gpu (16GB). I have 24gb, and run it through
| ollama.
| gloosx wrote:
| Can someone explain these Aider benchmarks to me? They pass same
| 113 tests through llm every time. Why they then extrapolate
| ability of llm to pass these 113 basic python challenges to the
| general ability to produce/edit code? For me it sounds like this
| or that model is 70% accurate in solving same hundred python
| training tasks, but why does it mean that it's good at other
| languages and arbitrary, private tasks as well? Does anyone ever
| tried to change them test cases or wiggle conditions a bit to see
| if it will still hit 70%?
| tarruda wrote:
| It seems this is the problem with most benchmarks, which is why
| benchmark performance doesn't mean much these days.
| lasermike026 wrote:
| First look seem good. I'll keep hacking with it.
| smokel wrote:
| Does anyone know why the sizes of these models are typically
| expressed in _number_ of weights (i.e 1.5B and 9B in this case),
| without mentioning the weight size in bytes?
|
| For practical reasons, I often like to know how much GPU RAM is
| required to run these models locally. The actual number of
| weights seems to only express some kind of relative power, which
| I doubt is relevant to most users.
|
| Edit: reformulated to sound like a genuine question instead of a
| complaint.
| tarruda wrote:
| Since most LLMs are released as FP16, just the number of
| parameters is enough to know the total required GPU RAM.
| magnat wrote:
| Because you can quantize a model e.g. from original 16 bits
| down to 5 bits per weight to fit your available memory
| constraints.
| GaggiX wrote:
| The weight size depends on the accuracy you are running the
| model at, you usually do not run a model at fp16 as it would be
| wasteful.
| zeroq wrote:
| Everytime someone tells how AI 10x his programming capabilities
| I'm like "tell me you're bad at coding without telling me".
| coolspot wrote:
| It allows me to move much faster, because I can write a comment
| describing something more high-level and get plausible code
| from it to review & correct.
| patrick-fitz wrote:
| I'd be interested to see how it performs on
| https://www.swebench.com/
|
| Using SWE-agent + Yi-Coder-9B-Chat.
___________________________________________________________________
(page generated 2024-09-05 23:00 UTC)