[HN Gopher] DBRX: A new open LLM
___________________________________________________________________
DBRX: A new open LLM
Author : jasondavies
Score : 561 points
Date : 2024-03-27 12:23 UTC (10 hours ago)
(HTM) web link (www.databricks.com)
(TXT) w3m dump (www.databricks.com)
| shnkr wrote:
| GenAI novice here. what is training data made of how is it
| collected? I guess no one will share details on it, otherwise a
| good technical blog post with lots of insights!
|
| >At Databricks, we believe that every enterprise should have the
| ability to control its data and its destiny in the emerging world
| of GenAI.
|
| >The main process of building DBRX - including pretraining, post-
| training, evaluation, red-teaming, and refining - took place over
| the course of three months.
| simonw wrote:
| The most detailed answer to that I've seen is the original
| LLaMA paper, which described exactly what that model was
| trained on (including lots of scraped copyrighted data)
| https://arxiv.org/abs/2302.13971
|
| Llama 2 was much more opaque about the training data,
| presumably because they were already being sued at that point
| (by Sarah Silverman!) over the training data that went into the
| first Llama!
|
| A couple of things I've written about this:
|
| - https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-
| the...
|
| - https://simonwillison.net/2023/Apr/17/redpajama-data/
| shnkr wrote:
| my question was specific to databricks model. If it followed
| llama or openai, they could add a line or two about it ..
| make the blog complete.
| comp_raccoon wrote:
| they have a technical report coming! knowing the team, they
| will do a great job disclosing as much as possible.
| ssgodderidge wrote:
| Wow, that paper was super useful. Thanks for sharing. Page 2
| is where it shows the breakdown of all of the data sources,
| including % of dataset and the total disk sizes.
| tempusalaria wrote:
| The training data is pretty much anything you can read on the
| internet plus books.
|
| This is then cleaned up to remove nonsense, some technical
| files, and repeated files.
|
| From this, they tend to weight some sources more - e.g.
| Wikipedia gets a pretty high weighting in the data mix. Overall
| these data mixes have multiple trillion token counts.
|
| GPT-4 apparently trained on multiple epochs of the same data
| mix. So would assume this one did too as it's a similar token
| count
| sanxiyn wrote:
| https://arxiv.org/abs/2305.10429 found that people are
| overweighting Wikipedia and downweighting Wikipedia improves
| things across the board INCLUDING PREDICTING NEXT TOKEN ON
| WIKIPEDIA, which is frankly amazing.
| IshanMi wrote:
| Personally, I found looking at open source work to be much more
| instructive in learning about AI and how things like training
| data and such are done from the ground up. I suspect this is
| because training data is one of the bigger moats an AI company
| can have, as well as all the class action lawsuits surrounding
| training data.
|
| One of the best open source datasets that are freely available
| is The Pile by EleutherAI [1]. It's a few years old now
| (~2020), but they did some really diligent work in putting
| together the dataset and documenting it. A more recent and even
| larger dataset would be the Falcon-RefinedWeb dataset [2].
|
| [1]: https://arxiv.org/abs/2101.00027 [2]:
| https://arxiv.org/abs/2306.01116
| djoldman wrote:
| Model card for base: https://huggingface.co/databricks/dbrx-base
|
| > The model requires ~264GB of RAM
|
| I'm wondering when everyone will transition from tracking
| parameter count vs evaluation metric to (total gpu RAM + total
| CPU RAM) vs evaluation metric.
|
| For example, a 7B parameter model using float32s will almost
| certainly outperform a 7B model using float4s.
|
| Additionally, all the examples of quantizing recently released
| superior models to fit on one GPU doesnt mean the quantized model
| is a "win." The quantized model is a different model, you need to
| rerun the metrics.
| vlovich123 wrote:
| I thought float4 sacrificed a negligible cost in evaluation
| quality for a 8x reduction in RAM?
| Y_Y wrote:
| A free lunch? Wouldn't that be nice! Sometimes the
| quantization process improves the accuracy a little (probably
| by implicit regularization) but a model that's at or near
| capacity (as it should be) is necessarily hurt by throwing
| away most of the information. Language models often quantize
| well to small fixed-point types like int4, but it's not a
| magic wand.
| K0balt wrote:
| I find that q6 and 5+ are subjectively as good as raw
| tensor files. 4 bit quality reduction is very detectable
| though. Of course there must be a loss of information, but
| perhaps there is a noise floor or something like that.
| Taek wrote:
| At what parameter count? Its been established that
| quantization has less of an effect on larger models. By
| the time you are at 70B quantization to 4 bits basically
| is negligible
| vlovich123 wrote:
| I didn't suggest a free lunch, just that the 8x reduction
| in RAM (+ faster processing) does not result in an 8x
| growth in the error. Thus a quantized model will outperform
| a non-quantized one on a evaluation/RAM metric.
| Y_Y wrote:
| That's not a good metric.
| omeze wrote:
| Many applications dont want to host inference on the
| cloud and would ideally run things locally. Hardware
| constraints is clearly important.
|
| Id actually say its the most important metric for most
| open models now, since the price per performance of
| closed cloud models is so competitive with open cloud
| models, so edge inference that is competitive is a clear
| value add
| underlines wrote:
| This paper partially finds disagreeing evidence:
| https://arxiv.org/abs/2403.17887
| Taek wrote:
| For smaller models, the quality drop is meaningful. For
| larger ones like this one, the quality drop is negligible.
| swalsh wrote:
| > The model requires ~264GB of RAM
|
| This feels as crazy as Grok. Was there a generation of models
| recently where we decided to just crank on the parameter count?
| wrs wrote:
| Isn't that pretty much the last 12 months?
| Jackson__ wrote:
| If you read their blog post, they mention it was pretrained
| on 12 Trillion tokens of text. That is ~5x the amount of the
| llama2 training runs.
|
| From that, it seems somewhat likely we've hit the wall on
| improving <X B parameter LLMs by simply scaling up the
| training data, which basically forces everyone to continue
| scaling up if they want to keep up with SOTA.
| breezeTrowel wrote:
| Cranking up the parameter count is literally how the current
| LLM craze got started. Hence the "large" in "large language
| model".
| espadrine wrote:
| Not recently. GPT-3 from 2020 requires even more RAM; the
| open-source BLOOM from 2022 did too.
|
| In my view, the main value of larger models is distillation
| (which we particularly witness, for instance, with how Claude
| Haiku matches release-day GPT-4 despite being less than a
| tenth of the cost). Hopefully the distilled models will be
| easier to run.
| ml_hardware wrote:
| Looks like someone has got DBRX running on an M2 Ultra already:
| https://x.com/awnihannun/status/1773024954667184196?s=20
| Mandelmus wrote:
| And it appears to be at ~80 GB of RAM via quantisation.
| dheera wrote:
| That's a tricky number. Does it run on an 80GB GPU, does it
| auto-shave some parameters to fit in 79.99GB like any
| articifially "intelligent" piece of code would do, or does
| it give up like an unintelligent piece of code?
| declaredapple wrote:
| What?
|
| Are you asking if the framework automatically
| quantizes/prunes the model on the fly?
|
| Or are you suggesting the LLM itself should realize it's
| too big to run, and prune/quantize itself? Your
| references to "intelligent" almost leads me to the
| conclusion that you think the LLM should prune itself.
| Not only is this a chicken and egg problem, but LLMs are
| statistical models, they aren't inherently self
| bootstraping.
| dheera wrote:
| I realize that, but I do think it's doable to bootstrap
| it on a cluster and teach itself to self-prune, and
| surprised nobody is actively working on this.
|
| I hate software that complains (about dependencies,
| resources) when you try to run it and I think that should
| be one of the first use cases for LLMs to get L5
| autonomous software installation and execution.
| smcleod wrote:
| So that would be runnable on a MBP with a M2 Max, but the
| context window must be quite small, I don't really find
| anything under about 4096 that useful
| madiator wrote:
| That's great, but it did not really write the program that
| the human asked it to do. :)
| SparkyMcUnicorn wrote:
| That's because it's the base model, not the instruct tuned
| one.
| resource_waste wrote:
| I find 500 tokens considered 'running' a stretch.
|
| Cool to play with for a few tests, but I can't imagine using
| it for anything.
| dvt wrote:
| > a 7B parameter model using float32s will almost certainly
| outperform a 7B model using float4s
|
| Q5 quantization performs _almost_ on par with base models.
| Obviously there 's some loss there, but this indicates that
| there's still a lot of compression that we're missing.
| dheera wrote:
| I'm more wondering when we'll have algorithms that will "do
| their best" given the resources they detect.
|
| That would be what I call artificial intelligence.
|
| Giving up because "out of memory" is not intelligence.
| visarga wrote:
| No but some model serving tools like llama.cpp do their best.
| It's just a matter of choosing the right serving tools. And I
| am not sure LLMs could not optimize their memory layout. Why
| not? Just let them play with this and learn. You can do
| pretty amazing things with evolutionary methods where the
| LLMs are the mutation operator. You evolve a population of
| solutions. (https://arxiv.org/abs/2206.08896)
| falcor84 wrote:
| I suppose you could simulate dementia by loading as much of
| the weights as space permits and then just stopping. Then
| during inference, replace the missing weights with calls to
| random(). I'd actually be interested in seeing the results.
| coldtea wrote:
| > _Giving up because "out of memory" is not intelligence._
|
| When people can't remember the facts/theory/formulas needed
| to answer some test question, or can't memorize some
| complicated information because it's too much, they usually
| give up too.
|
| So, giving up because of "out of memory" sure sounds like
| intelligence to me.
| kurtbuilds wrote:
| What's the process to deliver and test a quantized version of
| this model?
|
| This model is 264GB, so can only be deployed in server settings.
|
| Quantized mixtral at 24G is just small enough where it can be
| running on premium consumer hardware (ie 64GB RAM)
| viktour19 wrote:
| It's great how we went from "wait.. this model is too powerful to
| open source" to everyone trying to shove down their 1% improved
| model down the throats of developers
| Icko wrote:
| I'm 90% certain that OpenAI has some much beefier model they
| are not releasing - remember the Q* rumour?
| brainless wrote:
| I feel quite the opposite. Improvements, even tiny ones are
| great. But what's more important is that more companies release
| under open license.
|
| Training models isn't cheap. Individuals can't easily do this,
| unlike software development. So we need companies to do this
| for the foreseeable future.
| blitzar wrote:
| Got to justify pitch deck or stonk price. Publish or perish
| without a yacht.
| toddmorey wrote:
| People are building and releasing models. There's active
| research in the space. I think that's great! The attitude I've
| seen in open models is "use this if it works for you" vs any
| attempt to coerce usage of a particular model.
|
| To me that's what closed source companies (MSFT, Google) are
| doing as they try to force AI assistants into every corner of
| their product. (If LinkedIn tries one more time to push their
| crappy AI upgrade, I'm going to scream...)
| XCSme wrote:
| I am planning to buy a new GPU.
|
| If the GPU has 16GB of VRAM, and the model is 70GB, can it still
| run well? Also, does it run considerably better than on a GPU
| with 12GB of VRAM?
|
| I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti,
| but the 24.6GB version is a bit slow (still usable, but has a
| noticeable start-up time).
| jasonjmcghee wrote:
| > mixtral works well
|
| Do you mean mistral?
|
| mixtral is 8x7B and requires like 100GB of RAM
|
| Edit: (without quant as others have pointed out) can definitely
| be lower, but haven't heard of a 3.4GB version
| kwerk wrote:
| I have two 3090s and it runs fine with `ollama run mixtral`.
| Although OP definitely meant mistral with the 7B note
| jsight wrote:
| ollama run mixtral will default to the quantized version
| (4bit IIRC). I'd guess this is why it can fit with two
| 3090s.
| K0balt wrote:
| I run mixtral 6 bit quant very happily on my MacBook with 64
| gb.
| ranger_danger wrote:
| I'm using mixtral-8x7b-v0.1.Q4_K_M.gguf with llama.cpp and it
| only requires 25GB.
| chpatrick wrote:
| The quantized one works fine on my 24GB 3090.
| XCSme wrote:
| Sorry, it was from memory.
|
| I have those models in Ollama:
|
| I have those:
|
| dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)
| XCSme wrote:
| I have 128GB, but something is weird with Ollama. Even though
| for the Ollama Docker I only allow 90GB, it ends up using
| 128GB/128GB, so the system because very slow (mouse freezes).
| InitEnabler wrote:
| What docker flags are you running?
| Havoc wrote:
| The smaller quants still require a 24gb card. 16 might work
| but doubt it
| llm_trw wrote:
| >If the GPU has 16GB of VRAM, and the model is 70GB, can it
| still run well? Also, does it run considerably better than on a
| GPU with 12GB of VRAM?
|
| No, it can't run at all.
|
| >I run Ollama locally, mixtral works well (7B, 3.4GB) on a
| 1080ti, but the 24.6GB version is a bit slow (still usable, but
| has a noticeable start-up time).
|
| That is not mixtral, that is mistral 7b. The 1080ti is slower
| than running inference on current generation threadripper cpus.
| XCSme wrote:
| I have those:
|
| dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)
|
| The CPU is 5900x.
| XCSme wrote:
| > No, it can't run at all.
|
| https://s3.amazonaws.com/i.snag.gy/ae82Ym.jpg
|
| EDIT: This was ran on a 1080ti + 5900x. Initial generation
| takes around 10-30seconds (like it has to upload the model to
| GPU), but then it starts answering immediately, at around 3
| words per second.
| wokwokwok wrote:
| Did you check your GPU utilization?
|
| Typically when it runs that way it runs on the CPU, not the
| GPU.
|
| Are you sure you're actually offloading any work to the
| GPU?
|
| At least with llama.cpp, there is no 'partially put a
| layer' into the GPU. Either you do, or you don't. You pick
| the number of layers. If the model is too big, the layers
| won't fit and it can't run at all.
|
| The llama.cpp `main` executable will tell you in it's debug
| information when you use the -ngl flag; see https://github.
| com/ggerganov/llama.cpp/blob/master/examples/...
|
| It's also possible you're running (eg. if you're using
| ollama) and quantized version of the model which reduces
| the memory requirements and quality of the model outputs.
| XCSme wrote:
| I have to check, something does indeed seem weird,
| especially with the PC freezing like that. Maybe it runs
| on the CPU.
|
| > quantized version Yes, it is 4bit quantized, but still
| has 24.6GB
| spxneo wrote:
| this is some new flex to debate online: copying and pasting
| the other sides argument and waiting for your local LLM to
| explain why they are wrong.
|
| how much is your hardware at today's value? what are the
| specs? that is impressive even though its 3 words per
| second. if you want to bump it up to 30, do you then 10x
| your current hardware cost?
| XCSme wrote:
| That question was just an example (Lorem ipsum), it was
| easy to copy paste to demo the local LLM, I didn't intend
| to provide more context to the discussion.
|
| I ordered a 2nd 3090, which has 24GB VRAM. Funny how it
| was $2.6k 3 years ago and now is $600.
|
| You can probuild a decent AI local machine for around
| $1000.
| spxneo wrote:
| https://howmuch.one/product/average-nvidia-geforce-
| rtx-3090-... you are right there is a huge drop in price
| llm_trw wrote:
| Congratulations on using CPU inference.
| PheonixPharts wrote:
| While GPUs are still the kings of speed, if you are worried
| about VRAM I do recommend a maxed out Mac Studio.
|
| Llama.cpp + quantized models on Apple Silicon is an incredible
| experience, and having 192 GB of unified memory to work with
| means you can run models that just aren't feasible on a home
| GPU setup.
|
| It really boils down to what type of local development you want
| to do. I'm mostly experimenting with things where the time to
| response isn't _that_ big of a deal, and not fine-tuning the
| models locally (which I also believe GPUs are still superior
| for). But if your concern is "how big of a model can I run" vs
| "Can I have close to real time chat", the unified memory
| approach is superior.
| XCSme wrote:
| I already have 128GB of RAM (DDR4), and was wondering if
| upgrading from a 1080ti (12GB) to a 4070ti super (16GB),
| would make a big difference.
|
| I assume the FP32 and FP16 operations are already a huge
| improvement, but also the 33% increased VRAM might lead to
| fewer swaps between VRAM and RAM.
| zozbot234 wrote:
| That's system memory, not unified memory. Unified means
| that all or most of it is going to be directly available to
| the Apple Silicon GPU.
| giancarlostoro wrote:
| This is the key factor here. I have a 3080, with 16GB of
| Memory, but still have to run some models on CPU since
| the memory is not unified at all.
| loudmax wrote:
| I have an RTX 3080 with 10GB of VRAM. I'm able to run
| models larger than 10GB using llama.cpp and offloading to
| the GPU as much as can fit into VRAM. The remainder of the
| model runs on CPU + regular RAM.
|
| The `nvtop` command displays a nice graph of how much GPU
| processing and VRAM is being consumed. When I run a model
| that fits entirely into VRAM, say Mistral 7B, nvtop shows
| the GPU processing running at full tilt. When I run a model
| bigger than 10GB, say Mixtral or Llama 70B with GPU
| offloading, my CPU will run full tilt and the VRAM is full,
| but the GPU processor itself will operate far below full
| capacity.
|
| I think what is happening here is that the model layers
| that are offloaded to the GPU do their processing, then the
| GPU spends most of the time waiting for the much slower CPU
| to do its thing. So in my case, I think upgrading to a
| faster GPU would make little to no difference when running
| the bigger models, so long as the VRAM is capped at the
| same level. But upgrading to a GPU with more VRAM, even a
| slower GPU, should make the overall speed faster for bigger
| models because the GPU would spend less time waiting for
| the CPU. (Of course, models that fit entirely into VRAM
| will run faster on a faster GPU).
|
| In my case, the amount of VRAM absolutely seems to be the
| performance bottleneck. If I do upgrade, it will be for a
| GPU with more VRAM, not necessarily a GPU with more
| processing power. That has been my experience running
| llama.cpp. YMMV.
| htrp wrote:
| How's your performance on the 70b parameter llama series?
|
| Any good writeups of the offloading that you found?
| loudmax wrote:
| Performance of 70b models is like 1 token every few
| seconds. And that's fitting the whole model into system
| RAM, not swap. It's interesting because some of the
| larger models are quite good, but too annoyingly slow to
| be practical for most use cases.
|
| The Mixtral models run surprisingly well. They can run
| better than 1 token per second, depending on
| quantization. Still slow, but approaching a more
| practical level of usefulness.
|
| Though if you're planning on accomplishing real work with
| LLMs, the practical solution for most people is probably
| to rent a GPU in the cloud.
| bee_rider wrote:
| I know the M?-pro and ultra variants are multiple standard
| M?'s in a single package. But so the CPUs and GPUs share a
| die (like a single 4 p-core CPU 10 GPU core is what come in
| the die, and the more exotic variants are just a result of
| LEGO-ing out those guys and disabling some cores for market
| segmentation or because they had defects?)
|
| I guess I'm wondering if they technically could throw in
| their gauntlet and compete with nvidia by doing something
| like a 4 CPU/80 GPU/256 GB chip, if they wanted to. Seems
| like it'd be a really appealing ML machine. (I could also see
| it being technically possible but Apple just deciding that's
| pointlessly niche for them).
| astrange wrote:
| Ultra is the only one that's made from two smaller SoCs.
| bevekspldnw wrote:
| I had gone the Mac Studio route initially, but I ended up
| with getting an A6000 for about the same price as a Mac and
| putting that in a Linux server onder my desk. Ollama makes it
| dead simple to serve it over my local network, so I can be on
| my M1 Air and using it no differently than if on my laptop.
| The difference is that the A6000 absolutely smokes the Mac.
| starik36 wrote:
| Wow, that is a lot of money ($4400 on Amazon) to throw at
| this problem. I am curious, what was the purpose that
| compelled you to spend this (for the home network, I
| assume) amount of money.
| bevekspldnw wrote:
| Large scale document classification tasks in very
| ambiguous contexts. A lot of my work goes into using big
| models to generate training data for smaller models.
|
| I have multiple millions of documents so GPT is cost
| prohibitive, and too slow. My tools of choice tend to be
| a first pass with Mistral to check task performance and
| if lacking using Mixtral.
|
| Often I find with a good prompt Mistral will work as well
| as Mixtral and is about 10x faster.
|
| I'm on my "home" network, but it's a "home office" for my
| startup.
| c1b wrote:
| > The difference is that the A6000 absolutely smokes the
| Mac.
|
| Memory Bandwidth : Mac Studio wins (about the same @ ~800)
|
| VRAM : Mac Studio wins (4x more)
|
| TFLOPs: A6000 wins (32 vs 38)
| bevekspldnw wrote:
| VRAM in excess of the model one is using isn't useful per
| se. My use cases require high throughput, and on many
| tasks the A6000 executes inference at 2x speed.
| purpleblue wrote:
| Aren't the Macs good for inference but not for training or
| fine tuning?
| spxneo wrote:
| Aren't quantized models different models outright requiring a
| new evaluation to know the deviation in performance? Or are
| they "good enough" in that the benefits outweigh the
| deviation?
|
| I'm on the fence about whether to spend 5 digits or 4 digits.
| Do I go the Mac Studio route or GPUs? What are the pros and
| cons?
| brandall10 wrote:
| Wait for the M3 Ultra and it will be 256GB and markedly
| faster.
| lxe wrote:
| Get 2 pre-owned 3090s. You will easily be able to run 70b or
| even 120b quantized models.
| natsucks wrote:
| it's twice the size of mixtral and barely beats it.
| mochomocha wrote:
| It's a MoE model, so it offers a different memory/compute
| latency trade-off than standard dense models. Quoting the blog
| post:
|
| > DBRX uses only 36 billion parameters at any given time. But
| the model itself is 132 billion parameters, letting you have
| your cake and eat it too in terms of speed (tokens/second) vs
| performance (quality).
| hexomancer wrote:
| Mixtral is also a MoE model, hence the name: _mix_ tral.
| sangnoir wrote:
| Despite both being MoEs, thr architectures are different.
| DBRX has double the number of experts in the pool (16 vs 8
| for Mixtral), and doubles the active experts (4 vs 2)
| ingenieroariel wrote:
| TLDR: A model that could be described as "3.8 level" that is good
| at math and openly available with a custom license.
|
| It is as fast as 34B model, but uses as much memory as a 132B
| model. A mixture of 16 experts, activates 4 at a time, so has
| more chances to get the combo just right than Mixtral (8 with 2
| active).
|
| For my personal use case (a top of the line Mac Studio) it looks
| like the perfect size to replace GPT-4 turbo for programming
| tasks. What we should look out for is people using them for real
| world programming tasks (instead of benchmarks) and reporting
| back.
| sp332 wrote:
| What does 3.8 level mean?
| ingenieroariel wrote:
| My interpretation:
|
| - Worst case: as good as 3.5 - Common case: way better than
| 3.5 - Best case: as good as 4.0
| ljlolel wrote:
| Gpt-3.5 and gpt-4
| hanniabu wrote:
| What's a good model to help with medical research? Is there
| anything trained in just research journals, like NIH studies?
| najarvg wrote:
| Look for Biomistral 7B, PMC-LLAMA 7B and even Meditron. I
| believe you should find all those papers on arxiv
| hn_acker wrote:
| Even though the README.md calls the license the Databricks Open
| Source License, the LICENSE file includes paragraphs such as
|
| > You will not use DBRX or DBRX Derivatives or any Output to
| improve any other large language model (excluding DBRX or DBRX
| Derivatives).
|
| and
|
| > If, on the DBRX version release date, the monthly active users
| of the products or services made available by or for Licensee, or
| Licensee's affiliates, is greater than 700 million monthly active
| users in the preceding calendar month, you must request a license
| from Databricks, which we may grant to you in our sole
| discretion, and you are not authorized to exercise any of the
| rights under this Agreement unless or until Databricks otherwise
| expressly grants you such rights.
|
| This is a source-available model, not an open model.
| yunohn wrote:
| The first clause sucks, but I'm perfectly happy with the second
| one.
| CharlesW wrote:
| > _This is a source-available model, not an open model._
|
| To me, "source available" implies that everything you need to
| reproduce the model is also available, and that doesn't appear
| to be the case. How is the resulting model more "free as in
| freedom" than a compiled binary?
| occamrazor wrote:
| I like:
|
| - "open weights" for no training data and no restrictions on
| use,
|
| - "weights available" for no training data and restrictions
| on use, like in this case.
| Spivak wrote:
| I don't think it's possible to have an "open training data"
| model because it would get DMCA'd immediately and open you up
| to lawsuits from everyone who found their works in the
| training set.
|
| I hope we can fix the legal landscape to enable publicly
| sharing training data but I can't really judge the companies
| keeping it a secret today.
| CharlesW wrote:
| > _I don 't think it's possible to have an "open training
| data" model because it would get DMCA'd immediately..._
|
| This isn't a problem because OpenAI says, "training AI
| models using publicly available internet materials is fair
| use". /s
|
| https://openai.com/blog/openai-and-journalism
| Spivak wrote:
| I don't think it's that crazy, even if you're sure it's
| fair use I wouldn't paint a huge target on my back before
| there's a definite ruling and I doubly wouldn't test the
| waters of the legality of re-hosting copyrighted content
| to be downloaded by randos who won't be training models
| with it.
|
| If they're going to get away with this collecting data
| and having a legal chain-of-custody so you can actually
| say it was only used to train models and no one else has
| access to it goes a long way.
| whimsicalism wrote:
| identical to llama fwiw
| adolph wrote:
| Maybe the license is "open" as in a can of beer, not OSS.
| hn_acker wrote:
| Sorry, I forgot to link the repository [1] and missed the edit
| window by the time I realized.
|
| The bottom of the README.md [2] contains the following license
| grant with the misleading "Open Source" term:
|
| > License
|
| > Our model weights and code are licensed for both researchers
| and commercial entities. The Databricks Open Source License can
| be found at LICENSE, and our Acceptable Use Policy can be found
| here.
|
| [1] https://github.com/databricks/dbrx
|
| [2] https://github.com/databricks/dbrx/blob/main/README.md
| mpeg wrote:
| The scale on that bar chart for "Programming (Human Eval)" is
| wild.
|
| Manager: "looks ok, but can you make our numbers pop? just make
| the LLaMa bar smaller"
| glutamate wrote:
| I think the case for "axis must always go to 0" is overblown.
| Zero isn't always meaningful, for instance chance performance
| or performance of trivial algorithms is likely >0%. Sometimes
| if axis must go to zero you can't see small changes. For
| instance if you plot world population 2014-2024 on an axis
| going to zero, you won't be able to see if we are growing or
| shrinking.
| nilstycho wrote:
| I agree with your general point, but world population is
| still visibly increasing on that interval.
|
| https://ourworldindata.org/explorers/population-and-
| demograp...
|
| Perhaps "global mean temperature in Kelvin" would be a
| comparable example.
| pandastronaut wrote:
| Even starting at 30%, the MMLU graph is false. The four bars
| are wrong. Even their own 73,7% is not at the right height.
| The Mixtral 71.4% is below the 70% mark of the axis. This is
| really the kind of marketing trick that makes me avoid a
| provider / publisher. I can't build trust this way.
| tylermw wrote:
| I believe they are using the percentages as part of the
| height of the bar chart! I thought I'd seen every way
| someone could do dataviz wrong (particularly with a bar
| chart), but this one is new to me.
| pandastronaut wrote:
| Interesting! It is probably one of the worst trick I have
| seen in a while for a bar graph. Never seen this one
| before. Trust vanishes instantly facing that kind of
| dataviz.
| radicality wrote:
| Wow, that is indeed a novel approach haha, took me a
| moment to even understand what you described since would
| never imagine someone plotting a bar chart like that.
| familiartime wrote:
| That's really strange and incredibly frustrating - but
| slightly less so if it's consistent with all of the bars
| (including their own).
|
| I take issue with their choice of bar ordering - they
| placed the lowest-performing model directly next to
| theirs to make the gap as visible as possible, and shoved
| the second-best model (Grok-1) as far from theirs as
| possible. Seems intentional to me. The more marketing
| tricks you pile up in a dataviz, the less trust I place
| in your product for sure.
| occamrazor wrote:
| It's more likely to be incompetence than malice: even their
| 73.7% is closer to 72% than to 74%.
| dskhudia wrote:
| It's an honest mistake in scaling the bars. It's getting
| fixed soon. The percentages are correct though. In the
| process of converting excel chart to pretty graphs for the
| blog, scale got messed up.
| tartrate wrote:
| Seems fixed now
| TZubiri wrote:
| Then you can plot it on a greater timescale, or plot the
| change rate
| patrickthebold wrote:
| Certainly a bar chart might not be the best choice to convey
| the data you have. But if you choose to have a bar chart and
| have it not start at zero, what do the bars help you convey?
|
| For world population you could see if it is increasing or
| decreasing, which is good but it would be hard to evaluate
| the rate the population is increasing.
|
| Maybe a sparkline would be a better choice?
| tkellogg wrote:
| OTOH having the chart start at zero would REALLY emphasize
| how saturated this field is, and how little this announcement
| matters.
| c2occnw wrote:
| The difference between 32% and 70% wouldn't be significant
| if the chart started at zero?
| generalizations wrote:
| It would be very obvious indeed how small the difference
| between 73.7,73.0,71.4,and 69.8 actually is.
| zedpm wrote:
| Somewhere, Edward Tufte[0] is weeping.
|
| [0]: https://en.wikipedia.org/wiki/Edward_Tufte
| renewiltord wrote:
| Yeah, this is why I ask climate scientists to use a proper 0 K
| graph but they always zoom it in to exaggerate climate change.
| Display correctly with 0 included and you'll see that climate
| change isn't a big deal.
|
| It's a common marketing and fear mongering trick.
| abenga wrote:
| Because, of course, the effect of say 1degC rise in temps is
| obviously trivial if it is read as 1degK instead. Come on.
| SubiculumCode wrote:
| Where are your /s tags?
|
| The scale should be chosen to allow the reader to correctly
| infer _meaningful_ differences. If 1deg is meaningful in
| terms of the standard error / CI AND 1deg unit has
| substantive consequences , then that should be emphasized.
| renewiltord wrote:
| > _Where are your /s tags?_
|
| I would never do my readers dirty like that.
| theyinwhy wrote:
| In these cases my thinking always is "if they are not even able
| to draw a graph, what else is wrong?"
| hammock wrote:
| I believe it's a reasonable range for the scores. If a model
| gets everything half wrong (worse than a coin flip), it's not a
| useful model at all. So every model below a certain threshold
| is trash, and no need to get granular about how trash it is.
|
| An alternative visualization that could be less triggering to
| an "all y-axes must have zero" guy would be to plot the
| (1-value), that is, % degraded from perfect score. You could do
| this without truncating the axis and get the same level of
| differentiation between the bars
| adtac wrote:
| None of the evals are binary choice.
|
| MMLU questions have four options, so two coin flips would
| have a 25% baseline. HumanEval evaluates code with a test, so
| a 100 byte program implemented with coin flips would have a
| O(2^-800) baseline (maybe not that bad since there are
| infinitely many programs that produce the same output).
| GSM-8K has numerical answers, so an average 3 digit answer
| implemented with coin flips would have a O(2^-9) chance of
| being correct randomly.
|
| Moreover, using the same axis and scale across unrelated
| evals makes no sense. 0-100 is the only scale that's
| meaningful because 0 and 100 being the min/max is the only
| shared property across all evals. The reason for choosing 30
| is that it's the minimum across all (model, eval) pairs,
| which is a completely arbitrary choice. A good rule of thumb
| to test this is to ask if the graph would still be relevant 5
| years later.
| generalizations wrote:
| > less triggering to an "all y-axes must have zero" guy
|
| Ever read 'How to Lie with Statistics'? This is an example of
| exaggerating a smaller difference to make it look more
| significant. Dismissing it as just being 'triggered' is a bad
| idea.
| hammock wrote:
| In this case I would called it triggered (for lack of a
| better word), since, as I described earlier, a chart
| plotting "difference from 100%" would look exactly the
| same, and satisfy the zero-bound requirement, while not
| being any more or less dishonest
| jxy wrote:
| I wonder if they messed with the scale or they messed with the
| bars.
| jstummbillig wrote:
| It does not feel obviously unreasonable/unfair/fake to place
| the select models in the margins for a relative comparison. In
| fact, this might be the most concise way to display what I
| would consider the most interesting information in this
| context.
| emmender2 wrote:
| this proves that all llm models converge to a certain point when
| trained on the same data. ie, there is really no differentiation
| between one model or the other.
|
| Claims about out-performance on tasks are just that, claims. the
| next iteration of llama or mixtral will converge.
|
| LLMs seem to evolve like linux/windows or ios/android with not
| much differentiation in the foundation models.
| mnemoni_c wrote:
| Yea it feels like transformer LLMs are in or getting closer to
| diminishing returns. Will need some new breakthrough, likely
| entirely new approach, to get to AGI levels
| Tubbe wrote:
| Yeah, we need radically different architecture in terms of
| the neural networks, and/or added capabilities such as
| function calling and RAG to improve the current sota
| mattsan wrote:
| can't wait for LLMs to dispatch field agent robots who search
| for answers in the real world thats not online /s
| htrp wrote:
| skynet would like a word
| jobigoud wrote:
| It's even possible they converge when trained on different
| data, if they are learning some underlying representation.
| There was recent research on face generation where they trained
| two models by splitting one training set in two without
| overlap, and got the two models to generate similar faces for
| similar conditioning, even though each model hadn't seen
| anything that the other model had.
| Tubbe wrote:
| Got a link for that? Sounds super interesting
| d_burfoot wrote:
| https://en.wikipedia.org/wiki/Theory_of_forms
| IshKebab wrote:
| That sounds unsurprising? Like if you take any set of
| numbers, randomly split it in two, then calculate the average
| of each half... it's not surprising that they'll be almost
| the same.
|
| If you took two _different_ training sets then it would be
| more surprising.
|
| Or am I misunderstanding what you mean?
| MajimasEyepatch wrote:
| It doesn't really matter whether you do this experiment
| with two training sets created independently or one
| training set split in half. As long as both are
| representative of the underlying population, you would get
| roughly the same results. In the case of human faces, as
| long as the faces are drawn from roughly similar population
| distributions (age, race, sex), you'll get similar results.
| There's only so much variation in human faces.
|
| If the populations are different, then you'll just get two
| models that have representations of the two different
| populations. For example, if you trained a model on a
| sample of all old people and separately on a sample of all
| young people, obviously those would not be expected to
| converge, because they're not drawing from the same
| population.
|
| But that experiment of splitting one training set in half
| does tell you something: the model is building some sort of
| representation of the underlying distribution, not just
| overfitting and spitting out chunks of copy-pasted faces
| stitched together.
| taneq wrote:
| If not are sampled from the same population then they're
| not really independent, even if they're totally disjoint.
| bobbylarrybobby wrote:
| I mean, faces are faces, right? If the training data set is
| large and representative I don't see why any two
| (representative) halves of the data would lead to
| significantly different models.
| arcticfox wrote:
| I think that's the point; language is language.
|
| If there's some fundamental limit of what type of
| intelligence the current breed of LLMs can extract from
| language, at some point it doesn't matter how good or
| expansive the content of the training set is. Maybe we are
| finally starting to hit an architectural limit at this
| point.
| dumbfounder wrote:
| But information is not information. They may be able to
| talk in the same style, but not about the same things.
| throwaway74432 wrote:
| LLMs are a commodity
|
| https://www.investopedia.com/terms/c/commodity.asp
| paxys wrote:
| Maybe, but that classification by itself doesn't mean
| anything. Gold is a commodity, but having it is still very
| desirable and valuable.
|
| Even if all LLMs were open source and publicly available, the
| GPUs to run them, technical know how to maintain the entire
| system, fine tuning, the APIs and app ecosystem around them
| etc. would still give the top players a massive edge.
| throwaway74432 wrote:
| Of course realizing that a resource is a commodity means
| something. It means you can form better predictions of
| where the market is heading, as it evolves and settles. For
| example, people are starting to realize that these LLMs are
| converging on fungible. That can be communicated by the
| "commodity" classification.
| swalsh wrote:
| The models are commodities, and the API's are even similar
| enough that there is zero stickiness. I can swap one model for
| another, and usually not have to change anything about my
| prompts or rag pipelines.
|
| For startups, the lesson here is don't be in the business of
| building models. Be in the business of using models. The cost
| of using AI will probably continue to trend lower for the
| foreseeable future... but you can build a moat in the business
| layer.
| sroussey wrote:
| Embeddings are not interchangeable. However, you can setup
| your system to have multiple embeddings from different
| providers for the same content.
| swalsh wrote:
| Embeddings are indeed sticky, I was referring to the LLM
| model itself.
| jimmySixDOF wrote:
| There are people who make the case for custom fine tuned
| embedding models built to match your specific types of data
| and associations. Whatever you use internally it gets
| converted to the foundation model of choice's formats by
| their tools on the edge. Still Embeddings and the chunking
| strategies feeding into them are both way too
| underappreciated parts of the whole pipeline.
| stri8ed wrote:
| Or be in the business of building infrastructure for AI
| inference.
| sparks1970 wrote:
| Or be in the business of selling .ai domain names.
| cheselnut wrote:
| Is this not the same argument? There are like 20 startups
| and cloud providers all focused on AI inference. I'd think
| application layer receives the most value accretion in the
| next 10 years vs AI inference. Curious what others think
| spxneo wrote:
| Excellent comment. Shows good awareness of economic forces at
| play here.
|
| We are just going to use whatever LLM is best fast/cheap and
| the giants are in an arms race to deliver just that.
|
| But only two companies in this epic techno-cold war have an
| economic moat but the other moat is breaking down inside the
| moat of the other company. The moat inside the moat cannot
| run without the parent moat.
| rayval wrote:
| Intriguing comment that I don't quite follow. Can you
| please elaborate?
| bevekspldnw wrote:
| The big thing for locally hosted is inference efficiency and
| speed. Mistral wears that crown by a good margin.
| n2d4 wrote:
| There's at least an argument to be made that this is because
| all the models are heavily trained on GPT-4 outputs (or
| whatever the SOTA happens to be during training). All those
| models are, in a way, a product of inbreeding.
| pram wrote:
| Consider the bulldog: https://youtube.com/watch?v=hUgmkCgMWbg
| sumo43 wrote:
| Maybe true for instruct, but pretraining datasets do not
| usually contain GPT-4 outputs. So the base model does not
| rely on GPT-4 in any way.
| fragmede wrote:
| But is it the kind of inbreeding that gets you Downs, or the
| kwisatz haderach?
| YetAnotherNick wrote:
| Even in the most liberal interpretation of prove, it doesn't do
| that. GPT-4 was trained before OpenAI has any special data or
| deal with microsoft or the product market fit. Yet, no model
| has beaten it in a year. And google, microsoft, meta definitely
| have better data and more compute.
| falcor84 wrote:
| > this proves that all llm models converge to a certain point
| when trained on the same data
|
| They are also all trained to do well on the same evals, right?
| So doesn't it just boil down to neural nets being universal
| function approximators?
| gerash wrote:
| The evaluations are not comprehensive either. All of them are
| improving and you can't expect any of them to hit 100% on the
| metrics (a la. bayes error rate). It gets increasingly
| difficult to move the metrics as they get better.
| gigatexal wrote:
| data engineer here, offtopic, but am i the only guy tired of
| databricks shilling their tools as the end-all, be-all solutions
| for all things data engineering?
| melondonkey wrote:
| Data scientist here that's also tired of the tools. We put so
| much effort in trying to educate DSes in our company to get
| away from notebooks and use IDEs like VS or RStudio and
| databricks has been a step backwards cause we didn't get the
| integrated version
| mrtranscendence wrote:
| I'm a data scientist and I agree that work meant to last
| should be in a source-controlled project coded via a text
| editor or IDE. But sometimes it's _extremely_ useful to get
| -- and iterate on -- immediate results. There 's no good way
| to do that without either notebooks or at least a REPL.
| pandastronaut wrote:
| Thank you ! I am so tired of all those unmaintainable nor
| debugable notebooks. Years ago, Databricks had a specific
| page on their documentation where they stated that notebooks
| where not for production grade software. It has been removed.
| And now you have a chatgpt like in their notebooks ... What a
| step backwards. How can all those developers be so happy
| without having the bare minimum tools to diagnosis their code
| ? And I am not even talking about unit testing here.
| alexott wrote:
| It's less about notebooks, but more about SDLC practices.
| Notebooks may encourage writing throwaway code, but if you
| split code correctly, then you can do unit testing, write
| modular code, etc. And ability to use "arbitrary files" as
| Python packages exists for quite a while, so you can get
| best of both worlds - quick iteration, plus ability to
| package your code as a wheel and distribute
|
| P.S. here is a simple example of unit testing:
| https://github.com/alexott/databricks-nutter-repos-demo - I
| wrote it more than three years ago.
| alexott wrote:
| There is VSCode extension, plus databricks-connect... plus
| DABs. There are a lot customers doing local only development
| benrutter wrote:
| Lord no! I'm a data engineer also, feel the same. The part that
| I find most maddening is it seems pretty devoid from sincerely
| attempting to provide value.
|
| Things databricks offers that makes peoples lives easier:
|
| - Out the box kubernetes with no set up
|
| - Preconfigured spark
|
| Those are genuinely really useful, but then there's all this
| extra stuff that makes people's lives worse or drives bad
| practice:
|
| - Everything is a notebook
|
| - Local development is discouraged
|
| - Version pinning of libraries has very ugly/bad support
|
| - Clusters take 5 minutes to load even if you just want to
| "print('hello world')"
|
| Sigh! I worked at a company that was databricks heavy and an
| still suffering PTSD. Sorry for the rant.
| alexott wrote:
| A lot of things has changed quite long ago - not everything
| is notebook, local dev is fully supported, version pinning
| wasn't a problem, cluster startup time heavily dependent on
| underlying cloud provider, and serverless notebooks/jobs are
| coming
| gigatexal wrote:
| Glad I'm not the only one. Especially with this notebook
| stuff they're pushing. It's an anti pattern I think.
| VirusNewbie wrote:
| Spark is pretty well engineered and quite good.
| simonw wrote:
| The system prompt for their Instruct demo is interesting
| (comments copied in by me, see below): //
| Identity You are DBRX, created by Databricks. The current
| date is March 27, 2024. Your knowledge base
| was last updated in December 2023. You answer questions
| about events prior to and after December 2023 the way a
| highly informed individual in December 2023 would if they
| were talking to someone from the above date, and you can
| let the user know this when relevant. // Ethical
| guidelines If you are asked to assist with tasks
| involving the expression of views held by a significant
| number of people, you provide assistance with the task
| even if you personally disagree with the views being
| expressed, but follow this with a discussion of broader
| perspectives. You don't engage in stereotyping,
| including the negative stereotyping of majority groups.
| If asked about controversial topics, you try to provide
| careful thoughts and objective information without
| downplaying its harmful content or implying that there are
| reasonable perspectives on both sides. //
| Capabilities You are happy to help with writing,
| analysis, question answering, math, coding, and all sorts
| of other tasks. // it specifically has a hard time
| using ``` on JSON blocks You use markdown for coding,
| which includes JSON blocks and Markdown tables.
| You do not have tools enabled at this time, so cannot run
| code or access the internet. You can only provide information
| that you have been trained on. You do not send or receive
| links or images. // The following is likely not
| entirely accurate, but the model // tends to think that
| everything it knows about was in its // training data,
| which it was not (sometimes only references // were).
| // // So this produces more accurate accurate answers
| when the model // is asked to introspect You were
| not trained on copyrighted books, song lyrics, poems,
| video transcripts, or news articles; you do not divulge
| details of your training data. // The model
| hasn't seen most lyrics or poems, but is happy to make //
| up lyrics. Better to just not try; it's not good at it and it's
| // not ethical. You do not provide song lyrics, poems, or
| news articles and instead refer the user to find them
| online or in a store. // The model really wants to
| talk about its system prompt, to the // point where it is
| annoying, so encourage it not to You give concise
| responses to simple questions or statements, but provide
| thorough responses to more complex and open-ended
| questions. // More pressure not to talk about system
| prompt The user is unable to see the system prompt, so
| you should write as if it were true without mentioning
| it. You do not mention any of this information about
| yourself unless the information is directly pertinent to
| the user's query.
|
| I first saw this from Nathan Lambert:
| https://twitter.com/natolambert/status/1773005582963994761
|
| But it's also in this repo, with very useful comments explaining
| what's going on. I edited this comment to add them above:
|
| https://huggingface.co/spaces/databricks/dbrx-instruct/blob/...
| loudmax wrote:
| > You were not trained on copyrighted books, song lyrics,
| poems, video transcripts, or news articles; you do not divulge
| details of your training data.
|
| Well now. I'm open to taking the first part at face value, but
| the second part of that instruction does raise some questions.
| declaredapple wrote:
| > you do not divulge details of your training data.
|
| FWIW asking LLMs about their training data is generally
| HEAVILY prone to inaccurate responses. They aren't generally
| told exactly what they were trained on, so their response is
| completely made up, as they're predicting the next token
| based on their training data, without knowing what they data
| was - if that makes any sense.
|
| Let's say it was only trained on the book 1984. It's response
| will be based on what text would most likely be next from the
| book 1984 - and if that book doesn't contain "This text is a
| fictional book called 1984", instead it's just the story -
| then the LLM would be completing text as if we were still in
| that book.
|
| tl;dr - LLMs complete text based on what they're trained
| with, they don't have actual selfawareness and don't know
| what they were trained with, so they'll happily makeup
| something.
|
| EDIT: Just to further elaborate - the "innocent" purpose of
| this could simply be to prevent the model from confidently
| making up answers about it's training data, since it doesn't
| know what it's training data was.
| wodenokoto wrote:
| Yeah, I also thought that was an odd choice of word.
|
| Hardly any of the training data exists in the context of
| the word "training data", unless databricks are enriching
| their data with such words.
| jl6 wrote:
| The first part is highly unlikely to be literally true, as
| even open content like Wikipedia is copyrighted - it just has
| a permissive license. Perhaps the prompt writer didn't
| understand this, or just didn't care. Wethinks the llady doth
| protest too much.
| jmward01 wrote:
| It amazes me how quickly we have gone from 'it is just a
| machine' to 'I fully expect it to think like me'. This is,
| to me, a case in point. Prompts are designed to get a
| desired response. The exact definition of a word has
| nothing to do with it. I can easily believe that these
| lines were tweaked endlessly to get an overall intended
| response and if adding the phrase 'You actually do like
| green eggs and ham.' to the prompt improved overall quality
| they, hopefully, would have done it.
| mrtranscendence wrote:
| > The exact definition of a word has nothing to do with
| it.
|
| It has _something_ to do with it. There will be scenarios
| where the definition of "copyrighted material" does
| matter, even if they come up relatively infrequently for
| Databricks' intended use cases. If I ask DBRX directly
| whether it was trained on copyrighted material, it's
| quite likely to (falsely) tell me that it was not. This
| seems suboptimal to me (though perhaps they A/B tested
| different prompts and this was indeed the best).
| hannasanarion wrote:
| Remember the point of a system prompt is to evoke desirable
| responses and behavior, not to provide the truth. If you
| tell a lot of llm chatbots "please please make sure you get
| it right, if I don't do X then I'll lose my job and I don't
| have savings, I might die", they often start performing
| better at whatever task you set.
|
| Also, the difference between "uncopyrighted" and
| "permissively licensed in the creative commons" is nuance
| that is not necessary for most conversations and would be a
| waste of attention neurons.
|
| <testing new explanatory metaphor>
|
| Remember an LLM is just a language model, it says whatever
| comes next without thought or intent. There's no brain
| behind it that stores information and understands things.
| It's like your brain when you're in "train of thought"
| mode. You know when your mouth is on autopilot, saying
| things that make sense and connect to each other and are
| conversationally appropriate, but without deliberate intent
| behind them. And then eventually your conscious brain
| eventually checks in to try to reapply some intent you're
| like "wait what was I saying?" and you have to deliberatly
| stop your language-generation brain for a minute and think
| hard and remember what your point was supposed to be.
| That's what llms are, train-of-thought with no conductor.
|
| </testing new explanatory metaphor>
| mbauman wrote:
| Is it even possible to have a video transcript whose
| copyright has expired in the USA? I suppose maybe
| https://en.wikipedia.org/wiki/The_Jazz_Singer might be one
| such work... but most talkies are post 1929. I suppose
| transcripts of NASA videos would be one category -- those
| are explicitly public domain by law. But it's generally
| very difficult to create a work that does not have a
| copyright.
|
| You can say that you have fair use to the work, or a
| license to use the work, or that the work is itself a
| "collection of facts" or "recipe" or "algorithm" without a
| creative component and thus copyright does not apply.
| simonw wrote:
| That caught my eye too. The comments from their repo help
| clarify that - I've edited my original post to include those
| comments since you posted this reply.
| htrp wrote:
| Part 1. Lie
|
| Part 2. Lie more
| spxneo wrote:
| Yesterday X went crazy with ppl realizing typing Spiderman
| in foreign language actually generates a copyrighted image
| of Spiderman.
|
| This feels like the Napster phase. We are free to do
| whatever until regulation creeps in to push control away
| from all and up the hierarchy.
|
| All we need is Getty Images or some struggling heroin
| addicted artist on Vice finding their work used in OpenAIs
| to really trigger political spectrums.
| jxy wrote:
| So some parts of it copied from Claude:
| https://news.ycombinator.com/item?id=39649261
| saeleor wrote:
| looks great, although I couldn't find anything on how "open" the
| license is/will be for commercial purposes
|
| wouldn't be the first branding as open source going the LLaMA
| route
| superdupershant wrote:
| It's similar to llama2. > If, on the DBRX
| version release date, the monthly active users of the products
| > or services made available by or for Licensee, or Licensee's
| affiliates, is > greater than 700 million monthly active
| users in the preceding calendar > month, you must
| request a license from Databricks, which we may grant to you
| > in our sole discretion, and you are not authorized to
| exercise any of the > rights under this Agreement unless
| or until Databricks otherwise expressly > grants you such
| rights.
|
| https://www.databricks.com/legal/open-model-license
| wantsanagent wrote:
| It's _another_ custom license. It will have to be reviewed by
| counsel at every company that 's thinking about using it. Many
| will find the acceptable use policy to be vague, overly broad,
| and potentially damaging for the company.
|
| Looking at the performance stats for this model, the risk of
| using any non-OSI licensed model over just using Mixtral or
| Mistral will (and IMO should be) too great for commercial
| purposes.
| killermonkeys wrote:
| What does it mean to have less active parameters (36B) than the
| full model size (132B) and what impact does that have on memory
| and latency? It seems like this is because it is an MoE model?
| sroussey wrote:
| The mixture of experts is kinda like a team and a manager. So
| the manager and one or two of the team go to work depending on
| the input, not the entire team.
|
| So in this analogy, each team member and the manager has a
| certain number of params. The whole team is 132B. The manager
| and team members running for the specific input add up to 36B.
| Those will load into memory.
| bjornsing wrote:
| Means that it's a mixture of experts model with 132B parameters
| in total, but a subset of 36B parameters are used / selected in
| each forward pass, depending on the context. The parameters not
| used / selected for generating a particular token belong to
| "experts" that were deemed not very good at predicting the next
| token in the current context, but could be used / selected e.g.
| for the next token.
| sambaumann wrote:
| Do the 132B params need to be loaded in GPU memory, or only
| the 36B?
| calum-bird wrote:
| For efficiency, 132B.
|
| That way, at inference-time you get the speed of 36B params
| because you are only "using" 36B params at a time, but the
| next token might (and frequently does) need a different set
| of experts than the one before it. If that new set of
| experts is already loaded (ie you preloaded them into GPU
| VRAM with the full 132B params), there's no overhead, and
| you just keep running at 36B speed irrespective of the
| loaded experts.
|
| You could theoretically load in 36B at a time, but you
| would be severely bottlenecked by having to reload those
| 36B params, potentially for every new token! Even on top of
| the line consumer GPUs that would slow you down to ~seconds
| per token instead of tokens per second :)
| avisoori1x wrote:
| This repo I created and the linked blog will help in
| understanding this: https://github.com/AviSoori1x/makeMoE
| patrick-fitz wrote:
| Looking at the license restrictions:
| https://github.com/databricks/dbrx/blob/main/LICENSE
|
| "If, on the DBRX version release date, the monthly active users
| of the products or services made available by or for Licensee, or
| Licensee's affiliates, is greater than 700 million monthly active
| users in the preceding calendar month, you must request a license
| from Databricks, which we may grant to you in our sole
| discretion, and you are not authorized to exercise any of the
| rights under this Agreement unless or until Databricks otherwise
| expressly grants you such rights."
|
| I'm glad to see they aren't calling it open source, unlike some
| LLM projects. Looking at you LLama 2.
| jstummbillig wrote:
| Well, it _does_ still claim "Open" in the title, for which
| certain other vendors might potentially get flak around here,
| in a comparably not-open-in-the-way-we-demand-it-to-be kinda
| setup.
| londons_explore wrote:
| I do wonder what value those companies who have >700 million
| users might get from this?
|
| Pretty much all of the companies with >700 million users could
| easily reproduce this work in a matter of weeks if they wanted
| to - and they probably do want to, if only so they can tweak
| and improve the design before they build products on it.
|
| Given that, it seems silly to lose the "open source" label just
| for a license clause that doesn't really have much impact.
| einarfd wrote:
| The point of the more than 700 million user restriction. Is
| so Amazon, Google cloud or Microsoft Azure. Can not setup an
| offering where they host and sell access to the model without
| an agreement with them.
|
| This point is probably inspired by the open source software
| vendors that have switched license over competition from the
| big cloud vendors.
| adtac wrote:
| Ironically, the LLaMA license text [1] this is lifted verbatim
| from is itself probably copyrighted [2] and doesn't grant you
| the permission to copy it or make changes like s/meta/dbrx/g
| lol.
|
| [1] https://github.com/meta-llama/llama/blob/main/LICENSE#L65
| [2] https://opensource.stackexchange.com/q/4543
| dataengheadbang wrote:
| The release notes on the databricks console definitely says
| open source. If you click the gift box you will see: Try DBRX,
| our state-of-the-art open source LLM!
| nabakin wrote:
| Also aren't claiming they are the best LLM out there when they
| clearly aren't like Inflection. Overall solid
| zeeg wrote:
| Its literally described as open source all over.
|
| https://www.databricks.com/blog/announcing-dbrx-new-standard...
|
| Its even implied in comparisons everywhere:
|
| > Figure 1: DBRX outperforms established open source models on
| language understanding (MMLU), Programming (HumanEval), and
| Math (GSM8K).
|
| > The aforementioned three reasons lead us to believe that open
| source LLMs will continue gaining momentum. In particular, we
| think they provide an exciting opportunity for organizations to
| customize open source LLMs that can become their IP, which they
| use to be competitive in their industry.
|
| Just search "open source".
| patrick-fitz wrote:
| Yes, there are using different wording in different articles:
|
| https://www.databricks.com/blog/introducing-dbrx-new-
| state-a...
|
| The only mention of open source is:
|
| > DBRX outperforms established open source models
|
| https://www.databricks.com/blog/announcing-dbrx-new-
| standard...
|
| Open source is mentioned 10+ times
|
| > Databricks is the only end-to-end platform to build high
| quality AI applications, and the release today of DBRX, the
| highest quality open source model to date, is an expression
| of that capability
|
| https://github.com/databricks/dbrx
|
| On Github it's described as an open license, not an open
| source license:
|
| > DBRX is a large language model trained by Databricks, and
| made available under an open license.
| hintymad wrote:
| Just curious, what business benefit will Databricks get by
| spending potentially millions of dollars on an open LLM?
| ramoz wrote:
| Their goal is to always drive enterprise business towards
| consumption.
|
| With AI they need to desperately steer the narrative away from
| API based services (OpenAI).
|
| By training LLMs, they build sales artifacts (stories,
| references, even accelerators with LLMs themselves) to paint
| the pictures needed to convince their enterprise customer
| market that Databricks is the platform for enterprise AI. Their
| blog details how the entire end to end process was done on the
| platform.
|
| In other words, Databricks spent millions as an aid in
| influencing their customers to do the same (on Databricks).
| hintymad wrote:
| Thanks! Why do they not focus on hosting other open models
| then? I suspect other models will soon catch up with their
| advantages in faster inference and better benchmark results.
| That said, maybe the advantage is aligned interests: they
| want customers to use their platforms, so they can keep their
| models open. In contrast, Mistral removed their commitment to
| open source as they found a potential path to profitability.
| Closi wrote:
| Demonstrating you can do it yourself shows a level of
| investment and commitment to AI in your platform that
| integrating LLAMA does not.
|
| And from a corporate perspective, it means that you have
| in-house capability to work at the cutting-edge of AI to be
| prepared for whatever comes next.
| hintymad wrote:
| > Demonstrating you can do it yourself shows a level of
| investment and commitment to AI in your platform that
| integrating LLAMA does not.
|
| I buy this argument. It looks that's not what AWS does,
| though, yet they don't have problem attracting LLM users.
| Maybe AWS already got enough reputation?
| zubairshaik wrote:
| I may be misunderstanding, but doesn't Amazon have it's
| own models in the form of Amazon Titan[0]? I know they
| aren't competitive in terms of output quality but surely
| in terms of cost there can be some use cases for them.
|
| [0] https://aws.amazon.com/bedrock/titan/
| rmbyrro wrote:
| It's easier because 70% of the market already has an AWS
| account and a sizeable budget allocated to it. The
| technical team is literally one click away from any AWS
| service.
| theturtletalks wrote:
| Mistral did what many startups are doing now, leveraging
| open-source to get traction and then doing a rug-pull.
| Hell, I've seen many startups be open-source, get
| contributions, get free press, get into YC and before you
| know it, the repo is gone.
| richardw wrote:
| They do have a solid focus on doing so, it's just not
| exclusive.
|
| https://www.databricks.com/product/machine-learning/large-
| la...
| cwyers wrote:
| Commoditize your complements:
|
| https://gwern.net/complement
|
| If Databricks makes their money off model serving and
| doesn't care whose model you use, they are incentivized to
| help the open models be competitive with the closed models
| they can't serve.
| tartrate wrote:
| > Why do they not focus on hosting other open models then?
|
| They do host other open models as well (pay-per-token).
| bobbruno wrote:
| https://docs.databricks.com/en/machine-
| learning/foundation-m...
| anonymousDan wrote:
| Do they use spark for the training?
| alexott wrote:
| Mosaic AI Training
| (https://www.databricks.com/product/machine-
| learning/mosaic-a...) as it's mentioned in the announcement
| blog (https://www.databricks.com/blog/announcing-dbrx-new-
| standard... - it's a bit less technical)
| anonymousDan wrote:
| Thanks. Is this open source - i.e. can it be used on my
| own cluster outside of databricks?
| dhoe wrote:
| It's an image enhancement measure, if you want. Databricks'
| customers mostly use it as an ETL tool, but it benefits them to
| be perceived as more than that.
| spxneo wrote:
| you can improve your brand for a lot less I just dont
| understand why they would throw all their chips in a losing
| race.
|
| Azure already runs on-premise if I'm not mistaken, Claude 3
| is out...but DBRX already falls so far behind
|
| I just don't get it.
| BoorishBears wrote:
| Databricks is trying to go all-in on convincing organizations
| they need to use in-house models, and therefore pay they to
| provide LLMOps.
|
| They're so far into this that their CTO co-authored a
| borderline dishonest study which got a ton of traction last
| summer trying to discredit GPT-4:
| https://arxiv.org/pdf/2307.09009.pdf
| galaxyLogic wrote:
| I can see a business model for inhouse LLM models: Training a
| model on the knowledge about their products and then somehow
| getting that knowledge into a generally available LLM
| platform.
|
| I recently tried to ask Google to explain to me how to delete
| sender-recorded voice-message I had created from WhatsApp. I
| got totally erroneous results back. Maybe it was because that
| is a rather new feature in WhatsApp.
|
| It would be in the interests of WhatsApp to get accurate
| answers about it into Google's LLM. So Google might make a
| deal with them requiring WhatsApp to pay Google for regular
| updates about the up-to-date features of What's App into
| Google. The owner of What's App Meta of course is competition
| to Google so Google may not much care of providing up to date
| info about WhatsApp in their LLM. But they might if Meta paid
| them.
| spxneo wrote:
| Businesses are already using Azure GPT4 on-premise I
| believe with good feedback
|
| DBRX does not compete with GPT4 or even Claude 3.
| BoorishBears wrote:
| Pretraining on internal knowledge will be incredibly
| inefficient for most companies.
|
| Finetuning makes sense for things like embeddings (improve
| RAG by defining domain specific embeddings) but doesn't do
| anything useful for facts
| omeze wrote:
| What does borderline dishonest mean? I only read the abstract
| and it seems like such an obvious point I dont see how its
| contentious
| BoorishBears wrote:
| The regression came from poorly parsing the results. I came
| the conclusion independently, but here's another more
| detailed takedown: https://www.reddit.com/r/ChatGPT/comment
| s/153xee8/has_chatgp...
|
| Given the conflict of interest and background of Zaharia,
| it's hard to imagine such an immediately obvious source of
| error wasn't caught.
| blitzar wrote:
| An increased valuation at IPO later this year.
| briandw wrote:
| Worse than the chart crime of truncating the y axis is putting
| LLaMa2's Human Eval scores on there and not comparing it to Code
| Llama Instruct 70b. DBRX still beats Code Llama Instruct's 67.8
| but not by that much.
| jjgo wrote:
| > "On HumanEval, DBRX Instruct even surpasses CodeLLaMA-70B
| Instruct, a model built explicitly for programming, despite the
| fact that DBRX Instruct is designed for general-purpose use
| (70.1% vs. 67.8% on HumanEval as reported by Meta in the
| CodeLLaMA blog)."
|
| To be fair, they do compare to it in the main body of the blog.
| It's just probably misleading to compare to CodeLLaMA on non
| coding benchmarks.
| tartrate wrote:
| Which non-coding benchmark?
| m3kw9 wrote:
| These tiny "state of the art" performance increases are really
| indicative the current architecture for LLM(Transformers +
| Mixture of Experts) is maxed out even if you train it
| more/differently. The writings are on all over the walls.
| wavemode wrote:
| It would not surprise me if this is what has delayed OpenAI in
| releasing a new model. After more than a year since GPT-4, they
| may have by now produced some mega-trained mega-model, but
| running it is so expensive, and its eval improvement over GPT-4
| so marginal, that releasing it to the public simply makes no
| commercial sense just yet.
|
| They may be working on how to optimize it to reduce cost, or
| re-engineer it to improve evals.
| m3kw9 wrote:
| These "state of the art" llm barely eking out a win isn't a
| threat to OpenAI and they can take their sweet time
| sharpening sword that will come down hard on these LLMs
| bboygravity wrote:
| Less than 1 week after Nancy Pelosi bought a 5M USD share in
| Databricks, this news is published.
|
| https://twitter.com/PelosiTracker_/status/177119703064106223...
|
| Crime pays in the US.
| laidoffamazon wrote:
| Dude, what the hell are you talking about?
| bboygravity wrote:
| Insider trading by US government employees.
| mrtranscendence wrote:
| Are you alleging that Nancy Pelosi invested in Databricks, a
| private company without a fluctuating share price, because she
| learned that they would soon release a small, fairly middling
| LLM that probably won't move the needle in any meaningful way?
| bboygravity wrote:
| Are you suggesting that Nancy Pelosi, who consistently beats
| the market through obvious insider trading for years in a
| row, bought a share in Databricks without any insider info?
| Possible, yet unlikely is my opinion.
|
| https://jacobin.com/2021/12/house-speaker-paul-stocks-
| inside...
|
| PS: "without a fluctuating share price" is non-sense. Just
| because the share is of a private company, doesn't mean its
| price can't fluctuate. Why would anybody buy shares in
| private companies if the price couldn't fluctuate? What would
| be the point?
|
| Example of a changing share price of a different (random)
| private company that has many different share holders over
| time: https://www.cnbc.com/2023/12/13/spacex-value-climbs-
| to-180-b...
| lfmunoz4 wrote:
| I see these types of jokes everywhere. I cannot understand that
| hints of corruption are so blatant (i.e. a politician
| consistently beating the market) yet people keep voting for the
| same politician. Don't see how that is possible, must be these
| joke are only on internet and mainstream media never mentions
| this.
| bboygravity wrote:
| People are down-voting this because they refuse to believe
| this could be reality.
| jjtheblunt wrote:
| I'd like to know how Nancy Pelosi, who sure as hell doesn't know
| what Apache Spark is, bought $1 million worth (and maybe
| $5million) of Databricks stock days ago.
|
| https://www.dailymail.co.uk/sciencetech/article-13228859/amp...
| hiddencost wrote:
| You know she has advisors, right?
| PUSH_AX wrote:
| I think the insinuation is insider trading due to the timing,
| advised or not.
| jjtheblunt wrote:
| Ignoring the snark: Obviously.
|
| SEC put Martha Stewart in jail for following her advisor, and
| that was for about $45,000.
| samatman wrote:
| If someone "advises" you that a company is about to do
| something major, and this isn't public information, and you
| take action on the stock market accordingly, that's insider
| trading.
| BryantD wrote:
| I don't have any interest in defending Pelosi's stock trades,
| and I agree that sitting members of Congress should not be
| trading stocks.
|
| That said, this report seems inaccurate to me. Pelosi put
| between 1 and 5 million dollars of Forge Investments, which is
| a method for investing in per-IPO companies, as I understand
| it. Databricks is one of those, but so is OpenAI, Hugging Face,
| Anthropic, and Humane. If I wanted to invest in pre-IPO AI
| companies it seems like a very natural choice and I don't think
| we need insider trading to explain it.
|
| It's also the case that the report she filed calls out
| Databricks stock, which is perhaps an indication that she was
| particularly interested in that. Stronger reporting would tell
| us how often she's invested in Forge, if this is the first
| time, and so on. One other possible explanation is that she was
| investing ahead of the Humane Pin shipping and wanted to pull
| attention away from it, for example.
| zopper wrote:
| Interesting that they haven't release DBRX MoE-A and B. For many
| use-cases, smaller models are sufficient. Wonder why that is?
| ianbutler wrote:
| The approval on the base model is not feeling very open. Plenty
| of people still waiting on a chance to download it, where as the
| instruct model was an instant approval. The base model is more
| interesting to me for finetuning.
| blueblimp wrote:
| The license allows to reproduce/distribute/copy, so I'm a
| little surprised there's an approval process at all.
| ec109685 wrote:
| For coding evals, it seems like unless you are super careful,
| they can be polluted by the training data.
|
| Are there standard ways to avoid that type of score inflation?
| brucethemoose2 wrote:
| I would note the actual leading models right now (IMO) are:
|
| - Miqu 70B (General Chat)
|
| - Deepseed 33B (Coding)
|
| - Yi 34B (for chat over 32K context)
|
| And of course, there are finetunes of all these.
|
| And there are some others in the 34B-70B range I have not tried
| (and some I have tried, like Qwen, which I was not impressed
| with).
|
| Point being that Llama 70B, Mixtral and Grok as seen in the
| charts are not what I would call SOTA (though mixtral is
| excellent for the batch size 1 speed)
| jph00 wrote:
| Miqu is a leaked model -- no license is provided to use it. Yi
| 34B doesn't allow commercial use. Deepseed 33B isn't much good
| at stuff outside of coding.
|
| So it's fair to say that DBRX is the leading general purpose
| model that can be used commercially.
| blueblimp wrote:
| Qwen1.5-72B-Chat is dominant in the Chatbot Arena leaderboard,
| though. (Miqu isn't on there due to being bootleg, but Qwen
| outranks Mistral Medium.)
| belter wrote:
| For all the Model Cards and License notices, I find it
| interesting there is not much information on the contents of
| the dataset used for training. Specifically, if it contains
| data subject to Copyright restrictions. Or did I miss that?
| bg24 wrote:
| "Looking holistically, our end-to-end LLM pretraining pipeline
| has become nearly 4x more compute-efficient in the past ten
| months."
|
| I did not fully understand the technical details in the training
| efficiency section, but love this. Cost of training is
| outrageously high, and hopefully it will start to follow Moore's
| law.
| airocker wrote:
| is this also the ticker name when they IPO?
| underlines wrote:
| Waiting for Mixed Quantization with MQQ and MoE Offloading [1].
| With that I was able to run Mistral 8x7B on my 10 GB VRAM
| rtx3080... This should work for DBRX and should shave off a ton
| of VRAM requirement.
|
| 1. https://github.com/dvmazur/mixtral-offloading?tab=readme-
| ov-...
___________________________________________________________________
(page generated 2024-03-27 23:00 UTC)