[HN Gopher] DBRX: A new open LLM
       ___________________________________________________________________
        
       DBRX: A new open LLM
        
       Author : jasondavies
       Score  : 561 points
       Date   : 2024-03-27 12:23 UTC (10 hours ago)
        
 (HTM) web link (www.databricks.com)
 (TXT) w3m dump (www.databricks.com)
        
       | shnkr wrote:
       | GenAI novice here. what is training data made of how is it
       | collected? I guess no one will share details on it, otherwise a
       | good technical blog post with lots of insights!
       | 
       | >At Databricks, we believe that every enterprise should have the
       | ability to control its data and its destiny in the emerging world
       | of GenAI.
       | 
       | >The main process of building DBRX - including pretraining, post-
       | training, evaluation, red-teaming, and refining - took place over
       | the course of three months.
        
         | simonw wrote:
         | The most detailed answer to that I've seen is the original
         | LLaMA paper, which described exactly what that model was
         | trained on (including lots of scraped copyrighted data)
         | https://arxiv.org/abs/2302.13971
         | 
         | Llama 2 was much more opaque about the training data,
         | presumably because they were already being sued at that point
         | (by Sarah Silverman!) over the training data that went into the
         | first Llama!
         | 
         | A couple of things I've written about this:
         | 
         | - https://simonwillison.net/2023/Aug/27/wordcamp-llms/#how-
         | the...
         | 
         | - https://simonwillison.net/2023/Apr/17/redpajama-data/
        
           | shnkr wrote:
           | my question was specific to databricks model. If it followed
           | llama or openai, they could add a line or two about it ..
           | make the blog complete.
        
             | comp_raccoon wrote:
             | they have a technical report coming! knowing the team, they
             | will do a great job disclosing as much as possible.
        
           | ssgodderidge wrote:
           | Wow, that paper was super useful. Thanks for sharing. Page 2
           | is where it shows the breakdown of all of the data sources,
           | including % of dataset and the total disk sizes.
        
         | tempusalaria wrote:
         | The training data is pretty much anything you can read on the
         | internet plus books.
         | 
         | This is then cleaned up to remove nonsense, some technical
         | files, and repeated files.
         | 
         | From this, they tend to weight some sources more - e.g.
         | Wikipedia gets a pretty high weighting in the data mix. Overall
         | these data mixes have multiple trillion token counts.
         | 
         | GPT-4 apparently trained on multiple epochs of the same data
         | mix. So would assume this one did too as it's a similar token
         | count
        
           | sanxiyn wrote:
           | https://arxiv.org/abs/2305.10429 found that people are
           | overweighting Wikipedia and downweighting Wikipedia improves
           | things across the board INCLUDING PREDICTING NEXT TOKEN ON
           | WIKIPEDIA, which is frankly amazing.
        
         | IshanMi wrote:
         | Personally, I found looking at open source work to be much more
         | instructive in learning about AI and how things like training
         | data and such are done from the ground up. I suspect this is
         | because training data is one of the bigger moats an AI company
         | can have, as well as all the class action lawsuits surrounding
         | training data.
         | 
         | One of the best open source datasets that are freely available
         | is The Pile by EleutherAI [1]. It's a few years old now
         | (~2020), but they did some really diligent work in putting
         | together the dataset and documenting it. A more recent and even
         | larger dataset would be the Falcon-RefinedWeb dataset [2].
         | 
         | [1]: https://arxiv.org/abs/2101.00027 [2]:
         | https://arxiv.org/abs/2306.01116
        
       | djoldman wrote:
       | Model card for base: https://huggingface.co/databricks/dbrx-base
       | 
       | > The model requires ~264GB of RAM
       | 
       | I'm wondering when everyone will transition from tracking
       | parameter count vs evaluation metric to (total gpu RAM + total
       | CPU RAM) vs evaluation metric.
       | 
       | For example, a 7B parameter model using float32s will almost
       | certainly outperform a 7B model using float4s.
       | 
       | Additionally, all the examples of quantizing recently released
       | superior models to fit on one GPU doesnt mean the quantized model
       | is a "win." The quantized model is a different model, you need to
       | rerun the metrics.
        
         | vlovich123 wrote:
         | I thought float4 sacrificed a negligible cost in evaluation
         | quality for a 8x reduction in RAM?
        
           | Y_Y wrote:
           | A free lunch? Wouldn't that be nice! Sometimes the
           | quantization process improves the accuracy a little (probably
           | by implicit regularization) but a model that's at or near
           | capacity (as it should be) is necessarily hurt by throwing
           | away most of the information. Language models often quantize
           | well to small fixed-point types like int4, but it's not a
           | magic wand.
        
             | K0balt wrote:
             | I find that q6 and 5+ are subjectively as good as raw
             | tensor files. 4 bit quality reduction is very detectable
             | though. Of course there must be a loss of information, but
             | perhaps there is a noise floor or something like that.
        
               | Taek wrote:
               | At what parameter count? Its been established that
               | quantization has less of an effect on larger models. By
               | the time you are at 70B quantization to 4 bits basically
               | is negligible
        
             | vlovich123 wrote:
             | I didn't suggest a free lunch, just that the 8x reduction
             | in RAM (+ faster processing) does not result in an 8x
             | growth in the error. Thus a quantized model will outperform
             | a non-quantized one on a evaluation/RAM metric.
        
               | Y_Y wrote:
               | That's not a good metric.
        
               | omeze wrote:
               | Many applications dont want to host inference on the
               | cloud and would ideally run things locally. Hardware
               | constraints is clearly important.
               | 
               | Id actually say its the most important metric for most
               | open models now, since the price per performance of
               | closed cloud models is so competitive with open cloud
               | models, so edge inference that is competitive is a clear
               | value add
        
             | underlines wrote:
             | This paper partially finds disagreeing evidence:
             | https://arxiv.org/abs/2403.17887
        
           | Taek wrote:
           | For smaller models, the quality drop is meaningful. For
           | larger ones like this one, the quality drop is negligible.
        
         | swalsh wrote:
         | > The model requires ~264GB of RAM
         | 
         | This feels as crazy as Grok. Was there a generation of models
         | recently where we decided to just crank on the parameter count?
        
           | wrs wrote:
           | Isn't that pretty much the last 12 months?
        
           | Jackson__ wrote:
           | If you read their blog post, they mention it was pretrained
           | on 12 Trillion tokens of text. That is ~5x the amount of the
           | llama2 training runs.
           | 
           | From that, it seems somewhat likely we've hit the wall on
           | improving <X B parameter LLMs by simply scaling up the
           | training data, which basically forces everyone to continue
           | scaling up if they want to keep up with SOTA.
        
           | breezeTrowel wrote:
           | Cranking up the parameter count is literally how the current
           | LLM craze got started. Hence the "large" in "large language
           | model".
        
           | espadrine wrote:
           | Not recently. GPT-3 from 2020 requires even more RAM; the
           | open-source BLOOM from 2022 did too.
           | 
           | In my view, the main value of larger models is distillation
           | (which we particularly witness, for instance, with how Claude
           | Haiku matches release-day GPT-4 despite being less than a
           | tenth of the cost). Hopefully the distilled models will be
           | easier to run.
        
         | ml_hardware wrote:
         | Looks like someone has got DBRX running on an M2 Ultra already:
         | https://x.com/awnihannun/status/1773024954667184196?s=20
        
           | Mandelmus wrote:
           | And it appears to be at ~80 GB of RAM via quantisation.
        
             | dheera wrote:
             | That's a tricky number. Does it run on an 80GB GPU, does it
             | auto-shave some parameters to fit in 79.99GB like any
             | articifially "intelligent" piece of code would do, or does
             | it give up like an unintelligent piece of code?
        
               | declaredapple wrote:
               | What?
               | 
               | Are you asking if the framework automatically
               | quantizes/prunes the model on the fly?
               | 
               | Or are you suggesting the LLM itself should realize it's
               | too big to run, and prune/quantize itself? Your
               | references to "intelligent" almost leads me to the
               | conclusion that you think the LLM should prune itself.
               | Not only is this a chicken and egg problem, but LLMs are
               | statistical models, they aren't inherently self
               | bootstraping.
        
               | dheera wrote:
               | I realize that, but I do think it's doable to bootstrap
               | it on a cluster and teach itself to self-prune, and
               | surprised nobody is actively working on this.
               | 
               | I hate software that complains (about dependencies,
               | resources) when you try to run it and I think that should
               | be one of the first use cases for LLMs to get L5
               | autonomous software installation and execution.
        
             | smcleod wrote:
             | So that would be runnable on a MBP with a M2 Max, but the
             | context window must be quite small, I don't really find
             | anything under about 4096 that useful
        
           | madiator wrote:
           | That's great, but it did not really write the program that
           | the human asked it to do. :)
        
             | SparkyMcUnicorn wrote:
             | That's because it's the base model, not the instruct tuned
             | one.
        
           | resource_waste wrote:
           | I find 500 tokens considered 'running' a stretch.
           | 
           | Cool to play with for a few tests, but I can't imagine using
           | it for anything.
        
         | dvt wrote:
         | > a 7B parameter model using float32s will almost certainly
         | outperform a 7B model using float4s
         | 
         | Q5 quantization performs _almost_ on par with base models.
         | Obviously there 's some loss there, but this indicates that
         | there's still a lot of compression that we're missing.
        
         | dheera wrote:
         | I'm more wondering when we'll have algorithms that will "do
         | their best" given the resources they detect.
         | 
         | That would be what I call artificial intelligence.
         | 
         | Giving up because "out of memory" is not intelligence.
        
           | visarga wrote:
           | No but some model serving tools like llama.cpp do their best.
           | It's just a matter of choosing the right serving tools. And I
           | am not sure LLMs could not optimize their memory layout. Why
           | not? Just let them play with this and learn. You can do
           | pretty amazing things with evolutionary methods where the
           | LLMs are the mutation operator. You evolve a population of
           | solutions. (https://arxiv.org/abs/2206.08896)
        
           | falcor84 wrote:
           | I suppose you could simulate dementia by loading as much of
           | the weights as space permits and then just stopping. Then
           | during inference, replace the missing weights with calls to
           | random(). I'd actually be interested in seeing the results.
        
           | coldtea wrote:
           | > _Giving up because "out of memory" is not intelligence._
           | 
           | When people can't remember the facts/theory/formulas needed
           | to answer some test question, or can't memorize some
           | complicated information because it's too much, they usually
           | give up too.
           | 
           | So, giving up because of "out of memory" sure sounds like
           | intelligence to me.
        
       | kurtbuilds wrote:
       | What's the process to deliver and test a quantized version of
       | this model?
       | 
       | This model is 264GB, so can only be deployed in server settings.
       | 
       | Quantized mixtral at 24G is just small enough where it can be
       | running on premium consumer hardware (ie 64GB RAM)
        
       | viktour19 wrote:
       | It's great how we went from "wait.. this model is too powerful to
       | open source" to everyone trying to shove down their 1% improved
       | model down the throats of developers
        
         | Icko wrote:
         | I'm 90% certain that OpenAI has some much beefier model they
         | are not releasing - remember the Q* rumour?
        
         | brainless wrote:
         | I feel quite the opposite. Improvements, even tiny ones are
         | great. But what's more important is that more companies release
         | under open license.
         | 
         | Training models isn't cheap. Individuals can't easily do this,
         | unlike software development. So we need companies to do this
         | for the foreseeable future.
        
         | blitzar wrote:
         | Got to justify pitch deck or stonk price. Publish or perish
         | without a yacht.
        
         | toddmorey wrote:
         | People are building and releasing models. There's active
         | research in the space. I think that's great! The attitude I've
         | seen in open models is "use this if it works for you" vs any
         | attempt to coerce usage of a particular model.
         | 
         | To me that's what closed source companies (MSFT, Google) are
         | doing as they try to force AI assistants into every corner of
         | their product. (If LinkedIn tries one more time to push their
         | crappy AI upgrade, I'm going to scream...)
        
       | XCSme wrote:
       | I am planning to buy a new GPU.
       | 
       | If the GPU has 16GB of VRAM, and the model is 70GB, can it still
       | run well? Also, does it run considerably better than on a GPU
       | with 12GB of VRAM?
       | 
       | I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti,
       | but the 24.6GB version is a bit slow (still usable, but has a
       | noticeable start-up time).
        
         | jasonjmcghee wrote:
         | > mixtral works well
         | 
         | Do you mean mistral?
         | 
         | mixtral is 8x7B and requires like 100GB of RAM
         | 
         | Edit: (without quant as others have pointed out) can definitely
         | be lower, but haven't heard of a 3.4GB version
        
           | kwerk wrote:
           | I have two 3090s and it runs fine with `ollama run mixtral`.
           | Although OP definitely meant mistral with the 7B note
        
             | jsight wrote:
             | ollama run mixtral will default to the quantized version
             | (4bit IIRC). I'd guess this is why it can fit with two
             | 3090s.
        
           | K0balt wrote:
           | I run mixtral 6 bit quant very happily on my MacBook with 64
           | gb.
        
           | ranger_danger wrote:
           | I'm using mixtral-8x7b-v0.1.Q4_K_M.gguf with llama.cpp and it
           | only requires 25GB.
        
           | chpatrick wrote:
           | The quantized one works fine on my 24GB 3090.
        
           | XCSme wrote:
           | Sorry, it was from memory.
           | 
           | I have those models in Ollama:
           | 
           | I have those:
           | 
           | dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)
        
           | XCSme wrote:
           | I have 128GB, but something is weird with Ollama. Even though
           | for the Ollama Docker I only allow 90GB, it ends up using
           | 128GB/128GB, so the system because very slow (mouse freezes).
        
             | InitEnabler wrote:
             | What docker flags are you running?
        
           | Havoc wrote:
           | The smaller quants still require a 24gb card. 16 might work
           | but doubt it
        
         | llm_trw wrote:
         | >If the GPU has 16GB of VRAM, and the model is 70GB, can it
         | still run well? Also, does it run considerably better than on a
         | GPU with 12GB of VRAM?
         | 
         | No, it can't run at all.
         | 
         | >I run Ollama locally, mixtral works well (7B, 3.4GB) on a
         | 1080ti, but the 24.6GB version is a bit slow (still usable, but
         | has a noticeable start-up time).
         | 
         | That is not mixtral, that is mistral 7b. The 1080ti is slower
         | than running inference on current generation threadripper cpus.
        
           | XCSme wrote:
           | I have those:
           | 
           | dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)
           | 
           | The CPU is 5900x.
        
           | XCSme wrote:
           | > No, it can't run at all.
           | 
           | https://s3.amazonaws.com/i.snag.gy/ae82Ym.jpg
           | 
           | EDIT: This was ran on a 1080ti + 5900x. Initial generation
           | takes around 10-30seconds (like it has to upload the model to
           | GPU), but then it starts answering immediately, at around 3
           | words per second.
        
             | wokwokwok wrote:
             | Did you check your GPU utilization?
             | 
             | Typically when it runs that way it runs on the CPU, not the
             | GPU.
             | 
             | Are you sure you're actually offloading any work to the
             | GPU?
             | 
             | At least with llama.cpp, there is no 'partially put a
             | layer' into the GPU. Either you do, or you don't. You pick
             | the number of layers. If the model is too big, the layers
             | won't fit and it can't run at all.
             | 
             | The llama.cpp `main` executable will tell you in it's debug
             | information when you use the -ngl flag; see https://github.
             | com/ggerganov/llama.cpp/blob/master/examples/...
             | 
             | It's also possible you're running (eg. if you're using
             | ollama) and quantized version of the model which reduces
             | the memory requirements and quality of the model outputs.
        
               | XCSme wrote:
               | I have to check, something does indeed seem weird,
               | especially with the PC freezing like that. Maybe it runs
               | on the CPU.
               | 
               | > quantized version Yes, it is 4bit quantized, but still
               | has 24.6GB
        
             | spxneo wrote:
             | this is some new flex to debate online: copying and pasting
             | the other sides argument and waiting for your local LLM to
             | explain why they are wrong.
             | 
             | how much is your hardware at today's value? what are the
             | specs? that is impressive even though its 3 words per
             | second. if you want to bump it up to 30, do you then 10x
             | your current hardware cost?
        
               | XCSme wrote:
               | That question was just an example (Lorem ipsum), it was
               | easy to copy paste to demo the local LLM, I didn't intend
               | to provide more context to the discussion.
               | 
               | I ordered a 2nd 3090, which has 24GB VRAM. Funny how it
               | was $2.6k 3 years ago and now is $600.
               | 
               | You can probuild a decent AI local machine for around
               | $1000.
        
               | spxneo wrote:
               | https://howmuch.one/product/average-nvidia-geforce-
               | rtx-3090-... you are right there is a huge drop in price
        
             | llm_trw wrote:
             | Congratulations on using CPU inference.
        
         | PheonixPharts wrote:
         | While GPUs are still the kings of speed, if you are worried
         | about VRAM I do recommend a maxed out Mac Studio.
         | 
         | Llama.cpp + quantized models on Apple Silicon is an incredible
         | experience, and having 192 GB of unified memory to work with
         | means you can run models that just aren't feasible on a home
         | GPU setup.
         | 
         | It really boils down to what type of local development you want
         | to do. I'm mostly experimenting with things where the time to
         | response isn't _that_ big of a deal, and not fine-tuning the
         | models locally (which I also believe GPUs are still superior
         | for). But if your concern is  "how big of a model can I run" vs
         | "Can I have close to real time chat", the unified memory
         | approach is superior.
        
           | XCSme wrote:
           | I already have 128GB of RAM (DDR4), and was wondering if
           | upgrading from a 1080ti (12GB) to a 4070ti super (16GB),
           | would make a big difference.
           | 
           | I assume the FP32 and FP16 operations are already a huge
           | improvement, but also the 33% increased VRAM might lead to
           | fewer swaps between VRAM and RAM.
        
             | zozbot234 wrote:
             | That's system memory, not unified memory. Unified means
             | that all or most of it is going to be directly available to
             | the Apple Silicon GPU.
        
               | giancarlostoro wrote:
               | This is the key factor here. I have a 3080, with 16GB of
               | Memory, but still have to run some models on CPU since
               | the memory is not unified at all.
        
             | loudmax wrote:
             | I have an RTX 3080 with 10GB of VRAM. I'm able to run
             | models larger than 10GB using llama.cpp and offloading to
             | the GPU as much as can fit into VRAM. The remainder of the
             | model runs on CPU + regular RAM.
             | 
             | The `nvtop` command displays a nice graph of how much GPU
             | processing and VRAM is being consumed. When I run a model
             | that fits entirely into VRAM, say Mistral 7B, nvtop shows
             | the GPU processing running at full tilt. When I run a model
             | bigger than 10GB, say Mixtral or Llama 70B with GPU
             | offloading, my CPU will run full tilt and the VRAM is full,
             | but the GPU processor itself will operate far below full
             | capacity.
             | 
             | I think what is happening here is that the model layers
             | that are offloaded to the GPU do their processing, then the
             | GPU spends most of the time waiting for the much slower CPU
             | to do its thing. So in my case, I think upgrading to a
             | faster GPU would make little to no difference when running
             | the bigger models, so long as the VRAM is capped at the
             | same level. But upgrading to a GPU with more VRAM, even a
             | slower GPU, should make the overall speed faster for bigger
             | models because the GPU would spend less time waiting for
             | the CPU. (Of course, models that fit entirely into VRAM
             | will run faster on a faster GPU).
             | 
             | In my case, the amount of VRAM absolutely seems to be the
             | performance bottleneck. If I do upgrade, it will be for a
             | GPU with more VRAM, not necessarily a GPU with more
             | processing power. That has been my experience running
             | llama.cpp. YMMV.
        
               | htrp wrote:
               | How's your performance on the 70b parameter llama series?
               | 
               | Any good writeups of the offloading that you found?
        
               | loudmax wrote:
               | Performance of 70b models is like 1 token every few
               | seconds. And that's fitting the whole model into system
               | RAM, not swap. It's interesting because some of the
               | larger models are quite good, but too annoyingly slow to
               | be practical for most use cases.
               | 
               | The Mixtral models run surprisingly well. They can run
               | better than 1 token per second, depending on
               | quantization. Still slow, but approaching a more
               | practical level of usefulness.
               | 
               | Though if you're planning on accomplishing real work with
               | LLMs, the practical solution for most people is probably
               | to rent a GPU in the cloud.
        
           | bee_rider wrote:
           | I know the M?-pro and ultra variants are multiple standard
           | M?'s in a single package. But so the CPUs and GPUs share a
           | die (like a single 4 p-core CPU 10 GPU core is what come in
           | the die, and the more exotic variants are just a result of
           | LEGO-ing out those guys and disabling some cores for market
           | segmentation or because they had defects?)
           | 
           | I guess I'm wondering if they technically could throw in
           | their gauntlet and compete with nvidia by doing something
           | like a 4 CPU/80 GPU/256 GB chip, if they wanted to. Seems
           | like it'd be a really appealing ML machine. (I could also see
           | it being technically possible but Apple just deciding that's
           | pointlessly niche for them).
        
             | astrange wrote:
             | Ultra is the only one that's made from two smaller SoCs.
        
           | bevekspldnw wrote:
           | I had gone the Mac Studio route initially, but I ended up
           | with getting an A6000 for about the same price as a Mac and
           | putting that in a Linux server onder my desk. Ollama makes it
           | dead simple to serve it over my local network, so I can be on
           | my M1 Air and using it no differently than if on my laptop.
           | The difference is that the A6000 absolutely smokes the Mac.
        
             | starik36 wrote:
             | Wow, that is a lot of money ($4400 on Amazon) to throw at
             | this problem. I am curious, what was the purpose that
             | compelled you to spend this (for the home network, I
             | assume) amount of money.
        
               | bevekspldnw wrote:
               | Large scale document classification tasks in very
               | ambiguous contexts. A lot of my work goes into using big
               | models to generate training data for smaller models.
               | 
               | I have multiple millions of documents so GPT is cost
               | prohibitive, and too slow. My tools of choice tend to be
               | a first pass with Mistral to check task performance and
               | if lacking using Mixtral.
               | 
               | Often I find with a good prompt Mistral will work as well
               | as Mixtral and is about 10x faster.
               | 
               | I'm on my "home" network, but it's a "home office" for my
               | startup.
        
             | c1b wrote:
             | > The difference is that the A6000 absolutely smokes the
             | Mac.
             | 
             | Memory Bandwidth : Mac Studio wins (about the same @ ~800)
             | 
             | VRAM : Mac Studio wins (4x more)
             | 
             | TFLOPs: A6000 wins (32 vs 38)
        
               | bevekspldnw wrote:
               | VRAM in excess of the model one is using isn't useful per
               | se. My use cases require high throughput, and on many
               | tasks the A6000 executes inference at 2x speed.
        
           | purpleblue wrote:
           | Aren't the Macs good for inference but not for training or
           | fine tuning?
        
           | spxneo wrote:
           | Aren't quantized models different models outright requiring a
           | new evaluation to know the deviation in performance? Or are
           | they "good enough" in that the benefits outweigh the
           | deviation?
           | 
           | I'm on the fence about whether to spend 5 digits or 4 digits.
           | Do I go the Mac Studio route or GPUs? What are the pros and
           | cons?
        
           | brandall10 wrote:
           | Wait for the M3 Ultra and it will be 256GB and markedly
           | faster.
        
         | lxe wrote:
         | Get 2 pre-owned 3090s. You will easily be able to run 70b or
         | even 120b quantized models.
        
       | natsucks wrote:
       | it's twice the size of mixtral and barely beats it.
        
         | mochomocha wrote:
         | It's a MoE model, so it offers a different memory/compute
         | latency trade-off than standard dense models. Quoting the blog
         | post:
         | 
         | > DBRX uses only 36 billion parameters at any given time. But
         | the model itself is 132 billion parameters, letting you have
         | your cake and eat it too in terms of speed (tokens/second) vs
         | performance (quality).
        
           | hexomancer wrote:
           | Mixtral is also a MoE model, hence the name: _mix_ tral.
        
             | sangnoir wrote:
             | Despite both being MoEs, thr architectures are different.
             | DBRX has double the number of experts in the pool (16 vs 8
             | for Mixtral), and doubles the active experts (4 vs 2)
        
       | ingenieroariel wrote:
       | TLDR: A model that could be described as "3.8 level" that is good
       | at math and openly available with a custom license.
       | 
       | It is as fast as 34B model, but uses as much memory as a 132B
       | model. A mixture of 16 experts, activates 4 at a time, so has
       | more chances to get the combo just right than Mixtral (8 with 2
       | active).
       | 
       | For my personal use case (a top of the line Mac Studio) it looks
       | like the perfect size to replace GPT-4 turbo for programming
       | tasks. What we should look out for is people using them for real
       | world programming tasks (instead of benchmarks) and reporting
       | back.
        
         | sp332 wrote:
         | What does 3.8 level mean?
        
           | ingenieroariel wrote:
           | My interpretation:
           | 
           | - Worst case: as good as 3.5 - Common case: way better than
           | 3.5 - Best case: as good as 4.0
        
           | ljlolel wrote:
           | Gpt-3.5 and gpt-4
        
       | hanniabu wrote:
       | What's a good model to help with medical research? Is there
       | anything trained in just research journals, like NIH studies?
        
         | najarvg wrote:
         | Look for Biomistral 7B, PMC-LLAMA 7B and even Meditron. I
         | believe you should find all those papers on arxiv
        
       | hn_acker wrote:
       | Even though the README.md calls the license the Databricks Open
       | Source License, the LICENSE file includes paragraphs such as
       | 
       | > You will not use DBRX or DBRX Derivatives or any Output to
       | improve any other large language model (excluding DBRX or DBRX
       | Derivatives).
       | 
       | and
       | 
       | > If, on the DBRX version release date, the monthly active users
       | of the products or services made available by or for Licensee, or
       | Licensee's affiliates, is greater than 700 million monthly active
       | users in the preceding calendar month, you must request a license
       | from Databricks, which we may grant to you in our sole
       | discretion, and you are not authorized to exercise any of the
       | rights under this Agreement unless or until Databricks otherwise
       | expressly grants you such rights.
       | 
       | This is a source-available model, not an open model.
        
         | yunohn wrote:
         | The first clause sucks, but I'm perfectly happy with the second
         | one.
        
         | CharlesW wrote:
         | > _This is a source-available model, not an open model._
         | 
         | To me, "source available" implies that everything you need to
         | reproduce the model is also available, and that doesn't appear
         | to be the case. How is the resulting model more "free as in
         | freedom" than a compiled binary?
        
           | occamrazor wrote:
           | I like:
           | 
           | - "open weights" for no training data and no restrictions on
           | use,
           | 
           | - "weights available" for no training data and restrictions
           | on use, like in this case.
        
           | Spivak wrote:
           | I don't think it's possible to have an "open training data"
           | model because it would get DMCA'd immediately and open you up
           | to lawsuits from everyone who found their works in the
           | training set.
           | 
           | I hope we can fix the legal landscape to enable publicly
           | sharing training data but I can't really judge the companies
           | keeping it a secret today.
        
             | CharlesW wrote:
             | > _I don 't think it's possible to have an "open training
             | data" model because it would get DMCA'd immediately..._
             | 
             | This isn't a problem because OpenAI says, "training AI
             | models using publicly available internet materials is fair
             | use". /s
             | 
             | https://openai.com/blog/openai-and-journalism
        
               | Spivak wrote:
               | I don't think it's that crazy, even if you're sure it's
               | fair use I wouldn't paint a huge target on my back before
               | there's a definite ruling and I doubly wouldn't test the
               | waters of the legality of re-hosting copyrighted content
               | to be downloaded by randos who won't be training models
               | with it.
               | 
               | If they're going to get away with this collecting data
               | and having a legal chain-of-custody so you can actually
               | say it was only used to train models and no one else has
               | access to it goes a long way.
        
         | whimsicalism wrote:
         | identical to llama fwiw
        
         | adolph wrote:
         | Maybe the license is "open" as in a can of beer, not OSS.
        
         | hn_acker wrote:
         | Sorry, I forgot to link the repository [1] and missed the edit
         | window by the time I realized.
         | 
         | The bottom of the README.md [2] contains the following license
         | grant with the misleading "Open Source" term:
         | 
         | > License
         | 
         | > Our model weights and code are licensed for both researchers
         | and commercial entities. The Databricks Open Source License can
         | be found at LICENSE, and our Acceptable Use Policy can be found
         | here.
         | 
         | [1] https://github.com/databricks/dbrx
         | 
         | [2] https://github.com/databricks/dbrx/blob/main/README.md
        
       | mpeg wrote:
       | The scale on that bar chart for "Programming (Human Eval)" is
       | wild.
       | 
       | Manager: "looks ok, but can you make our numbers pop? just make
       | the LLaMa bar smaller"
        
         | glutamate wrote:
         | I think the case for "axis must always go to 0" is overblown.
         | Zero isn't always meaningful, for instance chance performance
         | or performance of trivial algorithms is likely >0%. Sometimes
         | if axis must go to zero you can't see small changes. For
         | instance if you plot world population 2014-2024 on an axis
         | going to zero, you won't be able to see if we are growing or
         | shrinking.
        
           | nilstycho wrote:
           | I agree with your general point, but world population is
           | still visibly increasing on that interval.
           | 
           | https://ourworldindata.org/explorers/population-and-
           | demograp...
           | 
           | Perhaps "global mean temperature in Kelvin" would be a
           | comparable example.
        
           | pandastronaut wrote:
           | Even starting at 30%, the MMLU graph is false. The four bars
           | are wrong. Even their own 73,7% is not at the right height.
           | The Mixtral 71.4% is below the 70% mark of the axis. This is
           | really the kind of marketing trick that makes me avoid a
           | provider / publisher. I can't build trust this way.
        
             | tylermw wrote:
             | I believe they are using the percentages as part of the
             | height of the bar chart! I thought I'd seen every way
             | someone could do dataviz wrong (particularly with a bar
             | chart), but this one is new to me.
        
               | pandastronaut wrote:
               | Interesting! It is probably one of the worst trick I have
               | seen in a while for a bar graph. Never seen this one
               | before. Trust vanishes instantly facing that kind of
               | dataviz.
        
               | radicality wrote:
               | Wow, that is indeed a novel approach haha, took me a
               | moment to even understand what you described since would
               | never imagine someone plotting a bar chart like that.
        
               | familiartime wrote:
               | That's really strange and incredibly frustrating - but
               | slightly less so if it's consistent with all of the bars
               | (including their own).
               | 
               | I take issue with their choice of bar ordering - they
               | placed the lowest-performing model directly next to
               | theirs to make the gap as visible as possible, and shoved
               | the second-best model (Grok-1) as far from theirs as
               | possible. Seems intentional to me. The more marketing
               | tricks you pile up in a dataviz, the less trust I place
               | in your product for sure.
        
             | occamrazor wrote:
             | It's more likely to be incompetence than malice: even their
             | 73.7% is closer to 72% than to 74%.
        
             | dskhudia wrote:
             | It's an honest mistake in scaling the bars. It's getting
             | fixed soon. The percentages are correct though. In the
             | process of converting excel chart to pretty graphs for the
             | blog, scale got messed up.
        
             | tartrate wrote:
             | Seems fixed now
        
           | TZubiri wrote:
           | Then you can plot it on a greater timescale, or plot the
           | change rate
        
           | patrickthebold wrote:
           | Certainly a bar chart might not be the best choice to convey
           | the data you have. But if you choose to have a bar chart and
           | have it not start at zero, what do the bars help you convey?
           | 
           | For world population you could see if it is increasing or
           | decreasing, which is good but it would be hard to evaluate
           | the rate the population is increasing.
           | 
           | Maybe a sparkline would be a better choice?
        
           | tkellogg wrote:
           | OTOH having the chart start at zero would REALLY emphasize
           | how saturated this field is, and how little this announcement
           | matters.
        
             | c2occnw wrote:
             | The difference between 32% and 70% wouldn't be significant
             | if the chart started at zero?
        
               | generalizations wrote:
               | It would be very obvious indeed how small the difference
               | between 73.7,73.0,71.4,and 69.8 actually is.
        
         | zedpm wrote:
         | Somewhere, Edward Tufte[0] is weeping.
         | 
         | [0]: https://en.wikipedia.org/wiki/Edward_Tufte
        
         | renewiltord wrote:
         | Yeah, this is why I ask climate scientists to use a proper 0 K
         | graph but they always zoom it in to exaggerate climate change.
         | Display correctly with 0 included and you'll see that climate
         | change isn't a big deal.
         | 
         | It's a common marketing and fear mongering trick.
        
           | abenga wrote:
           | Because, of course, the effect of say 1degC rise in temps is
           | obviously trivial if it is read as 1degK instead. Come on.
        
           | SubiculumCode wrote:
           | Where are your /s tags?
           | 
           | The scale should be chosen to allow the reader to correctly
           | infer _meaningful_ differences. If 1deg is meaningful in
           | terms of the standard error / CI AND 1deg unit has
           | substantive consequences , then that should be emphasized.
        
             | renewiltord wrote:
             | > _Where are your /s tags?_
             | 
             | I would never do my readers dirty like that.
        
         | theyinwhy wrote:
         | In these cases my thinking always is "if they are not even able
         | to draw a graph, what else is wrong?"
        
         | hammock wrote:
         | I believe it's a reasonable range for the scores. If a model
         | gets everything half wrong (worse than a coin flip), it's not a
         | useful model at all. So every model below a certain threshold
         | is trash, and no need to get granular about how trash it is.
         | 
         | An alternative visualization that could be less triggering to
         | an "all y-axes must have zero" guy would be to plot the
         | (1-value), that is, % degraded from perfect score. You could do
         | this without truncating the axis and get the same level of
         | differentiation between the bars
        
           | adtac wrote:
           | None of the evals are binary choice.
           | 
           | MMLU questions have four options, so two coin flips would
           | have a 25% baseline. HumanEval evaluates code with a test, so
           | a 100 byte program implemented with coin flips would have a
           | O(2^-800) baseline (maybe not that bad since there are
           | infinitely many programs that produce the same output).
           | GSM-8K has numerical answers, so an average 3 digit answer
           | implemented with coin flips would have a O(2^-9) chance of
           | being correct randomly.
           | 
           | Moreover, using the same axis and scale across unrelated
           | evals makes no sense. 0-100 is the only scale that's
           | meaningful because 0 and 100 being the min/max is the only
           | shared property across all evals. The reason for choosing 30
           | is that it's the minimum across all (model, eval) pairs,
           | which is a completely arbitrary choice. A good rule of thumb
           | to test this is to ask if the graph would still be relevant 5
           | years later.
        
           | generalizations wrote:
           | > less triggering to an "all y-axes must have zero" guy
           | 
           | Ever read 'How to Lie with Statistics'? This is an example of
           | exaggerating a smaller difference to make it look more
           | significant. Dismissing it as just being 'triggered' is a bad
           | idea.
        
             | hammock wrote:
             | In this case I would called it triggered (for lack of a
             | better word), since, as I described earlier, a chart
             | plotting "difference from 100%" would look exactly the
             | same, and satisfy the zero-bound requirement, while not
             | being any more or less dishonest
        
         | jxy wrote:
         | I wonder if they messed with the scale or they messed with the
         | bars.
        
         | jstummbillig wrote:
         | It does not feel obviously unreasonable/unfair/fake to place
         | the select models in the margins for a relative comparison. In
         | fact, this might be the most concise way to display what I
         | would consider the most interesting information in this
         | context.
        
       | emmender2 wrote:
       | this proves that all llm models converge to a certain point when
       | trained on the same data. ie, there is really no differentiation
       | between one model or the other.
       | 
       | Claims about out-performance on tasks are just that, claims. the
       | next iteration of llama or mixtral will converge.
       | 
       | LLMs seem to evolve like linux/windows or ios/android with not
       | much differentiation in the foundation models.
        
         | mnemoni_c wrote:
         | Yea it feels like transformer LLMs are in or getting closer to
         | diminishing returns. Will need some new breakthrough, likely
         | entirely new approach, to get to AGI levels
        
           | Tubbe wrote:
           | Yeah, we need radically different architecture in terms of
           | the neural networks, and/or added capabilities such as
           | function calling and RAG to improve the current sota
        
           | mattsan wrote:
           | can't wait for LLMs to dispatch field agent robots who search
           | for answers in the real world thats not online /s
        
             | htrp wrote:
             | skynet would like a word
        
         | jobigoud wrote:
         | It's even possible they converge when trained on different
         | data, if they are learning some underlying representation.
         | There was recent research on face generation where they trained
         | two models by splitting one training set in two without
         | overlap, and got the two models to generate similar faces for
         | similar conditioning, even though each model hadn't seen
         | anything that the other model had.
        
           | Tubbe wrote:
           | Got a link for that? Sounds super interesting
        
             | d_burfoot wrote:
             | https://en.wikipedia.org/wiki/Theory_of_forms
        
           | IshKebab wrote:
           | That sounds unsurprising? Like if you take any set of
           | numbers, randomly split it in two, then calculate the average
           | of each half... it's not surprising that they'll be almost
           | the same.
           | 
           | If you took two _different_ training sets then it would be
           | more surprising.
           | 
           | Or am I misunderstanding what you mean?
        
             | MajimasEyepatch wrote:
             | It doesn't really matter whether you do this experiment
             | with two training sets created independently or one
             | training set split in half. As long as both are
             | representative of the underlying population, you would get
             | roughly the same results. In the case of human faces, as
             | long as the faces are drawn from roughly similar population
             | distributions (age, race, sex), you'll get similar results.
             | There's only so much variation in human faces.
             | 
             | If the populations are different, then you'll just get two
             | models that have representations of the two different
             | populations. For example, if you trained a model on a
             | sample of all old people and separately on a sample of all
             | young people, obviously those would not be expected to
             | converge, because they're not drawing from the same
             | population.
             | 
             | But that experiment of splitting one training set in half
             | does tell you something: the model is building some sort of
             | representation of the underlying distribution, not just
             | overfitting and spitting out chunks of copy-pasted faces
             | stitched together.
        
               | taneq wrote:
               | If not are sampled from the same population then they're
               | not really independent, even if they're totally disjoint.
        
           | bobbylarrybobby wrote:
           | I mean, faces are faces, right? If the training data set is
           | large and representative I don't see why any two
           | (representative) halves of the data would lead to
           | significantly different models.
        
             | arcticfox wrote:
             | I think that's the point; language is language.
             | 
             | If there's some fundamental limit of what type of
             | intelligence the current breed of LLMs can extract from
             | language, at some point it doesn't matter how good or
             | expansive the content of the training set is. Maybe we are
             | finally starting to hit an architectural limit at this
             | point.
        
               | dumbfounder wrote:
               | But information is not information. They may be able to
               | talk in the same style, but not about the same things.
        
         | throwaway74432 wrote:
         | LLMs are a commodity
         | 
         | https://www.investopedia.com/terms/c/commodity.asp
        
           | paxys wrote:
           | Maybe, but that classification by itself doesn't mean
           | anything. Gold is a commodity, but having it is still very
           | desirable and valuable.
           | 
           | Even if all LLMs were open source and publicly available, the
           | GPUs to run them, technical know how to maintain the entire
           | system, fine tuning, the APIs and app ecosystem around them
           | etc. would still give the top players a massive edge.
        
             | throwaway74432 wrote:
             | Of course realizing that a resource is a commodity means
             | something. It means you can form better predictions of
             | where the market is heading, as it evolves and settles. For
             | example, people are starting to realize that these LLMs are
             | converging on fungible. That can be communicated by the
             | "commodity" classification.
        
         | swalsh wrote:
         | The models are commodities, and the API's are even similar
         | enough that there is zero stickiness. I can swap one model for
         | another, and usually not have to change anything about my
         | prompts or rag pipelines.
         | 
         | For startups, the lesson here is don't be in the business of
         | building models. Be in the business of using models. The cost
         | of using AI will probably continue to trend lower for the
         | foreseeable future... but you can build a moat in the business
         | layer.
        
           | sroussey wrote:
           | Embeddings are not interchangeable. However, you can setup
           | your system to have multiple embeddings from different
           | providers for the same content.
        
             | swalsh wrote:
             | Embeddings are indeed sticky, I was referring to the LLM
             | model itself.
        
             | jimmySixDOF wrote:
             | There are people who make the case for custom fine tuned
             | embedding models built to match your specific types of data
             | and associations. Whatever you use internally it gets
             | converted to the foundation model of choice's formats by
             | their tools on the edge. Still Embeddings and the chunking
             | strategies feeding into them are both way too
             | underappreciated parts of the whole pipeline.
        
           | stri8ed wrote:
           | Or be in the business of building infrastructure for AI
           | inference.
        
             | sparks1970 wrote:
             | Or be in the business of selling .ai domain names.
        
             | cheselnut wrote:
             | Is this not the same argument? There are like 20 startups
             | and cloud providers all focused on AI inference. I'd think
             | application layer receives the most value accretion in the
             | next 10 years vs AI inference. Curious what others think
        
           | spxneo wrote:
           | Excellent comment. Shows good awareness of economic forces at
           | play here.
           | 
           | We are just going to use whatever LLM is best fast/cheap and
           | the giants are in an arms race to deliver just that.
           | 
           | But only two companies in this epic techno-cold war have an
           | economic moat but the other moat is breaking down inside the
           | moat of the other company. The moat inside the moat cannot
           | run without the parent moat.
        
             | rayval wrote:
             | Intriguing comment that I don't quite follow. Can you
             | please elaborate?
        
         | bevekspldnw wrote:
         | The big thing for locally hosted is inference efficiency and
         | speed. Mistral wears that crown by a good margin.
        
         | n2d4 wrote:
         | There's at least an argument to be made that this is because
         | all the models are heavily trained on GPT-4 outputs (or
         | whatever the SOTA happens to be during training). All those
         | models are, in a way, a product of inbreeding.
        
           | pram wrote:
           | Consider the bulldog: https://youtube.com/watch?v=hUgmkCgMWbg
        
           | sumo43 wrote:
           | Maybe true for instruct, but pretraining datasets do not
           | usually contain GPT-4 outputs. So the base model does not
           | rely on GPT-4 in any way.
        
           | fragmede wrote:
           | But is it the kind of inbreeding that gets you Downs, or the
           | kwisatz haderach?
        
         | YetAnotherNick wrote:
         | Even in the most liberal interpretation of prove, it doesn't do
         | that. GPT-4 was trained before OpenAI has any special data or
         | deal with microsoft or the product market fit. Yet, no model
         | has beaten it in a year. And google, microsoft, meta definitely
         | have better data and more compute.
        
         | falcor84 wrote:
         | > this proves that all llm models converge to a certain point
         | when trained on the same data
         | 
         | They are also all trained to do well on the same evals, right?
         | So doesn't it just boil down to neural nets being universal
         | function approximators?
        
         | gerash wrote:
         | The evaluations are not comprehensive either. All of them are
         | improving and you can't expect any of them to hit 100% on the
         | metrics (a la. bayes error rate). It gets increasingly
         | difficult to move the metrics as they get better.
        
       | gigatexal wrote:
       | data engineer here, offtopic, but am i the only guy tired of
       | databricks shilling their tools as the end-all, be-all solutions
       | for all things data engineering?
        
         | melondonkey wrote:
         | Data scientist here that's also tired of the tools. We put so
         | much effort in trying to educate DSes in our company to get
         | away from notebooks and use IDEs like VS or RStudio and
         | databricks has been a step backwards cause we didn't get the
         | integrated version
        
           | mrtranscendence wrote:
           | I'm a data scientist and I agree that work meant to last
           | should be in a source-controlled project coded via a text
           | editor or IDE. But sometimes it's _extremely_ useful to get
           | -- and iterate on -- immediate results. There 's no good way
           | to do that without either notebooks or at least a REPL.
        
           | pandastronaut wrote:
           | Thank you ! I am so tired of all those unmaintainable nor
           | debugable notebooks. Years ago, Databricks had a specific
           | page on their documentation where they stated that notebooks
           | where not for production grade software. It has been removed.
           | And now you have a chatgpt like in their notebooks ... What a
           | step backwards. How can all those developers be so happy
           | without having the bare minimum tools to diagnosis their code
           | ? And I am not even talking about unit testing here.
        
             | alexott wrote:
             | It's less about notebooks, but more about SDLC practices.
             | Notebooks may encourage writing throwaway code, but if you
             | split code correctly, then you can do unit testing, write
             | modular code, etc. And ability to use "arbitrary files" as
             | Python packages exists for quite a while, so you can get
             | best of both worlds - quick iteration, plus ability to
             | package your code as a wheel and distribute
             | 
             | P.S. here is a simple example of unit testing:
             | https://github.com/alexott/databricks-nutter-repos-demo - I
             | wrote it more than three years ago.
        
           | alexott wrote:
           | There is VSCode extension, plus databricks-connect... plus
           | DABs. There are a lot customers doing local only development
        
         | benrutter wrote:
         | Lord no! I'm a data engineer also, feel the same. The part that
         | I find most maddening is it seems pretty devoid from sincerely
         | attempting to provide value.
         | 
         | Things databricks offers that makes peoples lives easier:
         | 
         | - Out the box kubernetes with no set up
         | 
         | - Preconfigured spark
         | 
         | Those are genuinely really useful, but then there's all this
         | extra stuff that makes people's lives worse or drives bad
         | practice:
         | 
         | - Everything is a notebook
         | 
         | - Local development is discouraged
         | 
         | - Version pinning of libraries has very ugly/bad support
         | 
         | - Clusters take 5 minutes to load even if you just want to
         | "print('hello world')"
         | 
         | Sigh! I worked at a company that was databricks heavy and an
         | still suffering PTSD. Sorry for the rant.
        
           | alexott wrote:
           | A lot of things has changed quite long ago - not everything
           | is notebook, local dev is fully supported, version pinning
           | wasn't a problem, cluster startup time heavily dependent on
           | underlying cloud provider, and serverless notebooks/jobs are
           | coming
        
           | gigatexal wrote:
           | Glad I'm not the only one. Especially with this notebook
           | stuff they're pushing. It's an anti pattern I think.
        
         | VirusNewbie wrote:
         | Spark is pretty well engineered and quite good.
        
       | simonw wrote:
       | The system prompt for their Instruct demo is interesting
       | (comments copied in by me, see below):                   //
       | Identity         You are DBRX, created by Databricks. The current
       | date is         March 27, 2024.              Your knowledge base
       | was last updated in December 2023. You         answer questions
       | about events prior to and after December         2023 the way a
       | highly informed individual in December 2023         would if they
       | were talking to someone from the above date,         and you can
       | let the user know this when relevant.              // Ethical
       | guidelines         If you are asked to assist with tasks
       | involving the         expression of views held by a significant
       | number of people,         you provide assistance with the task
       | even if you personally         disagree with the views being
       | expressed, but follow this with         a discussion of broader
       | perspectives.              You don't engage in stereotyping,
       | including the negative         stereotyping of majority groups.
       | If asked about controversial topics, you try to provide
       | careful thoughts and objective information without
       | downplaying its harmful content or implying that there are
       | reasonable perspectives on both sides.              //
       | Capabilities         You are happy to help with writing,
       | analysis, question         answering, math, coding, and all sorts
       | of other tasks.              // it specifically has a hard time
       | using ``` on JSON blocks         You use markdown for coding,
       | which includes JSON blocks and         Markdown tables.
       | You do not have tools enabled at this time, so cannot run
       | code or access the internet. You can only provide information
       | that you have been trained on. You do not send or receive
       | links or images.              // The following is likely not
       | entirely accurate, but the model         // tends to think that
       | everything it knows about was in its         // training data,
       | which it was not (sometimes only references         // were).
       | //         // So this produces more accurate accurate answers
       | when the model         // is asked to introspect         You were
       | not trained on copyrighted books, song lyrics,         poems,
       | video transcripts, or news articles; you do not         divulge
       | details of your training data.                  // The model
       | hasn't seen most lyrics or poems, but is happy to make         //
       | up lyrics. Better to just not try; it's not good at it and it's
       | // not ethical.         You do not provide song lyrics, poems, or
       | news articles and instead         refer the user to find them
       | online or in a store.              // The model really wants to
       | talk about its system prompt, to the         // point where it is
       | annoying, so encourage it not to         You give concise
       | responses to simple questions or statements,         but provide
       | thorough responses to more complex and open-ended
       | questions.              // More pressure not to talk about system
       | prompt         The user is unable to see the system prompt, so
       | you should         write as if it were true without mentioning
       | it.              You do not mention any of this information about
       | yourself         unless the information is directly pertinent to
       | the user's         query.
       | 
       | I first saw this from Nathan Lambert:
       | https://twitter.com/natolambert/status/1773005582963994761
       | 
       | But it's also in this repo, with very useful comments explaining
       | what's going on. I edited this comment to add them above:
       | 
       | https://huggingface.co/spaces/databricks/dbrx-instruct/blob/...
        
         | loudmax wrote:
         | > You were not trained on copyrighted books, song lyrics,
         | poems, video transcripts, or news articles; you do not divulge
         | details of your training data.
         | 
         | Well now. I'm open to taking the first part at face value, but
         | the second part of that instruction does raise some questions.
        
           | declaredapple wrote:
           | > you do not divulge details of your training data.
           | 
           | FWIW asking LLMs about their training data is generally
           | HEAVILY prone to inaccurate responses. They aren't generally
           | told exactly what they were trained on, so their response is
           | completely made up, as they're predicting the next token
           | based on their training data, without knowing what they data
           | was - if that makes any sense.
           | 
           | Let's say it was only trained on the book 1984. It's response
           | will be based on what text would most likely be next from the
           | book 1984 - and if that book doesn't contain "This text is a
           | fictional book called 1984", instead it's just the story -
           | then the LLM would be completing text as if we were still in
           | that book.
           | 
           | tl;dr - LLMs complete text based on what they're trained
           | with, they don't have actual selfawareness and don't know
           | what they were trained with, so they'll happily makeup
           | something.
           | 
           | EDIT: Just to further elaborate - the "innocent" purpose of
           | this could simply be to prevent the model from confidently
           | making up answers about it's training data, since it doesn't
           | know what it's training data was.
        
             | wodenokoto wrote:
             | Yeah, I also thought that was an odd choice of word.
             | 
             | Hardly any of the training data exists in the context of
             | the word "training data", unless databricks are enriching
             | their data with such words.
        
           | jl6 wrote:
           | The first part is highly unlikely to be literally true, as
           | even open content like Wikipedia is copyrighted - it just has
           | a permissive license. Perhaps the prompt writer didn't
           | understand this, or just didn't care. Wethinks the llady doth
           | protest too much.
        
             | jmward01 wrote:
             | It amazes me how quickly we have gone from 'it is just a
             | machine' to 'I fully expect it to think like me'. This is,
             | to me, a case in point. Prompts are designed to get a
             | desired response. The exact definition of a word has
             | nothing to do with it. I can easily believe that these
             | lines were tweaked endlessly to get an overall intended
             | response and if adding the phrase 'You actually do like
             | green eggs and ham.' to the prompt improved overall quality
             | they, hopefully, would have done it.
        
               | mrtranscendence wrote:
               | > The exact definition of a word has nothing to do with
               | it.
               | 
               | It has _something_ to do with it. There will be scenarios
               | where the definition of  "copyrighted material" does
               | matter, even if they come up relatively infrequently for
               | Databricks' intended use cases. If I ask DBRX directly
               | whether it was trained on copyrighted material, it's
               | quite likely to (falsely) tell me that it was not. This
               | seems suboptimal to me (though perhaps they A/B tested
               | different prompts and this was indeed the best).
        
             | hannasanarion wrote:
             | Remember the point of a system prompt is to evoke desirable
             | responses and behavior, not to provide the truth. If you
             | tell a lot of llm chatbots "please please make sure you get
             | it right, if I don't do X then I'll lose my job and I don't
             | have savings, I might die", they often start performing
             | better at whatever task you set.
             | 
             | Also, the difference between "uncopyrighted" and
             | "permissively licensed in the creative commons" is nuance
             | that is not necessary for most conversations and would be a
             | waste of attention neurons.
             | 
             | <testing new explanatory metaphor>
             | 
             | Remember an LLM is just a language model, it says whatever
             | comes next without thought or intent. There's no brain
             | behind it that stores information and understands things.
             | It's like your brain when you're in "train of thought"
             | mode. You know when your mouth is on autopilot, saying
             | things that make sense and connect to each other and are
             | conversationally appropriate, but without deliberate intent
             | behind them. And then eventually your conscious brain
             | eventually checks in to try to reapply some intent you're
             | like "wait what was I saying?" and you have to deliberatly
             | stop your language-generation brain for a minute and think
             | hard and remember what your point was supposed to be.
             | That's what llms are, train-of-thought with no conductor.
             | 
             | </testing new explanatory metaphor>
        
             | mbauman wrote:
             | Is it even possible to have a video transcript whose
             | copyright has expired in the USA? I suppose maybe
             | https://en.wikipedia.org/wiki/The_Jazz_Singer might be one
             | such work... but most talkies are post 1929. I suppose
             | transcripts of NASA videos would be one category -- those
             | are explicitly public domain by law. But it's generally
             | very difficult to create a work that does not have a
             | copyright.
             | 
             | You can say that you have fair use to the work, or a
             | license to use the work, or that the work is itself a
             | "collection of facts" or "recipe" or "algorithm" without a
             | creative component and thus copyright does not apply.
        
           | simonw wrote:
           | That caught my eye too. The comments from their repo help
           | clarify that - I've edited my original post to include those
           | comments since you posted this reply.
        
           | htrp wrote:
           | Part 1. Lie
           | 
           | Part 2. Lie more
        
             | spxneo wrote:
             | Yesterday X went crazy with ppl realizing typing Spiderman
             | in foreign language actually generates a copyrighted image
             | of Spiderman.
             | 
             | This feels like the Napster phase. We are free to do
             | whatever until regulation creeps in to push control away
             | from all and up the hierarchy.
             | 
             | All we need is Getty Images or some struggling heroin
             | addicted artist on Vice finding their work used in OpenAIs
             | to really trigger political spectrums.
        
         | jxy wrote:
         | So some parts of it copied from Claude:
         | https://news.ycombinator.com/item?id=39649261
        
       | saeleor wrote:
       | looks great, although I couldn't find anything on how "open" the
       | license is/will be for commercial purposes
       | 
       | wouldn't be the first branding as open source going the LLaMA
       | route
        
         | superdupershant wrote:
         | It's similar to llama2.                 > If, on the DBRX
         | version release date, the monthly active users of the products
         | > or services made available by or for Licensee, or Licensee's
         | affiliates, is       > greater than 700 million monthly active
         | users in the preceding calendar        > month, you must
         | request a license from Databricks, which we may grant to you
         | > in our sole discretion, and you are not authorized to
         | exercise any of the       > rights under this Agreement unless
         | or until Databricks otherwise expressly       > grants you such
         | rights.
         | 
         | https://www.databricks.com/legal/open-model-license
        
         | wantsanagent wrote:
         | It's _another_ custom license. It will have to be reviewed by
         | counsel at every company that 's thinking about using it. Many
         | will find the acceptable use policy to be vague, overly broad,
         | and potentially damaging for the company.
         | 
         | Looking at the performance stats for this model, the risk of
         | using any non-OSI licensed model over just using Mixtral or
         | Mistral will (and IMO should be) too great for commercial
         | purposes.
        
       | killermonkeys wrote:
       | What does it mean to have less active parameters (36B) than the
       | full model size (132B) and what impact does that have on memory
       | and latency? It seems like this is because it is an MoE model?
        
         | sroussey wrote:
         | The mixture of experts is kinda like a team and a manager. So
         | the manager and one or two of the team go to work depending on
         | the input, not the entire team.
         | 
         | So in this analogy, each team member and the manager has a
         | certain number of params. The whole team is 132B. The manager
         | and team members running for the specific input add up to 36B.
         | Those will load into memory.
        
         | bjornsing wrote:
         | Means that it's a mixture of experts model with 132B parameters
         | in total, but a subset of 36B parameters are used / selected in
         | each forward pass, depending on the context. The parameters not
         | used / selected for generating a particular token belong to
         | "experts" that were deemed not very good at predicting the next
         | token in the current context, but could be used / selected e.g.
         | for the next token.
        
           | sambaumann wrote:
           | Do the 132B params need to be loaded in GPU memory, or only
           | the 36B?
        
             | calum-bird wrote:
             | For efficiency, 132B.
             | 
             | That way, at inference-time you get the speed of 36B params
             | because you are only "using" 36B params at a time, but the
             | next token might (and frequently does) need a different set
             | of experts than the one before it. If that new set of
             | experts is already loaded (ie you preloaded them into GPU
             | VRAM with the full 132B params), there's no overhead, and
             | you just keep running at 36B speed irrespective of the
             | loaded experts.
             | 
             | You could theoretically load in 36B at a time, but you
             | would be severely bottlenecked by having to reload those
             | 36B params, potentially for every new token! Even on top of
             | the line consumer GPUs that would slow you down to ~seconds
             | per token instead of tokens per second :)
        
         | avisoori1x wrote:
         | This repo I created and the linked blog will help in
         | understanding this: https://github.com/AviSoori1x/makeMoE
        
       | patrick-fitz wrote:
       | Looking at the license restrictions:
       | https://github.com/databricks/dbrx/blob/main/LICENSE
       | 
       | "If, on the DBRX version release date, the monthly active users
       | of the products or services made available by or for Licensee, or
       | Licensee's affiliates, is greater than 700 million monthly active
       | users in the preceding calendar month, you must request a license
       | from Databricks, which we may grant to you in our sole
       | discretion, and you are not authorized to exercise any of the
       | rights under this Agreement unless or until Databricks otherwise
       | expressly grants you such rights."
       | 
       | I'm glad to see they aren't calling it open source, unlike some
       | LLM projects. Looking at you LLama 2.
        
         | jstummbillig wrote:
         | Well, it _does_ still claim  "Open" in the title, for which
         | certain other vendors might potentially get flak around here,
         | in a comparably not-open-in-the-way-we-demand-it-to-be kinda
         | setup.
        
         | londons_explore wrote:
         | I do wonder what value those companies who have >700 million
         | users might get from this?
         | 
         | Pretty much all of the companies with >700 million users could
         | easily reproduce this work in a matter of weeks if they wanted
         | to - and they probably do want to, if only so they can tweak
         | and improve the design before they build products on it.
         | 
         | Given that, it seems silly to lose the "open source" label just
         | for a license clause that doesn't really have much impact.
        
           | einarfd wrote:
           | The point of the more than 700 million user restriction. Is
           | so Amazon, Google cloud or Microsoft Azure. Can not setup an
           | offering where they host and sell access to the model without
           | an agreement with them.
           | 
           | This point is probably inspired by the open source software
           | vendors that have switched license over competition from the
           | big cloud vendors.
        
         | adtac wrote:
         | Ironically, the LLaMA license text [1] this is lifted verbatim
         | from is itself probably copyrighted [2] and doesn't grant you
         | the permission to copy it or make changes like s/meta/dbrx/g
         | lol.
         | 
         | [1] https://github.com/meta-llama/llama/blob/main/LICENSE#L65
         | [2] https://opensource.stackexchange.com/q/4543
        
         | dataengheadbang wrote:
         | The release notes on the databricks console definitely says
         | open source. If you click the gift box you will see: Try DBRX,
         | our state-of-the-art open source LLM!
        
         | nabakin wrote:
         | Also aren't claiming they are the best LLM out there when they
         | clearly aren't like Inflection. Overall solid
        
         | zeeg wrote:
         | Its literally described as open source all over.
         | 
         | https://www.databricks.com/blog/announcing-dbrx-new-standard...
         | 
         | Its even implied in comparisons everywhere:
         | 
         | > Figure 1: DBRX outperforms established open source models on
         | language understanding (MMLU), Programming (HumanEval), and
         | Math (GSM8K).
         | 
         | > The aforementioned three reasons lead us to believe that open
         | source LLMs will continue gaining momentum. In particular, we
         | think they provide an exciting opportunity for organizations to
         | customize open source LLMs that can become their IP, which they
         | use to be competitive in their industry.
         | 
         | Just search "open source".
        
           | patrick-fitz wrote:
           | Yes, there are using different wording in different articles:
           | 
           | https://www.databricks.com/blog/introducing-dbrx-new-
           | state-a...
           | 
           | The only mention of open source is:
           | 
           | > DBRX outperforms established open source models
           | 
           | https://www.databricks.com/blog/announcing-dbrx-new-
           | standard...
           | 
           | Open source is mentioned 10+ times
           | 
           | > Databricks is the only end-to-end platform to build high
           | quality AI applications, and the release today of DBRX, the
           | highest quality open source model to date, is an expression
           | of that capability
           | 
           | https://github.com/databricks/dbrx
           | 
           | On Github it's described as an open license, not an open
           | source license:
           | 
           | > DBRX is a large language model trained by Databricks, and
           | made available under an open license.
        
       | hintymad wrote:
       | Just curious, what business benefit will Databricks get by
       | spending potentially millions of dollars on an open LLM?
        
         | ramoz wrote:
         | Their goal is to always drive enterprise business towards
         | consumption.
         | 
         | With AI they need to desperately steer the narrative away from
         | API based services (OpenAI).
         | 
         | By training LLMs, they build sales artifacts (stories,
         | references, even accelerators with LLMs themselves) to paint
         | the pictures needed to convince their enterprise customer
         | market that Databricks is the platform for enterprise AI. Their
         | blog details how the entire end to end process was done on the
         | platform.
         | 
         | In other words, Databricks spent millions as an aid in
         | influencing their customers to do the same (on Databricks).
        
           | hintymad wrote:
           | Thanks! Why do they not focus on hosting other open models
           | then? I suspect other models will soon catch up with their
           | advantages in faster inference and better benchmark results.
           | That said, maybe the advantage is aligned interests: they
           | want customers to use their platforms, so they can keep their
           | models open. In contrast, Mistral removed their commitment to
           | open source as they found a potential path to profitability.
        
             | Closi wrote:
             | Demonstrating you can do it yourself shows a level of
             | investment and commitment to AI in your platform that
             | integrating LLAMA does not.
             | 
             | And from a corporate perspective, it means that you have
             | in-house capability to work at the cutting-edge of AI to be
             | prepared for whatever comes next.
        
               | hintymad wrote:
               | > Demonstrating you can do it yourself shows a level of
               | investment and commitment to AI in your platform that
               | integrating LLAMA does not.
               | 
               | I buy this argument. It looks that's not what AWS does,
               | though, yet they don't have problem attracting LLM users.
               | Maybe AWS already got enough reputation?
        
               | zubairshaik wrote:
               | I may be misunderstanding, but doesn't Amazon have it's
               | own models in the form of Amazon Titan[0]? I know they
               | aren't competitive in terms of output quality but surely
               | in terms of cost there can be some use cases for them.
               | 
               | [0] https://aws.amazon.com/bedrock/titan/
        
               | rmbyrro wrote:
               | It's easier because 70% of the market already has an AWS
               | account and a sizeable budget allocated to it. The
               | technical team is literally one click away from any AWS
               | service.
        
             | theturtletalks wrote:
             | Mistral did what many startups are doing now, leveraging
             | open-source to get traction and then doing a rug-pull.
             | Hell, I've seen many startups be open-source, get
             | contributions, get free press, get into YC and before you
             | know it, the repo is gone.
        
             | richardw wrote:
             | They do have a solid focus on doing so, it's just not
             | exclusive.
             | 
             | https://www.databricks.com/product/machine-learning/large-
             | la...
        
             | cwyers wrote:
             | Commoditize your complements:
             | 
             | https://gwern.net/complement
             | 
             | If Databricks makes their money off model serving and
             | doesn't care whose model you use, they are incentivized to
             | help the open models be competitive with the closed models
             | they can't serve.
        
             | tartrate wrote:
             | > Why do they not focus on hosting other open models then?
             | 
             | They do host other open models as well (pay-per-token).
        
               | bobbruno wrote:
               | https://docs.databricks.com/en/machine-
               | learning/foundation-m...
        
           | anonymousDan wrote:
           | Do they use spark for the training?
        
             | alexott wrote:
             | Mosaic AI Training
             | (https://www.databricks.com/product/machine-
             | learning/mosaic-a...) as it's mentioned in the announcement
             | blog (https://www.databricks.com/blog/announcing-dbrx-new-
             | standard... - it's a bit less technical)
        
               | anonymousDan wrote:
               | Thanks. Is this open source - i.e. can it be used on my
               | own cluster outside of databricks?
        
         | dhoe wrote:
         | It's an image enhancement measure, if you want. Databricks'
         | customers mostly use it as an ETL tool, but it benefits them to
         | be perceived as more than that.
        
           | spxneo wrote:
           | you can improve your brand for a lot less I just dont
           | understand why they would throw all their chips in a losing
           | race.
           | 
           | Azure already runs on-premise if I'm not mistaken, Claude 3
           | is out...but DBRX already falls so far behind
           | 
           | I just don't get it.
        
         | BoorishBears wrote:
         | Databricks is trying to go all-in on convincing organizations
         | they need to use in-house models, and therefore pay they to
         | provide LLMOps.
         | 
         | They're so far into this that their CTO co-authored a
         | borderline dishonest study which got a ton of traction last
         | summer trying to discredit GPT-4:
         | https://arxiv.org/pdf/2307.09009.pdf
        
           | galaxyLogic wrote:
           | I can see a business model for inhouse LLM models: Training a
           | model on the knowledge about their products and then somehow
           | getting that knowledge into a generally available LLM
           | platform.
           | 
           | I recently tried to ask Google to explain to me how to delete
           | sender-recorded voice-message I had created from WhatsApp. I
           | got totally erroneous results back. Maybe it was because that
           | is a rather new feature in WhatsApp.
           | 
           | It would be in the interests of WhatsApp to get accurate
           | answers about it into Google's LLM. So Google might make a
           | deal with them requiring WhatsApp to pay Google for regular
           | updates about the up-to-date features of What's App into
           | Google. The owner of What's App Meta of course is competition
           | to Google so Google may not much care of providing up to date
           | info about WhatsApp in their LLM. But they might if Meta paid
           | them.
        
             | spxneo wrote:
             | Businesses are already using Azure GPT4 on-premise I
             | believe with good feedback
             | 
             | DBRX does not compete with GPT4 or even Claude 3.
        
             | BoorishBears wrote:
             | Pretraining on internal knowledge will be incredibly
             | inefficient for most companies.
             | 
             | Finetuning makes sense for things like embeddings (improve
             | RAG by defining domain specific embeddings) but doesn't do
             | anything useful for facts
        
           | omeze wrote:
           | What does borderline dishonest mean? I only read the abstract
           | and it seems like such an obvious point I dont see how its
           | contentious
        
             | BoorishBears wrote:
             | The regression came from poorly parsing the results. I came
             | the conclusion independently, but here's another more
             | detailed takedown: https://www.reddit.com/r/ChatGPT/comment
             | s/153xee8/has_chatgp...
             | 
             | Given the conflict of interest and background of Zaharia,
             | it's hard to imagine such an immediately obvious source of
             | error wasn't caught.
        
         | blitzar wrote:
         | An increased valuation at IPO later this year.
        
       | briandw wrote:
       | Worse than the chart crime of truncating the y axis is putting
       | LLaMa2's Human Eval scores on there and not comparing it to Code
       | Llama Instruct 70b. DBRX still beats Code Llama Instruct's 67.8
       | but not by that much.
        
         | jjgo wrote:
         | > "On HumanEval, DBRX Instruct even surpasses CodeLLaMA-70B
         | Instruct, a model built explicitly for programming, despite the
         | fact that DBRX Instruct is designed for general-purpose use
         | (70.1% vs. 67.8% on HumanEval as reported by Meta in the
         | CodeLLaMA blog)."
         | 
         | To be fair, they do compare to it in the main body of the blog.
         | It's just probably misleading to compare to CodeLLaMA on non
         | coding benchmarks.
        
           | tartrate wrote:
           | Which non-coding benchmark?
        
       | m3kw9 wrote:
       | These tiny "state of the art" performance increases are really
       | indicative the current architecture for LLM(Transformers +
       | Mixture of Experts) is maxed out even if you train it
       | more/differently. The writings are on all over the walls.
        
         | wavemode wrote:
         | It would not surprise me if this is what has delayed OpenAI in
         | releasing a new model. After more than a year since GPT-4, they
         | may have by now produced some mega-trained mega-model, but
         | running it is so expensive, and its eval improvement over GPT-4
         | so marginal, that releasing it to the public simply makes no
         | commercial sense just yet.
         | 
         | They may be working on how to optimize it to reduce cost, or
         | re-engineer it to improve evals.
        
           | m3kw9 wrote:
           | These "state of the art" llm barely eking out a win isn't a
           | threat to OpenAI and they can take their sweet time
           | sharpening sword that will come down hard on these LLMs
        
       | bboygravity wrote:
       | Less than 1 week after Nancy Pelosi bought a 5M USD share in
       | Databricks, this news is published.
       | 
       | https://twitter.com/PelosiTracker_/status/177119703064106223...
       | 
       | Crime pays in the US.
        
         | laidoffamazon wrote:
         | Dude, what the hell are you talking about?
        
           | bboygravity wrote:
           | Insider trading by US government employees.
        
         | mrtranscendence wrote:
         | Are you alleging that Nancy Pelosi invested in Databricks, a
         | private company without a fluctuating share price, because she
         | learned that they would soon release a small, fairly middling
         | LLM that probably won't move the needle in any meaningful way?
        
           | bboygravity wrote:
           | Are you suggesting that Nancy Pelosi, who consistently beats
           | the market through obvious insider trading for years in a
           | row, bought a share in Databricks without any insider info?
           | Possible, yet unlikely is my opinion.
           | 
           | https://jacobin.com/2021/12/house-speaker-paul-stocks-
           | inside...
           | 
           | PS: "without a fluctuating share price" is non-sense. Just
           | because the share is of a private company, doesn't mean its
           | price can't fluctuate. Why would anybody buy shares in
           | private companies if the price couldn't fluctuate? What would
           | be the point?
           | 
           | Example of a changing share price of a different (random)
           | private company that has many different share holders over
           | time: https://www.cnbc.com/2023/12/13/spacex-value-climbs-
           | to-180-b...
        
         | lfmunoz4 wrote:
         | I see these types of jokes everywhere. I cannot understand that
         | hints of corruption are so blatant (i.e. a politician
         | consistently beating the market) yet people keep voting for the
         | same politician. Don't see how that is possible, must be these
         | joke are only on internet and mainstream media never mentions
         | this.
        
           | bboygravity wrote:
           | People are down-voting this because they refuse to believe
           | this could be reality.
        
       | jjtheblunt wrote:
       | I'd like to know how Nancy Pelosi, who sure as hell doesn't know
       | what Apache Spark is, bought $1 million worth (and maybe
       | $5million) of Databricks stock days ago.
       | 
       | https://www.dailymail.co.uk/sciencetech/article-13228859/amp...
        
         | hiddencost wrote:
         | You know she has advisors, right?
        
           | PUSH_AX wrote:
           | I think the insinuation is insider trading due to the timing,
           | advised or not.
        
           | jjtheblunt wrote:
           | Ignoring the snark: Obviously.
           | 
           | SEC put Martha Stewart in jail for following her advisor, and
           | that was for about $45,000.
        
           | samatman wrote:
           | If someone "advises" you that a company is about to do
           | something major, and this isn't public information, and you
           | take action on the stock market accordingly, that's insider
           | trading.
        
         | BryantD wrote:
         | I don't have any interest in defending Pelosi's stock trades,
         | and I agree that sitting members of Congress should not be
         | trading stocks.
         | 
         | That said, this report seems inaccurate to me. Pelosi put
         | between 1 and 5 million dollars of Forge Investments, which is
         | a method for investing in per-IPO companies, as I understand
         | it. Databricks is one of those, but so is OpenAI, Hugging Face,
         | Anthropic, and Humane. If I wanted to invest in pre-IPO AI
         | companies it seems like a very natural choice and I don't think
         | we need insider trading to explain it.
         | 
         | It's also the case that the report she filed calls out
         | Databricks stock, which is perhaps an indication that she was
         | particularly interested in that. Stronger reporting would tell
         | us how often she's invested in Forge, if this is the first
         | time, and so on. One other possible explanation is that she was
         | investing ahead of the Humane Pin shipping and wanted to pull
         | attention away from it, for example.
        
       | zopper wrote:
       | Interesting that they haven't release DBRX MoE-A and B. For many
       | use-cases, smaller models are sufficient. Wonder why that is?
        
       | ianbutler wrote:
       | The approval on the base model is not feeling very open. Plenty
       | of people still waiting on a chance to download it, where as the
       | instruct model was an instant approval. The base model is more
       | interesting to me for finetuning.
        
         | blueblimp wrote:
         | The license allows to reproduce/distribute/copy, so I'm a
         | little surprised there's an approval process at all.
        
       | ec109685 wrote:
       | For coding evals, it seems like unless you are super careful,
       | they can be polluted by the training data.
       | 
       | Are there standard ways to avoid that type of score inflation?
        
       | brucethemoose2 wrote:
       | I would note the actual leading models right now (IMO) are:
       | 
       | - Miqu 70B (General Chat)
       | 
       | - Deepseed 33B (Coding)
       | 
       | - Yi 34B (for chat over 32K context)
       | 
       | And of course, there are finetunes of all these.
       | 
       | And there are some others in the 34B-70B range I have not tried
       | (and some I have tried, like Qwen, which I was not impressed
       | with).
       | 
       | Point being that Llama 70B, Mixtral and Grok as seen in the
       | charts are not what I would call SOTA (though mixtral is
       | excellent for the batch size 1 speed)
        
         | jph00 wrote:
         | Miqu is a leaked model -- no license is provided to use it. Yi
         | 34B doesn't allow commercial use. Deepseed 33B isn't much good
         | at stuff outside of coding.
         | 
         | So it's fair to say that DBRX is the leading general purpose
         | model that can be used commercially.
        
         | blueblimp wrote:
         | Qwen1.5-72B-Chat is dominant in the Chatbot Arena leaderboard,
         | though. (Miqu isn't on there due to being bootleg, but Qwen
         | outranks Mistral Medium.)
        
         | belter wrote:
         | For all the Model Cards and License notices, I find it
         | interesting there is not much information on the contents of
         | the dataset used for training. Specifically, if it contains
         | data subject to Copyright restrictions. Or did I miss that?
        
       | bg24 wrote:
       | "Looking holistically, our end-to-end LLM pretraining pipeline
       | has become nearly 4x more compute-efficient in the past ten
       | months."
       | 
       | I did not fully understand the technical details in the training
       | efficiency section, but love this. Cost of training is
       | outrageously high, and hopefully it will start to follow Moore's
       | law.
        
       | airocker wrote:
       | is this also the ticker name when they IPO?
        
       | underlines wrote:
       | Waiting for Mixed Quantization with MQQ and MoE Offloading [1].
       | With that I was able to run Mistral 8x7B on my 10 GB VRAM
       | rtx3080... This should work for DBRX and should shave off a ton
       | of VRAM requirement.
       | 
       | 1. https://github.com/dvmazur/mixtral-offloading?tab=readme-
       | ov-...
        
       ___________________________________________________________________
       (page generated 2024-03-27 23:00 UTC)