[HN Gopher] ETH Zurich and EPFL to release a LLM developed on pu...
___________________________________________________________________
ETH Zurich and EPFL to release a LLM developed on public
infrastructure
Author : andy99
Score : 260 points
Date : 2025-07-11 18:45 UTC (4 hours ago)
(HTM) web link (ethz.ch)
(TXT) w3m dump (ethz.ch)
| k__ wrote:
| _" respecting web crawling opt-outs during data acquisition
| produces virtually no performance degradation"_
|
| Great to read that!
| Onavo wrote:
| No performance degradation on training metrics except for the
| end user. At the end of the day users and website owners have
| completely orthogonal interests. Users want answers and
| content, website owners want attention so they can upsell/push
| ads. You can only serve one master.
| esafak wrote:
| > Users want answers and content, website owners want
| attention so they can upsell/push ads. You can only serve one
| master
|
| How are you going to serve users if web site owners decide to
| wall their content? You can't ignore one side of the market.
| Onavo wrote:
| You don't. You bypass them with crawlers and don't reveal
| your training data. And this is exactly why open source
| models can't surpass open weight models.
| diggan wrote:
| > And this is exactly why open source models can't
| surpass open weight models.
|
| It is a fair point, but how strong of a point it is
| remains to be seen, some architectures are better than
| others, even with the same training data, so not
| impossible we could at one point see some innovative
| architectures beating current proprietary ones. It would
| probably be short-lived though, as the proprietary ones
| would obviously improve in their next release after that.
| Bengalilol wrote:
| Looking forward to proof test it.
| greenavocado wrote:
| Why would you announce this without a release? Be honest.
| wood_spirit wrote:
| The announcement was at the International Open-Source LLM
| Builders Summit held this week in Switzerland. Is it so strange
| that they announced what they are doing and the timeline?
| JumpCrisscross wrote:
| Funding? Deeply biasing European uses to publicly-developed
| European LLMs (or at least not American or Chinese ones) would
| make a lot of sense. (Potentially too much sense for Brussels.)
| phtrivier wrote:
| The cliche (at least on my side of the Alps) is that people in
| Switzerland like to take theiiiir tiiiime.
| Bengalilol wrote:
| "Move as quickly as possible, but as slowly as necessary."
| WeirderScience wrote:
| The open training data is a huge differentiator. Is this the
| first truly open dataset of this scale? Prior efforts like The
| Pile were valuable, but had limitations. Curious to see how
| reproducible the training is.
| layer8 wrote:
| > The model will be fully open: source code and weights will be
| publicly available, and the training data will be transparent
| and reproducible
|
| This leads me to believe that the training data won't be made
| publicly available in full, but merely be "reproducible". This
| might mean that they'll provide references like a list of URLs
| of the pages they trained on, but not their contents.
| WeirderScience wrote:
| Yeah, I suspect you're right. Still, even a list of URLs for
| a frontier model (assuming it does turn out to be of that
| level) would be welcome over the current situation.
| glhaynes wrote:
| That wouldn't seem reproducible if the content at those URLs
| changes. (Er, unless it was all web.archive.org URLs or
| something.)
| dietr1ch wrote:
| This is a problem with the Web. It should be easier to
| download content like it was updating a git Repo.
| TobTobXX wrote:
| Well, when the actual content is 100s of terabytes big,
| providing URLs may be more practical for them and for others.
| layer8 wrote:
| The difference between content they are allowed to train on
| vs. being allowed to distribute copies of is likely at
| least as relevant.
| evolvedlight wrote:
| Yup, it's not a dataset packaged like you hope for here, as it
| still contains traditionally copyrighted material
| oytis wrote:
| The press release talks a lot about how it was done, but very
| little about how capabilities compare to other open models.
| pantalaimon wrote:
| It's a university, teaching the 'how it's done' is kind of the
| point
| EA-3167 wrote:
| Sure, but usually you teach something that is inherently
| useful, or can be applied to some sort of useful endeavor. In
| this case I think it's fair to ask what the collision of two
| bubbles really achieves, or if it's just a useful teaching
| model, what it can be applied to.
| joot82 wrote:
| _The model will be released in two sizes -- 8 billion and 70
| billion parameters [...]. The 70B version will rank among the
| most powerful fully open models worldwide. [...] In late
| summer, the LLM will be released under the Apache 2.0 License._
|
| We'll find out in September if it's true?
| k__ wrote:
| I hope DeepSeek R2, but I fear Llama 4.
| oytis wrote:
| Yeah, I was thinking more of a table with benchmark results
| wood_spirit wrote:
| The article says
|
| " Open LLMs are increasingly viewed as credible alternatives to
| commercial systems, most of which are developed behind closed
| doors in the United States or China"
|
| It is obvious that the companies producing big LLMs today have
| the incentive to try to enshitify them. Trying to get
| subscriptions at the same time as trying to do product placement
| ads etc. Worse, some already have political biases they promote.
|
| It would be wonderful if a partnership between academia and
| government in Europe can do a public good search and AI that
| endeavours to serve the user over the company.
| klabb3 wrote:
| Yes but it's a very complicated service to deliver. Even if
| they train great models, they likely will not operationalize
| them for inference. Those will still be private actors, and the
| incentives to enshittify will be the same. Also, for AI
| generally the incentives is much higher than last tech
| generation, due to cost of running these things. Basically, the
| free services where you're the product must aggressively
| extract value out of you in order to make a profit.
| bee_rider wrote:
| Is this setting the bar for dataset transparency? It seems like a
| significant step forward. Assuming it works out, that is.
|
| They missed an opportunity though. They should have called their
| machine the AIps (AI Petaflops Supercomputer).
| philipkglass wrote:
| I think that the Allen Institute for Artificial Intelligence
| OLMo models are also completely open:
|
| _OLMo is fully open_
|
| _Ai2 believes in the power of openness to build a future where
| AI is accessible to all. Open weights alone aren't enough -
| true openness requires models to be trained in the open with
| fully open access to data, models, and code._
|
| https://allenai.org/olmo
| isusmelj wrote:
| I hope they do well. AFAIK they're training or finetuning an
| older LLaMA model, so performance might lag behind SOTA. But what
| really matters is that ETH and EPFL get hands-on experience
| training at scale. From what I've heard, the new AI cluster still
| has teething problems. A lot of people underestimate how tough it
| is to train models at this scale, especially on your own infra.
|
| Disclaimer: I'm Swiss and studied at ETH. We've got the
| brainpower, but not much large-scale training experience yet. And
| IMHO, a lot of the "magic" in LLMs is infrastructure-driven.
| luke-stanley wrote:
| When I read "from scratch", I assume they are doing pre-
| training, not just finetuning, do you have a different take? Do
| you mean it's normal Llama architecture they're using? I'm
| curious about the benchmarks!
| andy99 wrote:
| Imo, a lot of the magic is also dataset driven, specifically
| the SFT and other fine tuning / RLHF data they have. That's
| what has separated the models people actually use from the
| also-rans.
|
| I agree with everything you say about getting the experience,
| the infrastructure is very important and is probably the most
| critical part of a sovereign LLM supply chain. I would hope
| there will also be enough focus on the data, early on, that the
| model will be useful.
| alfalfasprout wrote:
| The infra does become pretty complex to get a SOTA LLM trained.
| People assume it's as simple as loading up the architecture and
| a dataset + using something like Ray. There's a lot that goes
| into designing the dataset, the eval pipelines, the training
| approach, maximizing the use of your hardware, dealing with
| cross-node latency, recovering from errors, etc.
|
| But it's good to have more and more players in this space.
| hubraumhugo wrote:
| Pretty proud to see this at the top of HN as a Swiss (and I know
| many are lurking here!). These two universities produce world-
| class founders, researchers, and engineers. Yet, we always stay
| in the shadow of the US. With our top-tier public infrastructure,
| education, and political stability (+ neutrality), we have a
| unqiue opportunity to build something exceptional in the open LLM
| space.
| amelius wrote:
| Yeah, that's what "democratizing AI" means.
___________________________________________________________________
(page generated 2025-07-11 23:00 UTC)