[HN Gopher] ETH Zurich and EPFL to release a LLM developed on pu...
       ___________________________________________________________________
        
       ETH Zurich and EPFL to release a LLM developed on public
       infrastructure
        
       Author : andy99
       Score  : 260 points
       Date   : 2025-07-11 18:45 UTC (4 hours ago)
        
 (HTM) web link (ethz.ch)
 (TXT) w3m dump (ethz.ch)
        
       | k__ wrote:
       | _" respecting web crawling opt-outs during data acquisition
       | produces virtually no performance degradation"_
       | 
       | Great to read that!
        
         | Onavo wrote:
         | No performance degradation on training metrics except for the
         | end user. At the end of the day users and website owners have
         | completely orthogonal interests. Users want answers and
         | content, website owners want attention so they can upsell/push
         | ads. You can only serve one master.
        
           | esafak wrote:
           | > Users want answers and content, website owners want
           | attention so they can upsell/push ads. You can only serve one
           | master
           | 
           | How are you going to serve users if web site owners decide to
           | wall their content? You can't ignore one side of the market.
        
             | Onavo wrote:
             | You don't. You bypass them with crawlers and don't reveal
             | your training data. And this is exactly why open source
             | models can't surpass open weight models.
        
               | diggan wrote:
               | > And this is exactly why open source models can't
               | surpass open weight models.
               | 
               | It is a fair point, but how strong of a point it is
               | remains to be seen, some architectures are better than
               | others, even with the same training data, so not
               | impossible we could at one point see some innovative
               | architectures beating current proprietary ones. It would
               | probably be short-lived though, as the proprietary ones
               | would obviously improve in their next release after that.
        
       | Bengalilol wrote:
       | Looking forward to proof test it.
        
       | greenavocado wrote:
       | Why would you announce this without a release? Be honest.
        
         | wood_spirit wrote:
         | The announcement was at the International Open-Source LLM
         | Builders Summit held this week in Switzerland. Is it so strange
         | that they announced what they are doing and the timeline?
        
         | JumpCrisscross wrote:
         | Funding? Deeply biasing European uses to publicly-developed
         | European LLMs (or at least not American or Chinese ones) would
         | make a lot of sense. (Potentially too much sense for Brussels.)
        
         | phtrivier wrote:
         | The cliche (at least on my side of the Alps) is that people in
         | Switzerland like to take theiiiir tiiiime.
        
           | Bengalilol wrote:
           | "Move as quickly as possible, but as slowly as necessary."
        
       | WeirderScience wrote:
       | The open training data is a huge differentiator. Is this the
       | first truly open dataset of this scale? Prior efforts like The
       | Pile were valuable, but had limitations. Curious to see how
       | reproducible the training is.
        
         | layer8 wrote:
         | > The model will be fully open: source code and weights will be
         | publicly available, and the training data will be transparent
         | and reproducible
         | 
         | This leads me to believe that the training data won't be made
         | publicly available in full, but merely be "reproducible". This
         | might mean that they'll provide references like a list of URLs
         | of the pages they trained on, but not their contents.
        
           | WeirderScience wrote:
           | Yeah, I suspect you're right. Still, even a list of URLs for
           | a frontier model (assuming it does turn out to be of that
           | level) would be welcome over the current situation.
        
           | glhaynes wrote:
           | That wouldn't seem reproducible if the content at those URLs
           | changes. (Er, unless it was all web.archive.org URLs or
           | something.)
        
             | dietr1ch wrote:
             | This is a problem with the Web. It should be easier to
             | download content like it was updating a git Repo.
        
           | TobTobXX wrote:
           | Well, when the actual content is 100s of terabytes big,
           | providing URLs may be more practical for them and for others.
        
             | layer8 wrote:
             | The difference between content they are allowed to train on
             | vs. being allowed to distribute copies of is likely at
             | least as relevant.
        
         | evolvedlight wrote:
         | Yup, it's not a dataset packaged like you hope for here, as it
         | still contains traditionally copyrighted material
        
       | oytis wrote:
       | The press release talks a lot about how it was done, but very
       | little about how capabilities compare to other open models.
        
         | pantalaimon wrote:
         | It's a university, teaching the 'how it's done' is kind of the
         | point
        
           | EA-3167 wrote:
           | Sure, but usually you teach something that is inherently
           | useful, or can be applied to some sort of useful endeavor. In
           | this case I think it's fair to ask what the collision of two
           | bubbles really achieves, or if it's just a useful teaching
           | model, what it can be applied to.
        
         | joot82 wrote:
         | _The model will be released in two sizes -- 8 billion and 70
         | billion parameters [...]. The 70B version will rank among the
         | most powerful fully open models worldwide. [...] In late
         | summer, the LLM will be released under the Apache 2.0 License._
         | 
         | We'll find out in September if it's true?
        
           | k__ wrote:
           | I hope DeepSeek R2, but I fear Llama 4.
        
           | oytis wrote:
           | Yeah, I was thinking more of a table with benchmark results
        
       | wood_spirit wrote:
       | The article says
       | 
       | " Open LLMs are increasingly viewed as credible alternatives to
       | commercial systems, most of which are developed behind closed
       | doors in the United States or China"
       | 
       | It is obvious that the companies producing big LLMs today have
       | the incentive to try to enshitify them. Trying to get
       | subscriptions at the same time as trying to do product placement
       | ads etc. Worse, some already have political biases they promote.
       | 
       | It would be wonderful if a partnership between academia and
       | government in Europe can do a public good search and AI that
       | endeavours to serve the user over the company.
        
         | klabb3 wrote:
         | Yes but it's a very complicated service to deliver. Even if
         | they train great models, they likely will not operationalize
         | them for inference. Those will still be private actors, and the
         | incentives to enshittify will be the same. Also, for AI
         | generally the incentives is much higher than last tech
         | generation, due to cost of running these things. Basically, the
         | free services where you're the product must aggressively
         | extract value out of you in order to make a profit.
        
       | bee_rider wrote:
       | Is this setting the bar for dataset transparency? It seems like a
       | significant step forward. Assuming it works out, that is.
       | 
       | They missed an opportunity though. They should have called their
       | machine the AIps (AI Petaflops Supercomputer).
        
         | philipkglass wrote:
         | I think that the Allen Institute for Artificial Intelligence
         | OLMo models are also completely open:
         | 
         |  _OLMo is fully open_
         | 
         |  _Ai2 believes in the power of openness to build a future where
         | AI is accessible to all. Open weights alone aren't enough -
         | true openness requires models to be trained in the open with
         | fully open access to data, models, and code._
         | 
         | https://allenai.org/olmo
        
       | isusmelj wrote:
       | I hope they do well. AFAIK they're training or finetuning an
       | older LLaMA model, so performance might lag behind SOTA. But what
       | really matters is that ETH and EPFL get hands-on experience
       | training at scale. From what I've heard, the new AI cluster still
       | has teething problems. A lot of people underestimate how tough it
       | is to train models at this scale, especially on your own infra.
       | 
       | Disclaimer: I'm Swiss and studied at ETH. We've got the
       | brainpower, but not much large-scale training experience yet. And
       | IMHO, a lot of the "magic" in LLMs is infrastructure-driven.
        
         | luke-stanley wrote:
         | When I read "from scratch", I assume they are doing pre-
         | training, not just finetuning, do you have a different take? Do
         | you mean it's normal Llama architecture they're using? I'm
         | curious about the benchmarks!
        
         | andy99 wrote:
         | Imo, a lot of the magic is also dataset driven, specifically
         | the SFT and other fine tuning / RLHF data they have. That's
         | what has separated the models people actually use from the
         | also-rans.
         | 
         | I agree with everything you say about getting the experience,
         | the infrastructure is very important and is probably the most
         | critical part of a sovereign LLM supply chain. I would hope
         | there will also be enough focus on the data, early on, that the
         | model will be useful.
        
         | alfalfasprout wrote:
         | The infra does become pretty complex to get a SOTA LLM trained.
         | People assume it's as simple as loading up the architecture and
         | a dataset + using something like Ray. There's a lot that goes
         | into designing the dataset, the eval pipelines, the training
         | approach, maximizing the use of your hardware, dealing with
         | cross-node latency, recovering from errors, etc.
         | 
         | But it's good to have more and more players in this space.
        
       | hubraumhugo wrote:
       | Pretty proud to see this at the top of HN as a Swiss (and I know
       | many are lurking here!). These two universities produce world-
       | class founders, researchers, and engineers. Yet, we always stay
       | in the shadow of the US. With our top-tier public infrastructure,
       | education, and political stability (+ neutrality), we have a
       | unqiue opportunity to build something exceptional in the open LLM
       | space.
        
       | amelius wrote:
       | Yeah, that's what "democratizing AI" means.
        
       ___________________________________________________________________
       (page generated 2025-07-11 23:00 UTC)