[HN Gopher] ETH Zurich and EPFL to release a LLM developed on pu...
       ___________________________________________________________________
        
       ETH Zurich and EPFL to release a LLM developed on public
       infrastructure
        
       Author : andy99
       Score  : 635 points
       Date   : 2025-07-11 18:45 UTC (1 days ago)
        
 (HTM) web link (ethz.ch)
 (TXT) w3m dump (ethz.ch)
        
       | k__ wrote:
       | _" respecting web crawling opt-outs during data acquisition
       | produces virtually no performance degradation"_
       | 
       | Great to read that!
        
         | Onavo wrote:
         | No performance degradation on training metrics except for the
         | end user. At the end of the day users and website owners have
         | completely orthogonal interests. Users want answers and
         | content, website owners want attention so they can upsell/push
         | ads. You can only serve one master.
        
           | esafak wrote:
           | > Users want answers and content, website owners want
           | attention so they can upsell/push ads. You can only serve one
           | master
           | 
           | How are you going to serve users if web site owners decide to
           | wall their content? You can't ignore one side of the market.
        
             | Onavo wrote:
             | You don't. You bypass them with crawlers and don't reveal
             | your training data. And this is exactly why open source
             | models can't surpass open weight models.
        
               | diggan wrote:
               | > And this is exactly why open source models can't
               | surpass open weight models.
               | 
               | It is a fair point, but how strong of a point it is
               | remains to be seen, some architectures are better than
               | others, even with the same training data, so not
               | impossible we could at one point see some innovative
               | architectures beating current proprietary ones. It would
               | probably be short-lived though, as the proprietary ones
               | would obviously improve in their next release after that.
        
               | jowea wrote:
               | How can open source models respectful of robots.txt
               | possibly perform equally if they are missing information
               | that the other models have access to?
        
               | datameta wrote:
               | How can we possibly find out without trying?
        
               | jowea wrote:
               | It is logically impossible for a LLM to, for example, to
               | know that fooExecute() takes two int arguments if the
               | documentation is blocked by robots.txt and there are no
               | examples of fooExecute() usage in the wild, don't you
               | agree?
        
               | tharant wrote:
               | Sure, the model would not "know" about your example, but
               | that's not the point; the penultimate[0] goal is for the
               | model to figure out the method signature on its own just
               | like a human dev might leverage her own knowledge and
               | experience to infer that method signature. Intelligence
               | isn't just rote memorization.
               | 
               | [0] the ultimate, of course, being profit.
        
               | jowea wrote:
               | I don't think a human dev can divine a method signature
               | and effects in the general case either. Sure the add()
               | function probably takes 2 numbers, but maybe it takes a
               | list? Or a two-tuple? How would we or the LLM know
               | without having the documentation? And yeah sure the LLM
               | can look at the documentation while being used instead of
               | it being part of the training dataset, but that's
               | strictly inferior for practical uses, no?
               | 
               | I'm not sure if we're thinking of the same field of AI
               | development. I think I'm talking about the super-
               | autocomplete with integrated copy of all of digitalized
               | human knowledge, while you're talking about trying to do
               | (proto-)AGI. Is that it?
        
               | heavenlyblue wrote:
               | > Sure the add() function probably takes 2 numbers, but
               | maybe it takes a list? Or a two-tuple? How would we or
               | the LLM know without having the documentation?
               | 
               | You just listed possible options in the order of their
               | relative probability. Human would attempt to use them in
               | exactly that order
        
               | diggan wrote:
               | I agree, but also think it's less important. I don't want
               | a big fat LLM that memorized every API out there, and as
               | soon as the API changed, the weights have to updated. I
               | like the current approach of Codex (and similar) where
               | they can look up the APIs they need to use as they're
               | doing the work instead, so same weights will continue to
               | work no matter how much the APIs change.
        
               | Dylan16807 wrote:
               | Maybe the missing data makes it 3% worse but the
               | architecture is 5% better. Or your respect for robots.txt
               | gets you more funding and you gain a 4% advantage by
               | training longer.
               | 
               | Don't focus too much on a single variable, especially
               | when all the variables have diminishing returns.
        
               | lllllm wrote:
               | this is what this paper tries to answer:
               | https://arxiv.org/abs/2504.06219 the quality gap is
               | surprisingly small between compliant and not
        
         | JKCalhoun wrote:
         | Is there not yet a Source where the web has already been
         | scraped and souped down to just the text? It would seem someone
         | would have created such a thing in order to save LLM training
         | from having to reinvent the wheel.
         | 
         | I understand the web is a dynamic thing but still it would seem
         | to be useful on some level.
        
           | CaptainFever wrote:
           | Common Crawl, maybe?
        
         | stephen_cagle wrote:
         | I wonder if the reason for these results is that any data on
         | the internet is already copied to other locations by actors who
         | ignore crawling opt-outs. So, even if _they_ respect all web
         | crawling opt-outs, they are still effectively copying the data
         | because someone else did not respect it who does not include an
         | opt-out.
        
           | conradkay wrote:
           | My guess is that it doesn't remove that much of the data, and
           | the post-training data (not just randomly scraped from the
           | web) probably matters more
        
           | lllllm wrote:
           | Yes this is an interesting question. In our arxiv paper [1]
           | we did study this for news articles, and also removed
           | duplicates of articles (decontamination). We did not observe
           | an impact on the downstream accuracy of the LLM, in the case
           | of news data.
           | 
           | [1] https://arxiv.org/abs/2504.06219
        
       | Bengalilol wrote:
       | Looking forward to proof test it.
        
       | greenavocado wrote:
       | Why would you announce this without a release? Be honest.
        
         | wood_spirit wrote:
         | The announcement was at the International Open-Source LLM
         | Builders Summit held this week in Switzerland. Is it so strange
         | that they announced what they are doing and the timeline?
        
         | JumpCrisscross wrote:
         | Funding? Deeply biasing European uses to publicly-developed
         | European LLMs (or at least not American or Chinese ones) would
         | make a lot of sense. (Potentially too much sense for Brussels.)
        
         | phtrivier wrote:
         | The cliche (at least on my side of the Alps) is that people in
         | Switzerland like to take theiiiir tiiiime.
        
           | Bengalilol wrote:
           | "Move as quickly as possible, but as slowly as necessary."
        
       | WeirderScience wrote:
       | The open training data is a huge differentiator. Is this the
       | first truly open dataset of this scale? Prior efforts like The
       | Pile were valuable, but had limitations. Curious to see how
       | reproducible the training is.
        
         | layer8 wrote:
         | > The model will be fully open: source code and weights will be
         | publicly available, and the training data will be transparent
         | and reproducible
         | 
         | This leads me to believe that the training data won't be made
         | publicly available in full, but merely be "reproducible". This
         | might mean that they'll provide references like a list of URLs
         | of the pages they trained on, but not their contents.
        
           | WeirderScience wrote:
           | Yeah, I suspect you're right. Still, even a list of URLs for
           | a frontier model (assuming it does turn out to be of that
           | level) would be welcome over the current situation.
        
           | glhaynes wrote:
           | That wouldn't seem reproducible if the content at those URLs
           | changes. (Er, unless it was all web.archive.org URLs or
           | something.)
        
             | dietr1ch wrote:
             | This is a problem with the Web. It should be easier to
             | download content like it was updating a git Repo.
        
           | TobTobXX wrote:
           | Well, when the actual content is 100s of terabytes big,
           | providing URLs may be more practical for them and for others.
        
             | layer8 wrote:
             | The difference between content they are allowed to train on
             | vs. being allowed to distribute copies of is likely at
             | least as relevant.
        
             | sschueller wrote:
             | No problem, we have 25 Gbit/s home internet here. [1]
             | 
             | [1] https://www.init7.net/en/internet/fiber7/
        
         | evolvedlight wrote:
         | Yup, it's not a dataset packaged like you hope for here, as it
         | still contains traditionally copyrighted material
        
       | oytis wrote:
       | The press release talks a lot about how it was done, but very
       | little about how capabilities compare to other open models.
        
         | pantalaimon wrote:
         | It's a university, teaching the 'how it's done' is kind of the
         | point
        
           | EA-3167 wrote:
           | Sure, but usually you teach something that is inherently
           | useful, or can be applied to some sort of useful endeavor. In
           | this case I think it's fair to ask what the collision of two
           | bubbles really achieves, or if it's just a useful teaching
           | model, what it can be applied to.
        
         | joot82 wrote:
         | _The model will be released in two sizes -- 8 billion and 70
         | billion parameters [...]. The 70B version will rank among the
         | most powerful fully open models worldwide. [...] In late
         | summer, the LLM will be released under the Apache 2.0 License._
         | 
         | We'll find out in September if it's true?
        
           | k__ wrote:
           | I hope DeepSeek R2, but I fear Llama 4.
        
           | oytis wrote:
           | Yeah, I was thinking more of a table with benchmark results
        
       | wood_spirit wrote:
       | The article says
       | 
       | " Open LLMs are increasingly viewed as credible alternatives to
       | commercial systems, most of which are developed behind closed
       | doors in the United States or China"
       | 
       | It is obvious that the companies producing big LLMs today have
       | the incentive to try to enshitify them. Trying to get
       | subscriptions at the same time as trying to do product placement
       | ads etc. Worse, some already have political biases they promote.
       | 
       | It would be wonderful if a partnership between academia and
       | government in Europe can do a public good search and AI that
       | endeavours to serve the user over the company.
        
         | klabb3 wrote:
         | Yes but it's a very complicated service to deliver. Even if
         | they train great models, they likely will not operationalize
         | them for inference. Those will still be private actors, and the
         | incentives to enshittify will be the same. Also, for AI
         | generally the incentives is much higher than last tech
         | generation, due to cost of running these things. Basically, the
         | free services where you're the product must aggressively
         | extract value out of you in order to make a profit.
        
       | bee_rider wrote:
       | Is this setting the bar for dataset transparency? It seems like a
       | significant step forward. Assuming it works out, that is.
       | 
       | They missed an opportunity though. They should have called their
       | machine the AIps (AI Petaflops Supercomputer).
        
         | philipkglass wrote:
         | I think that the Allen Institute for Artificial Intelligence
         | OLMo models are also completely open:
         | 
         |  _OLMo is fully open_
         | 
         |  _Ai2 believes in the power of openness to build a future where
         | AI is accessible to all. Open weights alone aren't enough -
         | true openness requires models to be trained in the open with
         | fully open access to data, models, and code._
         | 
         | https://allenai.org/olmo
        
           | lamuswawir wrote:
           | I am a simple man, I see AI2, I upvote.
        
         | ekianjo wrote:
         | Smollm is also completely open as far as I know
        
       | isusmelj wrote:
       | I hope they do well. AFAIK they're training or finetuning an
       | older LLaMA model, so performance might lag behind SOTA. But what
       | really matters is that ETH and EPFL get hands-on experience
       | training at scale. From what I've heard, the new AI cluster still
       | has teething problems. A lot of people underestimate how tough it
       | is to train models at this scale, especially on your own infra.
       | 
       | Disclaimer: I'm Swiss and studied at ETH. We've got the
       | brainpower, but not much large-scale training experience yet. And
       | IMHO, a lot of the "magic" in LLMs is infrastructure-driven.
        
         | luke-stanley wrote:
         | When I read "from scratch", I assume they are doing pre-
         | training, not just finetuning, do you have a different take? Do
         | you mean it's normal Llama architecture they're using? I'm
         | curious about the benchmarks!
        
         | andy99 wrote:
         | Imo, a lot of the magic is also dataset driven, specifically
         | the SFT and other fine tuning / RLHF data they have. That's
         | what has separated the models people actually use from the
         | also-rans.
         | 
         | I agree with everything you say about getting the experience,
         | the infrastructure is very important and is probably the most
         | critical part of a sovereign LLM supply chain. I would hope
         | there will also be enough focus on the data, early on, that the
         | model will be useful.
        
         | alfalfasprout wrote:
         | The infra does become pretty complex to get a SOTA LLM trained.
         | People assume it's as simple as loading up the architecture and
         | a dataset + using something like Ray. There's a lot that goes
         | into designing the dataset, the eval pipelines, the training
         | approach, maximizing the use of your hardware, dealing with
         | cross-node latency, recovering from errors, etc.
         | 
         | But it's good to have more and more players in this space.
        
         | lllllm wrote:
         | No, the model has nothing do to with Llama. We are using our
         | own architecture, and training from scratch. Llama also does
         | not have open training data, and is non-compliant, in contrast
         | to this model.
         | 
         | Source: I'm part of the training team
        
           | macawfish wrote:
           | Are you using dbpedia?
        
             | lllllm wrote:
             | no. the main source is fineweb2, but with additional
             | filtering for compliance, toxicity removal, and quality
             | filters such as fineweb2-hq
        
               | PeterStuer wrote:
               | Thx for engaging here.
               | 
               | Can you comment on how the filtering impacted language
               | coverage? E.g. finweb2 has 1800+ languages, but some with
               | very little actual representation, while finweb2-hq has
               | just 20 but each with a subdsantial data set.
               | 
               | (I'm personaly most interested in covering the 24
               | official EU languages)
        
               | lllllm wrote:
               | we kept all 1800+ (script/language) pairs, not only the
               | quality filtered ones. the question if a mix of quality
               | filtered and not languages impacts the mixing is still an
               | open question. preliminary research (Section 4.2.7 of
               | https://arxiv.org/abs/2502.10361 ) indicates that quality
               | filtering can mitigate the curse of multilinguality to
               | some degree, so facilitate cross-lingual generalization,
               | but it has to be seen how strong this effect is on larger
               | scale
        
           | danielhanchen wrote:
           | If you guys need help on GGUFs + Unsloth dynamic quants +
           | finetuning support via Unsloth
           | https://github.com/unslothai/unsloth on day 0 / 1, more than
           | happy to help :)
        
             | lllllm wrote:
             | absolutely! i've sent you a linkedin message last week. but
             | here seems to work much better, thanks a lot!
        
           | isusmelj wrote:
           | Thanks for clarifying! I wish you all the best luck!
        
           | Al-Khwarizmi wrote:
           | So you're not going to use copyrighted data for training?
           | That's going to be a disadvantage with respect to LLaMa and
           | other well-known models, it's an open secret that everyone is
           | using everything they can get their hands on.
           | 
           | Good luck though, very needed project!
        
             | badsectoracula wrote:
             | Not sure about the Swiss laws, but the EU AI Act and the
             | 2019/790 digital millennium directive it piggies back on
             | the topic, does allow for training on copyrighted data as
             | long as any opt-out mechanisms (e.g. robots.txt) are
             | respected. AFAICT this LLM was trained by respecting those
             | mechanisms (and as linked elsewhere they didn't find any
             | practical difference in performance - note that there is an
             | exception to allow ignoring the opt-out mechanisms for
             | research purposes, so they could make that comparison).
        
           | moffkalast wrote:
           | L3 has open pretraining data, it's just not official for
           | obvious legal reasons:
           | https://huggingface.co/datasets/HuggingFaceFW/fineweb
        
           | d3m0t3p wrote:
           | Hey, really cool project, I'm excited to see the outcome. Is
           | there a blog / paper summarizing how you are doing it ? Also
           | which research group is currently working on it at eth ?
        
         | asjir wrote:
         | I'd be more concerned about the size used being 70b (deepseek
         | r1 has 671b) which makes catching up with SOTA kinda more
         | difficult to begin with.
        
           | zettabomb wrote:
           | SOTA performance is relative to model size. If it performs
           | better than other models in the 70B range (e.g. Llama 3.3)
           | then it could be quite useful. Not everyone has the VRAM to
           | run the full fat Deepseek R1.
        
             | tough wrote:
             | also isn't DeepSeek's Mixture of Experts? meaning not all
             | params get ever activated on one forward pass?
             | 
             | 70B feels like the best balance between usable locally and
             | decent for regular use.
             | 
             | maybe not SOTA, but a great first step.
        
       | hubraumhugo wrote:
       | Pretty proud to see this at the top of HN as a Swiss (and I know
       | many are lurking here!). These two universities produce world-
       | class founders, researchers, and engineers. Yet, we always stay
       | in the shadow of the US. With our top-tier public infrastructure,
       | education, and political stability (+ neutrality), we have a
       | unqiue opportunity to build something exceptional in the open LLM
       | space.
        
         | RHSman2 wrote:
         | I work with EPFL alumni. Brilliant minds.
        
         | MITSardine wrote:
         | I think EPFL and ETH are generally well known internationally,
         | but Switzerland being rather small (9M pop), it's only natural
         | you don't hear much about it compared to other larger
         | countries!
        
       | amelius wrote:
       | Yeah, that's what "democratizing AI" means.
        
       | nektro wrote:
       | gross use of public infrastructure
        
         | protocolture wrote:
         | I literally cant fault this, even steelmanning anti AI
         | positions. What makes you say that?
        
         | PetitPrince wrote:
         | Sometimes ago there was a Tom Scott video about the fasted
         | accelerating car in the world, developed by a team with a vast
         | majority of student. One remark stayed with me: "the goal is
         | not to build a car, but to build engineer".
         | 
         | In that regard it's absolutely not a waste of public infra just
         | like this car was not a waste.
        
         | herbst wrote:
         | It even used green power. Literally zero complains or outcry
         | from the public yet. Guess we like progress, especially if it
         | helps independence.
        
         | MITSardine wrote:
         | University and research clusters are built to run research
         | code. I can guarantee this project is 10x as impactful and
         | interesting as what usually runs on these machines. This coming
         | from someone in the area that usually hogs these machines
         | (numerical simulation). I'm very excited to see academic actors
         | tackle LLMs.
        
       | westurner wrote:
       | Use case for science and code LLMs: Superhydrodynamic gravity
       | (SQR / SQG, )
       | 
       | LLMs do seem to favor general relativity but probably would've
       | favored classical mechanics at the time given the training
       | corpora.
       | 
       | Not-yet unified: Quantum gravity, QFT, "A unified model must: "
       | https://news.ycombinator.com/item?id=44289148
       | 
       | Will be interested to see how this model responds to currently
       | unresolvable issues in physics. Is it an open or a closed world
       | mentality and/or a conditioned disclaimer which encourages
       | progress?
       | 
       | What are the current benchmarks?
       | 
       | From https://news.ycombinator.com/item?id=42899805 re: "Large
       | Language Models for Mathematicians" (2023) :
       | 
       | > _Benchmarks for math and physics LLMs: FrontierMath, TheoremQA,
       | Multi SWE-bench:https://news.ycombinator.com/item?id=42097683 _
       | 
       | Multi-SWE-bench: A Multi-Lingual and Multi-Modal GitHub Issue
       | Resolving Benchmark: https://multi-swe-bench.github.io/
       | 
       | Add'l LLM benchmarks and awesome lists:
       | https://news.ycombinator.com/item?id=44485226
       | 
       | Microsoft has a new datacenter that you don't have to keep adding
       | water to; which spares the aquifers.
       | 
       | How to use this LLM to solve energy and sustainability problems
       | all LLMs exacerbate? Solutions for the Global Goals, hopefully
        
       | seydor wrote:
       | I wonder if multilingual llms are better or worse compared a
       | single language model
        
         | tugdual wrote:
         | This is an interesting problem that has various challenges -
         | currently most tokenization solutions where trainees using hype
         | pair encoding where the most commonly seen combinations of
         | letters were being selected to be a mapping. This meant that
         | the majority of tokenization was English mappings meaning your
         | LLM had a better tokenization of English compared to other
         | languages it was being trained on.
         | 
         | C.f. https://medium.com/@biswanai92/understanding-token-
         | fertility...
        
       | mukeshyadavnitt wrote:
       | nice
        
       | contrarian1234 wrote:
       | This seems like the equivalent of a university designing an ICE
       | car...
       | 
       | What does anyone get out of this when we have open weight models
       | already ?
       | 
       | Are they going to do very innovative AI research that companies
       | wouldn't dare try/fund? Seems unlikely ..
       | 
       | Is it a moonshot huge project that no single company could
       | fund..? Not that either
       | 
       | If it's just a little fun to train the next generation of LLM
       | researchers.. Then you might as well just make a small scale toy
       | instead of using up a super computer center
        
         | urvader wrote:
         | This model will be one of the few open models where the
         | training data is also open which makes it ideal for fine
         | tuning.
        
         | chvid wrote:
         | That it will actually be open and reproducible?
         | 
         | Including how it was trained, what data was used, how training
         | data was synthesized, how other models were used etc. All the
         | stuff that is kept secret in case of llama, deepseek etc.
        
         | herbst wrote:
         | Why do you think it's about money? IMO it's about much more
         | than that, like independence and actual data freedom trough
         | reproductive LLMs
        
         | MITSardine wrote:
         | Super computers are being used daily for much toy-ier codes in
         | research, be glad this at least interests the public and
         | constitutes a foray of academia into new areas.
        
       | defraudbah wrote:
       | ETH Zurich is doing so many amazing things that I want to go
       | study there. Unbelievable how many great people are coming from
       | that university
        
         | blue_light_man wrote:
         | It's also possible you just think of ETH Zurich as great and
         | automatically associate the people and products as amazing.
         | Could be a circular dependency here.
        
           | rtaylorgarlock wrote:
           | That is indeed how things work. I can think of a few 'good'
           | media-relevant examples, including e.g. that recent super-
           | quick cart project [1], that reach beyond the more vanilla
           | startup-spinoffs or basic media efforts.
           | 
           | 1 https://ethz.ch/en/news-and-events/eth-
           | news/news/2023/09/fro...
        
           | datameta wrote:
           | I took courses online from ETH Zurich before the formula was
           | "perfected" and I'd say they were ahead of the curve in
           | quality, concise but info-dense educational content.
        
       | Tepix wrote:
       | How does it compare to Teuken and EuroLLM?
        
       | sschueller wrote:
       | Yet, Switzerland was put in the 2. Tier list[1] of countries that
       | can get unlimited access to the top AI chips.
       | 
       | [1] https://www.bluewin.ch/en/news/usa-restricts-swiss-access-
       | to...
       | 
       | [2] https://chplusplus.org/u-s-export-controls-on-ai-chips/
        
       | kisamoto wrote:
       | Any info on context length or comparable performance? Press
       | release is unfortunately lacking on technical details.
       | 
       | Also I'm curious if there was any reason to make such a PR
       | without actually releasing the model (due Summer)? What's the
       | delay? Or rather what was the motivation for a PR?
        
       ___________________________________________________________________
       (page generated 2025-07-12 23:00 UTC)