[HN Gopher] ETH Zurich and EPFL to release a LLM developed on pu...
___________________________________________________________________
ETH Zurich and EPFL to release a LLM developed on public
infrastructure
Author : andy99
Score : 635 points
Date : 2025-07-11 18:45 UTC (1 days ago)
(HTM) web link (ethz.ch)
(TXT) w3m dump (ethz.ch)
| k__ wrote:
| _" respecting web crawling opt-outs during data acquisition
| produces virtually no performance degradation"_
|
| Great to read that!
| Onavo wrote:
| No performance degradation on training metrics except for the
| end user. At the end of the day users and website owners have
| completely orthogonal interests. Users want answers and
| content, website owners want attention so they can upsell/push
| ads. You can only serve one master.
| esafak wrote:
| > Users want answers and content, website owners want
| attention so they can upsell/push ads. You can only serve one
| master
|
| How are you going to serve users if web site owners decide to
| wall their content? You can't ignore one side of the market.
| Onavo wrote:
| You don't. You bypass them with crawlers and don't reveal
| your training data. And this is exactly why open source
| models can't surpass open weight models.
| diggan wrote:
| > And this is exactly why open source models can't
| surpass open weight models.
|
| It is a fair point, but how strong of a point it is
| remains to be seen, some architectures are better than
| others, even with the same training data, so not
| impossible we could at one point see some innovative
| architectures beating current proprietary ones. It would
| probably be short-lived though, as the proprietary ones
| would obviously improve in their next release after that.
| jowea wrote:
| How can open source models respectful of robots.txt
| possibly perform equally if they are missing information
| that the other models have access to?
| datameta wrote:
| How can we possibly find out without trying?
| jowea wrote:
| It is logically impossible for a LLM to, for example, to
| know that fooExecute() takes two int arguments if the
| documentation is blocked by robots.txt and there are no
| examples of fooExecute() usage in the wild, don't you
| agree?
| tharant wrote:
| Sure, the model would not "know" about your example, but
| that's not the point; the penultimate[0] goal is for the
| model to figure out the method signature on its own just
| like a human dev might leverage her own knowledge and
| experience to infer that method signature. Intelligence
| isn't just rote memorization.
|
| [0] the ultimate, of course, being profit.
| jowea wrote:
| I don't think a human dev can divine a method signature
| and effects in the general case either. Sure the add()
| function probably takes 2 numbers, but maybe it takes a
| list? Or a two-tuple? How would we or the LLM know
| without having the documentation? And yeah sure the LLM
| can look at the documentation while being used instead of
| it being part of the training dataset, but that's
| strictly inferior for practical uses, no?
|
| I'm not sure if we're thinking of the same field of AI
| development. I think I'm talking about the super-
| autocomplete with integrated copy of all of digitalized
| human knowledge, while you're talking about trying to do
| (proto-)AGI. Is that it?
| heavenlyblue wrote:
| > Sure the add() function probably takes 2 numbers, but
| maybe it takes a list? Or a two-tuple? How would we or
| the LLM know without having the documentation?
|
| You just listed possible options in the order of their
| relative probability. Human would attempt to use them in
| exactly that order
| diggan wrote:
| I agree, but also think it's less important. I don't want
| a big fat LLM that memorized every API out there, and as
| soon as the API changed, the weights have to updated. I
| like the current approach of Codex (and similar) where
| they can look up the APIs they need to use as they're
| doing the work instead, so same weights will continue to
| work no matter how much the APIs change.
| Dylan16807 wrote:
| Maybe the missing data makes it 3% worse but the
| architecture is 5% better. Or your respect for robots.txt
| gets you more funding and you gain a 4% advantage by
| training longer.
|
| Don't focus too much on a single variable, especially
| when all the variables have diminishing returns.
| lllllm wrote:
| this is what this paper tries to answer:
| https://arxiv.org/abs/2504.06219 the quality gap is
| surprisingly small between compliant and not
| JKCalhoun wrote:
| Is there not yet a Source where the web has already been
| scraped and souped down to just the text? It would seem someone
| would have created such a thing in order to save LLM training
| from having to reinvent the wheel.
|
| I understand the web is a dynamic thing but still it would seem
| to be useful on some level.
| CaptainFever wrote:
| Common Crawl, maybe?
| stephen_cagle wrote:
| I wonder if the reason for these results is that any data on
| the internet is already copied to other locations by actors who
| ignore crawling opt-outs. So, even if _they_ respect all web
| crawling opt-outs, they are still effectively copying the data
| because someone else did not respect it who does not include an
| opt-out.
| conradkay wrote:
| My guess is that it doesn't remove that much of the data, and
| the post-training data (not just randomly scraped from the
| web) probably matters more
| lllllm wrote:
| Yes this is an interesting question. In our arxiv paper [1]
| we did study this for news articles, and also removed
| duplicates of articles (decontamination). We did not observe
| an impact on the downstream accuracy of the LLM, in the case
| of news data.
|
| [1] https://arxiv.org/abs/2504.06219
| Bengalilol wrote:
| Looking forward to proof test it.
| greenavocado wrote:
| Why would you announce this without a release? Be honest.
| wood_spirit wrote:
| The announcement was at the International Open-Source LLM
| Builders Summit held this week in Switzerland. Is it so strange
| that they announced what they are doing and the timeline?
| JumpCrisscross wrote:
| Funding? Deeply biasing European uses to publicly-developed
| European LLMs (or at least not American or Chinese ones) would
| make a lot of sense. (Potentially too much sense for Brussels.)
| phtrivier wrote:
| The cliche (at least on my side of the Alps) is that people in
| Switzerland like to take theiiiir tiiiime.
| Bengalilol wrote:
| "Move as quickly as possible, but as slowly as necessary."
| WeirderScience wrote:
| The open training data is a huge differentiator. Is this the
| first truly open dataset of this scale? Prior efforts like The
| Pile were valuable, but had limitations. Curious to see how
| reproducible the training is.
| layer8 wrote:
| > The model will be fully open: source code and weights will be
| publicly available, and the training data will be transparent
| and reproducible
|
| This leads me to believe that the training data won't be made
| publicly available in full, but merely be "reproducible". This
| might mean that they'll provide references like a list of URLs
| of the pages they trained on, but not their contents.
| WeirderScience wrote:
| Yeah, I suspect you're right. Still, even a list of URLs for
| a frontier model (assuming it does turn out to be of that
| level) would be welcome over the current situation.
| glhaynes wrote:
| That wouldn't seem reproducible if the content at those URLs
| changes. (Er, unless it was all web.archive.org URLs or
| something.)
| dietr1ch wrote:
| This is a problem with the Web. It should be easier to
| download content like it was updating a git Repo.
| TobTobXX wrote:
| Well, when the actual content is 100s of terabytes big,
| providing URLs may be more practical for them and for others.
| layer8 wrote:
| The difference between content they are allowed to train on
| vs. being allowed to distribute copies of is likely at
| least as relevant.
| sschueller wrote:
| No problem, we have 25 Gbit/s home internet here. [1]
|
| [1] https://www.init7.net/en/internet/fiber7/
| evolvedlight wrote:
| Yup, it's not a dataset packaged like you hope for here, as it
| still contains traditionally copyrighted material
| oytis wrote:
| The press release talks a lot about how it was done, but very
| little about how capabilities compare to other open models.
| pantalaimon wrote:
| It's a university, teaching the 'how it's done' is kind of the
| point
| EA-3167 wrote:
| Sure, but usually you teach something that is inherently
| useful, or can be applied to some sort of useful endeavor. In
| this case I think it's fair to ask what the collision of two
| bubbles really achieves, or if it's just a useful teaching
| model, what it can be applied to.
| joot82 wrote:
| _The model will be released in two sizes -- 8 billion and 70
| billion parameters [...]. The 70B version will rank among the
| most powerful fully open models worldwide. [...] In late
| summer, the LLM will be released under the Apache 2.0 License._
|
| We'll find out in September if it's true?
| k__ wrote:
| I hope DeepSeek R2, but I fear Llama 4.
| oytis wrote:
| Yeah, I was thinking more of a table with benchmark results
| wood_spirit wrote:
| The article says
|
| " Open LLMs are increasingly viewed as credible alternatives to
| commercial systems, most of which are developed behind closed
| doors in the United States or China"
|
| It is obvious that the companies producing big LLMs today have
| the incentive to try to enshitify them. Trying to get
| subscriptions at the same time as trying to do product placement
| ads etc. Worse, some already have political biases they promote.
|
| It would be wonderful if a partnership between academia and
| government in Europe can do a public good search and AI that
| endeavours to serve the user over the company.
| klabb3 wrote:
| Yes but it's a very complicated service to deliver. Even if
| they train great models, they likely will not operationalize
| them for inference. Those will still be private actors, and the
| incentives to enshittify will be the same. Also, for AI
| generally the incentives is much higher than last tech
| generation, due to cost of running these things. Basically, the
| free services where you're the product must aggressively
| extract value out of you in order to make a profit.
| bee_rider wrote:
| Is this setting the bar for dataset transparency? It seems like a
| significant step forward. Assuming it works out, that is.
|
| They missed an opportunity though. They should have called their
| machine the AIps (AI Petaflops Supercomputer).
| philipkglass wrote:
| I think that the Allen Institute for Artificial Intelligence
| OLMo models are also completely open:
|
| _OLMo is fully open_
|
| _Ai2 believes in the power of openness to build a future where
| AI is accessible to all. Open weights alone aren't enough -
| true openness requires models to be trained in the open with
| fully open access to data, models, and code._
|
| https://allenai.org/olmo
| lamuswawir wrote:
| I am a simple man, I see AI2, I upvote.
| ekianjo wrote:
| Smollm is also completely open as far as I know
| isusmelj wrote:
| I hope they do well. AFAIK they're training or finetuning an
| older LLaMA model, so performance might lag behind SOTA. But what
| really matters is that ETH and EPFL get hands-on experience
| training at scale. From what I've heard, the new AI cluster still
| has teething problems. A lot of people underestimate how tough it
| is to train models at this scale, especially on your own infra.
|
| Disclaimer: I'm Swiss and studied at ETH. We've got the
| brainpower, but not much large-scale training experience yet. And
| IMHO, a lot of the "magic" in LLMs is infrastructure-driven.
| luke-stanley wrote:
| When I read "from scratch", I assume they are doing pre-
| training, not just finetuning, do you have a different take? Do
| you mean it's normal Llama architecture they're using? I'm
| curious about the benchmarks!
| andy99 wrote:
| Imo, a lot of the magic is also dataset driven, specifically
| the SFT and other fine tuning / RLHF data they have. That's
| what has separated the models people actually use from the
| also-rans.
|
| I agree with everything you say about getting the experience,
| the infrastructure is very important and is probably the most
| critical part of a sovereign LLM supply chain. I would hope
| there will also be enough focus on the data, early on, that the
| model will be useful.
| alfalfasprout wrote:
| The infra does become pretty complex to get a SOTA LLM trained.
| People assume it's as simple as loading up the architecture and
| a dataset + using something like Ray. There's a lot that goes
| into designing the dataset, the eval pipelines, the training
| approach, maximizing the use of your hardware, dealing with
| cross-node latency, recovering from errors, etc.
|
| But it's good to have more and more players in this space.
| lllllm wrote:
| No, the model has nothing do to with Llama. We are using our
| own architecture, and training from scratch. Llama also does
| not have open training data, and is non-compliant, in contrast
| to this model.
|
| Source: I'm part of the training team
| macawfish wrote:
| Are you using dbpedia?
| lllllm wrote:
| no. the main source is fineweb2, but with additional
| filtering for compliance, toxicity removal, and quality
| filters such as fineweb2-hq
| PeterStuer wrote:
| Thx for engaging here.
|
| Can you comment on how the filtering impacted language
| coverage? E.g. finweb2 has 1800+ languages, but some with
| very little actual representation, while finweb2-hq has
| just 20 but each with a subdsantial data set.
|
| (I'm personaly most interested in covering the 24
| official EU languages)
| lllllm wrote:
| we kept all 1800+ (script/language) pairs, not only the
| quality filtered ones. the question if a mix of quality
| filtered and not languages impacts the mixing is still an
| open question. preliminary research (Section 4.2.7 of
| https://arxiv.org/abs/2502.10361 ) indicates that quality
| filtering can mitigate the curse of multilinguality to
| some degree, so facilitate cross-lingual generalization,
| but it has to be seen how strong this effect is on larger
| scale
| danielhanchen wrote:
| If you guys need help on GGUFs + Unsloth dynamic quants +
| finetuning support via Unsloth
| https://github.com/unslothai/unsloth on day 0 / 1, more than
| happy to help :)
| lllllm wrote:
| absolutely! i've sent you a linkedin message last week. but
| here seems to work much better, thanks a lot!
| isusmelj wrote:
| Thanks for clarifying! I wish you all the best luck!
| Al-Khwarizmi wrote:
| So you're not going to use copyrighted data for training?
| That's going to be a disadvantage with respect to LLaMa and
| other well-known models, it's an open secret that everyone is
| using everything they can get their hands on.
|
| Good luck though, very needed project!
| badsectoracula wrote:
| Not sure about the Swiss laws, but the EU AI Act and the
| 2019/790 digital millennium directive it piggies back on
| the topic, does allow for training on copyrighted data as
| long as any opt-out mechanisms (e.g. robots.txt) are
| respected. AFAICT this LLM was trained by respecting those
| mechanisms (and as linked elsewhere they didn't find any
| practical difference in performance - note that there is an
| exception to allow ignoring the opt-out mechanisms for
| research purposes, so they could make that comparison).
| moffkalast wrote:
| L3 has open pretraining data, it's just not official for
| obvious legal reasons:
| https://huggingface.co/datasets/HuggingFaceFW/fineweb
| d3m0t3p wrote:
| Hey, really cool project, I'm excited to see the outcome. Is
| there a blog / paper summarizing how you are doing it ? Also
| which research group is currently working on it at eth ?
| asjir wrote:
| I'd be more concerned about the size used being 70b (deepseek
| r1 has 671b) which makes catching up with SOTA kinda more
| difficult to begin with.
| zettabomb wrote:
| SOTA performance is relative to model size. If it performs
| better than other models in the 70B range (e.g. Llama 3.3)
| then it could be quite useful. Not everyone has the VRAM to
| run the full fat Deepseek R1.
| tough wrote:
| also isn't DeepSeek's Mixture of Experts? meaning not all
| params get ever activated on one forward pass?
|
| 70B feels like the best balance between usable locally and
| decent for regular use.
|
| maybe not SOTA, but a great first step.
| hubraumhugo wrote:
| Pretty proud to see this at the top of HN as a Swiss (and I know
| many are lurking here!). These two universities produce world-
| class founders, researchers, and engineers. Yet, we always stay
| in the shadow of the US. With our top-tier public infrastructure,
| education, and political stability (+ neutrality), we have a
| unqiue opportunity to build something exceptional in the open LLM
| space.
| RHSman2 wrote:
| I work with EPFL alumni. Brilliant minds.
| MITSardine wrote:
| I think EPFL and ETH are generally well known internationally,
| but Switzerland being rather small (9M pop), it's only natural
| you don't hear much about it compared to other larger
| countries!
| amelius wrote:
| Yeah, that's what "democratizing AI" means.
| nektro wrote:
| gross use of public infrastructure
| protocolture wrote:
| I literally cant fault this, even steelmanning anti AI
| positions. What makes you say that?
| PetitPrince wrote:
| Sometimes ago there was a Tom Scott video about the fasted
| accelerating car in the world, developed by a team with a vast
| majority of student. One remark stayed with me: "the goal is
| not to build a car, but to build engineer".
|
| In that regard it's absolutely not a waste of public infra just
| like this car was not a waste.
| herbst wrote:
| It even used green power. Literally zero complains or outcry
| from the public yet. Guess we like progress, especially if it
| helps independence.
| MITSardine wrote:
| University and research clusters are built to run research
| code. I can guarantee this project is 10x as impactful and
| interesting as what usually runs on these machines. This coming
| from someone in the area that usually hogs these machines
| (numerical simulation). I'm very excited to see academic actors
| tackle LLMs.
| westurner wrote:
| Use case for science and code LLMs: Superhydrodynamic gravity
| (SQR / SQG, )
|
| LLMs do seem to favor general relativity but probably would've
| favored classical mechanics at the time given the training
| corpora.
|
| Not-yet unified: Quantum gravity, QFT, "A unified model must: "
| https://news.ycombinator.com/item?id=44289148
|
| Will be interested to see how this model responds to currently
| unresolvable issues in physics. Is it an open or a closed world
| mentality and/or a conditioned disclaimer which encourages
| progress?
|
| What are the current benchmarks?
|
| From https://news.ycombinator.com/item?id=42899805 re: "Large
| Language Models for Mathematicians" (2023) :
|
| > _Benchmarks for math and physics LLMs: FrontierMath, TheoremQA,
| Multi SWE-bench:https://news.ycombinator.com/item?id=42097683 _
|
| Multi-SWE-bench: A Multi-Lingual and Multi-Modal GitHub Issue
| Resolving Benchmark: https://multi-swe-bench.github.io/
|
| Add'l LLM benchmarks and awesome lists:
| https://news.ycombinator.com/item?id=44485226
|
| Microsoft has a new datacenter that you don't have to keep adding
| water to; which spares the aquifers.
|
| How to use this LLM to solve energy and sustainability problems
| all LLMs exacerbate? Solutions for the Global Goals, hopefully
| seydor wrote:
| I wonder if multilingual llms are better or worse compared a
| single language model
| tugdual wrote:
| This is an interesting problem that has various challenges -
| currently most tokenization solutions where trainees using hype
| pair encoding where the most commonly seen combinations of
| letters were being selected to be a mapping. This meant that
| the majority of tokenization was English mappings meaning your
| LLM had a better tokenization of English compared to other
| languages it was being trained on.
|
| C.f. https://medium.com/@biswanai92/understanding-token-
| fertility...
| mukeshyadavnitt wrote:
| nice
| contrarian1234 wrote:
| This seems like the equivalent of a university designing an ICE
| car...
|
| What does anyone get out of this when we have open weight models
| already ?
|
| Are they going to do very innovative AI research that companies
| wouldn't dare try/fund? Seems unlikely ..
|
| Is it a moonshot huge project that no single company could
| fund..? Not that either
|
| If it's just a little fun to train the next generation of LLM
| researchers.. Then you might as well just make a small scale toy
| instead of using up a super computer center
| urvader wrote:
| This model will be one of the few open models where the
| training data is also open which makes it ideal for fine
| tuning.
| chvid wrote:
| That it will actually be open and reproducible?
|
| Including how it was trained, what data was used, how training
| data was synthesized, how other models were used etc. All the
| stuff that is kept secret in case of llama, deepseek etc.
| herbst wrote:
| Why do you think it's about money? IMO it's about much more
| than that, like independence and actual data freedom trough
| reproductive LLMs
| MITSardine wrote:
| Super computers are being used daily for much toy-ier codes in
| research, be glad this at least interests the public and
| constitutes a foray of academia into new areas.
| defraudbah wrote:
| ETH Zurich is doing so many amazing things that I want to go
| study there. Unbelievable how many great people are coming from
| that university
| blue_light_man wrote:
| It's also possible you just think of ETH Zurich as great and
| automatically associate the people and products as amazing.
| Could be a circular dependency here.
| rtaylorgarlock wrote:
| That is indeed how things work. I can think of a few 'good'
| media-relevant examples, including e.g. that recent super-
| quick cart project [1], that reach beyond the more vanilla
| startup-spinoffs or basic media efforts.
|
| 1 https://ethz.ch/en/news-and-events/eth-
| news/news/2023/09/fro...
| datameta wrote:
| I took courses online from ETH Zurich before the formula was
| "perfected" and I'd say they were ahead of the curve in
| quality, concise but info-dense educational content.
| Tepix wrote:
| How does it compare to Teuken and EuroLLM?
| sschueller wrote:
| Yet, Switzerland was put in the 2. Tier list[1] of countries that
| can get unlimited access to the top AI chips.
|
| [1] https://www.bluewin.ch/en/news/usa-restricts-swiss-access-
| to...
|
| [2] https://chplusplus.org/u-s-export-controls-on-ai-chips/
| kisamoto wrote:
| Any info on context length or comparable performance? Press
| release is unfortunately lacking on technical details.
|
| Also I'm curious if there was any reason to make such a PR
| without actually releasing the model (due Summer)? What's the
| delay? Or rather what was the motivation for a PR?
___________________________________________________________________
(page generated 2025-07-12 23:00 UTC)