[HN Gopher] Qwen2-Math
___________________________________________________________________
Qwen2-Math
Author : limoce
Score : 115 points
Date : 2024-08-08 15:00 UTC (8 hours ago)
(HTM) web link (qwenlm.github.io)
(TXT) w3m dump (qwenlm.github.io)
| Leary wrote:
| Where could I try this?
| currycurry16 wrote:
| https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct
| rahimnathwani wrote:
| FYI the model is almost 150GB:
| https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct/tree/mai...
| moffkalast wrote:
| Nobody runs these things at full precision, a 4 bit quant of
| 70B range models fits into 64 GB. Though you would need more
| for long context.
| extheat wrote:
| A simple equation to approximate it is `memory_in_gb =
| parameters_in_billions * (bits/8)`
|
| So at 32 bit full precision, 70 * (32 / 8) ~= 280GB
|
| fp16, 70 * (16 / 8) ~= 140GB
|
| 8 bit, 70 * (8 / 8) ~= 70GB
|
| 4 bit, 70 * (4 / 8) ~= 35GB
|
| However in things like llama.cpp quants sometimes it's mixed
| so some of the weights are Q5, some Q4, etc, so you usually
| want to take the higher number.
| moffkalast wrote:
| Well that and you also need a fair bit more space for the
| KV cache which can be a bit unpredictable. Models without
| GQA, flash attention or 4 bit cache support are really
| terrible in that regard, plus it depends on context length.
| Haven't found a good rule of thumb for that yet.
| lostmsu wrote:
| How much ELO on chatbot arena does 4 bit lose vs full
| fp16/bf16?
| vessenes wrote:
| Sample solution for Balkan MO 2023 seems .. questionable?
|
| The problem involves players removing stones sequentially and
| asking which will win with perfect play: the listed answer
| definitely doesn't list all possible types of strategies.
|
| The answer it gives may be right; in fact I bet it is correct
| (the second player), but does the qwen team offer the solution as
| correct including the logic? And is the solution logic correct?
| pathsjs wrote:
| I was going to write the same thing. I checked the first three
| problems and all solutions are partial at best. Now, don't get
| me wrong, this is still impressive. But putting the problems
| there with the implication that qwen solves them correctly when
| it doesn't does not really inspire trust
| tveita wrote:
| Same with the Lusophon Mathematical Olympiad 2023 example.
|
| It makes a bunch of specious claims about parity. (Adding 4 to
| a number changes the parity? Therefore, the parity of each
| colour always changes for each turn? Therefore, the parity will
| always be the same as it was initially?) And then it concludes
| that since the parity is right, 2022 of each colour must be a
| reachable state.
|
| Which, as you say, is quite possible the correct answer, but
| it's really weird to put it out there as an example with no
| comment on the reasoning.
| sterlind wrote:
| Using human language models for math problems is a bad idea for
| exactly this reason: judging correctness of proofs in English
| is subjective, whereas proofs in Lean are objective and easy to
| machine-check. Deepmind's AlphaProof was so remarkable
| precisely because it spoke Lean, so its proofs were verifiable.
| Why waste time on anything else?
| margorczynski wrote:
| I would think of two:
|
| 1) Much easier to state the problem and basically all the
| knowledge we have of math is not in the form of Lean proofs
|
| 2) It can be applied to a much broader range of domains, math
| is kinda unique as in most cases verifying something is 100%
| correct is impossible (and getting such a signal for RL)
| dimask wrote:
| 1) Then more math should get formalised in lean.
|
| 2) How is a solution by LLMs supposed to be verified
| without such a formalisation?
| sterlind wrote:
| 1) You could train a different language model to translate
| between Lean and English. Use AlphaProof to do the hard
| work of theorem proving; use the translation model for
| interpretability and as a user interface. In fact, such a
| system could be ideal for _formalizing_ a hundred years of
| English-language proofs.
|
| 2) This model is already specialized for math; applying it
| to other domains is out of scope. And as you point out,
| speaking Lean (and thus being verifiable) gives you an RL
| reward signal that's way more precise and readily available
| than piddly RLHF from human reviewers. Gyms like this are
| where AGI will happen.
|
| _(P.S. if anyone wants to work with me on reproducing
| AlphaProof, hit me up on Discord.)_
| vessenes wrote:
| The real answer in this case is that AlphaProof is a
| massively more complex training pipeline than the Qwen team's
| pipeline. Sadly. Agreed that integrating Lean makes a ton of
| sense; it's just probably an order of magnitude easier to do
| it inside DeepMind than elsewhere right now.
| meroes wrote:
| This is my experience doing RLHF for math for LLMs. So many
| times it's a partial answer. It may find the minimal case for
| example, but it doesn't say why it's the minimal one, even when
| the prompt specially asks for the why.
| tempfile wrote:
| First solution (IMO 2002) is completely wrong. It shows that
| 1,2,3 cubes are not sufficient, and provide an obstacle that
| _doesn 't rule out_ 4 cubes, but does not prove that there
| actually are 4 cubes that sum to the given number. This is much
| harder (and I don't know the true answer)
| kevinventullo wrote:
| I think it's about the same level of difficulty as showing you
| need at least four, but yeah I agree that what is written is
| insufficient. One solution is that you can write
|
| 2002 = 10^3 + 10^3 + 1^3 + 1^3
|
| Then, multiply through by 2002^2001, which is itself a cube
| since 2001 is divisible by 3.
| ziofill wrote:
| I see that they do some decontamination of the datasets, in the
| hope that the models won't just recite answers from the training
| data. But in the recent interview with Subbarao Kambhampati on
| MLST (https://www.youtube.com/watch?v=y1WnHpedi2A) they explain
| that models fail as soon as one slightly rephrases the test
| problems (indicating that they are indeed mostly reciting). I
| expect this to be the case with this model too.
| eightysixfour wrote:
| This is just not true. There are plenty of private tests with
| problems that are not in the training set. GPQA is an excellent
| example.
| refulgentis wrote:
| There's a lot of bunko 3rd-time-rephrased "wisdom" about AI,
| and enough interest in it that if you have the right title, you
| can get away with repeating it.
|
| I'm a bit surprised that an HN reader, who presumably has first
| hand experience with them, isn't sure if its just a hashmap
| lookup.
| beyondCritics wrote:
| It is obious that all of these problems are still way too hard,
| although sometimes it has ideas. It flawlessly demonstrates how
| to simplify (2002^2002) mod 9. I recall that there was once a
| scandalous university exam for future math teachers in germany,
| which asked to do tasks like that, but all failed the test. With
| Qwen-2 at hand this might not have happened.
| next_xibalba wrote:
| Kind of surprised this was released in English first given it was
| produced by a Chinese group (Alibaba Cloud). I wonder why that
| is.
| karmasimida wrote:
| I mean I think 90% of AI researchers in China, use English as
| the first choice for publication ... so not surprised?
| m3kw9 wrote:
| Maybe Training material also mostly English
| ipnon wrote:
| These solutions aren't perfect, but imagine how many more people
| can become mathematicians now that the price of an elite IMO
| medal winning tutor can be quantified as Hugging Face hosting
| costs!
| Strix97 wrote:
| I don't think that's this is useful tool to become a
| mathematician. Becoming one does not necessarily involve
| solving these kinds of puzzle exercises.
|
| It's a useful to hone your creative thinking and learning how
| to approach a mathematical problem, but it wont make you a
| mathematician.
| meroes wrote:
| A mathematician tutor that doesn't know how to write proofs
| though. Or at the most charitable, not how human mathematicians
| write proofs. I'm not talking about using Lean or similar, but
| the common parlance and rigor outside of theorem provers.
| azinman2 wrote:
| > This model mainly supports English. We will release bilingual
| (English and Chinese) math models soon
|
| The irony.
| qrian wrote:
| The solution for IMO 2022 is barely a 1/7 solution. It just says
| ' might not satisfy the inequality for all y' without a proof.
| That was the point of the question.
___________________________________________________________________
(page generated 2024-08-08 23:01 UTC)