[HN Gopher] Qwen2-Math
       ___________________________________________________________________
        
       Qwen2-Math
        
       Author : limoce
       Score  : 115 points
       Date   : 2024-08-08 15:00 UTC (8 hours ago)
        
 (HTM) web link (qwenlm.github.io)
 (TXT) w3m dump (qwenlm.github.io)
        
       | Leary wrote:
       | Where could I try this?
        
         | currycurry16 wrote:
         | https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct
        
       | rahimnathwani wrote:
       | FYI the model is almost 150GB:
       | https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct/tree/mai...
        
         | moffkalast wrote:
         | Nobody runs these things at full precision, a 4 bit quant of
         | 70B range models fits into 64 GB. Though you would need more
         | for long context.
        
           | extheat wrote:
           | A simple equation to approximate it is `memory_in_gb =
           | parameters_in_billions * (bits/8)`
           | 
           | So at 32 bit full precision, 70 * (32 / 8) ~= 280GB
           | 
           | fp16, 70 * (16 / 8) ~= 140GB
           | 
           | 8 bit, 70 * (8 / 8) ~= 70GB
           | 
           | 4 bit, 70 * (4 / 8) ~= 35GB
           | 
           | However in things like llama.cpp quants sometimes it's mixed
           | so some of the weights are Q5, some Q4, etc, so you usually
           | want to take the higher number.
        
             | moffkalast wrote:
             | Well that and you also need a fair bit more space for the
             | KV cache which can be a bit unpredictable. Models without
             | GQA, flash attention or 4 bit cache support are really
             | terrible in that regard, plus it depends on context length.
             | Haven't found a good rule of thumb for that yet.
        
           | lostmsu wrote:
           | How much ELO on chatbot arena does 4 bit lose vs full
           | fp16/bf16?
        
       | vessenes wrote:
       | Sample solution for Balkan MO 2023 seems .. questionable?
       | 
       | The problem involves players removing stones sequentially and
       | asking which will win with perfect play: the listed answer
       | definitely doesn't list all possible types of strategies.
       | 
       | The answer it gives may be right; in fact I bet it is correct
       | (the second player), but does the qwen team offer the solution as
       | correct including the logic? And is the solution logic correct?
        
         | pathsjs wrote:
         | I was going to write the same thing. I checked the first three
         | problems and all solutions are partial at best. Now, don't get
         | me wrong, this is still impressive. But putting the problems
         | there with the implication that qwen solves them correctly when
         | it doesn't does not really inspire trust
        
         | tveita wrote:
         | Same with the Lusophon Mathematical Olympiad 2023 example.
         | 
         | It makes a bunch of specious claims about parity. (Adding 4 to
         | a number changes the parity? Therefore, the parity of each
         | colour always changes for each turn? Therefore, the parity will
         | always be the same as it was initially?) And then it concludes
         | that since the parity is right, 2022 of each colour must be a
         | reachable state.
         | 
         | Which, as you say, is quite possible the correct answer, but
         | it's really weird to put it out there as an example with no
         | comment on the reasoning.
        
         | sterlind wrote:
         | Using human language models for math problems is a bad idea for
         | exactly this reason: judging correctness of proofs in English
         | is subjective, whereas proofs in Lean are objective and easy to
         | machine-check. Deepmind's AlphaProof was so remarkable
         | precisely because it spoke Lean, so its proofs were verifiable.
         | Why waste time on anything else?
        
           | margorczynski wrote:
           | I would think of two:
           | 
           | 1) Much easier to state the problem and basically all the
           | knowledge we have of math is not in the form of Lean proofs
           | 
           | 2) It can be applied to a much broader range of domains, math
           | is kinda unique as in most cases verifying something is 100%
           | correct is impossible (and getting such a signal for RL)
        
             | dimask wrote:
             | 1) Then more math should get formalised in lean.
             | 
             | 2) How is a solution by LLMs supposed to be verified
             | without such a formalisation?
        
             | sterlind wrote:
             | 1) You could train a different language model to translate
             | between Lean and English. Use AlphaProof to do the hard
             | work of theorem proving; use the translation model for
             | interpretability and as a user interface. In fact, such a
             | system could be ideal for _formalizing_ a hundred years of
             | English-language proofs.
             | 
             | 2) This model is already specialized for math; applying it
             | to other domains is out of scope. And as you point out,
             | speaking Lean (and thus being verifiable) gives you an RL
             | reward signal that's way more precise and readily available
             | than piddly RLHF from human reviewers. Gyms like this are
             | where AGI will happen.
             | 
             |  _(P.S. if anyone wants to work with me on reproducing
             | AlphaProof, hit me up on Discord.)_
        
           | vessenes wrote:
           | The real answer in this case is that AlphaProof is a
           | massively more complex training pipeline than the Qwen team's
           | pipeline. Sadly. Agreed that integrating Lean makes a ton of
           | sense; it's just probably an order of magnitude easier to do
           | it inside DeepMind than elsewhere right now.
        
         | meroes wrote:
         | This is my experience doing RLHF for math for LLMs. So many
         | times it's a partial answer. It may find the minimal case for
         | example, but it doesn't say why it's the minimal one, even when
         | the prompt specially asks for the why.
        
       | tempfile wrote:
       | First solution (IMO 2002) is completely wrong. It shows that
       | 1,2,3 cubes are not sufficient, and provide an obstacle that
       | _doesn 't rule out_ 4 cubes, but does not prove that there
       | actually are 4 cubes that sum to the given number. This is much
       | harder (and I don't know the true answer)
        
         | kevinventullo wrote:
         | I think it's about the same level of difficulty as showing you
         | need at least four, but yeah I agree that what is written is
         | insufficient. One solution is that you can write
         | 
         | 2002 = 10^3 + 10^3 + 1^3 + 1^3
         | 
         | Then, multiply through by 2002^2001, which is itself a cube
         | since 2001 is divisible by 3.
        
       | ziofill wrote:
       | I see that they do some decontamination of the datasets, in the
       | hope that the models won't just recite answers from the training
       | data. But in the recent interview with Subbarao Kambhampati on
       | MLST (https://www.youtube.com/watch?v=y1WnHpedi2A) they explain
       | that models fail as soon as one slightly rephrases the test
       | problems (indicating that they are indeed mostly reciting). I
       | expect this to be the case with this model too.
        
         | eightysixfour wrote:
         | This is just not true. There are plenty of private tests with
         | problems that are not in the training set. GPQA is an excellent
         | example.
        
         | refulgentis wrote:
         | There's a lot of bunko 3rd-time-rephrased "wisdom" about AI,
         | and enough interest in it that if you have the right title, you
         | can get away with repeating it.
         | 
         | I'm a bit surprised that an HN reader, who presumably has first
         | hand experience with them, isn't sure if its just a hashmap
         | lookup.
        
       | beyondCritics wrote:
       | It is obious that all of these problems are still way too hard,
       | although sometimes it has ideas. It flawlessly demonstrates how
       | to simplify (2002^2002) mod 9. I recall that there was once a
       | scandalous university exam for future math teachers in germany,
       | which asked to do tasks like that, but all failed the test. With
       | Qwen-2 at hand this might not have happened.
        
       | next_xibalba wrote:
       | Kind of surprised this was released in English first given it was
       | produced by a Chinese group (Alibaba Cloud). I wonder why that
       | is.
        
         | karmasimida wrote:
         | I mean I think 90% of AI researchers in China, use English as
         | the first choice for publication ... so not surprised?
        
         | m3kw9 wrote:
         | Maybe Training material also mostly English
        
       | ipnon wrote:
       | These solutions aren't perfect, but imagine how many more people
       | can become mathematicians now that the price of an elite IMO
       | medal winning tutor can be quantified as Hugging Face hosting
       | costs!
        
         | Strix97 wrote:
         | I don't think that's this is useful tool to become a
         | mathematician. Becoming one does not necessarily involve
         | solving these kinds of puzzle exercises.
         | 
         | It's a useful to hone your creative thinking and learning how
         | to approach a mathematical problem, but it wont make you a
         | mathematician.
        
         | meroes wrote:
         | A mathematician tutor that doesn't know how to write proofs
         | though. Or at the most charitable, not how human mathematicians
         | write proofs. I'm not talking about using Lean or similar, but
         | the common parlance and rigor outside of theorem provers.
        
       | azinman2 wrote:
       | > This model mainly supports English. We will release bilingual
       | (English and Chinese) math models soon
       | 
       | The irony.
        
       | qrian wrote:
       | The solution for IMO 2022 is barely a 1/7 solution. It just says
       | ' might not satisfy the inequality for all y' without a proof.
       | That was the point of the question.
        
       ___________________________________________________________________
       (page generated 2024-08-08 23:01 UTC)