[HN Gopher] DeepSeekMath 7B achieved 51.7% on MATH benchmark
       ___________________________________________________________________
        
       DeepSeekMath 7B achieved 51.7% on MATH benchmark
        
       Author : mdp
       Score  : 77 points
       Date   : 2024-02-06 16:45 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mdp wrote:
       | Related paper - https://arxiv.org/pdf/2402.03300.pdf
        
       | rgbrgb wrote:
       | Supports commercial use!
       | 
       | Interesting what's unsupported:
       | 
       | - In any way that violates any applicable national or
       | international law or regulation or infringes upon the lawful
       | rights and interests of any third party;
       | 
       | - For military use in any way;
       | 
       | - For the purpose of exploiting, harming or attempting to exploit
       | or harm minors in any way;
       | 
       | - To generate or disseminate verifiably false information and/or
       | content with the purpose of harming others;
       | 
       | - To generate or disseminate inappropriate content subject to
       | applicable regulatory requirements;
       | 
       | - To generate or disseminate personal identifiable information
       | without due authorization or for unreasonable use;
       | 
       | - To defame, disparage or otherwise harass others;
       | 
       | - For fully automated decision making that adversely impacts an
       | individual's legal rights or otherwise creates or modifies a
       | binding, enforceable obligation;
       | 
       | - For any use intended to or which has the effect of
       | discriminating against or harming individuals or groups based on
       | online or offline social behavior or known or predicted personal
       | or personality characteristics;
       | 
       | - To exploit any of the vulnerabilities of a specific group of
       | persons based on their age, social, physical or mental
       | characteristics, in order to materially distort the behavior of a
       | person pertaining to that group in a manner that causes or is
       | likely to cause that person or another person physical or
       | psychological harm;
       | 
       | - For any use intended to or which has the effect of
       | discriminating against individuals or groups based on legally
       | protected characteristics or categories.
        
         | ronsor wrote:
         | The irony is that anyone who was going to do those things isn't
         | going to care about a license anyway.
        
           | BossingAround wrote:
           | True, but at least the author wouldn't be liable.
        
             | zamadatix wrote:
             | The MIT license covers liability more broadly and tightly
             | in a single paragraph.
        
             | declaredapple wrote:
             | You either wouldn't be liable anyway (not responsible for
             | what people use it for) or will still be held liable (took
             | no measures to prevent malicious use).
        
         | austin-cheney wrote:
         | Who would qualify as a user that seeks to violate applicable
         | laws and yet is somehow identified as an official part of some
         | legally recognized military? Furthermore, how would anybody
         | know?
         | 
         | As a dumb Army guy if I were doing military research I would
         | just keep it on my private military internet that does not
         | exist for non-military users.
        
           | wand3r wrote:
           | Its virtue signaling. I know its over used, but seriously,
           | who is intentionally harming minors BUT unwilling to break a
           | ToS contract?
        
             | nurettin wrote:
             | The Devil?
        
             | zeusk wrote:
             | Meta/Instagram probably
        
             | austin-cheney wrote:
             | Processed food vendors come to mind, but I get your point.
        
             | paulddraper wrote:
             | Facebook lawyers
        
       | godelski wrote:
       | Does anyone know how much spoilage are in these datasets? Common
       | crawl has a lot of websites in it, including Reddit and Stack*.
       | I'm certain there are lots of questions in those datasets and we
       | want to differentiate recall from problem solving (often
       | confused). I have a deep distrust when using large datasets like
       | this given a common one with 60 authors assumed writing leet code
       | style programs by hand would mean they wouldn't appear in the
       | training data (github) and didn't even bother to check. It's
       | really hard to sanitize datasets of this size and deduplication
       | is a much harder task than many realize.
       | 
       | https://arxiv.org/abs/2107.03374
       | 
       | https://arxiv.org/abs/2303.09540
        
         | ianbutler wrote:
         | Some have a lot and those models are often ignored (except by
         | lay people or hobbyists which is a different problem), but many
         | serious submissions from serious groups for benchmarks like
         | this check for contamination to specifically avoid the problem
         | you're suggesting. Process for decontamination has been
         | outlined by many groups so you can often check it out.
        
         | Imnimo wrote:
         | The paper claims:
         | 
         | >To avoid benchmark contamination, we follow Guo et al. (2024)
         | to filter out web pages containing questions or answers from
         | English mathematical benchmarks such as GSM8K (Cobbe et al.,
         | 2021) and MATH (Hendrycks et al., 2021) and Chinese benchmarks
         | such as CMATH (Wei et al., 2023) and AGIEval (Zhong et al.,
         | 2023). The filtering criteria are as follows: any text segment
         | containing a 10-gram string that matches exactly with any sub-
         | string from the evaluation benchmarks is removed from our math
         | training corpus. For benchmark texts that are shorter than 10
         | grams but have at least 3 grams, we employ exact matching to
         | filter out contaminated web pages.
         | 
         | However, benchmark contamination is difficult, and ngram
         | matching is often insufficient. See
         | https://arxiv.org/pdf/2311.04850.pdf for some examples of how
         | this approach can fail.
         | 
         | In general, if a benchmark is available online before a model's
         | dataset is collected, I put very little stock into that model's
         | performance on that benchmark. It's just too hard to know
         | what's a true improvement and what's contamination. It's
         | especially true for a paper like this that specifically hunts
         | down MATH-like data.
        
       | deepseekfake wrote:
       | I have spoken to team members, and they all say the results of
       | this and coder are very, very much leakage (no suprisse given the
       | result!!)
        
         | godelski wrote:
         | That's good to know, and better to admit. Earns a lot of
         | respect, at least for me. Recall is still a pretty useful task.
         | I just wish more would be less afraid to admit spoilage.
        
         | pclmulqdq wrote:
         | There's a good chance that's also true for GPT-4 given how they
         | train. Without _known completely new_ evals, it 's hard to say
         | that any LLM benchmark results aren't leakage.
        
           | CuriouslyC wrote:
           | If you're trying to prove the model has reasoning abilities,
           | ask it the question in a language other than English, even
           | better give it multiple sentences in different languages and
           | tell it to answer the question without first translating the
           | sentences.
        
         | rgbrgb wrote:
         | yea, that's my first thought seeing the result too. we need a
         | reputable proprietary eval.
        
         | riku_iki wrote:
         | It's actually interesting results in a sense we see the
         | limitation of LLM to memorize complicated information
         | correctly. Gemini ultra also reported around 50% accuracy
        
           | Davidzheng wrote:
           | I think the SOTA is GPT4+tool use? I heard near 80%
        
             | riku_iki wrote:
             | Yes, tools help to advance over LLM limitations. GPT4
             | without tools is about 50% too.
        
         | gowld wrote:
         | Why haven't they updated the github page?
        
       ___________________________________________________________________
       (page generated 2024-02-06 23:01 UTC)