[HN Gopher] DeepSeekMath 7B achieved 51.7% on MATH benchmark
___________________________________________________________________
DeepSeekMath 7B achieved 51.7% on MATH benchmark
Author : mdp
Score : 77 points
Date : 2024-02-06 16:45 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mdp wrote:
| Related paper - https://arxiv.org/pdf/2402.03300.pdf
| rgbrgb wrote:
| Supports commercial use!
|
| Interesting what's unsupported:
|
| - In any way that violates any applicable national or
| international law or regulation or infringes upon the lawful
| rights and interests of any third party;
|
| - For military use in any way;
|
| - For the purpose of exploiting, harming or attempting to exploit
| or harm minors in any way;
|
| - To generate or disseminate verifiably false information and/or
| content with the purpose of harming others;
|
| - To generate or disseminate inappropriate content subject to
| applicable regulatory requirements;
|
| - To generate or disseminate personal identifiable information
| without due authorization or for unreasonable use;
|
| - To defame, disparage or otherwise harass others;
|
| - For fully automated decision making that adversely impacts an
| individual's legal rights or otherwise creates or modifies a
| binding, enforceable obligation;
|
| - For any use intended to or which has the effect of
| discriminating against or harming individuals or groups based on
| online or offline social behavior or known or predicted personal
| or personality characteristics;
|
| - To exploit any of the vulnerabilities of a specific group of
| persons based on their age, social, physical or mental
| characteristics, in order to materially distort the behavior of a
| person pertaining to that group in a manner that causes or is
| likely to cause that person or another person physical or
| psychological harm;
|
| - For any use intended to or which has the effect of
| discriminating against individuals or groups based on legally
| protected characteristics or categories.
| ronsor wrote:
| The irony is that anyone who was going to do those things isn't
| going to care about a license anyway.
| BossingAround wrote:
| True, but at least the author wouldn't be liable.
| zamadatix wrote:
| The MIT license covers liability more broadly and tightly
| in a single paragraph.
| declaredapple wrote:
| You either wouldn't be liable anyway (not responsible for
| what people use it for) or will still be held liable (took
| no measures to prevent malicious use).
| austin-cheney wrote:
| Who would qualify as a user that seeks to violate applicable
| laws and yet is somehow identified as an official part of some
| legally recognized military? Furthermore, how would anybody
| know?
|
| As a dumb Army guy if I were doing military research I would
| just keep it on my private military internet that does not
| exist for non-military users.
| wand3r wrote:
| Its virtue signaling. I know its over used, but seriously,
| who is intentionally harming minors BUT unwilling to break a
| ToS contract?
| nurettin wrote:
| The Devil?
| zeusk wrote:
| Meta/Instagram probably
| austin-cheney wrote:
| Processed food vendors come to mind, but I get your point.
| paulddraper wrote:
| Facebook lawyers
| godelski wrote:
| Does anyone know how much spoilage are in these datasets? Common
| crawl has a lot of websites in it, including Reddit and Stack*.
| I'm certain there are lots of questions in those datasets and we
| want to differentiate recall from problem solving (often
| confused). I have a deep distrust when using large datasets like
| this given a common one with 60 authors assumed writing leet code
| style programs by hand would mean they wouldn't appear in the
| training data (github) and didn't even bother to check. It's
| really hard to sanitize datasets of this size and deduplication
| is a much harder task than many realize.
|
| https://arxiv.org/abs/2107.03374
|
| https://arxiv.org/abs/2303.09540
| ianbutler wrote:
| Some have a lot and those models are often ignored (except by
| lay people or hobbyists which is a different problem), but many
| serious submissions from serious groups for benchmarks like
| this check for contamination to specifically avoid the problem
| you're suggesting. Process for decontamination has been
| outlined by many groups so you can often check it out.
| Imnimo wrote:
| The paper claims:
|
| >To avoid benchmark contamination, we follow Guo et al. (2024)
| to filter out web pages containing questions or answers from
| English mathematical benchmarks such as GSM8K (Cobbe et al.,
| 2021) and MATH (Hendrycks et al., 2021) and Chinese benchmarks
| such as CMATH (Wei et al., 2023) and AGIEval (Zhong et al.,
| 2023). The filtering criteria are as follows: any text segment
| containing a 10-gram string that matches exactly with any sub-
| string from the evaluation benchmarks is removed from our math
| training corpus. For benchmark texts that are shorter than 10
| grams but have at least 3 grams, we employ exact matching to
| filter out contaminated web pages.
|
| However, benchmark contamination is difficult, and ngram
| matching is often insufficient. See
| https://arxiv.org/pdf/2311.04850.pdf for some examples of how
| this approach can fail.
|
| In general, if a benchmark is available online before a model's
| dataset is collected, I put very little stock into that model's
| performance on that benchmark. It's just too hard to know
| what's a true improvement and what's contamination. It's
| especially true for a paper like this that specifically hunts
| down MATH-like data.
| deepseekfake wrote:
| I have spoken to team members, and they all say the results of
| this and coder are very, very much leakage (no suprisse given the
| result!!)
| godelski wrote:
| That's good to know, and better to admit. Earns a lot of
| respect, at least for me. Recall is still a pretty useful task.
| I just wish more would be less afraid to admit spoilage.
| pclmulqdq wrote:
| There's a good chance that's also true for GPT-4 given how they
| train. Without _known completely new_ evals, it 's hard to say
| that any LLM benchmark results aren't leakage.
| CuriouslyC wrote:
| If you're trying to prove the model has reasoning abilities,
| ask it the question in a language other than English, even
| better give it multiple sentences in different languages and
| tell it to answer the question without first translating the
| sentences.
| rgbrgb wrote:
| yea, that's my first thought seeing the result too. we need a
| reputable proprietary eval.
| riku_iki wrote:
| It's actually interesting results in a sense we see the
| limitation of LLM to memorize complicated information
| correctly. Gemini ultra also reported around 50% accuracy
| Davidzheng wrote:
| I think the SOTA is GPT4+tool use? I heard near 80%
| riku_iki wrote:
| Yes, tools help to advance over LLM limitations. GPT4
| without tools is about 50% too.
| gowld wrote:
| Why haven't they updated the github page?
___________________________________________________________________
(page generated 2024-02-06 23:01 UTC)