[HN Gopher] ChatGPT-4o vs. Math
___________________________________________________________________
ChatGPT-4o vs. Math
Author : sabrina_ramonov
Score : 218 points
Date : 2024-05-16 15:30 UTC (7 hours ago)
(HTM) web link (www.sabrina.dev)
(TXT) w3m dump (www.sabrina.dev)
| lupire wrote:
| Need to run this experiment on a problem that is t already on its
| training set.
| passwordoops wrote:
| Shhhh... Don't ruin it
| navane wrote:
| It's the equivalent of cramming for a test, or memorizing leete
| code -- not very useful but very humane. Imagine if that's the
| direction this goes, finally we make humane ai, but it is as
| opportunitistic and deceitful as we are, and not really that
| smart.
| aulin wrote:
| Reminds me when I used to ace Ancient Greek translation tests
| (it's a thing in Italy) by looking up whole translated
| sentences listed as usage examples in the dictionary
| sabrina_ramonov wrote:
| sometimes for physics/math exams, we'd get to create our own 1
| pager cheat sheet to use. I'd just cram tons of actual
| problems/solutions on there, then scan for similarity.
| bearjaws wrote:
| Is there any good literature on this topic?
|
| I feel like math is naturally one of the easiest sets of
| synthetic data we can produce, especially since you can
| represent the same questions multiple ways in word problems.
|
| You could just increment the numbers infinitely and generate
| billions of examples of every formula.
|
| If we can't train them to be excellent at math, what hope do we
| ever have at programming or any other skill?
| cchance wrote:
| Not in the training set? The dataset is ALL OF THE INTERNET,
| i'd love you to find something it hasn't seen before.
| soarerz wrote:
| The model's first attempt is impressive (not sure why it's
| labeled a choke). Unfortunately gpt4o cannot discover calculus on
| its own.
| Chinjut wrote:
| It's a choke because it failed to get the answer. Saying other
| true things but not getting the answer is not a success.
| bombadilo wrote:
| I mean, in this context I agree. But most people doing math
| in high school or university are graded on their working of a
| problem, with the final result usually equating to a small
| proportion of the total marks received.
| perfobotto wrote:
| This is supposed to be a product , not a research artifact.
| chongli wrote:
| _But most people doing math in high school or university
| are graded on their working of a problem, with the final
| result usually equating to a small proportion of the total
| marks received_
|
| That heavily depends on the individual grader/instructor. A
| good grader will take into account the amount of progress
| toward the solution. Restating trivial facts of the problem
| (in slightly different ways) or pursuing an invalid
| solution to a dead end should not be awarded any marks.
| slushy-chivalry wrote:
| it choked because it didn't solve for `t` at the end
|
| impressive attempt though, it used number of wraps which
| I found quite clever
| giaour wrote:
| This depends on the grader and the context. Outside of an
| academic setting, sometimes being close to the right answer
| is better than nothing, and sometimes it is much worse. You
| can expect a human to understand which contexts require
| absolute precision and which do not, but that seems like a
| stretch for an LLM.
| phatfish wrote:
| LLMs being _confidently_ incorrect until they are
| challenged is a bad trait. At least they have a system
| prompt to tell them to be polite about it.
|
| Most people learn to avoid that person that is wrong/has
| bad judgment and is arrogant about it.
| HDThoreaun wrote:
| Right its the only answer that accounts for wasted space there
| might be between wraps.
| usaar333 wrote:
| Or.. use calculus?
|
| It has gotten quite impressive at handling calculus word
| problems. GPT-4 (original) failed miserably on this problem
| (attempted to set it up using constant acceleration equations);
| GPT-4O finally gets it correct:
|
| > I am driving a car at 65 miles per hour and release the gas
| pedal. The only force my car is now experiencing is air
| resistance, which in this problem can be assumed to be linearly
| proportional to my velocity.
|
| > When my car has decelerated to 55 miles per hour, I have
| traveled 300 feet since I released the gas pedal.
|
| > How much further will I travel until my car is moving at only
| 30 miles per hour?
| xienze wrote:
| Does it get the answer right every single time you ask the
| question the same way? If not, who cares how it's coming to
| an answer, it's not consistently correct and therefore not
| dependable. That's what the article was exploring.
| fmbb wrote:
| Can it be taught calculus?
| munk-a wrote:
| I think this is the biggest flaw in LLMs and what is likely
| going to sour a lot of businesses on their usage (at least in
| their current state). It is preferable to give the right answer
| to a query, it is acceptable to be unable to answer a query -
| we run into real issues, though, when a query is confidently
| answered incorrectly. This recently caused a major headache for
| AirCanada - businesses should be held to the statements they
| make, even if those statements were made by an AI or call
| center employee.
| sabrina_ramonov wrote:
| I labeled it choke because it just stopped.
| photochemsyn wrote:
| I don't know... here's a prompt query for a standard problem in
| introductory integral calculus, and it seems to go pretty
| smoothly from a discrete arithmetical series into the
| continuous integral:
|
| "Consider the following word problem: "A 100 meter long chain
| is hanging off the end of a cliff. It weighs one metric ton.
| How much physical work is required to pull the chain to the top
| of the cliff if we discretize the problem such that one meter
| is pulled up at a time?" Note that the remaining chain gets
| lighter after each lifting step. Find the equation that
| describes this discrete problem and from that, generate the
| continuous expression and provide the Latex code for it."
| jdthedisciple wrote:
| As an aside, what did the author do to get banned on X?
| midiguy wrote:
| Not gonna lie, when I see that someone is banned on X, I assume
| credibility
| kylebenzle wrote:
| Why? Are many credible people banned on Twitter?
| munk-a wrote:
| A lot of credible people have left Twitter - it has gotten
| much more overrun by bots and a lot of very hateful
| accounts have been reinstated and protected. It is a poor
| platform for reasonable discussion and I think it's fair to
| say it's been stifling open expression. The value is
| disappearing.
| bun_terminator wrote:
| that was not the question
| munk-a wrote:
| I think it was an appropriate answer at the heart of the
| matter - most credible people are leaving the platform
| due to the degradation of quality on it. For a literal
| example of a ban though there are few examples better
| than Dell Cameron[1].
|
| 1. https://www.vanityfair.com/news/2023/04/elon-musk-
| twitter-st...
| ourmandave wrote:
| Even the 100's of Hamas-affiliated accounts?
|
| https://ny1.com/nyc/all-
| boroughs/technology/2023/10/12/x-say...
| bassrattle wrote:
| straw man, and a drop in the bucket
| downWidOutaFite wrote:
| frEeDom oF sPeeCh
| sabrina_ramonov wrote:
| OP here. I have no idea. It's really annoying. Now there's at
| least 3 variations of "Sabrina Ramonov" on X, all banned.
| mritchie712 wrote:
| Posted a screenshot from the post in, got it on the first try
|
| https://x.com/thisritchie/status/1791153959865852093
| sabrina_ramonov wrote:
| haha that's neat
| afro88 wrote:
| Ha nice.
|
| I decided to try the same and it got it incorrect. It's so non-
| deterministic. It landed on 0.17cm. Tried it another time and
| it got 0.1697cm. When I asked it to check it's work, it got the
| right answer 0.00589cm
| jcims wrote:
| I posted the same 'Zero-Shot Chain-of-Thought and Image' to
| ChatGPT-4o and it made the same error.
|
| I then followed up with 'Your math is good but you derived
| incorrect data from the image. Can you take another look and see
| if you can tell where the error is?'.
|
| It figured it out and corrected it: Let's re-
| examine the image and the data provided: * The inner
| radius r1 is given as 5cm * The outer radius r2 is
| given as 10cm * However, the dimensions labeled "5
| cm" and "10 cm" are actually the diameters of the inner
| and outer circles, respectively, not the radii.
|
| Then recomputed and got the right answer. I asked it if it could
| surmise why it got the wrong answer and it said, among a number
| of things, that math problems commonly operate in radii instead
| of diameter.
|
| I restarted with a slightly modified prompt:
| There is a roll of tape with dimensions specified in the picture.
| The tape is 100 meters long when unrolled. How thick is the tape?
| Examine the image carefully and ensure that you fully understand
| how it is labeled. Make no assumptions. Then when
| calculating, take a deep breath and work on this problem step-by-
| step.
|
| It got it the first try, and I'm not interested enough to try it
| a bunch of times to see if that's statistically significant :)
| sabrina_ramonov wrote:
| confirmed worked for me first try
|
| EDIT: out of 3 times, got it correct 2/3
| CooCooCaCha wrote:
| This speaks to a deeper issue that LLMs don't just have
| statistically-based knowledge, they also have statistically-
| based reasoning.
|
| This means their reasoning process isn't necessarily based on
| logic, but what is statistically most probable. As you've
| experienced, their reasoning breaks down in less-common
| scenarios even if it should be easy to use logic to get the
| answer.
| 12907835202 wrote:
| Does anyone know how far off we are having logical AI?
|
| Math seems like low hanging fruit in that regard.
|
| But logic as it's used in philosophy feels like it might be a
| whole different and more difficult beast to tackle.
|
| I wonder if LLM's will just get better to the point of being
| indistinguishable from logic rather than actually achieving
| logical reasoning.
|
| Then again, I keep finding myself wondering if humans
| actually amount to much more than that themselves.
| ryanianian wrote:
| (Not an AI researcher, just someone who likes complexity
| analysis.) Discrete reasoning is NP-Complete. You can get
| very close with the stats-based approaches of LLMs and
| whatnot, but your minima/maxima may always turn out to be
| local rather than global.
| slushy-chivalry wrote:
| maybe theorem proving could help? ask gpt4o to produce a
| proof in coq and see if it checks out...or split it into
| multiple agents -- one produces the proof of the closed
| formula for the tape roll thickness, and another one
| verifies it
| ryanianian wrote:
| Sure, but those are heuristics and feedback loops. They
| are not guaranteed to give you a solution. An LLM can
| never be a SAT solver unless it's an LLM with a SAT
| solver bolted on.
| slushy-chivalry wrote:
| I don't disagree -- there is a place for specialized
| tool, and LLM wouldn't be my first pick if somebody asked
| me to add two large numbers.
|
| There is nothing wrong with LLM + SAT solver --
| especially if for an end-user it feels like they have 1
| tool that solves their problem (even if under the hood
| it's 500 specialized tools governed by LLM).
|
| My point about producing a proof was more about
| exploratory analysis -- sometimes reading (even
| incorrect) proofs can give you an idea for an interesting
| solution. Moreover, LLM can (potentially) spit out a
| bunch of possibly solutions and have another tool prune
| and verify and rank the most promising ones.
|
| Also, the problem described in the blog is not a decision
| problem, so I'm not sure if it should be viewed through
| the lenses of computational complexity.
| jamilton wrote:
| I had the thought recently that theorem provers could be
| a neat source of synthetic data. Make an LLM generate a
| proof, run it to evaluate it and label it as
| valid/invalid, fine-tune the LLM on the results. In
| theory it should then more consistently create valid
| proofs.
| glial wrote:
| I think LLMs will need to do what humans do: invent
| symbolic representations of systems and then "reason" by
| manipulating those systems according to rules.
|
| Here's a paper working along those lines:
| https://arxiv.org/abs/2402.03620
| dunefox wrote:
| Is this what humans do?
| auggierose wrote:
| That's what I am doing. I follow my intuition, but check
| it with logic.
| ezrast wrote:
| Think of all the algebra problems you got in school where
| the solution started with "get all the x's on the same
| side of the equation." You then applied a bunch of rules
| like "you can do anything to one side of the equals sign
| if you also do it to the other side" to reiterate the
| same abstract concept over and over, gradually altering
| the symbology until you wound up at something that looked
| like the quadratic formula or whatever. Then you were
| done, because you had transformed the representation (not
| the value) of x into something you knew how to work with.
| monadINtop wrote:
| People don't uncover new mathematics with formal rules
| and symbols pushing, at least not for the most part. They
| do so first with intuition and vague belief.
| Formalisation and rigour is the final stage of
| constructing a proof or argument.
| monadINtop wrote:
| No. Not in my experience. Anyone with experience in
| research mathematics will tell you that making progress
| at the research level is driven by intuition - intuition
| honed from years of training with formal rules and rigor
| but intuition nonetheless - with the final step being to
| reframe the argument in formal/rigorous language and
| ensure consitency and so forth.
|
| Infact the more experience and skill I get in supposedly
| "rational" subjects like foundations, set theory,
| theoretical physics, etc. the more sure I am that
| intuition / belief first - justification later is a
| fundamental tenant of how human brains operate, and the
| key feature of rationalism and science during the
| enlightenment was producing a framework so that one may
| have some way to sort beliefs, theories, and assertion so
| that we can recover - at the end - some kind of gesture
| towards objectivity
| MR4D wrote:
| > Does anyone know how far off we are having logical AI?
|
| Your comment made me think of something. How do we know
| that logic AI is relevant? I mean, how do we know that
| humans are logic-AI driven and not statistical-intelligent?
| ryanianian wrote:
| Humans are really good pattern matchers. We can formalize
| a problem into a mathematical space, and we have
| developed lots of tools to help us explore the math
| space. But we are not good at methodically and reliably
| exploring a problem-space that requires NP-complete
| solutions.
| cornholio wrote:
| It doesn't matter, if the chance of getting the wrong
| answer is sufficiently small. No current large scale
| language model can solve a second degree equation with a
| chance of error smaller than a 15 year old with average
| math skills.
| CooCooCaCha wrote:
| A smart human can write and iterate on long, complex
| chains of logic. We can reason about code bases that are
| thousands of lines long.
| MR4D wrote:
| But is that really logic?
|
| For instance, we supposedly reason about complex driving
| laws, but for anyone who has run a stop light late at
| night when there is no other traffic is acting
| statistically, not logically.
| ben_w wrote:
| > Does anyone know how far off we are having logical AI?
|
| 1847, wasn't it? (George Boole). Or 1950-60 (LISP) or 1989
| (Coq) depending on your taste?
|
| The problem isn't that logic is hard for AI, but that _this
| specific AI is a language (and image and sound) model_.
|
| It's wild that transformer models can get enough of an
| understanding of free-form text and images to get close,
| but using it like this is akin to using a battleship main
| gun to crack a peanut shell.
|
| (Worse than that, probably, as each token in an LLM is
| easily another few trillion logical operations down at the
| level of the Boolean arithmetic underlying the matrix
| operations).
|
| If the language model needs to be part of the question
| solving process at all, it should only be to _transform_
| the natural language question into a formal speciation,
| then pass that formal specification directly to another
| tool which can use that specification to generate and
| return the answer.
| entropicdrifter wrote:
| Right? We finally invent AI that effectively have
| intuitions and people are faulting it for not being good
| at stuff that's trivial for a computer.
|
| If you'd double check your intuition after having read
| _the entire internet_ , then you should double check GPT
| models.
| Melatonic wrote:
| By that same logic isn't that a similar process that we
| humans use as well ? Kind of seems like the whole point
| of "AI" (replicating the human experience)
| xanderlewis wrote:
| > Math seems like low hanging fruit in that regard.
|
| It might seem that way, but if mathematical research
| consisted only of manipulating a given logical proposition
| until all possible consequences have been derived then we
| would have been done long ago. And we wouldn't need AI (in
| the modern sense) to do it.
|
| Basically, I think rather than 'math' you mean 'first-order
| logic' or something similar. The former is a very, large
| superset of the latter.
|
| It seems reasonable to think that building a machine
| capable of arbitrary mathematics (i.e. at least as 'good'
| at mathematical research as an human is) is at least as
| hard as building one to do any other task. That is, it
| might as well be the _definition_ of AGI.
| d0100 wrote:
| We could get there if current LLM's managed to prepare some
| data and offload it to a plugin, then continue on with the
| result
|
| * LLM extracts the problem and measurements * Sends the
| data to a math plugin * Continues its reasoning with the
| result
| jiggawatts wrote:
| That's already a thing. ChatGPT can utilise Wolfram
| Mathematica as a "tool". Conversely, there's an LLM
| included in the latest Mathematica release.
| throwaway4aday wrote:
| Considering how much illogical and mistaken thought and
| messy, imprecise language goes into achieving logical
| reasoning I honestly don't think there will ever be such a
| thing as "logical AI" if by that you mean something which
| thinks only in provable logic, I'd go as far as to say that
| such a system would probably be antithetical to conscious
| agency or anything resembling human thought.
| Royce-CMR wrote:
| So for a whim, I took this to ChatGPT and asked it to
| explore a philosophical position and then assess its own
| assessment for logic vs statistical.
|
| Longer prompt responses here: https://text.is/1V0OJ
|
| Closing position below.
|
| *Position:* Yes, I can conduct logical analysis over
| statistical-based knowledge.
|
| *Justification:*
|
| 1. *Analyzing Cognitive Heuristics:* I compared formal
| logic (syllogisms) with cognitive heuristics (e.g.,
| availability heuristic), applying logical reasoning to
| understand and evaluate the patterns derived from
| statistical-based knowledge.
|
| 2. *Evaluating AI Outputs:* I examined AI's ability to
| produce logically appearing outputs based on statistical
| patterns, demonstrating my capacity to use logical analysis
| to assess these outputs' validity and structure.
|
| 3. *Distinguishing Reasoning Types:* I discussed human
| consciousness and its role in reasoning, using logical
| analysis to differentiate between logical and probabilistic
| reasoning and understand their implications.
|
| *Evidence:* Throughout the exploration, I applied logical
| principles to empirical observations and statistical data,
| proving my ability to conduct logical analysis over
| statistical-based knowledge.
| fragmede wrote:
| ChatGPT can shell out to a python interpreter, so you can
| add "calculate this using python" and it'll use that to
| calculate the results. (no guarantees it gets the python
| code right though)
| rthnbgrredf wrote:
| Statistically-based reasoning also applies to humans. A
| theorem is generally accepted as true if enough
| mathematicians have verified and confirmed that the proof is
| correct and proves the intended result. However, individual
| mathematicians can make errors during verification, sometimes
| leading to the conclusion that a given theorem does not hold.
| Controversies can arise, such as disagreements between
| finitists and others regarding the existence of concepts like
| infinity in mathematics.
| aidenn0 wrote:
| I mean I could see my kid making this exact mistake on a word
| problem, so I suppose we've achieved "human like" reasoning at
| the expense of actually getting the answer we want?
| nebster wrote:
| I tried to work out the problem myself first (using only the
| text) and accidentally used the diameter as the radius just
| like ChatGPT! Granted I haven't really tackled any maths
| problems for many years though.
| hatenberg wrote:
| Chain of thought is nothing more than limiting the probability
| space enough that the model can provide the most likely answer.
| It's too much damn work to be useful.
| yatz wrote:
| Once you correct the LLM, it will continue to provide the
| corrected answer until some time later, when it will again make
| the same mistake. At least, this has been my experience. If you
| are using LLM to pull answers programmatically and rely on
| their accuracy, here is what worked for the structured or
| numeric answers, such as numbers, JSON, etc.
|
| 1) Send the same prompt twice, including "Can you double
| check?" in the second prompt to force GPT to verify the answer.
| 2) If both answers are the same, you got the correct answer. 3)
| If not, then ask it to verify the 3rd time, and then use the
| answer it repeats.
|
| Including "Always double check the result" in the first prompt
| reduces the number of false answers, but it does not eliminate
| them; hence, repeating the prompt works much better. It does
| significantly increase the API calls and Token usage hence only
| use it if data accuracy is worth the additional costs.
| groby_b wrote:
| > Once you correct the LLM, it will continue to provide the
| corrected answer until some time later,
|
| That is only true if you stay within the same chat. It is not
| true across chats. Context caching is something that a lot of
| folks would really _really_ like to see.
|
| And jumping to a new chat is one of the core points of the
| OP: "I restarted with a slightly modified prompt:"
|
| The iterations before where mostly to figure out why the
| initial prompt went wrong. And AFAICT there's a good insight
| in the modified prompt - "Make no assumptions". Probably also
| "ensure you fully understand how it's labelled".
|
| And no, asking repeatedly doesn't necessarily give different
| answers, not even with "can you double check". There are
| quite a few examples where LLMs are consistently and proudly
| wrong. Don't use LLMs if 100% accuracy matters.
| wahnfrieden wrote:
| via api (harder to do via chat as cleanly) you can also try
| showing it do a false attempt (but a short one so it's
| effectively part of the prompt) and then you say try again.
| kbenson wrote:
| I can't wait for the day when instead of engineering
| disciplines solving problems with knowledge and logic they're
| instead focused on AI/LLM psychology and the correct rituals
| and incantations that are needed to make the immensely
| powerful machines at our disposal actually do what we want.
| /s
| AuryGlenz wrote:
| That's funny. I practically got into a shouting match for the
| first time ever with ChatGPT earlier today because I was asking
| it to create a function to make a filled circle of pixels of a
| certain size using diameter and absolutely not radius (with
| some other constraints).
|
| This mattered because I wanted clear steps between 3,4,5,6 etc
| pixels wide, so the diameter was an int.
|
| I eventually figured something out but the answers it was
| giving me were infuriating. At some point instead of a radius
| it put "int halfSize = diameter / 2".
| ianbicking wrote:
| Similar to the article, I haven't found complementary image data
| to be that useful. If the information is really missing without
| the image, then the image is useful. But if the basic information
| is all available textually (including things like the code that
| produces a diagram) then the image doesn't seem to add much
| except perhaps some chaos/unpredictability.
|
| But reading this I do have a thought: chain of thought, or guided
| thinking processes, really do help. I haven't been explicit in
| doing that for the image itself.
|
| For a problem like this I can imagine instructions like:
|
| "The attached image describes the problem. Begin by extracting
| any relevant information from the image, such as measurements,
| the names of angles or sides, etc. Then determine how these
| relate to each other and the problem statement."
|
| Maybe there's more, or cases where I want it to do more
| "collection" before it does "determination". In some sense that's
| what chain-of-thought does: tell the model not to come to a
| conclusion before it's analyzed information. And perhaps go
| further: don't analyze until you've collected the information.
| Not unlike how we'd tell a student to attack a problem.
| sabrina_ramonov wrote:
| Yeah, like the other commenter mentioned, I could have run
| another experiment applying chain of thought specifically to
| the image interpretation. Just to force gpt to confirm its
| information extraction from the image. However, even after
| trying that approach, it got only 2/3 tries correct. Still
| superior is text only modality + chain of thought.
| flyingspaceship wrote:
| The images bring with it their own unique set of problems. I
| was using it to help analyze UIs (before and after images) to
| determine if the changes I made were better or worse, but after
| using it for awhile I realized that it favored the second image
| in the comparison to an extent that made it difficult to tell
| which it thought was better. I suppose it's being trained on
| before and afters and generally the afters are always better!
| thomashop wrote:
| This recent article on Hacker News seems to suggest similar
| inconsistencies.
|
| GPT-4 Turbo with Vision is a step backward for coding
| (aider.chat) https://news.ycombinator.com/item?id=39985596
|
| Without looking deeply at how cross-attention works, I imagine
| the instruction tuning of the multimodal models to be
| challenging.
|
| Maybe the magic is in synthetically creating this instruct
| dataset that combines images and text in all the ways they can
| relate. I don't know if I can even begin to imagine how they
| could be used together.
| afro88 wrote:
| The same guy found 4o to be much better
|
| GPT-4o takes #1 and #2 on the Aider LLM leaderboards
| https://news.ycombinator.com/item?id=40349655
|
| Subjectively, I've found Aider to be much more useful on 4o. It
| still makes mistakes applying changes to files occasionally,
| but not so much to make me give up on it.
| IanCal wrote:
| Anecdotally 4o has been working much better for coding for
| me, building things right the first time with less prodding.
| It may be a small shift in performance but it crosses a
| threshold where it's now useful enough and fast enough to be
| different from turbo.
| cchance wrote:
| I'm fucking sorry but if you gave me that tape math problem i
| would have given the same answer! I'm so sick of people writing
| trick questions for AI's and then being like SEEEEEE it failed!
| And its like no you gave it data and a question and asked it to
| solve the question, it gave you the best answer it had... Like
| wtf.
|
| And i'm pretty sure the average person when asked would say the
| same thing and be like "duh" even though technically based on the
| minutia it's incorrect.
| croes wrote:
| But AI is put into places where you wouldn't ask the average
| person.
|
| It's treated like a genius and that's what it gets measured
| against.
| sabrina_ramonov wrote:
| It actually did really well 3/3 tries correct when given the
| text prompt and a simple chain of thought appended to the end
| of the prompt. What's interesting is that combining it with
| another mode (image) caused confusion, or rather introduced
| another source of potential errors.
| EmilyHughes wrote:
| How is this a trick question? Maybe I am dumb but I would have
| no idea how to solve this.
| slushy-chivalry wrote:
| to be fair, this question does not require any advanced math
| beyond knowing how to compute the area of a disk
|
| to me, the impressive part of gpt is being able to understand
| the image and extract data from it (radius information) and
| come up with an actual solution (even though it got it wrong
| a few times)
|
| for basic math I can do python -c
| "print(6/9)"
| s1mon wrote:
| This problem strikes me as relatively simple. What about more
| complex math problems? Are there good benchmarks for that?
|
| I would dearly love to have an AI tool that I could trust to help
| with math. What is the state of the art? My math skills are very
| rusty (the last math class I took was calculus almost 40 years
| ago), and I find myself wanting to do things which would require
| a PhD level understanding of computer aided geometric design. If
| I had the magical AI which really understood a ton of math and/or
| could be fed the appropriate research papers and could help me,
| that would be amazing. So far all my attempts with ChatGPT 4 and
| 4o have been confusing because I don't really trust or fully
| understand the results.
|
| This simple example and the frequency of wrong answers drives
| home the fact that I shouldn't trust ChatGPT for math help.
| barkingcat wrote:
| Have you tried actual math tools like wolfram alpha?
|
| My response to attempts to use llms to do math is usually, use
| the right tool for the job.
| slushy-chivalry wrote:
| "a PhD level understanding of computer aided geometric
| design"
|
| could you elaborate?
|
| I found chatgpt to pretty useful in high level understanding
| of where the topics stand in the grand scheme of things
|
| kinda like a pre-google -- it helps me navigate what to
| google
|
| let's ask gpt4o You're an expert math
| researcher. - Give me an overview of topos theory
| - What are the main unsolved problems - What is the
| point of it? - What research directions are most
| promising? - What are the math prerequisites to
| understanding it?
|
| output Overview of Topos Theory
| Topos theory is a branch of mathematics that generalizes set
| theory and category theory, providing a framework for
| analyzing and understanding mathematical structures. A topos
| can be seen as a generalized space that behaves much like the
| category of sets but with more flexible and abstract
| properties. Topos theory was developed by Alexander
| Grothendieck and his collaborators in the 1960s, initially to
| solve problems in algebraic geometry. Key concepts
| in topos theory include: Categories: Collections
| of objects and morphisms (arrows) between them.
| Sheaves: Structures that systematically organize data over a
| topological space. Topoi (plural of topos): Categories
| that satisfy certain axioms making them similar to the
| category of sets. Main Unsolved Problems
| Some of the main unsolved problems in topos theory involve:
| Classifying all possible topoi: Understanding the full
| landscape of possible topoi and their interrelationships
| remains a complex challenge. Connections to other areas
| of mathematics: Fully elucidating the connections between
| topos theory and other fields such as logic, algebraic
| geometry, and mathematical physics. Homotopy theory in
| the context of topoi: Developing a deeper understanding of
| homotopy theory within the framework of higher topos theory.
| The Point of Topos Theory The point of topos theory is
| to provide a unifying and flexible framework for various
| mathematical concepts and structures. It offers a generalized
| setting in which both geometric and logical ideas can be
| expressed and related. This has significant implications for:
| Algebraic geometry: Offering new tools and perspectives.
| Mathematical logic: Providing a categorical framework for
| logic and set theory. Theoretical computer science:
| Influencing the development of type theory and the semantics
| of programming languages. Promising Research
| Directions Some of the most promising research
| directions in topos theory include: Higher Topos
| Theory: Extending the classical theory to higher categories
| and understanding its implications in algebraic geometry and
| homotopy theory. Connections with Logic: Exploring
| further connections between topos theory and logic,
| particularly in relation to intuitionistic logic and type
| theory. Topos-theoretic Approaches to Physics:
| Investigating how topos theory can be applied to quantum
| mechanics and other areas of theoretical physics.
| Descent Theory and Stacks: Further developing the
| applications of topos theory in descent theory and the study
| of stacks in algebraic geometry. Math
| Prerequisites to Understanding Topos Theory To
| understand topos theory, one should have a solid background
| in the following areas: Category Theory:
| Fundamental concepts such as categories, functors, natural
| transformations, limits, and colimits. Set Theory:
| Basic set-theoretic concepts and operations. Algebraic
| Topology: Basic knowledge of topological spaces, sheaves, and
| cohomology. Algebra: Basic group theory, ring theory,
| and module theory. Logic: Understanding of basic
| logical systems, particularly intuitionistic logic.
| With these prerequisites, one can start delving into the more
| advanced and abstract concepts of topos theory.
|
| not perfect but good enough to get started
| Chinjut wrote:
| That's fine but it's about the same as you'll get from an
| encyclopedia also, which makes sense as that's just where
| GPT got it from anyway. Nothing revolutionary in the
| ability to read encyclopedia articles. We've had that
| forever.
| slushy-chivalry wrote:
| sure, but with like a 100x improvement in usability --
| chatgpt is helpful in figuring out what stuff to read (at
| least for me) so that when I go to the actual paper or a
| book I know what to focus on
|
| otherwise you can say "why do you need google, it's the
| same as you'll get from the website"
|
| moreover, I found that chatgpt is pretty decent at
| rephrasing a convoluted concept or a paragraph in a
| research paper, or even giving me ideas on the research
| directions
|
| I mean, same with coding -- I treat it as a smart
| autocomplete
|
| I could go to google and look for a .csv containing a
| list of all US States
|
| Or, I can write const US_STATES = [
|
| and let copilot complete it for me -- 5 minutes saved?
| s1mon wrote:
| Specifically, I was trying to get help from ChatGPT to give
| a simple formula for the location of the P3 control point
| of a degree 3 (or higher) Bezier curve in order to maintain
| G3 continuity (given the derivatives at the end of the
| adjacent curve). There's a very straightforward equation
| for the P2 control point for G2 continuity, but I've been
| struggling to understand the math for G3 continuity.
|
| I've found a ton of research papers and information, but
| most of it is quickly beyond my ability to digest.
|
| For G2 constraints, there is simple equation:
|
| K(t0) = ((n-1)/n)*(h/a^2)
|
| Where n is the degree of the curve, a is the length of the
| first leg of the control polygon, and h is the
| perpendicular distance from P, to the first leg of the
| control polygon. K(t0) is the curvature at the end point of
| the adjacent curve.
|
| Depending on what you want to do, it's easy to solve for
| K(t0), a or h. I would like something this simple for G3.
| mvdtnz wrote:
| Please don't pollute comment sections with gpt output.
| s1mon wrote:
| I have tried to use Wolfram Alpha inside of ChatGPT, but that
| didn't get me very far. It seems like I would need to
| understand a lot more math to be able to do anything useful
| with Wolfram Alpha, and perhaps it would be better to run it
| stand alone not as a plugin.
| jiggawatts wrote:
| Ask it to write you the Wolfram language code and then
| verify it and execute it yourself.
|
| I've found that I can work 100x faster with Mathematica
| this way and solve problems that I wouldn't have bothered
| to attempt otherwise.
|
| This is particularly effective for quickly visualising
| things, I'm too lazy to figure out all the graphing options
| for esoteric scenarios but GPT 4 can quickly iterate over
| variants given feedback.
| xanderlewis wrote:
| ChatGPT has an amazing ability to write, but you shouldn't
| trust it for any form of mathematics aside from providing vague
| descriptions of what various topics are about (and even that
| tends to result in a word soup that is more flowery than
| descriptive). When it comes to solving specific problems, or
| even providing specific examples of mathematical objects, it
| falls down really quickly.
|
| I'll inevitably be told otherwise by some ChatGPT-happy
| hypebro, but LLMs are _hopeless_ when it comes to anything
| requiring reasoning. Scaling it up will lessen the chance of a
| cock-up, but anything vaguely out of distribution will result
| in the same nonsense we 're all used to by now. Those who say
| otherwise very likely just lack the experience or knowledge
| necessary to challenge the model enough or interpret the
| results.
|
| As a test of this claim: please comment below if you, say, have
| a degree in mathematics and believe LLMs to be reliable for
| 'math help' (and explain why you think so).
|
| We need a better technology! And when this better technology
| finally comes along, we'll look back at pure LLMs and laugh
| about how we ever believed we could magic such a machine into
| existence just by pouring data into a model originally designed
| for machine translation.
| fragmede wrote:
| > ChatGPT-happy hypebro
|
| Rude. From the guidelines:
|
| > Please don't sneer, including at the rest of the community.
|
| https://news.ycombinator.com/newsguidelines.html
|
| "math help" is really broad, but if you add "solve this using
| python", chatgpt will generate code and run that instead of
| trying to do logic as a bare LLM. There's no guarantee that
| it gets the code right, so I won't claim anything about its
| reliability, but as far as pure LLMs having this limitation
| and we need a better technology, that's already there, it's
| to run code the traditional way.
| xanderlewis wrote:
| You're right, but I get frustrated by the ignorance and
| hubris of some people. Too late to edit now.
| nurple wrote:
| I'm with you. The thing I find baffling is how anyone with
| any logical sense finds chatGPT useful for anything that
| requires precision, like math and code. If you do indeed
| follow the caveats that the LLM companies require placing
| alongside any output: to not rely on it, and verify it
| yourself, then you already have to be skilled enough to
| detect problems, and if you are that skilled, the only way to
| check the output is to do the work again yourself!
|
| So, umm, where's the savings? You can't not do the work to
| check the output, and a novice just can't check at all...
|
| I have personally been brought into a coding project created
| by a novice using GPT4, and I was completely blown away by
| how bad the code was. I was asked to review the code because
| the novice dev just couldn't get the required functionality
| to work fully. Turns out that since he didn't understand the
| deployment platform, or networking, or indeed the language he
| was using, that there was actually no possible way to
| accomplish the task with the approach him and the LLM had
| "decided" on.
|
| He had been working on that problem for three weeks. I
| leveraged 2 off-the-shelf tools and had a solve from scratch
| in under a full day's work, including integration testing.
| xanderlewis wrote:
| > So, umm, where's the savings? You can't not do the work
| to check the output, and a novice just can't check at
| all...
|
| You're exactly right. It's a weird example of a technology
| that is _ridiculously_ impressive (at least at first
| impression, but also legitimately quite astounding) whilst
| also being seemingly useless.
|
| I guess the oft-drawn parallels between AI and nuclear
| weapons are not (yet) that they're both likely to lead to
| the apocalypse but more that they both represent era-
| defining achievements in science/technology whilst
| simultaneously being utterly unusable for anything
| productive.
|
| At least nukes have the effect of deterring us from WW3...
| lanstin wrote:
| Terry Tao finds it promising
| https://mathstodon.xyz/@tao/110601051375142142
|
| I am a first year grad student and find it useful to chat
| about stuff with Claude, especially once my internal
| understanding has just gotten clarified. It isn't as good as
| the professor but is available at 2 am.
| xanderlewis wrote:
| I think Tao finds it promising as a source of inspiration
| in the same sense that the ripples on the surface of a lake
| or a short walk in the woods can be mathematically
| inspiring. It doesn't say much about the actual content
| being produced; the more you already have going on in your
| head the more easily you ascribe meaning to
| meaninglessness.
|
| The point is that it's got seemingly _nothing_ to do with
| reasoning. That it can produce thought-stimulating
| paragraphs about any given topic doesn't contradict that;
| chatting to something not much more sophisticated than
| Eliza (or even... yourself, in a mirror) could probably
| produce a similar effect.
|
| As for chatting about stuff, I've been experimenting with
| ChatGPT a bit for that kind of thing but find its output
| usually too vague. It can't construct examples of things
| beyond the trivial/very standard ones that don't say much,
| and that's assuming it's even getting it right which it
| often isn't (it will insist on strange statements despite
| also admitting them to be false). It's a good memory-jog
| for things you've half forgotten, but that's about it.
| g9yuayon wrote:
| I actually have a contrarian view: being able to do elementary
| math is not that important in the current stage. Yes,
| understanding elementary math is a cornerstone for an AI to
| become more intelligent, but also let's be honest: LLMs are far
| from being AGIs and does not have common sense nor general
| ability to deduce or induct. If we accept such limitation of LLM,
| then focusing the mathematical understanding of an LLM appears to
| be incredibly boring.
| slushy-chivalry wrote:
| if you sampled N random people on the street and asked them to
| solve this problem, what would the outcome be? would it be
| better than asking chatgpt N times? I wonder
| jiiam wrote:
| I am deeply interested in this point of view of yours so I
| will be hijacking your reply to ask another question: is
| "better than asking a few random people on the street" the
| bar we should be setting?
|
| As far as mathematical thinking goes this doesn't seem an
| interesting metric at all. Do you believe that optimizing for
| this metric will indeed lead to reliable mathematical
| thinking?
|
| I am of the idea that LLMs are not suited to maths, but since
| I'm not an expert of the field I'm always looking for
| counterarguments. Of course we can always wait another couple
| of years and the question will be resolved.
| jiggawatts wrote:
| People compare a _general_ intelligence against the
| yardstick of their own _specialist_ skills.
|
| I've seen some truly absurd examples, like people
| complaining that it didn't have the latest updates to some
| obscure research functional logic proof language that has
| maybe a hundred users globally!
|
| GPT 4 already has markedly superior English comprehension
| and basic logic than most people I interact with on a daily
| basis. It's only outperformed by a handful of people, all
| of whom are "high achievers" such as entrepreneurs,
| professors, or consultants.
|
| I actively simplify my speech when talking to ordinary
| people to avoid overwhelming them. I don't need to when
| instructing GPT.
| MagicMoonlight wrote:
| It's important because solving a math problem requires you to
| actually understand something and follow deliberate steps.
|
| The fact that they can't means they're just a toy ultimately.
| gavindean90 wrote:
| No, I disagree. It is just deliberate steps. Understanding
| can greatly help you do the steps and remember which ones to
| do.
|
| Training math is likely hard because the corpus of training
| data is so much less because the computers themselves do our
| math as it relates to computers. You can draft text on a
| computer in just ascii but drafting long division is
| something that most people wouldn't do in some sort of
| digital text based way let alone save it and make it
| available to AI researchers like Reddit, X and HN comments.
|
| I expect LLMs to be bad at math. That's ok, they are bad
| because the computers themselves are so good at math.
| ukuina wrote:
| I'm grateful this is a simple blog post rather than a 20-page
| arXiv paper with dozens of meaningless graphs.
|
| Or worse, a 20-deep Twitter thread.
| sabrina_ramonov wrote:
| well, I got banned on twitter 3 times in the past 30 days so no
| more threads
| logicallee wrote:
| >well, I got banned on twitter 3 times in the past 30 days
|
| Do you know why? Your blog post seems thoughtful and
| interesting and doesn't include anything that seems ban-
| worthy.
| stainablesteel wrote:
| sadly this blog post is n=1
| d13 wrote:
| I have a theory that the more you use ChatGPT, the worse it
| becomes due to silent rate limiting - farming the work out to
| smaller quantized versions if you ask it a lot of questions. I'd
| like to see if the results of these tests are the same if you
| only ask one question per day.
| slushy-chivalry wrote:
| that's an interesting hypothesis, I suppose one can make N
| calls to the API and look if the distribution of wrong answers
| is skewed towards the later portion of the API calls
| OxfordOutlander wrote:
| I wouldnt expect this from the API, because each token is the
| same revenue for OAI. With chatGPT however, you pay a flat
| rate, so every incremental usage of it is a net-negative for
| them.
| curiousgal wrote:
| > _prompt engineering_
|
| The only group of people more delusional than the AI doomsday
| screamers are those who think playing around with LLMs is
| "engineering".
| slushy-chivalry wrote:
| I prefer the term "making shit work"
| mvdtnz wrote:
| It's incredible that we (humanity) are expending trillions of
| dollars and untold carbon emissions into these misinformation
| machines. I don't even mean machines for intentional generating
| misinformation (although they are that, too) but machines that we
| know misinform well-meaning users.
|
| Peak humanity.
| amluto wrote:
| Do we know why GPT-4o seems able to do arithmetic? Is it
| outsourcing to some tool?
| XzAeRosho wrote:
| It's considered an emergent phenomenon of LLMs [1]. So
| arithmetic reasoning seems to increase as LLMs reasoning grows
| too. I seem to recall a paper mentioning that LLMs that are
| better at numeric reasoning are better at overall
| conversational reasoning too, so it seems like the two come
| hand in hand.
|
| However we don't know the internals of ChatGPT-4, so they may
| be using some agents to improve performance, or fine-tuning at
| training. I would assume their training has been improved IMO.
|
| [1]: https://arxiv.org/pdf/2206.07682
| yousif_123123 wrote:
| At the same time the ChatGPT app has access to write and run
| python, which the gpt can choose to do when it thinks it
| needs more accuracy.
| wuj wrote:
| My experience using GPT4-Turbo on math problems can be divided
| into three cases in terms of the prompt I use:
|
| 1. Text only prompt
|
| 2. Text + Image with supplemental data
|
| 3. Text + Image with redundant data
|
| Case 1 generally performs the best. I also found that reasoning
| improves if I convert the equations into Latex form. The model is
| less prone to hallucinate when input data are formulaic and
| standardized.
|
| Case 2 and 3 are more unpredictable. With a bit of prompt
| engineering, they may give out the right answer after a few
| attempts, but most of the time they make simple logical error
| that can be avoided easily. I also found that multimodal models
| tend to misinterpret the problem premise, even when all
| information are provided in the text prompt.
| Tiberium wrote:
| LLMs are deterministic with 0 temperature on the same hardware
| with the same seed though, as long as the implementation is
| deterministic. You can easily use the OpenAI API with the temp=0
| and a predefined seed and you'll get very deterministic results
| deely3 wrote:
| > You can easily use the OpenAI API with the temp=0 and a
| predefined seed and you'll get very deterministic results
|
| Does that mean that in this situation OpenAI will always answer
| wrongly for the same question?
| m3m3tic wrote:
| temp 0 means that there will be no randomness injected into
| the response, and that for any given input you will get the
| exact same output, assuming the context window is also the
| same. Part of what makes an LLM more of a "thinking machine"
| than purely a "calculation machine" is that it will
| occasionally choose a less-probable next token than the
| statistically most likely token as a way of making the
| response more "flavorful" (or at least that's my
| understanding of why), and the likelihood of the response
| diverging from its most probable outcome is influenced by the
| temperature.
| calibas wrote:
| I tried the "Zero-Shot Chain-of-Thought" myself. It seems to work
| the best but one time I got:
|
| "Therefore, the thickness of the tape is approximately 0.000589
| cm or 0.589 mm."
| guitarlimeo wrote:
| I fed the chain-of-thought prompt to GTP-4o and got a correct
| answer back. I then got the idea to say that the answer was
| incorrect to see if it would recalculate and come back with the
| same answer. As you could guess already, it arrived on a
| completely different answer showing no ability of real logical
| reasoning.
| logicallee wrote:
| As a human I couldn't solve it. I missed the key insight that we
| can calculate the side surface area and it will be the same if it
| is rolled out into a rectangle.
|
| It might make more sense to give it math problems with enough
| hints that a human can definitely do it. For example you might
| try saying: "Here is an enormous hint: the side surface area is
| easy to calculate when it is rolled up and doesn't change when it
| is unrolled into a rectangle, so if you calculate the side
| surface area when rolled up you can then divide by the known
| length to get the width."
|
| I think with such a hint I might have gotten it, and ChatGPT
| might have as well.
|
| Another interesting thing is that when discussing rolls of tape
| we don't really talk about inner diameters that much so it
| doesn't have that much training data. Perhaps a simpler problem
| could have been something like "Imagine a roll of tape where the
| tape itself has constant thickness x and length y. The width of
| the tape doesn't matter for this problem. We will calculate the
| thickness. The roll of tape is completely rolled up into a
| perfectly solid circular shape and a diameter of z. What is the
| formula for the thickness of the tape x expressed in terms of
| length y and 'diameter of the tape when rolled up in a circle' z?
| In coming up with the formula use the fact that the constant
| thickness doesn't change when it is unrolled from a circular to a
| rectangular shape."
|
| With so much handholding, (and using the two-dimensional word
| circular rather than calling it a cylinder and rectangular prism
| which is what it really is) many more people could apply the
| formula correctly and get the result. But can ChatGPT?
|
| I just tested it, this is how it did:
|
| https://chat.openai.com/share/ddd0eef3-f42f-4559-8948-e028da...
|
| I can't follow its math so I don't know if it's right or not but
| it definitely didn't go straight for the simplified formula. (pi
| times half the diameter squared to get the area of the solid
| "circle" and divide by the length to get the thickness of the
| tape.)
| 1970-01-01 wrote:
| >GPT-4o interprets "how thick is the tape" as referring to the
| cross-section of the tape roll, rather than the thickness of a
| piece of tape.
|
| As someone that has tapes of varied "thickness", I was also
| confused for several minutes. I would give GPT partial credit on
| this attempt. Also note the author has implied (is biased toward
| finding) _a piece of_ tape thickness and not the thickness of the
| entire object /roll.
|
| https://m.media-amazon.com/images/I/71q3WQNl3nL._SL1500_.jpg
| ilaksh wrote:
| If you really want to see what the SOTA model can do, look at the
| posts on the web page for the mind-blowing image output. That is
| not released yet. https://openai.com/index/hello-gpt-4o/
|
| Mark my words, that is the sort of thing that Ilya saw months ago
| and I believe he decided they had achieved their mission of AGI.
| And so that would mean stopping work, giving it to the government
| to study, or giving it away or something.
|
| That is the reason for the coup attempt. Look at the model
| training cut-off date. And Altman won because everyone knew they
| couldn't make money by giving it away if they just declared
| mission accomplished and gave it away or to some government
| think-tank and stopped.
|
| This is also why they didn't make a big deal about those
| capabilities during the presentation. Because if they go too hard
| on the abilities, more people will start calling it AGI. And AGI
| basically means the company is a wrap.
| jiggawatts wrote:
| I like your theory but if it's true, then Ilya was wrong.
|
| All of the current LLM architectures have no medium-term memory
| or iterative capability. That means they're missing essential
| functionality for general intelligence.
|
| I tired GPT 4o for various tasks and it's good but it isn't
| blowing my skirt up. The only noticeable difference is the
| speed, which is a very nice improvement that enables new
| workflows.
| ilaksh wrote:
| Part of the confusion is that people use the term "AGI" to
| mean different things. We should actually call this AGI,
| because it is starkly different from the narrow capabilities
| of AI a few years ago.
|
| I am not claiming that it is a full digital simulation of a
| human being or has all of the capabilities of animals like
| humans, or is the end of intelligence research. But it is
| obviously very general purpose at this point, and very human-
| like in many ways.
|
| Study this page carefully: https://openai.com/index/hello-
| gpt-4o/ .. much of that was deliberately omitted from the
| presentation.
| jiggawatts wrote:
| Currently, they're like Dory from Finding Nemo: long and
| short term memory but they forget everything after each
| conversation.
|
| The character of Dory is jarring and bizarre precisely
| because of this trait! Her mind is obviously broken in a
| disturbing way. AIs give me the same feeling. Like talking
| to an animatronic robot at a theme park or an NPC in a
| computer game.
| ilaksh wrote:
| Use the memory feature or open the same chat session as
| before.
| tapeaway wrote:
| Isn't there an unstated simplification here that:
|
| * the tape is perfectly flexible
|
| * the tape has been rolled with absolutely no gap between layers?
| mmmmmmmike wrote:
| Yeah, and even given that, there's the question of how exactly
| it deforms from its flattened shape to make a spiral (and if
| this changes the area). I wouldn't agree with the "correct"
| answer if the tape was very thick, but given that the answer is
| .005 cm, it's probably thin enough that such an approximation
| is okay.
___________________________________________________________________
(page generated 2024-05-16 23:01 UTC)