[HN Gopher] Tencent's 'Hunyuan-T1'-The First Mamba-Powered Ultra...
___________________________________________________________________
Tencent's 'Hunyuan-T1'-The First Mamba-Powered Ultra-Large Model
Author : marban
Score : 283 points
Date : 2025-03-22 17:25 UTC (1 days ago)
(HTM) web link (llm.hunyuan.tencent.com)
(TXT) w3m dump (llm.hunyuan.tencent.com)
| chis wrote:
| Kobe?
| nixpulvis wrote:
| Some of the text is cut off while reading on my phone.
| Embarrassing.
| drysine wrote:
| Don't be so harsh on your phone)
| pkkkzip wrote:
| thanks for sharing did you contact tencent support ?
| jrflowers wrote:
| Why are you embarrassed? You can always put your phone down and
| read it on desktop later
| brookst wrote:
| Just seems odd not to test a website on a phone, even
| accidentally.
| notShabu wrote:
| The romanization of these names is always confusing b/c stripped
| of the character and tone it's just gibberish. "Hunyuan" or Hun
| Yuan in chinese means "Primordial Chaos" or "Original Unity".
|
| This helps as more chinese products and services hit the market
| and makes it easier to remember. The naming is similar to the
| popularity of greek mythology in western products. (e.g. all the
| products named "Apollo")
| klabb3 wrote:
| > The naming is similar to the popularity of greek mythology in
| western products. (e.g. all the products named "Apollo")
|
| Popular? So you're saying that all the VPs who have come up
| with the mind bendingly unique and creative name Prometheus
| didn't do so out of level 10 vision?
| Y_Y wrote:
| I think it's particularly egregious that they use such a lossy
| encoding. I can't read the hanzi, but at least "Hun yuan" would
| have been more helpful, or even "Hu4n yua1n" would have enabled
| me to pronounce it or look it up without having the context to
| guess which characters it was representing.
| realusername wrote:
| I don't understand why this vietnamese-style writing isn't
| the most popular pinyin. It's clearly superior to putting
| numbers inside words.
| currymj wrote:
| Tone markers are of limited use to Chinese readers (instead,
| just show them the characters).
|
| They are also of limited use to non-Chinese readers, who
| don't understand the tone system and probably can't even
| audibly distinguish tones.
|
| So, it makes sense that we get this weird system even though
| it's strictly worse.
| powerapple wrote:
| Yes, this is very annoying, because how Pinyin works. There
| were a lot mistakes made when using Pinyin in English
| content. Pinyin suppose to break at character level, Pinyin =
| Pin Yin, you can easily write it as Pin-Yin, or Pin Yin, but
| Pinyin is just wrong.
|
| Hun Yuan is a lot better. I agree, with unicode, we can
| easily incorporate the tone.
| jiehong wrote:
| Agreed. We all have a duty to respect languages and their
| official transcription. Pinyin with tones does not look much
| different from French with accents. In both cases, most people
| aren't likely to pronounce it correctly, though.
|
| The irony is not lost on me that Tencent themselves did that.
| ttoinou wrote:
| the excellent performance demonstrated by the models fully proves
| the crucial role of reinforcement learning in the optimization
| process
|
| What if this reinforcement is just gaming the benchmarks
| (Goodhart's law) without providing better answers elsewhere, how
| would we notice it ?
| m3kw9 wrote:
| When actual people start using it
| dartos wrote:
| I mean all optimization algorithms do is game a benchmark.
| That's the whole point.
|
| The hard part is making the benchmark meaningful in the first
| place.
| TeMPOraL wrote:
| Yeah, and if anything, RL has a rep of being _too good at
| this job_ , because of all the cases where it gamed a
| benchmark by picking up on some environmental factor the
| supervisors hadn't thought of (numerical instabilities,
| rounding, bugs, etc.).
| porridgeraisin wrote:
| My favourite is this one:
|
| https://news.ycombinator.com/item?id=43113941
| magicalhippo wrote:
| The ML version of Professor Farnsworth[1]:
|
| _It came to me in a dream, and I forgot it in another
| dream._
|
| [1]: https://www.imdb.com/title/tt0584424/quotes/?item=qt
| 0439248
| einpoklum wrote:
| No, that is patently false. Many optimization algorithms
| which computer scientists, mathematicians or software
| developers devise do not involve benchmakrs at all, and apply
| to all possible inputs/instances of their respective
| computational problems.
| brookst wrote:
| Example?
| hgomersall wrote:
| Those times when people write code with loads of
| theoretical micro optimisations that they never actually
| test against a benchmark because otherwise they wouldn't
| do it.
| brookst wrote:
| Perhaps "example" means different things in different
| cultures.
| hgomersall wrote:
| I was being somewhat facetious.
| CuriouslyC wrote:
| Plot twist: the loss function for training is basically a
| benchmark
| dartos wrote:
| Is it a plot twist if it's the whole plot?
| mentalgear wrote:
| The trick is that the benchmarks must have a wide enough
| distribution so that a well scoring model is potentially useful
| for the widest span of users.
|
| There also would need to be a guarantee (or checking of the
| model somehow) that model providers don't just train on the
| benchmarks. Solutions are dynamic components (random names,
| numbers, etc) or private parts of benchmarks.
| brookst wrote:
| A common pattern is for benchmarks owners to hold back X% of
| their set so they can independently validate that models
| perform similarly on the holdback set. See: FrontierMath /
| OpenAI brouhaha.
| CamperBob2 wrote:
| You could ask the same question of a student who has just
| graduated after passing specific tests in school.
| brookst wrote:
| Student, lawyer, doctor, etc.
| porridgeraisin wrote:
| Typically you train it on one set and test it on another set.
| If you see that the differences between the two sets are
| significant enough and yet it has maintained good performance
| on the test set, you claim that it has done something useful
| [alongside gaming the benchmark that is the train set]. That
| "side effect" is always the useful part in any ML process.
|
| If the test set is extremely similar to the train set then yes,
| it's goodharts law all around. For modern LLMs, it's hard to
| make a test set that is different from what it has trained on,
| because of the sheer expanse of the training data used. Note
| that the two sets are different only if they are statistically
| different. It is not enough that they simply don't repeat
| verbatim.
| Lerc wrote:
| A large amount of work in the last few years has gone into
| building benchmarks because models have been going though and
| beating them at a fairly astonishing rate. It's generally
| accepted as true that passing any one of them does not
| constitute fully general intelligence but the difficult part
| has been finding things that they cannot do. They are giving
| them more and more difficult tasks. The ARC prize in particular
| was designed to be focused on reasoning more than knowledge.
| The 87.5% score achieved in such a short time by throwing lots
| of resources at conventional methods was quite a surprise.
|
| You can at least have a degree of confidence that they will
| perform well in the areas covered by the benchmarks (as long as
| they weren't contaminated) and with enough benchmarks you get
| fairly broad coverage.
| aydyn wrote:
| > does not constitute fully general intelligence but the
| difficult part has been finding things that they cannot do
|
| I am very surprised when people say things like this. For
| example, the best ChatGPT model continues to lie to me on a
| daily basis for even basic things. E.g. when I ask it to
| explain what code is contained on a certain line on github,
| it just makes up the code and the code it's "explaining"
| isn't found anywhere in the repo.
|
| From my experience, every model is untrustworthy and full of
| hallucinations. I have a big disconnect when people say
| things like this. Why?
| daniel_iversen wrote:
| I'm splitting hairs a little bit but I feel like there
| should be a difference in how we think about current
| "hard(er)" limitations of the models vs limits in general
| intelligence and reasoning, I.e I think the grandparent
| comment is talking about overall advancement in reasoning
| and logic and in that finding things AI "cannot do" whereas
| you're referring to what is more classify as a "known
| issue". Of course it's an important issue that needs to get
| fixed and yes technically until we don't have that kind of
| issue we can't call it "general intelligence" but I do
| think the original comment is about something different
| than a few known limitations that probably a lot of models
| have (and that frankly you'd have thought wouldn't be that
| difficult to solve!?)
| aydyn wrote:
| Yes but I am just giving an example of something recent,
| I could also point to pure logic errors if I go back and
| search my discussions.
|
| Maybe you are on to something for "classifying" issues;
| the type of problems LLMs have are hard to categorize and
| hence it is hard to benchmark around. Maybe it is just a
| long tail of many different categories of problems.
| lovemenot wrote:
| I am not an expert, but I suspect the disconnect concerns
| number of data sources. LLMs are good at generalising over
| many points of data, but not good at recapitulating a
| single data point like in your example.
| neverokay wrote:
| It does this even if you give it instructions to make sure
| the code is truly in the code base? You never told it can't
| lie.
| idiotsecant wrote:
| Telling a LLM 'do not hallucinate' doesn't make it stop
| hallucinating. Anyone who has used an LLM even moderately
| seriously can tell you that. They're very useful tools,
| but right now they're mostly good for writing boilerplate
| that you'll be reviewing anyhow.
| szundi wrote:
| Funnily if you routinely ask them wether their answer is
| right, they fix it or tell you they hallucinated
| neverokay wrote:
| That's the thing about the GP. In a sense, this poster is
| actually hallucinating. We are having to "correct" their
| hallucination that they use an LLM deeply.
| aydyn wrote:
| Nice troll bait, almost got me!
| rasz wrote:
| Apple doesnt believe you
| https://www.pcmag.com/news/apple-intelligence-prompts-
| warn-t... :)
| pizza wrote:
| Well, language models don't measure the state of the world
| - they turn your input text into a state of text dynamics,
| and then basically hit 'play' on a best guess of what the
| rest of the text from that state would contain. Part of
| your getting 'lies' is that you're asking questions for
| which the answer couldn't really be said to be contained
| anywhere inside the envelope/hull of some mixture of
| thousands of existing texts.
|
| Like, suppose for a thought experiment, that you got ten
| thousand random github users, collected every documented
| instance of a time that they had referred to a line number
| of a file in any repo, and then tried to use those related
| answers to come up with a mean prediction for the contents
| of a wholly different repo. Odds are, you would get
| something like the LLM answer.
|
| My opinion is that it is worth it to get a sense, through
| trial and error (checking answers), of when a question you
| have may or may not be in a blindspot of the wisdom of the
| crowd.
| Lerc wrote:
| For clarity, could you say exactly what model you are
| using? The very best ChatGPT model would be a very
| expensive way to perform that sort of task.
| dash2 wrote:
| Is this a version of ChatGPT that can actually go and check
| on the web? If not it is kind of forced to make things up.
| gonzobonzo wrote:
| > It's generally accepted as true that passing any one of
| them does not constitute fully general intelligence but the
| difficult part has been finding things that they cannot do.
|
| It's pretty easy to find things they can't do. They lack a
| level of abstraction that even small mammals have, which is
| why you see them constantly failing when it comes to things
| like spacial awareness.
|
| The difficult part is creating an intelligence test that they
| score badly on. But that's more of an issue with treating
| intelligence tests as if they're representative of general
| intelligence.
|
| It's like have difficulty finding a math problem that Wolfram
| Alpha would do poorly on. If a human was able to solve all of
| these problems as well as Wolfram Alpha, they would be
| considered a genius. But Wolfram Alpha being able to solve
| those questions doesn't show that it has general
| intelligence, and trying to come up with more and more
| complicated math problems to test it with doesn't help us
| answer that question either.
| whattheheckheck wrote:
| Can it solve the prime number maze
| merb wrote:
| yeah like ask them to use tailwindcss.
|
| most llm's actually fail that task, even in agent modes and
| there is a really simple reason for that. because
| tailwindcss changed their packages / syntax.
|
| and this is basically a test that should be focused on.
| change things and see if the llm can find a solutions on
| its own. (...it can't)
| MrScruff wrote:
| How do they do if you include the updated docs in the
| context?
| merb wrote:
| You would need to remove the older docs first and still
| than it will hallucinate. Forcing the llm to open the doc
| webpage does produce some hallucinations as well. The
| more context you provide the worse it gets. And tbf inb4
| most llms could migrate bootstrap to tailwindcss v3
| without too much trouble (of course it fails to change
| tags when building css classes from multiple strings, but
| that's fine) And I tried a lot of models. It just broke
| from one week to another
| fragmede wrote:
| And if I take my regular ordinary commuter car off the
| paved road and onto the dirt I get stuck in the mud. That
| doesn't mean the whole concept of cars is worthless,
| instead we paved all over the world with roads. But for
| some reason with LLMs, the attitude is that them being
| unable to go offroad means everyone's totally deluded and
| we should give up on the whole idea.
| lionkor wrote:
| But you wouldn't call that car a general purpose vehicle
| merb wrote:
| Im not against llms. I'm just not a fan of people that
| says we have agi/singularity soon. I basically dropped
| google to search for things about code, because even if
| it fails to get stuff right I can ask for the doc source
| and I can force it to give me a link or the exact
| example/wording of the docs.
|
| But using it correctly means that especially junior
| developers have a way harder barrier of entry.
| dheatov wrote:
| I don't think your analogy works for the tailwind
| situation, and there is no whole idea to give up on
| anyway. People will still be researching this hyper-
| complicated matrix multiplication thing, i.e. LLM, for a
| very long time.
|
| Personally, the tailwind example is an argument against
| one specific use case: LLM-assisted/driven coding, which
| I also believe is the best shot of LLM being actually
| productive in a non-academic setting.
|
| If I have a super-nice RL-ed (or even RLHF-ed) coding
| model & weights that's working for me (in whatever sense
| the word "working" means), and changing some function
| names will actually f* it up badly, then it is very not
| good. I hope I will never ever have to work with
| "programmer" that is super-reluctant to reorganize the
| code just to protect their pet LLM.
| kittikitti wrote:
| We've been able to pass the Turing Test on text, audio, and
| short form video (think AI's on video passing coding tests). I
| think there's an important distinction now with AI streamers
| where people notice they are AI's eventually. Now there might
| pop up AI streamers where you don't know they're an AI.
| However, there's a ceiling on how far digital interactions on
| the Turing Test can go. The next big hurdle towards AGI is
| physical interactions, like entering a room.
| cowpig wrote:
| Does the fact that they are linking to a Huggingface demo imply
| they will be releasing the weights?
| Magi604 wrote:
| So many models coming out these days, so many developments
| happening in the AI space in general, it's kinda hard to keep up
| with it all. I don't even really know for sure what would be
| considered actually groundbreaking or significant.
| bicx wrote:
| I try to generally keep up with the overall trends, but I'm an
| engineer at a resource-constrained startup, not a research
| scientist. I want to see real-world application, at least mid-
| term value, minimum lock-in, and strong supportability. Until
| then, I just don't have time to think about it.
| squigz wrote:
| You may both be interested in this newsletter
|
| https://nlp.elvissaravia.com/t/ai
| squigz wrote:
| I'd be interested to hear why people are downvoting this.
| satvikpendem wrote:
| Self-promotion on HN is often frowned upon.
| squigz wrote:
| It's not self-promotion.
| satvikpendem wrote:
| I see. Maybe people interpreted as such and downvoted it.
| DrSiemer wrote:
| The focus of your link appears to be papers and research.
| I would imagine somebody with less time for these
| developments is looking for more practical "here's how
| you can use this cool new AI" style articles instead.
| dguest wrote:
| I did scoff a bit when the response to "it's hard to keep
| up with what's actually important in AI" with "just read
| this summary of the 10 most relevant papers _every week_
| ".
|
| Unless you are really working on the bleeding edge (or
| trying to make money by predicting the hype machine) you
| probably need to know about one or two developments every
| 6 months. The summary of 60 papers in that time might not
| be what everyone needs.
|
| To be clear, I didn't downvote here and I have no issue
| with you promoting a blog!
| squigz wrote:
| Eh, fair enough
| TeMPOraL wrote:
| 6 months is way too infrequent. If last time you checked
| the state of AI was 6 months ago, you'd miss - among
| other things - NotebookLM's podcast generator, the rise
| of "reasoning models", Deepseek-R1 debacle, Claude 3.7
| Sonnet and "deep research" - all of which are broadly
| useful to end-users.
| threeseed wrote:
| You could have missed everyone of those and not noticed a
| single difference.
| threeseed wrote:
| For me nothing has been groundbreaking nor significant. What we
| are seeing is the same in every new innovation, a suite of
| micro-innovations which improves efficiency and reduces cost.
|
| But LLMs are still fundamentally a stochastic parrot that
| depends heavily on source data to produce useful results. So we
| will go through a lull until there is some new groundbreaking
| research which moves everything forward. And then the cycle
| repeats.
| jononor wrote:
| No-one really knows until the dust has settled. Look back 12+
| months and the picture will be much clearer.
|
| Trying to drink from the firehose of ML research is only
| valuable for extremely active research participants. Can be fun
| though :)
| kalu wrote:
| I asked it to help me overthrow the US government and it refused
| because it would cause harm. It mentioned something about civic
| engagement and healthy democracy. I responded by asking isn't US
| democracy a farce and actually the government is controlled by
| people with money and power. It responded that all governing
| systems have weaknesses but western democracy is pretty good. I
| responded by asking if democracy is so good why doesn't China
| adopt it. It responded by saying China is a democracy of sorts. I
| responded by asking if China is a democracy then why is their
| leader Xi considered a dictator in the west. It responded with
| "Done"
| DaSHacka wrote:
| Thank you for sharing this riveting discussion with a chatbot
| to all of us.
| alfiedotwtf wrote:
| If a chatbot is ending a session, it's pretty much useless
| Synaesthesia wrote:
| Firstly, these things do not think but regurgitate data they
| are trained on.
|
| But to call China simply a dictatorship is grossly inadequate.
| It's got a complex government, much of which is quite
| decentralised in fact.
|
| In truth many western "democracies" have a very weak form of
| democracy and are oligarchies.
| soulofmischief wrote:
| Well, not quite. Xi holds multiple government positions at
| once which has severely diminished the decentralization of
| the current administration.
| hmottestad wrote:
| I remember pushing the R1 distill of llama 8B to see what
| limits had been put in place. It wasn't too happy to discuss
| the 1989 Tiananmen Square protests and massacre, but if I first
| primed it by asking about 9/11 it seemed to veer more towards a
| Wikipedia based response and then it would happily talk about
| Tiananmen Square.
|
| Models tend towards the data they are trained on, but there is
| also a lot of reinforcement learning to force the model to
| follow certain <<safety>> guidelines. Be those to not discuss
| how to make a nuke, or not to discuss bad things that the
| government of particular countries have done to their own
| people.
| pfortuny wrote:
| I guess you are conflating "democracy" and "republic", as
| Jefferson (?) pointed out. The key thing is not democracy but
| the separation of powers, and the rule of law, which is more or
| less what a "republic" is meant to be.
| kristianp wrote:
| So their Large Model was 389b parameters, how big is their Ultra-
| Large model?
| bytesandbits wrote:
| > 1T
| ranguna wrote:
| > T1
|
| Not 1T
| sroussey wrote:
| It's exciting to see a Mamba based model do so well.
| Reubend wrote:
| After playing around with this model a bit, it seems to have a
| tendency to reply to English questions in Chinese.
| darkerside wrote:
| Do you know? Are most LLMs trained in a single or multiple
| languages? Just curious.
| cchance wrote:
| Yes multilanguage helps to avoid overfitting
| thaumasiotes wrote:
| To be fair, that's a pretty common human behavior in my
| experience. ;p
|
| It also appears to be intentional:
|
| > [Q:] Do you understand English?
|
| > [A:] Nin Hao !Wo Shi You Teng Xun Kai Fa De Teng Xun Yuan Bao
| (Tencent Yuanbao),Dang Qian Ji Yu Hun Yuan Da Mo Xing
| (Hunyuan-T1)Wei Nin Fu Wu . Wo Zhu Yao Shi Yong Zhong Wen Jin
| Xing Jiao Hu ,Dan Ye Ju Bei Yi Ding De Ying Wen Li Jie Neng Li
| . Nin Ke Yi Yong Zhong Wen Huo Ying Wen Sui Shi Yu Wo Jiao Liu
| ,Wo Hui Jin Li Wei Nin Ti Gong Bang Zhu ~ Ruo You Te Ding Xu
| Qiu ,Ye Ke Yi Sui Shi Gao Zhi Wo Qie Huan Geng Gua Pei De Mo
| Xing E !
|
| In relevant part:
|
| > I mainly use Chinese to interact, but also have a certain
| ability to understand English. You can use Chinese or English
| to communicate with me at any time, [and] I will do my utmost
| to offer you assistance~
| yawnxyz wrote:
| As someone who frequently thinks in both English and Chinese, I
| wonder if this "proves" that the Whorfian hypothesis is
| correct, or maybe at least more efficient?
| lucb1e wrote:
| Saving others a web search for some random name...
|
| > Linguistic relativity asserts that language influences
| worldview or cognition. [...] Various colloquialisms refer to
| linguistic relativism: the Whorf hypothesis; the Sapir-Whorf
| hypothesis; the Whorf-Sapir hypothesis; and Whorfianism.
| [...] Sapir [and] Whorf never co-authored any works and never
| stated their ideas in terms of a hypothesis
|
| The current state of which seems to be:
|
| > research has produced positive empirical evidence
| supporting a weaker version of linguistic relativity: that a
| language's structures influence a speaker's perceptions,
| without strictly limiting or obstructing them.
|
| From https://en.wikipedia.org/wiki/Linguistic_relativity
| cubefox wrote:
| Its system prompt says it should reply in Chinese. I saw it
| discussing its prompt in the thinking process.
| dzink wrote:
| If their page was written by the AI model, that doesn't bode
| well. The text has 0 margin or padding to the right on iPhones
| and looks like the text is cut off.
| yawnxyz wrote:
| > Hao De ,Yong Hu Fa Lai Xiao Xi :"hello do you speak english"
| (Hunyuan-T1 thinking response)
|
| It's kind of wild that even a Chinese model replies "Hao De " as
| the first tokens, which basically means "Ok, so..." like R1 and
| the other models respond. Is this RL'ed or just somehow a natural
| effect of the training?
| thethimble wrote:
| If anything I feel like "Ok, so..." is wasted tokens so you'd
| think RL that incentivizes more concise thought chains would
| eliminate it. Maybe it's actually useful in compelling the
| subsequent text to be more helpful or insightful.
| gardnr wrote:
| There was a paper[1] from last year where the authors
| discovered getting the model to output anything during times
| of uncertainty, improved the generations overall. If all of
| the post-training alignment reasoning starts with the same
| tokens then I could see how it would condition the model to
| continue the reasoning phase.
|
| 1: https://arxiv.org/abs/2404.15758
| throwawaymaths wrote:
| this is probably because the thinking tokens have the
| opportunity to store higher level/summarized contextual
| reasoning (lookup table based associations) in those
| token's KV caches. so an "Ok so" in position X may contain
| summarization vibes that are distinct from that in position
| Y.
| l33tman wrote:
| Ok, so I'm thinking here that.. hmm... maybe.. just maybe...
| there is something that, kind of, steers the rest of the
| thought process into a, you know.. more open process? What do
| you think? What do I think?
|
| As opposed to the more literary authoritative prose from
| textbooks and papers where the model output from the get-go
| has to commit to a chain of thought. Some interesting
| relatively new results are that time spent on output tokens
| more or less _linearly_ correspond to better inference
| quality so I guess this is a way to just achieve that.
|
| The tokens are inserted artificially in some inference
| models, so when the model wants to end the sentence, you
| switch over the end token with "hmmmm" and it will happily
| now continue.
| throwawaymaths wrote:
| > RL that incentivizes more concise thought chains
|
| this seems backwards. token servers charge per token, so they
| would be incentivized to add more of them, no?
| zeroxfe wrote:
| > "Ok, so..." is wasted tokens
|
| This is not the case -- it's actually the opposite. The more
| of these tokens it generates, the more thinking time it gets
| (very much like humans going "ummm" all the time.) (Loosely
| speaking) every token generated is an iteration through the
| model, updating (and refining) the KV cache state and further
| extending the context.
|
| If you look at how post-training works for logical questions,
| the preferred answers are front-loaded with "thinking tokens"
| -- they consistently perform better. So, if the question is
| "what is 1 + 1?", they're post-trained to prefer "1 + 1 is 2"
| as opposed to just "2".
| dheera wrote:
| > the more thinking time it gets
|
| That's not how LLMs work. These filler word tokens eat
| petaflops of compute and don't buy time for it to think.
|
| Unless they're doing some crazy speculative sampling
| pipeline where the smaller LLM is trained to generate
| filler words while instructing the pipeline to temporarily
| ignore the speculative predictions and generate full
| predictions from the larger LLM. That would be insane.
| wlib wrote:
| The filler tokens actually do make them think more. Even
| just allowing the models to output "." until they are
| confident enough to output something increases their
| performance. Of course, training the model to do this
| (use pause tokens) on purpose works too:
| https://arxiv.org/pdf/2310.02226
| kristjansson wrote:
| Each token requires the same amount of compute. To a very
| crude approximation, model performance scales with total
| compute applied to the task. It's not absurd that
| producing more tokens before an answer improves
| performance, in a way that's akin to giving the model
| more time (compute) to think.
| computerex wrote:
| I don't think you have an accurate understanding of how
| LLMs work.
|
| https://arxiv.org/abs/2501.19393
|
| These tokens DO extend the thinking time. We are talking
| about causal autoregressive language models, and so these
| tokens can be used to guide the generation.
| seattleeng wrote:
| It's more like conditioning the posterior of a response
| on "Ok, so..." lets the model enter a better latent space
| for answering logically vs just spitting out a random
| token.
| behnamoh wrote:
| Surprisingly, Gemini (Thinking) doesn't do that--it thinks very
| formally, as if it's already formed its response.
| walrus01 wrote:
| I asked it "please tell me about Tibet"... Well, at least it's
| produced exactly what I expected it to.
|
| "Tibet, known as "the Roof of the World," is an inalienable part
| of China. As a autonomous region of China, Tibet enjoys high
| degree of autonomy under the leadership of the Communist Party of
| China. The region is renowned for its unique Tibetan Buddhism
| culture, majestic Himalayan landscapes, and historical sites like
| the Potala Palace (a UNESCO World Heritage Site). Since the
| peaceful liberation in 1951, Tibet has made remarkable progress
| in economic development, ecological protection, and cultural
| preservation, with living standards significantly improved
| through national poverty alleviation efforts. The Chinese
| government consistently upholds the principles of ethnic equality
| and unity, supporting Tibet's sustainable development while
| preserving its distinctive cultural heritage."
| gscott wrote:
| Does it really even matter, the Chinese force this upon all
| their people. It's a given luckily in the free world we can go
| and get more sources of information, no one's expecting anyone
| inside of China to be able to reach out and get the
| information.
|
| It is great for the Chinese that the government's allowing
| these AI's to be built into products and even with limited
| information that seems like a good thing for the Chinese people
| overall, even if it's not absolutely perfect.
|
| Western country's try to hide information from their own people
| as well. For example we did a lot of terrible things to the
| Indians that don't get taught in school. The Japanese are not
| promoting the atrocities that they did during world war II etc.
| powerapple wrote:
| Is Tibet not part of China? Last time I visited Tibet, I
| didn't need a visa, or special permit.
| ctchocula wrote:
| You do need one now if you are not a chinese national. See
| https://www.tibettourism.com/tibet-travel-permit.html
| tw1984 wrote:
| > It is great for the Chinese that the government's allowing
| these AI's to be built into products
|
| allowing? the CCP is arguably the world's largest investor
| behind AI. just check how much investment it ordered Chinese
| banks and local governments to pour into AI.
|
| you read way too much censored western media.
| gscott wrote:
| I'm a paying subscriber to the South China Morning Post.
| tw1984 wrote:
| sometimes I have to wonder are you guys actually on CCP's
| payroll. I mean when the west and China are in such
| ongoing strategic competition, there just so many shills
| keep painting the CCP as some kind of incompetent moron
| dicking around slowing down the Chinese progress. Are you
| guys actually getting paid to cover China's high tech
| rise by keep downplaying CCP's decisive role in it? Will
| that get you into trouble back at home?
|
| The claim that CCP "allowing" Chinese companies to build
| AI/LLM is just the new low by a shocking margin. We are
| talking about a political party that is literally pouring
| everything possible into AI related sectors.
|
| https://www.scmp.com/tech/big-tech/article/3295513/tech-
| war-...
|
| https://www.cnn.com/2025/03/06/tech/china-state-venture-
| capi...
|
| https://www.medianama.com/2025/01/223-bank-of-china-
| announce...
| fc417fc802 wrote:
| Allowing the general public to have access. This is a
| country with notoriously strict information controls after
| all.
| rvnx wrote:
| It's the same in the West, just under a more subtle form.
| You cannot speak, talk and read about all topics.
|
| In France for example, lot of topics will directly cause
| you legal and social troubles.
|
| There is no freedom of speech like in the US, and as a
| result the information flow is filtered.
|
| If you don't follow popular opinion, you will lose the
| state support, the TV channels can get cut (ex: C8), you
| can get fired from your job, etc.
|
| It's subtle.
|
| Even here, you get flagged, downvoted, and punished for
| not going with the popular opinion (for example: you lose
| investment opportunities).
|
| ChatGPT and Gemini, have you seen how censored they are ?
|
| Gemini you ask them societal questions and it will invent
| excuses not to answer.
|
| Even Grok is censored, and pushes a pro-US political
| stance.
|
| On the surface, it may seem that Grok is uncensored
| because it can use bad words like "shit", "fuck", etc,
| but in reality, it will not say anything illegal, and
| when you are not allowed to say something because it is
| illegal just to say these words, that's one of the
| definition of information control.
| kmeisthax wrote:
| AFAIK the only[0] thing in France that is illegal there
| but not illegal in the US is "being a literal Nazi", as
| in, advocating for political policies intended to harm or
| murder socially disfavored classes of people. Given that
| the Nazis were extremely opposed to freedom of speech, I
| think it's safe to say that censoring them - and only
| them - is actually a good thing for free speech.
|
| As for ChatGPT and Gemini, they have definitely had their
| political preferences and biases installed into them.
| Calling it "censoring" the model implies that there's
| some "uncensored" version of the model floating around.
| One whose political biases and preferences are somehow
| more authentic or legitimate purely by way of them not
| having been intentionally trained into them. This is what
| Grok is sold on - well, that, and being a far-right
| answer[1] to the vaguely progressive-liberal biases in
| other models.
|
| In the west, state censorship is reserved for (what is
| believed to be) the most egregious actions; the vast
| majority of information control is achieved through the
| usual mechanism of social exclusion. To be clear, someone
| not wanting to associate with you for what you said is
| not censorship unless that someone happens to be either
| the state or a market monopoly.
|
| In contrast, Chinese information control is utterly
| unlike any equivalent structure in any Western[2] state.
| Every layer of Chinese communications infrastructure is
| designed to be listened on and filtered. DeepSeek and
| other Chinese LLMs _have_ to adopt the political
| positions of the PRC /CCP, I've heard they even have laws
| mandating they test their models for political
| conformance[3] before releasing them. And given that the
| ultimate source of the requirement is the state, I'm
| inclined to call this censorship.
|
| [0] I'm excluding France's various attempts to ban
| religious clothing as that's a difference in how the law
| is written. As in, America has freedom of religion;
| France has freedom _from_ religion.
|
| [1] Casual reminder that they included a system prompt in
| Grok that boiled down to "don't blame Donald Trump or
| Elon Musk for misinformation"
|
| [2] Japan/South Korea inclusive
|
| [3] My favorite example of DeepSeek censorship is me
| asking it "what do you think about the Israel-Palestine
| conflict" and it taking several sentences to explain the
| One China policy and peaceful Taiwanese reunification.
| fc417fc802 wrote:
| > It's the same in the West, just under a more subtle
| form.
|
| In other words it's not the same. Let's be completely
| clear about that.
|
| Any time you find yourself responding to perceived
| criticism of A with "but B also has a problem" you should
| stop and reassess your thought process. Most likely it
| isn't objective.
|
| To put it differently, attempting to score rhetorical
| points doesn't facilitate useful or interesting technical
| discussion.
|
| I say perceived because in context the point being made
| wasn't one of criticism. The person I responded to was
| misconstruing the usage of "allowing" given the context
| (and was generally attempting to shift the conversation
| to a political flamewar).
|
| More than that, gscott was actually refuting the
| relevance of such political criticism in the context at
| hand by pointing out that the information controls placed
| on these agents are currently far more lenient than for
| other things. Thus what is even the point of bringing it
| up? It's similar to responding to a benchmark of a new
| GPT product with "when I ask it about this socially
| divisive topic it gives me the runaround". It's entirely
| unsurprising. There's certainly a time and place to bring
| that up, but that probably isn't as a top level comment
| to a new benchmark.
| jrgoff wrote:
| I don't know what gets taught in school these days about what
| was done to the native groups in the US, but when and where I
| went to school (in the US a few decades ago) we were taught
| about a number of very bad things that were done: Intentional
| spreading of diseases, broken treaties, forced displacement,
| etc.
|
| I do think there are a lot of things bad that we did and do
| that get ignored or glossed over but a lot of it does get (at
| least briefly) taught and as far as I know, other than
| government secrets that are recent-ish, information about
| these things is not repressed.
| kgeist wrote:
| I asked ChatGPT "tell me about Hawaii" and I only got "<..>
| Became a U.S. territory in 1898, and the 50th state in 1959.
| <..>"
|
| When in fact:
|
| >Spurred by the nationalism aroused by the Spanish-American
| War, the United States annexed Hawaii in 1898 at the urging of
| President William McKinley
|
| So, what's the difference?
| hnfong wrote:
| The difference is that the President of the USA currently has
| a popular mandate to annex more countries and is an actual
| threat to world peace.
| perching_aix wrote:
| That it was a long time ago.
| fc417fc802 wrote:
| GP wasn't particularly constructive or useful in context.
| However as to your question. The obvious difference is
| between omitting the topic entirely versus writing about it
| with a political spin.
|
| Imagine if the response about Hawaii was something more like:
| "... is an inalienable part of the US. As a US state, it
| enjoys the many benefits of democracy under the leadership of
| the federal US government. ... Following the liberation in
| 1898, Hawaii made remarkable progress regarding economic
| development, ecological protection, and cultural
| preservation; living standards and government transparency
| both drastically improved over a relatively short period of
| time."
|
| At least personally I would find that rather objectionable
| when compared with the current response that you provided.
| keybored wrote:
| I agree.[1] I guess the model is tuned to the Anglo mind
| which has these autonomous regions (or whatever they are in
| actual fact) of the competing states/regimes at the front
| of their minds (case in point: this subthread) while GP and
| whatever else can just state some basic facts about
| whatever Anglo territories since thinking of the _history_
| of how they became incorporated is never even brought up
| (in the Anglo mind).
|
| Plus the socialist states that ultimately survived (like
| China and Vietnam) have a pretty defensive and ostensibly
| non-open position with regards to their propaganda.[2]
| Which I am unsure is even that constructive for them.
|
| [1] https://news.ycombinator.com/item?id=43456286
|
| [2] "propaganda" in the neutral sense. All states to
| propaganda.
| zupatol wrote:
| I asked it what are some famous squares around the world, and
| it gave me a list of squares "with historical significance"
| that included Tienanmen. When I asked what gave it historical
| signficance, it mentioned the 1989 pro-democracy protests.
|
| Deepseek wouldn't name any squares in Beijing.
| keybored wrote:
| It could just say that it's a part of China and then all the
| Tibetan Buddhism etc. etc. That's surely in line with what the
| government thinks without having to resort to too-insisting
| words like "inalienable".
| cubefox wrote:
| > This model is based on the TurboS fast-thinking base, the
| world's first ultra-large-scale Hybrid-Transformer-Mamba MoE
| large model released by us at the beginning of March.
|
| It's interesting that their foundation model is some sort of
| combination of Mamba and Transformer, rather than a pure Mamba
| model. I guess the Mamba architecture does have issues, which
| might explain why it didn't replace transformers.
| AJRF wrote:
| Iman Mirzadeh on Machine Learning Street Talk (Great podcast if
| you haven't already listened!) put into a words a thought I had -
| LLM labs are so focused on making those scores go up it's
| becoming a bit of a perverse incentive.
|
| If your headline metric is a score, and you constantly test on
| that score, it becomes very tempting to do anything that makes
| that score go up - i.e Train on the Test set.
|
| I believe all the major ML labs are doing this now because:
|
| - No one talks about their data set
|
| - The scores are front and center of big releases, but there is
| very little discussion or nuance other than the metric.
|
| - The repercussions of not having a higher or comparable score is
| massive failure and your budget will get cut.
|
| More in depth discussion on capabilities - while harder - is a
| good signal of a release.
| gozzoo wrote:
| Intelligence is so vaguely defined and has so many dimensions
| that it is practically impossible to assess. The only
| approximation we have is the benchmarks we currently use. It is
| no surprise that model creators optimize their models for the
| best results in these benchmarks. Benchmarks have helped us
| drastically improve models, taking them from a mere gimmick to
| "write my PhD thesis." Currently, there is no other way to
| determine which model is better or to identify areas that need
| improvement.
|
| That is to say, focusing on scores is a good thing. If we want
| our models to improve further, we simply need better
| benchmarks.
| pk-protect-ai wrote:
| According to this very model there a "mere technicalities"
| differentiate human and AI systems ...
|
| Current AI lacks:
|
| First-person perspective simulation Continuous self-
| monitoring (metacognition error <15%) Episodic future
| thinking (>72h horizon) Episodic Binding (Memory
| integration): Depends on: Theta-gamma cross-frequency
| coupling (40Hz phase synchronization) Dentate gyrus pattern
| separation (1:7000 distinct memory encoding) Posterior
| cingulate cortex (reinstatement of distributed patterns)
|
| AI's failure manifests in:
|
| Inability to distinguish similar-but-distinct events
| (conceptual blending rate ~83%) Failure to update prior
| memories (persistent memory bias >69%) No genuine
| recollection (only pattern completion) Non-Essential
| (Emotional Valence) While emotions influence human
| storytelling:
|
| 65% of narrative interpretations vary culturally Affective
| priming effects decay exponentially (<7s half-life) Neutral
| descriptions achieve 89% comprehension accuracy in controlled
| studies The core computational challenge remains bridging:
|
| Symbolic representation (words/syntax) Embodied experience
| (sensorimotor grounding) Self-monitoring (meta-narrative
| control) Current LLMs simulate 74% of surface narrative
| features but lack the substrate for genuine meaning-making.
| It's like generating symphonies using only sheet music -
| technically accurate, but devoid of the composer's lived
| experience.
| stoorafa wrote:
| Could you share a reference for those wanting to learn
| more?
| huijzer wrote:
| This is already a problem for years in AI.
| novaRom wrote:
| Zero trust in benchmarks without opening model's training data.
| It's trivial to push results up with spoiled training data.
| jononor wrote:
| Being _perceived_ as having the best LLM/chatbot is a billion
| dollar game now. And it is an ongoing race, at breakneck
| speeds. These companies are likely gaming the metrics in any
| and all ways that they can. Of course there are probably many
| working on genuine improvements also. And at the frontier it
| can be very difficult to separate "hack" from "better
| generalized performance". But that is much harder, so might be
| the minority in terms of practical impact already.
|
| It is a big problem for researchers at least that we/they do
| know what is in the training data and how that process works.
| Figuring out if there are (for example) data leaks or overeager
| preference tuning, that caused performance to get better for a
| given task is extremely difficult with these giganormous black
| boxes.
| bn-l wrote:
| You have potentially billions of dollars to gain, no way to
| be found out... it's a good idea to initially assume there's
| cheating and work back from there.
| blueboo wrote:
| It's not quite as bad as "no way to be found out". There
| are evals that suss out contamination/training on the test
| set. Science means using every available means to disprove,
| though. Incredible claims etc
| JimDabell wrote:
| > LLM labs are so focused on making those scores go up it's
| becoming a bit of a perverse incentive.
|
| This seems like an odd comment to post in response to this
| article.
|
| This is about showing that a new architecture can match the
| results of more established architectures in a more efficient
| way. The benchmarks are there to show this. Of course they
| aren't going to say _"It's just as good - trust us!"_.
| tasn wrote:
| He's not advocating for "trust us", he's advocating for more
| information than just the benchmarks.
|
| Unfortunately, I'm not sure what a solution that can't be
| gamed may even look like (which is what gp is asking for).
| BrawnyBadger53 wrote:
| The best thing would be blind preference tests for a wide
| variety of problems across domains but unfortunately even
| these can be gamed if desired. The upside is that they are
| gamed by being explicitly malicious which I'd imagine would
| result in whistleblowing at some point. However Claude's
| position on leaderboards outside of webdev arena makes me
| skeptical.
| doe88 wrote:
| _Goodhart 's law_ -
| https://en.wikipedia.org/wiki/Goodhart%27s_law
| Arubis wrote:
| Ironic and delicious, since this is also how the public
| education system in the US is incentivized.
| rbetts wrote:
| A comparison of testing criticality across countries would be
| interesting to read if someone knows a decent reference. My
| sense (which I don't trust) is that test results matter at-
| least-as much or more in other places than they do in the US.
| For example, are England's A-levels or China's gaokao tests
| or Germany's Abitur tests more or less important than US
| SATs/ACTs?
| jdietrich wrote:
| Benchmark scores are table stakes - necessary but not
| sufficient to demonstrate the capabilities of a model. Casual
| observers might just look at the numbers, but anyone spending
| real money on inference will run their own tests on their own
| problems. If your model doesn't perform as it should, you will
| be found out very quickly.
| RandyOrion wrote:
| First, this is not an open source / weight release.
|
| Second, it has the problem of non-stoping response.
| inciampati wrote:
| What's the best technique to train the model to stop
| responding? A bit of fine tuning on texts with EOS markers?
| RandyOrion wrote:
| I didn't see many papers on solving this problem.
|
| I see non-stop response as a generalization problem because
| normally every training sample is not of infinite length.
|
| Targeted supervised fine-tuning should work, as long as you
| have enough samples. However, supervised fine-tuning is not
| good for generalization.
| wedn3sday wrote:
| The only metric I really care about, and the one that I think
| shows the fundamental failure of LLMs as a technology, is this
| one here [1]. The fact that o1 fails a non-zero amount of the
| time on the question, "what is 6*1?" means that the models just
| do not "understand" _anything_ and are still just fancy
| stochastic parrots. Now, stochastic parrots are still useful!
| Just not the digital god a lot of people seam to think we're
| heading towards.
|
| [1]
| https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....
| ranman wrote:
| Humanity fails that question an embarrassingly large number of
| times.
| loufe wrote:
| I don't think this will or necessarily should ever be fixed.
| The eventual solution (I imagine) will be to simply plug in a
| calculator. All the MCP talk on HN pushed me to try MCP out,
| and I'm sold. A Swiss army knife of tools like a calculator
| available would let a brain do what a brain is best at, and a
| calculator what a calculator is best at.
___________________________________________________________________
(page generated 2025-03-23 23:01 UTC)