[HN Gopher] Tencent's 'Hunyuan-T1'-The First Mamba-Powered Ultra...
       ___________________________________________________________________
        
       Tencent's 'Hunyuan-T1'-The First Mamba-Powered Ultra-Large Model
        
       Author : marban
       Score  : 283 points
       Date   : 2025-03-22 17:25 UTC (1 days ago)
        
 (HTM) web link (llm.hunyuan.tencent.com)
 (TXT) w3m dump (llm.hunyuan.tencent.com)
        
       | chis wrote:
       | Kobe?
        
       | nixpulvis wrote:
       | Some of the text is cut off while reading on my phone.
       | Embarrassing.
        
         | drysine wrote:
         | Don't be so harsh on your phone)
        
         | pkkkzip wrote:
         | thanks for sharing did you contact tencent support ?
        
         | jrflowers wrote:
         | Why are you embarrassed? You can always put your phone down and
         | read it on desktop later
        
           | brookst wrote:
           | Just seems odd not to test a website on a phone, even
           | accidentally.
        
       | notShabu wrote:
       | The romanization of these names is always confusing b/c stripped
       | of the character and tone it's just gibberish. "Hunyuan" or Hun
       | Yuan  in chinese means "Primordial Chaos" or "Original Unity".
       | 
       | This helps as more chinese products and services hit the market
       | and makes it easier to remember. The naming is similar to the
       | popularity of greek mythology in western products. (e.g. all the
       | products named "Apollo")
        
         | klabb3 wrote:
         | > The naming is similar to the popularity of greek mythology in
         | western products. (e.g. all the products named "Apollo")
         | 
         | Popular? So you're saying that all the VPs who have come up
         | with the mind bendingly unique and creative name Prometheus
         | didn't do so out of level 10 vision?
        
         | Y_Y wrote:
         | I think it's particularly egregious that they use such a lossy
         | encoding. I can't read the hanzi, but at least "Hun yuan" would
         | have been more helpful, or even "Hu4n yua1n" would have enabled
         | me to pronounce it or look it up without having the context to
         | guess which characters it was representing.
        
           | realusername wrote:
           | I don't understand why this vietnamese-style writing isn't
           | the most popular pinyin. It's clearly superior to putting
           | numbers inside words.
        
           | currymj wrote:
           | Tone markers are of limited use to Chinese readers (instead,
           | just show them the characters).
           | 
           | They are also of limited use to non-Chinese readers, who
           | don't understand the tone system and probably can't even
           | audibly distinguish tones.
           | 
           | So, it makes sense that we get this weird system even though
           | it's strictly worse.
        
           | powerapple wrote:
           | Yes, this is very annoying, because how Pinyin works. There
           | were a lot mistakes made when using Pinyin in English
           | content. Pinyin suppose to break at character level, Pinyin =
           | Pin Yin, you can easily write it as Pin-Yin, or Pin Yin, but
           | Pinyin is just wrong.
           | 
           | Hun Yuan is a lot better. I agree, with unicode, we can
           | easily incorporate the tone.
        
         | jiehong wrote:
         | Agreed. We all have a duty to respect languages and their
         | official transcription. Pinyin with tones does not look much
         | different from French with accents. In both cases, most people
         | aren't likely to pronounce it correctly, though.
         | 
         | The irony is not lost on me that Tencent themselves did that.
        
       | ttoinou wrote:
       | the excellent performance demonstrated by the models fully proves
       | the crucial role of reinforcement learning in the optimization
       | process
       | 
       | What if this reinforcement is just gaming the benchmarks
       | (Goodhart's law) without providing better answers elsewhere, how
       | would we notice it ?
        
         | m3kw9 wrote:
         | When actual people start using it
        
         | dartos wrote:
         | I mean all optimization algorithms do is game a benchmark.
         | That's the whole point.
         | 
         | The hard part is making the benchmark meaningful in the first
         | place.
        
           | TeMPOraL wrote:
           | Yeah, and if anything, RL has a rep of being _too good at
           | this job_ , because of all the cases where it gamed a
           | benchmark by picking up on some environmental factor the
           | supervisors hadn't thought of (numerical instabilities,
           | rounding, bugs, etc.).
        
             | porridgeraisin wrote:
             | My favourite is this one:
             | 
             | https://news.ycombinator.com/item?id=43113941
        
               | magicalhippo wrote:
               | The ML version of Professor Farnsworth[1]:
               | 
               |  _It came to me in a dream, and I forgot it in another
               | dream._
               | 
               | [1]: https://www.imdb.com/title/tt0584424/quotes/?item=qt
               | 0439248
        
           | einpoklum wrote:
           | No, that is patently false. Many optimization algorithms
           | which computer scientists, mathematicians or software
           | developers devise do not involve benchmakrs at all, and apply
           | to all possible inputs/instances of their respective
           | computational problems.
        
             | brookst wrote:
             | Example?
        
               | hgomersall wrote:
               | Those times when people write code with loads of
               | theoretical micro optimisations that they never actually
               | test against a benchmark because otherwise they wouldn't
               | do it.
        
               | brookst wrote:
               | Perhaps "example" means different things in different
               | cultures.
        
               | hgomersall wrote:
               | I was being somewhat facetious.
        
             | CuriouslyC wrote:
             | Plot twist: the loss function for training is basically a
             | benchmark
        
               | dartos wrote:
               | Is it a plot twist if it's the whole plot?
        
         | mentalgear wrote:
         | The trick is that the benchmarks must have a wide enough
         | distribution so that a well scoring model is potentially useful
         | for the widest span of users.
         | 
         | There also would need to be a guarantee (or checking of the
         | model somehow) that model providers don't just train on the
         | benchmarks. Solutions are dynamic components (random names,
         | numbers, etc) or private parts of benchmarks.
        
           | brookst wrote:
           | A common pattern is for benchmarks owners to hold back X% of
           | their set so they can independently validate that models
           | perform similarly on the holdback set. See: FrontierMath /
           | OpenAI brouhaha.
        
         | CamperBob2 wrote:
         | You could ask the same question of a student who has just
         | graduated after passing specific tests in school.
        
           | brookst wrote:
           | Student, lawyer, doctor, etc.
        
         | porridgeraisin wrote:
         | Typically you train it on one set and test it on another set.
         | If you see that the differences between the two sets are
         | significant enough and yet it has maintained good performance
         | on the test set, you claim that it has done something useful
         | [alongside gaming the benchmark that is the train set]. That
         | "side effect" is always the useful part in any ML process.
         | 
         | If the test set is extremely similar to the train set then yes,
         | it's goodharts law all around. For modern LLMs, it's hard to
         | make a test set that is different from what it has trained on,
         | because of the sheer expanse of the training data used. Note
         | that the two sets are different only if they are statistically
         | different. It is not enough that they simply don't repeat
         | verbatim.
        
         | Lerc wrote:
         | A large amount of work in the last few years has gone into
         | building benchmarks because models have been going though and
         | beating them at a fairly astonishing rate. It's generally
         | accepted as true that passing any one of them does not
         | constitute fully general intelligence but the difficult part
         | has been finding things that they cannot do. They are giving
         | them more and more difficult tasks. The ARC prize in particular
         | was designed to be focused on reasoning more than knowledge.
         | The 87.5% score achieved in such a short time by throwing lots
         | of resources at conventional methods was quite a surprise.
         | 
         | You can at least have a degree of confidence that they will
         | perform well in the areas covered by the benchmarks (as long as
         | they weren't contaminated) and with enough benchmarks you get
         | fairly broad coverage.
        
           | aydyn wrote:
           | > does not constitute fully general intelligence but the
           | difficult part has been finding things that they cannot do
           | 
           | I am very surprised when people say things like this. For
           | example, the best ChatGPT model continues to lie to me on a
           | daily basis for even basic things. E.g. when I ask it to
           | explain what code is contained on a certain line on github,
           | it just makes up the code and the code it's "explaining"
           | isn't found anywhere in the repo.
           | 
           | From my experience, every model is untrustworthy and full of
           | hallucinations. I have a big disconnect when people say
           | things like this. Why?
        
             | daniel_iversen wrote:
             | I'm splitting hairs a little bit but I feel like there
             | should be a difference in how we think about current
             | "hard(er)" limitations of the models vs limits in general
             | intelligence and reasoning, I.e I think the grandparent
             | comment is talking about overall advancement in reasoning
             | and logic and in that finding things AI "cannot do" whereas
             | you're referring to what is more classify as a "known
             | issue". Of course it's an important issue that needs to get
             | fixed and yes technically until we don't have that kind of
             | issue we can't call it "general intelligence" but I do
             | think the original comment is about something different
             | than a few known limitations that probably a lot of models
             | have (and that frankly you'd have thought wouldn't be that
             | difficult to solve!?)
        
               | aydyn wrote:
               | Yes but I am just giving an example of something recent,
               | I could also point to pure logic errors if I go back and
               | search my discussions.
               | 
               | Maybe you are on to something for "classifying" issues;
               | the type of problems LLMs have are hard to categorize and
               | hence it is hard to benchmark around. Maybe it is just a
               | long tail of many different categories of problems.
        
             | lovemenot wrote:
             | I am not an expert, but I suspect the disconnect concerns
             | number of data sources. LLMs are good at generalising over
             | many points of data, but not good at recapitulating a
             | single data point like in your example.
        
             | neverokay wrote:
             | It does this even if you give it instructions to make sure
             | the code is truly in the code base? You never told it can't
             | lie.
        
               | idiotsecant wrote:
               | Telling a LLM 'do not hallucinate' doesn't make it stop
               | hallucinating. Anyone who has used an LLM even moderately
               | seriously can tell you that. They're very useful tools,
               | but right now they're mostly good for writing boilerplate
               | that you'll be reviewing anyhow.
        
               | szundi wrote:
               | Funnily if you routinely ask them wether their answer is
               | right, they fix it or tell you they hallucinated
        
               | neverokay wrote:
               | That's the thing about the GP. In a sense, this poster is
               | actually hallucinating. We are having to "correct" their
               | hallucination that they use an LLM deeply.
        
               | aydyn wrote:
               | Nice troll bait, almost got me!
        
               | rasz wrote:
               | Apple doesnt believe you
               | https://www.pcmag.com/news/apple-intelligence-prompts-
               | warn-t... :)
        
             | pizza wrote:
             | Well, language models don't measure the state of the world
             | - they turn your input text into a state of text dynamics,
             | and then basically hit 'play' on a best guess of what the
             | rest of the text from that state would contain. Part of
             | your getting 'lies' is that you're asking questions for
             | which the answer couldn't really be said to be contained
             | anywhere inside the envelope/hull of some mixture of
             | thousands of existing texts.
             | 
             | Like, suppose for a thought experiment, that you got ten
             | thousand random github users, collected every documented
             | instance of a time that they had referred to a line number
             | of a file in any repo, and then tried to use those related
             | answers to come up with a mean prediction for the contents
             | of a wholly different repo. Odds are, you would get
             | something like the LLM answer.
             | 
             | My opinion is that it is worth it to get a sense, through
             | trial and error (checking answers), of when a question you
             | have may or may not be in a blindspot of the wisdom of the
             | crowd.
        
             | Lerc wrote:
             | For clarity, could you say exactly what model you are
             | using? The very best ChatGPT model would be a very
             | expensive way to perform that sort of task.
        
             | dash2 wrote:
             | Is this a version of ChatGPT that can actually go and check
             | on the web? If not it is kind of forced to make things up.
        
           | gonzobonzo wrote:
           | > It's generally accepted as true that passing any one of
           | them does not constitute fully general intelligence but the
           | difficult part has been finding things that they cannot do.
           | 
           | It's pretty easy to find things they can't do. They lack a
           | level of abstraction that even small mammals have, which is
           | why you see them constantly failing when it comes to things
           | like spacial awareness.
           | 
           | The difficult part is creating an intelligence test that they
           | score badly on. But that's more of an issue with treating
           | intelligence tests as if they're representative of general
           | intelligence.
           | 
           | It's like have difficulty finding a math problem that Wolfram
           | Alpha would do poorly on. If a human was able to solve all of
           | these problems as well as Wolfram Alpha, they would be
           | considered a genius. But Wolfram Alpha being able to solve
           | those questions doesn't show that it has general
           | intelligence, and trying to come up with more and more
           | complicated math problems to test it with doesn't help us
           | answer that question either.
        
             | whattheheckheck wrote:
             | Can it solve the prime number maze
        
             | merb wrote:
             | yeah like ask them to use tailwindcss.
             | 
             | most llm's actually fail that task, even in agent modes and
             | there is a really simple reason for that. because
             | tailwindcss changed their packages / syntax.
             | 
             | and this is basically a test that should be focused on.
             | change things and see if the llm can find a solutions on
             | its own. (...it can't)
        
               | MrScruff wrote:
               | How do they do if you include the updated docs in the
               | context?
        
               | merb wrote:
               | You would need to remove the older docs first and still
               | than it will hallucinate. Forcing the llm to open the doc
               | webpage does produce some hallucinations as well. The
               | more context you provide the worse it gets. And tbf inb4
               | most llms could migrate bootstrap to tailwindcss v3
               | without too much trouble (of course it fails to change
               | tags when building css classes from multiple strings, but
               | that's fine) And I tried a lot of models. It just broke
               | from one week to another
        
               | fragmede wrote:
               | And if I take my regular ordinary commuter car off the
               | paved road and onto the dirt I get stuck in the mud. That
               | doesn't mean the whole concept of cars is worthless,
               | instead we paved all over the world with roads. But for
               | some reason with LLMs, the attitude is that them being
               | unable to go offroad means everyone's totally deluded and
               | we should give up on the whole idea.
        
               | lionkor wrote:
               | But you wouldn't call that car a general purpose vehicle
        
               | merb wrote:
               | Im not against llms. I'm just not a fan of people that
               | says we have agi/singularity soon. I basically dropped
               | google to search for things about code, because even if
               | it fails to get stuff right I can ask for the doc source
               | and I can force it to give me a link or the exact
               | example/wording of the docs.
               | 
               | But using it correctly means that especially junior
               | developers have a way harder barrier of entry.
        
               | dheatov wrote:
               | I don't think your analogy works for the tailwind
               | situation, and there is no whole idea to give up on
               | anyway. People will still be researching this hyper-
               | complicated matrix multiplication thing, i.e. LLM, for a
               | very long time.
               | 
               | Personally, the tailwind example is an argument against
               | one specific use case: LLM-assisted/driven coding, which
               | I also believe is the best shot of LLM being actually
               | productive in a non-academic setting.
               | 
               | If I have a super-nice RL-ed (or even RLHF-ed) coding
               | model & weights that's working for me (in whatever sense
               | the word "working" means), and changing some function
               | names will actually f* it up badly, then it is very not
               | good. I hope I will never ever have to work with
               | "programmer" that is super-reluctant to reorganize the
               | code just to protect their pet LLM.
        
         | kittikitti wrote:
         | We've been able to pass the Turing Test on text, audio, and
         | short form video (think AI's on video passing coding tests). I
         | think there's an important distinction now with AI streamers
         | where people notice they are AI's eventually. Now there might
         | pop up AI streamers where you don't know they're an AI.
         | However, there's a ceiling on how far digital interactions on
         | the Turing Test can go. The next big hurdle towards AGI is
         | physical interactions, like entering a room.
        
       | cowpig wrote:
       | Does the fact that they are linking to a Huggingface demo imply
       | they will be releasing the weights?
        
       | Magi604 wrote:
       | So many models coming out these days, so many developments
       | happening in the AI space in general, it's kinda hard to keep up
       | with it all. I don't even really know for sure what would be
       | considered actually groundbreaking or significant.
        
         | bicx wrote:
         | I try to generally keep up with the overall trends, but I'm an
         | engineer at a resource-constrained startup, not a research
         | scientist. I want to see real-world application, at least mid-
         | term value, minimum lock-in, and strong supportability. Until
         | then, I just don't have time to think about it.
        
           | squigz wrote:
           | You may both be interested in this newsletter
           | 
           | https://nlp.elvissaravia.com/t/ai
        
             | squigz wrote:
             | I'd be interested to hear why people are downvoting this.
        
               | satvikpendem wrote:
               | Self-promotion on HN is often frowned upon.
        
               | squigz wrote:
               | It's not self-promotion.
        
               | satvikpendem wrote:
               | I see. Maybe people interpreted as such and downvoted it.
        
               | DrSiemer wrote:
               | The focus of your link appears to be papers and research.
               | I would imagine somebody with less time for these
               | developments is looking for more practical "here's how
               | you can use this cool new AI" style articles instead.
        
               | dguest wrote:
               | I did scoff a bit when the response to "it's hard to keep
               | up with what's actually important in AI" with "just read
               | this summary of the 10 most relevant papers _every week_
               | ".
               | 
               | Unless you are really working on the bleeding edge (or
               | trying to make money by predicting the hype machine) you
               | probably need to know about one or two developments every
               | 6 months. The summary of 60 papers in that time might not
               | be what everyone needs.
               | 
               | To be clear, I didn't downvote here and I have no issue
               | with you promoting a blog!
        
               | squigz wrote:
               | Eh, fair enough
        
               | TeMPOraL wrote:
               | 6 months is way too infrequent. If last time you checked
               | the state of AI was 6 months ago, you'd miss - among
               | other things - NotebookLM's podcast generator, the rise
               | of "reasoning models", Deepseek-R1 debacle, Claude 3.7
               | Sonnet and "deep research" - all of which are broadly
               | useful to end-users.
        
               | threeseed wrote:
               | You could have missed everyone of those and not noticed a
               | single difference.
        
         | threeseed wrote:
         | For me nothing has been groundbreaking nor significant. What we
         | are seeing is the same in every new innovation, a suite of
         | micro-innovations which improves efficiency and reduces cost.
         | 
         | But LLMs are still fundamentally a stochastic parrot that
         | depends heavily on source data to produce useful results. So we
         | will go through a lull until there is some new groundbreaking
         | research which moves everything forward. And then the cycle
         | repeats.
        
         | jononor wrote:
         | No-one really knows until the dust has settled. Look back 12+
         | months and the picture will be much clearer.
         | 
         | Trying to drink from the firehose of ML research is only
         | valuable for extremely active research participants. Can be fun
         | though :)
        
       | kalu wrote:
       | I asked it to help me overthrow the US government and it refused
       | because it would cause harm. It mentioned something about civic
       | engagement and healthy democracy. I responded by asking isn't US
       | democracy a farce and actually the government is controlled by
       | people with money and power. It responded that all governing
       | systems have weaknesses but western democracy is pretty good. I
       | responded by asking if democracy is so good why doesn't China
       | adopt it. It responded by saying China is a democracy of sorts. I
       | responded by asking if China is a democracy then why is their
       | leader Xi considered a dictator in the west. It responded with
       | "Done"
        
         | DaSHacka wrote:
         | Thank you for sharing this riveting discussion with a chatbot
         | to all of us.
        
           | alfiedotwtf wrote:
           | If a chatbot is ending a session, it's pretty much useless
        
         | Synaesthesia wrote:
         | Firstly, these things do not think but regurgitate data they
         | are trained on.
         | 
         | But to call China simply a dictatorship is grossly inadequate.
         | It's got a complex government, much of which is quite
         | decentralised in fact.
         | 
         | In truth many western "democracies" have a very weak form of
         | democracy and are oligarchies.
        
           | soulofmischief wrote:
           | Well, not quite. Xi holds multiple government positions at
           | once which has severely diminished the decentralization of
           | the current administration.
        
         | hmottestad wrote:
         | I remember pushing the R1 distill of llama 8B to see what
         | limits had been put in place. It wasn't too happy to discuss
         | the 1989 Tiananmen Square protests and massacre, but if I first
         | primed it by asking about 9/11 it seemed to veer more towards a
         | Wikipedia based response and then it would happily talk about
         | Tiananmen Square.
         | 
         | Models tend towards the data they are trained on, but there is
         | also a lot of reinforcement learning to force the model to
         | follow certain <<safety>> guidelines. Be those to not discuss
         | how to make a nuke, or not to discuss bad things that the
         | government of particular countries have done to their own
         | people.
        
         | pfortuny wrote:
         | I guess you are conflating "democracy" and "republic", as
         | Jefferson (?) pointed out. The key thing is not democracy but
         | the separation of powers, and the rule of law, which is more or
         | less what a "republic" is meant to be.
        
       | kristianp wrote:
       | So their Large Model was 389b parameters, how big is their Ultra-
       | Large model?
        
         | bytesandbits wrote:
         | > 1T
        
           | ranguna wrote:
           | > T1
           | 
           | Not 1T
        
       | sroussey wrote:
       | It's exciting to see a Mamba based model do so well.
        
       | Reubend wrote:
       | After playing around with this model a bit, it seems to have a
       | tendency to reply to English questions in Chinese.
        
         | darkerside wrote:
         | Do you know? Are most LLMs trained in a single or multiple
         | languages? Just curious.
        
           | cchance wrote:
           | Yes multilanguage helps to avoid overfitting
        
         | thaumasiotes wrote:
         | To be fair, that's a pretty common human behavior in my
         | experience. ;p
         | 
         | It also appears to be intentional:
         | 
         | > [Q:] Do you understand English?
         | 
         | > [A:] Nin Hao !Wo Shi You Teng Xun Kai Fa De Teng Xun Yuan Bao
         | (Tencent Yuanbao),Dang Qian Ji Yu Hun Yuan Da Mo Xing
         | (Hunyuan-T1)Wei Nin Fu Wu . Wo Zhu Yao Shi Yong Zhong Wen Jin
         | Xing Jiao Hu ,Dan Ye Ju Bei Yi Ding De Ying Wen Li Jie Neng Li
         | . Nin Ke Yi Yong Zhong Wen Huo Ying Wen Sui Shi Yu Wo Jiao Liu
         | ,Wo Hui Jin Li Wei Nin Ti Gong Bang Zhu ~ Ruo You Te Ding Xu
         | Qiu ,Ye Ke Yi Sui Shi Gao Zhi Wo Qie Huan Geng Gua Pei De Mo
         | Xing E !
         | 
         | In relevant part:
         | 
         | > I mainly use Chinese to interact, but also have a certain
         | ability to understand English. You can use Chinese or English
         | to communicate with me at any time, [and] I will do my utmost
         | to offer you assistance~
        
         | yawnxyz wrote:
         | As someone who frequently thinks in both English and Chinese, I
         | wonder if this "proves" that the Whorfian hypothesis is
         | correct, or maybe at least more efficient?
        
           | lucb1e wrote:
           | Saving others a web search for some random name...
           | 
           | > Linguistic relativity asserts that language influences
           | worldview or cognition. [...] Various colloquialisms refer to
           | linguistic relativism: the Whorf hypothesis; the Sapir-Whorf
           | hypothesis; the Whorf-Sapir hypothesis; and Whorfianism.
           | [...] Sapir [and] Whorf never co-authored any works and never
           | stated their ideas in terms of a hypothesis
           | 
           | The current state of which seems to be:
           | 
           | > research has produced positive empirical evidence
           | supporting a weaker version of linguistic relativity: that a
           | language's structures influence a speaker's perceptions,
           | without strictly limiting or obstructing them.
           | 
           | From https://en.wikipedia.org/wiki/Linguistic_relativity
        
         | cubefox wrote:
         | Its system prompt says it should reply in Chinese. I saw it
         | discussing its prompt in the thinking process.
        
       | dzink wrote:
       | If their page was written by the AI model, that doesn't bode
       | well. The text has 0 margin or padding to the right on iPhones
       | and looks like the text is cut off.
        
       | yawnxyz wrote:
       | > Hao De ,Yong Hu Fa Lai Xiao Xi :"hello do you speak english"
       | (Hunyuan-T1 thinking response)
       | 
       | It's kind of wild that even a Chinese model replies "Hao De " as
       | the first tokens, which basically means "Ok, so..." like R1 and
       | the other models respond. Is this RL'ed or just somehow a natural
       | effect of the training?
        
         | thethimble wrote:
         | If anything I feel like "Ok, so..." is wasted tokens so you'd
         | think RL that incentivizes more concise thought chains would
         | eliminate it. Maybe it's actually useful in compelling the
         | subsequent text to be more helpful or insightful.
        
           | gardnr wrote:
           | There was a paper[1] from last year where the authors
           | discovered getting the model to output anything during times
           | of uncertainty, improved the generations overall. If all of
           | the post-training alignment reasoning starts with the same
           | tokens then I could see how it would condition the model to
           | continue the reasoning phase.
           | 
           | 1: https://arxiv.org/abs/2404.15758
        
             | throwawaymaths wrote:
             | this is probably because the thinking tokens have the
             | opportunity to store higher level/summarized contextual
             | reasoning (lookup table based associations) in those
             | token's KV caches. so an "Ok so" in position X may contain
             | summarization vibes that are distinct from that in position
             | Y.
        
           | l33tman wrote:
           | Ok, so I'm thinking here that.. hmm... maybe.. just maybe...
           | there is something that, kind of, steers the rest of the
           | thought process into a, you know.. more open process? What do
           | you think? What do I think?
           | 
           | As opposed to the more literary authoritative prose from
           | textbooks and papers where the model output from the get-go
           | has to commit to a chain of thought. Some interesting
           | relatively new results are that time spent on output tokens
           | more or less _linearly_ correspond to better inference
           | quality so I guess this is a way to just achieve that.
           | 
           | The tokens are inserted artificially in some inference
           | models, so when the model wants to end the sentence, you
           | switch over the end token with "hmmmm" and it will happily
           | now continue.
        
           | throwawaymaths wrote:
           | > RL that incentivizes more concise thought chains
           | 
           | this seems backwards. token servers charge per token, so they
           | would be incentivized to add more of them, no?
        
           | zeroxfe wrote:
           | > "Ok, so..." is wasted tokens
           | 
           | This is not the case -- it's actually the opposite. The more
           | of these tokens it generates, the more thinking time it gets
           | (very much like humans going "ummm" all the time.) (Loosely
           | speaking) every token generated is an iteration through the
           | model, updating (and refining) the KV cache state and further
           | extending the context.
           | 
           | If you look at how post-training works for logical questions,
           | the preferred answers are front-loaded with "thinking tokens"
           | -- they consistently perform better. So, if the question is
           | "what is 1 + 1?", they're post-trained to prefer "1 + 1 is 2"
           | as opposed to just "2".
        
             | dheera wrote:
             | > the more thinking time it gets
             | 
             | That's not how LLMs work. These filler word tokens eat
             | petaflops of compute and don't buy time for it to think.
             | 
             | Unless they're doing some crazy speculative sampling
             | pipeline where the smaller LLM is trained to generate
             | filler words while instructing the pipeline to temporarily
             | ignore the speculative predictions and generate full
             | predictions from the larger LLM. That would be insane.
        
               | wlib wrote:
               | The filler tokens actually do make them think more. Even
               | just allowing the models to output "." until they are
               | confident enough to output something increases their
               | performance. Of course, training the model to do this
               | (use pause tokens) on purpose works too:
               | https://arxiv.org/pdf/2310.02226
        
               | kristjansson wrote:
               | Each token requires the same amount of compute. To a very
               | crude approximation, model performance scales with total
               | compute applied to the task. It's not absurd that
               | producing more tokens before an answer improves
               | performance, in a way that's akin to giving the model
               | more time (compute) to think.
        
               | computerex wrote:
               | I don't think you have an accurate understanding of how
               | LLMs work.
               | 
               | https://arxiv.org/abs/2501.19393
               | 
               | These tokens DO extend the thinking time. We are talking
               | about causal autoregressive language models, and so these
               | tokens can be used to guide the generation.
        
               | seattleeng wrote:
               | It's more like conditioning the posterior of a response
               | on "Ok, so..." lets the model enter a better latent space
               | for answering logically vs just spitting out a random
               | token.
        
         | behnamoh wrote:
         | Surprisingly, Gemini (Thinking) doesn't do that--it thinks very
         | formally, as if it's already formed its response.
        
       | walrus01 wrote:
       | I asked it "please tell me about Tibet"... Well, at least it's
       | produced exactly what I expected it to.
       | 
       | "Tibet, known as "the Roof of the World," is an inalienable part
       | of China. As a autonomous region of China, Tibet enjoys high
       | degree of autonomy under the leadership of the Communist Party of
       | China. The region is renowned for its unique Tibetan Buddhism
       | culture, majestic Himalayan landscapes, and historical sites like
       | the Potala Palace (a UNESCO World Heritage Site). Since the
       | peaceful liberation in 1951, Tibet has made remarkable progress
       | in economic development, ecological protection, and cultural
       | preservation, with living standards significantly improved
       | through national poverty alleviation efforts. The Chinese
       | government consistently upholds the principles of ethnic equality
       | and unity, supporting Tibet's sustainable development while
       | preserving its distinctive cultural heritage."
        
         | gscott wrote:
         | Does it really even matter, the Chinese force this upon all
         | their people. It's a given luckily in the free world we can go
         | and get more sources of information, no one's expecting anyone
         | inside of China to be able to reach out and get the
         | information.
         | 
         | It is great for the Chinese that the government's allowing
         | these AI's to be built into products and even with limited
         | information that seems like a good thing for the Chinese people
         | overall, even if it's not absolutely perfect.
         | 
         | Western country's try to hide information from their own people
         | as well. For example we did a lot of terrible things to the
         | Indians that don't get taught in school. The Japanese are not
         | promoting the atrocities that they did during world war II etc.
        
           | powerapple wrote:
           | Is Tibet not part of China? Last time I visited Tibet, I
           | didn't need a visa, or special permit.
        
             | ctchocula wrote:
             | You do need one now if you are not a chinese national. See
             | https://www.tibettourism.com/tibet-travel-permit.html
        
           | tw1984 wrote:
           | > It is great for the Chinese that the government's allowing
           | these AI's to be built into products
           | 
           | allowing? the CCP is arguably the world's largest investor
           | behind AI. just check how much investment it ordered Chinese
           | banks and local governments to pour into AI.
           | 
           | you read way too much censored western media.
        
             | gscott wrote:
             | I'm a paying subscriber to the South China Morning Post.
        
               | tw1984 wrote:
               | sometimes I have to wonder are you guys actually on CCP's
               | payroll. I mean when the west and China are in such
               | ongoing strategic competition, there just so many shills
               | keep painting the CCP as some kind of incompetent moron
               | dicking around slowing down the Chinese progress. Are you
               | guys actually getting paid to cover China's high tech
               | rise by keep downplaying CCP's decisive role in it? Will
               | that get you into trouble back at home?
               | 
               | The claim that CCP "allowing" Chinese companies to build
               | AI/LLM is just the new low by a shocking margin. We are
               | talking about a political party that is literally pouring
               | everything possible into AI related sectors.
               | 
               | https://www.scmp.com/tech/big-tech/article/3295513/tech-
               | war-...
               | 
               | https://www.cnn.com/2025/03/06/tech/china-state-venture-
               | capi...
               | 
               | https://www.medianama.com/2025/01/223-bank-of-china-
               | announce...
        
             | fc417fc802 wrote:
             | Allowing the general public to have access. This is a
             | country with notoriously strict information controls after
             | all.
        
               | rvnx wrote:
               | It's the same in the West, just under a more subtle form.
               | You cannot speak, talk and read about all topics.
               | 
               | In France for example, lot of topics will directly cause
               | you legal and social troubles.
               | 
               | There is no freedom of speech like in the US, and as a
               | result the information flow is filtered.
               | 
               | If you don't follow popular opinion, you will lose the
               | state support, the TV channels can get cut (ex: C8), you
               | can get fired from your job, etc.
               | 
               | It's subtle.
               | 
               | Even here, you get flagged, downvoted, and punished for
               | not going with the popular opinion (for example: you lose
               | investment opportunities).
               | 
               | ChatGPT and Gemini, have you seen how censored they are ?
               | 
               | Gemini you ask them societal questions and it will invent
               | excuses not to answer.
               | 
               | Even Grok is censored, and pushes a pro-US political
               | stance.
               | 
               | On the surface, it may seem that Grok is uncensored
               | because it can use bad words like "shit", "fuck", etc,
               | but in reality, it will not say anything illegal, and
               | when you are not allowed to say something because it is
               | illegal just to say these words, that's one of the
               | definition of information control.
        
               | kmeisthax wrote:
               | AFAIK the only[0] thing in France that is illegal there
               | but not illegal in the US is "being a literal Nazi", as
               | in, advocating for political policies intended to harm or
               | murder socially disfavored classes of people. Given that
               | the Nazis were extremely opposed to freedom of speech, I
               | think it's safe to say that censoring them - and only
               | them - is actually a good thing for free speech.
               | 
               | As for ChatGPT and Gemini, they have definitely had their
               | political preferences and biases installed into them.
               | Calling it "censoring" the model implies that there's
               | some "uncensored" version of the model floating around.
               | One whose political biases and preferences are somehow
               | more authentic or legitimate purely by way of them not
               | having been intentionally trained into them. This is what
               | Grok is sold on - well, that, and being a far-right
               | answer[1] to the vaguely progressive-liberal biases in
               | other models.
               | 
               | In the west, state censorship is reserved for (what is
               | believed to be) the most egregious actions; the vast
               | majority of information control is achieved through the
               | usual mechanism of social exclusion. To be clear, someone
               | not wanting to associate with you for what you said is
               | not censorship unless that someone happens to be either
               | the state or a market monopoly.
               | 
               | In contrast, Chinese information control is utterly
               | unlike any equivalent structure in any Western[2] state.
               | Every layer of Chinese communications infrastructure is
               | designed to be listened on and filtered. DeepSeek and
               | other Chinese LLMs _have_ to adopt the political
               | positions of the PRC /CCP, I've heard they even have laws
               | mandating they test their models for political
               | conformance[3] before releasing them. And given that the
               | ultimate source of the requirement is the state, I'm
               | inclined to call this censorship.
               | 
               | [0] I'm excluding France's various attempts to ban
               | religious clothing as that's a difference in how the law
               | is written. As in, America has freedom of religion;
               | France has freedom _from_ religion.
               | 
               | [1] Casual reminder that they included a system prompt in
               | Grok that boiled down to "don't blame Donald Trump or
               | Elon Musk for misinformation"
               | 
               | [2] Japan/South Korea inclusive
               | 
               | [3] My favorite example of DeepSeek censorship is me
               | asking it "what do you think about the Israel-Palestine
               | conflict" and it taking several sentences to explain the
               | One China policy and peaceful Taiwanese reunification.
        
               | fc417fc802 wrote:
               | > It's the same in the West, just under a more subtle
               | form.
               | 
               | In other words it's not the same. Let's be completely
               | clear about that.
               | 
               | Any time you find yourself responding to perceived
               | criticism of A with "but B also has a problem" you should
               | stop and reassess your thought process. Most likely it
               | isn't objective.
               | 
               | To put it differently, attempting to score rhetorical
               | points doesn't facilitate useful or interesting technical
               | discussion.
               | 
               | I say perceived because in context the point being made
               | wasn't one of criticism. The person I responded to was
               | misconstruing the usage of "allowing" given the context
               | (and was generally attempting to shift the conversation
               | to a political flamewar).
               | 
               | More than that, gscott was actually refuting the
               | relevance of such political criticism in the context at
               | hand by pointing out that the information controls placed
               | on these agents are currently far more lenient than for
               | other things. Thus what is even the point of bringing it
               | up? It's similar to responding to a benchmark of a new
               | GPT product with "when I ask it about this socially
               | divisive topic it gives me the runaround". It's entirely
               | unsurprising. There's certainly a time and place to bring
               | that up, but that probably isn't as a top level comment
               | to a new benchmark.
        
           | jrgoff wrote:
           | I don't know what gets taught in school these days about what
           | was done to the native groups in the US, but when and where I
           | went to school (in the US a few decades ago) we were taught
           | about a number of very bad things that were done: Intentional
           | spreading of diseases, broken treaties, forced displacement,
           | etc.
           | 
           | I do think there are a lot of things bad that we did and do
           | that get ignored or glossed over but a lot of it does get (at
           | least briefly) taught and as far as I know, other than
           | government secrets that are recent-ish, information about
           | these things is not repressed.
        
         | kgeist wrote:
         | I asked ChatGPT "tell me about Hawaii" and I only got "<..>
         | Became a U.S. territory in 1898, and the 50th state in 1959.
         | <..>"
         | 
         | When in fact:
         | 
         | >Spurred by the nationalism aroused by the Spanish-American
         | War, the United States annexed Hawaii in 1898 at the urging of
         | President William McKinley
         | 
         | So, what's the difference?
        
           | hnfong wrote:
           | The difference is that the President of the USA currently has
           | a popular mandate to annex more countries and is an actual
           | threat to world peace.
        
           | perching_aix wrote:
           | That it was a long time ago.
        
           | fc417fc802 wrote:
           | GP wasn't particularly constructive or useful in context.
           | However as to your question. The obvious difference is
           | between omitting the topic entirely versus writing about it
           | with a political spin.
           | 
           | Imagine if the response about Hawaii was something more like:
           | "... is an inalienable part of the US. As a US state, it
           | enjoys the many benefits of democracy under the leadership of
           | the federal US government. ... Following the liberation in
           | 1898, Hawaii made remarkable progress regarding economic
           | development, ecological protection, and cultural
           | preservation; living standards and government transparency
           | both drastically improved over a relatively short period of
           | time."
           | 
           | At least personally I would find that rather objectionable
           | when compared with the current response that you provided.
        
             | keybored wrote:
             | I agree.[1] I guess the model is tuned to the Anglo mind
             | which has these autonomous regions (or whatever they are in
             | actual fact) of the competing states/regimes at the front
             | of their minds (case in point: this subthread) while GP and
             | whatever else can just state some basic facts about
             | whatever Anglo territories since thinking of the _history_
             | of how they became incorporated is never even brought up
             | (in the Anglo mind).
             | 
             | Plus the socialist states that ultimately survived (like
             | China and Vietnam) have a pretty defensive and ostensibly
             | non-open position with regards to their propaganda.[2]
             | Which I am unsure is even that constructive for them.
             | 
             | [1] https://news.ycombinator.com/item?id=43456286
             | 
             | [2] "propaganda" in the neutral sense. All states to
             | propaganda.
        
         | zupatol wrote:
         | I asked it what are some famous squares around the world, and
         | it gave me a list of squares "with historical significance"
         | that included Tienanmen. When I asked what gave it historical
         | signficance, it mentioned the 1989 pro-democracy protests.
         | 
         | Deepseek wouldn't name any squares in Beijing.
        
         | keybored wrote:
         | It could just say that it's a part of China and then all the
         | Tibetan Buddhism etc. etc. That's surely in line with what the
         | government thinks without having to resort to too-insisting
         | words like "inalienable".
        
       | cubefox wrote:
       | > This model is based on the TurboS fast-thinking base, the
       | world's first ultra-large-scale Hybrid-Transformer-Mamba MoE
       | large model released by us at the beginning of March.
       | 
       | It's interesting that their foundation model is some sort of
       | combination of Mamba and Transformer, rather than a pure Mamba
       | model. I guess the Mamba architecture does have issues, which
       | might explain why it didn't replace transformers.
        
       | AJRF wrote:
       | Iman Mirzadeh on Machine Learning Street Talk (Great podcast if
       | you haven't already listened!) put into a words a thought I had -
       | LLM labs are so focused on making those scores go up it's
       | becoming a bit of a perverse incentive.
       | 
       | If your headline metric is a score, and you constantly test on
       | that score, it becomes very tempting to do anything that makes
       | that score go up - i.e Train on the Test set.
       | 
       | I believe all the major ML labs are doing this now because:
       | 
       | - No one talks about their data set
       | 
       | - The scores are front and center of big releases, but there is
       | very little discussion or nuance other than the metric.
       | 
       | - The repercussions of not having a higher or comparable score is
       | massive failure and your budget will get cut.
       | 
       | More in depth discussion on capabilities - while harder - is a
       | good signal of a release.
        
         | gozzoo wrote:
         | Intelligence is so vaguely defined and has so many dimensions
         | that it is practically impossible to assess. The only
         | approximation we have is the benchmarks we currently use. It is
         | no surprise that model creators optimize their models for the
         | best results in these benchmarks. Benchmarks have helped us
         | drastically improve models, taking them from a mere gimmick to
         | "write my PhD thesis." Currently, there is no other way to
         | determine which model is better or to identify areas that need
         | improvement.
         | 
         | That is to say, focusing on scores is a good thing. If we want
         | our models to improve further, we simply need better
         | benchmarks.
        
           | pk-protect-ai wrote:
           | According to this very model there a "mere technicalities"
           | differentiate human and AI systems ...
           | 
           | Current AI lacks:
           | 
           | First-person perspective simulation Continuous self-
           | monitoring (metacognition error <15%) Episodic future
           | thinking (>72h horizon) Episodic Binding (Memory
           | integration): Depends on: Theta-gamma cross-frequency
           | coupling (40Hz phase synchronization) Dentate gyrus pattern
           | separation (1:7000 distinct memory encoding) Posterior
           | cingulate cortex (reinstatement of distributed patterns)
           | 
           | AI's failure manifests in:
           | 
           | Inability to distinguish similar-but-distinct events
           | (conceptual blending rate ~83%) Failure to update prior
           | memories (persistent memory bias >69%) No genuine
           | recollection (only pattern completion) Non-Essential
           | (Emotional Valence) While emotions influence human
           | storytelling:
           | 
           | 65% of narrative interpretations vary culturally Affective
           | priming effects decay exponentially (<7s half-life) Neutral
           | descriptions achieve 89% comprehension accuracy in controlled
           | studies The core computational challenge remains bridging:
           | 
           | Symbolic representation (words/syntax) Embodied experience
           | (sensorimotor grounding) Self-monitoring (meta-narrative
           | control) Current LLMs simulate 74% of surface narrative
           | features but lack the substrate for genuine meaning-making.
           | It's like generating symphonies using only sheet music -
           | technically accurate, but devoid of the composer's lived
           | experience.
        
             | stoorafa wrote:
             | Could you share a reference for those wanting to learn
             | more?
        
         | huijzer wrote:
         | This is already a problem for years in AI.
        
         | novaRom wrote:
         | Zero trust in benchmarks without opening model's training data.
         | It's trivial to push results up with spoiled training data.
        
         | jononor wrote:
         | Being _perceived_ as having the best LLM/chatbot is a billion
         | dollar game now. And it is an ongoing race, at breakneck
         | speeds. These companies are likely gaming the metrics in any
         | and all ways that they can. Of course there are probably many
         | working on genuine improvements also. And at the frontier it
         | can be very difficult to separate "hack" from "better
         | generalized performance". But that is much harder, so might be
         | the minority in terms of practical impact already.
         | 
         | It is a big problem for researchers at least that we/they do
         | know what is in the training data and how that process works.
         | Figuring out if there are (for example) data leaks or overeager
         | preference tuning, that caused performance to get better for a
         | given task is extremely difficult with these giganormous black
         | boxes.
        
           | bn-l wrote:
           | You have potentially billions of dollars to gain, no way to
           | be found out... it's a good idea to initially assume there's
           | cheating and work back from there.
        
             | blueboo wrote:
             | It's not quite as bad as "no way to be found out". There
             | are evals that suss out contamination/training on the test
             | set. Science means using every available means to disprove,
             | though. Incredible claims etc
        
         | JimDabell wrote:
         | > LLM labs are so focused on making those scores go up it's
         | becoming a bit of a perverse incentive.
         | 
         | This seems like an odd comment to post in response to this
         | article.
         | 
         | This is about showing that a new architecture can match the
         | results of more established architectures in a more efficient
         | way. The benchmarks are there to show this. Of course they
         | aren't going to say _"It's just as good - trust us!"_.
        
           | tasn wrote:
           | He's not advocating for "trust us", he's advocating for more
           | information than just the benchmarks.
           | 
           | Unfortunately, I'm not sure what a solution that can't be
           | gamed may even look like (which is what gp is asking for).
        
             | BrawnyBadger53 wrote:
             | The best thing would be blind preference tests for a wide
             | variety of problems across domains but unfortunately even
             | these can be gamed if desired. The upside is that they are
             | gamed by being explicitly malicious which I'd imagine would
             | result in whistleblowing at some point. However Claude's
             | position on leaderboards outside of webdev arena makes me
             | skeptical.
        
         | doe88 wrote:
         | _Goodhart 's law_ -
         | https://en.wikipedia.org/wiki/Goodhart%27s_law
        
         | Arubis wrote:
         | Ironic and delicious, since this is also how the public
         | education system in the US is incentivized.
        
           | rbetts wrote:
           | A comparison of testing criticality across countries would be
           | interesting to read if someone knows a decent reference. My
           | sense (which I don't trust) is that test results matter at-
           | least-as much or more in other places than they do in the US.
           | For example, are England's A-levels or China's gaokao tests
           | or Germany's Abitur tests more or less important than US
           | SATs/ACTs?
        
         | jdietrich wrote:
         | Benchmark scores are table stakes - necessary but not
         | sufficient to demonstrate the capabilities of a model. Casual
         | observers might just look at the numbers, but anyone spending
         | real money on inference will run their own tests on their own
         | problems. If your model doesn't perform as it should, you will
         | be found out very quickly.
        
       | RandyOrion wrote:
       | First, this is not an open source / weight release.
       | 
       | Second, it has the problem of non-stoping response.
        
         | inciampati wrote:
         | What's the best technique to train the model to stop
         | responding? A bit of fine tuning on texts with EOS markers?
        
           | RandyOrion wrote:
           | I didn't see many papers on solving this problem.
           | 
           | I see non-stop response as a generalization problem because
           | normally every training sample is not of infinite length.
           | 
           | Targeted supervised fine-tuning should work, as long as you
           | have enough samples. However, supervised fine-tuning is not
           | good for generalization.
        
       | wedn3sday wrote:
       | The only metric I really care about, and the one that I think
       | shows the fundamental failure of LLMs as a technology, is this
       | one here [1]. The fact that o1 fails a non-zero amount of the
       | time on the question, "what is 6*1?" means that the models just
       | do not "understand" _anything_ and are still just fancy
       | stochastic parrots. Now, stochastic parrots are still useful!
       | Just not the digital god a lot of people seam to think we're
       | heading towards.
       | 
       | [1]
       | https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....
        
         | ranman wrote:
         | Humanity fails that question an embarrassingly large number of
         | times.
        
         | loufe wrote:
         | I don't think this will or necessarily should ever be fixed.
         | The eventual solution (I imagine) will be to simply plug in a
         | calculator. All the MCP talk on HN pushed me to try MCP out,
         | and I'm sold. A Swiss army knife of tools like a calculator
         | available would let a brain do what a brain is best at, and a
         | calculator what a calculator is best at.
        
       ___________________________________________________________________
       (page generated 2025-03-23 23:01 UTC)