[HN Gopher] Non-determinism in GPT-4 is caused by Sparse MoE
       ___________________________________________________________________
        
       Non-determinism in GPT-4 is caused by Sparse MoE
        
       Author : 152334H
       Score  : 57 points
       Date   : 2023-08-04 21:37 UTC (1 hours ago)
        
 (HTM) web link (152334h.github.io)
 (TXT) w3m dump (152334h.github.io)
        
       | dudus wrote:
       | Off topic
       | 
       | > 3 months later, reading a paper while on board a boring flight
       | home, I have my answer.
       | 
       | I noticed people from hacker news routinely read scientific
       | papers. This is a habit I envy but don't share.
       | 
       | Any tips or sites for someone interested in picking up more
       | science papers to read.
        
         | dylan604 wrote:
         | I want to know what a non-boring flight would be like
        
       | [deleted]
        
       | refulgentis wrote:
       | This is _excellent_ work, I've been adamantly against MoE for a
       | set of reasons, this is the first compelling evidence I've seen
       | that hasn't been on Substack or a bare repeating of rumor.
       | 
       | I had absolutely no idea GPT4 was nondeterministic and I use it
       | about 2 hours a day. I can see why a cursory looking wasn't
       | cutting it, they "feel" the same in your memory, a lot of similar
       | vocab usage, but are formatted entirely differently, and have
       | sort of a synonym-phrase thing going where some of the key words
       | are the same.
        
         | derwiki wrote:
         | GPT4 web chat for two hours a day? I buy that. Using the API
         | repeatedly for the same inputs, eg developing a program, and
         | the non-determinism is hard to miss.
        
           | sebzim4500 wrote:
           | I would imagine that most people use nonzero temperature, so
           | they won't need to look for any explanation for non-
           | determinism.
        
             | dekhn wrote:
             | Literally the first thing I did when I had llama.cpp
             | working was set the temperature to 0 and repeat queries.
             | 
             | (but that's mainly because I'm a weird old scientist with
             | lots of experience with nondeterminism in software).
        
         | 152334H wrote:
         | Thanks. I'm really no expert (:P) on MoE research; I just
         | noticed what was written in the Soft MoE paper and felt a need
         | to check.
         | 
         | The non-deterministic outputs are really similar, yeah, if you
         | check the gist examples I linked https://gist.github.com/152334
         | H/047827ad3740627f4d37826c867a.... This part is at least no
         | surprise, since the randomness should be bounded.
         | 
         | I suspect OpenAI will figure out some way to reduce the
         | randomness at some point, though, given their public commitment
         | to eventually adding logprobs back to ChatCompletions.
        
           | cubefox wrote:
           | I don't think this commitment had any plausibility. Token
           | "probabilities" only have a straightforward probabilistic
           | interpretation for base models. In fine-tuned models, they do
           | no longer represent the probability of the next token given
           | the prompt, but rather how well the next token fulfills the
           | ... tendencies induced by SL and RL tuning. Which is
           | presumably pretty useless information. OpenAI has no
           | intention to provide access to the GPT-4 base model, and they
           | in fact removed API access to the GPT-3.5 base model.
        
         | FanaHOVA wrote:
         | > I've been adamantly against MoE for a set of reasons
         | 
         | Such as?
        
       | osmarks wrote:
       | I feel like this introduces the potential for weird and hard-to-
       | implement side channel attacks, if the sequences in a batch can
       | affect the routing of others.
        
         | tehsauce wrote:
         | I think you're right. Would be very hard to exploit I imagine
         | though.
        
           | derwiki wrote:
           | Hard like building a virtual machine in an image decoder? If
           | there's a way there's a will.
        
       | pazimzadeh wrote:
       | Mixture of Experts
        
       | alpark3 wrote:
       | _If_ 3.5 is a MoE model, doesn't that give a lot of hope to open
       | source movements? Once a good open source MoE model comes out,
       | maybe even some type of variation of the decoder models
       | available(I don't know whether MoE models have to be trained from
       | scratch), that implies a lot more can be done with a lot less.
        
         | 152334H wrote:
         | I agree, and really hope that Meta is doing something in that
         | vein. Reducing the FLOPs:Memory ratio (as in Soft MoE) could
         | also open the door to CPU (or at least Apple Silicon) inference
         | becoming more relevant.
        
       ___________________________________________________________________
       (page generated 2023-08-04 23:00 UTC)