[HN Gopher] Bytes are all you need: Transformers operating direc...
       ___________________________________________________________________
        
       Bytes are all you need: Transformers operating directly on file
       bytes
        
       Author : pmoriarty
       Score  : 169 points
       Date   : 2023-06-03 14:02 UTC (8 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | bluefishinit wrote:
       | Interesting thought to have a model training co-processor that
       | reads all of the data inputs and outputs from the actual
       | processor. There's a ton of sequence information flowing through
       | there even on a single machine. Then you'd basically have a model
       | that was a "virtual machine" mirror of your actual cpu and the
       | data it's interacted with. I'm not sure what would emerge from
       | that, but it's super interesting.
        
         | vczf wrote:
         | Could be used to "compress"/predict common computations, a la
         | Nvidia's AI video compression. Distributed computing that's
         | distributed across time as well as space.
        
         | bigyikes wrote:
         | Wouldn't be surprised if this was already done to some degrees
         | for branch prediction
        
           | yyyk wrote:
           | Already the case for more then a decade.
           | 
           | https://chasethedevil.github.io/post/the_neural_network_in_y.
           | ..
        
       | tehsauce wrote:
       | Seems in similar spirit to the "perceiver" architecture from
       | deepmind a couple years ago:
       | 
       | https://arxiv.org/abs/2107.14795
        
       | brucethemoose2 wrote:
       | As they note, most media compression is going to throw a
       | monkeywrench into the whole thing.
       | 
       | But I was kinda hoping they would test gpu texture compression.
       | AFAIK its a much simpler compression scheme.
        
         | Taek wrote:
         | Why not enforce that decompression is a necessary part of the
         | data cleaning? There's no reason to operate on complex formats
         | like mp4 directly.
        
       | wizzwizz4 wrote:
       | > Additionally, we demonstrate that ByteFormer has applications
       | in privacy-preserving inference. ByteFormer is capable of
       | performing inference on particular obfuscated input
       | representations with no loss of accuracy. We also demonstrate
       | ByteFormer's ability to perform inference with a hypothetical
       | privacy-preserving camera which avoids forming full images by
       | consistently masking $90\%$ of pixel channels, while still
       | achieving $71.35\%$ accuracy on ImageNet.
       | 
       | I'm not certain they know what "privacy-preserving" means. All
       | the claims they've made around privacy look, to the lay-person
       | (me), to be meaningless:
       | 
       | * Permuting the input values doesn't change _anything_ , because.
       | If anything, this suggests that transformers might be able to
       | approximate the original image, given an unknownly-permuted image
       | - but so can humans, so that's nothing new.
       | https://en.wikipedia.org/wiki/Block_cipher_modes_of_operatio...
       | 
       | * Their "partially-masked image" just looks like a noisy image,
       | not a redacted one. Basic information theory suggests it's not
       | really privacy-preserving at all.
       | 
       | Is it normal for AI papers to be so hypey? Like, this part is
       | _literally_ security-by-obscurity.
       | 
       | > As our method can handle highly nonlinear JPEG encodings, we
       | expect it to perform well on a variety of alternative encodings
       | that an outside observer might not be able to easily guess.
       | 
       | I don't see how any of section 4.2 contributes to the paper,
       | other than letting them make a bold claim about a buzzword in a
       | disproportionate amount of the abstract and conclusion.
        
         | l33t233372 wrote:
         | They say their model works for a hypothetical privacy
         | preserving camera that masks 90% of the pixels.
         | 
         | I'm not sure there's much more to it; I think you're reading
         | too far into it. It shows the power of their model and possible
         | applications towards privacy.
         | 
         | It's not a stance that that hypothetical camera is actually
         | great for privacy.
        
           | wizzwizz4 wrote:
           | But a camera that masks 90% of the pixels _isn 't_ privacy-
           | preserving. It's just a 90s consumer-grade webcam. They
           | haven't shown that the approach works with _actual_ privacy
           | measures, which makes their claims in this area dubious.
        
             | marcellus23 wrote:
             | > It's just a 90s consumer-grade webcam.
             | 
             | The positions of the masked pixels are not stored in the
             | resulting data -- it's not like they are just making some
             | pixels black. The channel information is actually removed
             | from the buffer entirely, and then inference is performed
             | on that buffer:
             | 
             | > The camera stores the remaining unmasked pixel channels
             | in an array without retaining the coordinates of pixel
             | channels on the image sensor. In this scenario, an ad-
             | versary could not obtain a faithful reconstruction of the
             | in- put image. Even if the adversary could guess pixel
             | channel locations, the low resolution of captured data
             | prevents the adversary from recovering a high-fidelity
             | image
             | 
             | Also:
             | 
             | > Their "partially-masked image" just looks like a noisy
             | image, not a redacted one
             | 
             | The caption for the figure (assuming you're talking about
             | figure 4) makes it clear that the figure is illustrative,
             | since it includes the positions of the masked pixels. The
             | pixel positions would not be present in the actual data
             | that comes from this hypothetical camera. So what the
             | figure "looks like" is irrelevant -- its purely
             | illustrative.
        
         | ResearchCode wrote:
         | ByteFormer. All You Need. The applied papers that claim dubious
         | "sota" status on some benchmark do this. The foundational work
         | that is actually worth reading doesn't.
        
         | furyofantares wrote:
         | I'm interested in that claim as well, though maybe not so
         | overtly hostile to the research.
         | 
         | My first thought was certainly that, at the very least, an
         | adversary can perform image classification as well. I think
         | this is an obvious limitation of how privacy-preserving is
         | possible, and so maybe it's just taken that the reader should
         | understand that. And BIG-IF the adversary can't do much more
         | than that, it would still be appreciated.
        
           | [deleted]
        
       | Aerbil313 wrote:
       | This seems like a downgrade. Intuition suggests it should lead to
       | better performance if you first parse/process the input to
       | represent the actual input space better.
        
         | ftxbro wrote:
         | Yes that's the intuition, but there's an idea called 'the
         | bitter lesson' by Richard Sutton where every 'feature
         | engineering' is swamped and overtaken by the raw power of
         | stacking more layers with increased parameters and exaflops and
         | dataset sizes.
        
         | refulgentis wrote:
         | You'd think this, but, RGB color is ~meaningless and all the
         | image stuff works great. I long for a model that is trained in
         | a perceptual space, and yet, I doubt it'll matter
        
           | wizzwizz4 wrote:
           | It's not faithfully representative of the post-processing
           | performed by the average person's visual system, but that
           | doesn't make it meaningless by _any_ stretch of the
           | imagination.
        
             | refulgentis wrote:
             | Intuitively, it's meaningless. What's brighter, 255 G or
             | 255 R?
        
               | wizzwizz4 wrote:
               | Depends on the picture.
               | https://en.wikipedia.org/wiki/Checker_shadow_illusion
        
         | rgovostes wrote:
         | I wondered something like this--would it be better to train a
         | CNN in say YUV color space? But if you consider that a NN
         | approximates any function, if using YUV performed better then
         | the network would learn to convert RGB to YUV itself. (It's a
         | simple linear relationship that would take a single layer.)
         | 
         | I then found a paper that confirmed that converting input to
         | different color spaces did not cause significant differences in
         | performance.
        
           | jiggawatts wrote:
           | Similarly I wondered if _partially_ decompressing video and
           | using that format as both the input and output might work.
           | The logic is that a fully decompressed video is huge, and
           | that extra data is by definition wasteful: it's exactly
           | what's thrown away by compression! We've designed compression
           | to efficiently match the human visual system and not waste
           | bytes on irrelevant things.
           | 
           | So I wonder if a NN trained on something like a quantised DCT
           | as both input and output might be dramatically more
           | efficient, roughly in line with the compression ratio of
           | applying the same transforms to a raw video.
           | 
           | Obviously we'd have to avoid bit-level streaming algorithms
           | like Huffman coding.
           | 
           | However even reusing image tiles might work via methods such
           | as differentiable hash tables, as seen in NVIDIA's reverse
           | rendering neural nets!
           | 
           | Food for thought...
        
       | bentt wrote:
       | I tried to get GPT-4 to act as a compiler and it didnt go so well
       | but it felt like it was mainly because it didnt believe it could.
       | After much consternation it was willing to put together a hello
       | world in x86 assembly.
        
         | vidarh wrote:
         | A lot of problems are avoided with strategic "as a [suitable
         | role], [task]" or "you are a ...". Not sure what you tried, but
         | variations of that have been enough to bypass all kinds of
         | weird objections for me.
        
           | [deleted]
        
         | optimalsolver wrote:
         | The little language model that could.
        
       | [deleted]
        
       | RecycledEle wrote:
       | I wonder what would happen if someone connected a Transformer
       | between the inputs and output of an embedded system.
       | 
       | Could we get a robotic arm to catch a falling ball?
       | 
       | I suspect the lack of awareness of time would mess it up.
       | 
       | What if we had the Transformer take inputs from the state of the
       | world (as determined by other software) and output commands? I
       | wonder what it could do.
        
         | ZeroCool2u wrote:
         | I think PALM-E[1] is pretty close to what you're describing.
         | 
         | [1]: https://arstechnica.com/information-
         | technology/2023/03/embod...
        
           | xp84 wrote:
           | > In a video example, a researcher grabs the chips from the
           | robot and moves them, but the robot locates the chips and
           | grabs them again.
           | 
           | Well, that's it boys, I think we've successfully created a
           | Terminator. To paraphrase Kyle Reese, "It is a chip-grabbing
           | machine, and it absolutely will not stop!! Until it has
           | acquired all the chips."
        
             | [deleted]
        
       | quickthrower2 wrote:
       | Had a quick skim.
       | 
       | I was really struggling to see the practical advantage of this.
       | Because you can convert different formats to what the model needs
       | easily.
       | 
       | It feels like some strawman are set up. In the world of the
       | paper, people have to painstakingly convert the image to the
       | correct format for the model, and "hand-crafting a model stem for
       | each modality"
       | 
       | But once you have set up a model, and got it working for say
       | .tiff then to work on any other format you can just use
       | ImageMagick or something? Unless you want the meta data too?
       | 
       | I think the use case for a model that works on bytes is as a kind
       | of "easy to install" package on local devices, like a security
       | camera, regular camera etc that will work with whatever the local
       | file format is.
       | 
       | Also, a dollar into the "is/are all you need" jar please.
        
         | thelastparadise wrote:
         | > Also, a dollar into the "is/are all you need" jar please.
         | 
         | "Is/Are All You Need Considered Harmful"
        
       | AndrewKemendo wrote:
       | I definitely appreciate the technical privacy focus here and I'm
       | curious if that is the whole purpose here.
       | 
       | I don't see that many benefits to this method other than privacy
       | - the ability to process multiple data types with a singular
       | architecture is really cool but not SOTA beating or comparable
       | really to large multimodal systems like imagebind [1]
       | 
       | [1] https://arxiv.org/pdf/2305.05665.pdf
        
       | doctoboggan wrote:
       | Both tif and wav are lossless so this doesn't really surprise me.
       | 
       | It is however nice to see Apple researchers publishing, you don't
       | see that often at the forefront of transformer or generative
       | model research. I hope it means that everyone here on HN is right
       | and they are seriously looking into running the generative models
       | locally on their own silicon
        
         | brucethemoose2 wrote:
         | I mean, they published their own metal stable diffusion
         | implementation on GitHub _very_ quickly last year. I dont think
         | this was in doubt
        
           | refulgentis wrote:
           | That's not AI research
        
             | brucethemoose2 wrote:
             | Its desperately needed though.
             | 
             | If AMD or Intel started going around github and writing in
             | acceleration for random popular ML repos (and not just one
             | off abandoned demos), that would change so much.
        
               | refulgentis wrote:
               | These models run on CUDA. Making a version that ran on
               | Metal filled a feature gap on Apple's proprietary GPUs.
               | CPUs aren't involved, the last thing, for example, Stable
               | Diffusion needs is help from AMD and Intel.
        
               | zeusk wrote:
               | CPUs are actually involved, especially with SIMD like AVX
               | and VNNI instructions. It all depends on the scale of
               | your compute.
               | 
               | https://www.intel.com/content/www/us/en/developer/tools/o
               | pen...
        
               | refulgentis wrote:
               | Linking to an Intel SDK with "deep learning" in it does
               | prove it's possible to do matmuls on CPUs.
               | 
               | But this a nitpicked to death thread: it's good to see
               | Apple releasing research, the Metal stuff was to make it
               | run on their GPU, and there simply aren't any relevant
               | uses of CPUs.
               | 
               | Academically, yes, CPUs have SIMD so they can implement
               | the same algorithm that drives GPU perf, but without some
               | massive breakthrough the one that can do more matmuls
               | wins, and nVidia wins massively, in theory and practice.
        
               | zeusk wrote:
               | I'm not just linking it as a proof, I now work at Intel
               | and collaborate with Microsoft on bringing new platform
               | technologies to life. I'm currently involved in VNNI and
               | hybrid core scheduling KPIs. OpenVINO allows developers
               | to target execution engines on client based on their
               | needs and CPU is a well supported engine for lightweight
               | inferencing that is latency sensitive.
               | 
               | Matmuls (FMAC especially) did evolve as a SIMD workload
               | on CPUs but it's still forcing vector execution on cores
               | designed around scalar workloads - you just have to be
               | mindful of tradeoffs between executing a bunch of
               | AVX/VNNI instructions for inferencing vs setting up a
               | GPU/VPU/IPU context and performing that inference
               | asynchronously.
               | 
               | AFAIK, ARM is also interested in use of NEON/SIMD co-
               | processors as low latency inference engines.
               | 
               | Microsoft and Intel have a lot of interesting stuff in
               | the works for AI compute (as I'm sure so do Apple, Nvidia
               | and to a lesser extend AMD).
        
               | sebzim4500 wrote:
               | You know that Intel makes GPUs right? In fact, they are
               | by far the biggest producer of GPUs by volume.
               | 
               | Sure, they aren't being used in datacenters but if you
               | are making an application that needs to run locally then
               | you absolutely need good support for intel GPUs.
        
               | cinntaile wrote:
               | You're talking about integrated GPUs I presume?
        
               | sebzim4500 wrote:
               | Mainly yes, although they do also make dedicated gpus
        
               | brucethemoose2 wrote:
               | AMD and Intel both ship a lot of GPUs.
               | 
               | In fact there _are_ good AMD /Intel implementations, just
               | not enough dev effort to add them to popular SD projects
        
           | sroussey wrote:
           | And then... nothing after?
        
             | haswell wrote:
             | I suspect they prefer to use their standard marketing
             | avenues to better control the narrative. I'm sure we'll be
             | hearing a lot more next week, and in Apple fashion it'll
             | all be grounded in customer facing use cases and associated
             | developer frameworks.
        
         | brucethemoose2 wrote:
         | Oh, and also, lossless does not necessarily mean readable.
         | Hence they note zlib compressed PNGs are untenable.
        
         | samwillis wrote:
         | Drip by drip gearing up for "one more thing" on Monday...
        
         | schappim wrote:
         | It has been well documented that Apple has permitted its AI
         | teams to publish papers in a manner that appears to contradict
         | Apple's culture of secrecy. This is explicitly allowed, as
         | previously this was a hindrance to recruiting talent.
        
         | nighthawk454 wrote:
         | They publish research all the time:
         | 
         | https://machinelearning.apple.com/research
         | 
         | Including research about fitting models locally onto Apple
         | Silicon, for example:
         | https://machinelearning.apple.com/research/neural-engine-tra...
        
       ___________________________________________________________________
       (page generated 2023-06-03 23:00 UTC)