[HN Gopher] Bytes are all you need: Transformers operating direc...
___________________________________________________________________
Bytes are all you need: Transformers operating directly on file
bytes
Author : pmoriarty
Score : 169 points
Date : 2023-06-03 14:02 UTC (8 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| bluefishinit wrote:
| Interesting thought to have a model training co-processor that
| reads all of the data inputs and outputs from the actual
| processor. There's a ton of sequence information flowing through
| there even on a single machine. Then you'd basically have a model
| that was a "virtual machine" mirror of your actual cpu and the
| data it's interacted with. I'm not sure what would emerge from
| that, but it's super interesting.
| vczf wrote:
| Could be used to "compress"/predict common computations, a la
| Nvidia's AI video compression. Distributed computing that's
| distributed across time as well as space.
| bigyikes wrote:
| Wouldn't be surprised if this was already done to some degrees
| for branch prediction
| yyyk wrote:
| Already the case for more then a decade.
|
| https://chasethedevil.github.io/post/the_neural_network_in_y.
| ..
| tehsauce wrote:
| Seems in similar spirit to the "perceiver" architecture from
| deepmind a couple years ago:
|
| https://arxiv.org/abs/2107.14795
| brucethemoose2 wrote:
| As they note, most media compression is going to throw a
| monkeywrench into the whole thing.
|
| But I was kinda hoping they would test gpu texture compression.
| AFAIK its a much simpler compression scheme.
| Taek wrote:
| Why not enforce that decompression is a necessary part of the
| data cleaning? There's no reason to operate on complex formats
| like mp4 directly.
| wizzwizz4 wrote:
| > Additionally, we demonstrate that ByteFormer has applications
| in privacy-preserving inference. ByteFormer is capable of
| performing inference on particular obfuscated input
| representations with no loss of accuracy. We also demonstrate
| ByteFormer's ability to perform inference with a hypothetical
| privacy-preserving camera which avoids forming full images by
| consistently masking $90\%$ of pixel channels, while still
| achieving $71.35\%$ accuracy on ImageNet.
|
| I'm not certain they know what "privacy-preserving" means. All
| the claims they've made around privacy look, to the lay-person
| (me), to be meaningless:
|
| * Permuting the input values doesn't change _anything_ , because.
| If anything, this suggests that transformers might be able to
| approximate the original image, given an unknownly-permuted image
| - but so can humans, so that's nothing new.
| https://en.wikipedia.org/wiki/Block_cipher_modes_of_operatio...
|
| * Their "partially-masked image" just looks like a noisy image,
| not a redacted one. Basic information theory suggests it's not
| really privacy-preserving at all.
|
| Is it normal for AI papers to be so hypey? Like, this part is
| _literally_ security-by-obscurity.
|
| > As our method can handle highly nonlinear JPEG encodings, we
| expect it to perform well on a variety of alternative encodings
| that an outside observer might not be able to easily guess.
|
| I don't see how any of section 4.2 contributes to the paper,
| other than letting them make a bold claim about a buzzword in a
| disproportionate amount of the abstract and conclusion.
| l33t233372 wrote:
| They say their model works for a hypothetical privacy
| preserving camera that masks 90% of the pixels.
|
| I'm not sure there's much more to it; I think you're reading
| too far into it. It shows the power of their model and possible
| applications towards privacy.
|
| It's not a stance that that hypothetical camera is actually
| great for privacy.
| wizzwizz4 wrote:
| But a camera that masks 90% of the pixels _isn 't_ privacy-
| preserving. It's just a 90s consumer-grade webcam. They
| haven't shown that the approach works with _actual_ privacy
| measures, which makes their claims in this area dubious.
| marcellus23 wrote:
| > It's just a 90s consumer-grade webcam.
|
| The positions of the masked pixels are not stored in the
| resulting data -- it's not like they are just making some
| pixels black. The channel information is actually removed
| from the buffer entirely, and then inference is performed
| on that buffer:
|
| > The camera stores the remaining unmasked pixel channels
| in an array without retaining the coordinates of pixel
| channels on the image sensor. In this scenario, an ad-
| versary could not obtain a faithful reconstruction of the
| in- put image. Even if the adversary could guess pixel
| channel locations, the low resolution of captured data
| prevents the adversary from recovering a high-fidelity
| image
|
| Also:
|
| > Their "partially-masked image" just looks like a noisy
| image, not a redacted one
|
| The caption for the figure (assuming you're talking about
| figure 4) makes it clear that the figure is illustrative,
| since it includes the positions of the masked pixels. The
| pixel positions would not be present in the actual data
| that comes from this hypothetical camera. So what the
| figure "looks like" is irrelevant -- its purely
| illustrative.
| ResearchCode wrote:
| ByteFormer. All You Need. The applied papers that claim dubious
| "sota" status on some benchmark do this. The foundational work
| that is actually worth reading doesn't.
| furyofantares wrote:
| I'm interested in that claim as well, though maybe not so
| overtly hostile to the research.
|
| My first thought was certainly that, at the very least, an
| adversary can perform image classification as well. I think
| this is an obvious limitation of how privacy-preserving is
| possible, and so maybe it's just taken that the reader should
| understand that. And BIG-IF the adversary can't do much more
| than that, it would still be appreciated.
| [deleted]
| Aerbil313 wrote:
| This seems like a downgrade. Intuition suggests it should lead to
| better performance if you first parse/process the input to
| represent the actual input space better.
| ftxbro wrote:
| Yes that's the intuition, but there's an idea called 'the
| bitter lesson' by Richard Sutton where every 'feature
| engineering' is swamped and overtaken by the raw power of
| stacking more layers with increased parameters and exaflops and
| dataset sizes.
| refulgentis wrote:
| You'd think this, but, RGB color is ~meaningless and all the
| image stuff works great. I long for a model that is trained in
| a perceptual space, and yet, I doubt it'll matter
| wizzwizz4 wrote:
| It's not faithfully representative of the post-processing
| performed by the average person's visual system, but that
| doesn't make it meaningless by _any_ stretch of the
| imagination.
| refulgentis wrote:
| Intuitively, it's meaningless. What's brighter, 255 G or
| 255 R?
| wizzwizz4 wrote:
| Depends on the picture.
| https://en.wikipedia.org/wiki/Checker_shadow_illusion
| rgovostes wrote:
| I wondered something like this--would it be better to train a
| CNN in say YUV color space? But if you consider that a NN
| approximates any function, if using YUV performed better then
| the network would learn to convert RGB to YUV itself. (It's a
| simple linear relationship that would take a single layer.)
|
| I then found a paper that confirmed that converting input to
| different color spaces did not cause significant differences in
| performance.
| jiggawatts wrote:
| Similarly I wondered if _partially_ decompressing video and
| using that format as both the input and output might work.
| The logic is that a fully decompressed video is huge, and
| that extra data is by definition wasteful: it's exactly
| what's thrown away by compression! We've designed compression
| to efficiently match the human visual system and not waste
| bytes on irrelevant things.
|
| So I wonder if a NN trained on something like a quantised DCT
| as both input and output might be dramatically more
| efficient, roughly in line with the compression ratio of
| applying the same transforms to a raw video.
|
| Obviously we'd have to avoid bit-level streaming algorithms
| like Huffman coding.
|
| However even reusing image tiles might work via methods such
| as differentiable hash tables, as seen in NVIDIA's reverse
| rendering neural nets!
|
| Food for thought...
| bentt wrote:
| I tried to get GPT-4 to act as a compiler and it didnt go so well
| but it felt like it was mainly because it didnt believe it could.
| After much consternation it was willing to put together a hello
| world in x86 assembly.
| vidarh wrote:
| A lot of problems are avoided with strategic "as a [suitable
| role], [task]" or "you are a ...". Not sure what you tried, but
| variations of that have been enough to bypass all kinds of
| weird objections for me.
| [deleted]
| optimalsolver wrote:
| The little language model that could.
| [deleted]
| RecycledEle wrote:
| I wonder what would happen if someone connected a Transformer
| between the inputs and output of an embedded system.
|
| Could we get a robotic arm to catch a falling ball?
|
| I suspect the lack of awareness of time would mess it up.
|
| What if we had the Transformer take inputs from the state of the
| world (as determined by other software) and output commands? I
| wonder what it could do.
| ZeroCool2u wrote:
| I think PALM-E[1] is pretty close to what you're describing.
|
| [1]: https://arstechnica.com/information-
| technology/2023/03/embod...
| xp84 wrote:
| > In a video example, a researcher grabs the chips from the
| robot and moves them, but the robot locates the chips and
| grabs them again.
|
| Well, that's it boys, I think we've successfully created a
| Terminator. To paraphrase Kyle Reese, "It is a chip-grabbing
| machine, and it absolutely will not stop!! Until it has
| acquired all the chips."
| [deleted]
| quickthrower2 wrote:
| Had a quick skim.
|
| I was really struggling to see the practical advantage of this.
| Because you can convert different formats to what the model needs
| easily.
|
| It feels like some strawman are set up. In the world of the
| paper, people have to painstakingly convert the image to the
| correct format for the model, and "hand-crafting a model stem for
| each modality"
|
| But once you have set up a model, and got it working for say
| .tiff then to work on any other format you can just use
| ImageMagick or something? Unless you want the meta data too?
|
| I think the use case for a model that works on bytes is as a kind
| of "easy to install" package on local devices, like a security
| camera, regular camera etc that will work with whatever the local
| file format is.
|
| Also, a dollar into the "is/are all you need" jar please.
| thelastparadise wrote:
| > Also, a dollar into the "is/are all you need" jar please.
|
| "Is/Are All You Need Considered Harmful"
| AndrewKemendo wrote:
| I definitely appreciate the technical privacy focus here and I'm
| curious if that is the whole purpose here.
|
| I don't see that many benefits to this method other than privacy
| - the ability to process multiple data types with a singular
| architecture is really cool but not SOTA beating or comparable
| really to large multimodal systems like imagebind [1]
|
| [1] https://arxiv.org/pdf/2305.05665.pdf
| doctoboggan wrote:
| Both tif and wav are lossless so this doesn't really surprise me.
|
| It is however nice to see Apple researchers publishing, you don't
| see that often at the forefront of transformer or generative
| model research. I hope it means that everyone here on HN is right
| and they are seriously looking into running the generative models
| locally on their own silicon
| brucethemoose2 wrote:
| I mean, they published their own metal stable diffusion
| implementation on GitHub _very_ quickly last year. I dont think
| this was in doubt
| refulgentis wrote:
| That's not AI research
| brucethemoose2 wrote:
| Its desperately needed though.
|
| If AMD or Intel started going around github and writing in
| acceleration for random popular ML repos (and not just one
| off abandoned demos), that would change so much.
| refulgentis wrote:
| These models run on CUDA. Making a version that ran on
| Metal filled a feature gap on Apple's proprietary GPUs.
| CPUs aren't involved, the last thing, for example, Stable
| Diffusion needs is help from AMD and Intel.
| zeusk wrote:
| CPUs are actually involved, especially with SIMD like AVX
| and VNNI instructions. It all depends on the scale of
| your compute.
|
| https://www.intel.com/content/www/us/en/developer/tools/o
| pen...
| refulgentis wrote:
| Linking to an Intel SDK with "deep learning" in it does
| prove it's possible to do matmuls on CPUs.
|
| But this a nitpicked to death thread: it's good to see
| Apple releasing research, the Metal stuff was to make it
| run on their GPU, and there simply aren't any relevant
| uses of CPUs.
|
| Academically, yes, CPUs have SIMD so they can implement
| the same algorithm that drives GPU perf, but without some
| massive breakthrough the one that can do more matmuls
| wins, and nVidia wins massively, in theory and practice.
| zeusk wrote:
| I'm not just linking it as a proof, I now work at Intel
| and collaborate with Microsoft on bringing new platform
| technologies to life. I'm currently involved in VNNI and
| hybrid core scheduling KPIs. OpenVINO allows developers
| to target execution engines on client based on their
| needs and CPU is a well supported engine for lightweight
| inferencing that is latency sensitive.
|
| Matmuls (FMAC especially) did evolve as a SIMD workload
| on CPUs but it's still forcing vector execution on cores
| designed around scalar workloads - you just have to be
| mindful of tradeoffs between executing a bunch of
| AVX/VNNI instructions for inferencing vs setting up a
| GPU/VPU/IPU context and performing that inference
| asynchronously.
|
| AFAIK, ARM is also interested in use of NEON/SIMD co-
| processors as low latency inference engines.
|
| Microsoft and Intel have a lot of interesting stuff in
| the works for AI compute (as I'm sure so do Apple, Nvidia
| and to a lesser extend AMD).
| sebzim4500 wrote:
| You know that Intel makes GPUs right? In fact, they are
| by far the biggest producer of GPUs by volume.
|
| Sure, they aren't being used in datacenters but if you
| are making an application that needs to run locally then
| you absolutely need good support for intel GPUs.
| cinntaile wrote:
| You're talking about integrated GPUs I presume?
| sebzim4500 wrote:
| Mainly yes, although they do also make dedicated gpus
| brucethemoose2 wrote:
| AMD and Intel both ship a lot of GPUs.
|
| In fact there _are_ good AMD /Intel implementations, just
| not enough dev effort to add them to popular SD projects
| sroussey wrote:
| And then... nothing after?
| haswell wrote:
| I suspect they prefer to use their standard marketing
| avenues to better control the narrative. I'm sure we'll be
| hearing a lot more next week, and in Apple fashion it'll
| all be grounded in customer facing use cases and associated
| developer frameworks.
| brucethemoose2 wrote:
| Oh, and also, lossless does not necessarily mean readable.
| Hence they note zlib compressed PNGs are untenable.
| samwillis wrote:
| Drip by drip gearing up for "one more thing" on Monday...
| schappim wrote:
| It has been well documented that Apple has permitted its AI
| teams to publish papers in a manner that appears to contradict
| Apple's culture of secrecy. This is explicitly allowed, as
| previously this was a hindrance to recruiting talent.
| nighthawk454 wrote:
| They publish research all the time:
|
| https://machinelearning.apple.com/research
|
| Including research about fitting models locally onto Apple
| Silicon, for example:
| https://machinelearning.apple.com/research/neural-engine-tra...
___________________________________________________________________
(page generated 2023-06-03 23:00 UTC)