hngopher.com

       [HN Gopher] Magika: AI powered fast and efficient file type iden...
       ___________________________________________________________________
        
       Magika: AI powered fast and efficient file type identification
        
       Author : alphabetting
       Score  : 584 points
       Date   : 2024-02-16 01:02 UTC (21 hours ago)
        
 (HTM) web link (opensource.googleblog.com)
 (TXT) w3m dump (opensource.googleblog.com)
        
       | NiloCK wrote:
       | A somewhat surprising and genuinely useful application of the
       | family of techniques.
       | 
       | I wonder how susceptible it is to adversarial binaries or, hah,
       | prompt-injected binaries.
        
         | dghlsakjg wrote:
         | "These aren't the binaries you are looking for..."
        
         | jamesdwilson wrote:
         | For the extremely limited number of file types supported, I
         | question the utility of this compared to `magic`
        
         | star4040 wrote:
         | It gets a lot of binary file formats wrong for me out-of-the-
         | box. I think it needs to be a bit more effective before we can
         | truly determine the effectiveness of such exploits.
        
           | queuebert wrote:
           | But they reported >99% accuracy on their cherry-picked
           | dataset! /s
        
         | nicklecompte wrote:
         | Elsewhere in the thread kevincox[1] points out that it's
         | extremely susceptible to adversarial binaries:
         | 
         | > Worse it seems that for unknown formats it confidently claims
         | that it is one of the known formats. Rather than saying
         | "unknown" or "binary data".
         | 
         | Seems like this is genuinely useless for anybody but AI
         | researchers.
         | 
         | [1] https://news.ycombinator.com/item?id=39395677
        
       | kushie wrote:
       | this couldnt have been released at a better time for me! really
       | needed a library like this.
        
         | petesergeant wrote:
         | Tell us why!
        
         | ebursztein wrote:
         | Thanks :)
        
       | thorum wrote:
       | Supported file types:
       | https://github.com/google/magika/blob/main/docs/supported-co...
        
         | s1mon wrote:
         | It's surprising that there are so many file types that seem
         | relatively common which are missing from this list. There are
         | no raw image file formats. There's nothing for CAD - either
         | source files or neutral files. There's no MIDI files, or any
         | other music creation types. There's no APL, Pascal, COBOL,
         | assembly source file formats etc.
        
           | _3u10 wrote:
           | No tracker / .mod files either, just use file.
        
             | ebursztein wrote:
             | Thanks for the list, we will probably try to extend the
             | list of format supported in future revision.
        
           | photoGrant wrote:
           | Yeah this quickly went from 'additional helpful tool in the
           | kit' to 'probably should use something else first'
        
           | vintermann wrote:
           | Well, what they used this for at Google was apparently
           | scanning their user's files for things they shouldn't store
           | in the cloud. Probably they don't care much about MIDI.
        
           | kevincox wrote:
           | Worse it seems that for unknown formats it confidently claims
           | that it is one of the known formats. Rather than saying
           | "unknown" or "binary data".
        
       | vunderba wrote:
       | As somebody who's dealt with the ambiguity of attempting to use
       | file signatures in order to identify file type, this seems like a
       | pretty useful library. Especially since it seems to be able to
       | distinguish between different types of text files based on their
       | format/content e.g. CSV, markdown, etc.
        
       | semitones wrote:
       | Is it really common enough for files not to be annotated with a
       | useful/correct file type extension (e.g. .mp3, .txt) that a
       | library like this is needed?
        
         | hiddencost wrote:
         | malware can intentionally obfuscate itself
        
         | callalex wrote:
         | Nothing is ever simple. Even for the most basic .txt files it's
         | still useful to know what the character encoding is (utf? 8/16?
         | Latin-whatever? etc.) and what the line format is
         | (\n,\cr\lf,\n\lf) as well as determining if some maniac removed
         | all the indentation characters and replaced them with a mystery
         | number of spaces.
         | 
         | Then there are all the container formats that have different
         | kinds of formats embedded in them (mov,mkv,pdf etc.)
        
           | cole-k wrote:
           | A fun read in service of your first point:
           | https://en.wikipedia.org/wiki/Bush_hid_the_facts
        
         | SnowflakeOnIce wrote:
         | Yes!
         | 
         | Sometimes a file has no extension. Other times the extension is
         | a lie. Still other times, you may be dealing with an unnamed
         | bytestring and wish to know what kind of content it is.
         | 
         | This last case happens quite a lot in Nosey Parker [1], a
         | detector of secrets in textual data. There, it is possible to
         | come across unnamed files in Git history, and it would be
         | useful to the user to still indicate what type of file it seems
         | to be.
         | 
         | I added file type detection based on libmagic to Nosey Parker a
         | while back, but it's not compiled in by default because
         | libmagic is slow and complicates the build process. Also,
         | libmagic is implemented as a large C library whose primary job
         | is parsing, which makes the security side of me jittery.
         | 
         | I will likely add enabled-by-default filetype detection to
         | Nosey Parker using Magika's ONNX model.
         | 
         | [1] https://github.com/praetorian-inc/noseyparker
        
         | m0shen wrote:
         | At multiple points in my career I've been responsible for apis
         | that accept PDFs. Many non-tech savvy people seeing this, will
         | just change the extension of the file they're uploading to
         | `.pdf`.
         | 
         | To make matters worse, there is some business software out
         | there that will actually bastardize the PDF format and put
         | garbage before the PDF file header. So for some things you end
         | up writing custom validation and cleanup logic anyway.
        
       | userbinator wrote:
       | _Today web browsers, code editors, and countless other software
       | rely on file-type detection to decide how to properly render a
       | file._
       | 
       | "web browsers"? Odd to see this coming from Google itself.
       | https://en.wikipedia.org/wiki/Content_sniffing was widely
       | criticised for being problematic for security.
        
         | rafram wrote:
         | Content sniffing can be disabled by the server (X-Content-Type-
         | Options: nosniff), but it's still used by default. Web browsers
         | have to assume that servers are stupid, and that for relatively
         | harmless cases, it's fine to e.g. render a PNG loaded by an
         | <img> even if it's served as text/plain.
        
       | stevepike wrote:
       | Oh man, this brings me back! Almost 10 years ago I was working on
       | a rails app trying to detect the file type of uploaded
       | spreadsheets (xlsx files were being detected as application/zip,
       | which is technically true but useless).
       | 
       | I found "magic" that could detect these and submitted a patch at
       | https://bugs.freedesktop.org/show_bug.cgi?id=78797. My patch got
       | rejected for needing to look at the first 3KB bytes of the file
       | to figure out the type. They had a hard limit that they wouldn't
       | see past the first 256 bytes. Now in 2024 we're doing this with
       | deep learning! It'd be cool if google released some speed
       | performance benchmarks here against the old-fashioned
       | implementations. Obviously it'd be slower, but is it 1000x or
       | 10^6x?
        
         | renonce wrote:
         | From the first paragraph:
         | 
         | > enabling precise file identification within milliseconds,
         | even when running on a CPU.
         | 
         | Maybe your old-fashioned implementations were detecting in
         | microseconds?
        
           | stevepike wrote:
           | Yeah I saw that, but that could cover a pretty wide range and
           | it's not clear to me whether that relies on preloading a
           | model.
        
             | ryanjshaw wrote:
             | > At inference time Magika uses Onnx as an inference engine
             | to ensure files are identified in a matter of milliseconds,
             | almost as fast as a non-AI tool even on CPU.
        
         | ebursztein wrote:
         | Co-author of Magika here (Elie) so we didn't include the
         | measurements in the blog post to avoid making it too long but
         | we did those measurements.
         | 
         | Overall file takes about 6ms (single file) 2.26ms per files
         | when scanning multiples. Magika is at 65ms single file and
         | 5.3ms when scanning multiples.
         | 
         | So Magika is for the worst case scenario about 10x slower due
         | to the time it takes to load the model and 2x slower on
         | repeated detection. This is why we said it is not that much
         | slower.
         | 
         | We will have more performance measurements in the upcoming
         | research paper. Hope that answer the question
        
           | jpk wrote:
           | Do you have a sense of performance in terms of energy use? 2x
           | slower is fine, but is that at the same wattage, or more?
        
             | alephnan wrote:
             | That sounds like a nit / premature optimization.
             | 
             | Electricity is cheap. If this is sufficiently or actually
             | important for your org, you should measure it yourself.
             | There are too many variables and factors subject to your
             | org's hardware.
        
               | djxfade wrote:
               | Totally disagree. Most end users are on laptops and
               | mobile devices these days, not desktop towers. Thus power
               | efficiency is important for battery life. Performance per
               | watt would be an interesting comparison.
        
               | true_religion wrote:
               | What end users are working with arbitrary files that they
               | don't know the identification of?
               | 
               | This entire use case seems to be one suited for servers
               | handling user media.
        
               | michaelt wrote:
               | Theoretically? Anyone running a virus scanner.
               | 
               | Of course, it's arguably unlikely a virus scanner would
               | opt for an ML-based approach, as they specifically need
               | to be robust against adversarial inputs.
        
               | scq wrote:
               | You'd be surprised what an AV scanner would do.
               | 
               | https://twitter.com/taviso/status/732365178872856577
        
               | michaelmior wrote:
               | > it's arguably unlikely a virus scanner would opt for an
               | ML-based approach
               | 
               | Several major players such as Norton, McAfee, and
               | Symantec all at least claim to use AI/ML in their
               | antivirus products.
        
               | r0ze-at-hn wrote:
               | Browsers often need to guess a file type
        
               | wongarsu wrote:
               | File managers that render preview images. Even detecting
               | which software to open the file with when you click it.
               | 
               | Of course on Windows the convention is to use the file
               | extension, but on other platforms the convention is to
               | look at the file contents
        
               | michaelmior wrote:
               | > on other platforms the convention is to look at the
               | file contents
               | 
               | MacOS (that is, Finder) also looks at the extension. That
               | has also been the case with any file manager I've used on
               | Linux distros that I can recall.
        
               | jdiff wrote:
               | You might be surprised. Rename your Photo.JPG as
               | Photo.PNG and you'll still get a perfectly fine
               | thumbnail. The extension is a hint, but it isn't
               | definitive, especially when you start downloading from
               | the web.
        
               | underdeserver wrote:
               | In general you're right, but I can't think of a single
               | local use for identifying file types by a human on a
               | laptop - at least, one with scale where this matters.
               | It's all going to be SaaS services where people upload
               | stuff.
        
               | prmph wrote:
               | We are building a data analysis tool with great UX, where
               | users select data files, which are then parsed and
               | uploaded to S3 directly, on their client machines. The
               | server only takes over after this step.
               | 
               | Since the data files can be large, this approach bypasses
               | having to trnasfer the file twice, first to the server,
               | and then to S3 after parsing.
        
               | DontSignAnytng wrote:
               | This dont sound like very common scenario.
        
               | vertis wrote:
               | I mean if you care about that you shouldn't be running
               | anything that isn't highly optimized. Don't open webpages
               | that might be CPU or GPU intensive. Don't run Electron
               | apps, or really anything that isn't built in a compiled
               | language.
               | 
               | Certainly you should do an audit of all the Android and
               | iOS apps as well, to make sure they've been made in a
               | efficient manner.
               | 
               | Block ads as well, they waste power.
               | 
               | This file identification is SUCH a small aspect of
               | everything that is burning power in your laptop or phone
               | as to be laughable.
        
               | _puk wrote:
               | Whilst energy usage is indeed a small aspect this early
               | on when using bespoke models, we do have to consider that
               | this is a model for simply identifying a file type.
               | 
               | What happens when we introduce more bespoke models for
               | manipulating the data in that file?
               | 
               | This feels like it could slowly boil to the point of
               | programs using magnitudes higher power, at which point
               | it'll be hard to claw it back.
        
               | vertis wrote:
               | That's a slippery slope argument, which is a common
               | logical fallacy[0]. This model being inefficient compared
               | to the best possible implementation does not mean that
               | future additions will also be inefficient.
               | 
               | It's the equivalent to saying many people programming in
               | Ruby is causing all future programs to be less efficient.
               | Which is not true. In fact, many people programming in
               | Ruby has caused Ruby to become more efficient because it
               | gets optimised as it gets used more (or Python for that
               | matter).
               | 
               | It's not as energy efficient as C, but it hasn't caused
               | it to get worse and worse, and spiral out of control.
               | 
               | Likewise smart contracts are incredibly inefficient
               | mechanisms of computation. The result is mostly that
               | people don't use them for any meaningful amounts of
               | computation, that all gets done "Off Chain".
               | 
               | Generative AI is definitely less efficient, but it's
               | likely to improve over time, and indeed things like
               | quantization has allowed models that would normally to
               | require much more substantial hardware resources (and
               | therefore, more energy intensive) to be run on smaller
               | systems.
               | 
               | [0]: https://en.wikipedia.org/wiki/Slippery_slope
        
               | diffeomorphism wrote:
               | That is a fallacy fallacy. Just because some slopes are
               | not slippery that does not mean none of them are.
        
               | thfuran wrote:
               | >This feels like it could slowly boil to the point of
               | programs using magnitudes higher power, at which point
               | it'll be hard to claw it back.
               | 
               | We're already there. Modern software is, by and large,
               | profoundly inefficient.
        
               | cornholio wrote:
               | The hardware requirements of a massively parallel
               | algorithm can't possibly be "a nit" in any universe
               | inhabited by rational beings.
        
           | chmod775 wrote:
           | Is that single-threaded libmagic vs Magika using every core
           | on the system? What are the numbers like if you run multiple
           | libmagic instances in parallel for multiple files, or limit
           | both libmagic and magika to a single core?
           | 
           | Testing it on my own system, magika seems to use a lot more
           | CPU-time:                   file /usr/lib/*  0,34s user 0,54s
           | system 43% cpu 2,010 total         ./file-parallel.sh  0,85s
           | user 1,91s system 580% cpu 0,477 total         bin/magika
           | /usr/lib/*  92,73s user 1,11s system 393% cpu 23,869 total
           | 
           | Looks about 50x slower to me. There's 5k files in my lib
           | folder. It's definitely still impressively fast given how the
           | identification is done, but the difference is far from
           | negligible.
        
         | metafunctor wrote:
         | I've ended up implementing a layer on top of "magic" which, if
         | magic detects application/zip, reads the zip file manifest and
         | checks for telltale file names to reliably detect Office files.
         | 
         | The "magic" library does not seem to be equipped with the
         | capabilities needed to be robust against the zip manifest being
         | ordered in a different way than expected.
         | 
         | But this deep learning approach... I don't know. It might be
         | hard to shoehorn in to many applications where the traditional
         | methods have negligible memory and compute costs and the
         | accuracy is basically 100% for cases that matter (detecting
         | particular file types of interest). But when looking at a large
         | random collection of unknown blobs, yeah, I can see how this
         | could be great.
        
           | comboy wrote:
           | Many commenters seem to be using _magic_ instead of _file_ ,
           | any reasons?
        
             | e1g wrote:
             | _magic_ is the core detection logic of _file_ that was
             | extracted out to be available as a library. So these days
             | _file_ is just a higher level wrapper around _magic_
        
               | comboy wrote:
               | thanks
        
           | stevepike wrote:
           | If you're curious, here's how I solved it for ruby back in
           | the day. Still used magic bytes, but added an overlay on top
           | of the freedesktop.org DB:
           | https://github.com/mimemagicrb/mimemagic/pull/20
        
         | brabel wrote:
         | > They had a hard limit that they wouldn't see past the first
         | 256 bytes.
         | 
         | Then they could never detect zip files with certainty, given
         | that to do that you need to read up to 65KB (+ 22) at the END
         | of the file. The reason is that the zip archive format allows
         | "gargabe" bytes both in the beginning of the file and in
         | between local file headers.... and it's actually not uncommon
         | to prepend a program that self-extracts the archive, for
         | example. The only way to know if a file is a valid zip archive
         | is to look for the End of Central Directory Entry, which is
         | always at the end of the file AND allows for a comment of
         | unknown length at the end (and as the comment length field
         | takes 2 bytes, the comment can be up to 65K long).
        
           | jeffbee wrote:
           | That's why the whole question is ill formed. A file does not
           | have exactly one type. It may be a valid input in various
           | contexts. A zip archive may also very well be something else.
        
         | aidenn0 wrote:
         | FWIW, file can now distinguish many types of zip containers,
         | including Oxml files.
        
       | rfl890 wrote:
       | We have had file(1) for years
        
         | samtheprogram wrote:
         | This is beyond what file is capable of. It's also mentioned in
         | the third paragraph.
         | 
         | RTFA.
        
           | wruza wrote:
           | Some HN readers may not know about file(1) even. It's fine to
           | mention that $subj enhances that, but the rtfa part seems
           | pretty unnecessary.
        
           | Vogtinator wrote:
           | FWICT file is more capable, predictable and also faster while
           | being more energy-efficient at the same time.
        
             | Majestic121 wrote:
             | That's not what the performance table in the article is
             | implying, with a precision and recall higher on Magika
             | hovering around 99%, while magic is at 92% prec and 72%
             | recall.
             | 
             | One can doubt the representativity of their dataset, but if
             | what is in the article is correct, Magika is clearly way
             | more capable and predictable
        
           | NoGravitas wrote:
           | Yes, it's slower than file(1), uses more energy, recognizes
           | fewer file types, and is less accurate.
        
         | aitchnyu wrote:
         | This group of Linux users used to brag Linux will identify
         | files even if you change the extension, Windoze needs to police
         | you about changing extension, nearly 20 years back.
        
       | lifthrasiir wrote:
       | I'm extremely confused about the claim that other tools have a
       | worse precision or recall for APK or JAR files which are very
       | much regular. Like, they should be a valid ZIP file with `META-
       | INF/MANIFEST.MF` present (at least), and APK would need
       | `classes.dex` as well, but at this point there is no other format
       | that can be confused with APK or JAR I believe. I'd like to see
       | which file was causing unexpected drop on precision or recall.
        
         | charcircuit wrote:
         | apks are also zipaligned so it's not like random users are
         | going to be making them either
        
         | HtmlProgrammer wrote:
         | Minecraft mods 14 years ago used to tell you to open the JAR
         | and delete the META-INF when installing them so can't rely on
         | that one...
        
         | supriyo-biswas wrote:
         | The `file` command checks only the first few bytes, and doesn't
         | parse the structure of the file. APK files are indeed reported
         | as Zip archives by the latest version of `file`.
        
           | m0shen wrote:
           | This is false in every sense for
           | https://www.darwinsys.com/file/ (probably the most used file
           | version). It depends on the magic for a specific file, but it
           | can check any part of your file. Many Linux distros are years
           | out of date, you might be using a very old version.
           | 
           | FILE_45:                   ./src/file -m magic/magic.mgc
           | ../../OpenCalc.v2.3.1.apk         ../../OpenCalc.v2.3.1.apk:
           | Android package (APK), with zipflinger virtual entry, with
           | APK Signing Block
        
             | supriyo-biswas wrote:
             | Interesting! I checked with file 5.44 from Ubuntu 23.10 and
             | 5.45 on macOS using homebrew, and in both cases, I got "Zip
             | archive data, at least v2.0 to extract" for the file
             | here[1]. I don't have an Android phone to check and I'm
             | also not familiar with Android tooling, so is this a
             | corrupt APK?
             | 
             | [1] https://download.apkpure.net/custom/com.apkpure.aegon-3
             | 19781...
        
               | m0shen wrote:
               | That doesn't appear to be a valid link. Try building
               | `file` from source and using the provided default magic
               | database.
        
               | supriyo-biswas wrote:
               | I also tried this with the sources of file from the
               | homepage you linked above, and I still get the same
               | results.
               | 
               | You could try this for yourself using the same APKPure
               | file which I uploaded at the following alternative
               | link[1]. Further, while this could be a corrupt APK, I
               | can't see any signs of that from a cursory inspection as
               | both the `classes.dex` and `META-INF` directory are
               | present, and this is APKPure's own APK, instead of an APK
               | contributed for an app contributed by a third-party.
               | 
               | [1] https://wormhole.app/Mebmy#CDv86juV9H4aRCL2DSJeDw
        
         | Someone wrote:
         | People do create JAR files without a META-INF/MANIFEST.MF
         | entry.
         | 
         | The tooling even supports it. https://docs.oracle.com/en/java/j
         | avase/21/docs/specs/man/jar...:                 -M or --no-
         | manifest          Doesn't create a manifest file for the
         | entries
        
       | Vt71fcAqt7 wrote:
       | This feels like old school google. I like that it's just a static
       | webpage that basically can't be shut down or sunsetted. It
       | reminds of when Google just made useful stuff and gave them away
       | for free on a webpage like translate and google books. Obviously
       | less life changing than the above but still a great option to
       | have when I need this.
        
       | vrnvu wrote:
       | At $job we have been using Apache Tika for years.
       | 
       | Works but occasionally having bugs and weird collisions when
       | working with billions of files.
       | 
       | Happy to see new contributions in the space.
        
       | johnea wrote:
       | The results of which you'll never be 100% sure are correct...
        
         | wruza wrote:
         | They missed such an opportunity to name it "fail". It's like
         | "file" but with "ai" in it.
        
           | tamrix wrote:
           | What about faile?
        
         | rfoo wrote:
         | But file(2) is already like that - my data files without
         | headers are reported randomly as disk images, compressed
         | archives or even executables for never-heard-of machines.
        
         | plesiv wrote:
         | Other methods use heuristics to guess many filetypes and in the
         | benchmark they show worse performance (in terms of precision).
         | Assuming benchmarks are not biased, the fact that this approach
         | uses AI heuristics instead of hard-coded heuristics shouldn't
         | make it strictly worse.
        
       | Imnimo wrote:
       | I wonder how big of a deal it is that you'd have to retrain the
       | model to support a new or changed file type? It doesn't seem like
       | the repo contains training code, but I could be missing it...
        
       | m0shen wrote:
       | As someone that has worked in a space that has to deal with
       | uploaded files for the last few years, and someone who maintains
       | a WASM libmagic Node package ( https://github.com/moshen/wasmagic
       | ) , I have to say I really love seeing new entries into the file
       | type detection space.
       | 
       | Though I have to say when looking at the Node module, I don't
       | understand why they released it.
       | 
       | Their docs say it's slow:
       | 
       | https://github.com/google/magika/blob/120205323e260dad4e5877...
       | 
       | It loads the model an runtime:
       | 
       | https://github.com/google/magika/blob/120205323e260dad4e5877...
       | 
       | They mark it as Experimental in the documentation, but it seems
       | like it was just made for the web demo.
       | 
       | Also as others have mentioned. The model appears to only detect
       | 116 file types:
       | 
       | https://github.com/google/magika/blob/120205323e260dad4e5877...
       | 
       | Where libmagic detects... a lot. Over 1600 last time I checked:
       | 
       | https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...
       | 
       | I guess I'm confused by this release. Sure it detected most of my
       | list of sample files, but in a sample set of 4 zip files, it
       | misidentified one.
        
         | lebean wrote:
         | It's for researchers, probably.
        
           | m0shen wrote:
           | Yeah, there is this line:                   By open-sourcing
           | Magika, we aim to help other software improve their file
           | identification accuracy and offer researchers a reliable
           | method for identifying file types at scale.
           | 
           | Which implies a production-ready release for general usage,
           | as well as usage by security researchers.
        
         | m0shen wrote:
         | Made a small test to try it out:
         | https://gist.github.com/moshen/784ee4a38439f00b17855233617e9...
         | hyperfine ./magika.bash ./file.bash         Benchmark 1:
         | ./magika.bash           Time (mean +- s):     706.2 ms +-  21.1
         | ms    [User: 10520.3 ms, System: 1604.6 ms]           Range
         | (min ... max):   684.0 ms ... 738.9 ms    10 runs
         | Benchmark 2: ./file.bash           Time (mean +- s):      23.6
         | ms +-   1.1 ms    [User: 15.7 ms, System: 7.9 ms]
         | Range (min ... max):    22.4 ms ...  29.0 ms    111 runs
         | Summary           './file.bash' ran            29.88 +- 1.65
         | times faster than './magika.bash'
        
           | barrkel wrote:
           | Realistically, either you're identifying one file
           | interactively and you don't care about latency differences in
           | the 10s of ms, or you're identifying in bulk (batch command
           | line or online in response to requests), in which case you
           | should measure the marginal cost and exclude Python startup
           | and model loading times.
        
             | m0shen wrote:
             | My little script is trying to identify in bulk, at least by
             | passing 165 file paths to `magika`, and `file`.
             | 
             | Though, I absolutely agree with you. I think realistically
             | it's better to do this kind of thing in a library rather
             | than shell out to it at all. I was just trying to get an
             | idea on how it generally compares.
             | 
             | Another note, I was trying to be generous to `magicka` here
             | because when it's single file identification, it's about
             | 160-180ms on my machine vs <1ms for `file`. I realize
             | that's going to be quite a bit of python startup in that
             | number, which is why I didn't go with it when pushing that
             | benchmark up earlier. I'll probably push an update to that
             | gist to include the single file benchmark as well.
        
             | chmod775 wrote:
             | Going by those number it's taking almost a second to run,
             | not 10s of ms. And going by those numbers, it's doing
             | something massively parallel in that time. So basically all
             | your cores will spike to 100% for almost a second during
             | those one-shot identifications. It looks like GP has a
             | 12-16 threads CPU, _and it is using those while still being
             | 30 times slower than single-threaded libmagic_.
             | 
             | That tool needs 100x more CPU time just to figure out some
             | filetypes than vim needs to open a file from a cold start
             | (which presumably includes using libmagic to check the
             | type).
             | 
             | If I had to wait a second just to open something during
             | which that thing uses every resource available on my
             | computer to the fullest, I'd probably break my keyboard.
             | Try using that thing as a drop-in _file_ replacement, open
             | some folder in your favorite file manager, and watch your
             | computer slow to a crawl as your file manager tries to
             | figure out what thumbnails to render.
             | 
             | It's utterly unsuitable for "interactive" identifications.
        
           | m0shen wrote:
           | I've updated this script with some single-file cli numbers,
           | which are (as expected) not good. Mostly just comparing
           | python startup time for that.                   make
           | sqlite3 < analyze.sql         file_avg
           | python_avg         python_x_times_slower_single_cli
           | --------------------  -----------------
           | --------------------------------         0.000874874856301821
           | 0.179884610224334  205.611818568799         file_avg
           | python_avg     python_x_times_slower_bulk_cli
           | ------------------  -------------
           | ------------------------------         0.0231715865881818
           | 0.69613745142  30.0427184289163
        
         | ebursztein wrote:
         | We did release the npm package because indeed we create a web
         | demo and thought people might want to also use it. We know it
         | is not as fast as the python version or a C++ version -- which
         | why we did mark it as experimental.
         | 
         | The release include the python package and the cli which are
         | quite fast and is the main way we did expect people to use --
         | sorry if that hasn't be clear in the post.
         | 
         | The goal of the release is to offer a tool that is far more
         | accurate that other tools and works on the major file types as
         | we hope it to be useful to the community.
         | 
         | Glad to hear it worked on your files
        
           | m0shen wrote:
           | Thank you for the release! I understand you're just getting
           | it out the door. I just hope to see it delivered as a native
           | library or something more reusable.
           | 
           | I did try the python cli, but it seems to be about 30x slower
           | than `file` for the random bag of files I checked.
           | 
           | I'll probably take some time this weekend to make a couple of
           | issues around misidentified files.
           | 
           | I'll definitely be adding this to my toolset!
        
         | invernizzi wrote:
         | Hello! We wrote the Node library as a first functional version.
         | Its API is already stable, but it's a bit slower than the
         | Python library for two reasons: it loads the model at runtime,
         | and it doesn't do batch lookups, meaning it calls the model for
         | each file. Other than that, it's just as fast for single file
         | lookups, which is the most common usecase.
        
           | m0shen wrote:
           | Good to know! Thank you. I'll definitely be trying it out.
           | Though, I might download and hardcode the model ;)
           | 
           | I also appreciate the use of ONNX here, as I'm already
           | thinking about using another version of the runtime.
           | 
           | Do you think you'll open source your F1 benchmark?
        
         | michaelt wrote:
         | _> The model appears to only detect 116 file types [...] Where
         | libmagic detects... a lot. Over 1600 last time I checked_
         | 
         | As I'm sure you know, in a lot of applications, you're
         | preparing things for a downstream process which supports far
         | fewer than 1600 file types.
         | 
         | For example, a printer driver might call on _file_ to check if
         | an input is postscript or PDF, to choose the appropriate
         | converter - and for any other format, just reject the input.
         | 
         | Or someone training an ML model to generate Python code might
         | have a load of files they've scraped from the web, but might
         | want to discard anything that isn't Python.
        
           | theon144 wrote:
           | Okay, but your one file type is more likely to be included in
           | the 1600 that libmagic supports rather than Magika's 116?
           | 
           | For that matter, the file types I care about are
           | unfortunately misdetected by Magika (which is also an
           | important point - the `file` command at least gives up and
           | says "data" when it doesn't know, whereas the Magika demo
           | gives a confidently wrong answer).
           | 
           | I don't want to criticize the release because it's not meant
           | to be a production-ready piece of software, and I'm sure the
           | current 116 types isn't a hard limit, but I do understand the
           | parent comment's contention.
        
         | tudorw wrote:
         | Can we do the 1600 if known, if not, let the AI take a guess?
        
           | m0shen wrote:
           | Absolutely, and honestly in a non-interactive ingestion
           | workflow you're probably doing multiple checks anyway. I've
           | worked with systems that call multiple libraries and hand-
           | coded validation for each incoming file.
           | 
           | Maybe it's my general malaise, or disillusionment with the
           | software industry, but when I wrote that I was really just
           | expecting more.
        
       | kazinator wrote:
       | > _So far, libmagic and most other file-type-identification
       | software have been relying on a handcrafted collection of
       | heuristics and custom rules to detect each file format.
       | 
       | This manual approach is both time consuming and error prone as it
       | is hard for humans to create generalized rules by hand._
       | 
       | Pure nonsense. The rules are accurate, based on the actual
       | formats, and not "heuristics".
        
         | cAtte_ wrote:
         | the rules aren't based on the formats, but on a small portion
         | of them (their magic numbers). this makes them inaccurate
         | (think docx vs zip) and heuristic.
        
         | cle wrote:
         | Besides compound file types, not all formats are well-specified
         | either. Example is CSV.
        
         | summerlight wrote:
         | 1. Not all file formats are well specified 2. Not all files are
         | precisely following the specification 3. Not all file formats
         | are mutually exclusive
         | 
         | Those facts are clearly reflected in the table.
        
       | Nullabillity wrote:
       | It seems to detect my Android build.gradle.kts as Scala, which I
       | suppose is a kind of hilarious confusion but not exactly useful.
        
       | krick wrote:
       | What are use-cases for this? I mean, obviously detecting the
       | filetype is useful, but we kinda already have plenty of tools to
       | do that, and I cannot imagine, why we need some "smart" way of
       | doing this. If you are not a human, and you are not sure what is
       | it (like, an unknown file being uploaded to a server) you would
       | be better off just rejecting it completely, right? After all,
       | there's absolutely no way an "AI powered" tool can be more
       | reliable than some dumb, err-on-safer-side heuristic, and you
       | wouldn't want to trust _that_ thing to protect you from malicious
       | payloads.
        
         | nindalf wrote:
         | > no way an "AI powered" tool can be more reliable
         | 
         | The article provides accuracy benchmarks.
         | 
         | > you would be better off just rejecting it completely
         | 
         | They mention using it in gmail and Drive, neither of which have
         | the luxury of rejecting files willy-nilly.
        
           | fuzztester wrote:
           | I have not tried it recently, but IIRC, Gmail does reject
           | attachments which are zip files, for security reasons.
        
             | wildrhythms wrote:
             | Gmail nukes zips if they contain an executable or some
             | other 'prohibited' file type. Most email providers block
             | executable attachments.
        
         | n2d4 wrote:
         | Virus detection is mentioned in the article. Code editors need
         | to find the programming language for syntax highlighting of
         | code before you give it a name. Your desktop OS needs to know
         | which program to open files with. Or, recovering files from a
         | corrupted drive. Etc
         | 
         | It's easy to distinguish, say, a PNG from a JPG file (or
         | anything else that has well-defined magic bytes). But some
         | files look virtually identical (eg. .jar files are really just
         | .zip files). Also see polyglot files [1].
         | 
         | If you allow an `unknown` label or human intervention, then
         | yes, magic bytes might be enough, but sometimes you'd rather
         | have a 99% chance to be right about 95% of files vs. a 100%
         | chance to be right about 50% of files.
         | 
         | [1] https://en.wikipedia.org/wiki/Polyglot_(computing)
        
       | star4040 wrote:
       | It seems like it defeats the purpose of such a tool that this
       | initial version doesn't have polyglot files. I hope they're quick
       | to work on that.
        
       | VikingCoder wrote:
       | What does it do with an Actually Portable Executable compiled by
       | Cosmopolitan libc compiler?
        
         | supriyo-biswas wrote:
         | It's reported as a PE executable, `file` on the other hand
         | reports it as a "DOS/MBR boot sector."
        
       | lopkeny12ko wrote:
       | I don't understand why this needs to exist. Isn't file type
       | detection inherently deterministic by nature? A valid tar archive
       | will always have the same first few magic bytes. An ELF binary
       | has a universal ELF magic and header. If the magic is bad, then
       | the file is corrupted and not a valid XYZ file. What's the value
       | in throwing in "heuristics" and probabilistic inference into a
       | process that is black and white by design.
        
         | potatoman22 wrote:
         | This also works for formats like Python, HTML, and JSON.
        
           | LiamPowell wrote:
           | file (https://www.darwinsys.com/file/) already detects all
           | these formats.
        
             | ebursztein wrote:
             | Indeed but as pointed out in the blog post -- file is
             | significantly less accurate that Magika. There are also
             | some file type that we support and file doesn't as reported
             | in the table.
        
               | LiamPowell wrote:
               | I can't immediately find the dataset used for
               | benchmarking. Is file actually failing on common files or
               | just particularly nasty examples? If it's the latter then
               | how does it compare to Magika on files that an average
               | person is likely to see?
        
               | schleck8 wrote:
               | > Is file actually failing on common files or just
               | particularly nasty examples? If it's the latter then how
               | does it compare to Magika on files that an average person
               | is likely to see?
               | 
               | That's not the point in file type guessing is it? Google
               | employs it as an additional security measure for user
               | submitted content which absolutely makes sense given what
               | malware devs do with file types.
        
           | amelius wrote:
           | Yes, but shouldn't the file type be part of the file, or
           | (better) of the metadata of the file?
           | 
           | Knowing is better than guessing.
        
           | lopkeny12ko wrote:
           | I still don't see how this is useful. The only time I want to
           | answer the question "what type of file is this" is if it is
           | an opaque blob of binary data. If it's a plain text file like
           | Python, HTML, or JSON, I can figure that out by just catting
           | the file.
        
         | vintermann wrote:
         | Consider, it's perfectly possible for a file to fit two or more
         | file formats - polyglot files are a hobby for some people.
         | 
         | And there are also a billion formats that are _not_ uniquely
         | determined by magic bytes. You don 't have to go further than
         | text files.
        
           | KOLANICH wrote:
           | This tool doesn't work this way.
        
         | TacticalCoder wrote:
         | > What's the value in throwing in "heuristics" and
         | probabilistic inference into a process that is black and white
         | by design.
         | 
         | I use the _file_ command all the time. The value is when you
         | get this:                   ... $  file somefile.xyz
         | somefile.xyz: data
         | 
         | AIUI from reading TFA, _magika_ can determine more filetypes
         | than what the _file_ command can detect.
         | 
         | It'd actually be very easy to determine if there's any value in
         | _magika_ : run _file_ on every file on your filesystem and then
         | for every file where the _file_ command returns  "data", run
         | _magika_ and see if _magika_ is right.
         | 
         | If it's right, there's your value.
         | 
         | P.S: it may also be easier to run on Windows than the file
         | command? But then I can't do much to help people who are on
         | Windows.
        
           | Eiim wrote:
           | From elsewhere in this thread, it appears that Magika detects
           | far fewer file types than file (116 vs ~1600), which makes
           | sense. For file, you just need to drop in a few rules to add
           | a new, somewhat obscure type. An AI approach like Magika will
           | need lots of training and test data for each new file type.
           | Where Magika might have a leg up is with distinguishing
           | different textual data files (i.e., source code), but I don't
           | see that as a particularly big use case honestly.
        
         | cle wrote:
         | It's not always deterministic, sometimes it's fuzzy depending
         | on the file type. Example of this is a one-line CSV file. I
         | tested one case of that, libmagic detects it as a text file
         | while magika correctly detects it as a CSV (and gives a
         | confidence score, which is killer).
        
         | alkonaut wrote:
         | But even with determinism, it's not always right. It's not too
         | rare to find a text file with a byte order mark indicating
         | UTF16 (0xFE 0xFF) but then actually containing utf-8. But what
         | "format" does it have then? Is it UTF-8 or UTF-16? Same with
         | e.g. a jar file missing a manifest. That's just a zip, even
         | though I'm sure some runtime might eat it.
         | 
         | But the question is when you have the issue of having to guess
         | the format of a file? Is it when reverse engineering? Last time
         | I did something like this was in the 90's when trying to pick
         | apart some texture from a directory of files called asset0001.k
         | and it turns out it was a bitmap or whatever. Fun times.
        
       | a-dub wrote:
       | probably a lot of interesting work going on that looks like this
       | for the virustotal db itself.
        
       | account-5 wrote:
       | Assuming that I've not misunderstood, how does this compare to
       | things like: TrID [0]?? Apart from being open source.
       | 
       | [0] https://mark0.net/soft-trid-e.html
        
         | JacobThreeThree wrote:
         | The bulk of the short article is a set of performance
         | benchmarks comparing Magika to TrID and others.
        
           | account-5 wrote:
           | Argh, the risks of browsing the web without JavaScript and/or
           | third party scripts enabled, you miss content, because
           | rendering text and images on the modern web can't be done
           | without them, apparently. (Sarcasm).
           | 
           | You are of course correct. I can see the images showing the
           | comparison. Apologies.
        
       | earth2mars wrote:
       | how do i pronounce this? Myajika or MaGika? anyhow, its super
       | cool.
        
       | thangalin wrote:
       | My FOSS desktop text editor performs a subset of file type
       | identification using the first 12 bytes, detecting the type quite
       | quickly:
       | 
       | * https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main...
       | 
       | There's a much larger list of file signatures at:
       | 
       | * https://github.com/veniware/Space-Maker/blob/master/FileSign...
        
       | Labo333 wrote:
       | I wonder what the output will be on polyglot files like run-
       | anywhere binaries produced by cosmopolitan [1]
       | 
       | [1]: https://justine.lol/cosmopolitan/
        
       | awaythrow999 wrote:
       | Wonder how this would handle a polyglot[0][1], that is valid as a
       | PDF document, a ZIP archive, and a Bash script that runs a Python
       | webserver, which hosts Kaitai Struct's WebIDE which, allowing you
       | to view the file's own annotated bytes.
       | 
       | [0]: https://www.alchemistowl.org/pocorgtfo/
       | 
       | [1]: https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf
       | 
       | Edit: just tested, and it does only identify the zip layer
        
         | rvnx wrote:
         | You can try it here: https://google.github.io/magika/
         | 
         | It's relatively limited compared to `file` (~10% coverage),
         | it's more like a specialized classificator for basic file
         | formats, so such cases are really out-of-scope.
         | 
         | I guess it's more for detecting common file formats then with
         | high recall.
         | 
         | However, where is the actual source of the model ? Let's say I
         | want to add a new file format myself.
         | 
         | Apparently only the source of the interpreter is here, not the
         | source of the model nor the training set, which is the most
         | important thing.
        
           | alexandreyc wrote:
           | Yes, I totally agree; it's not what I would qualify as open
           | source.
           | 
           | Do you plan to release the training code along the research
           | paper? What about the dataset?
           | 
           | In any case, it's very neat to have ML-based technique and
           | lightweight model for such tasks!
        
           | tempay wrote:
           | Is there anything about the performance on unknown files?
           | 
           | I've tried a few that aren't "basic" but are widely used
           | enough to be well supported in libmagic and it thinks they're
           | zip files. I know enough about the underlying formats to know
           | they're not using zip as a container under-the-hood.
        
           | kevincox wrote:
           | Apparenty the Super Mario Bros. 3 ROM is 100% a SWF file.
           | 
           | Cool that you can use it online though. Might end up using it
           | like that. Although it seems like it may focus on common
           | formats.
        
       | diimdeep wrote:
       | > Magika: AI powered fast and efficient file type identification
       | 
       | of 116 file types with proprietary puny model with no training
       | code and no dataset.
       | 
       | > We are releasing a paper later this year detailing how the
       | Magika model was trained and its performance on large datasets.
       | 
       | And ? How do you advance industry by this googleblog post and
       | source code that is useless without closed source model ? All I
       | see here is loud marketing name, loud promises, but actually
       | barely anything useful. Hooly rooftop characters sideproject?
        
       | secondary_op wrote:
       | Why is this piece of code being sold as open source, when in
       | reality it just calls into proprietary ML blob that is tiny and
       | useless, and actual source code of model is closed while properly
       | useful large model is non existing ?
        
         | KOLANICH wrote:
         | Not into proprietary, the blob is within an Apache-licensed
         | repo. Though there was no code to train it, but the repo
         | contains some info allowing to recreate the code training it.
         | Basically a JSON-based configs containing graph architecture.
         | Even if you didn't have them, the repo contains an ONNX model,
         | from which one can devise the architecture.
        
       | flohofwoe wrote:
       | I wonder how it performs with detecting C vs C++ vs ObjC vs
       | ObjC++ and for bonus points: the common C/C++ subset (which is an
       | incompatible C fork), also extra bonus points for detecting
       | language version compatibility (e.g. C89 vs C99 vs C11...).
       | 
       | Separating C from C++ and ObjC is where the file type detection
       | on Github traditionally had problems with (but has been getting
       | dramatically better over time), from an "AI-powered" solution
       | which has been trained on the entire internet I would expect to
       | do better right from the start.
       | 
       | The list here doesn't even mention any of those languages except
       | C though:
       | 
       | https://github.com/google/magika/blob/main/docs/supported-co...
        
       | andrewstuart wrote:
       | Very useful.
       | 
       | I wrote an editor that needed file type detection but the results
       | of traditional approaches were flaky.
        
       | 20after4 wrote:
       | I just want to say thank you for the release. There are quite a
       | lot of complaints in the comments but I think this is a useful
       | and worthwhile contribution and I appreciate the authors for
       | going through the effort to get it approved for open source
       | release. It would be great if the model training data was
       | included (or at lease documentation about how to reproduce it.)
       | but that doesn't preclude this being useful. Thanks!
        
       | Someone wrote:
       | If their "Exif Tool" is https://exiftool.org/ (what else could it
       | be?), I don't understand why they included it in their tests.
       | Also, how does ExifTool recognize Python and html files?
        
       | Andugal wrote:
       | I have a question: Is something like Magika enough to check if a
       | file is malicious or not?
       | 
       | Example: users can upload PNG file (and only PNG is accepted). If
       | Malika detects that the file is a PNG, does this mean the file is
       | clean?
        
         | cjg wrote:
         | > does this mean the file is clean?
         | 
         | No.
        
         | TacticalCoder wrote:
         | If that PNG of yours is not just an example note that you can
         | detect easily if the PNG as any extra data (which may or may
         | not indicate an attempt as mischief) and reject the (rare) PNGs
         | with extra data. I ran a script checking the thousands of PNGs
         | on my system and found three with extra data, all three
         | probably due to the "PNG acropalypse" bug (but mischief cannot
         | be ruled out).
         | 
         | P.S: btw I'm not implying using extra data that shouldn't be
         | there in a PNG is the only way to have a malicious PNG.
        
         | nicklecompte wrote:
         | This comment from kevincox[1] says the answer is a hard "no":
         | 
         | > Worse it seems that for unknown formats it confidently claims
         | that it is one of the known formats. Rather than saying
         | "unknown" or "binary data".
         | 
         | There are other comments in this thread that make me think
         | Google contaminated their test data with training data and the
         | 99% results should not be taken at face value. OTOH I am not
         | particularly surprised that Magika would be better than the
         | other tools at distinguishing _semi-unstructured plain text_
         | e.g. Java source vs. C++ source or YAMLs versus INIs. But that
         | 's a very different use case than many security applications.
         | The comments here suggest Magika is especially susceptible to
         | binary obfuscation.
         | 
         | [1] https://news.ycombinator.com/item?id=39395677
        
         | kevincox wrote:
         | The only way to do this reliably is to render the PNG to pixels
         | then render it back to an PNG with a trusted encoder. Of course
         | now you are taking the risk of vulnerabilities in the "render
         | to pixels" step. But the result will be clean.
         | 
         | AKA parse, don't validate.
        
       | TacticalCoder wrote:
       | To me the obvious use case is to first use the _file_ command but
       | then, when _file_ returns  "DATA" (meaning it couldn't guess the
       | file type), call _magika_.
       | 
       | I guess I'll be writing a wrapper (only for when using my shell
       | in interactive mode) around _file_ doing just that when I come
       | back from vacation. I hate it when _file_ cannot do its thing.
       | 
       | Put it this way: I use _file_ a lot and I know at times it cannot
       | detect a filetype. But is _file_ often wrong when it does have a
       | match? I don 't think so...
       | 
       | So in most of the cases I'd have _file_ correctly give me the
       | filetype, very quickly but then in those rare cases where _file_
       | cannot find anything, I 'd then use the slower but apparently
       | more capable _magika_.
        
         | SnowflakeOnIce wrote:
         | I have seen 'file' misclassify many things when running it at
         | large scale (millions of files) from a hodgepodge of sources.
         | Unrelated types getting called 'GPG Private Keys', for example.
         | 
         | For textual data types, 'file' gets confused often, or doesn't
         | give a precise type. GitHub's 'linguist' [1] tool does much
         | better here, but is structured in such a way that it is
         | difficult to call it on an arbitrary file or bytestring that
         | doesn't reside in a git repo.
         | 
         | I'd love to have a classification tool that can more granularly
         | classify textual files! It may not be Magika _today_ since it
         | only supports 116-something types. For this use case, an ML-
         | based approach will be more successful than an approach based
         | solely on handwritten heuristic rules. I'm excited to see where
         | this goes.
        
       | lakomen wrote:
       | Why? Just check the damn headers. Why do you need a power hungry
       | and complicated AI model to do it? Why?
        
       | TomNomNom wrote:
       | This looks cool. I ran this on some web crawl data I have
       | locally, so: all files you'd find on regular websites; HTML, CSS,
       | JavaScript, fonts etc.
       | 
       | It identified some simple HTML files (html, head, title, body, p
       | tags and not much else) as "MS Visual Basic source (VBA)", "ASP
       | source (code)", and "Generic text document" where the `file`
       | utility correctly identified all such examples as "HTML document
       | text".
       | 
       | Some woff and woff2 files it identified as "TrueType Font Data",
       | others are "Unknown binary data (unknown)" with low confidence
       | guesses ranging from FLAC audio to ISO 9660. Again, the `file`
       | utility correctly identifies these files as "Web Open Font
       | Format".
       | 
       | I like the idea, but the current implementation can't be relied
       | on IMO; especially not for automation.
       | 
       | A minor pet peeve also: it doesn't seem to detect when its output
       | is a pipe and strip the shell colour escapes resulting in
       | `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the
       | output into a vim buffer or similar.
        
         | michaelmior wrote:
         | > the current implementation can't be relied on IMO
         | 
         | What's your reasoning for not relying on this? (It seems to me
         | that this would be application-dependent at the very least.)
        
           | jdiff wrote:
           | I'm not the person you asked, but I'm not sure I understand
           | your question and I'd like to. It whiffed multiple common
           | softballs, to the point it brings into question the claims
           | made about its performance. What reasoning is there to trust
           | it?
        
             | michaelmior wrote:
             | > It whiffed multiple common softballs
             | 
             | I must have missed this in the article. Where was this?
        
               | jdiff wrote:
               | ...It's in the comment you were responding to. Directly
               | above the section you quoted.
        
           | TomNomNom wrote:
           | It provided the wrong file-types for some files, so I cannot
           | rely on its output to be correct.
           | 
           | If you wanted to, for example, use this tool to route
           | different files to different format-specific handlers it
           | would sometimes send files to the wrong handlers.
        
             | michaelmior wrote:
             | Except a 100% correct implementation doesn't exist AFAIK.
             | So if I want to do anything that makes a decision based on
             | the type of a file, I have to pick _some_ algorithm to do
             | that. If I can do that correctly 99% of the time, that 's
             | better than not being able to make that decision at all,
             | which is where I'm left if a perfect implementation doesn't
             | exist.
        
               | jdiff wrote:
               | Nobody's asking for perfection. But the AI is offering
               | inexplicable and obvious nondeterministic mistakes that
               | the traditional algorithms don't suffer from.
               | 
               | Magika goes wrong and your fonts become audio files and
               | nobody knows why. Magic goes wrong and your ZIP-based
               | documents get mistaken for generic ZIP files. If you work
               | with that edge case a lot, you can anticipate it with
               | traditional algorithms. You can't anticipate
               | nondeterministic hallucination.
        
               | jsnell wrote:
               | Where are you getting the non-determinism part from? It
               | would seem surprising for there to be anything non-
               | deterministic about an ML model like this, and nothing in
               | the original reports seems to suggest that either.
        
               | TeMPOraL wrote:
               | Large ML models tend to be uncorrectably non-
               | deterministic simply from doing lots of floating point
               | math in parallel. Addition and multiplication of floats
               | is neither commutative nor associative - you may get
               | different results depending on the order in which you
               | add/multiply numbers.
        
               | Gormo wrote:
               | > It would seem surprising for there to be anything non-
               | deterministic about an ML model like this
               | 
               | I think there may be some confusion of ideas going in
               | here. Machine learning is fundamentally stochastic, so it
               | is non-deterministic almost by definition.
        
         | ebursztein wrote:
         | Thanks for the feedback -- we will look into it. If you can
         | share with us the list of URL that would be very helpful so we
         | can reproduce - send us an email at magika-dev@google.com if
         | that is possible.
         | 
         | For crawling we have planned a head only model to avoid
         | fetching the whole file but it is not ready yet -- we weren't
         | sure what use-cases would emerge so that is good to know that
         | such model might be useful.
         | 
         | We mostly use Magika internally to route files for AV scanning
         | as we wrote in the blog post, so it is possible that despite
         | our best effort to test Magika extensively on various file
         | types it is not as good on fonts format as it should be. We
         | will look into.
         | 
         | Thanks again for sharing your experience with Magika this is
         | very useful.
        
           | TomNomNom wrote:
           | Sure thing :)
           | 
           | Here's[0] a .tgz file with 3 files in it that are
           | misidentified by magika but correctly identified by the
           | `file` utility: asp.html, vba.html, unknown.woff
           | 
           | These are files that were in one of my crawl datasets.
           | 
           | [0]: https://poc.lol/files/magika-test.tgz
        
             | ebursztein wrote:
             | Thank you - we are adding them to our test suit for the
             | next version.
        
               | TomNomNom wrote:
               | Super, thank you! I look forward to it :)
               | 
               | I've worked on similar problems recently so I'm well
               | aware of how difficult this is. An example I've given
               | people is in automatically detecting base64-encoded data.
               | It _seems_ easy at first, but any four, eight, or twelve
               | (etc) letter word is technically valid base64, so you
               | need to decide if and how those things should be
               | excluded.
        
             | beeboobaa wrote:
             | Do you have permission to redistribute these files?
        
               | IvyMike wrote:
               | You are asking what if this guy has "web crawl data" that
               | google does not have?
               | 
               | And what if he says no, he does not have permission.
        
               | beeboobaa wrote:
               | > You are asking what if this guy has "web crawl data"
               | that google does not have?
               | 
               | No, I'm asking if he has permission to redistribute these
               | files.
        
               | timschmidt wrote:
               | Are you attempting to assert that use of these files
               | solely for the purpose of improving a software system
               | meant to classify file types does not fall under fair
               | use?
               | 
               | https://en.wikipedia.org/wiki/Fair_use
        
               | beeboobaa wrote:
               | I'm asking a question.
               | 
               | Here's another one for you: Do you believe that all
               | pictures you have ever taken, all emails you have ever
               | written, all code you have ever written could be posted
               | here on this forum to improve someone else's software
               | system?
               | 
               | If so, could you go ahead and post that zip? I'd like to
               | ingest it in my model.
        
               | timschmidt wrote:
               | Your question seems orthogonal to the situation. The
               | three files posted seem to be the minimum amount of
               | information required to reproduce the bug. Fair use
               | encompasses a LOT of uses of otherwise copyrighted work,
               | and this seems clearly to be one.
        
               | beeboobaa wrote:
               | I don't see how publicly posting them on a forum is
               | 
               | > the minimum amount of information required to reproduce
               | the bug
               | 
               | MAYBE if they had communicated privately that'd be an
               | argument that made sense.
        
               | timschmidt wrote:
               | So you don't think that software development which
               | happens in public web forums deserve fair use protection?
        
               | beeboobaa wrote:
               | That's an interesting way to frame "publicly posted
               | someone else's data without their consent for anyone to
               | see and download"
        
               | timschmidt wrote:
               | I notice you're so invested that you haven't noticed that
               | the files have been renamed and zipped such that they're
               | not even indexable. How you'd expect anyone not
               | participating in software development to find them is yet
               | to be explained.
        
               | beeboobaa wrote:
               | I notice you're so invested you keep coming up with
               | imaginary scenarios that you pretend somehow matter, lol
        
               | timschmidt wrote:
               | Have fun, buddy!
        
               | jdiff wrote:
               | It's three files that were scraped from (and so publicly
               | available on) the web. That's not at all similar to your
               | strawful analogy.
        
               | timschmidt wrote:
               | I'm over here trying to fathom the lack of control over
               | one's own life it would take to cause someone to turn
               | into an online copyright cop, when the data in question
               | isn't even their own, is clearly divorced from any
               | context which would make it useful for anything other
               | than fixing the bug, and about which the original
               | copyright holder hasn't complained.
               | 
               | Some people just want to argue.
               | 
               | If the copyright holder has a problem with the use, they
               | are perfectly entitled to spend some of their dollar
               | bills to file a law suit, as part of which the contents
               | of the files can be entered into the public record for
               | all to legally access, as was done with Scientology.
               | 
               | I don't expect anyone would be so daft.
        
               | beeboobaa wrote:
               | Literally just asked a question and that seems to have
               | set you off, bud. Are you alright? Do you need to feed
               | your LLM more data to keep it happy?
        
               | timschmidt wrote:
               | I'm always happy to stand up for folks who make things
               | over people who want to police them. Especially when
               | nothing wrong has happened. Maybe take a walk and get
               | some fresh air?
        
           | westurner wrote:
           | What is the MIME type of a .tar file; and what are the MIME
           | types of the constituent concatenated files within an archive
           | format like e.g. tar?
           | 
           | hachoir/subfile/main.py: https://github.com/vstinner/hachoir/
           | blob/main/hachoir/subfil...
           | 
           | File signature: https://en.wikipedia.org/wiki/File_signature
           | 
           | PhotoRec: https://en.wikipedia.org/wiki/PhotoRec
           | 
           | "File Format Gallery for Kaitai Struct"; 185+ binary file
           | format specifications: https://formats.kaitai.io/
           | 
           | Table of ': https://formats.kaitai.io/xref.html
           | 
           | AntiVirus software > Identification methods > Signature-based
           | detection, Heuristics, and _ML /AI data mining_: https://en.w
           | ikipedia.org/wiki/Antivirus_software#Identificat...
           | 
           | Executable compression; packer/loader:
           | https://en.wikipedia.org/wiki/Executable_compression
           | 
           | Shellcode database > MSF:
           | https://en.wikipedia.org/wiki/Shellcode_database
           | 
           | sigtool.c: https://github.com/Cisco-
           | Talos/clamav/blob/main/sigtool/sigt...
           | 
           | clamav sigtool:
           | https://www.google.com/search?q=clamav+sigtool
           | 
           | https://blog.didierstevens.com/2017/07/14/clamav-sigtool-
           | dec... :                 sigtool --find-sigs "$name" |
           | sigtool --decode-sigs
           | 
           | List of file signatures:
           | https://en.wikipedia.org/wiki/List_of_file_signatures
           | 
           | And then also clusterfuzz/oss-fuzz scans .txt source files
           | with (sandboxed) Static and Dynamic Analysis tools, and
           | `debsums`/`rpm -Va` verify that files on disk have the same
           | (GPG signed) checksums as the package they are supposed to
           | have been installed from, and a file-based HIDS builds a
           | database of file hashes and compares what's on disk in a
           | later scan with what was presumed good, and ~gdesktop LLM
           | tools scan every file, and there are extended filesystem
           | attributes for _label_ -based MAC systems like SELinux, oh
           | and NTFS ADS.
           | 
           | A sufficient cryptographic hash function yields random bits
           | with uniform probability. DRBG Deterministic Random Bit
           | Generators need high entropy random bits in order to
           | continuously re-seed the RNG random number generator. Is it
           | safe to assume that hashing (1) every file on disk, or (2)
           | any given file on disk at random, will yield random bits with
           | uniform probability; and (3) why Argon2 instead of e.g. only
           | two rounds of SHA256?
           | 
           | https://github.com/google/osv.dev/blob/master/README.md#usin.
           | .. :
           | 
           | > _We provide a Go based tool that will scan your
           | dependencies, and check them against the OSV database for
           | known vulnerabilities via the OSV API._ ... With package
           | metadata, not (a file hash, package) database that could be
           | generated from OSV and the actual package files instead of
           | their manifest of already-calculated checksums.
           | 
           | Might as well be heating a pool on the roof with all of this
           | waste heat from hashing binaries build from code of unknown
           | static and dynamic quality.
           | 
           | Add'l useful formats:
           | 
           | > _Currently it is able to scan various lockfiles, debian
           | docker containers, SPDX and CycloneDB SBOMs, and git
           | repositories_
           | 
           | Things like bittorrent magnet URIs, Named Data Networking,
           | and IPFS are (file-hash based) "Content addressable storage":
           | https://en.wikipedia.org/wiki/Content-addressable_storage
        
       | nayuki wrote:
       | The name sounds like the Pokemon Magikarp or the anime series
       | Madoka Magica.
        
       | pier25 wrote:
       | I use FFMPEG to detect if uploaded files are valid audio files.
       | Would this be much faster?
        
       | omni wrote:
       | Can someone please help me understand why this is useful? The
       | article mentions malware scanning applications, but if I'm
       | sending you a malicious PDF, won't I want to clearly mark it with
       | a .pdf extension so that you open it in your PDF app? Their
       | examples are all very obvious based on file extensions.
        
       | chromaton wrote:
       | It can't correctly identify a DXF file in my testing. It
       | categorizes it as plain text.
        
       | Eiim wrote:
       | I ran a quick test on 100 semi-random files I had laying around.
       | Of those, 81 were detected correctly, 6 were detected as the
       | wrong file type, and 12 were detected with an unspecific file
       | type (unknown binary/generic text) when a more specific type
       | existed. In 4 of the unspecific cases, a low-confidence guess was
       | provided, which was wrong in each case. However, almost all of
       | the files which were detected wrong/unspecific are of types not
       | supported by Magika, with one exception of a JSON file containing
       | a lot of JS code as text, which was detected as JS code. For
       | comparison, file 5.45 (the version I happened to have installed)
       | got 83 correct, 6 wrong, and 10 not specific. It detected the
       | weird JSON correctly, but also had its own strange issues, such
       | as detecting a CSV as just "data". The "wrong" here was somewhat
       | skewed by the 4 GLSL shader code files that were in the dataset
       | for some reason, all of which it detected as C code (Magika
       | called them unknown). The other two "wrong" detections were also
       | code formats that it seems it doesn't support. It was also able
       | to output a lot more information about the media files. Not sure
       | what to make of these tests but perhaps they're useful to
       | somebody.
        
         | pizzalife wrote:
         | > The "wrong" here was somewhat skewed by the 4 GLSL shader
         | code files that were in the dataset for some reason, all of
         | which it detected as C code
         | 
         | To be fair though, a snippet of GLSL shader code can be
         | perfectly valid C.
        
           | Eiim wrote:
           | Indeed, which is why I felt the need to call it out here. I'm
           | not certain if the files on question actually happened to be
           | valid C but whether that's a meaningful mistake regardless is
           | left to the reader to decide.
        
       | Delumine wrote:
       | Voidtools - Everything.. looking at you to implement this
        
       | init0 wrote:
       | Why not detect it by checking the magic number of the buffer?
        
         | dagmx wrote:
         | Not every file has one for starters and many can be incorrect.
         | 
         | Especially in the context of use as a virus scanner, you don't
         | trust what the file says it is
        
       | YoshiRulz wrote:
       | So instead of spending some of their human resources to improve
       | libmagic, they used some of their computing power to create an
       | "open source" neural net, which is technically more accurate than
       | the "error-prone" hand-written rules (ignoring that it supports
       | far fewer filetypes), and which is much less effective in an
       | adversarial context, and they want it to "help other software
       | improve their file identification accuracy," which of course it
       | can't since neural nets aren't introspectable. Thanks guys.
        
         | 12_throw_away wrote:
         | Come on, can't you help but be impressed by this amazing AI
         | tech? That gives us sci-fi tools like ... a less-accurate,
         | incomplete, stochastic, un-debuggable, slower, electricity-
         | guzzling version of `file`.
        
         | og_kalu wrote:
         | >So instead of spending some of their human resources to
         | improve libmagic
         | 
         | A large megacorp can work on multiple things at once.
         | 
         | >an "open source" neural net, which is technically more
         | accurate than the "error-prone" hand-written rules (ignoring
         | that it supports far fewer filetypes)
         | 
         | You say that like it's a contradiction but it's not.
         | 
         | >and which is much less effective in an adversarial context,
         | 
         | Is it? This seems like an assumption.
         | 
         | >and they want it to "help other software improve their file
         | identification accuracy," which of course it can't since neural
         | nets aren't introspectable.
         | 
         | Being introspectable or not has no bearing on the accuracy of a
         | system.
        
       | breather wrote:
       | Can we please god stop using AI like it's a meaningful word? This
       | is really interesting technology; it's hamstrung by association
       | with a predatory marketing term.
        
       | goshx wrote:
       | I used an HTML file and added JPEG magic bytes to its header:
       | 
       | magika file.jpg
       | 
       | file.jpg: JPEG image data (image)
        
       | runxel wrote:
       | Took a .dxf file and fed it to Magika. It says with confidence of
       | 97% that that must be a PowerShell file. A classic .dwg could be
       | "mscompress" (whatever that is), 81%, or a GIF. Both couldn't be
       | further from the truth.
       | 
       | Common files are categorized successfully - but well, yeah that's
       | not really an achievement. Pretty much nothing more than a toy
       | right now.
        
         | queuebert wrote:
         | The real problem with deep learning approaches is hallucination
         | and edge case failures. When someone finally fixes this, I hope
         | it makes the HN front page.
        
       | lqcfcjx wrote:
       | After reading thru all the comments, honestly I still don't get
       | the point of this system. What is potential practical value or
       | applications of this model?
        
       | jwithington wrote:
       | I guess I'm kind of a dummy on this, but why is it impressive to
       | identify that a .js file is Javascript, a .md file is Markdown,
       | etc?
        
         | playingalong wrote:
         | Because it's done by inspecting the content, not the name of
         | the file.
        
       | woliveirajr wrote:
       | Reminds me when someone asked (at StackOverflow) on how to
       | recognize binaries for different architetures, like x86 or ARM-
       | something or Apple M1 and so on.
       | 
       | I gave the idea to use the technique of NCD (Normalized
       | compression distance), based on Kolmogorov complexity. Celibrasi,
       | R. was one great researcher in this area, and I think he worked
       | at Google at some point.
       | 
       | Using AI seems to follow the same path: "learn" what represents
       | some specific file and then compare the unknown file to those
       | references (AI:all the parameters, NCD:compression against a
       | known type).
        
       | aidenn0 wrote:
       | But will it let you print on Tuesday[1]?
       | 
       | 1:
       | https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...
        
         | queuebert wrote:
         | For a subscription fee.
        
       | jjsimpso wrote:
       | I wrote an implementation of libmagic in Racket a few years
       | ago(https://github.com/jjsimpso/magic). File type identification
       | is a pretty interesting topic.
       | 
       | As others have noted, libmagic detects many more file types than
       | Magika, but I can see Magika being useful for text files in
       | particular, because anything written by humans doesn't have a
       | rigid format.
        
       ___________________________________________________________________
       (page generated 2024-02-16 23:01 UTC)