[HN Gopher] Magika: AI powered fast and efficient file type iden...
___________________________________________________________________
Magika: AI powered fast and efficient file type identification
Author : alphabetting
Score : 584 points
Date : 2024-02-16 01:02 UTC (21 hours ago)
(HTM) web link (opensource.googleblog.com)
(TXT) w3m dump (opensource.googleblog.com)
| NiloCK wrote:
| A somewhat surprising and genuinely useful application of the
| family of techniques.
|
| I wonder how susceptible it is to adversarial binaries or, hah,
| prompt-injected binaries.
| dghlsakjg wrote:
| "These aren't the binaries you are looking for..."
| jamesdwilson wrote:
| For the extremely limited number of file types supported, I
| question the utility of this compared to `magic`
| star4040 wrote:
| It gets a lot of binary file formats wrong for me out-of-the-
| box. I think it needs to be a bit more effective before we can
| truly determine the effectiveness of such exploits.
| queuebert wrote:
| But they reported >99% accuracy on their cherry-picked
| dataset! /s
| nicklecompte wrote:
| Elsewhere in the thread kevincox[1] points out that it's
| extremely susceptible to adversarial binaries:
|
| > Worse it seems that for unknown formats it confidently claims
| that it is one of the known formats. Rather than saying
| "unknown" or "binary data".
|
| Seems like this is genuinely useless for anybody but AI
| researchers.
|
| [1] https://news.ycombinator.com/item?id=39395677
| kushie wrote:
| this couldnt have been released at a better time for me! really
| needed a library like this.
| petesergeant wrote:
| Tell us why!
| ebursztein wrote:
| Thanks :)
| thorum wrote:
| Supported file types:
| https://github.com/google/magika/blob/main/docs/supported-co...
| s1mon wrote:
| It's surprising that there are so many file types that seem
| relatively common which are missing from this list. There are
| no raw image file formats. There's nothing for CAD - either
| source files or neutral files. There's no MIDI files, or any
| other music creation types. There's no APL, Pascal, COBOL,
| assembly source file formats etc.
| _3u10 wrote:
| No tracker / .mod files either, just use file.
| ebursztein wrote:
| Thanks for the list, we will probably try to extend the
| list of format supported in future revision.
| photoGrant wrote:
| Yeah this quickly went from 'additional helpful tool in the
| kit' to 'probably should use something else first'
| vintermann wrote:
| Well, what they used this for at Google was apparently
| scanning their user's files for things they shouldn't store
| in the cloud. Probably they don't care much about MIDI.
| kevincox wrote:
| Worse it seems that for unknown formats it confidently claims
| that it is one of the known formats. Rather than saying
| "unknown" or "binary data".
| vunderba wrote:
| As somebody who's dealt with the ambiguity of attempting to use
| file signatures in order to identify file type, this seems like a
| pretty useful library. Especially since it seems to be able to
| distinguish between different types of text files based on their
| format/content e.g. CSV, markdown, etc.
| semitones wrote:
| Is it really common enough for files not to be annotated with a
| useful/correct file type extension (e.g. .mp3, .txt) that a
| library like this is needed?
| hiddencost wrote:
| malware can intentionally obfuscate itself
| callalex wrote:
| Nothing is ever simple. Even for the most basic .txt files it's
| still useful to know what the character encoding is (utf? 8/16?
| Latin-whatever? etc.) and what the line format is
| (\n,\cr\lf,\n\lf) as well as determining if some maniac removed
| all the indentation characters and replaced them with a mystery
| number of spaces.
|
| Then there are all the container formats that have different
| kinds of formats embedded in them (mov,mkv,pdf etc.)
| cole-k wrote:
| A fun read in service of your first point:
| https://en.wikipedia.org/wiki/Bush_hid_the_facts
| SnowflakeOnIce wrote:
| Yes!
|
| Sometimes a file has no extension. Other times the extension is
| a lie. Still other times, you may be dealing with an unnamed
| bytestring and wish to know what kind of content it is.
|
| This last case happens quite a lot in Nosey Parker [1], a
| detector of secrets in textual data. There, it is possible to
| come across unnamed files in Git history, and it would be
| useful to the user to still indicate what type of file it seems
| to be.
|
| I added file type detection based on libmagic to Nosey Parker a
| while back, but it's not compiled in by default because
| libmagic is slow and complicates the build process. Also,
| libmagic is implemented as a large C library whose primary job
| is parsing, which makes the security side of me jittery.
|
| I will likely add enabled-by-default filetype detection to
| Nosey Parker using Magika's ONNX model.
|
| [1] https://github.com/praetorian-inc/noseyparker
| m0shen wrote:
| At multiple points in my career I've been responsible for apis
| that accept PDFs. Many non-tech savvy people seeing this, will
| just change the extension of the file they're uploading to
| `.pdf`.
|
| To make matters worse, there is some business software out
| there that will actually bastardize the PDF format and put
| garbage before the PDF file header. So for some things you end
| up writing custom validation and cleanup logic anyway.
| userbinator wrote:
| _Today web browsers, code editors, and countless other software
| rely on file-type detection to decide how to properly render a
| file._
|
| "web browsers"? Odd to see this coming from Google itself.
| https://en.wikipedia.org/wiki/Content_sniffing was widely
| criticised for being problematic for security.
| rafram wrote:
| Content sniffing can be disabled by the server (X-Content-Type-
| Options: nosniff), but it's still used by default. Web browsers
| have to assume that servers are stupid, and that for relatively
| harmless cases, it's fine to e.g. render a PNG loaded by an
| <img> even if it's served as text/plain.
| stevepike wrote:
| Oh man, this brings me back! Almost 10 years ago I was working on
| a rails app trying to detect the file type of uploaded
| spreadsheets (xlsx files were being detected as application/zip,
| which is technically true but useless).
|
| I found "magic" that could detect these and submitted a patch at
| https://bugs.freedesktop.org/show_bug.cgi?id=78797. My patch got
| rejected for needing to look at the first 3KB bytes of the file
| to figure out the type. They had a hard limit that they wouldn't
| see past the first 256 bytes. Now in 2024 we're doing this with
| deep learning! It'd be cool if google released some speed
| performance benchmarks here against the old-fashioned
| implementations. Obviously it'd be slower, but is it 1000x or
| 10^6x?
| renonce wrote:
| From the first paragraph:
|
| > enabling precise file identification within milliseconds,
| even when running on a CPU.
|
| Maybe your old-fashioned implementations were detecting in
| microseconds?
| stevepike wrote:
| Yeah I saw that, but that could cover a pretty wide range and
| it's not clear to me whether that relies on preloading a
| model.
| ryanjshaw wrote:
| > At inference time Magika uses Onnx as an inference engine
| to ensure files are identified in a matter of milliseconds,
| almost as fast as a non-AI tool even on CPU.
| ebursztein wrote:
| Co-author of Magika here (Elie) so we didn't include the
| measurements in the blog post to avoid making it too long but
| we did those measurements.
|
| Overall file takes about 6ms (single file) 2.26ms per files
| when scanning multiples. Magika is at 65ms single file and
| 5.3ms when scanning multiples.
|
| So Magika is for the worst case scenario about 10x slower due
| to the time it takes to load the model and 2x slower on
| repeated detection. This is why we said it is not that much
| slower.
|
| We will have more performance measurements in the upcoming
| research paper. Hope that answer the question
| jpk wrote:
| Do you have a sense of performance in terms of energy use? 2x
| slower is fine, but is that at the same wattage, or more?
| alephnan wrote:
| That sounds like a nit / premature optimization.
|
| Electricity is cheap. If this is sufficiently or actually
| important for your org, you should measure it yourself.
| There are too many variables and factors subject to your
| org's hardware.
| djxfade wrote:
| Totally disagree. Most end users are on laptops and
| mobile devices these days, not desktop towers. Thus power
| efficiency is important for battery life. Performance per
| watt would be an interesting comparison.
| true_religion wrote:
| What end users are working with arbitrary files that they
| don't know the identification of?
|
| This entire use case seems to be one suited for servers
| handling user media.
| michaelt wrote:
| Theoretically? Anyone running a virus scanner.
|
| Of course, it's arguably unlikely a virus scanner would
| opt for an ML-based approach, as they specifically need
| to be robust against adversarial inputs.
| scq wrote:
| You'd be surprised what an AV scanner would do.
|
| https://twitter.com/taviso/status/732365178872856577
| michaelmior wrote:
| > it's arguably unlikely a virus scanner would opt for an
| ML-based approach
|
| Several major players such as Norton, McAfee, and
| Symantec all at least claim to use AI/ML in their
| antivirus products.
| r0ze-at-hn wrote:
| Browsers often need to guess a file type
| wongarsu wrote:
| File managers that render preview images. Even detecting
| which software to open the file with when you click it.
|
| Of course on Windows the convention is to use the file
| extension, but on other platforms the convention is to
| look at the file contents
| michaelmior wrote:
| > on other platforms the convention is to look at the
| file contents
|
| MacOS (that is, Finder) also looks at the extension. That
| has also been the case with any file manager I've used on
| Linux distros that I can recall.
| jdiff wrote:
| You might be surprised. Rename your Photo.JPG as
| Photo.PNG and you'll still get a perfectly fine
| thumbnail. The extension is a hint, but it isn't
| definitive, especially when you start downloading from
| the web.
| underdeserver wrote:
| In general you're right, but I can't think of a single
| local use for identifying file types by a human on a
| laptop - at least, one with scale where this matters.
| It's all going to be SaaS services where people upload
| stuff.
| prmph wrote:
| We are building a data analysis tool with great UX, where
| users select data files, which are then parsed and
| uploaded to S3 directly, on their client machines. The
| server only takes over after this step.
|
| Since the data files can be large, this approach bypasses
| having to trnasfer the file twice, first to the server,
| and then to S3 after parsing.
| DontSignAnytng wrote:
| This dont sound like very common scenario.
| vertis wrote:
| I mean if you care about that you shouldn't be running
| anything that isn't highly optimized. Don't open webpages
| that might be CPU or GPU intensive. Don't run Electron
| apps, or really anything that isn't built in a compiled
| language.
|
| Certainly you should do an audit of all the Android and
| iOS apps as well, to make sure they've been made in a
| efficient manner.
|
| Block ads as well, they waste power.
|
| This file identification is SUCH a small aspect of
| everything that is burning power in your laptop or phone
| as to be laughable.
| _puk wrote:
| Whilst energy usage is indeed a small aspect this early
| on when using bespoke models, we do have to consider that
| this is a model for simply identifying a file type.
|
| What happens when we introduce more bespoke models for
| manipulating the data in that file?
|
| This feels like it could slowly boil to the point of
| programs using magnitudes higher power, at which point
| it'll be hard to claw it back.
| vertis wrote:
| That's a slippery slope argument, which is a common
| logical fallacy[0]. This model being inefficient compared
| to the best possible implementation does not mean that
| future additions will also be inefficient.
|
| It's the equivalent to saying many people programming in
| Ruby is causing all future programs to be less efficient.
| Which is not true. In fact, many people programming in
| Ruby has caused Ruby to become more efficient because it
| gets optimised as it gets used more (or Python for that
| matter).
|
| It's not as energy efficient as C, but it hasn't caused
| it to get worse and worse, and spiral out of control.
|
| Likewise smart contracts are incredibly inefficient
| mechanisms of computation. The result is mostly that
| people don't use them for any meaningful amounts of
| computation, that all gets done "Off Chain".
|
| Generative AI is definitely less efficient, but it's
| likely to improve over time, and indeed things like
| quantization has allowed models that would normally to
| require much more substantial hardware resources (and
| therefore, more energy intensive) to be run on smaller
| systems.
|
| [0]: https://en.wikipedia.org/wiki/Slippery_slope
| diffeomorphism wrote:
| That is a fallacy fallacy. Just because some slopes are
| not slippery that does not mean none of them are.
| thfuran wrote:
| >This feels like it could slowly boil to the point of
| programs using magnitudes higher power, at which point
| it'll be hard to claw it back.
|
| We're already there. Modern software is, by and large,
| profoundly inefficient.
| cornholio wrote:
| The hardware requirements of a massively parallel
| algorithm can't possibly be "a nit" in any universe
| inhabited by rational beings.
| chmod775 wrote:
| Is that single-threaded libmagic vs Magika using every core
| on the system? What are the numbers like if you run multiple
| libmagic instances in parallel for multiple files, or limit
| both libmagic and magika to a single core?
|
| Testing it on my own system, magika seems to use a lot more
| CPU-time: file /usr/lib/* 0,34s user 0,54s
| system 43% cpu 2,010 total ./file-parallel.sh 0,85s
| user 1,91s system 580% cpu 0,477 total bin/magika
| /usr/lib/* 92,73s user 1,11s system 393% cpu 23,869 total
|
| Looks about 50x slower to me. There's 5k files in my lib
| folder. It's definitely still impressively fast given how the
| identification is done, but the difference is far from
| negligible.
| metafunctor wrote:
| I've ended up implementing a layer on top of "magic" which, if
| magic detects application/zip, reads the zip file manifest and
| checks for telltale file names to reliably detect Office files.
|
| The "magic" library does not seem to be equipped with the
| capabilities needed to be robust against the zip manifest being
| ordered in a different way than expected.
|
| But this deep learning approach... I don't know. It might be
| hard to shoehorn in to many applications where the traditional
| methods have negligible memory and compute costs and the
| accuracy is basically 100% for cases that matter (detecting
| particular file types of interest). But when looking at a large
| random collection of unknown blobs, yeah, I can see how this
| could be great.
| comboy wrote:
| Many commenters seem to be using _magic_ instead of _file_ ,
| any reasons?
| e1g wrote:
| _magic_ is the core detection logic of _file_ that was
| extracted out to be available as a library. So these days
| _file_ is just a higher level wrapper around _magic_
| comboy wrote:
| thanks
| stevepike wrote:
| If you're curious, here's how I solved it for ruby back in
| the day. Still used magic bytes, but added an overlay on top
| of the freedesktop.org DB:
| https://github.com/mimemagicrb/mimemagic/pull/20
| brabel wrote:
| > They had a hard limit that they wouldn't see past the first
| 256 bytes.
|
| Then they could never detect zip files with certainty, given
| that to do that you need to read up to 65KB (+ 22) at the END
| of the file. The reason is that the zip archive format allows
| "gargabe" bytes both in the beginning of the file and in
| between local file headers.... and it's actually not uncommon
| to prepend a program that self-extracts the archive, for
| example. The only way to know if a file is a valid zip archive
| is to look for the End of Central Directory Entry, which is
| always at the end of the file AND allows for a comment of
| unknown length at the end (and as the comment length field
| takes 2 bytes, the comment can be up to 65K long).
| jeffbee wrote:
| That's why the whole question is ill formed. A file does not
| have exactly one type. It may be a valid input in various
| contexts. A zip archive may also very well be something else.
| aidenn0 wrote:
| FWIW, file can now distinguish many types of zip containers,
| including Oxml files.
| rfl890 wrote:
| We have had file(1) for years
| samtheprogram wrote:
| This is beyond what file is capable of. It's also mentioned in
| the third paragraph.
|
| RTFA.
| wruza wrote:
| Some HN readers may not know about file(1) even. It's fine to
| mention that $subj enhances that, but the rtfa part seems
| pretty unnecessary.
| Vogtinator wrote:
| FWICT file is more capable, predictable and also faster while
| being more energy-efficient at the same time.
| Majestic121 wrote:
| That's not what the performance table in the article is
| implying, with a precision and recall higher on Magika
| hovering around 99%, while magic is at 92% prec and 72%
| recall.
|
| One can doubt the representativity of their dataset, but if
| what is in the article is correct, Magika is clearly way
| more capable and predictable
| NoGravitas wrote:
| Yes, it's slower than file(1), uses more energy, recognizes
| fewer file types, and is less accurate.
| aitchnyu wrote:
| This group of Linux users used to brag Linux will identify
| files even if you change the extension, Windoze needs to police
| you about changing extension, nearly 20 years back.
| lifthrasiir wrote:
| I'm extremely confused about the claim that other tools have a
| worse precision or recall for APK or JAR files which are very
| much regular. Like, they should be a valid ZIP file with `META-
| INF/MANIFEST.MF` present (at least), and APK would need
| `classes.dex` as well, but at this point there is no other format
| that can be confused with APK or JAR I believe. I'd like to see
| which file was causing unexpected drop on precision or recall.
| charcircuit wrote:
| apks are also zipaligned so it's not like random users are
| going to be making them either
| HtmlProgrammer wrote:
| Minecraft mods 14 years ago used to tell you to open the JAR
| and delete the META-INF when installing them so can't rely on
| that one...
| supriyo-biswas wrote:
| The `file` command checks only the first few bytes, and doesn't
| parse the structure of the file. APK files are indeed reported
| as Zip archives by the latest version of `file`.
| m0shen wrote:
| This is false in every sense for
| https://www.darwinsys.com/file/ (probably the most used file
| version). It depends on the magic for a specific file, but it
| can check any part of your file. Many Linux distros are years
| out of date, you might be using a very old version.
|
| FILE_45: ./src/file -m magic/magic.mgc
| ../../OpenCalc.v2.3.1.apk ../../OpenCalc.v2.3.1.apk:
| Android package (APK), with zipflinger virtual entry, with
| APK Signing Block
| supriyo-biswas wrote:
| Interesting! I checked with file 5.44 from Ubuntu 23.10 and
| 5.45 on macOS using homebrew, and in both cases, I got "Zip
| archive data, at least v2.0 to extract" for the file
| here[1]. I don't have an Android phone to check and I'm
| also not familiar with Android tooling, so is this a
| corrupt APK?
|
| [1] https://download.apkpure.net/custom/com.apkpure.aegon-3
| 19781...
| m0shen wrote:
| That doesn't appear to be a valid link. Try building
| `file` from source and using the provided default magic
| database.
| supriyo-biswas wrote:
| I also tried this with the sources of file from the
| homepage you linked above, and I still get the same
| results.
|
| You could try this for yourself using the same APKPure
| file which I uploaded at the following alternative
| link[1]. Further, while this could be a corrupt APK, I
| can't see any signs of that from a cursory inspection as
| both the `classes.dex` and `META-INF` directory are
| present, and this is APKPure's own APK, instead of an APK
| contributed for an app contributed by a third-party.
|
| [1] https://wormhole.app/Mebmy#CDv86juV9H4aRCL2DSJeDw
| Someone wrote:
| People do create JAR files without a META-INF/MANIFEST.MF
| entry.
|
| The tooling even supports it. https://docs.oracle.com/en/java/j
| avase/21/docs/specs/man/jar...: -M or --no-
| manifest Doesn't create a manifest file for the
| entries
| Vt71fcAqt7 wrote:
| This feels like old school google. I like that it's just a static
| webpage that basically can't be shut down or sunsetted. It
| reminds of when Google just made useful stuff and gave them away
| for free on a webpage like translate and google books. Obviously
| less life changing than the above but still a great option to
| have when I need this.
| vrnvu wrote:
| At $job we have been using Apache Tika for years.
|
| Works but occasionally having bugs and weird collisions when
| working with billions of files.
|
| Happy to see new contributions in the space.
| johnea wrote:
| The results of which you'll never be 100% sure are correct...
| wruza wrote:
| They missed such an opportunity to name it "fail". It's like
| "file" but with "ai" in it.
| tamrix wrote:
| What about faile?
| rfoo wrote:
| But file(2) is already like that - my data files without
| headers are reported randomly as disk images, compressed
| archives or even executables for never-heard-of machines.
| plesiv wrote:
| Other methods use heuristics to guess many filetypes and in the
| benchmark they show worse performance (in terms of precision).
| Assuming benchmarks are not biased, the fact that this approach
| uses AI heuristics instead of hard-coded heuristics shouldn't
| make it strictly worse.
| Imnimo wrote:
| I wonder how big of a deal it is that you'd have to retrain the
| model to support a new or changed file type? It doesn't seem like
| the repo contains training code, but I could be missing it...
| m0shen wrote:
| As someone that has worked in a space that has to deal with
| uploaded files for the last few years, and someone who maintains
| a WASM libmagic Node package ( https://github.com/moshen/wasmagic
| ) , I have to say I really love seeing new entries into the file
| type detection space.
|
| Though I have to say when looking at the Node module, I don't
| understand why they released it.
|
| Their docs say it's slow:
|
| https://github.com/google/magika/blob/120205323e260dad4e5877...
|
| It loads the model an runtime:
|
| https://github.com/google/magika/blob/120205323e260dad4e5877...
|
| They mark it as Experimental in the documentation, but it seems
| like it was just made for the web demo.
|
| Also as others have mentioned. The model appears to only detect
| 116 file types:
|
| https://github.com/google/magika/blob/120205323e260dad4e5877...
|
| Where libmagic detects... a lot. Over 1600 last time I checked:
|
| https://github.com/file/file/tree/4cbd5c8f0851201d203755b76c...
|
| I guess I'm confused by this release. Sure it detected most of my
| list of sample files, but in a sample set of 4 zip files, it
| misidentified one.
| lebean wrote:
| It's for researchers, probably.
| m0shen wrote:
| Yeah, there is this line: By open-sourcing
| Magika, we aim to help other software improve their file
| identification accuracy and offer researchers a reliable
| method for identifying file types at scale.
|
| Which implies a production-ready release for general usage,
| as well as usage by security researchers.
| m0shen wrote:
| Made a small test to try it out:
| https://gist.github.com/moshen/784ee4a38439f00b17855233617e9...
| hyperfine ./magika.bash ./file.bash Benchmark 1:
| ./magika.bash Time (mean +- s): 706.2 ms +- 21.1
| ms [User: 10520.3 ms, System: 1604.6 ms] Range
| (min ... max): 684.0 ms ... 738.9 ms 10 runs
| Benchmark 2: ./file.bash Time (mean +- s): 23.6
| ms +- 1.1 ms [User: 15.7 ms, System: 7.9 ms]
| Range (min ... max): 22.4 ms ... 29.0 ms 111 runs
| Summary './file.bash' ran 29.88 +- 1.65
| times faster than './magika.bash'
| barrkel wrote:
| Realistically, either you're identifying one file
| interactively and you don't care about latency differences in
| the 10s of ms, or you're identifying in bulk (batch command
| line or online in response to requests), in which case you
| should measure the marginal cost and exclude Python startup
| and model loading times.
| m0shen wrote:
| My little script is trying to identify in bulk, at least by
| passing 165 file paths to `magika`, and `file`.
|
| Though, I absolutely agree with you. I think realistically
| it's better to do this kind of thing in a library rather
| than shell out to it at all. I was just trying to get an
| idea on how it generally compares.
|
| Another note, I was trying to be generous to `magicka` here
| because when it's single file identification, it's about
| 160-180ms on my machine vs <1ms for `file`. I realize
| that's going to be quite a bit of python startup in that
| number, which is why I didn't go with it when pushing that
| benchmark up earlier. I'll probably push an update to that
| gist to include the single file benchmark as well.
| chmod775 wrote:
| Going by those number it's taking almost a second to run,
| not 10s of ms. And going by those numbers, it's doing
| something massively parallel in that time. So basically all
| your cores will spike to 100% for almost a second during
| those one-shot identifications. It looks like GP has a
| 12-16 threads CPU, _and it is using those while still being
| 30 times slower than single-threaded libmagic_.
|
| That tool needs 100x more CPU time just to figure out some
| filetypes than vim needs to open a file from a cold start
| (which presumably includes using libmagic to check the
| type).
|
| If I had to wait a second just to open something during
| which that thing uses every resource available on my
| computer to the fullest, I'd probably break my keyboard.
| Try using that thing as a drop-in _file_ replacement, open
| some folder in your favorite file manager, and watch your
| computer slow to a crawl as your file manager tries to
| figure out what thumbnails to render.
|
| It's utterly unsuitable for "interactive" identifications.
| m0shen wrote:
| I've updated this script with some single-file cli numbers,
| which are (as expected) not good. Mostly just comparing
| python startup time for that. make
| sqlite3 < analyze.sql file_avg
| python_avg python_x_times_slower_single_cli
| -------------------- -----------------
| -------------------------------- 0.000874874856301821
| 0.179884610224334 205.611818568799 file_avg
| python_avg python_x_times_slower_bulk_cli
| ------------------ -------------
| ------------------------------ 0.0231715865881818
| 0.69613745142 30.0427184289163
| ebursztein wrote:
| We did release the npm package because indeed we create a web
| demo and thought people might want to also use it. We know it
| is not as fast as the python version or a C++ version -- which
| why we did mark it as experimental.
|
| The release include the python package and the cli which are
| quite fast and is the main way we did expect people to use --
| sorry if that hasn't be clear in the post.
|
| The goal of the release is to offer a tool that is far more
| accurate that other tools and works on the major file types as
| we hope it to be useful to the community.
|
| Glad to hear it worked on your files
| m0shen wrote:
| Thank you for the release! I understand you're just getting
| it out the door. I just hope to see it delivered as a native
| library or something more reusable.
|
| I did try the python cli, but it seems to be about 30x slower
| than `file` for the random bag of files I checked.
|
| I'll probably take some time this weekend to make a couple of
| issues around misidentified files.
|
| I'll definitely be adding this to my toolset!
| invernizzi wrote:
| Hello! We wrote the Node library as a first functional version.
| Its API is already stable, but it's a bit slower than the
| Python library for two reasons: it loads the model at runtime,
| and it doesn't do batch lookups, meaning it calls the model for
| each file. Other than that, it's just as fast for single file
| lookups, which is the most common usecase.
| m0shen wrote:
| Good to know! Thank you. I'll definitely be trying it out.
| Though, I might download and hardcode the model ;)
|
| I also appreciate the use of ONNX here, as I'm already
| thinking about using another version of the runtime.
|
| Do you think you'll open source your F1 benchmark?
| michaelt wrote:
| _> The model appears to only detect 116 file types [...] Where
| libmagic detects... a lot. Over 1600 last time I checked_
|
| As I'm sure you know, in a lot of applications, you're
| preparing things for a downstream process which supports far
| fewer than 1600 file types.
|
| For example, a printer driver might call on _file_ to check if
| an input is postscript or PDF, to choose the appropriate
| converter - and for any other format, just reject the input.
|
| Or someone training an ML model to generate Python code might
| have a load of files they've scraped from the web, but might
| want to discard anything that isn't Python.
| theon144 wrote:
| Okay, but your one file type is more likely to be included in
| the 1600 that libmagic supports rather than Magika's 116?
|
| For that matter, the file types I care about are
| unfortunately misdetected by Magika (which is also an
| important point - the `file` command at least gives up and
| says "data" when it doesn't know, whereas the Magika demo
| gives a confidently wrong answer).
|
| I don't want to criticize the release because it's not meant
| to be a production-ready piece of software, and I'm sure the
| current 116 types isn't a hard limit, but I do understand the
| parent comment's contention.
| tudorw wrote:
| Can we do the 1600 if known, if not, let the AI take a guess?
| m0shen wrote:
| Absolutely, and honestly in a non-interactive ingestion
| workflow you're probably doing multiple checks anyway. I've
| worked with systems that call multiple libraries and hand-
| coded validation for each incoming file.
|
| Maybe it's my general malaise, or disillusionment with the
| software industry, but when I wrote that I was really just
| expecting more.
| kazinator wrote:
| > _So far, libmagic and most other file-type-identification
| software have been relying on a handcrafted collection of
| heuristics and custom rules to detect each file format.
|
| This manual approach is both time consuming and error prone as it
| is hard for humans to create generalized rules by hand._
|
| Pure nonsense. The rules are accurate, based on the actual
| formats, and not "heuristics".
| cAtte_ wrote:
| the rules aren't based on the formats, but on a small portion
| of them (their magic numbers). this makes them inaccurate
| (think docx vs zip) and heuristic.
| cle wrote:
| Besides compound file types, not all formats are well-specified
| either. Example is CSV.
| summerlight wrote:
| 1. Not all file formats are well specified 2. Not all files are
| precisely following the specification 3. Not all file formats
| are mutually exclusive
|
| Those facts are clearly reflected in the table.
| Nullabillity wrote:
| It seems to detect my Android build.gradle.kts as Scala, which I
| suppose is a kind of hilarious confusion but not exactly useful.
| krick wrote:
| What are use-cases for this? I mean, obviously detecting the
| filetype is useful, but we kinda already have plenty of tools to
| do that, and I cannot imagine, why we need some "smart" way of
| doing this. If you are not a human, and you are not sure what is
| it (like, an unknown file being uploaded to a server) you would
| be better off just rejecting it completely, right? After all,
| there's absolutely no way an "AI powered" tool can be more
| reliable than some dumb, err-on-safer-side heuristic, and you
| wouldn't want to trust _that_ thing to protect you from malicious
| payloads.
| nindalf wrote:
| > no way an "AI powered" tool can be more reliable
|
| The article provides accuracy benchmarks.
|
| > you would be better off just rejecting it completely
|
| They mention using it in gmail and Drive, neither of which have
| the luxury of rejecting files willy-nilly.
| fuzztester wrote:
| I have not tried it recently, but IIRC, Gmail does reject
| attachments which are zip files, for security reasons.
| wildrhythms wrote:
| Gmail nukes zips if they contain an executable or some
| other 'prohibited' file type. Most email providers block
| executable attachments.
| n2d4 wrote:
| Virus detection is mentioned in the article. Code editors need
| to find the programming language for syntax highlighting of
| code before you give it a name. Your desktop OS needs to know
| which program to open files with. Or, recovering files from a
| corrupted drive. Etc
|
| It's easy to distinguish, say, a PNG from a JPG file (or
| anything else that has well-defined magic bytes). But some
| files look virtually identical (eg. .jar files are really just
| .zip files). Also see polyglot files [1].
|
| If you allow an `unknown` label or human intervention, then
| yes, magic bytes might be enough, but sometimes you'd rather
| have a 99% chance to be right about 95% of files vs. a 100%
| chance to be right about 50% of files.
|
| [1] https://en.wikipedia.org/wiki/Polyglot_(computing)
| star4040 wrote:
| It seems like it defeats the purpose of such a tool that this
| initial version doesn't have polyglot files. I hope they're quick
| to work on that.
| VikingCoder wrote:
| What does it do with an Actually Portable Executable compiled by
| Cosmopolitan libc compiler?
| supriyo-biswas wrote:
| It's reported as a PE executable, `file` on the other hand
| reports it as a "DOS/MBR boot sector."
| lopkeny12ko wrote:
| I don't understand why this needs to exist. Isn't file type
| detection inherently deterministic by nature? A valid tar archive
| will always have the same first few magic bytes. An ELF binary
| has a universal ELF magic and header. If the magic is bad, then
| the file is corrupted and not a valid XYZ file. What's the value
| in throwing in "heuristics" and probabilistic inference into a
| process that is black and white by design.
| potatoman22 wrote:
| This also works for formats like Python, HTML, and JSON.
| LiamPowell wrote:
| file (https://www.darwinsys.com/file/) already detects all
| these formats.
| ebursztein wrote:
| Indeed but as pointed out in the blog post -- file is
| significantly less accurate that Magika. There are also
| some file type that we support and file doesn't as reported
| in the table.
| LiamPowell wrote:
| I can't immediately find the dataset used for
| benchmarking. Is file actually failing on common files or
| just particularly nasty examples? If it's the latter then
| how does it compare to Magika on files that an average
| person is likely to see?
| schleck8 wrote:
| > Is file actually failing on common files or just
| particularly nasty examples? If it's the latter then how
| does it compare to Magika on files that an average person
| is likely to see?
|
| That's not the point in file type guessing is it? Google
| employs it as an additional security measure for user
| submitted content which absolutely makes sense given what
| malware devs do with file types.
| amelius wrote:
| Yes, but shouldn't the file type be part of the file, or
| (better) of the metadata of the file?
|
| Knowing is better than guessing.
| lopkeny12ko wrote:
| I still don't see how this is useful. The only time I want to
| answer the question "what type of file is this" is if it is
| an opaque blob of binary data. If it's a plain text file like
| Python, HTML, or JSON, I can figure that out by just catting
| the file.
| vintermann wrote:
| Consider, it's perfectly possible for a file to fit two or more
| file formats - polyglot files are a hobby for some people.
|
| And there are also a billion formats that are _not_ uniquely
| determined by magic bytes. You don 't have to go further than
| text files.
| KOLANICH wrote:
| This tool doesn't work this way.
| TacticalCoder wrote:
| > What's the value in throwing in "heuristics" and
| probabilistic inference into a process that is black and white
| by design.
|
| I use the _file_ command all the time. The value is when you
| get this: ... $ file somefile.xyz
| somefile.xyz: data
|
| AIUI from reading TFA, _magika_ can determine more filetypes
| than what the _file_ command can detect.
|
| It'd actually be very easy to determine if there's any value in
| _magika_ : run _file_ on every file on your filesystem and then
| for every file where the _file_ command returns "data", run
| _magika_ and see if _magika_ is right.
|
| If it's right, there's your value.
|
| P.S: it may also be easier to run on Windows than the file
| command? But then I can't do much to help people who are on
| Windows.
| Eiim wrote:
| From elsewhere in this thread, it appears that Magika detects
| far fewer file types than file (116 vs ~1600), which makes
| sense. For file, you just need to drop in a few rules to add
| a new, somewhat obscure type. An AI approach like Magika will
| need lots of training and test data for each new file type.
| Where Magika might have a leg up is with distinguishing
| different textual data files (i.e., source code), but I don't
| see that as a particularly big use case honestly.
| cle wrote:
| It's not always deterministic, sometimes it's fuzzy depending
| on the file type. Example of this is a one-line CSV file. I
| tested one case of that, libmagic detects it as a text file
| while magika correctly detects it as a CSV (and gives a
| confidence score, which is killer).
| alkonaut wrote:
| But even with determinism, it's not always right. It's not too
| rare to find a text file with a byte order mark indicating
| UTF16 (0xFE 0xFF) but then actually containing utf-8. But what
| "format" does it have then? Is it UTF-8 or UTF-16? Same with
| e.g. a jar file missing a manifest. That's just a zip, even
| though I'm sure some runtime might eat it.
|
| But the question is when you have the issue of having to guess
| the format of a file? Is it when reverse engineering? Last time
| I did something like this was in the 90's when trying to pick
| apart some texture from a directory of files called asset0001.k
| and it turns out it was a bitmap or whatever. Fun times.
| a-dub wrote:
| probably a lot of interesting work going on that looks like this
| for the virustotal db itself.
| account-5 wrote:
| Assuming that I've not misunderstood, how does this compare to
| things like: TrID [0]?? Apart from being open source.
|
| [0] https://mark0.net/soft-trid-e.html
| JacobThreeThree wrote:
| The bulk of the short article is a set of performance
| benchmarks comparing Magika to TrID and others.
| account-5 wrote:
| Argh, the risks of browsing the web without JavaScript and/or
| third party scripts enabled, you miss content, because
| rendering text and images on the modern web can't be done
| without them, apparently. (Sarcasm).
|
| You are of course correct. I can see the images showing the
| comparison. Apologies.
| earth2mars wrote:
| how do i pronounce this? Myajika or MaGika? anyhow, its super
| cool.
| thangalin wrote:
| My FOSS desktop text editor performs a subset of file type
| identification using the first 12 bytes, detecting the type quite
| quickly:
|
| * https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/src/main...
|
| There's a much larger list of file signatures at:
|
| * https://github.com/veniware/Space-Maker/blob/master/FileSign...
| Labo333 wrote:
| I wonder what the output will be on polyglot files like run-
| anywhere binaries produced by cosmopolitan [1]
|
| [1]: https://justine.lol/cosmopolitan/
| awaythrow999 wrote:
| Wonder how this would handle a polyglot[0][1], that is valid as a
| PDF document, a ZIP archive, and a Bash script that runs a Python
| webserver, which hosts Kaitai Struct's WebIDE which, allowing you
| to view the file's own annotated bytes.
|
| [0]: https://www.alchemistowl.org/pocorgtfo/
|
| [1]: https://www.alchemistowl.org/pocorgtfo/pocorgtfo16.pdf
|
| Edit: just tested, and it does only identify the zip layer
| rvnx wrote:
| You can try it here: https://google.github.io/magika/
|
| It's relatively limited compared to `file` (~10% coverage),
| it's more like a specialized classificator for basic file
| formats, so such cases are really out-of-scope.
|
| I guess it's more for detecting common file formats then with
| high recall.
|
| However, where is the actual source of the model ? Let's say I
| want to add a new file format myself.
|
| Apparently only the source of the interpreter is here, not the
| source of the model nor the training set, which is the most
| important thing.
| alexandreyc wrote:
| Yes, I totally agree; it's not what I would qualify as open
| source.
|
| Do you plan to release the training code along the research
| paper? What about the dataset?
|
| In any case, it's very neat to have ML-based technique and
| lightweight model for such tasks!
| tempay wrote:
| Is there anything about the performance on unknown files?
|
| I've tried a few that aren't "basic" but are widely used
| enough to be well supported in libmagic and it thinks they're
| zip files. I know enough about the underlying formats to know
| they're not using zip as a container under-the-hood.
| kevincox wrote:
| Apparenty the Super Mario Bros. 3 ROM is 100% a SWF file.
|
| Cool that you can use it online though. Might end up using it
| like that. Although it seems like it may focus on common
| formats.
| diimdeep wrote:
| > Magika: AI powered fast and efficient file type identification
|
| of 116 file types with proprietary puny model with no training
| code and no dataset.
|
| > We are releasing a paper later this year detailing how the
| Magika model was trained and its performance on large datasets.
|
| And ? How do you advance industry by this googleblog post and
| source code that is useless without closed source model ? All I
| see here is loud marketing name, loud promises, but actually
| barely anything useful. Hooly rooftop characters sideproject?
| secondary_op wrote:
| Why is this piece of code being sold as open source, when in
| reality it just calls into proprietary ML blob that is tiny and
| useless, and actual source code of model is closed while properly
| useful large model is non existing ?
| KOLANICH wrote:
| Not into proprietary, the blob is within an Apache-licensed
| repo. Though there was no code to train it, but the repo
| contains some info allowing to recreate the code training it.
| Basically a JSON-based configs containing graph architecture.
| Even if you didn't have them, the repo contains an ONNX model,
| from which one can devise the architecture.
| flohofwoe wrote:
| I wonder how it performs with detecting C vs C++ vs ObjC vs
| ObjC++ and for bonus points: the common C/C++ subset (which is an
| incompatible C fork), also extra bonus points for detecting
| language version compatibility (e.g. C89 vs C99 vs C11...).
|
| Separating C from C++ and ObjC is where the file type detection
| on Github traditionally had problems with (but has been getting
| dramatically better over time), from an "AI-powered" solution
| which has been trained on the entire internet I would expect to
| do better right from the start.
|
| The list here doesn't even mention any of those languages except
| C though:
|
| https://github.com/google/magika/blob/main/docs/supported-co...
| andrewstuart wrote:
| Very useful.
|
| I wrote an editor that needed file type detection but the results
| of traditional approaches were flaky.
| 20after4 wrote:
| I just want to say thank you for the release. There are quite a
| lot of complaints in the comments but I think this is a useful
| and worthwhile contribution and I appreciate the authors for
| going through the effort to get it approved for open source
| release. It would be great if the model training data was
| included (or at lease documentation about how to reproduce it.)
| but that doesn't preclude this being useful. Thanks!
| Someone wrote:
| If their "Exif Tool" is https://exiftool.org/ (what else could it
| be?), I don't understand why they included it in their tests.
| Also, how does ExifTool recognize Python and html files?
| Andugal wrote:
| I have a question: Is something like Magika enough to check if a
| file is malicious or not?
|
| Example: users can upload PNG file (and only PNG is accepted). If
| Malika detects that the file is a PNG, does this mean the file is
| clean?
| cjg wrote:
| > does this mean the file is clean?
|
| No.
| TacticalCoder wrote:
| If that PNG of yours is not just an example note that you can
| detect easily if the PNG as any extra data (which may or may
| not indicate an attempt as mischief) and reject the (rare) PNGs
| with extra data. I ran a script checking the thousands of PNGs
| on my system and found three with extra data, all three
| probably due to the "PNG acropalypse" bug (but mischief cannot
| be ruled out).
|
| P.S: btw I'm not implying using extra data that shouldn't be
| there in a PNG is the only way to have a malicious PNG.
| nicklecompte wrote:
| This comment from kevincox[1] says the answer is a hard "no":
|
| > Worse it seems that for unknown formats it confidently claims
| that it is one of the known formats. Rather than saying
| "unknown" or "binary data".
|
| There are other comments in this thread that make me think
| Google contaminated their test data with training data and the
| 99% results should not be taken at face value. OTOH I am not
| particularly surprised that Magika would be better than the
| other tools at distinguishing _semi-unstructured plain text_
| e.g. Java source vs. C++ source or YAMLs versus INIs. But that
| 's a very different use case than many security applications.
| The comments here suggest Magika is especially susceptible to
| binary obfuscation.
|
| [1] https://news.ycombinator.com/item?id=39395677
| kevincox wrote:
| The only way to do this reliably is to render the PNG to pixels
| then render it back to an PNG with a trusted encoder. Of course
| now you are taking the risk of vulnerabilities in the "render
| to pixels" step. But the result will be clean.
|
| AKA parse, don't validate.
| TacticalCoder wrote:
| To me the obvious use case is to first use the _file_ command but
| then, when _file_ returns "DATA" (meaning it couldn't guess the
| file type), call _magika_.
|
| I guess I'll be writing a wrapper (only for when using my shell
| in interactive mode) around _file_ doing just that when I come
| back from vacation. I hate it when _file_ cannot do its thing.
|
| Put it this way: I use _file_ a lot and I know at times it cannot
| detect a filetype. But is _file_ often wrong when it does have a
| match? I don 't think so...
|
| So in most of the cases I'd have _file_ correctly give me the
| filetype, very quickly but then in those rare cases where _file_
| cannot find anything, I 'd then use the slower but apparently
| more capable _magika_.
| SnowflakeOnIce wrote:
| I have seen 'file' misclassify many things when running it at
| large scale (millions of files) from a hodgepodge of sources.
| Unrelated types getting called 'GPG Private Keys', for example.
|
| For textual data types, 'file' gets confused often, or doesn't
| give a precise type. GitHub's 'linguist' [1] tool does much
| better here, but is structured in such a way that it is
| difficult to call it on an arbitrary file or bytestring that
| doesn't reside in a git repo.
|
| I'd love to have a classification tool that can more granularly
| classify textual files! It may not be Magika _today_ since it
| only supports 116-something types. For this use case, an ML-
| based approach will be more successful than an approach based
| solely on handwritten heuristic rules. I'm excited to see where
| this goes.
| lakomen wrote:
| Why? Just check the damn headers. Why do you need a power hungry
| and complicated AI model to do it? Why?
| TomNomNom wrote:
| This looks cool. I ran this on some web crawl data I have
| locally, so: all files you'd find on regular websites; HTML, CSS,
| JavaScript, fonts etc.
|
| It identified some simple HTML files (html, head, title, body, p
| tags and not much else) as "MS Visual Basic source (VBA)", "ASP
| source (code)", and "Generic text document" where the `file`
| utility correctly identified all such examples as "HTML document
| text".
|
| Some woff and woff2 files it identified as "TrueType Font Data",
| others are "Unknown binary data (unknown)" with low confidence
| guesses ranging from FLAC audio to ISO 9660. Again, the `file`
| utility correctly identifies these files as "Web Open Font
| Format".
|
| I like the idea, but the current implementation can't be relied
| on IMO; especially not for automation.
|
| A minor pet peeve also: it doesn't seem to detect when its output
| is a pipe and strip the shell colour escapes resulting in
| `^[[1;37` and `^[[0;39m` wrapping every line if you pipe the
| output into a vim buffer or similar.
| michaelmior wrote:
| > the current implementation can't be relied on IMO
|
| What's your reasoning for not relying on this? (It seems to me
| that this would be application-dependent at the very least.)
| jdiff wrote:
| I'm not the person you asked, but I'm not sure I understand
| your question and I'd like to. It whiffed multiple common
| softballs, to the point it brings into question the claims
| made about its performance. What reasoning is there to trust
| it?
| michaelmior wrote:
| > It whiffed multiple common softballs
|
| I must have missed this in the article. Where was this?
| jdiff wrote:
| ...It's in the comment you were responding to. Directly
| above the section you quoted.
| TomNomNom wrote:
| It provided the wrong file-types for some files, so I cannot
| rely on its output to be correct.
|
| If you wanted to, for example, use this tool to route
| different files to different format-specific handlers it
| would sometimes send files to the wrong handlers.
| michaelmior wrote:
| Except a 100% correct implementation doesn't exist AFAIK.
| So if I want to do anything that makes a decision based on
| the type of a file, I have to pick _some_ algorithm to do
| that. If I can do that correctly 99% of the time, that 's
| better than not being able to make that decision at all,
| which is where I'm left if a perfect implementation doesn't
| exist.
| jdiff wrote:
| Nobody's asking for perfection. But the AI is offering
| inexplicable and obvious nondeterministic mistakes that
| the traditional algorithms don't suffer from.
|
| Magika goes wrong and your fonts become audio files and
| nobody knows why. Magic goes wrong and your ZIP-based
| documents get mistaken for generic ZIP files. If you work
| with that edge case a lot, you can anticipate it with
| traditional algorithms. You can't anticipate
| nondeterministic hallucination.
| jsnell wrote:
| Where are you getting the non-determinism part from? It
| would seem surprising for there to be anything non-
| deterministic about an ML model like this, and nothing in
| the original reports seems to suggest that either.
| TeMPOraL wrote:
| Large ML models tend to be uncorrectably non-
| deterministic simply from doing lots of floating point
| math in parallel. Addition and multiplication of floats
| is neither commutative nor associative - you may get
| different results depending on the order in which you
| add/multiply numbers.
| Gormo wrote:
| > It would seem surprising for there to be anything non-
| deterministic about an ML model like this
|
| I think there may be some confusion of ideas going in
| here. Machine learning is fundamentally stochastic, so it
| is non-deterministic almost by definition.
| ebursztein wrote:
| Thanks for the feedback -- we will look into it. If you can
| share with us the list of URL that would be very helpful so we
| can reproduce - send us an email at magika-dev@google.com if
| that is possible.
|
| For crawling we have planned a head only model to avoid
| fetching the whole file but it is not ready yet -- we weren't
| sure what use-cases would emerge so that is good to know that
| such model might be useful.
|
| We mostly use Magika internally to route files for AV scanning
| as we wrote in the blog post, so it is possible that despite
| our best effort to test Magika extensively on various file
| types it is not as good on fonts format as it should be. We
| will look into.
|
| Thanks again for sharing your experience with Magika this is
| very useful.
| TomNomNom wrote:
| Sure thing :)
|
| Here's[0] a .tgz file with 3 files in it that are
| misidentified by magika but correctly identified by the
| `file` utility: asp.html, vba.html, unknown.woff
|
| These are files that were in one of my crawl datasets.
|
| [0]: https://poc.lol/files/magika-test.tgz
| ebursztein wrote:
| Thank you - we are adding them to our test suit for the
| next version.
| TomNomNom wrote:
| Super, thank you! I look forward to it :)
|
| I've worked on similar problems recently so I'm well
| aware of how difficult this is. An example I've given
| people is in automatically detecting base64-encoded data.
| It _seems_ easy at first, but any four, eight, or twelve
| (etc) letter word is technically valid base64, so you
| need to decide if and how those things should be
| excluded.
| beeboobaa wrote:
| Do you have permission to redistribute these files?
| IvyMike wrote:
| You are asking what if this guy has "web crawl data" that
| google does not have?
|
| And what if he says no, he does not have permission.
| beeboobaa wrote:
| > You are asking what if this guy has "web crawl data"
| that google does not have?
|
| No, I'm asking if he has permission to redistribute these
| files.
| timschmidt wrote:
| Are you attempting to assert that use of these files
| solely for the purpose of improving a software system
| meant to classify file types does not fall under fair
| use?
|
| https://en.wikipedia.org/wiki/Fair_use
| beeboobaa wrote:
| I'm asking a question.
|
| Here's another one for you: Do you believe that all
| pictures you have ever taken, all emails you have ever
| written, all code you have ever written could be posted
| here on this forum to improve someone else's software
| system?
|
| If so, could you go ahead and post that zip? I'd like to
| ingest it in my model.
| timschmidt wrote:
| Your question seems orthogonal to the situation. The
| three files posted seem to be the minimum amount of
| information required to reproduce the bug. Fair use
| encompasses a LOT of uses of otherwise copyrighted work,
| and this seems clearly to be one.
| beeboobaa wrote:
| I don't see how publicly posting them on a forum is
|
| > the minimum amount of information required to reproduce
| the bug
|
| MAYBE if they had communicated privately that'd be an
| argument that made sense.
| timschmidt wrote:
| So you don't think that software development which
| happens in public web forums deserve fair use protection?
| beeboobaa wrote:
| That's an interesting way to frame "publicly posted
| someone else's data without their consent for anyone to
| see and download"
| timschmidt wrote:
| I notice you're so invested that you haven't noticed that
| the files have been renamed and zipped such that they're
| not even indexable. How you'd expect anyone not
| participating in software development to find them is yet
| to be explained.
| beeboobaa wrote:
| I notice you're so invested you keep coming up with
| imaginary scenarios that you pretend somehow matter, lol
| timschmidt wrote:
| Have fun, buddy!
| jdiff wrote:
| It's three files that were scraped from (and so publicly
| available on) the web. That's not at all similar to your
| strawful analogy.
| timschmidt wrote:
| I'm over here trying to fathom the lack of control over
| one's own life it would take to cause someone to turn
| into an online copyright cop, when the data in question
| isn't even their own, is clearly divorced from any
| context which would make it useful for anything other
| than fixing the bug, and about which the original
| copyright holder hasn't complained.
|
| Some people just want to argue.
|
| If the copyright holder has a problem with the use, they
| are perfectly entitled to spend some of their dollar
| bills to file a law suit, as part of which the contents
| of the files can be entered into the public record for
| all to legally access, as was done with Scientology.
|
| I don't expect anyone would be so daft.
| beeboobaa wrote:
| Literally just asked a question and that seems to have
| set you off, bud. Are you alright? Do you need to feed
| your LLM more data to keep it happy?
| timschmidt wrote:
| I'm always happy to stand up for folks who make things
| over people who want to police them. Especially when
| nothing wrong has happened. Maybe take a walk and get
| some fresh air?
| westurner wrote:
| What is the MIME type of a .tar file; and what are the MIME
| types of the constituent concatenated files within an archive
| format like e.g. tar?
|
| hachoir/subfile/main.py: https://github.com/vstinner/hachoir/
| blob/main/hachoir/subfil...
|
| File signature: https://en.wikipedia.org/wiki/File_signature
|
| PhotoRec: https://en.wikipedia.org/wiki/PhotoRec
|
| "File Format Gallery for Kaitai Struct"; 185+ binary file
| format specifications: https://formats.kaitai.io/
|
| Table of ': https://formats.kaitai.io/xref.html
|
| AntiVirus software > Identification methods > Signature-based
| detection, Heuristics, and _ML /AI data mining_: https://en.w
| ikipedia.org/wiki/Antivirus_software#Identificat...
|
| Executable compression; packer/loader:
| https://en.wikipedia.org/wiki/Executable_compression
|
| Shellcode database > MSF:
| https://en.wikipedia.org/wiki/Shellcode_database
|
| sigtool.c: https://github.com/Cisco-
| Talos/clamav/blob/main/sigtool/sigt...
|
| clamav sigtool:
| https://www.google.com/search?q=clamav+sigtool
|
| https://blog.didierstevens.com/2017/07/14/clamav-sigtool-
| dec... : sigtool --find-sigs "$name" |
| sigtool --decode-sigs
|
| List of file signatures:
| https://en.wikipedia.org/wiki/List_of_file_signatures
|
| And then also clusterfuzz/oss-fuzz scans .txt source files
| with (sandboxed) Static and Dynamic Analysis tools, and
| `debsums`/`rpm -Va` verify that files on disk have the same
| (GPG signed) checksums as the package they are supposed to
| have been installed from, and a file-based HIDS builds a
| database of file hashes and compares what's on disk in a
| later scan with what was presumed good, and ~gdesktop LLM
| tools scan every file, and there are extended filesystem
| attributes for _label_ -based MAC systems like SELinux, oh
| and NTFS ADS.
|
| A sufficient cryptographic hash function yields random bits
| with uniform probability. DRBG Deterministic Random Bit
| Generators need high entropy random bits in order to
| continuously re-seed the RNG random number generator. Is it
| safe to assume that hashing (1) every file on disk, or (2)
| any given file on disk at random, will yield random bits with
| uniform probability; and (3) why Argon2 instead of e.g. only
| two rounds of SHA256?
|
| https://github.com/google/osv.dev/blob/master/README.md#usin.
| .. :
|
| > _We provide a Go based tool that will scan your
| dependencies, and check them against the OSV database for
| known vulnerabilities via the OSV API._ ... With package
| metadata, not (a file hash, package) database that could be
| generated from OSV and the actual package files instead of
| their manifest of already-calculated checksums.
|
| Might as well be heating a pool on the roof with all of this
| waste heat from hashing binaries build from code of unknown
| static and dynamic quality.
|
| Add'l useful formats:
|
| > _Currently it is able to scan various lockfiles, debian
| docker containers, SPDX and CycloneDB SBOMs, and git
| repositories_
|
| Things like bittorrent magnet URIs, Named Data Networking,
| and IPFS are (file-hash based) "Content addressable storage":
| https://en.wikipedia.org/wiki/Content-addressable_storage
| nayuki wrote:
| The name sounds like the Pokemon Magikarp or the anime series
| Madoka Magica.
| pier25 wrote:
| I use FFMPEG to detect if uploaded files are valid audio files.
| Would this be much faster?
| omni wrote:
| Can someone please help me understand why this is useful? The
| article mentions malware scanning applications, but if I'm
| sending you a malicious PDF, won't I want to clearly mark it with
| a .pdf extension so that you open it in your PDF app? Their
| examples are all very obvious based on file extensions.
| chromaton wrote:
| It can't correctly identify a DXF file in my testing. It
| categorizes it as plain text.
| Eiim wrote:
| I ran a quick test on 100 semi-random files I had laying around.
| Of those, 81 were detected correctly, 6 were detected as the
| wrong file type, and 12 were detected with an unspecific file
| type (unknown binary/generic text) when a more specific type
| existed. In 4 of the unspecific cases, a low-confidence guess was
| provided, which was wrong in each case. However, almost all of
| the files which were detected wrong/unspecific are of types not
| supported by Magika, with one exception of a JSON file containing
| a lot of JS code as text, which was detected as JS code. For
| comparison, file 5.45 (the version I happened to have installed)
| got 83 correct, 6 wrong, and 10 not specific. It detected the
| weird JSON correctly, but also had its own strange issues, such
| as detecting a CSV as just "data". The "wrong" here was somewhat
| skewed by the 4 GLSL shader code files that were in the dataset
| for some reason, all of which it detected as C code (Magika
| called them unknown). The other two "wrong" detections were also
| code formats that it seems it doesn't support. It was also able
| to output a lot more information about the media files. Not sure
| what to make of these tests but perhaps they're useful to
| somebody.
| pizzalife wrote:
| > The "wrong" here was somewhat skewed by the 4 GLSL shader
| code files that were in the dataset for some reason, all of
| which it detected as C code
|
| To be fair though, a snippet of GLSL shader code can be
| perfectly valid C.
| Eiim wrote:
| Indeed, which is why I felt the need to call it out here. I'm
| not certain if the files on question actually happened to be
| valid C but whether that's a meaningful mistake regardless is
| left to the reader to decide.
| Delumine wrote:
| Voidtools - Everything.. looking at you to implement this
| init0 wrote:
| Why not detect it by checking the magic number of the buffer?
| dagmx wrote:
| Not every file has one for starters and many can be incorrect.
|
| Especially in the context of use as a virus scanner, you don't
| trust what the file says it is
| YoshiRulz wrote:
| So instead of spending some of their human resources to improve
| libmagic, they used some of their computing power to create an
| "open source" neural net, which is technically more accurate than
| the "error-prone" hand-written rules (ignoring that it supports
| far fewer filetypes), and which is much less effective in an
| adversarial context, and they want it to "help other software
| improve their file identification accuracy," which of course it
| can't since neural nets aren't introspectable. Thanks guys.
| 12_throw_away wrote:
| Come on, can't you help but be impressed by this amazing AI
| tech? That gives us sci-fi tools like ... a less-accurate,
| incomplete, stochastic, un-debuggable, slower, electricity-
| guzzling version of `file`.
| og_kalu wrote:
| >So instead of spending some of their human resources to
| improve libmagic
|
| A large megacorp can work on multiple things at once.
|
| >an "open source" neural net, which is technically more
| accurate than the "error-prone" hand-written rules (ignoring
| that it supports far fewer filetypes)
|
| You say that like it's a contradiction but it's not.
|
| >and which is much less effective in an adversarial context,
|
| Is it? This seems like an assumption.
|
| >and they want it to "help other software improve their file
| identification accuracy," which of course it can't since neural
| nets aren't introspectable.
|
| Being introspectable or not has no bearing on the accuracy of a
| system.
| breather wrote:
| Can we please god stop using AI like it's a meaningful word? This
| is really interesting technology; it's hamstrung by association
| with a predatory marketing term.
| goshx wrote:
| I used an HTML file and added JPEG magic bytes to its header:
|
| magika file.jpg
|
| file.jpg: JPEG image data (image)
| runxel wrote:
| Took a .dxf file and fed it to Magika. It says with confidence of
| 97% that that must be a PowerShell file. A classic .dwg could be
| "mscompress" (whatever that is), 81%, or a GIF. Both couldn't be
| further from the truth.
|
| Common files are categorized successfully - but well, yeah that's
| not really an achievement. Pretty much nothing more than a toy
| right now.
| queuebert wrote:
| The real problem with deep learning approaches is hallucination
| and edge case failures. When someone finally fixes this, I hope
| it makes the HN front page.
| lqcfcjx wrote:
| After reading thru all the comments, honestly I still don't get
| the point of this system. What is potential practical value or
| applications of this model?
| jwithington wrote:
| I guess I'm kind of a dummy on this, but why is it impressive to
| identify that a .js file is Javascript, a .md file is Markdown,
| etc?
| playingalong wrote:
| Because it's done by inspecting the content, not the name of
| the file.
| woliveirajr wrote:
| Reminds me when someone asked (at StackOverflow) on how to
| recognize binaries for different architetures, like x86 or ARM-
| something or Apple M1 and so on.
|
| I gave the idea to use the technique of NCD (Normalized
| compression distance), based on Kolmogorov complexity. Celibrasi,
| R. was one great researcher in this area, and I think he worked
| at Google at some point.
|
| Using AI seems to follow the same path: "learn" what represents
| some specific file and then compare the unknown file to those
| references (AI:all the parameters, NCD:compression against a
| known type).
| aidenn0 wrote:
| But will it let you print on Tuesday[1]?
|
| 1:
| https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...
| queuebert wrote:
| For a subscription fee.
| jjsimpso wrote:
| I wrote an implementation of libmagic in Racket a few years
| ago(https://github.com/jjsimpso/magic). File type identification
| is a pretty interesting topic.
|
| As others have noted, libmagic detects many more file types than
| Magika, but I can see Magika being useful for text files in
| particular, because anything written by humans doesn't have a
| rigid format.
___________________________________________________________________
(page generated 2024-02-16 23:01 UTC)