[HN Gopher] Exploiting machine learning Pickle files
___________________________________________________________________
Exploiting machine learning Pickle files
Author : ingve
Score : 60 points
Date : 2021-03-17 10:45 UTC (1 days ago)
(HTM) web link (blog.trailofbits.com)
(TXT) w3m dump (blog.trailofbits.com)
| a-dub wrote:
| why doesn't python have first class support for serde? it's such
| a basic and core function.
| mrguyorama wrote:
| Our team is literally running into this right now. XGBoost claims
| their non-pickle implementation of model file (which is just
| json) is "experimental"
| a-dub wrote:
| been there, done that (json parameter storage). it's slow as
| molasses and precision loss from storing floats as text can
| cause issues.
| lunixbochs wrote:
| This was exploited by one team during an ML challenge (ai-han-
| solo) at Defcon CTF Finals 2019. Dropped a meterpreter every time
| you loaded that team's tensorflow model.
| tyingq wrote:
| Was surprised to learn that it's used in ML models. I was under
| the impression that it's pretty slow[1]. Maybe it's used here
| because it's Python aware, and doesn't have trouble saving
| complex data structures?
|
| [1] https://www.benfrederickson.com/dont-pickle-your-data/
| kvathupo wrote:
| As someone with a toe in the deep learning research space, were
| you to look at commonly used ML code, then you'd find software
| engineering problems that are far bigger than just pickling. I
| think it underscores the distinction between _computer science_
| and _software engineering_ ; that is, the theoreticians of the
| former and those who actually deploy them in the latter.
|
| Researchers, especially sleep-deprived grad students, have
| borderline unreadable code for papers since they don't care
| about deployment. I'd imagine the enterprise engineers who
| create development pipelines, however, take such risks into
| consideration.
| josephorjoe wrote:
| Having deployed such code in the past, I have some very
| unhappy memories of list comprehensions with a bunch of
| single letter variables.
|
| Nothing like tracking a bug to 1 line and finding out the
| line does 12 different things.
| krallistic wrote:
| For must use-cases the performance cost of loading the model
| are pretty low compared to either their training cost or making
| thousands of inference calls (when used in an API). Mayebe they
| matter if you do AWS Lambda with ML Models, but mostly pickle
| performance is absolutely fine.
|
| BUT the security problems still remain and weigh much higher
| wodenokoto wrote:
| I'm surprised pickled models are used for sharing with 3rd
| parties.
|
| But internally in projects I see it used all the time. It's
| easy and it works and you trust internal code.
| ori_b wrote:
| Internal code has a habit of becoming external code.
| nonameiguess wrote:
| Yeah, I'm also surprised by this. The standard library itself
| explicitly warns you in a big red box that pickle is not
| secure. It also doesn't support NumPy, though NumPy has its own
| native persistence modules. I'd have expected people to be
| using something like hdf5 for this.
| hprotagonist wrote:
| the soopar geniuses at facebook clearly know better! /s
|
| (in truth, because pytorch serialised models can include
| python code for jit scripts, it's nonobvious what a good way
| to store python code is -- but torch has recently moved to a
| zipfile impl as of 1.6:
| https://pytorch.org/docs/stable/generated/torch.save.html)
|
| >I'd have expected people to be using something like hdf5 for
| this.
|
| Amusingly, matlab was _way_ ahead of the pack here; matfiles
| have been hdf5 since r2006b, back when we just called it
| "matlab 7.3".
| a-dub wrote:
| yep! in past projects i've actually used the .mat file
| format for both C and python bits because their wrapping of
| hdf5 is sufficiently well done.
|
| solved a lot of problems! :)
| liuliu wrote:
| Soumith and I had discussions on hdf5 in the beginning of the
| AI renaissance (around 2014?). Torch maintainers definitely
| aware of better format at that time (original Torch in Lua
| has high-level File APIs for serialization).
|
| It becomes muddy when we moved from Caffe / TensorFlow to
| dynamic models with PyTorch where it is harder to see how to
| persist model (which means both the executable objects and
| the weights) efficiently and safely.
|
| At the end of the day, I think "export" and "checkpointing"
| should be two different things. An "exported" model should be
| safe to deploy and run on platforms like Azure ML while a
| "checkpointing" model should be treated like code and
| everything goes. That is probably where ONNX should be (for
| exporting).
| ogrisel wrote:
| pickle.dump/load is only slow if your main objects has
| references to many small nested objects: e.g. a large python
| dicts with million of key values that are small Python str or
| int objects for instance.
|
| If your main object has only references to a few large sub-
| objects (e.g. a bunch of multi-MB or GB numpy arrays to store
| the numerical parameters of a machine learning moodel), then it
| can be very fast, basically IO-bottlenecked by writing or
| reading the bytes to/from the disk.
| tyingq wrote:
| Interesting. The link I posted is a lot (100k) of small
| 4-field objects, but they aren't nested.
| wendythehacker wrote:
| What's even worse is that ML frameworks (also newer ones) don't
| have or support built in authenticity/integrity checking when
| loading model and model architecture. Developers have to build
| their own solutions, like checking a hash or signature themselves
| - very few do.
|
| This threat model of an ML system is quite interesting also, it
| highlights the various security challenges a typical ML system
| faces: https://embracethered.com/blog/posts/2020/husky-ai-threat-
| mo...
| roddux wrote:
| Something you won't gather from skim-reading the headline is that
| this is that the author has also created a tool, Fickling:
| https://github.com/trailofbits/fickling - to aid in playing
| around with pickle files.
|
| From the article: _[Fickling] can help you reverse engineer,
| test, and even create malicious pickle files._
___________________________________________________________________
(page generated 2021-03-18 23:02 UTC)