[HN Gopher] Exploiting machine learning Pickle files
       ___________________________________________________________________
        
       Exploiting machine learning Pickle files
        
       Author : ingve
       Score  : 60 points
       Date   : 2021-03-17 10:45 UTC (1 days ago)
        
 (HTM) web link (blog.trailofbits.com)
 (TXT) w3m dump (blog.trailofbits.com)
        
       | a-dub wrote:
       | why doesn't python have first class support for serde? it's such
       | a basic and core function.
        
       | mrguyorama wrote:
       | Our team is literally running into this right now. XGBoost claims
       | their non-pickle implementation of model file (which is just
       | json) is "experimental"
        
         | a-dub wrote:
         | been there, done that (json parameter storage). it's slow as
         | molasses and precision loss from storing floats as text can
         | cause issues.
        
       | lunixbochs wrote:
       | This was exploited by one team during an ML challenge (ai-han-
       | solo) at Defcon CTF Finals 2019. Dropped a meterpreter every time
       | you loaded that team's tensorflow model.
        
       | tyingq wrote:
       | Was surprised to learn that it's used in ML models. I was under
       | the impression that it's pretty slow[1]. Maybe it's used here
       | because it's Python aware, and doesn't have trouble saving
       | complex data structures?
       | 
       | [1] https://www.benfrederickson.com/dont-pickle-your-data/
        
         | kvathupo wrote:
         | As someone with a toe in the deep learning research space, were
         | you to look at commonly used ML code, then you'd find software
         | engineering problems that are far bigger than just pickling. I
         | think it underscores the distinction between _computer science_
         | and _software engineering_ ; that is, the theoreticians of the
         | former and those who actually deploy them in the latter.
         | 
         | Researchers, especially sleep-deprived grad students, have
         | borderline unreadable code for papers since they don't care
         | about deployment. I'd imagine the enterprise engineers who
         | create development pipelines, however, take such risks into
         | consideration.
        
           | josephorjoe wrote:
           | Having deployed such code in the past, I have some very
           | unhappy memories of list comprehensions with a bunch of
           | single letter variables.
           | 
           | Nothing like tracking a bug to 1 line and finding out the
           | line does 12 different things.
        
         | krallistic wrote:
         | For must use-cases the performance cost of loading the model
         | are pretty low compared to either their training cost or making
         | thousands of inference calls (when used in an API). Mayebe they
         | matter if you do AWS Lambda with ML Models, but mostly pickle
         | performance is absolutely fine.
         | 
         | BUT the security problems still remain and weigh much higher
        
         | wodenokoto wrote:
         | I'm surprised pickled models are used for sharing with 3rd
         | parties.
         | 
         | But internally in projects I see it used all the time. It's
         | easy and it works and you trust internal code.
        
           | ori_b wrote:
           | Internal code has a habit of becoming external code.
        
         | nonameiguess wrote:
         | Yeah, I'm also surprised by this. The standard library itself
         | explicitly warns you in a big red box that pickle is not
         | secure. It also doesn't support NumPy, though NumPy has its own
         | native persistence modules. I'd have expected people to be
         | using something like hdf5 for this.
        
           | hprotagonist wrote:
           | the soopar geniuses at facebook clearly know better! /s
           | 
           | (in truth, because pytorch serialised models can include
           | python code for jit scripts, it's nonobvious what a good way
           | to store python code is -- but torch has recently moved to a
           | zipfile impl as of 1.6:
           | https://pytorch.org/docs/stable/generated/torch.save.html)
           | 
           | >I'd have expected people to be using something like hdf5 for
           | this.
           | 
           | Amusingly, matlab was _way_ ahead of the pack here; matfiles
           | have been hdf5 since r2006b, back when we just called it
           | "matlab 7.3".
        
             | a-dub wrote:
             | yep! in past projects i've actually used the .mat file
             | format for both C and python bits because their wrapping of
             | hdf5 is sufficiently well done.
             | 
             | solved a lot of problems! :)
        
           | liuliu wrote:
           | Soumith and I had discussions on hdf5 in the beginning of the
           | AI renaissance (around 2014?). Torch maintainers definitely
           | aware of better format at that time (original Torch in Lua
           | has high-level File APIs for serialization).
           | 
           | It becomes muddy when we moved from Caffe / TensorFlow to
           | dynamic models with PyTorch where it is harder to see how to
           | persist model (which means both the executable objects and
           | the weights) efficiently and safely.
           | 
           | At the end of the day, I think "export" and "checkpointing"
           | should be two different things. An "exported" model should be
           | safe to deploy and run on platforms like Azure ML while a
           | "checkpointing" model should be treated like code and
           | everything goes. That is probably where ONNX should be (for
           | exporting).
        
         | ogrisel wrote:
         | pickle.dump/load is only slow if your main objects has
         | references to many small nested objects: e.g. a large python
         | dicts with million of key values that are small Python str or
         | int objects for instance.
         | 
         | If your main object has only references to a few large sub-
         | objects (e.g. a bunch of multi-MB or GB numpy arrays to store
         | the numerical parameters of a machine learning moodel), then it
         | can be very fast, basically IO-bottlenecked by writing or
         | reading the bytes to/from the disk.
        
           | tyingq wrote:
           | Interesting. The link I posted is a lot (100k) of small
           | 4-field objects, but they aren't nested.
        
       | wendythehacker wrote:
       | What's even worse is that ML frameworks (also newer ones) don't
       | have or support built in authenticity/integrity checking when
       | loading model and model architecture. Developers have to build
       | their own solutions, like checking a hash or signature themselves
       | - very few do.
       | 
       | This threat model of an ML system is quite interesting also, it
       | highlights the various security challenges a typical ML system
       | faces: https://embracethered.com/blog/posts/2020/husky-ai-threat-
       | mo...
        
       | roddux wrote:
       | Something you won't gather from skim-reading the headline is that
       | this is that the author has also created a tool, Fickling:
       | https://github.com/trailofbits/fickling - to aid in playing
       | around with pickle files.
       | 
       | From the article: _[Fickling] can help you reverse engineer,
       | test, and even create malicious pickle files._
        
       ___________________________________________________________________
       (page generated 2021-03-18 23:02 UTC)