[HN Gopher] Don't Pickle Your Data
___________________________________________________________________
Don't Pickle Your Data
Author : behnamoh
Score : 46 points
Date : 2022-08-11 20:03 UTC (2 hours ago)
(HTM) web link (www.benfrederickson.com)
(TXT) w3m dump (www.benfrederickson.com)
| chaxor wrote:
| Last time I checked (i.e. performed several benchmarks upon),
| parquet with Zstd was about the best way to store compressed data
| for really fast and small files.
|
| Zstd is quite good, and is now (iirc) in the linux kernel.
|
| People may have some issue with parquet being column based, which
| can make inserts a little slower for example, but for a large
| mostly-set database it is a very good choice. A tsv.zst file
| could be another way to go as well. But like others, I really
| with hdf5 had some of these features of compression and wasn't so
| dang slow.
| ris wrote:
| Don't Assume Things About Others Use Cases.
|
| In cases where I'm doing some sort of interactive or exploratory
| data analysis with structures of complex python objects and want
| to stash a copy of what I'm working with in case the next thing I
| do screws the up or, who knows, I lose power - being able to
| quickly pickle something and have an amount of confidence I'll be
| able to get it back in a sensible state is very useful.
|
| I've also used it for debug dumps in experimental software so I
| have a chance of reproducing odd cases it comes across.
| hansvm wrote:
| I made a simple library for just such a purpose if you're
| interested. You can wrap a whole module (like requests or
| pandas) and cache every function/coroutine result to disk.
| https://github.com/hmusgrave/ememo
|
| I mainly use it for web scraping to be polite while I figure
| out the remote API, but I'm sure somebody could have another
| use.
| solarkraft wrote:
| > Pickle is slow
|
| ... Python is slow. But "slow" means "plenty fast" nowadays and
| the development speed advantage is immense.
|
| > unpickling malicious data can cause security issues
|
| Why would I do that?
|
| I can't read the linked page because it seems to be down/the link
| is broken, so I don't know whether this includes user data that
| is present before pickling and then turns to be an issue after
| pickling. Then I would worry, otherwise ... yeah, I'm not gonna
| unpickle random data.
|
| > Just use JSON
|
| How do I effortlessly restore objects including their methods
| from JSON?
| marcosdumay wrote:
| > How do I effortlessly restore objects including their methods
| from JSON?
|
| The recommendation from the title is usually made instead of
| something like "deserializing executable data is harmful". That
| is exactly the one question where the answer is "don't".
|
| It's not exactly the unpickling process that is the problem.
| It's how you established that the data isn't malicious. It is
| very hard to use pickle without creating some local privilege
| escalation possibilities. And at the end of the process, you
| usually don't get any capability that replicating the code on
| both sides of the communication channel wouldn't give you.
|
| (The problem isn't specific to Python either. There was a time
| when that kind of functionality was very hyped on both the
| industry and academia. For example, Java also got something
| similar that they had to retract. The famous Gnu-Hurd OS (the
| one that would never finish) was supposed to do that on the
| system level.)
| xhevahir wrote:
| The Mozart/Oz people came up with pickle, I think.
| NotTameAntelope wrote:
| Instantiate a new object of the class with the JSON as
| arguments, is one way.
|
| I've built a bunch of these systems, keeping your data separate
| solves a lot of future problems.
| LtWorf wrote:
| The benchmark is bad. Because after you load a json you can't
| really use it. Well to use it you must check lists are lists
| for real, objects are really objects and have the keys you
| think they should have and so on.
|
| The alternative is using something like typedload (which I
| wrote) or pydantic in addition to json load, to avoid
| cluttering the code with the countless and error prone checks
| one must do to use untrusted json.
|
| In the end dealing with untrusted json directly is terrible.
| theamk wrote:
| if you are dealing with untrusted data, pickle is not an
| option at all, it lacks security.
| cratermoon wrote:
| >> unpickling malicious data can cause security issues
|
| > Why would I do that?
|
| If you pickle data from an untrusted source, say a web form
| submission and then later unpickle it. See
| https://cwe.mitre.org/data/definitions/502.html
| ademarre wrote:
| _> If you pickle data from an untrusted source . . . and then
| later unpickle it_
|
| That is not exactly right. The risk is when you unpickle data
| that was pickled by someone else or that was tampered with
| after you pickled it.
| cratermoon wrote:
| Look closer at the CWE and the linked examples: An attacker
| can construct a illegitimate, serialized object, like an
| auth token or sessionID that instantiates one of Python's
| subprocesses to execute arbitrary commands
| TremendousJudge wrote:
| There's also the much faster cPickle. It may just be fast
| enough for your needs. If it isn't, then you start exploring
| other options.
| kzrdude wrote:
| the regular pickle module uses "cPickle" transparently. It
| should not be worth mentioning since Python 3.x.
|
| The article is 8 years old, so it kind of misses this detail.
| IshKebab wrote:
| That was included in the benchmarks.
| IshKebab wrote:
| > But "slow" means "plenty fast" nowadays
|
| Not in my experience. "Slow" means "it seems fast enough now
| and I'm sure we'll have time to rewrite it in a fast language
| once it's grown to a monster that processes 1000 times the data
| it does now... right?".
|
| > Why would I do that?
|
| Because you are using someone else's code and make the fairly
| reasonable assumption that deserialising data doesn't cause
| arbitrary code execution... But of course it's all your fault
| because you didn't read their code to see that it's using
| Pickle!
|
| > How do I effortlessly restore objects including their methods
| from JSON?
|
| You don't. You shouldn't.
| vore wrote:
| One thing that's not mentioned is that pickled data is
| effectively fossilized once you've pickled it. If you want to
| change the layout of a class and have objects unpickle
| correctly, it can be an ordeal, as objects are unpickled by
| their class name, and you need both the original class and the
| new class to correctly unpickle and migrate.
|
| If you instead selectively pick what you want to serialize
| about your data and keep the representations separate, you can
| change the internal model easily without having a huge impact
| on the serialized model.
| jessikat wrote:
| JSON really is a terrible serialization format. Even JavaScript
| can't safely deserialize JSON without silent data corruption.
| I've had to stringify numbers because of JavaScript, and there
| were no errors. Perhaps that's the fault of JavaScript, but I
| find the lack of encoding the numerical storage type to be a bug
| rather than a feature.
| windows_sucks wrote:
| would love to see an example of the data corruption you're
| talking about
| solarkraft wrote:
| (2014)
| ohiovr wrote:
| I found unpickling a lot slower than json loading.
| nomel wrote:
| I've found it to be much faster, with large amounts of data,
| like numpy arrays. And, some things aren't possible to convert
| to JSON, without writing a bunch of code to do the
| serialization/deserialization, which often makes things slow
| again.
| LtWorf wrote:
| But then you have to check that the "list" is really a list,
| that the objects do have the keys, that the strings are
| strings.
|
| This should be factored in the cost, and it wasn't in the
| benchmark.
| cratermoon wrote:
| Much the same can be leveled against Java's serialized objects.
| The OWASP top 10 from 2017 even had "Insecure Deserialization" at
| #8. The 2021 update[1] changes it to "Software and Data Integrity
| Failures", still at #8. It's CWE-502: Deserialization of
| Untrusted Data[2], where Python and Java are specifically
| mentioned.
|
| 1 https://owasp.org/www-project-top-ten/
|
| 2 https://cwe.mitre.org/data/definitions/502.html
| jleahy wrote:
| Should be (2014).
|
| More interestingly, as much as numpy and everybody advises
| against it, I believe that pickling data into a zstd stream is
| one of the fastest ways of storing sets of large matrices.
|
| The 'recommended' alternatives include numpy.save (uncompressed,
| which is bad when lz4 is faster than memcpy and you're saving to
| disk), numpy.savez (uncompressed zip files, even worse),
| numpy.savez_compressed (zlib zip, awful), hdf5 (one of the worlds
| worst formats and also using zlib), etc. I wish it wasn't the
| case, but it certainly seems like a good argument for pickle.
| a-dub wrote:
| even though all the metadata is weird and overengineered, i
| would probably still use hdf5 as it provides for interop with
| other numerical computing environments (matlab, julia).
|
| also hdf5 is at least securable. pickle streams are not
| designed for that. it's good to be able to send your data to
| others.
|
| fwiw. matlab .mat files are hdf5 at their core.
|
| i should also note that json is pretty bad for numerical data.
| the specification says nothing about how much precision to
| retain and printf/scanf is ridiculously slow for storing
| floats.
| jleahy wrote:
| hdf5 is extremely slow however, pickle+zstd is faster and
| results in smaller files.
| welterde wrote:
| I kind of feel like you are doing something wrong with
| HDF5, since for my use cases it's the fastest solution by
| far.
| a-dub wrote:
| hdf5+zstd would likely be comparable.
|
| good luck loading those pickle files 5y from now.
| jleahy wrote:
| hdf5+zstd is not a thing (or at least not a thing that's
| interoperable or usable 5y from now). I just wish there
| was a good off-the-shelf solution, this stuff is not
| difficult.
| northisup wrote:
| Who is out there using pickle because they think it is a good
| idea? We use it because it is easy and builtin to the language
| and handles datetime by default!
| 0xbadcafebee wrote:
| I agree with using JSON for most things, but YAML is another data
| serialization format with a lot more features
| (https://yaml.org/spec/1.2.2/). It too is a security risk, but
| you can use a 'safe' version of it. If you use it, use the great
| ruamel.yaml library. (No idea how slow it is, but probably slow)
| meatmanek wrote:
| They forgot another major problem: You can only reliably unpickle
| data using the same (or same-enough) code that pickled it. If
| your class definitions have changed or moved around, unpickling
| can break.
| kangalioo wrote:
| Reminds me of C's dumping structs to disk via memcpy
| AussieWog93 wrote:
| From the conclusion of the article: >Pickle on the other hand is
| slow, insecure, and can be only parsed in Python. The only real
| advantage to pickle is that it can serialize arbitrary Python
| objects
|
| ie, a bunch of drawbacks that don't really matter at all for the
| average home-made Python script, plus the "minor" advantage of
| being able to pickle literally anything and have it "just work".
|
| None of the other options out there let you build a foolproof
| "save button" in 3 lines of code.
| ridiculous_fish wrote:
| What are some alternatives to pickling which can handle cyclic
| references?
|
| I've looked into ORMs but these are invasive in terms of needing
| to annotate your classes and fields.
___________________________________________________________________
(page generated 2022-08-11 23:00 UTC)