[HN Gopher] Data-Oriented Programming in Python
___________________________________________________________________
Data-Oriented Programming in Python
Author : brilee
Score : 119 points
Date : 2022-11-26 17:45 UTC (1 days ago)
(HTM) web link (www.moderndescartes.com)
(TXT) w3m dump (www.moderndescartes.com)
| hgibbs wrote:
| I'd like to plug riptables
| (https://github.com/rtosholdings/riptable), which is (more-or-
| less) a performance upgrade to pandas.
| anigbrowl wrote:
| Looks nice!
| duped wrote:
| I'm curious how you would do data oriented programming in a
| language with no type system and no control over memory layout.
| And I guess the answer is "you can't, but JITs might exist
| someday that do it for you"
|
| But you can't wave your hands around and say compiler
| optimizations will fix performance problems - they can, but
| they're not magic, and the arrow in the proverbial knee for
| optimization passes are language semantics that make them
| impossible to realize (forcing the authors to either abandon the
| passes, or rely on things like dynamic deoptimization which is
| not free).
| jessermeyer wrote:
| Those are basically contradiction of terms. Orienting the
| program structure around the data necessarily requires control
| over memory layout and how it is interpretted.
| mumblemumble wrote:
| > I'm curious how you would do data oriented programming in a
| language with no type system and no control over memory layout.
|
| I'm not sure what language you're referring to? Neither of
| those is true of Python.
| _aavaa_ wrote:
| I'm not sure what either you or the parent consider a proper
| type system, but the difference between Python and (say)
| Java's type system is night and day.
|
| If you using the Typing module you might be able to get a
| linter to check things for you, but that's a far cry from
| what other typed languages have.
| roflyear wrote:
| How does that have anything to do with data?
| mumblemumble wrote:
| It is night and day. More specifically:
|
| Python's type discipline is stronger, but dynamic.
|
| Java's type discipline is static, but weaker.
|
| Both are most certainly typed. An _un_ typed language would
| be one like Forth or most assembly languages.
| sirwhinesalot wrote:
| By using only coding patterns that are known to JIT well and
| lower level primitive types and containers if provided by the
| language. Maximizing the use of packages written in native code
| also helps.
|
| The resulting code is even more annoying to write than using a
| lower level language typed language in the first place, but
| ecosystem access sometimes makes up for it.
|
| Hopefully tools like mypyc get better, letting well-typed
| python code with reasonable usage patterns be compiled to
| reasonably efficient native code.
|
| Last time I used it I was pleased with the performance benefits
| but it couldn't even compile all files in a module to a single
| shared library, despite this being mentioned as possible (and
| recommended) in the docs. Maybe I was doing something wrong,
| but they don't answer their github issues often, alas.
|
| Any little thing helps though, it's one thing for throwaway
| scripts to be inefficient, but applications? At a large scale
| it is a monstrous waste of time and literal energy.
| gnuvince wrote:
| Unfortunately, the terms "data-oriented _design_ " and "data-
| oriented _programming_ " refer to two different styles of
| programming. Data-oriented design is the approach to
| programming made popular by Mike Acton's CppCon keynote--as you
| say, it focuses on the layout of objects in memory to make the
| processing of data take advantage of the underlying hardware.
|
| Data-oriented programming is a style of programming that, as
| far as I know, originates in the Clojure community. It
| emphasizes using general data structures (vectors,
| dictionaries) to store all data and make code more re-usable.
| It has nothing to do with good cache utilization, pre-fetching,
| or avoiding branch mispredictions.
|
| It's a shame that two styles of programming which are almost
| diametrical opposites share such similar names.
|
| From the look of the article, it's discussing data-oriented
| design, but in Python, and I agree that it's kind of a weird
| match.
| tvb12 wrote:
| Wait, what? Thanks for pointing this out. I was pretty close
| to submitting an order with a Data-oriented Programming book
| in my cart.
| typon wrote:
| There is no book on data oriented design as far as I know.
| It would be great if someone could take Mike Actons talk
| and similar talks and condense the ideas into a book filled
| with real world examples.
| Jtsummers wrote:
| https://www.dataorienteddesign.com/dodmain/
|
| There is this one. It has been posted here a few times.
| wheelerof4te wrote:
| To spare you a couple minutes of your life, the article is saying
| this:
|
| Python + C modules = Speed
|
| Nothing new here, move along.
| brilee wrote:
| It actually isn't saying that. What do you think Python is made
| of, under the hood? It's C modules.
|
| The argument is not that NumPy is written in C, but that it
| amortizes the cost of Python overhead over multiple data,
| rather than incur it on each datum.
| cauch wrote:
| It's a details, but I keep seeing it:
|
| > Yet, [the scientists] struggle to move away from Python,
| because of network effects, and because Python's beginner-
| friendliness is appealing to scientists for whom programming is
| not a first language.
|
| I don't believe it's the whole story.
|
| In my case, during my 13 years in academia, I saw my field going
| away from C++ and towards python. Not because of network effects
| (it was the opposite: it was more difficult to not use what
| everybody was using), or because scientists were not able to
| program (the entry language of the whole field was C++, and
| python arrived only because scientists with a deep knowledge of
| C++ started to themselves switch the core library to be usable
| with python).
|
| I think something that computer scientists forgot when they
| consider the subject is that the way computer scientists do
| software is just not working when you do science.
|
| In science, you use coding as an exploratory tool. You are lucky
| if 10% of your code ends up being used in your final publication.
| Because the 90% was only there to understand and to progress
| towards the proper direction. For this reason, things like
| declaring variables, which is very important when one makes a
| professional software, are too costly to be useful when you need
| to write down a piece of code that you will ever run once to
| check a small hypothesis, especially when you have another
| language not requiring it. Another aspect is that you will
| present your scientific results to your colleagues, not your code
| (they are not interested in that), and they will come up with
| questions or good ideas, all very good for science, but rarely
| compatible with the way your algorithm was built in the first
| place, and you will need to shoe-horn it into your code (to test
| it) without taking 3 weeks. In this case, python flexibility and
| hackability is very useful.
|
| It's also visible in the popularity of things like Jupyter
| notebooks (I have to acknowledge it even if I personally don't
| like working with such tools), which reuse a working approach
| similar to what was done in mathematica and matlab, that were
| created with the scientific workflow in mind.
|
| I'm sure python simplicity has played a role. But I have the
| feeling that some people are totally oblivious on the fact that
| there may be other reasons.
| analog31 wrote:
| All good points. I think the network effects are more important
| than they were in the past, because the network has gotten
| bigger. Today, new graduates make sure to put Python on their
| resumes, even if they've barely touched it. A colleague who
| left for another job thanked me for encouraging him to learn
| Python. Contrast that with 13 years ago, roughly when I
| switched to Python. My colleagues thought it was some weird
| toy, and encouraged me to learn C# instead. The past 13 years
| have also seen most programmers get over their aversion to
| open-source tools.
|
| It's true, as mentioned in another comment, that repeatability
| is important. But at least in experimental science, the
| repeatability of an experiment is a bigger hurdle than making
| the same code run twice. I use Python code to control my
| experiments, and the ability to read my way through a complex
| workflow is quite valuable.
|
| One thing I like is being able to bodge together a huge blob of
| data and metadata into a dict that's easy to store and unpack.
| That encourages me to keep more experimental data and metadata,
| and use it later.
|
| Python and its libraries have sprawled to the point where it's
| anything but simple.
| whatever1 wrote:
| Computer scientists also assume that you know what inputs your
| program needs and what is the range of the outputs.
|
| That is out of touch with scientific research. We may change
| overnight completely the inputs, the core logic and the
| outputs.
|
| Having to babysit function signatures, manage memory and types
| throughout these activities is just draining.
| duped wrote:
| That is also how "computer scientists" and software engineers
| work. Our time is just valued a lot higher so we've come up
| with techniques to make our work more efficient and faster,
| like structuring our code well using types and function
| signatures.
|
| The added bonus is you get science that's you know,
| repeatable. Because the difference between industrial code
| and prototype code is that it gets run so often there can't
| be glaring mistakes; it must be repeatable by default. We
| have different techniques for dealing with these problems,
| but writing good code is orthogonal to that (I don't think
| scientists need to be running static analysis, doing deep
| reviews, and having extensive integration/unit/mock testing
| throughout their code).
| mumblemumble wrote:
| I used to be a software engineer, and now I'm a data
| scientist. Not exactly the same thing as a real scientist,
| but I suspect that we have common cause in this area.
|
| One of the hard lessons in the transition was realizing
| that things that allowed me to work more efficiently when I
| was a software engineer instead _reduce_ my efficiency in
| my new career.
|
| You might get a decent analogy of the difference by
| comparing photographs of the first transistor with pictures
| of every subsequent transistor. The first transistor's
| clearly going to be terrible in any production application.
| But the same characteristics that make it so terrible for
| practical use were also, to varying degrees, essential to
| or characteristic of the exploratory process that led to
| its creation.
|
| It's similar for my R&D code. In order to do my R&D work
| more efficiently and effectively, I need to do things that
| would be unholy in production code. This is why there's a
| separate and essential productionizing step where my output
| is heavily revamped and possibly even completely rewritten
| in a different programming language.
|
| re: repeatability, I've discovered that it, too, means
| something slightly different in a science context than it
| does in an engineering context.
| roflyear wrote:
| It's nice to be able to collect your data and then do analysis
| on it in one language. Data work is mostly about getting your
| data in a place where you can work with it. Rarely are you
| handed a dataset that is shiny and ready for analysis or
| regression ...
|
| Python also has the benefit of tooling, where you can use tools
| you're familiar with and still work on most codebases.
| tomrod wrote:
| This is a wonderfully technical article. I'd love to learn more
| about Python internals as a scientific coder.
| pedrovhb wrote:
| The official Python documentation is excellent, and in many
| ways goes beyond providing just a list of existing modules and
| what they do. Sometimes if I'm bored I'll actually just pull up
| documentation for something I'm not 100% familiar with and have
| a look around, and I almost always find something new and
| useful. A couple of interesting ones are [0][1], and [2] is a
| nice starting point for discovering more. Not everyone's cup of
| tea, but I also found it enjoyable dive into asyncio with the
| docs.
|
| [0] https://docs.python.org/3/howto/descriptor.html [1]
| https://docs.python.org/3/library/collections.html [2]
| https://docs.python.org/3/
| barefeg wrote:
| I recommend any of the talks by James Powell at PyData. For
| example this one https://youtu.be/cKPlPJyQrt4
|
| Edit: maybe this one on Numpy may be more relevant:
| https://youtu.be/u2yvNw49AX4
| _visgean wrote:
| Hmm nice article but imho skips over the biggest optimization:
| numpy uses BLAS libraries so stuff like
|
| > >>> multiply_by_two = homogenous_array * 2
|
| will be calculated most of the times using a BLAS library -
| whichever you are using
| (https://numpy.org/devdocs/user/building.html)
| cdavid wrote:
| That article talks about DL, where blas is much less relevant.
| The kernels are mostly CUDA (for GPU) and similar stuff for
| other accelerators.
| college_physics wrote:
| > In practice, scientific computing users rely on the NumPy
| family of libraries e.g. NumPy, SciPy, TensorFlow, PyTorch, CuPy,
| JAX, etc..
|
| this is a somewhat confusing statement. most of these libraries
| actually don't rely on numpy. e.g. tensorflow ultimately wraps
| c++/eigen tensors [0] and numpy enters somewhere higher up in
| their python integration
|
| [0]
| https://github.com/tensorflow/tensorflow/blob/master/tensorf...
___________________________________________________________________
(page generated 2022-11-28 05:00 UTC)