[HN Gopher] How Python List Works
___________________________________________________________________
How Python List Works
Author : tosh
Score : 61 points
Date : 2021-11-14 13:27 UTC (9 hours ago)
(HTM) web link (antonz.org)
(TXT) w3m dump (antonz.org)
| kgm wrote:
| One thing to note, in the gritty details of Python's actual
| implementation of the list type, is that there's a little more to
| say about how list resizing works than is immediately obvious.
| The relevant source is here:
| https://github.com/python/cpython/blob/main/Objects/listobje...
|
| As the article points out, the size progression, as detailed in
| the comment in that source, is relatively modest. Your "classic"
| dynamic array will simply double the size as needed. This
| progression does less than that: 16, 24, 32, 40, and so on.
| _But:_ The underlying implementation is using the standard C
| library function realloc(). This will perform some amount of in-
| place resizing on its own, and will lend a lot to the performance
| of a Python list, albeit in a way that might be difficult to nail
| down, since it depends on the overall usage of the C library 's
| heap.
| Zababa wrote:
| Related to that, do heterogenous lists make sense as a default? I
| can see the argument for dictionaries easily, you might want to
| put a lot of different stuff inside one. However, for lists, I
| fail to see the advantages of heterogenous lists as a default.
| Especially since they are going to be way slower, and nobody seem
| to use arrays in Python.
| kzrdude wrote:
| If you use the static typing eyes, the Python list is
| _homogenous_ : every element is of the type reference to python
| object. There's no funny business with types, just one
| reference after another.
| Zababa wrote:
| That's a fair point, but if static typing was like that it
| would be useless. "Object" is a type, but it's pretty useless
| most of the time.
| pdonis wrote:
| _> do heterogenous lists make sense as a default?_
|
| Yes, because then the list object doesn't have to do any extra
| work to check the types of objects. If your application needs
| that extra checking, you can add it in your application's code;
| but there's no reason to burden the basic list object with
| extra code that not everyone needs.
|
| _> Especially since they are going to be way slower_
|
| Way slower than what? They're slower than Python arrays, of
| course, but Python arrays can only store numeric values (since
| the amount of space each value takes up is constant and known
| when the array is created).
|
| For a container that needs to be able to store objects of any
| type (including instances of user-defined classes), a
| heterogeneous list is _faster_ than a homogeneous one, because,
| as above, it doesn 't have to do extra work to check the types
| of objects. All the other extra work is there because the size
| in memory of the objects isn't known in advance; and that will
| be true even if all of the objects are of the same type, since
| instances still won't necessarily all have the same in-memory
| size (for example, consider instances of user-defined classes
| with different attributes).
| Zababa wrote:
| > For a container that needs to be able to store objects of
| any type (including instances of user-defined classes), a
| heterogeneous list is faster than a homogeneous one, because,
| as above, it doesn't have to do extra work to check the types
| of objects.
|
| My question is about that. Is storing objects of any type a
| common use case, compared to storing objects of one type?
| sseagull wrote:
| I have found, at least in my own use of python, lists are
| for "homogeneous" collections of indeterminate length. But
| these are not strictly homogeneous - there may be None
| entries, for example. So I do use the heterogeneous
| capabilities.
|
| For heterogeneous collections with a fixed length, I tend
| to use a tuple. That makes more sense to me, conceptually
| (coming from C++).
| pdonis wrote:
| _> Is storing objects of any type a common use case,
| compared to storing objects of one type?_
|
| I'm not sure it's possible to know a general answer to
| that, since Python is used in such a wide variety of
| contexts.
|
| In my own programming, I use lists that can contain objects
| of different types often enough for that use case to be
| significant for me.
|
| One other thing to keep in mind is that Python has duck
| typing, so, for example, if you are iterating over a list
| and retrieving the same attribute or calling the same
| method on each object, that doesn't require each object to
| be of the same type, just to have an attribute with the
| required name, or a method with the required name and
| calling signature.
| duckerude wrote:
| Python's semantics make homogeneity less useful and hard to
| enforce.
|
| Everything's passed by reference, so the list has to contain
| pointers even if you know all its elements have the same type.
| And some objects can change their type by assigning to
| __class__, which can happen from a distance (because everything
| is by reference).
|
| A careful implementation can solve this for some builtin types.
| PyPy does it for ints: if a list only contains ints it's
| represented as a proper numeric array. But then as soon as you
| append some other type the whole list is converted to a more
| expensive more traditional representation.
|
| PyPy's ints still have to pretend to be heap-allocated objects
| at all times. In particular they have to do something tricky
| with the id() function, which on CPython just returns an
| object's memory address. The return value has to be unique and
| consistent for as long as an object lives.
| nayuki wrote:
| How a Python list really works? See the underlying C
| implementation:
| https://github.com/python/cpython/blob/main/Objects/listobje... ;
| https://github.com/python/cpython/blob/main/Include/listobje...
|
| For comparison, java.util.ArrayList is implemented in Java
| itself, and is much shorter:
| http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/...
| azinman2 wrote:
| Ya I was reading this thinking none of this is telling me how
| CPython implements it. It's just a poor-man's version.
| matthias509 wrote:
| One point which I think bears mentioning with respect to
| performance is that even with O(1) access, the performance in the
| real world of a list implemented by a list of pointers vs a
| contiguous block of memory will be very different. This is
| because programs often iterate over lists and a list implemented
| by a contiguous block of ram will have MUCH better cache
| locality.
|
| AFAIK, this unfortunately wouldn't be possible to implement in
| Python though because I don't think there would be any way to
| know how much space to allocate for each item in the list.
| seiferteric wrote:
| Does it have to be one or the other? You could allocate blocks
| of memory for the list then when it will be filled up you
| allocate more and add a pointer to it.
| dmurray wrote:
| This breaks O(1) access by index, because the objects in the
| list may have different sizes.
| Znafon wrote:
| If you have numeric value you can use
| https://docs.python.org/3/library/array.html
| kzrdude wrote:
| But if you're in the real world you use numpy.ndarray instead
| (this underpins half of the Python ecosystem, more or less).
| killingtime74 wrote:
| For more efficient "lists" there are Numpy arrays
| https://numpy.org/doc/stable/reference/generated/numpy.array...
| which do have cache locality
| wodenokoto wrote:
| Python does come with built-in arrays, but I've never seen
| them used, or heard anyone recommend them for anything.
|
| https://docs.python.org/3/library/array.html
| cozzyd wrote:
| These are useful when you need a pointer to an integer for
| a C or C++ API.
| edgyquant wrote:
| I had never heard of them and when I try to Google Python
| array v set speed it just returns list v set. I did get the
| official documentation but it doesn't go into speed.
| uranusjr wrote:
| They are recommended when you want speed, but most people
| writing Python probably never need that much speed (or
| would probably want numpy anyway), and the few who do
| likely have more private sources to recommend it for
| specific occasions. It should probably be used for more
| things (instead of reaching for numpy at first chance) but
| is still a niche tool.
| abecedarius wrote:
| The times I've used array, it was a common data structure
| for communicating with a C extension. You could use a
| Python list, but that's not just slower but more of a
| pain to deal with on the C side. (This was before numpy
| was everywhere; I don't know how much fun that might be
| to work with from C.)
|
| If you're going pure Python then space is a better reason
| than speed. I'm not sure an array is even as fast as a
| list.
| uranusjr wrote:
| I think I use struct much more when interfacing with C,
| but I can see array being useful for arrays specifically
| when you expect more than a couple of elements. Both are
| still quite niche though.
| kgm wrote:
| The standard array module was more relevant in the distant
| past, before version 2.6 added the bytearray type.
| Otherwise, numpy arrays are considerably more useful in
| virtually every circumstance.
| mark-r wrote:
| In CPython at least, there's no other way. Everything is an
| object, even simple types like integers. And references to
| objects are implemented with pointers. Even a Numpy array which
| stores a contiguous array of simple values requires conversion
| to an object when you actually try to _use_ those values.
| arthurcolle wrote:
| I always wondered why there weren't more 'hyper-collection'
| abstractions that just implement multiple collection types in one
| interface, so that you can get the respective speedups of the
| different collection types while still having the flexibility to
| use them as you wish
| karmakaze wrote:
| This is the field of datastructures. There isn't 'just' one be-
| all to end-all. There are so many kinds of hashing/tables,
| trees/balancing, hybrid (e.g. skiplists), priority queues, etc.
| Because there are so many possible combinations, if the choices
| matter it's useful to use a language that lets you make your
| own. This is one of the main reasons why generics for Go is a
| big deal. Go always had generics, but it's only been for the
| built-ins array/slice, map, and channel. With user-defined
| generics any datastructures/interfaces can be conveniently made
| and used without language/library/compiler support.
|
| Its curious that for Python this is blog post material, but for
| Go it's intrinsic to understanding the performance and usage of
| slices or maps, or in Java (and other langs) would be
| referenced merely by the class name ArrayList. The fact that so
| much can be done effectively in Python without always having to
| think about such details is an achievement of the language and
| ecosystem of libraries and packages.
| ketralnis wrote:
| C arrays are basically that. Also python dicts (most python
| types including all user-defined classes are actually dicts
| underneath), Lua tables, Javascript objects, and more. In most
| dynamically typed languages, "the" type is actually what you're
| calling a hyper-collection and not just conceptually, but also
| in implementation
| beervirus wrote:
| How would that work? Every time you create an object,
| internally it creates one of everything?
___________________________________________________________________
(page generated 2021-11-14 23:01 UTC)