[HN Gopher] Python strings are immutable, but only sometimes
       ___________________________________________________________________
        
       Python strings are immutable, but only sometimes
        
       Author : jermaustin1
       Score  : 85 points
       Date   : 2021-02-15 18:48 UTC (4 hours ago)
        
 (HTM) web link (web.eecs.utk.edu)
 (TXT) w3m dump (web.eecs.utk.edu)
        
       | Animats wrote:
       | Python string semantics are very strange, as the way Unicode was
       | implemented. The internal representation can be one, two, or four
       | bytes per character, depending on the widest character. Plus I've
       | heard that there's sometimes a UTF-8 representation in there,
       | too.
       | 
       | Python strings have to be indexable by integers. But, in fact,
       | most indexing is iteration. So a UTF-8 representation could work.
       | You can index forwards and backwards through UTF-8 with opaque
       | cursors. The search, match, and regular expression functions can
       | all work on UTF-8. Returned indexes would be opaque cursors.
       | Addition and subtraction of small integers to a cursor can be
       | handled by indexing forwards and backwards through UTF-8. If a
       | program does does random access indexing, or uses an opaque
       | cursor in a context where it really has to be convereted to an
       | integer, the implementation has to build an index of the entire
       | string. Index building costs O(n), but should be rare.
        
       | js2 wrote:
       | > Does this really count as mutability though? Not really. The
       | string would be thrown away anyway so this is just an
       | optimization to reuse the memory. The important takeaway here is
       | that you are not allocating a new string every single time like
       | the internet says.
       | 
       | This is an undocumented optimization, so you should assume you're
       | allocating a new sting every single time like the internet says.
       | 
       | I've been coding Python since 1.5.2 days and so I'll continue to
       | use lists as intermediate string builders and then join
       | afterwards because I know this works in past, present, a likely
       | future versions of Python.
        
         | orf wrote:
         | Some interesting results. Joining on a generator is slower than
         | joining on a list because the list has a known size beforehand,
         | which lets Python perform some optimizations. Even though it's
         | more memory intensive. Appending a string is surprisingly fast.
         | Building an intermediate list is not.
         | 
         | While I do get your point, Python likes making optimizations.
         | They rarely, if ever, make patterns slower by choice. There's
         | nothing in any spec that says it _has_ to be slow: that's as
         | much as an implementation detail as "joining lists are fast".
         | In [18]: sys.version         Out[18]: '3.9.1 (default, Feb  3
         | 2021, 07:38:02) \n[Clang 12.0.0 (clang-1200.0.32.29)]'
         | In [6]: strings = ["".join(random.choice(string.hexdigits) for
         | _ in range(10)) for _ in range(10)]                  In [9]:
         | def one():            ...:     "".join(s for s in strings)
         | ...:                  In [10]: def two():             ...:
         | "".join([s for s in strings])             ...:
         | In [11]: def three():             ...:     x = []
         | ...:     for s in strings:             ...:         x.append(s)
         | ...:     "".join(x)             ...:                  In [12]:
         | def four():             ...:     x = ""             ...:
         | for s in strings:             ...:         x += s
         | ...:                  In [13]: %timeit one()         753 ns +-
         | 9 ns per loop (mean +- std. dev. of 7 runs, 1000000 loops each)
         | In [14]: %timeit two()         521 ns +- 5.63 ns per loop (mean
         | +- std. dev. of 7 runs, 1000000 loops each)                  In
         | [15]: %timeit three()         696 ns +- 3.94 ns per loop (mean
         | +- std. dev. of 7 runs, 1000000 loops each)                  In
         | [16]: %timeit four()         620 ns +- 4.82 ns per loop (mean
         | +- std. dev. of 7 runs, 1000000 loops each)
        
         | Dylan16807 wrote:
         | Does python document anything about performance?
         | 
         | I assume my array sort won't be n^3, but how do we draw the
         | line on what to assume and what not to assume?
        
         | nine_k wrote:
         | This does play a role if for some reason you use object IDs,
         | e.g. when deserializing data with cross-links. Keeping a real
         | second reference becomes important.
        
           | wizzwizz4 wrote:
           | You should never use object IDs that way; that's what weak
           | references / dictionaries are for. (See pickle for details.)
        
           | heavenlyblue wrote:
           | Python docs explicitly say that object IDs are only unique
           | for objects with overlapping lifetimes.
           | 
           | Congrats, by trying to be smart you just shot yourself in the
           | foot.
        
           | [deleted]
        
         | lostcolony wrote:
         | Yeah; I'm pretty sure this isn't anything you can take
         | advantage of. But I think the point of the post is a "isn't
         | this neat?" rather than "because of an optimization one
         | approach isn't as bad as it could be and you should take
         | advantage of that".
        
           | masklinn wrote:
           | > Yeah; I'm pretty sure this isn't anything you can take
           | advantage of.
           | 
           | You probably _can_ but _should not_ rely on it:
           | 
           | * it was a very official part of the Python 2.4 release notes
           | 
           | * it's unlikely the devs would remove it as they know how
           | implementation details tend to leak into the software (see:
           | dict ordering)
           | 
           | * but it _is_ an optimisation so there 's no guarantee, a
           | simple trace function could break it
           | 
           | * and obviously there's no guarantee (and likely no way)
           | other implementations can implement it
        
             | lostcolony wrote:
             | Right, but my thought is "at what times would it make sense
             | to join, and not append, but this optimization would be
             | faster than joining"? I'm guessing not very many.
             | 
             | Because, yeah, the fact that str += "?" is faster isn't a
             | big deal; that's the natural, ergonomic way to deal with
             | it, because you're appending all of one string. Likewise
             | foo + " " + bar + "?" is probably easier to write that way,
             | than to drop them into a list and join (but even if not,
             | I'd be curious if it's actually any faster; this article
             | doesn't measure). By the time you get to joining large
             | amounts together, concatenating a CSV or something, you're
             | going to use join, naturally, and join is going to be more
             | performant.
             | 
             | So to my mind it's kind of an open question, at what points
             | is the non-ergonomic thing going to be the faster thing?
             | That's what "taking advantage of (an optimization)" feels
             | like; otherwise you're just writing code and letting the
             | performance fall where it may.
        
               | masklinn wrote:
               | > Right, but my thought is "at what times would it make
               | sense to join, and not append, but this optimization
               | would be faster than joining"? I'm guessing not very
               | many.
               | 
               | I'm guessing all of them. Because what it does is what
               | join will do internally anyway, but without the overhead
               | of the list itself, and with operations which have lower
               | overhead than the list's.
        
               | lostcolony wrote:
               | "I'm guessing"
               | 
               | Yes, you are. I'm saying this is a more interesting
               | question, and one that isn't answered. Because the
               | reality is that if you're appending just a handful of
               | things together, the idiomatic approach is to concatenate
               | them, and this optimization comes into play. If you're
               | concatenating a lot of items together...you probably
               | already have a list, and the idiomatic approach to join
               | them is probably faster than a for loop to concat them
               | over and over. So the question then, is is there a point
               | where idiomatically it makes more sense to join things
               | (putting them in a list if they aren't already, but they
               | might be; depends on the example), but this will be more
               | efficient?
        
             | dmurray wrote:
             | This one is less painful for the devs to break, if they
             | need to, because it only works 99% of the time. If you rely
             | on this behaviour for correctness, rather than performance,
             | it will set you right pretty quickly.
             | 
             | If you do want to take advantage of this, maybe you can go
             | a level deeper: dig into Python's behaviour when allocating
             | memory for strings, and figure out in what circumstances
             | you can be guaranteed to get the same id back. E.g. maybe
             | if you create a 28-character string, there will always be
             | room to append 4 more.
        
             | dmw_ng wrote:
             | The optimization got moved to unicode objects during the
             | great Python 3 upheaval and AFAIK still remains
             | unimplemented for what are now bytes objects. The paramiko
             | SSH library relied on it heavily, as a result its runtime
             | is (AFAIK still) horrific on Python 3
        
       | [deleted]
        
       | pwdisswordfish6 wrote:
       | This behaviour (id not changing after concatenation) is
       | consistent with the original str object being deallocated and a
       | new one allocated in its place with an identical id. It's not
       | what actually happens behind the scenes (in CPython), but it's
       | still an implementation detail that this demonstration does not
       | even unambiguously expose.
        
         | nine_k wrote:
         | CPython could do both; allocations are usually not byte-
         | perfect, and the small remaining room at the end of string can
         | be used as an optimization, or a new buffer can be allocated if
         | there's no more room.
         | 
         | Also, cutting long strings with .strip() can use the same
         | optimization. Allocation isn't fast.
         | 
         | IDK if it's used, but it's entirely plausible to implement.
        
           | masklinn wrote:
           | > CPython could do both; allocations are usually not byte-
           | perfect
           | 
           | Though it's probably not the case for strings, it should be
           | noted that for most objects allocations _are_ byte-perfect,
           | because the average Python object is a bunch of pointers and
           | and a refcount: two gc pointers, a refcount, a weakref
           | pointer, a class pointer and a dict pointer, for a total of
           | 48 bytes (as of 3.9, ignoring all the pointees).
           | 
           | That's not the case for `__slots__` though: they drop the
           | instance dict and the weakref pointer, but then each field
           | will add a pointer to the total (then again that's space
           | which won't be allocated as part of the instance dict).
        
       | nine_k wrote:
       | If a tree falls in a forest, and nobody observes it, does it make
       | a sound? The tree can choose, because there's nobody to notice
       | the difference.
       | 
       | Mutable state is not bad; _shared_ mutable state is bad -- or, at
       | least, takes a lot of care to handle. So CPython mutates strings
       | which are not observed by anyone else but the mutating code.
       | 
       | This is exactly the idea behind uniqueness types [1] and move
       | semantics in C++ and Rust: mutation is safe as long as it's not
       | shared.
       | 
       | [1]: https://en.wikipedia.org/wiki/Uniqueness_type
        
       | ymbeld wrote:
       | The title seems like false advertisement. It seems that strings
       | in Python are indeed immutable. How it implements immutability is
       | up to it (the implementation), and making a brand new allocation
       | for each operation is not the only way of adhering to the
       | contract.
       | 
       | Get back to me if you can make several references to the same
       | string and make them step on each others' toes by only assigning
       | a new value to one of them.
        
       | teddyh wrote:
       | > _the builtin id() function, which is just the memory address._
       | 
       | This, crucially, is _not_ guaranteed. Instead, id() is only
       | guaranteed to give a unique integer value for every object _valid
       | for the lifetime of that object_. So if Python deallocated your
       | first string object and allocated a new string object, Python
       | could give the same id() for the new string object and this would
       | be fine; this does _not_ necessarily mean that the two string
       | objects are the "same" object! In fact, since the objects do not
       | share lifetimes, they _cannot_ be said to be the same object!
        
       | postit wrote:
       | I just spend two days chasing a bug because someone shoved a
       | variable inside a dataclass in the wrong place, nothing is
       | immutable in python.
        
         | iainmerrick wrote:
         | No, strings are immutable, despite the semantic gymnastics this
         | article is performing.
        
           | masklinn wrote:
           | Strings are immutable on the Python side. On the C size,
           | however, PyUnicode_Resize is part of the PEP 384 Stable ABI
           | and very specifically tries to resize the string in-place
           | (it's normally intended for string-building).
        
             | ymbeld wrote:
             | If it is immutable from the client's perspective, it's
             | immutable. Anything else is ill-defined.
        
               | masklinn wrote:
               | The C api is "from a client's perspective". Any rando can
               | build native modules with it.
        
       | cs702 wrote:
       | My favorite Python WTF "feature" is that integers can have have
       | the same reference, but only sometimes:                 >>> a =
       | 256       >>> b = 256       >>> a is b       True       >>> a =
       | 257       >>> b = 257       >>> a is b       False       >>> a =
       | 257; b = 257       >>> a is b       True
       | 
       | Sometimes I think of Python as the _Nash Equilibrium_ [a] of
       | programming languages:
       | 
       | It's never the absolute best language for anything, but it's hard
       | to improve it on any front (e.g., execution speed) without
       | hindering it on other fronts (e.g., ad-hoc interactivity), and
       | it's never the absolute worst language for anything. Often
       | enough, it's good enough.
       | 
       | Python might well be "the least-worst language for everything."
       | 
       | --
       | 
       | [a] https://en.wikipedia.org/wiki/Nash_equilibrium
        
         | patrickthebold wrote:
         | Java basically has the same thing:
         | 
         | Quick Google gives: https://stackoverflow.com/a/1515811
        
         | justinnhli wrote:
         | > Sometimes I think of Python as the Nash Equilibrium of
         | programming languages
         | 
         | FYI: What you're describing is not a Nash equilibrium, but a
         | Pareto optimal point [1]. They are similar in that you couldn't
         | do any better, but Nash equilibria is in terms of whether this
         | would cause other players to change their strategies, while
         | Pareto optimality is only about trading off different
         | features/dimensions.
         | 
         | [1]: https://en.wikipedia.org/wiki/Pareto_efficiency
        
         | th0ma5 wrote:
         | This is outside of the spec... "is" is for testing the exact
         | same reference and it is only coincidence that to speed things
         | up they made smaller integers the same objects in memory. See:
         | >>> a=257         >>> b=a         >>> a is b         True
         | 
         | What you want is double equals.
        
           | [deleted]
        
           | oivey wrote:
           | He names references in his post. I highly doubt he's confused
           | about the difference between is and ==. It's a weird leak of
           | interpreter details that could, in very narrow situations,
           | cause a bug.
        
         | coldtea wrote:
         | > _My favorite Python WTF "feature" is that integers can have
         | have the same reference, but only sometimes_
         | 
         | Many languages do the same, ditto for strings (as in TFA).
        
         | Delk wrote:
         | As other commenters have pointed out, this is an
         | implementation-specification optimization rather than a
         | property of Python as a language.
         | 
         | It is, at a first glance, a bit weird. But the way you should
         | look at it is that Python the language doesn't say the two
         | integers have the same identity, and you shouldn't assume they
         | will. But it also doesn't say they _can 't_ be the same object.
         | Since Python integers are immutable, and thus having the two
         | variables actually reference the same object can't create side
         | effects unless you're directly playing with identities and
         | making assumptions in your code that you shouldn't make, the
         | implementation can have the two variables reference the same
         | object as an optimization without breaking anything.
        
         | patrec wrote:
         | > It's never the absolute best language for anything, but it's
         | hard to improve it on any front (e.g., execution speed) without
         | hindering it on other fronts (e.g., ad-hoc interactivity),
         | 
         | This belief seems common, but I always wonder if anyone with
         | familiarity with dynamic programming languages that were
         | implemented by people who knew what they are doing (as
         | implementers) thinks so. Self, Smalltalk and Common Lisp, for
         | example, are doing _much_ better on the ad-hoc interactivity
         | front in non-trivial ways whilst offering implementations with
         | vastly better performance preceding (C)Python by many years.
         | The fact that python has terrible execution speed is most due
         | to lack of relevant skills in the community not some conscious
         | engineering trade-off.
         | 
         | Having said that, I don't think you are wrong on python being
         | "the least worst language for everything" -- very few other
         | languages have an eco system of remotely comparable
         | expansiveness and quality (the top minds in several disciplines
         | mostly use python for their work) which alone kills of huge
         | swathes of would-be-competitors.
        
         | orf wrote:
         | `is` is for identity whereas `=` is for equality. You rarely
         | want `is` unless you're asking if two references _are the same
         | object_. This is almost exclusively used for `x is False
         | /True`, but sometimes used to denote "missing" arguments (where
         | true/false may be valid):                   missing = object()
         | def func(a=missing):            if a is missing:
         | raise ValueError('you must pass a')
         | 
         | This "numbers less than 256 are the same objects" is a fairly
         | common on the list of "wtf python" but I've never understood
         | it. You don't use `is` like that and you would never use it in
         | code because the operator you're using is not the right one for
         | the thing you're trying to do.
         | 
         | Plus if this is the biggest wtf then that's pretty good going.
        
           | cs702 wrote:
           | Yes, of course. It's not the "biggest" WTF feature, but it's
           | my favorite. There's a long list of WTF features here:
           | 
           | https://github.com/satwikkansal/wtfpython
           | 
           | Otherwise, I agree, it's a pretty good going :-)
        
       | nneonneo wrote:
       | I mean, who says Python strings are immutable?                 s
       | = "asd"       import ctypes       ctypes.memmove(id(s) +
       | ord('1'), b'X', 1)       print(s) # prints aXd
       | 
       | (Python 3.3+, 64-bit; this is a joke so please do not use this in
       | anything real)
        
         | yccs27 wrote:
         | Nice!
         | 
         | Can you please explain to a rookie what the ord('1') is for? Is
         | it just a way of writing a magic number of value 49, or is
         | there some significance to it?
        
           | panda88888 wrote:
           | CPython implementation of the string stores metadata to save
           | memory. The ord('1') simply jumps to the correct memory
           | offset in the string structure. I would think it's easier to
           | remember ord('0') as the offset of the string byte array in
           | the structure than the magic number 48?
        
           | nneonneo wrote:
           | It's mostly just a mildly obfuscated way to write 49. It also
           | happens to line up nicely with the index of the character
           | being overwritten (index 1); ord('0') through ord('9') work
           | likewise.
           | 
           | ----
           | 
           | I'm personally kind of shocked at how big strings are - the
           | minimum size is 6 native words in size, corresponding to the
           | following fields: refcount, type pointer, string length,
           | string hash (precomputed), flags, and a wstr pointer which
           | just seems to be NULL for most strings. It seems like they
           | could have at least merged the string length and flags
           | together for short strings - a possible subject for a future
           | PEP...
        
           | rasjani wrote:
           | It feels like its just "magic" number. Probably id() is the
           | actual raw pointer to str class and the string starts at
           | offset 48 ?
        
       ___________________________________________________________________
       (page generated 2021-02-15 23:00 UTC)