[HN Gopher] JVM Anatomy Quark #10: String.intern()
       ___________________________________________________________________
        
       JVM Anatomy Quark #10: String.intern()
        
       Author : hyperpape
       Score  : 77 points
       Date   : 2022-06-22 17:58 UTC (5 hours ago)
        
 (HTM) web link (shipilev.net)
 (TXT) w3m dump (shipilev.net)
        
       | throwaway23234 wrote:
       | For anyone that uses XML stuff in Java, you are probably aware of
       | String.intern memory and performance issues. I don't know if that
       | stuff persists today, but a ton of XML parsing goes through that
       | method.
        
       | smarks wrote:
       | (Note: the article is from 2019, but it is still relevant and
       | interesting.)
       | 
       | If all you're doing is deduplicating string objects in order to
       | save memory, the article makes a good case to use `HashMap` or
       | `ConcurrentHashMap` instead of `String.intern`.
       | 
       | The reason to use `String.intern` is in the fairly narrow case
       | where you want to use `==` to compare strings, and some of these
       | strings originated as string literals in Java source code. Java
       | specifies that string literals for equal strings are always the
       | _same_ Java object, even if they occur in different class files
       | that are compiled separately. This is accomplished by interning
       | the strings, and `String.intern` provides user-level access to
       | this same intern pool.
        
       | abadger9 wrote:
       | Having worked in the JVM ecosystem building low latency systems
       | for over a decade, I always aspire to have the minute knowledge
       | Aleksey has. I wish there was a book/MOOC for more senior
       | engineers, but i've found myself content and happy with the
       | Anatomy Quark series.
        
         | hyperpape wrote:
         | I share the feeling that I wish there were more resources out
         | there. I've collected a motley set at
         | https://justinblank.com/notebooks/jvmarchitecture.html, but I
         | haven't found anything systematic.
        
           | abadger9 wrote:
           | this is the one that got me started in low latency java a
           | decade ago - http://blog.vanillajava.blog/ - I've spoken with
           | Peter a couple of times, he's a phenomenal technologist
        
             | gwittel wrote:
             | Vanilla Java is a great resource. The work that Chronicle
             | does got me into low latency Java as well (not doing java
             | now though).
             | 
             | Some other great resources I used:
             | 
             | * http://psy-lob-saw.blogspot.com/ - Not updated since
             | 2018, but still great stuff in there
             | 
             | * https://richardstartin.github.io/ - A few posts per year.
             | Richard does a lot of interesting things.
        
           | layer8 wrote:
           | https://shipilev.net/jvm-anatomy-park yields 404. Maybe that
           | should link to the TFA now?
        
             | hyperpape wrote:
             | Thanks, will update later.
             | https://shipilev.net/jvm/anatomy-quarks/ is the address.
        
         | nnoitra wrote:
         | Why is Java used in low latency systems? Isn't C++ a better
         | choice there?
        
           | thriftwy wrote:
           | C++ is actually an esoteric language, not unlike Haskell or
           | Prolog. It is just a widely used one, for historical reasons.
           | 
           | Meanwhile, you may just get things done in Java without
           | giving much thought.
        
       | marginalia_nu wrote:
       | I've sort of felt that String.intern is sort of a primordial
       | vestige from really old Java landscape that doesn't quite make
       | sense anymore. Can't remove it for backwards compatibility
       | issues, but I don't understand when or where you would want to
       | use it in today's landscape.
        
       | metadat wrote:
       | Tl;dr:
       | 
       | > In almost every project we were taking care of, removing
       | String.intern() from the hotpaths, or optionally replacing it
       | with a handrolled deduplicator, was the very profitable
       | performance optimization.
       | 
       | > Do not use String.intern() without thinking very hard about it,
       | okay?
       | 
       | Overall fantastic article covering intern() in depth!
       | 
       | The only thing I'd like to see added would be visual graphs
       | instead of numeric tables for the benchmark results.
        
       | theandrewbailey wrote:
       | I've been writing Java for about 15 years. There hasn't been a
       | single instance where I've used String.intern(). I'm starting to
       | think that there is never an appropriate place for it. What do
       | you get from it? Anything else other than the ability to reliably
       | use == for String comparison instead of .equals()?
       | 
       | Nevermind the architecture astronauts saying that the string
       | intern pool is actually global state and global state is always
       | problematic.
        
         | chrisseaton wrote:
         | > Anything else other than the ability to reliably use == for
         | String comparison instead of .equals()?
         | 
         | Sometimes that's key.
         | 
         | Why not use an int or some other token? Well then you have to
         | garbage collect yourself. Maybe use your own class for the
         | token though instead of a string? But then you still have to do
         | half the garbage collection, as you need a weak map for lookup
         | and possibly some cleaning infrastructure.
        
         | catp wrote:
         | I've used String.intern() in genetic data analysis to reduce
         | memory usage by a large factor. Lots of repeated Strings like
         | "AA", "AB", "BB", etc.
        
         | jefffoster wrote:
         | Once upon a time I found string interning useful in conjunction
         | with IdentityHashMap for quick lookup of lots of strings.
        
         | jlmorton wrote:
         | One common thing I've seen is an attempt to de-duplicate a
         | poorly-crafted cache.
         | 
         | Let's say you have a cache with Map<Language, BusinessInfo>.
         | After a while, the cache is quite large, and you want to reduce
         | its memory usage. You realize within BusinessInfo, you have a
         | MailingAddress, but it's not actually localized. You're
         | duplicating millions of address lines for no reason.
         | 
         | You could split this out of your cache, but pulling that thread
         | is a bit tricky. Instead, you decide to store
         | addressLine.intern().
        
         | monkeybutton wrote:
         | I used it a long time ago in a program that read large
         | structured text files. What was 100s of MB of text in memory
         | became a fraction in size after interning the tokens.
        
         | GlitchMr wrote:
         | I don't think using String::intern makes sense, especially now
         | that Java's garbage collector is capable of deduplicating
         | strings (https://openjdk.org/jeps/192). In the past it could
         | have been used to reduce memory usage when a given string was
         | used a lot, but now there are better ways of dealing with that
         | issue.
        
           | hyperpape wrote:
           | G1's deduplication is nice, but note that G1's deduplication
           | is a lot weaker than what String.intern does. G1 deduplicates
           | the underlying byte array, but leaves separate strings (so s1
           | == s2 will evaluate false). So you still have an extra object
           | header.
           | 
           | If you have (like one our applications did) millions of
           | copies of the string "USA" in memory, that's many megabytes
           | of memory that explicit deduplication can save that the
           | garbage collector can't.
           | 
           | String.intern isn't the way, for all the reasons this post
           | outlines, but just using G1 isn't the right approach either.
        
             | vips7L wrote:
             | IIRC all of the concurrent GCs can dedupe now. Not just G1.
             | 
             | Hopefully soon object headers will be negligible with
             | progress from Lilliput though.
        
               | hyperpape wrote:
               | Strings have an object header, an int for the hashcode
               | and a pointer to the array. Assuming a < 32 GB heap (so 4
               | byte pointers), that's 24 bytes for the string, even once
               | the array is deduped. Lilliput is awesome, but an 8 byte
               | header would only reduce that to 16 bytes.
        
         | bumper_crop wrote:
         | You'll need to go back earlier than 15 years. ConcurrentHashMap
         | and friends were added in 1.6, but String.intern has been in
         | there since the beginning. Since Java shipped with Threads in
         | the standard library, (but no memory model) that meant it would
         | have been very difficult to do concurrent String deduplication
         | yourself. If you agree string deduplication is needed, then
         | String.intern() was a good implementation for a long while.
         | 
         | String.intern() also offers some other benefits for certain use
         | cases. Earlier versions of java did not cache the String
         | hashcode, which meant that to use Strings as hash table keys
         | meant hashing a lot more. But, an interned string can be used
         | in an IdentityHashMap, which was faster for a long portion of
         | Java's early life.
         | 
         | (I worked on a moderately popular Java library that targeted
         | Java 1.5 as the minimum version. It does occasionally come up
         | useful, but only in specific, and increasingly rare
         | circumstances)
        
         | Someone wrote:
         | I never understood it, either. It would be useful (at least
         | historically) in cases where you're willing to give up
         | performance in exchange for lower memory usage, and then, you'd
         | still have to make good choices on what strings to intern and
         | what strings not to intern (I think the standard example is a
         | XML parser that may choose to intern all node and attribute
         | names of a XML schema. If you're parsing a large file using a
         | DOM parser, that can significantly decrease memory usage)
         | 
         | Problem 1: it can easily depend on your input data whether
         | interning is necessary to prevent out of memory exceptions.
         | 
         | Problem 2: there often are better solutions to memory usage
         | than interning strings (for example, by replacing that DOM
         | parser by a streaming one)
         | 
         | Problem 3: if you're writing a library, you can't know anything
         | about the application that will use your code, so you have to
         | guess whether interning some of the strings you create will
         | help the user.
         | 
         | And of course, that's historically. (Some) modern JVMs already
         | deduplicate strings.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-06-22 23:00 UTC)