[HN Gopher] JVM Anatomy Quark #10: String.intern()
___________________________________________________________________
JVM Anatomy Quark #10: String.intern()
Author : hyperpape
Score : 77 points
Date : 2022-06-22 17:58 UTC (5 hours ago)
(HTM) web link (shipilev.net)
(TXT) w3m dump (shipilev.net)
| throwaway23234 wrote:
| For anyone that uses XML stuff in Java, you are probably aware of
| String.intern memory and performance issues. I don't know if that
| stuff persists today, but a ton of XML parsing goes through that
| method.
| smarks wrote:
| (Note: the article is from 2019, but it is still relevant and
| interesting.)
|
| If all you're doing is deduplicating string objects in order to
| save memory, the article makes a good case to use `HashMap` or
| `ConcurrentHashMap` instead of `String.intern`.
|
| The reason to use `String.intern` is in the fairly narrow case
| where you want to use `==` to compare strings, and some of these
| strings originated as string literals in Java source code. Java
| specifies that string literals for equal strings are always the
| _same_ Java object, even if they occur in different class files
| that are compiled separately. This is accomplished by interning
| the strings, and `String.intern` provides user-level access to
| this same intern pool.
| abadger9 wrote:
| Having worked in the JVM ecosystem building low latency systems
| for over a decade, I always aspire to have the minute knowledge
| Aleksey has. I wish there was a book/MOOC for more senior
| engineers, but i've found myself content and happy with the
| Anatomy Quark series.
| hyperpape wrote:
| I share the feeling that I wish there were more resources out
| there. I've collected a motley set at
| https://justinblank.com/notebooks/jvmarchitecture.html, but I
| haven't found anything systematic.
| abadger9 wrote:
| this is the one that got me started in low latency java a
| decade ago - http://blog.vanillajava.blog/ - I've spoken with
| Peter a couple of times, he's a phenomenal technologist
| gwittel wrote:
| Vanilla Java is a great resource. The work that Chronicle
| does got me into low latency Java as well (not doing java
| now though).
|
| Some other great resources I used:
|
| * http://psy-lob-saw.blogspot.com/ - Not updated since
| 2018, but still great stuff in there
|
| * https://richardstartin.github.io/ - A few posts per year.
| Richard does a lot of interesting things.
| layer8 wrote:
| https://shipilev.net/jvm-anatomy-park yields 404. Maybe that
| should link to the TFA now?
| hyperpape wrote:
| Thanks, will update later.
| https://shipilev.net/jvm/anatomy-quarks/ is the address.
| nnoitra wrote:
| Why is Java used in low latency systems? Isn't C++ a better
| choice there?
| thriftwy wrote:
| C++ is actually an esoteric language, not unlike Haskell or
| Prolog. It is just a widely used one, for historical reasons.
|
| Meanwhile, you may just get things done in Java without
| giving much thought.
| marginalia_nu wrote:
| I've sort of felt that String.intern is sort of a primordial
| vestige from really old Java landscape that doesn't quite make
| sense anymore. Can't remove it for backwards compatibility
| issues, but I don't understand when or where you would want to
| use it in today's landscape.
| metadat wrote:
| Tl;dr:
|
| > In almost every project we were taking care of, removing
| String.intern() from the hotpaths, or optionally replacing it
| with a handrolled deduplicator, was the very profitable
| performance optimization.
|
| > Do not use String.intern() without thinking very hard about it,
| okay?
|
| Overall fantastic article covering intern() in depth!
|
| The only thing I'd like to see added would be visual graphs
| instead of numeric tables for the benchmark results.
| theandrewbailey wrote:
| I've been writing Java for about 15 years. There hasn't been a
| single instance where I've used String.intern(). I'm starting to
| think that there is never an appropriate place for it. What do
| you get from it? Anything else other than the ability to reliably
| use == for String comparison instead of .equals()?
|
| Nevermind the architecture astronauts saying that the string
| intern pool is actually global state and global state is always
| problematic.
| chrisseaton wrote:
| > Anything else other than the ability to reliably use == for
| String comparison instead of .equals()?
|
| Sometimes that's key.
|
| Why not use an int or some other token? Well then you have to
| garbage collect yourself. Maybe use your own class for the
| token though instead of a string? But then you still have to do
| half the garbage collection, as you need a weak map for lookup
| and possibly some cleaning infrastructure.
| catp wrote:
| I've used String.intern() in genetic data analysis to reduce
| memory usage by a large factor. Lots of repeated Strings like
| "AA", "AB", "BB", etc.
| jefffoster wrote:
| Once upon a time I found string interning useful in conjunction
| with IdentityHashMap for quick lookup of lots of strings.
| jlmorton wrote:
| One common thing I've seen is an attempt to de-duplicate a
| poorly-crafted cache.
|
| Let's say you have a cache with Map<Language, BusinessInfo>.
| After a while, the cache is quite large, and you want to reduce
| its memory usage. You realize within BusinessInfo, you have a
| MailingAddress, but it's not actually localized. You're
| duplicating millions of address lines for no reason.
|
| You could split this out of your cache, but pulling that thread
| is a bit tricky. Instead, you decide to store
| addressLine.intern().
| monkeybutton wrote:
| I used it a long time ago in a program that read large
| structured text files. What was 100s of MB of text in memory
| became a fraction in size after interning the tokens.
| GlitchMr wrote:
| I don't think using String::intern makes sense, especially now
| that Java's garbage collector is capable of deduplicating
| strings (https://openjdk.org/jeps/192). In the past it could
| have been used to reduce memory usage when a given string was
| used a lot, but now there are better ways of dealing with that
| issue.
| hyperpape wrote:
| G1's deduplication is nice, but note that G1's deduplication
| is a lot weaker than what String.intern does. G1 deduplicates
| the underlying byte array, but leaves separate strings (so s1
| == s2 will evaluate false). So you still have an extra object
| header.
|
| If you have (like one our applications did) millions of
| copies of the string "USA" in memory, that's many megabytes
| of memory that explicit deduplication can save that the
| garbage collector can't.
|
| String.intern isn't the way, for all the reasons this post
| outlines, but just using G1 isn't the right approach either.
| vips7L wrote:
| IIRC all of the concurrent GCs can dedupe now. Not just G1.
|
| Hopefully soon object headers will be negligible with
| progress from Lilliput though.
| hyperpape wrote:
| Strings have an object header, an int for the hashcode
| and a pointer to the array. Assuming a < 32 GB heap (so 4
| byte pointers), that's 24 bytes for the string, even once
| the array is deduped. Lilliput is awesome, but an 8 byte
| header would only reduce that to 16 bytes.
| bumper_crop wrote:
| You'll need to go back earlier than 15 years. ConcurrentHashMap
| and friends were added in 1.6, but String.intern has been in
| there since the beginning. Since Java shipped with Threads in
| the standard library, (but no memory model) that meant it would
| have been very difficult to do concurrent String deduplication
| yourself. If you agree string deduplication is needed, then
| String.intern() was a good implementation for a long while.
|
| String.intern() also offers some other benefits for certain use
| cases. Earlier versions of java did not cache the String
| hashcode, which meant that to use Strings as hash table keys
| meant hashing a lot more. But, an interned string can be used
| in an IdentityHashMap, which was faster for a long portion of
| Java's early life.
|
| (I worked on a moderately popular Java library that targeted
| Java 1.5 as the minimum version. It does occasionally come up
| useful, but only in specific, and increasingly rare
| circumstances)
| Someone wrote:
| I never understood it, either. It would be useful (at least
| historically) in cases where you're willing to give up
| performance in exchange for lower memory usage, and then, you'd
| still have to make good choices on what strings to intern and
| what strings not to intern (I think the standard example is a
| XML parser that may choose to intern all node and attribute
| names of a XML schema. If you're parsing a large file using a
| DOM parser, that can significantly decrease memory usage)
|
| Problem 1: it can easily depend on your input data whether
| interning is necessary to prevent out of memory exceptions.
|
| Problem 2: there often are better solutions to memory usage
| than interning strings (for example, by replacing that DOM
| parser by a streaming one)
|
| Problem 3: if you're writing a library, you can't know anything
| about the application that will use your code, so you have to
| guess whether interning some of the strings you create will
| help the user.
|
| And of course, that's historically. (Some) modern JVMs already
| deduplicate strings.
| [deleted]
___________________________________________________________________
(page generated 2022-06-22 23:00 UTC)