hngopher.com

       [HN Gopher] Show HN: Samchika - A Java Library for Fast, Multith...
       ___________________________________________________________________
        
       Show HN: Samchika - A Java Library for Fast, Multithreaded File
       Processing
        
       Hi HN, I built a Java library called SmartFileProcessor to make
       high-performance, multi-threaded file processing simpler and more
       maintainable.  Most Java file processing solutions either involve a
       lot of boilerplate or don't handle concurrency, backpressure, or
       metrics well out of the box. I needed something fast, clean, and
       production-friendly -- so I built this.  Key features:  Multi-
       threaded line/batch processing using a configurable thread pool
       Producer/consumer model with built-in backpressure  Buffered,
       asynchronous writing with optional auto-flush  Live metrics: memory
       usage, throughput, thread times, queue stats  Simple builder API --
       minimal setup to get going  Output metrics to JSON, CSV, or human-
       readable format  Use cases:  Large CSV or log file parsing  ETL
       pre-processing  Line-by-line filtering and transformation  Batch
       preparation before ingestion  I'd really appreciate your feedback
       -- feature ideas, performance improvements, critiques, or whether
       this solves a real problem for others. Thanks for checking it out!
        
       Author : mprataps
       Score  : 56 points
       Date   : 2025-05-23 13:39 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gavinray wrote:
       | Please don't do this.
       | 
       | Have the OS handle memory paging and buffering for you and then
       | use Java's parallel algorithms to do concurrent processing.
       | 
       | Create a "MappedByteBuffer" and mmap the file into memory.
       | 
       | If the file is too large, use an "AsynchronousFileChannel" and
       | asynchronously read + process segments of the buffer.
        
         | 90s_dev wrote:
         | Knowing nothing about Java or compsci, I am very curious to see
         | the in depth discussion by all you Java/compsci experts that
         | your comment invites.
        
         | papercrane wrote:
         | If you're using a newer JVM you can also map a "MemorySegment",
         | which doesn't have the 2GiB limit that byte buffers have.
        
           | gavinray wrote:
           | Good point, have written about this in the past
           | 
           | https://gavinray97.github.io/blog/panama-not-so-foreign-
           | memo...
        
         | switchbak wrote:
         | Memory mapping is fun, but shouldn't we have some kind of async
         | IO / uring support by now? If you're looking at really high-
         | perf I/O, mmaping isn't really state of the art right now.
         | 
         | Then again, if you're in Java/JVM land you're probably not
         | building bleeding edge DBs ala ScyllaDB. But I'm somewhat
         | surprised at the lack of projects in this space. One would
         | think this would pair well with some of the reactive stream
         | implementations so that you wouldn't have to reimplement things
         | like backpressure, etc.
        
           | exabrial wrote:
           | try not to be a dick
        
           | threeseed wrote:
           | a) There have been libraries supporting io_uring on the JVM
           | for many years now.
           | 
           | b) SycllaDB is not bleeding edge. It uses the relatively old
           | now DPDK.
           | 
           | c) There are countless reactive stream implementations e.g.
           | https://vertx.io/docs/vertx-reactive-streams/java/
        
             | hawk_ wrote:
             | I thought DPDK would still be faster than io_uring.
        
           | jlokier wrote:
           | Last time I measured on Linux (a few years ago), with NVMe,
           | mmap + calling out to a thread pool to async-page-touch (so
           | the main thread didn't block) was faster than io_uring (from
           | the main thread) for random access reads.
        
         | SillyUsername wrote:
         | Better caveat that with, "but watch memory consumption, given
         | the nature of the likes of CopyOnWriteArraylist". GC will be a
         | bitch.
        
         | mprataps wrote:
         | Thanks for this comment. This will be an interesting aspect to
         | explore.
        
       | codetiger wrote:
       | Do you have a benchmark comparison with other similar tools?
        
       | sureglymop wrote:
       | Perhaps I misunderstand something but doesn't reading from a file
       | require a system call? And when there is a system call, the
       | context switches? So wouldn't using multiple threads to read from
       | a file mean that they can't really read in parallel anyway
       | because they block each other when executing that system call?
        
         | mike_hearn wrote:
         | System calls aren't context switches. They flip a permission
         | bit in the CPU but don't do the work a context switch involves
         | like modifying the MMU, flushing the TLBs, modifying kernel
         | structures, doing scheduling etc.
         | 
         | Also, modern filing systems are all thread safe. You can have
         | multiple threads reading and even writing in parallel on
         | different CPU cores.
        
         | bionsystem wrote:
         | If you open() read-only I don't think it blocks (some other
         | process writing to it might block though).
        
         | porridgeraisin wrote:
         | > system call, the context switches
         | 
         | No, there is no separate kernel "executing". When you do a
         | syscall, your thread becomes kernel mode and it executes the
         | function behind the syscall, then when it's done, your thread
         | reverts to user mode.
         | 
         | A context switch is when one thread is being swapped out for
         | another. Now the syscall could internally spawn a thread and
         | context switch to that, but I'm not sure if this happens in
         | read() or any syscall for that matter.
        
         | xxs wrote:
         | What all other siblings said - syscalls are not context switch,
         | they are called 'mode switch' and it has significantly less
         | impact.
        
       | sidcool wrote:
       | It would be even more amazing if it had tests. It's already
       | pretty good.
        
         | DannyB2 wrote:
         | Should the tests include some 10 GB files?
        
           | VWWHFSfQ wrote:
           | Should include a script for generating 10GB files maybe
        
           | sidcool wrote:
           | Naah. I meant unit tests. Not load tests.
        
         | mprataps wrote:
         | I will add unit tests next.
        
       | VWWHFSfQ wrote:
       | Am I wrong in thinking that this is duplicating lines in memory
       | repeatedly when buffering lines into batches, and then submitting
       | batches to threads? And then again when calling the line
       | processor? Seems like it might be a memory hog
        
         | Calzifer wrote:
         | Since most things in Java are handled by reference, including
         | Strings there should be not that much memory overhead. From a
         | quick look I could not find any actual line duplication.
        
       | Calzifer wrote:
       | for(int i=0;i<10000; ++i){                      // do nothing
       | just compute hash again and again.                 hash =
       | str.hashCode();             }
       | 
       | https://github.com/MayankPratap/Samchika/blob/ebf45acad1963d...
       | 
       | "do nothing" is correct, "again and again" not so much. Java
       | caches the hash code for Strings and since the JIT knows that (at
       | least in recent version[1]) it might even remove this loop
       | entirely.
       | 
       | [1] https://news.ycombinator.com/item?id=43854337
        
         | hyperpape wrote:
         | Even in older versions, if the compiler can see that there are
         | no side-effects, it is free to remove the loop and simply
         | return the value from the first iteration.
         | 
         | I'm actually pretty curious to see what this method does on
         | versions that don't have the optimization to treat hashCodes as
         | quasi-final.
         | 
         | A quick test using Java 17 shows it's not being optimized away
         | _completely_, but it's taking...~1 ns per iteration, which is
         | not enough to compute a hash code.
         | 
         | Edit: I'm being silly. It will just compute the hashcode the
         | first time, and then repeatedly check that it's cached and
         | return it. So the JIT doesn't have to do any _real_ work to
         | make this skip the hash code calculation.
         | 
         | So most likely, the effective code is:
         | computeHashCode();         for (int i = 0; i < 10000; i++) {
         | if (false) { // pretend this wouldn't have dead code
         | elimination, and the boolean is actually checked
         | computeHashCode();             }         }
        
           | rzzzt wrote:
           | JMH, the microbenchmark harness has an example that
           | highlights this:
           | https://github.com/openjdk/jmh/blob/master/jmh-
           | samples/src/m...
        
             | Calzifer wrote:
             | And since benchmarking is hard is also has a helper to
             | actually "waste" time. [1] The implementation [2] might
             | give an idea that it is not always trivial to do nothing
             | but still appear busy.
             | 
             | Btw I found most of the jmh samples interesting. IMO a
             | quite effective mix of example and documentation. (and I'm
             | not sure there is even much other official documentation)
             | 
             | [1] https://github.com/openjdk/jmh/blob/master/jmh-
             | samples/src/m... [2] https://github.com/openjdk/jmh/blob/87
             | 2b7203c294d90c17766d19...
        
         | mprataps wrote:
         | You are write. This code does not recalculate. However, it was
         | written just as a sample. Mainly user will provide his own
         | method to process the file.
        
       | SillyUsername wrote:
       | An ArrayList for huge numbers of add operations is not
       | performant. LinkedList will see your list throughput performance
       | at least double. There are other optimisations you can do but in
       | a brief perusal this stood out like a sore thumb.
        
         | fedsocpuppet wrote:
         | Huh? It'll be slower and eat a massive amount of memory too.
        
         | pkulak wrote:
         | I've literally never seen a linked list be faster than an array
         | list in a real application, so if you're right, this is kinda
         | huge for me.
        
         | Calzifer wrote:
         | Arrays are fast and ArrayList is like a fancy array with bound
         | check and auto grows. Only the grow part can be problematic if
         | it has to grow very often. But that can be avoided by providing
         | an appropriate initial size or reusing the ArrayList by using
         | clear() instead of creating a new one. Both is used by OP in
         | this project. Especially since the code copies lists quite
         | often I would expect LinkedList to perform way worse.
        
       | ldjkfkdsjnv wrote:
       | I could write this library with an llm in a few hours
        
         | mprataps wrote:
         | May be. I just started this with the intention to learn about
         | multithreading. I learnt a lot of concepts which I had earlier
         | only learnt in theory. I learnt how to use VisualVM to see my
         | thread performance. I learnt to use builder design pattern. No
         | LLM can take away this learning.
         | 
         | And this project is just a start.
        
         | bogeholm wrote:
         | I could probably do an Ironman if I really wanted to
        
         | threeseed wrote:
         | No different to cheating off someone at school.
         | 
         | You didn't learn anything. You didn't accomplish anything. And
         | no one including you respects it.
        
       | sieve wrote:
       | A note on the name.
       | 
       | The nasal "m" takes on the form of the nasal in the row/class of
       | the letter that follows it. As "n" is the nasal of the "c" class,
       | the "m" becomes "n"
       | 
       | Writing Sanskrit terms using the roman script without using
       | something like IAST/ISO-15919 is a pain in the neck. They are
       | going to be mispronounced one way or the other. I try to get the
       | ISO-15919 form and strip away everything that is not a-z.
       | 
       | So, snycikaa (sancika) = sancika
       | 
       | You probably want to keep the "ch," as the average English
       | speaker is not going to remember that the "c" is the "ch" of
       | "cheese" and not "see."
        
         | arnsholt wrote:
         | It's been ages since I did Sanskrit last, but wouldn't sam-cika
         | typically have the m realized as an anusvara rather than n?
        
           | sieve wrote:
           | Not unless it precedes a classless letter or it is actually
           | "m."
           | 
           | All nasals becoming anusvaras is something Hindi/Marathi and
           | other languages using the Devanagari script do. Sanskrit uses
           | the specific form of the nasal when available.
        
       | mprataps wrote:
       | Guys. I love you all. I did not expect such quality feedback.
       | 
       | I will try to incorporate most of your feedback. Your commments
       | have given me much to learn.
       | 
       | This project was started to just learn more about multithreading
       | in a practical way. I think I succeeded with that.
        
       ___________________________________________________________________
       (page generated 2025-05-23 23:01 UTC)