[HN Gopher] Show HN: Samchika - A Java Library for Fast, Multith...
___________________________________________________________________
Show HN: Samchika - A Java Library for Fast, Multithreaded File
Processing
Hi HN, I built a Java library called SmartFileProcessor to make
high-performance, multi-threaded file processing simpler and more
maintainable. Most Java file processing solutions either involve a
lot of boilerplate or don't handle concurrency, backpressure, or
metrics well out of the box. I needed something fast, clean, and
production-friendly -- so I built this. Key features: Multi-
threaded line/batch processing using a configurable thread pool
Producer/consumer model with built-in backpressure Buffered,
asynchronous writing with optional auto-flush Live metrics: memory
usage, throughput, thread times, queue stats Simple builder API --
minimal setup to get going Output metrics to JSON, CSV, or human-
readable format Use cases: Large CSV or log file parsing ETL
pre-processing Line-by-line filtering and transformation Batch
preparation before ingestion I'd really appreciate your feedback
-- feature ideas, performance improvements, critiques, or whether
this solves a real problem for others. Thanks for checking it out!
Author : mprataps
Score : 56 points
Date : 2025-05-23 13:39 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gavinray wrote:
| Please don't do this.
|
| Have the OS handle memory paging and buffering for you and then
| use Java's parallel algorithms to do concurrent processing.
|
| Create a "MappedByteBuffer" and mmap the file into memory.
|
| If the file is too large, use an "AsynchronousFileChannel" and
| asynchronously read + process segments of the buffer.
| 90s_dev wrote:
| Knowing nothing about Java or compsci, I am very curious to see
| the in depth discussion by all you Java/compsci experts that
| your comment invites.
| papercrane wrote:
| If you're using a newer JVM you can also map a "MemorySegment",
| which doesn't have the 2GiB limit that byte buffers have.
| gavinray wrote:
| Good point, have written about this in the past
|
| https://gavinray97.github.io/blog/panama-not-so-foreign-
| memo...
| switchbak wrote:
| Memory mapping is fun, but shouldn't we have some kind of async
| IO / uring support by now? If you're looking at really high-
| perf I/O, mmaping isn't really state of the art right now.
|
| Then again, if you're in Java/JVM land you're probably not
| building bleeding edge DBs ala ScyllaDB. But I'm somewhat
| surprised at the lack of projects in this space. One would
| think this would pair well with some of the reactive stream
| implementations so that you wouldn't have to reimplement things
| like backpressure, etc.
| exabrial wrote:
| try not to be a dick
| threeseed wrote:
| a) There have been libraries supporting io_uring on the JVM
| for many years now.
|
| b) SycllaDB is not bleeding edge. It uses the relatively old
| now DPDK.
|
| c) There are countless reactive stream implementations e.g.
| https://vertx.io/docs/vertx-reactive-streams/java/
| hawk_ wrote:
| I thought DPDK would still be faster than io_uring.
| jlokier wrote:
| Last time I measured on Linux (a few years ago), with NVMe,
| mmap + calling out to a thread pool to async-page-touch (so
| the main thread didn't block) was faster than io_uring (from
| the main thread) for random access reads.
| SillyUsername wrote:
| Better caveat that with, "but watch memory consumption, given
| the nature of the likes of CopyOnWriteArraylist". GC will be a
| bitch.
| mprataps wrote:
| Thanks for this comment. This will be an interesting aspect to
| explore.
| codetiger wrote:
| Do you have a benchmark comparison with other similar tools?
| sureglymop wrote:
| Perhaps I misunderstand something but doesn't reading from a file
| require a system call? And when there is a system call, the
| context switches? So wouldn't using multiple threads to read from
| a file mean that they can't really read in parallel anyway
| because they block each other when executing that system call?
| mike_hearn wrote:
| System calls aren't context switches. They flip a permission
| bit in the CPU but don't do the work a context switch involves
| like modifying the MMU, flushing the TLBs, modifying kernel
| structures, doing scheduling etc.
|
| Also, modern filing systems are all thread safe. You can have
| multiple threads reading and even writing in parallel on
| different CPU cores.
| bionsystem wrote:
| If you open() read-only I don't think it blocks (some other
| process writing to it might block though).
| porridgeraisin wrote:
| > system call, the context switches
|
| No, there is no separate kernel "executing". When you do a
| syscall, your thread becomes kernel mode and it executes the
| function behind the syscall, then when it's done, your thread
| reverts to user mode.
|
| A context switch is when one thread is being swapped out for
| another. Now the syscall could internally spawn a thread and
| context switch to that, but I'm not sure if this happens in
| read() or any syscall for that matter.
| xxs wrote:
| What all other siblings said - syscalls are not context switch,
| they are called 'mode switch' and it has significantly less
| impact.
| sidcool wrote:
| It would be even more amazing if it had tests. It's already
| pretty good.
| DannyB2 wrote:
| Should the tests include some 10 GB files?
| VWWHFSfQ wrote:
| Should include a script for generating 10GB files maybe
| sidcool wrote:
| Naah. I meant unit tests. Not load tests.
| mprataps wrote:
| I will add unit tests next.
| VWWHFSfQ wrote:
| Am I wrong in thinking that this is duplicating lines in memory
| repeatedly when buffering lines into batches, and then submitting
| batches to threads? And then again when calling the line
| processor? Seems like it might be a memory hog
| Calzifer wrote:
| Since most things in Java are handled by reference, including
| Strings there should be not that much memory overhead. From a
| quick look I could not find any actual line duplication.
| Calzifer wrote:
| for(int i=0;i<10000; ++i){ // do nothing
| just compute hash again and again. hash =
| str.hashCode(); }
|
| https://github.com/MayankPratap/Samchika/blob/ebf45acad1963d...
|
| "do nothing" is correct, "again and again" not so much. Java
| caches the hash code for Strings and since the JIT knows that (at
| least in recent version[1]) it might even remove this loop
| entirely.
|
| [1] https://news.ycombinator.com/item?id=43854337
| hyperpape wrote:
| Even in older versions, if the compiler can see that there are
| no side-effects, it is free to remove the loop and simply
| return the value from the first iteration.
|
| I'm actually pretty curious to see what this method does on
| versions that don't have the optimization to treat hashCodes as
| quasi-final.
|
| A quick test using Java 17 shows it's not being optimized away
| _completely_, but it's taking...~1 ns per iteration, which is
| not enough to compute a hash code.
|
| Edit: I'm being silly. It will just compute the hashcode the
| first time, and then repeatedly check that it's cached and
| return it. So the JIT doesn't have to do any _real_ work to
| make this skip the hash code calculation.
|
| So most likely, the effective code is:
| computeHashCode(); for (int i = 0; i < 10000; i++) {
| if (false) { // pretend this wouldn't have dead code
| elimination, and the boolean is actually checked
| computeHashCode(); } }
| rzzzt wrote:
| JMH, the microbenchmark harness has an example that
| highlights this:
| https://github.com/openjdk/jmh/blob/master/jmh-
| samples/src/m...
| Calzifer wrote:
| And since benchmarking is hard is also has a helper to
| actually "waste" time. [1] The implementation [2] might
| give an idea that it is not always trivial to do nothing
| but still appear busy.
|
| Btw I found most of the jmh samples interesting. IMO a
| quite effective mix of example and documentation. (and I'm
| not sure there is even much other official documentation)
|
| [1] https://github.com/openjdk/jmh/blob/master/jmh-
| samples/src/m... [2] https://github.com/openjdk/jmh/blob/87
| 2b7203c294d90c17766d19...
| mprataps wrote:
| You are write. This code does not recalculate. However, it was
| written just as a sample. Mainly user will provide his own
| method to process the file.
| SillyUsername wrote:
| An ArrayList for huge numbers of add operations is not
| performant. LinkedList will see your list throughput performance
| at least double. There are other optimisations you can do but in
| a brief perusal this stood out like a sore thumb.
| fedsocpuppet wrote:
| Huh? It'll be slower and eat a massive amount of memory too.
| pkulak wrote:
| I've literally never seen a linked list be faster than an array
| list in a real application, so if you're right, this is kinda
| huge for me.
| Calzifer wrote:
| Arrays are fast and ArrayList is like a fancy array with bound
| check and auto grows. Only the grow part can be problematic if
| it has to grow very often. But that can be avoided by providing
| an appropriate initial size or reusing the ArrayList by using
| clear() instead of creating a new one. Both is used by OP in
| this project. Especially since the code copies lists quite
| often I would expect LinkedList to perform way worse.
| ldjkfkdsjnv wrote:
| I could write this library with an llm in a few hours
| mprataps wrote:
| May be. I just started this with the intention to learn about
| multithreading. I learnt a lot of concepts which I had earlier
| only learnt in theory. I learnt how to use VisualVM to see my
| thread performance. I learnt to use builder design pattern. No
| LLM can take away this learning.
|
| And this project is just a start.
| bogeholm wrote:
| I could probably do an Ironman if I really wanted to
| threeseed wrote:
| No different to cheating off someone at school.
|
| You didn't learn anything. You didn't accomplish anything. And
| no one including you respects it.
| sieve wrote:
| A note on the name.
|
| The nasal "m" takes on the form of the nasal in the row/class of
| the letter that follows it. As "n" is the nasal of the "c" class,
| the "m" becomes "n"
|
| Writing Sanskrit terms using the roman script without using
| something like IAST/ISO-15919 is a pain in the neck. They are
| going to be mispronounced one way or the other. I try to get the
| ISO-15919 form and strip away everything that is not a-z.
|
| So, snycikaa (sancika) = sancika
|
| You probably want to keep the "ch," as the average English
| speaker is not going to remember that the "c" is the "ch" of
| "cheese" and not "see."
| arnsholt wrote:
| It's been ages since I did Sanskrit last, but wouldn't sam-cika
| typically have the m realized as an anusvara rather than n?
| sieve wrote:
| Not unless it precedes a classless letter or it is actually
| "m."
|
| All nasals becoming anusvaras is something Hindi/Marathi and
| other languages using the Devanagari script do. Sanskrit uses
| the specific form of the nasal when available.
| mprataps wrote:
| Guys. I love you all. I did not expect such quality feedback.
|
| I will try to incorporate most of your feedback. Your commments
| have given me much to learn.
|
| This project was started to just learn more about multithreading
| in a practical way. I think I succeeded with that.
___________________________________________________________________
(page generated 2025-05-23 23:01 UTC)