hngopher.com

       [HN Gopher] I almost failed to search a 37 GB text file in under...
       ___________________________________________________________________
        
       I almost failed to search a 37 GB text file in under 1 millisecond
        
       Author : xrayarx
       Score  : 113 points
       Date   : 2022-12-16 20:42 UTC (2 hours ago)
        
 (HTM) web link (death.andgravity.com)
 (TXT) w3m dump (death.andgravity.com)
        
       | didip wrote:
       | In Unix just split the input text into a smaller chunk and fork a
       | few processes to speed up the processing.
       | 
       | If the input file is even larger and you have a lot of RAM, use
       | your favorite programming language and call mmap() syscall.
        
       | [deleted]
        
       | danielvf wrote:
       | If we are really going all out for performance optimization...
       | 
       | In theory, hashes are going to be very evenly distributed. With a
       | file of that size you could probably land within 1% of the
       | correct location just by seeking to byte offset percentage given
       | by the first bytes of the hash...
       | 
       | (It's a fun article)
        
         | benlivengood wrote:
         | With some random sampling it looks like the worst case is 5
         | megabytes of error on the initial seek.
        
         | jffry wrote:
         | Indeed that's discussed in the article, under "Position
         | heuristic" - https://death.andgravity.com/pwned#position-
         | heuristic
        
         | Mogzol wrote:
         | They did mention they tried that, and got similar results to
         | the binary search they were already doing:
         | https://death.andgravity.com/pwned#position-heuristic
        
           | leni536 wrote:
           | > We can do this once, and then binary search a safety
           | interval around that position. Alas, this only gets rid of
           | the fast jumps at the beginning of the binary search, and for
           | some reason, it ends up being slightly slower than binary
           | search alone.
           | 
           | I don't get why you can't do this at each step.
        
           | remus wrote:
           | I assume this doesn't improve perf much because you pretty
           | quickly zoom in on a slice of hashes that are not evenly
           | distributed. 37GB feels like a lot of hashes but it's pretty
           | small compared to the space of all hashes, I suspect this
           | approach would be more effective if we were searching over a
           | much bigger chunk of data.
        
       | danbruc wrote:
       | You could compute the expected position of the hash in the file
       | as hash / 2^160 * filesize and then just look around in the
       | vicinity of that. Would be interesting to know what the maximum
       | deviation for any hash from its expected position is, then you
       | could maybe just load a single block of that size and search it
       | in memory if the maximum deviation is sufficiently small.
        
         | foota wrote:
         | They actually discuss that:
         | https://death.andgravity.com/pwned#position-heuristic
        
           | puffoflogic wrote:
           | Not very well unfortunately. The author's confusion about how
           | binary search works is preventing progress toward this
           | solution. The idea is to do an "unbalanced" binary search,
           | with the pivot chosen as offset (range size * big-endian
           | value of hash / (maximum possible big-endian value of hash +
           | 1)). Correctness is exactly the same as correctness of binary
           | search, because the correctness of binary search does not
           | depend on the offset of the pivot (so long as it's not zero
           | or past the end). At some heuristic range length, fall back
           | to linear search of the range. (Possibly, at some larger
           | length, fall back to balanced binary search... But with
           | efficient linear search of no more than a couple disk pages,
           | I doubt this would help.)
        
       | mberning wrote:
       | I am curious how fast you could query this data if it were
       | inserted into a sqlite table with an index on hash.
        
         | genericlemon24 wrote:
         | I actually did try this, but I got impatient and stopped it
         | after 30 minutes or so, and a ~20G database file.
         | 
         | For some reason, after that, `ls` in that directory (or even
         | shell completion) would freeze the shell entirely, even after
         | reboot. I eventually managed to delete the file with `rm
         | db.sqlite` (no completion allowed), but even that took like 2
         | minutes.
         | 
         | I might try again with WAL enabled (based on my shell history,
         | I also deleted a -journal file).
        
           | 0b01 wrote:
           | Have you tried DuckDB? It's the columnar version of SQLite.
        
             | genericlemon24 wrote:
             | Not yet, I might give it a try.
        
       | kragen wrote:
       | to not qualify as clickbait the title needs to include the word
       | 'sorted'
        
       | ohbtvz wrote:
       | Let me get this straight. OP searched for an entry in a sorted
       | list using dichotomic search. Okay... Any CS undergrad can do
       | that. Is there something that I'm missing?
        
         | forgotusername6 wrote:
         | Does reading/seeking in a file that large make it harder?
        
         | amelius wrote:
         | CS solved this problem, but not in such a way yet that we don't
         | have to think about it anymore. If Python's dict implemented
         | all of this under the hood _then_ it could be called a solved
         | problem.
        
         | dang wrote:
         | Some important site guidelines, including:
         | 
         | " _Don 't be snarky._"
         | 
         | " _Please don 't post shallow dismissals, especially of other
         | people's work. A good critical comment teaches us something._"
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
         | civopsec wrote:
         | It is famously difficult to implement binary search correctly.
        
           | karmakaze wrote:
           | How so, integer overflow? Not so famous that I'm aware.
        
             | mason55 wrote:
             | > _Not so famous that I 'm aware_
             | 
             | https://ai.googleblog.com/2006/06/extra-extra-read-all-
             | about...
        
             | kragen wrote:
             | it took about ten years from the time of the first
             | purported publication of a binary search algorithm for a
             | correct algorithm to be published
             | 
             | it's an algorithm where it's easy to have off-by-one errors
             | that lead to nontermination or incorrect results for
             | certain inputs
             | 
             | integer overflow is also a problem on some platforms,
             | especially java
        
       | mro_name wrote:
       | create cdbs https://news.ycombinator.com/item?id=32648589
        
       | orf wrote:
       | I wonder how long `rg "^000085013a02852372159cb94101b99ccaec59e1"
       | --max-count=1` would take on this?
        
       | puffoflogic wrote:
       | > we skip ahead by half the file size, then back, then ahead two
       | times by a quarter.
       | 
       | This is not how one does binary search.
        
       | divbzero wrote:
       | OP attempted this using Python.
       | 
       | What would be the fastest way using *nix commands? A naive
       | solution would be something like:                 echo -n
       | password | sha1sum | cut -d ' ' -f 1 | xargs -I hash grep hash
       | pwned.txt
        
         | saalweachter wrote:
         | Make sure you put a space at the beginning of your command, so
         | you don't leave your password sitting plaintext in your bash
         | history.
        
           | geniium wrote:
           | Oh I just learned something, thank you.
        
           | cbm-vic-20 wrote:
           | If you're using bash, you'll need a to use HISTIGNORE or
           | HISTCONTROL environment variables to do this.
        
         | AceJohnny2 wrote:
         | Perhaps there's a way to insert GNU Parallel in there to do
         | parallel search of different chunks?
         | 
         | Or just use ripgrep, which integrates multi-core.
        
           | chkhd wrote:
           | That is already doable with xargs itself
           | 
           | xargs -P maxprocs
           | 
           | Parallel mode: run at most maxprocs invocations of utility at
           | once. If maxprocs is set to 0, xargs will run as many
           | processes as possible.
        
             | giantrobot wrote:
             | GNU parallel gives you some extra features like a runtime
             | log and resumable operation.
        
         | jamespwilliams wrote:
         | Perhaps pushing the definition of a *nix command slightly, but
         | I'd be interested in the performance of
         | https://sgrep.sourceforge.net/
        
         | justinsaccount wrote:
         | Use look:                 look $(echo -n password | sha1sum |
         | cut -d ' ' -f 1 | tr a-z A-Z) pwned.txt
         | 
         | from man page:
         | 
         | NAME
         | 
         | look - display lines beginning with a given string
         | 
         | DESCRIPTION
         | 
         | The look utility displays any lines in file which contain
         | string. As look performs a binary search, the lines in file
         | must be sorted (where sort(1) was given the same options -d
         | and/or -f that look is invoked with).
         | 
         | example:                 justin@box:~/data$ time look $(echo -n
         | secret123 | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned-
         | passwords-sha1-ordered-by-hash-v6.txt
         | F2B14F68EB995FACB3A1C35287B778D5BD785511:17384            real
         | 0m0.212s       user 0m0.005s       sys 0m0.001s
         | justin@box:~/data$ time look $(echo -n secret123 | sha1sum |
         | cut -d ' ' -f 1 | tr a-z A-Z) pwned-passwords-sha1-ordered-by-
         | hash-v6.txt
         | F2B14F68EB995FACB3A1C35287B778D5BD785511:17384            real
         | 0m0.002s       user 0m0.003s       sys 0m0.001s
        
           | [deleted]
        
           | genericlemon24 wrote:
           | Hey, I didn't know about this command, neat!
           | 
           | On my laptop, look `time`s at ~10 ms (for comparison, the
           | Python "binary search" script `time`s at ~50 ms).
        
       | zX41ZdbW wrote:
       | These days you can simply load it into a database (ClickHouse) -
       | it will be compressed (about the same as the 7z file), and the
       | search will take 3 ms:
       | 
       | https://github.com/ClickHouse/ClickHouse/issues/42363
        
         | intelVISA wrote:
         | Lots of great solutions in the in-memory columnar data space,
         | what a time to be alive.
        
           | winrid wrote:
           | Clickhouse is not in memory.
        
             | gkbrk wrote:
             | ClickHouse has a table engine for keeping data in memory.
             | Can be accessed like this
             | 
             | create table tableName (.....) Engine = Memory();
        
         | 0b01 wrote:
         | DuckDB is also great for stuff like this. You can replace a
         | MapReduce cluster with a single SQL.
        
       | foota wrote:
       | I wonder if they could improve this by queueing multiple
       | positions to read at once. Presumably the disk will be able to do
       | some of them more efficiently.
        
         | genericlemon24 wrote:
         | (author here) Hey, that's actually a great idea. I remember
         | reading about a fast grep or wc implementation that did
         | parallel reads.
         | 
         | Off the top of my head, I don't really know how to do that
         | (aside from the naive way of using threads), I'm guessing
         | readv[1] might be be useful? Will definitely look into it.
         | 
         | [1]: https://linux.die.net/man/2/readv
        
       | twawaaay wrote:
       | Can't you just use something like a bloom filter?
       | 
       | There is a bit over 1B passwords in there (based on the size of
       | the file and the length of the line). You would need a binary
       | file around 3GB in size that would have to either load into
       | memory or do about 17 accesses to read specific bytes ( no
       | searching) to figure out if the password is in the filter.
        
         | UncleEntity wrote:
         | > Can't you just use something like a bloom filter?
         | 
         | If I had the ability to download a massive file I'd try it out
         | on a hextree I toy around with occasionally.
         | 
         | If you're making an index file may as well just throw it into a
         | tree structure where a lookup is anywhere from 1 to 20 pointer
         | dereferences (assuming the checksum is 20 hex digits) as it
         | optimizes storage so tree depth is variable. Plus it can retain
         | the counts as well.
         | 
         | Now I really want to try this out, the last article I read
         | along these lines I used it as a comparison and it was equally
         | as efficient as their conclusion.
        
         | genericlemon24 wrote:
         | You can, Scott Helme wrote some articles about this (I mention
         | them at the end of my article):
         | 
         | * https://scotthelme.co.uk/when-pwned-passwords-bloom/
         | 
         | * https://scotthelme.co.uk/sketchy-pwned-passwords/ - here he
         | uses a count-min sketch to also store the frequency
        
         | seritools wrote:
         | See the bonus chapter in the post (CTRL+F + "bloom" to find it)
        
       ___________________________________________________________________
       (page generated 2022-12-16 23:00 UTC)