[HN Gopher] I almost failed to search a 37 GB text file in under...
___________________________________________________________________
I almost failed to search a 37 GB text file in under 1 millisecond
Author : xrayarx
Score : 113 points
Date : 2022-12-16 20:42 UTC (2 hours ago)
(HTM) web link (death.andgravity.com)
(TXT) w3m dump (death.andgravity.com)
| didip wrote:
| In Unix just split the input text into a smaller chunk and fork a
| few processes to speed up the processing.
|
| If the input file is even larger and you have a lot of RAM, use
| your favorite programming language and call mmap() syscall.
| [deleted]
| danielvf wrote:
| If we are really going all out for performance optimization...
|
| In theory, hashes are going to be very evenly distributed. With a
| file of that size you could probably land within 1% of the
| correct location just by seeking to byte offset percentage given
| by the first bytes of the hash...
|
| (It's a fun article)
| benlivengood wrote:
| With some random sampling it looks like the worst case is 5
| megabytes of error on the initial seek.
| jffry wrote:
| Indeed that's discussed in the article, under "Position
| heuristic" - https://death.andgravity.com/pwned#position-
| heuristic
| Mogzol wrote:
| They did mention they tried that, and got similar results to
| the binary search they were already doing:
| https://death.andgravity.com/pwned#position-heuristic
| leni536 wrote:
| > We can do this once, and then binary search a safety
| interval around that position. Alas, this only gets rid of
| the fast jumps at the beginning of the binary search, and for
| some reason, it ends up being slightly slower than binary
| search alone.
|
| I don't get why you can't do this at each step.
| remus wrote:
| I assume this doesn't improve perf much because you pretty
| quickly zoom in on a slice of hashes that are not evenly
| distributed. 37GB feels like a lot of hashes but it's pretty
| small compared to the space of all hashes, I suspect this
| approach would be more effective if we were searching over a
| much bigger chunk of data.
| danbruc wrote:
| You could compute the expected position of the hash in the file
| as hash / 2^160 * filesize and then just look around in the
| vicinity of that. Would be interesting to know what the maximum
| deviation for any hash from its expected position is, then you
| could maybe just load a single block of that size and search it
| in memory if the maximum deviation is sufficiently small.
| foota wrote:
| They actually discuss that:
| https://death.andgravity.com/pwned#position-heuristic
| puffoflogic wrote:
| Not very well unfortunately. The author's confusion about how
| binary search works is preventing progress toward this
| solution. The idea is to do an "unbalanced" binary search,
| with the pivot chosen as offset (range size * big-endian
| value of hash / (maximum possible big-endian value of hash +
| 1)). Correctness is exactly the same as correctness of binary
| search, because the correctness of binary search does not
| depend on the offset of the pivot (so long as it's not zero
| or past the end). At some heuristic range length, fall back
| to linear search of the range. (Possibly, at some larger
| length, fall back to balanced binary search... But with
| efficient linear search of no more than a couple disk pages,
| I doubt this would help.)
| mberning wrote:
| I am curious how fast you could query this data if it were
| inserted into a sqlite table with an index on hash.
| genericlemon24 wrote:
| I actually did try this, but I got impatient and stopped it
| after 30 minutes or so, and a ~20G database file.
|
| For some reason, after that, `ls` in that directory (or even
| shell completion) would freeze the shell entirely, even after
| reboot. I eventually managed to delete the file with `rm
| db.sqlite` (no completion allowed), but even that took like 2
| minutes.
|
| I might try again with WAL enabled (based on my shell history,
| I also deleted a -journal file).
| 0b01 wrote:
| Have you tried DuckDB? It's the columnar version of SQLite.
| genericlemon24 wrote:
| Not yet, I might give it a try.
| kragen wrote:
| to not qualify as clickbait the title needs to include the word
| 'sorted'
| ohbtvz wrote:
| Let me get this straight. OP searched for an entry in a sorted
| list using dichotomic search. Okay... Any CS undergrad can do
| that. Is there something that I'm missing?
| forgotusername6 wrote:
| Does reading/seeking in a file that large make it harder?
| amelius wrote:
| CS solved this problem, but not in such a way yet that we don't
| have to think about it anymore. If Python's dict implemented
| all of this under the hood _then_ it could be called a solved
| problem.
| dang wrote:
| Some important site guidelines, including:
|
| " _Don 't be snarky._"
|
| " _Please don 't post shallow dismissals, especially of other
| people's work. A good critical comment teaches us something._"
|
| https://news.ycombinator.com/newsguidelines.html
| civopsec wrote:
| It is famously difficult to implement binary search correctly.
| karmakaze wrote:
| How so, integer overflow? Not so famous that I'm aware.
| mason55 wrote:
| > _Not so famous that I 'm aware_
|
| https://ai.googleblog.com/2006/06/extra-extra-read-all-
| about...
| kragen wrote:
| it took about ten years from the time of the first
| purported publication of a binary search algorithm for a
| correct algorithm to be published
|
| it's an algorithm where it's easy to have off-by-one errors
| that lead to nontermination or incorrect results for
| certain inputs
|
| integer overflow is also a problem on some platforms,
| especially java
| mro_name wrote:
| create cdbs https://news.ycombinator.com/item?id=32648589
| orf wrote:
| I wonder how long `rg "^000085013a02852372159cb94101b99ccaec59e1"
| --max-count=1` would take on this?
| puffoflogic wrote:
| > we skip ahead by half the file size, then back, then ahead two
| times by a quarter.
|
| This is not how one does binary search.
| divbzero wrote:
| OP attempted this using Python.
|
| What would be the fastest way using *nix commands? A naive
| solution would be something like: echo -n
| password | sha1sum | cut -d ' ' -f 1 | xargs -I hash grep hash
| pwned.txt
| saalweachter wrote:
| Make sure you put a space at the beginning of your command, so
| you don't leave your password sitting plaintext in your bash
| history.
| geniium wrote:
| Oh I just learned something, thank you.
| cbm-vic-20 wrote:
| If you're using bash, you'll need a to use HISTIGNORE or
| HISTCONTROL environment variables to do this.
| AceJohnny2 wrote:
| Perhaps there's a way to insert GNU Parallel in there to do
| parallel search of different chunks?
|
| Or just use ripgrep, which integrates multi-core.
| chkhd wrote:
| That is already doable with xargs itself
|
| xargs -P maxprocs
|
| Parallel mode: run at most maxprocs invocations of utility at
| once. If maxprocs is set to 0, xargs will run as many
| processes as possible.
| giantrobot wrote:
| GNU parallel gives you some extra features like a runtime
| log and resumable operation.
| jamespwilliams wrote:
| Perhaps pushing the definition of a *nix command slightly, but
| I'd be interested in the performance of
| https://sgrep.sourceforge.net/
| justinsaccount wrote:
| Use look: look $(echo -n password | sha1sum |
| cut -d ' ' -f 1 | tr a-z A-Z) pwned.txt
|
| from man page:
|
| NAME
|
| look - display lines beginning with a given string
|
| DESCRIPTION
|
| The look utility displays any lines in file which contain
| string. As look performs a binary search, the lines in file
| must be sorted (where sort(1) was given the same options -d
| and/or -f that look is invoked with).
|
| example: justin@box:~/data$ time look $(echo -n
| secret123 | sha1sum | cut -d ' ' -f 1 | tr a-z A-Z) pwned-
| passwords-sha1-ordered-by-hash-v6.txt
| F2B14F68EB995FACB3A1C35287B778D5BD785511:17384 real
| 0m0.212s user 0m0.005s sys 0m0.001s
| justin@box:~/data$ time look $(echo -n secret123 | sha1sum |
| cut -d ' ' -f 1 | tr a-z A-Z) pwned-passwords-sha1-ordered-by-
| hash-v6.txt
| F2B14F68EB995FACB3A1C35287B778D5BD785511:17384 real
| 0m0.002s user 0m0.003s sys 0m0.001s
| [deleted]
| genericlemon24 wrote:
| Hey, I didn't know about this command, neat!
|
| On my laptop, look `time`s at ~10 ms (for comparison, the
| Python "binary search" script `time`s at ~50 ms).
| zX41ZdbW wrote:
| These days you can simply load it into a database (ClickHouse) -
| it will be compressed (about the same as the 7z file), and the
| search will take 3 ms:
|
| https://github.com/ClickHouse/ClickHouse/issues/42363
| intelVISA wrote:
| Lots of great solutions in the in-memory columnar data space,
| what a time to be alive.
| winrid wrote:
| Clickhouse is not in memory.
| gkbrk wrote:
| ClickHouse has a table engine for keeping data in memory.
| Can be accessed like this
|
| create table tableName (.....) Engine = Memory();
| 0b01 wrote:
| DuckDB is also great for stuff like this. You can replace a
| MapReduce cluster with a single SQL.
| foota wrote:
| I wonder if they could improve this by queueing multiple
| positions to read at once. Presumably the disk will be able to do
| some of them more efficiently.
| genericlemon24 wrote:
| (author here) Hey, that's actually a great idea. I remember
| reading about a fast grep or wc implementation that did
| parallel reads.
|
| Off the top of my head, I don't really know how to do that
| (aside from the naive way of using threads), I'm guessing
| readv[1] might be be useful? Will definitely look into it.
|
| [1]: https://linux.die.net/man/2/readv
| twawaaay wrote:
| Can't you just use something like a bloom filter?
|
| There is a bit over 1B passwords in there (based on the size of
| the file and the length of the line). You would need a binary
| file around 3GB in size that would have to either load into
| memory or do about 17 accesses to read specific bytes ( no
| searching) to figure out if the password is in the filter.
| UncleEntity wrote:
| > Can't you just use something like a bloom filter?
|
| If I had the ability to download a massive file I'd try it out
| on a hextree I toy around with occasionally.
|
| If you're making an index file may as well just throw it into a
| tree structure where a lookup is anywhere from 1 to 20 pointer
| dereferences (assuming the checksum is 20 hex digits) as it
| optimizes storage so tree depth is variable. Plus it can retain
| the counts as well.
|
| Now I really want to try this out, the last article I read
| along these lines I used it as a comparison and it was equally
| as efficient as their conclusion.
| genericlemon24 wrote:
| You can, Scott Helme wrote some articles about this (I mention
| them at the end of my article):
|
| * https://scotthelme.co.uk/when-pwned-passwords-bloom/
|
| * https://scotthelme.co.uk/sketchy-pwned-passwords/ - here he
| uses a count-min sketch to also store the frequency
| seritools wrote:
| See the bonus chapter in the post (CTRL+F + "bloom" to find it)
___________________________________________________________________
(page generated 2022-12-16 23:00 UTC)