https://github.com/sstadick/hck Skip to content Sign up * Why GitHub? Features - + Mobile - + Actions - + Codespaces - + Packages - + Security - + Code review - + Issues - + Integrations - + GitHub Sponsors - + Customer stories- * Team * Enterprise * Explore + Explore GitHub - Learn and contribute + Topics - + Collections - + Trending - + Learning Lab - + Open source guides - Connect with others + The ReadME Project - + Events - + Community forum - + GitHub Education - + GitHub Stars program - * Marketplace * Pricing Plans - + Compare plans - + Contact Sales - + Education - [ ] [search-key] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} sstadick / hck * Notifications * Star 199 * Fork 2 A sharp cut(1) clone. View license 199 stars 2 forks Star Notifications * Code * Issues 2 * Pull requests 0 * Actions * Projects 0 * Wiki * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Wiki * Security * Insights master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags 2 branches 30 tags Code Clone HTTPS GitHub CLI [https://github.com/s] Use Git or checkout with SVN using the web URL. [gh repo clone sstadi] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Go back Launching Xcode If nothing happens, download Xcode and try again. Go back Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @sstadick sstadick (cargo-release) start next development iteration 0.4.3-alpha.0 ... 707ff38 Jul 9, 2021 (cargo-release) start next development iteration 0.4.3-alpha.0 707ff38 Git stats * 84 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time .github/workflows Fix path Jul 9, 2021 src [bugfix] All the issues reported by people trying things out (#14) Jul 7, 2021 .gitignore Initial work (#1) Jun 25, 2021 CHANGELOG.md update changelog Jul 9, 2021 Cargo.lock (cargo-release) start next development iteration 0.4.3-alpha.0 Jul 9, 2021 Cargo.toml (cargo-release) start next development iteration 0.4.3-alpha.0 Jul 9, 2021 LICENSE-MIT [bugfix] All the issues reported by people trying things out (#14) Jul 7, 2021 README.md make deb package naming consistant Jul 9, 2021 UNLICENSE [bugfix] All the issues reported by people trying things out (#14) Jul 7, 2021 benchmark.sh Updates to README and benchmarks Jul 7, 2021 build.rs Add nice things (#5) Jul 1, 2021 pgo.sh [feature] add PGO to CI (#18) Jul 8, 2021 pgo_local.sh add debian package Jul 8, 2021 View code hck Features Non-goals Install Examples Splitting with a string literal Splitting with a regex delimiter Reordering output columns Changing the output record separator Select columns with regex Automagic decompresion Splitting on multiple characters Benchmarks Hardware Data Tools Single character delimiter benchmark Multi-character delimiter benchmark Decompression Profile Guided Optimization TODO More packages and builds References README.md hck Build Status license Version info A sharp cut(1) clone. hck is a shortening of hack, a rougher form of cut. A close to drop in replacement for cut that can use a regex delimiter instead of a fixed string. Additionally this tool allows for specification of the order of the output columns using the same column selection syntax as cut (see below for examples). No single feature of hck on its own makes it stand out over awk, cut, xsv or other such tools. Where hck excels is making common things easy, such as reordering output fields, or splitting records on a weird delimiter. It is meant to be simple and easy to use while exploring datasets. Features * Reordering of output columns! i.e. if you use -f4,2,8 the output columns will appear in the order 4, 2, 8 * Delimiter treated as a regex (with -R), i.e. you can split on multiple spaces without and extra pipe to tr! * Specification of output delimiter * Selection of columns by header string literal with the -F option, or by regex by setting the -r flag * Input files will be automatically decompressed if their file extension is recognizable and a local binary exists to perform the decompression (similar to ripgrep). See Decompression. * Speed Non-goals * hck does not aim to be a complete CSV / TSV parser a la xsv which will respect quoting rules. It acts similar to cut in that it will split on the delimiter no mater where in the line it is. * Delimiters cannot contain newlines... well they can, they will just never be seen. hck will always be a line-by-line tool where newlines are the standard \n \r\n. Install * Homebrew / Linuxbrew brew tap sstadick/hck brew install hck * Built with profile guided optimizations * Debian (Ubuntu) curl -LO https://github.com/sstadick/hck/releases/download//hck-linux-amd64.deb sudo dpkg -i hck-linux-amd64.deb * Built with profile guided optimizations * With the Rust toolchain: export RUSTFLAGS='-C target-cpu=native' cargo install hck * From the releases page (the binaries have been built with profile guided optimizations) * Or, if you want the absolute fastest possible build that makes use of profile guided optimizations AND native cpu features: # Assumes you are on stable rust # NOTE: this won't work on windows, see CI for linked issue rustup component add llvm-tools-preview git clone https://github.com/sstadick/hck cd hck bash pgo_local.sh cp ./target/release/hck ~/.cargo/bin/hck * PRs are both welcome and encouraged for adding more packaging options and build types! I'd especially welcome PRs for the windows family of package managers / general making sure things are windows friendly. Examples Splitting with a string literal hck -Ld' ' -f1-3,5- ./README.md | head -n4 # hck

Splitting with a regex delimiter # note, '\s+' is the default ps aux | hck -f1-3,5- | head -n4 USER PID %CPU VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 169452 13472 ? Ss Jun21 0:19 /sbin/init splash root 2 0.0 0 0 ? S Jun21 0:00 [kthreadd] root 3 0.0 0 0 ? I< Jun21 0:00 [rcu_gp] Reordering output columns ps aux | hck -f2,1,3- | head -n4 PID USER %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 1 root 0.0 0.0 169452 13472 ? Ss Jun21 0:19 /sbin/init splash 2 root 0.0 0.0 0 0 ? S Jun21 0:00 [kthreadd] 3 root 0.0 0.0 0 0 ? I< Jun21 0:00 [rcu_gp] Changing the output record separator ps aux | hck -D'___' -f2,1,3 | head -n4 PID___USER___%CPU 1___root___0.0 2___root___0.0 3___root___0.0 Select columns with regex # Note the order match the order of the -F args ps aux | hck -r -F '^ST.*' -F '^USER$' | head -n4 STAT START USER Ss Jun21 root S Jun21 root I< Jun21 root Automagic decompresion gzip ./README.md hck -Ld' ' -f1-3,5- -z ./README.md.gz | head -n4 # hck

Splitting on multiple characters # with string literal printf 'this$;$is$;$a$;$test\na$;$b$;$3$;$four\n' > test.txt hck -Ld'$;$' -f3,4 ./test.txt a test 3 four # with an interesting regex printf 'this123__is456--a789-test\na129_-b849-_3109_-four\n' > test.txt hck -d'\d{3}[-_]+' -f3,4 ./test.txt a test 3 four Benchmarks This set of benchmarks is simply meant to show that hck is in the same ballpark as other tools. These are meant to capture real world usage of the tools, so in the multi-space delimiter benchmark for gcut, for example, we use tr to convert the space runs to a single space and then pipe to gcut. Note this is not meant to be an authoritative set of benchmarks, it is just meant to give a relative sense of performance of different ways of accomplishing the same tasks. Hardware Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory and 1TB NVMe Drive Data The all_train.csv data is used. This is a CSV dataset with 7 million lines. We test it both using , as the delimiter, and then also using \s\s\s as a delimiter. PRs are welcome for benchmarks with more tools, or improved (but still realistic) pipelines for commands. Tools cut: * https://www.gnu.org/software/coreutils/manual/html_node/ The-cut-command.html * 8.30 mawk: * https://invisible-island.net/mawk/mawk.html * v1.3.4 xsv: * https://github.com/BurntSushi/xsv * v0.13.0 (compiled locally with optimizations) tsv-utils: * https://github.com/eBay/tsv-utils * v2.2.0 (ldc2, compiled locally with optimizations) choose: * https://github.com/theryangeary/choose * v1.3.1 (compiled locally with optimizations) Single character delimiter benchmark Command Mean [s] Min Max Relative [s] [s] hck -Ld, -f1,8,19 ./hyper_data.txt > / 1.494 +- 1.463 1.532 1.00 dev/null 0.026 hck -Ld, -f1,8,19 --no-mmap ./ 1.735 +- 1.729 1.740 1.16 +- hyper_data.txt > /dev/null 0.004 0.02 hck -d, -f1,8,19 ./hyper_data.txt > / 1.772 +- 1.760 1.782 1.19 +- dev/null 0.009 0.02 hck -d, -f1,8,19 --no-mmap ./ 1.935 +- 1.862 1.958 1.30 +- hyper_data.txt > /dev/null 0.041 0.04 choose -f , -i ./hyper_data.txt 0 7 18 4.597 +- 4.574 4.617 3.08 +- > /dev/null 0.016 0.05 tsv-select -d, -f 1,8,19 ./ 1.788 +- 1.783 1.798 1.20 +- hyper_data.txt > /dev/null 0.006 0.02 xsv select -d, 1,8,19 ./hyper_data.txt 5.683 +- 5.660 5.706 3.80 +- > /dev/null 0.017 0.07 awk -F, '{print $1, $8, $19}' ./ 5.021 +- 5.005 5.041 3.36 +- hyper_data.txt > /dev/null 0.013 0.06 cut -d, -f1,8,19 ./hyper_data.txt > / 7.045 +- 6.847 7.787 4.72 +- dev/null 0.415 0.29 Multi-character delimiter benchmark Command Mean Min Max Relative [s] [s] [s] hck -Ld' ' -f1,8,19 ./ 2.127 hyper_data_multichar.txt > /dev/null +- 2.122 2.133 1.00 0.004 hck -Ld' ' -f1,8,19 --no-mmap ./ 2.467 1.16 +- hyper_data_multichar.txt > /dev/null +- 2.459 2.488 0.01 0.012 hck -d'[[:space:]]+' -f1,8,19 ./ 9.736 4.58 +- hyper_data_multichar.txt > /dev/null +- 9.630 9.786 0.03 0.069 hck -d'[[:space:]]+' --no-mmap -f1,8,19 9.840 4.63 +- ./hyper_data_multichar.txt > /dev/null +- 9.813 9.869 0.01 0.024 hck -d'\s+' -f1,8,19 ./ 10.446 4.91 +- hyper_data_multichar.txt > /dev/null +- 10.425 10.456 0.01 0.013 hck -d'\s+' -f1,8,19 --no-mmap ./ 10.498 4.94 +- hyper_data_multichar.txt > /dev/null +- 10.441 10.710 0.06 0.118 choose -f ' ' -i ./hyper_data.txt 0 7 3.266 1.54 +- 18 > /dev/null +- 3.248 3.277 0.01 0.011 choose -f '[[:space:]]+' -i ./ 18.020 8.47 +- hyper_data.txt 0 7 18 > /dev/null +- 17.993 18.040 0.02 0.022 choose -f '\s+' -i ./hyper_data.txt 0 7 59.425 27.94 +- 18 > /dev/null +- 58.900 59.893 0.22 0.457 awk -F' ' '{print $1, $8 $19}' ./ 6.824 3.21 +- hyper_data_multichar.txt > /dev/null +- 6.780 6.851 0.01 0.027 awk -F' ' '{print $1, $8, $19}' ./ 6.072 2.85 +- hyper_data_multichar.txt > /dev/null +- 5.919 6.385 0.09 0.181 awk -F'[:space:]+' '{print $1, $8, $19} 11.125 5.23 +- ' ./hyper_data_multichar.txt > /dev/ +- 11.012 11.177 0.03 null 0.066 < ./hyper_data_multichar.txt tr -s ' ' 7.508 3.53 +- | cut -d ' ' -f1,8,19 > /dev/null +- 7.433 7.591 0.03 0.059 < ./hyper_data_multichar.txt tr -s ' ' 6.719 3.16 +- | tail -n+2 | xsv select -d ' ' 1,8,19 +- 6.419 6.983 0.11 --no-headers > /dev/null 0.241 < ./hyper_data_multichar.txt tr -s ' ' 6.351 2.99 +- | hck -Ld' ' -f1,8,19 > /dev/null +- 6.296 6.391 0.02 0.041 < ./hyper_data_multichar.txt tr -s ' ' 6.359 2.99 +- | tsv-select -d ' ' -f 1,8,19 > /dev/ +- 6.311 6.453 0.03 null 0.056 Decompression The following table indicates the file extension / binary pairs that are used to try to decompress a file whent the -z option is specified: Extension Binary Type *.gz gzip -d -c gzip *.tgz gzip -d -c gzip *.bz2 bzip2 -d -c bzip2 *.tbz2 bzip -d -c gzip *.xz xz -d -c xz *.txz xz -d -c xz *.lz4 lz4 -d -c lz4 *.lzma xz --format=lzma -d -c lzma *.br brotli -d -c brotli *.zst zstd -d -c zstd *.zstd zstd -q -d -c zstd *.Z uncompress -c uncompress When a file with one of the extensions above is found, hck will open a subprocess running the the decompression tool listed above and read from the output of that tool. If the binary can't be found then hck will try to read the compressed file as is. See grep_cli for source code. The end goal is to add a similar preprocessor as ripgrep. Profile Guided Optimization See the pgo*.sh scripts for how to build this with optimizations. You will need to install the llvm tools via rustup component add llvm-tools-preview for this to work. Building with PGO seems to improve performance anywhere from 5-30% depending on the platform and codepath. i.e. on mac os it seems to have a larger effect, and on the regex codepath it also seems to have a greater effect. TODO * Add complement argument * Don't reparse fields / headers for each new file * figure out how to better reuse / share a vec * Support indexing from the end (unlikely though) * Bake in grep / filtering somehow (this will not be done at the expense of the primary utility of hck) * Move tests from main to core * Add more tests all around * Add pigz support * Add a greedy/non-greedy option that will ignore blank fields split.filter(|s| !s.is_empty() || config.opt.non_greedy) * Experiment with parallel parser as described here This should be very doable given we don't care about escaping quotes and such. More packages and builds https://github.com/sharkdp/bat/blob/master/.github/workflows/CICD.yml References * rust-coreutils-cut * ripgrep About A sharp cut(1) clone. Topics rust command-line text-processing Resources Readme License View license Releases 30 v0.4.2 Latest Jul 9, 2021 + 29 releases Packages 0 No packages published Languages * Rust 91.3% * Shell 8.7% * (c) 2021 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.