https://github.com/sstadick/hck

Skip to content
 
Sign up

  * Why GitHub?
    Features -
      + Mobile -
      + Actions -
      + Codespaces -
      + Packages -
      + Security -
      + Code review -
      + Issues -
      + Integrations -
      + GitHub Sponsors -
      + Customer stories-
  * Team
  * Enterprise
  * Explore
      + Explore GitHub -

    Learn and contribute

      + Topics -
      + Collections -
      + Trending -
      + Learning Lab -
      + Open source guides -

    Connect with others

      + The ReadME Project -
      + Events -
      + Community forum -
      + GitHub Education -
      + GitHub Stars program -
  * Marketplace
  * Pricing
    Plans -
      + Compare plans -
      + Contact Sales -
      + Education -

[                    ] [search-key]

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this user All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}

sstadick / hck

  * Notifications
  * Star 199
  * Fork 2

A sharp cut(1) clone.

View license
199 stars 2 forks
Star
Notifications

  * Code
  * Issues 2
  * Pull requests 0
  * Actions
  * Projects 0
  * Wiki
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Wiki
  * Security
  * Insights

master
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags
2 branches 30 tags
Code
 
Clone
HTTPS GitHub CLI
[https://github.com/s]

Use Git or checkout with SVN using the web URL.

[gh repo clone sstadi]

Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Go back

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Go back

Launching Xcode

If nothing happens, download Xcode and try again.

Go back

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@sstadick
sstadick (cargo-release) start next development iteration
0.4.3-alpha.0
...
707ff38 Jul 9, 2021
(cargo-release) start next development iteration 0.4.3-alpha.0
707ff38

Git stats

  * 84 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
.github/workflows
Fix path
Jul 9, 2021
src
[bugfix] All the issues reported by people trying things out (#14)
Jul 7, 2021
.gitignore
Initial work (#1)
Jun 25, 2021
CHANGELOG.md
update changelog
Jul 9, 2021
Cargo.lock
(cargo-release) start next development iteration 0.4.3-alpha.0
Jul 9, 2021
Cargo.toml
(cargo-release) start next development iteration 0.4.3-alpha.0
Jul 9, 2021
LICENSE-MIT
[bugfix] All the issues reported by people trying things out (#14)
Jul 7, 2021
README.md
make deb package naming consistant
Jul 9, 2021
UNLICENSE
[bugfix] All the issues reported by people trying things out (#14)
Jul 7, 2021
benchmark.sh
Updates to README and benchmarks
Jul 7, 2021
build.rs
Add nice things (#5)
Jul 1, 2021
pgo.sh
[feature] add PGO to CI (#18)
Jul 8, 2021
pgo_local.sh
add debian package
Jul 8, 2021
View code
 hck Features Non-goals Install Examples Splitting with a string
literal Splitting with a regex delimiter Reordering output columns
Changing the output record separator Select columns with regex
Automagic decompresion Splitting on multiple characters Benchmarks
Hardware Data Tools Single character delimiter benchmark
Multi-character delimiter benchmark Decompression Profile Guided
Optimization TODO More packages and builds References

README.md

  hck

                  Build Status license Version info
                        A sharp cut(1) clone.

hck is a shortening of hack, a rougher form of cut.

A close to drop in replacement for cut that can use a regex delimiter
instead of a fixed string. Additionally this tool allows for
specification of the order of the output columns using the same
column selection syntax as cut (see below for examples).

No single feature of hck on its own makes it stand out over awk, cut,
xsv or other such tools. Where hck excels is making common things
easy, such as reordering output fields, or splitting records on a
weird delimiter. It is meant to be simple and easy to use while
exploring datasets.

 Features

  * Reordering of output columns! i.e. if you use -f4,2,8 the output
    columns will appear in the order 4, 2, 8
  * Delimiter treated as a regex (with -R), i.e. you can split on
    multiple spaces without and extra pipe to tr!
  * Specification of output delimiter
  * Selection of columns by header string literal with the -F option,
    or by regex by setting the -r flag
  * Input files will be automatically decompressed if their file
    extension is recognizable and a local binary exists to perform
    the decompression (similar to ripgrep). See Decompression.
  * Speed

 Non-goals

  * hck does not aim to be a complete CSV / TSV parser a la xsv which
    will respect quoting rules. It acts similar to cut in that it
    will split on the delimiter no mater where in the line it is.
  * Delimiters cannot contain newlines... well they can, they will
    just never be seen. hck will always be a line-by-line tool where
    newlines are the standard \n \r\n.

 Install

  * Homebrew / Linuxbrew

brew tap sstadick/hck
brew install hck

* Built with profile guided optimizations

  * Debian (Ubuntu)

curl -LO https://github.com/sstadick/hck/releases/download/<latest>/hck-linux-amd64.deb
sudo dpkg -i hck-linux-amd64.deb

* Built with profile guided optimizations

  * With the Rust toolchain:

export RUSTFLAGS='-C target-cpu=native'
cargo install hck

  * From the releases page (the binaries have been built with profile
    guided optimizations)

  * Or, if you want the absolute fastest possible build that makes
    use of profile guided optimizations AND native cpu features:

# Assumes you are on stable rust
# NOTE: this won't work on windows, see CI for linked issue
rustup component add llvm-tools-preview
git clone https://github.com/sstadick/hck
cd hck
bash pgo_local.sh
cp ./target/release/hck ~/.cargo/bin/hck

  * PRs are both welcome and encouraged for adding more packaging
    options and build types! I'd especially welcome PRs for the
    windows family of package managers / general making sure things
    are windows friendly.

 Examples

 Splitting with a string literal

 hck -Ld' ' -f1-3,5- ./README.md | head -n4
#             hck

<p      align="center">
                <a      src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build      Status"></a>

 Splitting with a regex delimiter

# note, '\s+' is the default
 ps aux | hck -f1-3,5- | head -n4
USER    PID     %CPU    VSZ     RSS     TTY     STAT    START   TIME    COMMAND
root    1       0.0     169452  13472   ?       Ss      Jun21   0:19    /sbin/init      splash
root    2       0.0     0       0       ?       S       Jun21   0:00    [kthreadd]
root    3       0.0     0       0       ?       I<      Jun21   0:00    [rcu_gp]

 Reordering output columns

 ps aux | hck -f2,1,3- | head -n4
PID     USER    %CPU    %MEM    VSZ     RSS     TTY     STAT    START   TIME    COMMAND
1       root    0.0     0.0     169452  13472   ?       Ss      Jun21   0:19    /sbin/init      splash
2       root    0.0     0.0     0       0       ?       S       Jun21   0:00    [kthreadd]
3       root    0.0     0.0     0       0       ?       I<      Jun21   0:00    [rcu_gp]

 Changing the output record separator

 ps aux | hck -D'___' -f2,1,3 | head -n4
PID___USER___%CPU
1___root___0.0
2___root___0.0
3___root___0.0

 Select columns with regex

# Note the order match the order of the -F args
ps aux | hck -r -F '^ST.*' -F '^USER$' | head -n4
STAT    START   USER
Ss      Jun21   root
S       Jun21   root
I<      Jun21   root

 Automagic decompresion

 gzip ./README.md
 hck -Ld' ' -f1-3,5- -z ./README.md.gz | head -n4
#             hck

<p      align="center">
                <a      src="https://github.com/sstadick/hck/workflows/Check/badge.svg" alt="Build      Status"></a>

 Splitting on multiple characters

# with string literal
 printf 'this$;$is$;$a$;$test\na$;$b$;$3$;$four\n' > test.txt
 hck -Ld'$;$' -f3,4 ./test.txt
a       test
3       four
# with an interesting regex
 printf 'this123__is456--a789-test\na129_-b849-_3109_-four\n' > test.txt
 hck -d'\d{3}[-_]+' -f3,4 ./test.txt
a       test
3       four

 Benchmarks

This set of benchmarks is simply meant to show that hck is in the
same ballpark as other tools. These are meant to capture real world
usage of the tools, so in the multi-space delimiter benchmark for
gcut, for example, we use tr to convert the space runs to a single
space and then pipe to gcut.

Note this is not meant to be an authoritative set of benchmarks, it
is just meant to give a relative sense of performance of different
ways of accomplishing the same tasks.

 Hardware

Ubuntu 20 AMD Ryzen 9 3950X 16-Core Processor w/ 64 GB DDR4 memory
and 1TB NVMe Drive

 Data

The all_train.csv data is used.

This is a CSV dataset with 7 million lines. We test it both using ,
as the delimiter, and then also using \s\s\s as a delimiter.

PRs are welcome for benchmarks with more tools, or improved (but
still realistic) pipelines for commands.

 Tools

cut:

  * https://www.gnu.org/software/coreutils/manual/html_node/
    The-cut-command.html
  * 8.30

mawk:

  * https://invisible-island.net/mawk/mawk.html
  * v1.3.4

xsv:

  * https://github.com/BurntSushi/xsv
  * v0.13.0 (compiled locally with optimizations)

tsv-utils:

  * https://github.com/eBay/tsv-utils
  * v2.2.0 (ldc2, compiled locally with optimizations)

choose:

  * https://github.com/theryangeary/choose
  * v1.3.1 (compiled locally with optimizations)

 Single character delimiter benchmark

Command                                 Mean [s]   Min   Max Relative
                                                   [s]   [s]
hck -Ld, -f1,8,19 ./hyper_data.txt > /   1.494 +- 1.463 1.532     1.00
dev/null                                   0.026
hck -Ld, -f1,8,19 --no-mmap ./           1.735 +- 1.729 1.740   1.16 +-
hyper_data.txt > /dev/null                 0.004                 0.02
hck -d, -f1,8,19 ./hyper_data.txt > /    1.772 +- 1.760 1.782   1.19 +-
dev/null                                   0.009                 0.02
hck -d, -f1,8,19 --no-mmap ./            1.935 +- 1.862 1.958   1.30 +-
hyper_data.txt > /dev/null                 0.041                 0.04
choose -f , -i ./hyper_data.txt 0 7 18   4.597 +- 4.574 4.617   3.08 +-
> /dev/null                                0.016                 0.05
tsv-select -d, -f 1,8,19 ./              1.788 +- 1.783 1.798   1.20 +-
hyper_data.txt > /dev/null                 0.006                 0.02
xsv select -d, 1,8,19 ./hyper_data.txt   5.683 +- 5.660 5.706   3.80 +-
> /dev/null                                0.017                 0.07
awk -F, '{print $1, $8, $19}' ./         5.021 +- 5.005 5.041   3.36 +-
hyper_data.txt > /dev/null                 0.013                 0.06
cut -d, -f1,8,19 ./hyper_data.txt > /    7.045 +- 6.847 7.787   4.72 +-
dev/null                                   0.415                 0.29

 Multi-character delimiter benchmark

Command                                   Mean    Min    Max Relative
                                           [s]    [s]    [s]
hck -Ld' ' -f1,8,19 ./                   2.127
hyper_data_multichar.txt > /dev/null         +-  2.122  2.133     1.00
                                         0.004
hck -Ld' ' -f1,8,19 --no-mmap ./         2.467                 1.16 +-
hyper_data_multichar.txt > /dev/null         +-  2.459  2.488     0.01
                                         0.012
hck -d'[[:space:]]+' -f1,8,19 ./         9.736                 4.58 +-
hyper_data_multichar.txt > /dev/null         +-  9.630  9.786     0.03
                                         0.069
hck -d'[[:space:]]+' --no-mmap -f1,8,19  9.840                 4.63 +-
./hyper_data_multichar.txt > /dev/null       +-  9.813  9.869     0.01
                                         0.024
hck -d'\s+' -f1,8,19 ./                 10.446                 4.91 +-
hyper_data_multichar.txt > /dev/null         +- 10.425 10.456     0.01
                                         0.013
hck -d'\s+' -f1,8,19 --no-mmap ./       10.498                 4.94 +-
hyper_data_multichar.txt > /dev/null         +- 10.441 10.710     0.06
                                         0.118
choose -f ' ' -i ./hyper_data.txt 0 7    3.266                 1.54 +-
18 > /dev/null                               +-  3.248  3.277     0.01
                                         0.011
choose -f '[[:space:]]+' -i ./          18.020                 8.47 +-
hyper_data.txt 0 7 18 > /dev/null            +- 17.993 18.040     0.02
                                         0.022
choose -f '\s+' -i ./hyper_data.txt 0 7 59.425                27.94 +-
18 > /dev/null                               +- 58.900 59.893     0.22
                                         0.457
awk -F' ' '{print $1, $8 $19}' ./        6.824                 3.21 +-
hyper_data_multichar.txt > /dev/null         +-  6.780  6.851     0.01
                                         0.027
awk -F' ' '{print $1, $8, $19}' ./       6.072                 2.85 +-
hyper_data_multichar.txt > /dev/null         +-  5.919  6.385     0.09
                                         0.181
awk -F'[:space:]+' '{print $1, $8, $19} 11.125                 5.23 +-
' ./hyper_data_multichar.txt > /dev/         +- 11.012 11.177     0.03
null                                     0.066
< ./hyper_data_multichar.txt tr -s ' '   7.508                 3.53 +-
| cut -d ' ' -f1,8,19 > /dev/null            +-  7.433  7.591     0.03
                                         0.059
< ./hyper_data_multichar.txt tr -s ' '   6.719                 3.16 +-
| tail -n+2 | xsv select -d ' ' 1,8,19       +-  6.419  6.983     0.11
--no-headers > /dev/null                 0.241
< ./hyper_data_multichar.txt tr -s ' '   6.351                 2.99 +-
| hck -Ld' ' -f1,8,19 > /dev/null            +-  6.296  6.391     0.02
                                         0.041
< ./hyper_data_multichar.txt tr -s ' '   6.359                 2.99 +-
| tsv-select -d ' ' -f 1,8,19 > /dev/        +-  6.311  6.453     0.03
null                                     0.056

 Decompression

The following table indicates the file extension / binary pairs that
are used to try to decompress a file whent the -z option is
specified:

Extension Binary                 Type
*.gz      gzip -d -c             gzip
*.tgz     gzip -d -c             gzip
*.bz2     bzip2 -d -c            bzip2
*.tbz2    bzip -d -c             gzip
*.xz      xz -d -c               xz
*.txz     xz -d -c               xz
*.lz4     lz4 -d -c              lz4
*.lzma    xz --format=lzma -d -c lzma
*.br      brotli -d -c           brotli
*.zst     zstd -d -c             zstd
*.zstd    zstd -q -d -c          zstd
*.Z       uncompress -c          uncompress

When a file with one of the extensions above is found, hck will open
a subprocess running the the decompression tool listed above and read
from the output of that tool. If the binary can't be found then hck
will try to read the compressed file as is. See grep_cli for source
code. The end goal is to add a similar preprocessor as ripgrep.

 Profile Guided Optimization

See the pgo*.sh scripts for how to build this with optimizations. You
will need to install the llvm tools via rustup component add
llvm-tools-preview for this to work. Building with PGO seems to
improve performance anywhere from 5-30% depending on the platform and
codepath. i.e. on mac os it seems to have a larger effect, and on the
regex codepath it also seems to have a greater effect.

 TODO

  * Add complement argument
  * Don't reparse fields / headers for each new file
  * figure out how to better reuse / share a vec
  * Support indexing from the end (unlikely though)
  * Bake in grep / filtering somehow (this will not be done at the
    expense of the primary utility of hck)
  * Move tests from main to core
  * Add more tests all around
  * Add pigz support
  * Add a greedy/non-greedy option that will ignore blank fields
    split.filter(|s| !s.is_empty() || config.opt.non_greedy)
  * Experiment with parallel parser as described here This should be
    very doable given we don't care about escaping quotes and such.

 More packages and builds

https://github.com/sharkdp/bat/blob/master/.github/workflows/CICD.yml

 References

  * rust-coreutils-cut
  * ripgrep

About

A sharp cut(1) clone.

Topics

rust command-line text-processing

Resources

Readme

License

View license

Releases 30

 
v0.4.2 Latest
Jul 9, 2021
+ 29 releases

Packages 0

No packages published

Languages

  * Rust 91.3%
  * Shell 8.7%

  * (c) 2021 GitHub, Inc.
  * Terms
  * Privacy
  * Security
  * Status
  * Docs

 

  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.