[HN Gopher] Column - High-performance, columnar, in-memory store...
___________________________________________________________________
Column - High-performance, columnar, in-memory store with bitmap
indexing in Go
Author : ngaut
Score : 147 points
Date : 2021-06-21 07:59 UTC (15 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| dafelst wrote:
| I love seeing stuff like this, getting more understanding of the
| layers underlying high performance data analytics is super
| interesting to me.
|
| This project seems very similar to Apache Arrow, if OP or anyone
| else is around to explain why one might be used over the other
| that would be great.
| ctvo wrote:
| > This project seems very similar to Apache Arrow, if OP or
| anyone else is around to explain why one might be used over the
| other that would be great.
|
| Arrow is primarily a serialization format to transfer data
| between distributed systems. It uses zero copy and other
| techniques to quickly process, and store large data sets in
| memory.
|
| Other libraries allow you to query Arrow data once processed.
|
| This project is an in-memory columnar data store with querying
| and other capabilities.
| GeertJohan wrote:
| Really nice project! The transactions and replication streaming
| seem to make it a great choice for sharding/distributed
| environments!
| kelindar wrote:
| That's the idea, a transaction commit log decoupled from
| underlying durable storage allows you to build your own
| persistence layers. I'm still thinking to build a simple
| (memory-mapped?) layer, but as an optional, separate lib.
| potamic wrote:
| I'm a little naive on this subject, but just wondering what are
| the use cases for in-memory columnar stores? I was under the
| impression that columnar stores are good for OLAP use cases
| involving massive amounts of data. For datasets that fit within
| memory, are there still benefits in organizing data in a columnar
| manner and are the performance gains appreciable?
| dafelst wrote:
| I am merely a dabbler in this area and definitely not an
| expert, but my understanding is that columnar stores tend to be
| substantially more efficient for analytical operations over
| large sets of in memory data by virtue of the data being easier
| to operate on with vectorized instructions like SIMD.
| pepemysurprised wrote:
| Also see my comment above, but you find this kind of storage
| commonly in game development [0] where you are optimizing for
| batch access on specific columns to minimize cache misses. It's
| usually used as the storage layer for Entity Component Systems.
| It's also called data-oriented design [1]
|
| [0] http://cowboyprogramming.com/2007/01/05/evolve-your-
| heirachy...
|
| [1] https://en.wikipedia.org/wiki/Data-oriented_design
| nvartolomei wrote:
| Not subject matter expert but few that come to mind: memory can
| become a bottleneck, reading sequential data instead of jumping
| pointers/reading useless data and trashing caches gives much
| better throughout, compression applied to columnar data is more
| efficient and can give a throughput boost when memory bw
| becomes a bottleneck on systems with high number of CPUs.
| mananaysiempre wrote:
| I'm not sure about any performance gains or working with large
| datasets, but the ancient Metakit[1] was just a really pleasant
| relational algebra library ( [?] SQL data model library, it
| could do _e.g._ relations as values which are difficult for
| row-oriented databases). I'd say that Metakit & OOMK in Tcl is
| strictly better than the relational part of Pandas in Python,
| except the documentation is somewhere between bad and
| nonexistent.
|
| [1]: https://git.jeelabs.org/jcw/metakit
| pepemysurprised wrote:
| Very cool. This kind of storage is similar to what's typically
| being used in Entity Component Systems like EnTT [0], which can
| also be interpreted as in-memory column oriented databases.
|
| Recently I'm starting to like this type of programming over OOP.
| Each part of your application accesses a global in-memory DB with
| the full state. But unlike a traditional persistent db it's
| really fast.
|
| [0] https://github.com/skypjack/entt
| eklitzke wrote:
| How is using an in-memory database related to OOP? They seem
| completely orthogonal to me.
| shkkmo wrote:
| It is not. It is related to ECS which is being contrasted
| with OOP.
| pjmlp wrote:
| Something that gets lost is that this is also a variety of OOP.
|
| https://www.amazon.com/Component-Software-Object-Oriented-Pr...
|
| Programming against Objective-C protocols, COM interfaces,
| Component Pascal framework, and so forth.
| shkkmo wrote:
| What? In ECS state is managed seperately from logic and there
| is no inheritance. How is it a variety of OOP?
| pjmlp wrote:
| OOP is not inheritance, just one possible trait among OOP
| implementations, just like FP isn't Haskell.
|
| Component programming with interface separated from state
| is exactly what Objective-C protocols, COM, VBX, SOM,
| Component Pascal were all about.
|
| Those that promote ECS as not being OOP 99% of the times
| never read books like the one I linked on my comment.
|
| Instead they reference a talk done at GDC by one of the
| very first engines that made it well known to those that
| never read CS papers or books.
| setr wrote:
| Interfaces/protocols aren't really the same though. An
| interface defines capabilities for an object; the
| capabilities are directly associated with the type.
|
| In ECS (where Components are a bag of data, and systems
| handle all logic/operations), and like a DB, the object
| is defined by an id and its relations; from the
| relations, you can derive available capabilities.
|
| That is, an object tells you what it can do. In a DB,
| what it can do tells you the object.
|
| You can create the same system with interfaces by simply
| ignoring the methods part of it, and keeping the data
| part, but associating data with capabilities is pretty
| much the defining difference between objects and structs.
|
| More importantly from an architectural perspective, in
| ECS the logic isn't associated with the object, it's
| associated with a system that takes the object as input.
| The system is shared across all objects. The object
| (entity) for ECS is little more than an id and some
| relations.
|
| An ECS very directly corresponds to an RDBMS. To call it
| OOP is to deny the ORM's classic Object-Relational
| mismatch.
| pjmlp wrote:
| An interface in a component object model can be made only
| of properties.
|
| Secondly most languages with OOP support aren't
| Smalltalk/Java, rather multi-paradigm, e.g. Objective-C,
| Component Pascal, C++, Delphi, Python, among others when
| Component Programming came into CS papers for the first
| time.
|
| To argue that Component Programming is not OOP is just
| religious hate that shows lack of knowledge regarding CS
| literature.
| ignoramous wrote:
| The same developer has an open-source entity-component-system,
| as well: https://github.com/kelindar/ecs
| kelindar wrote:
| Wait no, that repo was an experiment that I'll be rebasing
| and finally building a real ECS based on the columnar storage
| library.
| eismcc wrote:
| Is there a Go equivalent of Calcite? If so, could probably bolt
| that onto the query path and work in the logical plan translation
| to the physical plan - which is the query API that's currently
| provided.
| DLA wrote:
| https://calcite.apache.org
| pjmlp wrote:
| Great job picking up Go for this.
| de6u99er wrote:
| Do you have a (Docker) container that can be used for trying it
| out?
| L_226 wrote:
| OT: Not a Go dev here but have some side projects written in
| it... Isn't docker for Go a bit unorthodox? I had a few nice
| headaches setting up my local env to use docker with Go to
| mirror my python workflow (all projects have a Dockerfile, no
| dependencies installed locally). I was under the impression
| that Pro Go people do not use docker for local Go dev. Please
| correct me if I am wrong.
| doctor_eval wrote:
| Docker and go work fine together but using docker for go dev
| is just an unnecessary hassle, especially if (like me) you're
| doing dev in MacOS - you have to cross compile to Linux which
| is slower, and then build and deploy the container - versus
| the very quick compile-run cycle of regular Go.
|
| As a reformed Java developer I can say that docker didn't add
| much time to the build cycle and gave us a better way to
| package resources for Java code, but Go is far more
| ergonomic, so taking a <2 second compile time for a small
| microservice and adding docker to turn it into a 30 second
| build time just isn't worth whatever utility you get from
| containers at dev time.
| pjmlp wrote:
| As not yet reformed Java developer, an Uberjar, custom
| runtime with jlink, or one the AOT compilers available, do
| the job just as well.
| doctor_eval wrote:
| Not really. Even with an uberjar, you still need to get
| that huge Java runtime distributed somehow. And then you
| need all the command line rubbish, starting heapsize,
| system properties, etc. Not to forget, for those of us
| outside the US, a special handmade distro of Java with
| the crypto export restrictions file in the right spot.
|
| Docker helps manage all of this, and does it fairly
| quickly, and made life relatively easy, but not without a
| cost in time and complexity.
|
| Go, out of the box, produces statically linked machine
| runnable binaries, including embedded resources, so you
| get the equivalent of an uberjar, plus resources, plus
| the runtime, all in a single executable file. And all of
| this pops out in a second or two with `go build`.
|
| AOT for Java might perhaps have similar advantages except
| that AFAIK (two years ago) the AOT compilers were
| expensive and had plenty of caveats with eg reflection. I
| expect they would be even slower than javac as well. So
| certainly a solution, and maybe you don't need docker any
| more, but then you have a different set of problems. It
| was never feasible when I was doing Java.
|
| To be clear, this isn't a Java vs Go thing. The question
| was why don't Go devs use Docker, and I've given some
| reasons. I quite like the Java language and miss some
| aspects of it, but there is a lot about the Java
| environment that I don't miss and runtime deployment
| complexity is one of them.
| pjmlp wrote:
| You missed the part of the comment, "custom runtime with
| jlink".
|
| I never been into US, plus the restrictions apply to any
| tech produced in US, regardless of the programming
| language.
|
| Thankfully, by having such laws, US made us create other
| standards as well.
|
| I also don't want to make it into a Java vs Go thing,
| rather make the point that many dismiss Java without
| really knowing what is around during the last 26 years on
| the ecosystem.
|
| It appears everyone just learns the basics and then
| complains from there.
|
| Not targeted at you, as you obviously got my point.
|
| On the other hand, kubernetes and docker are all about
| runtime deployment complexity. It feels like using
| Websphere 5 all over again, with containers == EAR,
| thankfully so far I managed to stay mostly away from
| them.
| doctor_eval wrote:
| I exited Java just as modules were kicking in so I'm not
| really familiar with jlink. But you still need to
| distribute that custom runtime and docker helps with
| that. I think docker is a great tool for dealing with
| Java's complexities. A Java docker image is like a Go
| executable.
|
| Re the export restrictions, although you are right in
| theory, it doesn't seem to affect Go. There is no special
| build, the crypto is just built in. Java is unique in how
| it dealt with this, I never understood why it was so
| hard.
|
| I agree with you re K8s. And I like the comparison to
| EARs. Both container systems are pretty poor substitutes
| for a binary you can just run in an OS.
|
| Go seems to recognise this. It knows its place in the
| deployment hierarchy and that's made my life so much
| easier. Go _feels_ like it's part of the Unix world,
| rather than apart from it, and Java was never like that.
| That's why docker became so important in the Java world.
| It gave Java the isolation from the OS that it always
| craved :)
| adamcstephens wrote:
| I agree that adding docker to a Go dev setup is not worth it,
| but I think commenter was asking for a docker image for
| running it. In that case, I'd say that docker could be worth
| it for the end user.
| _wldu wrote:
| I dockerize Go apps to run in AWS ECS Fargate, but otherwise
| I agree. Go apps don't need docker.
| physicles wrote:
| Go doesn't benefit as much from docker, but if you're already
| living in a docker world (i.e. everything you deploy is a
| docker image, and it's managed by compose or kubernetes) then
| it's easier to use docker than not.
|
| We build images (about 20, each with a Dockerfile) from a
| monorepo with a single go.mod. I have basically a full
| replica of prod running locally in k3s -- letting k3s manage
| it all is easier than dealing with the pile of environment
| variables that would be needed to get everything hooked up
| properly. And with kustomize, we can reuse a bunch of yaml
| from prod.
|
| Sometimes I'll run go binaries locally on my machine for
| debugging (the builds still work because go's packaging is
| finally stable). But the difference is minimal -- using
| docker/k8s is more about streamlining
| deployment/config/rollback (and the occasional co-packaged
| asset) than anything else.
| polskibus wrote:
| Great stuff, can it work with larger-than-memory datasets? Is
| there a way to limit resource consumption ? Or will process just
| blow up in such case?
| kelindar wrote:
| It's actually possible, columns are simple Go interfaces and
| can be re-defined and defined for specific types. You can
| easily build implementation of columns that actually load data
| from disk or even a remote server (RDBMS, S3, ..?) and retain
| the indexing capability.
|
| On the flip side, you could actually fit more data in-memory
| than with non-columnar methods, since the storage is column-by-
| column, it compresses very well. For example boolean values are
| stored as bitmaps in this implementation, strings could be
| stored in a hash map so there's only one string of a type that
| kept in memory, even if you have millions of rows.
| maxdo wrote:
| How that compares to hazelcast ?
| thunkshift1 wrote:
| Why the use of Go instead of something more traditional like c++,
| or even rust ? Isn't it primarily used for infrastructure
| scripting and will affect performance of the db
| pjmlp wrote:
| Exactly to prove to people like yourself that it is possible.
|
| IT industry is full of Matthews that need to be proven wrong
| for us to advance.
| ddlutz wrote:
| Is "Matthew" some sort of IT version of "Karen"?
| pjmlp wrote:
| I got the name in English wrong, it should have been
| Thomas.
|
| "You believe because you see me. Great blessings belong to
| the people who believe without seeing me!" (John 20:24-31 )
|
| Bringing it into the IT context, there are the visionaries
| that believe something is possible no matter what, and then
| there are those that even with stuff running in front of
| them cannot move beyond "yes but...".
|
| Ironically, in the 80's in what concerns home computers and
| game programming, both C and C++ also belonged to the "yes
| but..." group.
| DLA wrote:
| What's the problem with Go? Many high-performance things are
| built in Go. https://awesome-go.com
| kelindar wrote:
| Honestly, I enjoy programming in Go and been using it on a
| daily basis for the last few years. Most importantly, when it
| comes to performance it's often not the language that matters
| but how you structure your code. It's very much possible to
| build a terrible C++ program which thrashes memory and will be
| very slow. And I feel like Go is actually lacking those nice
| data-oriented libraries.
___________________________________________________________________
(page generated 2021-06-21 23:01 UTC)