hngopher.com

       [HN Gopher] Column - High-performance, columnar, in-memory store...
       ___________________________________________________________________
        
       Column - High-performance, columnar, in-memory store with bitmap
       indexing in Go
        
       Author : ngaut
       Score  : 147 points
       Date   : 2021-06-21 07:59 UTC (15 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dafelst wrote:
       | I love seeing stuff like this, getting more understanding of the
       | layers underlying high performance data analytics is super
       | interesting to me.
       | 
       | This project seems very similar to Apache Arrow, if OP or anyone
       | else is around to explain why one might be used over the other
       | that would be great.
        
         | ctvo wrote:
         | > This project seems very similar to Apache Arrow, if OP or
         | anyone else is around to explain why one might be used over the
         | other that would be great.
         | 
         | Arrow is primarily a serialization format to transfer data
         | between distributed systems. It uses zero copy and other
         | techniques to quickly process, and store large data sets in
         | memory.
         | 
         | Other libraries allow you to query Arrow data once processed.
         | 
         | This project is an in-memory columnar data store with querying
         | and other capabilities.
        
       | GeertJohan wrote:
       | Really nice project! The transactions and replication streaming
       | seem to make it a great choice for sharding/distributed
       | environments!
        
         | kelindar wrote:
         | That's the idea, a transaction commit log decoupled from
         | underlying durable storage allows you to build your own
         | persistence layers. I'm still thinking to build a simple
         | (memory-mapped?) layer, but as an optional, separate lib.
        
       | potamic wrote:
       | I'm a little naive on this subject, but just wondering what are
       | the use cases for in-memory columnar stores? I was under the
       | impression that columnar stores are good for OLAP use cases
       | involving massive amounts of data. For datasets that fit within
       | memory, are there still benefits in organizing data in a columnar
       | manner and are the performance gains appreciable?
        
         | dafelst wrote:
         | I am merely a dabbler in this area and definitely not an
         | expert, but my understanding is that columnar stores tend to be
         | substantially more efficient for analytical operations over
         | large sets of in memory data by virtue of the data being easier
         | to operate on with vectorized instructions like SIMD.
        
         | pepemysurprised wrote:
         | Also see my comment above, but you find this kind of storage
         | commonly in game development [0] where you are optimizing for
         | batch access on specific columns to minimize cache misses. It's
         | usually used as the storage layer for Entity Component Systems.
         | It's also called data-oriented design [1]
         | 
         | [0] http://cowboyprogramming.com/2007/01/05/evolve-your-
         | heirachy...
         | 
         | [1] https://en.wikipedia.org/wiki/Data-oriented_design
        
         | nvartolomei wrote:
         | Not subject matter expert but few that come to mind: memory can
         | become a bottleneck, reading sequential data instead of jumping
         | pointers/reading useless data and trashing caches gives much
         | better throughout, compression applied to columnar data is more
         | efficient and can give a throughput boost when memory bw
         | becomes a bottleneck on systems with high number of CPUs.
        
         | mananaysiempre wrote:
         | I'm not sure about any performance gains or working with large
         | datasets, but the ancient Metakit[1] was just a really pleasant
         | relational algebra library ( [?] SQL data model library, it
         | could do _e.g._ relations as values which are difficult for
         | row-oriented databases). I'd say that Metakit  & OOMK in Tcl is
         | strictly better than the relational part of Pandas in Python,
         | except the documentation is somewhere between bad and
         | nonexistent.
         | 
         | [1]: https://git.jeelabs.org/jcw/metakit
        
       | pepemysurprised wrote:
       | Very cool. This kind of storage is similar to what's typically
       | being used in Entity Component Systems like EnTT [0], which can
       | also be interpreted as in-memory column oriented databases.
       | 
       | Recently I'm starting to like this type of programming over OOP.
       | Each part of your application accesses a global in-memory DB with
       | the full state. But unlike a traditional persistent db it's
       | really fast.
       | 
       | [0] https://github.com/skypjack/entt
        
         | eklitzke wrote:
         | How is using an in-memory database related to OOP? They seem
         | completely orthogonal to me.
        
           | shkkmo wrote:
           | It is not. It is related to ECS which is being contrasted
           | with OOP.
        
         | pjmlp wrote:
         | Something that gets lost is that this is also a variety of OOP.
         | 
         | https://www.amazon.com/Component-Software-Object-Oriented-Pr...
         | 
         | Programming against Objective-C protocols, COM interfaces,
         | Component Pascal framework, and so forth.
        
           | shkkmo wrote:
           | What? In ECS state is managed seperately from logic and there
           | is no inheritance. How is it a variety of OOP?
        
             | pjmlp wrote:
             | OOP is not inheritance, just one possible trait among OOP
             | implementations, just like FP isn't Haskell.
             | 
             | Component programming with interface separated from state
             | is exactly what Objective-C protocols, COM, VBX, SOM,
             | Component Pascal were all about.
             | 
             | Those that promote ECS as not being OOP 99% of the times
             | never read books like the one I linked on my comment.
             | 
             | Instead they reference a talk done at GDC by one of the
             | very first engines that made it well known to those that
             | never read CS papers or books.
        
               | setr wrote:
               | Interfaces/protocols aren't really the same though. An
               | interface defines capabilities for an object; the
               | capabilities are directly associated with the type.
               | 
               | In ECS (where Components are a bag of data, and systems
               | handle all logic/operations), and like a DB, the object
               | is defined by an id and its relations; from the
               | relations, you can derive available capabilities.
               | 
               | That is, an object tells you what it can do. In a DB,
               | what it can do tells you the object.
               | 
               | You can create the same system with interfaces by simply
               | ignoring the methods part of it, and keeping the data
               | part, but associating data with capabilities is pretty
               | much the defining difference between objects and structs.
               | 
               | More importantly from an architectural perspective, in
               | ECS the logic isn't associated with the object, it's
               | associated with a system that takes the object as input.
               | The system is shared across all objects. The object
               | (entity) for ECS is little more than an id and some
               | relations.
               | 
               | An ECS very directly corresponds to an RDBMS. To call it
               | OOP is to deny the ORM's classic Object-Relational
               | mismatch.
        
               | pjmlp wrote:
               | An interface in a component object model can be made only
               | of properties.
               | 
               | Secondly most languages with OOP support aren't
               | Smalltalk/Java, rather multi-paradigm, e.g. Objective-C,
               | Component Pascal, C++, Delphi, Python, among others when
               | Component Programming came into CS papers for the first
               | time.
               | 
               | To argue that Component Programming is not OOP is just
               | religious hate that shows lack of knowledge regarding CS
               | literature.
        
         | ignoramous wrote:
         | The same developer has an open-source entity-component-system,
         | as well: https://github.com/kelindar/ecs
        
           | kelindar wrote:
           | Wait no, that repo was an experiment that I'll be rebasing
           | and finally building a real ECS based on the columnar storage
           | library.
        
       | eismcc wrote:
       | Is there a Go equivalent of Calcite? If so, could probably bolt
       | that onto the query path and work in the logical plan translation
       | to the physical plan - which is the query API that's currently
       | provided.
        
         | DLA wrote:
         | https://calcite.apache.org
        
       | pjmlp wrote:
       | Great job picking up Go for this.
        
       | de6u99er wrote:
       | Do you have a (Docker) container that can be used for trying it
       | out?
        
         | L_226 wrote:
         | OT: Not a Go dev here but have some side projects written in
         | it... Isn't docker for Go a bit unorthodox? I had a few nice
         | headaches setting up my local env to use docker with Go to
         | mirror my python workflow (all projects have a Dockerfile, no
         | dependencies installed locally). I was under the impression
         | that Pro Go people do not use docker for local Go dev. Please
         | correct me if I am wrong.
        
           | doctor_eval wrote:
           | Docker and go work fine together but using docker for go dev
           | is just an unnecessary hassle, especially if (like me) you're
           | doing dev in MacOS - you have to cross compile to Linux which
           | is slower, and then build and deploy the container - versus
           | the very quick compile-run cycle of regular Go.
           | 
           | As a reformed Java developer I can say that docker didn't add
           | much time to the build cycle and gave us a better way to
           | package resources for Java code, but Go is far more
           | ergonomic, so taking a <2 second compile time for a small
           | microservice and adding docker to turn it into a 30 second
           | build time just isn't worth whatever utility you get from
           | containers at dev time.
        
             | pjmlp wrote:
             | As not yet reformed Java developer, an Uberjar, custom
             | runtime with jlink, or one the AOT compilers available, do
             | the job just as well.
        
               | doctor_eval wrote:
               | Not really. Even with an uberjar, you still need to get
               | that huge Java runtime distributed somehow. And then you
               | need all the command line rubbish, starting heapsize,
               | system properties, etc. Not to forget, for those of us
               | outside the US, a special handmade distro of Java with
               | the crypto export restrictions file in the right spot.
               | 
               | Docker helps manage all of this, and does it fairly
               | quickly, and made life relatively easy, but not without a
               | cost in time and complexity.
               | 
               | Go, out of the box, produces statically linked machine
               | runnable binaries, including embedded resources, so you
               | get the equivalent of an uberjar, plus resources, plus
               | the runtime, all in a single executable file. And all of
               | this pops out in a second or two with `go build`.
               | 
               | AOT for Java might perhaps have similar advantages except
               | that AFAIK (two years ago) the AOT compilers were
               | expensive and had plenty of caveats with eg reflection. I
               | expect they would be even slower than javac as well. So
               | certainly a solution, and maybe you don't need docker any
               | more, but then you have a different set of problems. It
               | was never feasible when I was doing Java.
               | 
               | To be clear, this isn't a Java vs Go thing. The question
               | was why don't Go devs use Docker, and I've given some
               | reasons. I quite like the Java language and miss some
               | aspects of it, but there is a lot about the Java
               | environment that I don't miss and runtime deployment
               | complexity is one of them.
        
               | pjmlp wrote:
               | You missed the part of the comment, "custom runtime with
               | jlink".
               | 
               | I never been into US, plus the restrictions apply to any
               | tech produced in US, regardless of the programming
               | language.
               | 
               | Thankfully, by having such laws, US made us create other
               | standards as well.
               | 
               | I also don't want to make it into a Java vs Go thing,
               | rather make the point that many dismiss Java without
               | really knowing what is around during the last 26 years on
               | the ecosystem.
               | 
               | It appears everyone just learns the basics and then
               | complains from there.
               | 
               | Not targeted at you, as you obviously got my point.
               | 
               | On the other hand, kubernetes and docker are all about
               | runtime deployment complexity. It feels like using
               | Websphere 5 all over again, with containers == EAR,
               | thankfully so far I managed to stay mostly away from
               | them.
        
               | doctor_eval wrote:
               | I exited Java just as modules were kicking in so I'm not
               | really familiar with jlink. But you still need to
               | distribute that custom runtime and docker helps with
               | that. I think docker is a great tool for dealing with
               | Java's complexities. A Java docker image is like a Go
               | executable.
               | 
               | Re the export restrictions, although you are right in
               | theory, it doesn't seem to affect Go. There is no special
               | build, the crypto is just built in. Java is unique in how
               | it dealt with this, I never understood why it was so
               | hard.
               | 
               | I agree with you re K8s. And I like the comparison to
               | EARs. Both container systems are pretty poor substitutes
               | for a binary you can just run in an OS.
               | 
               | Go seems to recognise this. It knows its place in the
               | deployment hierarchy and that's made my life so much
               | easier. Go _feels_ like it's part of the Unix world,
               | rather than apart from it, and Java was never like that.
               | That's why docker became so important in the Java world.
               | It gave Java the isolation from the OS that it always
               | craved :)
        
           | adamcstephens wrote:
           | I agree that adding docker to a Go dev setup is not worth it,
           | but I think commenter was asking for a docker image for
           | running it. In that case, I'd say that docker could be worth
           | it for the end user.
        
           | _wldu wrote:
           | I dockerize Go apps to run in AWS ECS Fargate, but otherwise
           | I agree. Go apps don't need docker.
        
           | physicles wrote:
           | Go doesn't benefit as much from docker, but if you're already
           | living in a docker world (i.e. everything you deploy is a
           | docker image, and it's managed by compose or kubernetes) then
           | it's easier to use docker than not.
           | 
           | We build images (about 20, each with a Dockerfile) from a
           | monorepo with a single go.mod. I have basically a full
           | replica of prod running locally in k3s -- letting k3s manage
           | it all is easier than dealing with the pile of environment
           | variables that would be needed to get everything hooked up
           | properly. And with kustomize, we can reuse a bunch of yaml
           | from prod.
           | 
           | Sometimes I'll run go binaries locally on my machine for
           | debugging (the builds still work because go's packaging is
           | finally stable). But the difference is minimal -- using
           | docker/k8s is more about streamlining
           | deployment/config/rollback (and the occasional co-packaged
           | asset) than anything else.
        
       | polskibus wrote:
       | Great stuff, can it work with larger-than-memory datasets? Is
       | there a way to limit resource consumption ? Or will process just
       | blow up in such case?
        
         | kelindar wrote:
         | It's actually possible, columns are simple Go interfaces and
         | can be re-defined and defined for specific types. You can
         | easily build implementation of columns that actually load data
         | from disk or even a remote server (RDBMS, S3, ..?) and retain
         | the indexing capability.
         | 
         | On the flip side, you could actually fit more data in-memory
         | than with non-columnar methods, since the storage is column-by-
         | column, it compresses very well. For example boolean values are
         | stored as bitmaps in this implementation, strings could be
         | stored in a hash map so there's only one string of a type that
         | kept in memory, even if you have millions of rows.
        
       | maxdo wrote:
       | How that compares to hazelcast ?
        
       | thunkshift1 wrote:
       | Why the use of Go instead of something more traditional like c++,
       | or even rust ? Isn't it primarily used for infrastructure
       | scripting and will affect performance of the db
        
         | pjmlp wrote:
         | Exactly to prove to people like yourself that it is possible.
         | 
         | IT industry is full of Matthews that need to be proven wrong
         | for us to advance.
        
           | ddlutz wrote:
           | Is "Matthew" some sort of IT version of "Karen"?
        
             | pjmlp wrote:
             | I got the name in English wrong, it should have been
             | Thomas.
             | 
             | "You believe because you see me. Great blessings belong to
             | the people who believe without seeing me!" (John 20:24-31 )
             | 
             | Bringing it into the IT context, there are the visionaries
             | that believe something is possible no matter what, and then
             | there are those that even with stuff running in front of
             | them cannot move beyond "yes but...".
             | 
             | Ironically, in the 80's in what concerns home computers and
             | game programming, both C and C++ also belonged to the "yes
             | but..." group.
        
         | DLA wrote:
         | What's the problem with Go? Many high-performance things are
         | built in Go. https://awesome-go.com
        
         | kelindar wrote:
         | Honestly, I enjoy programming in Go and been using it on a
         | daily basis for the last few years. Most importantly, when it
         | comes to performance it's often not the language that matters
         | but how you structure your code. It's very much possible to
         | build a terrible C++ program which thrashes memory and will be
         | very slow. And I feel like Go is actually lacking those nice
         | data-oriented libraries.
        
       ___________________________________________________________________
       (page generated 2021-06-21 23:01 UTC)