[HN Gopher] Vineyard: An open-source in-memory data manager
       ___________________________________________________________________
        
       Vineyard: An open-source in-memory data manager
        
       Author : sighingnow
       Score  : 96 points
       Date   : 2021-01-19 12:30 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | phillc73 wrote:
       | It'd be interesting to know how this compares with alternative
       | solutions.
       | 
       | I might not understand the benefit proposition correctly, and I'm
       | not specifically into Python for data work, but I immediately
       | thought of things like feather[1], fst[2], disk.frame[3] and even
       | DuckDB[4].
       | 
       | Some of these are on disk rather than in memory, but I'd still be
       | interested in performance and use case comparisons.
       | 
       | [1] https://github.com/wesm/feather
       | 
       | [2] https://www.fstpackage.org/fst/
       | 
       | [3] https://diskframe.com/
       | 
       | [4] https://duckdb.org/
        
         | sighingnow wrote:
         | Vineyard addresses on sharing "distributed abstracted immutable
         | data" in memory, e.g., dask[1] generates a distributed
         | dataframe (since the size of data cannot be fit into a single
         | machine), and feed it to tensorflow as features for distributed
         | training. Vineyard aims at sharing the "DataFrame" between
         | these two engines, by memory sharing.
         | 
         | + feather: as well as arrow, are focused on interoperable data
         | stroage format. feather and arrow define a IPC serailziation
         | schema for common data structures, e.g., tensor and DataFrame.
         | Rather than do serailization for zero-copy (when possible)
         | deserialization, vineyard organized data as a "metadata" and a
         | set of "blobs", the metadata decides how to interpret those
         | blobs and blobs are shared in a zero-copy fashion. Moreover,
         | vineyard could manages large that cannot be fit into a single
         | machine as a "GlobalObject" and global objects can be shared
         | efficiently as well.
         | 
         | + fst: fst is also a serailization/stroage format, but vineyard
         | is not such a thing.
         | 
         | + diskframe: diskframe is similar to fst, and works for data
         | that cannot be fit into memory. Vineyard shares in-memory data
         | that cannot be fit into a single machine between different
         | engines (might be implemented in different languages).
         | 
         | + duckdb: vineyard is not a SQL execution engine.
        
       | sighingnow wrote:
       | Vineyard is a distributed in-memory data manager. Features:
       | 
       | 1. distributed in-memory immutable data storage;
       | 
       | 2. zero-copy data access through shared-memory;
       | 
       | 3. out-of-the-box high-level data abstractions for commonly used
       | data structures;
       | 
       | 4. pre-built drivers for I/O, migration, checkpoint, etc;
       | 
       | 5. C++ and Python API; and
       | 
       | 6. Kubernetes-integration for large-scale big data applications
       | 
       | Github: https://github.com/alibaba/libvineyard (stars are
       | welcomed!)
       | 
       | Documentation: https://v6d.io
       | 
       | Helm charts:
       | https://artifacthub.io/packages/helm/vineyard/vineyard
       | 
       | Any comments and contributions from the community are welcomed!
        
         | theknarf wrote:
         | What would be a typical use case for this kind of technology?
        
           | sighingnow wrote:
           | Let's say someone want to use python first to create a few
           | tensors with numpy, and then use another python package
           | networkx to create a graph from the tensors just created and
           | next to do some graph analysis with networkx.
           | 
           | It is easy to do it on a single machine in a single python
           | runtime because data structures like tensors can be easily
           | shared within a single python runtime.
           | 
           | It will become a bit harder to share the data across
           | processes/Python runtime efficiently without expensive IO or
           | serialization (doable with the help of things like plasma in
           | Apache Arrow).
           | 
           | However, if we want to deal with "big data" that cannot be
           | handled on a single machine, it will become very hard to
           | share data without IO/serialization. Vineyard can be used for
           | such cases. It makes sharing big distributed data structures
           | easy for different runtime.
           | 
           | There will be some added benefits with vineyard: 1. vineyard
           | can handle the the IO, sharding/partitioning, failover and
           | data migrations for the applications.
           | 
           | 2. vineyard supports many out-of-the-box highly efficient
           | data formats (such as Apache Arrow) and high-level
           | abstractions which ease the difficulties of developing big
           | data applications.
           | 
           | 3. With stream data support, there is a possibility of
           | enabling cross-process optimizations.
           | 
           | 4. Fits K8s very well, where common data structures can live
           | in a separated pod/container with its own resource limit.
        
           | streetcat1 wrote:
           | I would assume that any workload which decouple compute from
           | storage.
           | 
           | Most of the new data systems (e.g. Snowflake) are build on
           | top of a data lake (E.g. S3 bucket) and a scale out JIT
           | compute nodes.
           | 
           | By using such tool, you can read the data from S3 once, and
           | avoid loading the data into memory when you add nodes.
           | 
           | A specific use case that I am working on now is Auto ML.
           | Imagine training 100's of models on the same data.
        
         | adev_ wrote:
         | Looks very interesting guys, thank you a lot to release that.
         | 
         | I was working on something similar currently.
         | 
         | Any data pipeline complex enough finish soon or later multi-
         | lingual (Python, R, C++ + MPI) and multi-runtime (JVM, native,
         | python) and it becomes quickly impossible to execute everything
         | in one single process space without problems.
         | 
         | You are right in your design, shared memory node-to-node data
         | distribution is the answer to that to avoid the classical /
         | inefficient data dump-load-dump pattern that we find usually in
         | most heterogeneous pipeline.
        
         | lukevp wrote:
         | Does distributed mean it's replicated across machines? It was a
         | bit hard to tell from the article. Can this be run outside
         | kubernetes or is it meant to be part of a big set of infra?
         | 
         | One thing I've always wanted to build is an emulator (snes,
         | etc.) that does real-time replication of data across the lan
         | for both debugging and network play, is this low enough latency
         | for that? The idea would be to help new emulator authors not
         | have to build a debugger or test suite, if you could expose the
         | controller interface and memory, a single debugger could be
         | written that could be shared across new authors. All they would
         | have to do is bind to a simple Api that exposes memory and a
         | few commands to the library.
        
           | sighingnow wrote:
           | > Does distributed mean it's replicated across machines?
           | 
           | No, in vineyard "distributed" doesn't mean data replicated
           | across machines, rather, big data partitioned across
           | machines. We address cases where the data cannot be fit into
           | a single machine.
           | 
           | > Can this be run outside kubernetes or is it meant to be
           | part of a big set of infra?
           | 
           | Yes, vineyard can run outside of kubernetes, but it would be
           | a bit complicated to setup on many machines as a cluster.
           | 
           | Devops your vineyard cluster on kubernetes could leverage the
           | ability of Kubernetes. We already have integrated with helm
           | and there's a vineyard-operator in our roadmap.
        
       ___________________________________________________________________
       (page generated 2021-01-19 23:01 UTC)