[HN Gopher] Vineyard: An open-source in-memory data manager
___________________________________________________________________
Vineyard: An open-source in-memory data manager
Author : sighingnow
Score : 96 points
Date : 2021-01-19 12:30 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| phillc73 wrote:
| It'd be interesting to know how this compares with alternative
| solutions.
|
| I might not understand the benefit proposition correctly, and I'm
| not specifically into Python for data work, but I immediately
| thought of things like feather[1], fst[2], disk.frame[3] and even
| DuckDB[4].
|
| Some of these are on disk rather than in memory, but I'd still be
| interested in performance and use case comparisons.
|
| [1] https://github.com/wesm/feather
|
| [2] https://www.fstpackage.org/fst/
|
| [3] https://diskframe.com/
|
| [4] https://duckdb.org/
| sighingnow wrote:
| Vineyard addresses on sharing "distributed abstracted immutable
| data" in memory, e.g., dask[1] generates a distributed
| dataframe (since the size of data cannot be fit into a single
| machine), and feed it to tensorflow as features for distributed
| training. Vineyard aims at sharing the "DataFrame" between
| these two engines, by memory sharing.
|
| + feather: as well as arrow, are focused on interoperable data
| stroage format. feather and arrow define a IPC serailziation
| schema for common data structures, e.g., tensor and DataFrame.
| Rather than do serailization for zero-copy (when possible)
| deserialization, vineyard organized data as a "metadata" and a
| set of "blobs", the metadata decides how to interpret those
| blobs and blobs are shared in a zero-copy fashion. Moreover,
| vineyard could manages large that cannot be fit into a single
| machine as a "GlobalObject" and global objects can be shared
| efficiently as well.
|
| + fst: fst is also a serailization/stroage format, but vineyard
| is not such a thing.
|
| + diskframe: diskframe is similar to fst, and works for data
| that cannot be fit into memory. Vineyard shares in-memory data
| that cannot be fit into a single machine between different
| engines (might be implemented in different languages).
|
| + duckdb: vineyard is not a SQL execution engine.
| sighingnow wrote:
| Vineyard is a distributed in-memory data manager. Features:
|
| 1. distributed in-memory immutable data storage;
|
| 2. zero-copy data access through shared-memory;
|
| 3. out-of-the-box high-level data abstractions for commonly used
| data structures;
|
| 4. pre-built drivers for I/O, migration, checkpoint, etc;
|
| 5. C++ and Python API; and
|
| 6. Kubernetes-integration for large-scale big data applications
|
| Github: https://github.com/alibaba/libvineyard (stars are
| welcomed!)
|
| Documentation: https://v6d.io
|
| Helm charts:
| https://artifacthub.io/packages/helm/vineyard/vineyard
|
| Any comments and contributions from the community are welcomed!
| theknarf wrote:
| What would be a typical use case for this kind of technology?
| sighingnow wrote:
| Let's say someone want to use python first to create a few
| tensors with numpy, and then use another python package
| networkx to create a graph from the tensors just created and
| next to do some graph analysis with networkx.
|
| It is easy to do it on a single machine in a single python
| runtime because data structures like tensors can be easily
| shared within a single python runtime.
|
| It will become a bit harder to share the data across
| processes/Python runtime efficiently without expensive IO or
| serialization (doable with the help of things like plasma in
| Apache Arrow).
|
| However, if we want to deal with "big data" that cannot be
| handled on a single machine, it will become very hard to
| share data without IO/serialization. Vineyard can be used for
| such cases. It makes sharing big distributed data structures
| easy for different runtime.
|
| There will be some added benefits with vineyard: 1. vineyard
| can handle the the IO, sharding/partitioning, failover and
| data migrations for the applications.
|
| 2. vineyard supports many out-of-the-box highly efficient
| data formats (such as Apache Arrow) and high-level
| abstractions which ease the difficulties of developing big
| data applications.
|
| 3. With stream data support, there is a possibility of
| enabling cross-process optimizations.
|
| 4. Fits K8s very well, where common data structures can live
| in a separated pod/container with its own resource limit.
| streetcat1 wrote:
| I would assume that any workload which decouple compute from
| storage.
|
| Most of the new data systems (e.g. Snowflake) are build on
| top of a data lake (E.g. S3 bucket) and a scale out JIT
| compute nodes.
|
| By using such tool, you can read the data from S3 once, and
| avoid loading the data into memory when you add nodes.
|
| A specific use case that I am working on now is Auto ML.
| Imagine training 100's of models on the same data.
| adev_ wrote:
| Looks very interesting guys, thank you a lot to release that.
|
| I was working on something similar currently.
|
| Any data pipeline complex enough finish soon or later multi-
| lingual (Python, R, C++ + MPI) and multi-runtime (JVM, native,
| python) and it becomes quickly impossible to execute everything
| in one single process space without problems.
|
| You are right in your design, shared memory node-to-node data
| distribution is the answer to that to avoid the classical /
| inefficient data dump-load-dump pattern that we find usually in
| most heterogeneous pipeline.
| lukevp wrote:
| Does distributed mean it's replicated across machines? It was a
| bit hard to tell from the article. Can this be run outside
| kubernetes or is it meant to be part of a big set of infra?
|
| One thing I've always wanted to build is an emulator (snes,
| etc.) that does real-time replication of data across the lan
| for both debugging and network play, is this low enough latency
| for that? The idea would be to help new emulator authors not
| have to build a debugger or test suite, if you could expose the
| controller interface and memory, a single debugger could be
| written that could be shared across new authors. All they would
| have to do is bind to a simple Api that exposes memory and a
| few commands to the library.
| sighingnow wrote:
| > Does distributed mean it's replicated across machines?
|
| No, in vineyard "distributed" doesn't mean data replicated
| across machines, rather, big data partitioned across
| machines. We address cases where the data cannot be fit into
| a single machine.
|
| > Can this be run outside kubernetes or is it meant to be
| part of a big set of infra?
|
| Yes, vineyard can run outside of kubernetes, but it would be
| a bit complicated to setup on many machines as a cluster.
|
| Devops your vineyard cluster on kubernetes could leverage the
| ability of Kubernetes. We already have integrated with helm
| and there's a vineyard-operator in our roadmap.
___________________________________________________________________
(page generated 2021-01-19 23:01 UTC)