https://github.com/alibaba/libvineyard

Skip to content
 
Sign up

  * Why GitHub?
    Features -
      + Code review
      + Project management
      + Integrations
      + Actions
      + Packages
      + Security
      + Team management
      + Hosting
      + Mobile
      + Customer stories -
      + Security -
  * Team
  * Enterprise
  * Explore
      + Explore GitHub -

    Learn & contribute

      + Topics
      + Collections
      + Trending
      + Learning Lab
      + Open source guides

    Connect with others

      + Events
      + Community forum
      + GitHub Education
      + GitHub Stars program
  * Marketplace
  * Pricing
    Plans -
      + Compare plans
      + Contact Sales
      + Nonprofit -
      + Education -

[                    ] [search-key]

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in Sign up
{{ message }}

alibaba / libvineyard

  * Watch 12
  * Star 241
  * Fork 19

libvineyard: an in-memory immutable data manager.

v6d.io
Apache-2.0 License
241 stars 19 forks
Star
Watch

  * Code
  * Issues 14
  * Pull requests 1
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

main
6 branches 8 tags
Go to file Code
 
Clone
HTTPS GitHub CLI
[https://github.com/a]

Use Git or checkout with SVN using the web URL.

[gh repo clone alibab]

Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Go back

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Go back

Launching Xcode

If nothing happens, download Xcode and try again.

Go back

Launching Visual Studio

If nothing happens, download the GitHub extension for Visual Studio
and try again.

Go back

Latest commit

@sighingnow
sighingnow Support read_vineyard_dataframe with ObjectID. (#149)
...
0d959c5 Jan 19, 2021
Support read_vineyard_dataframe with ObjectID. (#149)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

0d959c5

Git stats

  * 140 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
.github
Fixes the version of yapf to 0.30.0.
Jan 13, 2021
charts
Bump vineyard version to v0.1.6.
Jan 18, 2021
cmake
Make the "backtrace" functionality "header-only", and add backtrace
t...
Dec 3, 2020
docker
python test for io (#47)
Nov 16, 2020
docs
Helm integration of vineyard. (#145)
Jan 14, 2021
misc
initial commit
Oct 27, 2020
modules
Support read_vineyard_dataframe with ObjectID. (#149)
Jan 19, 2021
python
Refactor file system API to adapt to filesystem-spec, and enhance
OSS...
Jan 18, 2021
src
Fixes the behaviour of recursively delete. (#143)
Jan 14, 2021
test
Teriminate - wait - then kill it directly.
Jan 14, 2021
thirdparty
Upgrade etcd-cpp-apiv3 submodule.
Jan 12, 2021
.clang-format
initial commit
Oct 27, 2020
.dockerignore
initial commit
Oct 27, 2020
.gitignore
Enable python tests (#55)
Nov 17, 2020
.gitmodules
Replacing ptree with nlohmann-json. (#116)
Jan 5, 2021
.pylintrc
initial commit
Oct 27, 2020
CMakeLists.txt
Bump vineyard version to v0.1.6.
Jan 18, 2021
CODE_OF_CONDUCT.md
Kubernetes releted docs in README.md. (#146)
Jan 15, 2021
CONTRIBUTING.rst
Helm integration of vineyard. (#145)
Jan 14, 2021
CPPLINT.cfg
initial commit
Oct 27, 2020
GOVERNANCE.md
Kubernetes releted docs in README.md. (#146)
Jan 15, 2021
LICENSE
initial commit
Oct 27, 2020
MAINTAINERS.md
Add CNCF community related documents (#144)
Jan 14, 2021
NOTICE.txt
initial commit
Oct 27, 2020
OWNERS
Add CNCF community related documents (#144)
Jan 14, 2021
README.rst
Revise kubernetes part in README.
Jan 15, 2021
SECURITY.md
Kubernetes releted docs in README.md. (#146)
Jan 15, 2021
setup.cfg.in
Fixes compatiblity bug for pandas 1.2.0. (#115)
Dec 31, 2020
setup.py
Fixes license and classifiers in setup.py.
Dec 23, 2020
vineyard-config-version.in.cmake
initial commit
Oct 27, 2020
vineyard-config.in.cmake
Find dependencies in vineyard-config.cmake, and drop a debug log.
Jan 7, 2021
View code

README.rst

                             libvineyard

                 an in-memory immutable data manager

Build and Test Coverage Docs Artifact HUB

Vineyard is an in-memory immutable data manager that provides
out-of-box high-level abstraction and zero-copy in-memory sharing for
distributed data in big data tasks, such as numerical computing,
machine learning, and graph analytics.

Vineyard is designed to enable zero-copy data sharing between big
data systems. Let's begin with a typical machine learning task of
time series prediction with LSTM. We can see that the task is divided
into steps of works: First, we read the data from the file system as
a pandas.DataFrame. Then, we apply some preprocessing jobs, such as
eliminating null values to the dataframe. After that, we define the
model, and train the model on the processed dataframe in PyTorch.
Finally, the performance of the model is evaluated.

On a single machine, although pandas and PyTorch are two different
systems targeting different tasks, data can be shared between them
efficiently with little extra-cost, with everything happening
end-to-end in a single python script.

Comparing the workflow with and without vineyard

What if the input data is too big to be processed on a single
machine? As illustrated on the left side of the figure, a common
practice is to store the data as tables on a distributed file system
(e.g., HDFS), and replace pandas with ETL processes using SQL over a
big data system such as Hive and Spark. To share the data with
PyTorch, the intermediate results are typically saved back as tables
on HDFS. This can bring some headaches to developers.

 1. For the same task, users are forced to program for multiple
    systems (SQL & Python).
 2. Data could be polymorphic. Non-relational data, such as tensors,
    dataframes and graphs/networks are becoming increasingly
    prevalent. Tables and SQL may not be best way to store/exchange
    or process them. Having the data transformed from/to "tables"
    back and forth between different systems could be a huge
    overhead.
 3. Saving/loading the data to/from the external storage requires
    lots of memory-copies and IO costs.

Vineyard is designed to solve these issues by providing:

 1. In-memory distributed data sharing in a zero-copy fashion to
    avoid introducing extra I/O costs by exploiting a shared memory
    manager derived from plasma.
 2. Built-in out-of-box high-level abstraction to share the
    distributed data with complex structures (e.g., distributed
    graphs) with nearly zero extra development cost, while the
    transformation costs are eliminated.

As shown in the right side of the above figure, we illustrate how to
integrate vineyard to solve the task in the big data context.

First, we use Mars (a tensor-based unified framework for large-scale
data computation which scales Numpy, Pandas and Scikit-learn) to
preprocess the raw data just like the single machine solution do, and
save the preprocessed dataframe into vineyard.

single      data_csv = pd.read_csv('./data.csv', usecols=[1])

            import mars.dataframe as md
distributed dataset = md.read_csv('hdfs://server/data_full', usecols=[1])
            # after preprocessing, save the dataset to vineyard
            vineyard_distributed_tensor_id = dataset.to_vineyard()

Then, we modify the training phase to get the preprocessed data from
vineyard. Here vineyard makes the sharing of distributed data between
Mars and PyTorch just like a local variable in the single machine
solution.

single      data_X, data_Y = create_dataset(dataset)

            client = vineyard.connect(vineyard_ipc_socket)
distributed dataset = client.get(vineyard_distributed_tensor_id).local_partition()
            data_X, data_Y = create_dataset(dataset)

Finally, we run the training phase distributedly across the cluster.

From the example, we see that with vineyard, the task in the big data
context can be handled with only minor modifications to the single
machine solution. Compare with the existing approaches, the I/O and
transformation overheads are also eliminated.

 Features

 In-Memory immutable data sharing

Vineyard is an in-memory immutable data manager, sharing immutable
data across different systems via shared memory without extra
overheads. Vineyard eliminates the overhead of serialization/
deserialization and IO during exchanging immutable data between
systems.

 Out-of-box high level data abstraction

Computation frameworks usually have their own data abstractions for
high-level concepts, for example tensor could be torch.tensor,
tf.Tensor, mxnet.ndarray etc., not to mention that every graph
processing engine has its own graph structure representations.

The variety of data abstractions makes the sharing hard. Vineyard
provides out-of-box high-level data abstractions over in-memory
blobs, by describing objects using hierarchical metadatas. Various
computation systems can utilize the built-in high level data
abstractions to exchange data with other systems in computation
pipeline in a concise manner.

 Stream pipelining

A computation doens't need to wait all precedent's result arrive
before starting to work. Vineyard provides stream as a special kind
of immmutable data for such pipeling scenarios. The precedent job can
write the immutable data chunk by chunk to vineyard, while
maintaining the data structure semantic, and the successor job reads
shared-memory chunks from vineyard's stream without extra copy cost,
then triggers it's own work. The overlapping helps for reducing the
overall processing time and memory consumption.

 Drivers

Many big data analytical tasks have lots of boilerplate routines for
tasks that unrelated to the computation itself, e.g., various IO
adaptors, data partition strategies and migration jobs. As the data
structure abstraction usually differs between systems such routines
cannot be easily reused.

Vineyard provides such common manipulate routines on immutable data
as drivers. Besides sharing the high level data abstractions,
vineyard extends the capabily of data structures by drivers, enabling
out-of-box reusable runtines for the boilerplate part in computation
jobs.

 Integrate with Kubernetes

Vineyard helps share immutable data between different workloads, is a
natural fit to cloud-native computing. Vineyard could provide
efficient distributed data sharing in cloud-native environment by
embracing cloud-native big data processing and Kubernetes helps
vineyard leverage the scale-in/out and scheduling ability of
Kubernetes.

 Deployment

For better leveraging the scale-in/out capability of Kubernetes for
worker pods of a data analytical job, vineyard could be deployed on
Kubernetes to as a DaemonSet in Kubernetes cluster. Vineyard pods
shares memory with worker pods using a UNIX domain socket with
fine-grained access control.

The UNIX domain socket can be either mounted on hostPath or via a
PersistentVolumeClaim. When users bundle vineyard and the workload to
the same pod, the UNIX domain socket could also be shared using an
emptyDir.

 Deployment with Helm

Vineyard also has tight integration with Kubernetes and Helm.
Vineyard can be deployed with helm:

helm repo add vineyard https://dl.bintray.com/libvineyard/charts/
helm install vineyard vineyard/vineyard

In the further vineyard will improve the integration with Kubernetes
by abstract vineyard objects as as Kubernetes resources (i.e., CRDs),
and leverage a vineyard operator to operate vineyard cluster.

 Install vineyard

Vineyard is distributed as a python package and can be easily
installed with pip:

pip3 install vineyard

The latest version of online documentation can be found at https://
v6d.io.

If you want to build vineyard from source, please refer to
Installation.

 License

libvineyard is distributed under Apache License 2.0. Please note that
third-party libraries may not have the same license as libvineyard.

 Acknowledgements

  * apache-arrow, a cross-language development platform for in-memory
    analytics;
  * boost-leaf, a C++ lightweight error augmentation framework;
  * dlmalloc, Doug Lea's memory allocator;
  * etcd-cpp-apiv3, a C++ API for etcd's v3 client API;
  * flat_hash_map, an efficient hashmap implementation;
  * pybind11, a library for seamless operability between C++11 and
    Python;
  * uri, a library for URI parsing.

 Getting involved

  * Read contribution guide.
  * Please report bugs by submitting a GitHub issue.
  * Submit contributions using pull requests.

Thank you in advance for your contributions to vineyard!

About

libvineyard: an in-memory immutable data manager.

v6d.io

Resources

Readme

License

Apache-2.0 License

Releases 8

 
vineyard v0.1.6 Latest
Jan 18, 2021
+ 7 releases

Packages 3

 
 
 

Used by 6

 

  * @zenhetown
  * @alibaba
  * @alibaba
  * @eslambakr

Contributors 8

  * @sighingnow
  * @siyuan0322
  * @pwrliang
  * @andydiwenzhu
  * @luoxiaojian
  * @acezen
  * @wenyuanyu
  * @lidongze0629

Languages

  * C++ 75.1%
  * Python 19.3%
  * CMake 5.2%
  * Other 0.4%

  * (c) 2021 GitHub, Inc.
  * Terms
  * Privacy
  * Security
  * Status
  * Help

 

  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.