[HN Gopher] Launch HN: Activeloop (YC S18) - Data lake for deep ...
       ___________________________________________________________________
        
       Launch HN: Activeloop (YC S18) - Data lake for deep learning
        
       Hi HN, I'm Davit, the CEO of Activeloop (https://activeloop.ai).
       We've made a "data lake" (industry jargon for a large data store
       with lots of heterogeneous data) that's optimized for deep
       learning. Keeping your data in an AI-optimized format means you can
       ship AI models faster, without having to build complex data
       infrastructure for image, audio, and video data (check out our
       GitHub here: https://github.com/activeloopai/deeplake).  Deep Lake
       stores complex data such as images, audio, videos,
       annotations/labels, and tabular data, in the form of tensors--a
       type of data structure used in linear algebra, which AI systems
       like to consume.  We then rapidly stream the data into three
       destinations: (a) a SQL-like language (Tensor Query Language) that
       you can use to query your data; (b) an in-browser engine that you
       can use to visualize your data; and (c) deep learning frameworks,
       letting you do AI magic on your data while fully utilizing your
       GPUs. Here's a 10-minute demo:
       https://www.youtube.com/watch?v=SxsofpSIw3k&t.  Back in 2016, I
       started my Ph.D. research in Deep Learning and witnessed the
       transition from GBs to TBs, then petabyte datasets. To run our
       models at scale, we needed to rethink how we handled data. One of
       the ways we optimized our workflows included streaming the data,
       while asynchronously running the computation on GPUs. This served
       as an inspiration for creating Activeloop.  When you want to use
       unstructured data for deep learning purposes, you'll encounter the
       following options:  - Storing metadata (pointers to the
       unstructured data) in a regular database, and images in object
       storage. It is inefficient to query the metadata table and then
       fetch images from object storages for high-throughput workloads.  -
       Store images inside a database. This typically explodes the memory
       cost and will cost you money. For example, storing images in
       MongoDB and using them to train a model would cost 20x more than a
       Deep Lake setup [2].  - Extend Parquet or Arrow to store images. On
       the plus side, you can now use existing analytical tools such as
       Spark, Kafka, and even DuckDB. But even major self-driving car
       companies failed on this path.  - Build custom infrastructure
       aligned with your data in-house. Assuming you have the money and
       access to 10 solid data engineers with PhD-level knowledge, this
       still takes time (~2.5+ years), is difficult to extend beyond the
       initial vertical, will be hard to maintain, and will defocus your
       data scientists.  Whatever the case, you'll get slow iteration
       cycles, under-utilized GPUs, and lots of ML engineer busywork (thus
       high costs).  Your unstructured data already sits in a data lake
       such as S3 or a distributed file system (e.g., Lustre) and you
       probably don't want to change this. Deep Lake keeps everything that
       a regular data lake makes great. It helps you version-control, run
       SQL queries, ingest billion-row data efficiently, and visualize
       terabyte-scale datasets in your browser or notebook. But there is
       one key difference from traditional data lakes: we store complex
       data, such as images, audio, videos, annotations/labels, and
       tabular data, in a tensorial form that is optimized for deep
       learning and GPU utilization.  Some stats/benchmarks since our
       launch:  - In a third-party benchmark by Yale University [3], Deep
       Lake provided the fastest data loader for PyTorch, especially when
       it comes to networked loading;  -Deep Lake handles scale and long
       distance: we trained a 1B parameter CLIP model on 16xA100 GPUs on
       the same machine on LAION-400M dataset, streaming the data from US-
       EAST (AWS) to US-CENTRAL (GCP) [4] [5];  - You can access datasets
       as large as 200M samples of image-text pairs) in seconds (as
       compared to the 100+ hours it takes via traditional methods) with
       one line of code. [6]  What's free and what's not: the data format,
       the Python dataloader, version control, and data lineage (a log of
       how the data came to its current state) with the Python API are
       open-source [7]. The query language, fast streaming, and
       visualization engines are built in C++ and are closed-source for
       the time being, but are accessible via a Python interface. Users
       can store up to 300GB of their data with us for free. Our growth
       plan is $995/month and includes an optimized query engine, the fast
       data loader, and features like analytics. If you're an academic,
       you can get this plan for free. Finally, we have an enterprise plan
       including role-based access control, security, integrations, and
       more than 10 TB of managed data [8].  Teams at Intel, Google, &
       MILA use Deep Lake. If you want to read more, we have an
       enterprise-y whitepaper at https://www.deeplake.ai/whitepaper, an
       academic paper at https://arxiv.org/abs/2209.10785, and a launch
       blog post with deep dive into features at
       https://www.activeloop.ai/resources/introducing-deep-lake-th....  I
       would love to hear your thoughts on this, especially anything about
       how you manage your deep learning data and what issues you run in
       with your infra. I look forward to all your comments. Thanks a lot!
       [1] https://www.activeloop.ai/resources/introducing-deep-lake-th...
       [2] https://imgur.com/a/AZtWSkA  [3]
       https://arxiv.org/abs/2209.13705  [4] https://imgur.com/a/POtHklM
       [5] https://github.com/activeloopai/deeplake-laion-clip  [6]
       https://datasets.activeloop.ai/docs/ml/datasets/coco-dataset...
       [7] https://github.com/activeloopai/deeplake  [8]
       https://app.activeloop.ai/pricing
        
       Author : davidbuniat
       Score  : 46 points
       Date   : 2022-11-15 16:01 UTC (7 hours ago)
        
       | dimatura wrote:
       | I've been observing and trying out all sorts of different
       | solutions in this space (dvc, git-annex, git-lfs, quilt, just
       | rsync, etc), and so far haven't been super happy with any of
       | them. So it's nice to see more alternatives.
       | 
       | Some questions/observations:
       | 
       | - Python API is great and definitely more useful than CLI when it
       | comes to regular use. But CLI would also be nice for some
       | operations, such as import/export, see history metadata, etc.
       | 
       | - Just a nit, but if I didn't already by know about activeloop I
       | might be less interested due to the "deep lake" branding. There's
       | already so much "data lake" stuff that I have no interest in
       | since it's historically not useful for image data.
       | 
       | - In academic contexts, it's typical to have a static dataset and
       | always use that. In my use cases, typically I have an every-
       | growing "raw" dataset and extract/preprocess subsets periodically
       | for (re)training, annotation, etc. I typically consider these
       | subsets ephemeral, as I can regenerate on demand. Last time I
       | tried activeloop, it seemed like it was more geared towards the
       | static dataset use case, but browsing the docs now it seems like
       | there's more consideration of the latter case, so I'll have to
       | look at that.
       | 
       | - In the examples I've seen, it seems like if a JPEG image is
       | added to the dataset with jpeg compression, it first gets
       | decompressed, so it would have to be recompressed - a lossy
       | operation, right?
       | 
       | (edit: formatting)
        
         | davidbuniat wrote:
         | There's indeed plenty of tools out there, so the next few years
         | are very interesting in terms of seeing which approach gets
         | adopted by the wider audience. Feedback from the community
         | (inc. feedback from you!) is instrumental in designing a tool
         | that works well and addresses all the needs, so thanks a lot
         | for your input re: CLI and branding. :)
         | 
         | Good observation, we've made sure that datasets can evolve
         | better as you go. Specifically for your use case, you can query
         | subsets of data and materialize it on the fly to be streamed,
         | and then go back to a specific dataset "view" (i.e. saved
         | query), as needed.
         | 
         | See how this works at around 5:50 here -
         | https://youtu.be/SxsofpSIw3k
         | 
         | As for your last question, when the jpeg is appended using its
         | file path, the compressed bytes get stored in the dataset
         | without decompression/recompression. When the data is accessed
         | as a numpy array, then the jpeg bytes are decompressed.
         | 
         | For researchers in Academia, our Growth plan is free. Since you
         | work at a startup, the trial for Growth plan is for two weeks.
         | If you want access, hit us up in the Community slack
         | (slack.activeloop.ai - or you can just test the querying on
         | public activeloop datasets!)
        
           | davidbuniat wrote:
           | One more thing regarding the compression -> you can both
           | deeplake.link() to your raw data lake without touching it, or
           | use deeplake.read() which preservers the compression as long
           | as it matches with tensor default compression.
           | 
           | https://docs.deeplake.ai/en/latest/deeplake.html?highlight=l.
           | .. https://docs.deeplake.ai/en/latest/deeplake.html#deeplake.
           | re...
        
             | dimatura wrote:
             | thanks for the pointer, I'll definitely give deeplake a
             | closer look
        
               | davidbuniat wrote:
               | my pleasure, let us know if you have any feedback, would
               | love to see you succeed wtih Deep Lake. :)
        
       | GeorgyM wrote:
       | Love your product! I noticed your partnership with Intel - what
       | should we expect from it?
        
         | davidbuniat wrote:
         | thank you so much! We've recently announced a broad strategic
         | collaboration with Intel Corporation to advance the field of AI
         | data infrastructure. More specifically, this means initiatives
         | in making sure Deep Lake datasets are trained super-smoothly on
         | 3rd Gen Intel Xeon Scalable processors with built-in AI
         | accelerators, as well as a range of technological improvements
         | to Deep Lake given Intel's know-how in the field. Together, we
         | will hopefully abstract away the need to build complex data
         | infrastructure in-house :)
        
       | ianbutler wrote:
       | Nice, seems like you're entering into a pretty competitive space.
       | I know multiple teams across different companies using Databricks
       | and their datalake and ML feature stores for this purpose.
       | 
       | I get that your tool is more specifically optimized for the task
       | of large scale ML but what's your strategy for going up against
       | the likes of Databricks especially when they can point to their
       | solution and go, hey you can use our datalake both for normal
       | business intelligence solutions and ML.
       | 
       | Here it seems like your tool would be separate from the BI
       | datalake and would likely be thought of and implemented a while
       | after the BI datalake is implemented since ML maturity seems to
       | come after BI maturity in companies I've worked for.
        
         | davidbuniat wrote:
         | a very fair point, we do usually say that our main (possible)
         | competition are the "traditional" data lakes! We've spent 4+
         | years designing this system to be specifically resolving the
         | issue for unstructured AI data like videos, images, audio (and
         | multimodal datasets, like the ones used to train models like
         | Stable Diffusion, for instance).
         | 
         | Our main competitive advantage against the players you've
         | mentioned is just that - our bet is that Deep Learning will
         | overtake traditional BI workflows (especially with >90% of data
         | generated today being unstructured), and we've been preparing
         | for it. Traditional "BI datalakes" are pretty inefficient when
         | it comes to storing the data specifically for deep learning
         | workflows. They currently also lack an entire suite of key
         | features (visualization for those data types, query engine
         | based on tensors, etc.) to be able to successfully convince the
         | potential users.
         | 
         | As a matter of fact, we're seeing not only adoption from AI-
         | first companies/startups who are building their infrastructure
         | from the ground up, but mature companies who are hitting the
         | limits of the traditional setups.
         | 
         | Keeping that in mind, we're working on making the onboarding
         | for such companies much easier, so their cost of switching to a
         | more efficient/performant setup is much lower.
         | 
         | As for Databricks specifically, we see them more as a
         | complement, rather than a competitor.
        
           | streetcat1 wrote:
           | So 90% of the data is unstructured, but 99% of the ML use
           | cases are tabular (structured), where tree based approach can
           | win against DL.
           | 
           | Also, 90% of the structured data is unlabeled. Hence, your
           | calculation should be for "labeled" unstructured data, is
           | this 90%?. I would argue that outside big tech, this is 0.1%.
           | 
           | You competition is not Databricks. Databricks main use case
           | is tabular data (both in the delta lake and in ML). I.e.
           | Databricks compete with snowflake. It tries to be a database.
           | I.e. it tries to get out of the data lake.
           | 
           | I think that your competition is with S3 and R2 from the
           | storage side, and with transformer based models (Hugging
           | face). Correct me if I am wrong, but the whole idea with
           | transformers models is that the training was already done ,
           | and you can use small amount of domain specific data? I.e.
           | you do not need a lot of storage?
        
             | davidbuniat wrote:
             | "...I would argue that outside big tech, this is 0.1%."
             | 
             | Fair point regarding the unlabeled/unstructured data. One
             | could also argue that labeled data isn't going to be a
             | prerequisite forever (see https://ai.facebook.com/blog/the-
             | first-high-performance-self...). We see a very sharp rise
             | in unstructured data use for ML (especially a large spike
             | caused by large language models like Dall-E 2 and Stable
             | Diffusion). In my opinion, the majority of the novel use
             | cases are outside of big tech, and we also see a trend in
             | "legacy" companies like media, manufacturing, etc. start
             | building dedicated ML teams. The industry is still nascent,
             | but it is growing fast. Frankly, we see the pain points
             | we're solving resonate with so many more companies than
             | just a year ago.
             | 
             | Agree re Snowflake/Databricks, they are partners rather
             | than competitors. We sit on top of S3/GCS or other blob
             | storages and currently are competing with various in-house
             | solutions that ML scientists built themselves. I do see
             | your point regarding large foundational models that would
             | be only fine-tuned on the tail end for various use cases. I
             | believe there still would be still companies building
             | foundational models from scratch (currently at 5 billion
             | images) so they can serve more application-specific
             | products and unstructured data generators that partner with
             | those companies creating a good enough market for the tool.
        
           | clusterhacks wrote:
           | "... our bet is that Deep Learning will overtake traditional
           | BI workflows ..."
           | 
           | This is an interesting perspective. I have spent years in the
           | traditional BI space and my gut feeling there is that
           | analytics there are very much not fancy. Simple stuff seems
           | to be where the real ROI is at.
           | 
           | Are you saying that data storage, data model, etc that
           | Activeloop puts in place to better support deep learning
           | workflows will replace the data storage, data model, etc as
           | the store of information but visualization and querying will
           | still be like BI work? Or alternatively, are you saying that
           | deep learning is on a roaring path to replace traditional BI
           | analytics?
        
             | davidbuniat wrote:
             | Thanks - it's very insightful to also hear your perspective
             | as someone coming in from the BI space (if you have any
             | more insights, please post them here, too). We have this
             | internal joke where when one says their analytics is based
             | on regressions, it's really an excel sheet, and if they say
             | it's AI, it's a simple Ordinary Least Squares regression,
             | and only a handful do actual AI/ML.
             | 
             | From what we are seeing in the market, both domains grow,
             | but with an overlap, and it's expanding, too. I think while
             | BI/Analytics would still be a major space, we would see
             | more DL-based novel applications generating increasingly
             | more business value (i.e. self-driving cars, robotics,
             | agritech). After all, even in VERY traditional
             | workflows/companies like economic growth estimation, we're
             | seeing DL being applied (e.g. they look at nightlight
             | satellite imagery to estimate economic
             | growth/urbanization).
             | 
             | So to answer your question, for some parts, I think it
             | would be the former (complement), and other applications it
             | would call for replacement (particularly in the cases where
             | companies use multi-modal data).
        
         | dimatura wrote:
         | I work in a startup that uses ML for computer vision and I get
         | the impression that there's significantly less commercial
         | tooling to handle computer vision (or more generally, computer
         | perception)-type data than tabular or text data. I get tons of
         | salespeople contacting me about all sorts of solutions that I
         | have no use for because they're not designed for image data.
         | 
         | Don't get me wrong, there's tons of startups with tools for
         | image/audio/video/point cloud/etc data as well, but generally
         | they seem less mature/useful to me at this point in time.
         | Activeloop is definitely interesting to me - I've tried it
         | already, in fact, but it was a while ago so it was still a
         | little bit half-baked I think.
        
           | davidbuniat wrote:
           | As a matter of fact, I do agree - the CV space is less
           | crowded (until now). That's hopefully where we come in!
           | 
           | thanks a lot for the input, and thanks for trying us out. You
           | know, it's always a work-in-progress, but we've actually done
           | a major overhaul - this is our biggest release yet and I'd
           | love it if you gave it a try, especially the querying feature
           | we're super-proud of. :)
           | https://docs.activeloop.ai/tutorials/querying-datasets
           | 
           | here's a couple of playbooks on how the new features
           | (visualization + querying + version control) play together to
           | solve complex workflows.
           | 
           | - https://docs.activeloop.ai/playbooks/training-with-lineage
           | - https://docs.activeloop.ai/playbooks/evaluating-model-
           | perfor... - https://docs.activeloop.ai/playbooks/training-
           | reproducibilit...
        
       | fxtentacle wrote:
       | The HuggingFace datasets python package is free and can handle
       | images inside Apache Arrow just fine. Combined with a few Exos,
       | NVMes, and dm-cache, storing TBs of data becomes very affordable
       | and I have direct 10G access and can mmap on the network share.
       | 
       | In my opinion, this space will be won based on price per
       | throughput.
       | 
       | But you're charging $1k monthly for 1TB? And my hobbyist speech
       | recognition projects (30TB) vastly exceed your enterprise plan?
       | That discrepancy makes it challenging to believe that your team
       | has as much experience with actual AI projects, as the reference
       | to self-driving cars would suggest.
       | 
       | Anything that easily fits onto a consumer NAS is probably not an
       | enterprise data lake. A one-time EUR1700 purchase (equivalent to
       | 2 months of your service) will buy me a TS-464-4G with 32TB RAID1
       | and 5 years of warranty.
        
         | avrionov wrote:
         | How do you store your data set? And do you have any open source
         | releases?
        
           | davidbuniat wrote:
           | You can store your data either remotely or locally (see here
           | on how https://docs.activeloop.ai/getting-started/creating-
           | datasets...).
           | 
           | You can then visualize your datasets if their stored on our
           | cloud, in AWS/GCP, or you can drag and drop your local
           | dataset in Deep Lake format into our UI
           | (https://docs.activeloop.ai/dataset-visualization)
           | 
           | We do, with version control, Python based dataloader and
           | dataset format being open source! Please check out
           | https://github.com/activeloopai/deeplake.
        
           | fxtentacle wrote:
           | "HuggingFace datasets" is an open source Python package:
           | https://github.com/huggingface/datasets/
           | 
           | And they also have ready-to-use scripts for A LOT of the
           | usual datasets: https://huggingface.co/datasets
           | 
           | including LAION 400M and LAION 2B:
           | https://huggingface.co/datasets/laion/laion2B-en
        
             | davidbuniat wrote:
             | thanks for the context and links! I replied to your comment
             | slightly above in one comment. :)
        
         | davidbuniat wrote:
         | Re: HF - we know them and admire their work (primarily, until
         | very recently, focused on NLP, while we focus mostly on CV). As
         | mentioned in the post, a large part of Deep Lake, including the
         | Python-based dataloader and dataset format, is open source as
         | well - https://github.com/activeloopai/deeplake.
         | 
         | Likewise, we curate a list of large open source datasets here
         | -> https://datasets.activeloop.ai/docs/ml/, but our main thing
         | isn't aggregating datasets (focus for HF datasets), but rather
         | providing people with a way to manage their data efficiently.
         | That being said, all of the 125+ public datasets we have are
         | available in seconds with one line of code. :)
         | 
         | We haven't benchmarked against HF datasets in a while, but Deep
         | Lake's dataloader is much, much faster in third-party
         | benchmarks (see this https://arxiv.org/pdf/2209.13705 and here
         | for an older version, that was much slower than what we have
         | now, see this: https://pasteboard.co/la3DmCUR2iFb.png). HF
         | under the hood uses Git-LFS (to the best of my knowledge) and
         | is not opinionated on formats, so LAION just dumps Parquet
         | files on their storage.
         | 
         | While your setup would work for a few TBs, scaling to PB would
         | be tricky including maintaining your own infrastructure. And
         | yep, as you said NAS/NFS would neither be able to handle the
         | scale (especially writes with 1k workers). I am also slightly
         | curious about your use of mmap files with image/video
         | compressed data (as zero-copy won't happen) unless you
         | decompress inside the GPU ;), but would love to learn more from
         | you! Re: pricing thanks for the feedback, storage is one
         | component and customly priced for PB-scale workloads.
        
           | fxtentacle wrote:
           | I was referring specifically to this page:
           | 
           | https://www.activeloop.ai/pricing/
           | 
           | which says
           | 
           | "Deep Lake Enterprise" and then "10TB of managed data
           | (total)".
           | 
           | To me, that read as if the Enterprise plan is limited to a
           | maximum of 10TB.
        
             | davidbuniat wrote:
             | Ah, thank you so much for noticing this! this is a very
             | important piece of feedback (we will fix it shortly). What
             | we meant was the 10TB is the first tier of the enterprise
             | plans, and the rest are more custom-billed (typically
             | because those also require other custom integrations, etc).
             | 
             | If you find any other points of confusion, please send them
             | our way, and we will fix it, the community has been
             | instrumental over the years in iterating on the product! :)
        
       ___________________________________________________________________
       (page generated 2022-11-15 23:01 UTC)