[HN Gopher] Launch HN: MindsDB (YC W20) - Machine Learning Insid...
       ___________________________________________________________________
        
       Launch HN: MindsDB (YC W20) - Machine Learning Inside Your Database
        
       Hi HN,  Adam and Jorge here, and today we're very excited to share
       MindsDB with you (http://github.com/mindsdb/mindsdb). MindsDB
       AutoML Server is an open-source platform designed to accelerate
       machine learning workflows for people with data inside databases by
       introducing virtual AI tables. We allow you to create and consume
       machine learning models as regular database tables.  Jorge and I
       have been friends for many years, having first met at college. We
       have previously founded and failed at another startup, but we stuck
       together as a team to start MindsDB. Initially a passion project,
       MindsDB began as an idea to help those who could not afford to hire
       a team of data scientists, which at the time was (and still is)
       very expensive. It has since grown into a thriving open-source
       community with contributors and users all over the globe.  With the
       plethora of data available in databases today, predictive modeling
       can often be a pain, especially if you need to write complex
       applications for ingesting data, training encoders and embedders,
       writing sampling algorithms, training models, optimizing,
       scheduling, versioning, moving models into production environments,
       maintaining them and then having to explain the predictions and the
       degree of confidence... we knew there had to be a better way!  We
       aim to steer you away from constantly reinventing the wheel by
       abstracting most of the unnecessary complexities around building,
       training, and deploying machine learning models. MindsDB provides
       you with two techniques for this: build and train models as simply
       as you would write an SQL query, and seamlessly "publish" and
       manage machine learning models as virtual tables inside your
       databases (we support Clickhouse, MariaDB, MySQL, PostgreSQL, and
       MSSQL. MongoDB is coming soon.) We also support getting data from
       other sources, such as Snowflake, s3, SQLite, and any excel, JSON,
       or CSV file.  When we talk to our growing community, we find that
       they are using MindsDB for anything ranging from reducing financial
       risk in the payments sector to predicting in-app usage statistics -
       one user is even trying to predict the price of Bitcoin using
       sentiment analysis (we wish them luck). No matter what the use-
       case, what we hear most often is that the two most painful parts of
       the whole process are model generation (R&D) and/or moving the
       model into production.  For those who already have models (i.e. who
       have already done the R&D part), we are launching the ability to
       bring your own models from frameworks like Pytorch, Tensorflow,
       scikit-learn, Keras, XGBoost, CatBoost, LightGBM, etc. directly
       into your database. If you'd like to try this experimental feature,
       you can sign-up here: (https://mindsdb.com/bring-your-own-ml-
       models)  We currently have a handful of customers who pay us for
       support. However, we will soon be launching a cloud version of
       MindsDB for those who do not want to worry about DevOps,
       scalability, and managing GPU clusters. Nevertheless, MindsDB will
       always remain free and open-source, because democratizing machine
       learning is at the core of every decision we make.  We're making
       good progress thanks to our open-source community and are also
       grateful to have the backing of the founders of MySQL & MariaDB. We
       would love your feedback and invite you to try it out.  We'd also
       love to hear about your experience, so please share your feedback,
       thoughts, comments, and ideas below. https://docs.mindsdb.com/ or
       https://mindsdb.com/  Thanks in advance, Adam & Jorge
        
       Author : adam_carrigan
       Score  : 107 points
       Date   : 2021-02-19 16:55 UTC (6 hours ago)
        
       | polskibus wrote:
       | How does your product differ from MS SQL's integrated R, for
       | someone who only needs MS SQL Server support?
        
         | torrmal wrote:
         | this is a great question, I actually think that if your
         | language is also R, MSsql r integration is a great option, what
         | we bring to the table for MSSQL users in particular is more
         | options as well as better performance for some types of
         | problems like high cardinality on time-series, for example:
         | predicting inventory for all products in a database taking into
         | account all previous inventory as well as say marketing data,
         | building this in the R bindings would be quite a challenge,
         | with mindsdb its a simple SQL statement
        
       | pachico wrote:
       | I've been following you, guys, for some months and I must say I'm
       | a huge fan. Being a hardcore Clickhouse user, I got hooked with
       | your tutorial about how to make it work with your product.
       | 
       | Best of luck!
        
         | george3d6 wrote:
         | Cheers :)
         | 
         | It's actually quite nice for me to hear that people we didn't
         | hear from yet are finding it useful. Since it's a library it's
         | hard to actually figure out how many people are really using it
         | successfully and what for.
         | 
         | If you don't mind sharing more details please do, either
         | through our usual channels (https://mindsdb.com/contact-us/) or
         | just send me an email (george.hosu@mindsdb.com). Figuring out
         | how people use it and what issues they encountered has been
         | immensely helpful to me.
        
         | torrmal wrote:
         | thank you!! lets chat, we would love to show you the timeseries
         | cool stuff we have done for clickhouse!
        
           | ajawee wrote:
           | Is it possible do the anomaly deduction with clickhouse data?
        
             | pcerdam wrote:
             | This is actually the next milestone in our time series
             | roadmap, so you can expect it to be possible rather soon :)
        
           | pachico wrote:
           | Sure thing! I'll contact you using the details in your
           | website!
        
       | maricels wrote:
       | Can I use it in mySQL?
        
         | ZoranP wrote:
         | Yes, you can. Check out the MySQL docs
         | https://docs.mindsdb.com/datasources/mysql/
        
       | aslakhellesoy wrote:
       | Looks very interesting! Any plans to support DynamoDB/Scylla?
        
         | torrmal wrote:
         | Thanks for asking! We develop based on community requests, if
         | you want we can setup a call, and we can see if we can help you
         | with your data on DynamoDB/SCylia, its becoming pretty fast now
         | to support new databases. send us an email to adam jorge at
         | mindsdb.com
        
           | PeterCorless wrote:
           | If you do have a Scylla use case, please also feel free to
           | let us Scylla monsters know: peter at scylladb dot com. Very
           | encouraged to see such a great convergence occurring between
           | NoSQL + ML. The camps between data scientists and data
           | engineers have been pitched too far apart to date. My best
           | wishes to all those who helped bring this to fruition.
        
         | george3d6 wrote:
         | I would love to support Scylla, I ** love that database, those
         | guys are magicians. And I assume in supporting that we'd also
         | offer de-facto support for Cassandra.
         | 
         | I don't think either Scylla or dynamo are on the roadmap now,
         | but if you want them feel free to create an issue asking for
         | them: https://github.com/mindsdb/mindsdb
         | 
         | It should be noted that there's two level of support:
         | 
         | 1. As a source of data (easy to implement) 2. Being able to
         | publish models into the database (a bit harder)
         | 
         | If you work with those and are interested in doing ML from the
         | database please get in touch, ideally via github, but you can
         | also use the contact form (https://mindsdb.com/contact-us/) or
         | email one of us directly. The best case scenario for us is that
         | when we do one of these integrations we have an actual user in
         | mind, and we're open to "first users" for any database where we
         | can find a reasonable way of integrating.
        
       | streetcat1 wrote:
       | Hi
       | 
       | So I assume that you are doing hyperparameter search? Can you
       | share what optimization method you are using for search (e.g.
       | random, gp )?
       | 
       | Also, is the search can be distributed in parallel to multi node
       | ?
       | 
       | And, if mindsdb is not part of the db, what happen if minddb fail
       | ?
       | 
       | Also, do you support automatic retraining? If yes, can you
       | elaborate more?
        
         | george3d6 wrote:
         | > So I assume that you are doing hyperparameter search? Can you
         | share what optimization method you are using for search (e.g.
         | random, gp )?
         | 
         | Short answer is optuna and ax but only sometimes.
         | 
         | Long answer lead me down a rabbit whole and it's 10k+ words and
         | a few experiments deep. If you're interested in this are
         | specifically ping me, but I've got nothing concrete, however I
         | like discussing it. A recent paper I saw that somewhat echos my
         | thoughts is: https://arxiv.org/pdf/2102.03034.pdf | but some
         | bits feel either over my head and/or overly pedantic and/or
         | overly formal | and I'm not sure I agree with the conclusion |
         | and loads of it is irrelevant. But if the problem interests you
         | I'd suggest giving it some time, with those disclaimers in mind
         | 
         | > Also, is the search can be distributed in parallel to multi
         | node ?
         | 
         | Theoretically yes, practically it's still WIP to get this to
         | work, but the architecture we have right now is very much
         | conceived with massive distribution in mind (see our docs for
         | more details on that).
         | 
         | > And, if mindsdb is not part of the db, what happen if minddb
         | fail ?
         | 
         | The select query you use to make a prediction returns an error,
         | essentially. Assuming you mean "what happens if it crashes or
         | if the model you are using crashes?".
         | 
         | e.g:
         | 
         | psql> SELECT diagnostic FROM mindsdb.flu_detector WHERE
         | headache=true AND temperature=37.5 AND cough='mild';
         | 
         | psql> Error: External table returned error: "Segfault"
         | 
         | OR
         | 
         | psql> SELECT diagnostic FROM mindsdb.flu_detector WHERE
         | headache=true AND temperature=37.5 AND coughsfsagsa='mild';
         | 
         | psql> Error: External table returned error: Input column
         | `coughsfsagsa` doesn't exist
         | 
         | (or something like that)
         | 
         | > Also, do you support automatic retraining?
         | 
         | Not at the moment, but we're going to add it very soon, with
         | the first implementation allowing retraining with a certain
         | user-set frequency (e.g. once every 2 hours).
         | 
         | Which will allow the model to be always fresh as new data comes
         | in (assuming there's no time limit on the query)
        
           | streetcat1 wrote:
           | Wow. Thanks for the answer and for the paper !. I myself
           | implemented this: https://arxiv.org/pdf/1810.05934.pdf in go.
           | 
           | The issue with retraining is that you need new labels (assume
           | supervised ML). so I wonder what process do you use to get
           | those.
        
         | torrmal wrote:
         | These are amazing questions Streetcat, We do some
         | hyperparameter search using Optuna, we may be moving to Ray
         | Tune because it can be highly parallelized. If MindsDB fails,
         | it depends on how various DBs manage federated storage, but
         | essentially you will get a query error. Funny that you mention
         | automatic retraining, people have been asking for this
         | recently, we will be supporting a retrain_frequency parameter
         | in the coming releases, would you like to give it a test drive?
        
           | streetcat1 wrote:
           | I am actually working on a product in the same area (auto ml/
           | mlops ) as a non YC startup... We might be able partner. I am
           | not sure how to reach you?
        
             | torrmal wrote:
             | absolutely lets connect!! jorge at mindsdb
        
             | adam_carrigan wrote:
             | Send us an email - Adam at MindsDB.com and Jorge at
             | MindsDB.com
        
       | davidnet wrote:
       | Huge fan since I saw them in Skydeck Berkeley, also Jorge is one
       | of the best talented engineers, and managers I have ever meet and
       | an inspiration to myself. Awesome to see Adam & Jorge pushing the
       | limits of ML to make ML accessible to everyone!
        
       | pplonski86 wrote:
       | What type of ML algorithms do you support? Do you have benchmarks
       | with performance?
        
         | george3d6 wrote:
         | Regrading benchmarks, we have three main dataset collections we
         | focus on currently:
         | 
         | 1. Datasets from customers, but obviously those can't be made
         | public.
         | 
         | 2. The OpenML benchmark, which is fairly limited because it's
         | mainly binary categories, but which is good because it's a 3rd
         | party, so unbiased. We have some intermediary results here (htt
         | ps://docs.google.com/spreadsheets/d/1oAgzzDyBqgmSNC6g9CFO...) ,
         | they are middle-of-the-road. However I think the benchmark is
         | pretty limited, i.e. it doesn't cover most of the kinds of
         | inputs and almost none of the output we support
         | 
         | 3. An internal benchmark suite which currently has 59 datasets,
         | mainly focused around classification and regression tasks with
         | many inputs, timeseries problems and text. Some part of it is
         | public but opening that up is a bit difficult due to licensing
         | issues. I'm hoping that in the next year it will grow and 90%+
         | of it can be made public. We benchmarkagainst older versions of
         | mindsdb, against hand made models we try to adapt to the task,
         | against the state of the art accuracy for the dataset (if we
         | can find it) and a few other auto ML frameworks (well, 1, but I
         | hope to extend that list) [see this repo for the ones we made
         | public: https://github.com/mindsdb/benchmarks, but I'm afraid
         | it's a bit outdated]
         | 
         | That being said benchmarking for us is still WIP, since as far
         | as I can tell nobody is trying to build open source models that
         | are as broad as what we're currently doing (for better or
         | worst), and the closed source services offered by various IaaS
         | providers don't really come with public benchmark results
         | outside of marketing.
        
           | cweill wrote:
           | The benchmarking challenges you are facing are pretty common
           | in the AutoML community. My colleagues and I at Google
           | Research are trying to solve this with
           | https://github.com/google/nitroml. It's still super early
           | days (no CI yet), but I think it could help your team
           | benchmark on a set of open standard benchmark tasks as we
           | open source more of the system.
        
             | george3d6 wrote:
             | Looks quite interesting, already pinned this in the
             | relevant slack channel :)
             | 
             | To be honest I'm rather happy with how the internal
             | benchmark suite is turning out, but to some extent you are
             | inviting bias by creating them yourself. On top of that, it
             | doesn't hurt to have more benchmarks.
             | 
             | At the end of the day it's a combination of: * How much
             | work is it to integrate (easy to measure) * How visible is
             | it, i.e if we actually find something interesting will be
             | visible and legible to others (ify to mesure, citations,
             | stars, etc are some invitation) * How useful it is to
             | "improve" the library (hard to measure, and what we aim to
             | be good at is a moving target)
             | 
             | So realistically that's the equation I have to judge in
             | terms of adding a new benchmarks suite, and it's very
             | annoying because you'll note the most important things are
             | the hardest to measure.
             | 
             | Would you want people to integrate with this now or would
             | you rather wait a few weeks/months/years until it matures
             | more? If the former, can you give a few details regrading
             | where to start (README is fairly barren), if the later
             | please ping me (george.hosu@mindsdb.com) when you think it
             | could be ready to try.
             | 
             | Anyway, any open benchmark library is a step in the right
             | direction, thanks for working on this :)
        
               | cweill wrote:
               | Thanks for your feedback! Based off the description of
               | how you already do things, I'd say you're ahead of the
               | curve as far as rigorous model quality benchmarking. You
               | should absolutely hold off of using nitroml for a few
               | months until it's more mature. It's very much pre-
               | prerelease in a build-in-the-open sense. :) I'll shoot
               | you an email once it's ready for anyone to try out. When
               | the time comes, we'll have a blog post to announce it,
               | and will include proper documentation.
               | 
               | And, congrats on the launch!
        
         | torrmal wrote:
         | Also worth mentioning, Mindsdb can take input columns of any of
         | the following (numerical, categorical, text, images) and it's
         | getting pretty good at Timeseries problems (for which we
         | support a variety of techniques, including novel approaches to
         | sequential data such as (RNNs, Transformers, CNN tiling, ...).
         | Given the nature of data in databases where there is often a
         | chronological order of transactions we put allot of focus on
         | offer capabilities to make the models time aware.
        
         | adam_carrigan wrote:
         | The design is modular such that it can support anything under
         | the cover.
         | 
         | Essentially you have encoders for all of the columns, which
         | then get piped into a mixer and then into decoders to predict
         | the final output(s). These encoders and decoders can be any
         | type of ML model, but our current focus is on neural networks.
         | 
         | So e.g. if you have say a text like "A cute cat" and the number
         | 5 and your target is an image (let's assume you have a training
         | set such that the model would learn to generate one with 5 cute
         | cats) then you have:
         | 
         | 1. Text encoder generates an embedding for (cute cat) +
         | numerical encoder normalizes "5" 2. A mixer (which can be e.g.
         | an FCNN or gradient booster) generates an intermediate
         | representation. 3. A decoder that is trained to generate images
         | takes that representation and generates an image1.
         | 
         |  _Note: above is a good illustrative example, in practice, we
         | 're good with outputting dates, numerical, categories, tags and
         | time-series (i.e. predicting 20 steps ahead). We haven't put
         | much work into image/text/audio/video outputs_
         | 
         | You should be able to find more details about how we do this in
         | the docs and most of the heavy lifting happens in the lightwood
         | repo, the code for that is fairly readable I hope:
         | https://github.com/mindsdb/lightwood
        
       | juliantorresgo wrote:
       | Amazing launch guys! I really like it.
        
       | rodrigky wrote:
       | This is pretty cool, I've been following you guys since your
       | Skydeck demo day back in 2018.
       | 
       | Would we be able to use your tool for Tableu?
        
         | ZoranP wrote:
         | That's awesome :). Yes, you are able to use any BI Tools as
         | Tableau, Power BI, SAS BI, or any other BI Tool that you can
         | connect to external databases. Tutorials and examples of BI
         | Tools should be published soon on our documentation.
        
       | robertlagrant wrote:
       | Is it inside the database? It looks as though it's actually in a
       | separate server, that is called by the database.
        
         | george3d6 wrote:
         | From the user perspective it's inside the database, you can run
         | mindsdb in the backgrond,connect it to the database once, and
         | then do everything from within the database (i.e. connecting
         | with a sql client to your database server and issuing commands
         | the same way you would query "normal" tables).
         | 
         | From a technical perspective it's a separate server that
         | communicates with the database through various mechanisms (e.g.
         | federate engine) but it's no different from e.g. multiple
         | instance of mariadb being abstracted by a galera cluster into
         | something that behaves like a single database from a client
         | perspective.
        
       | suyash wrote:
       | Oracle already provides same ability
       | https://blogs.oracle.com/machinelearning/machine-learning-in...
        
         | [deleted]
        
         | torrmal wrote:
         | you are right!! the main thing is to offer it for other
         | databases (mysql, mariadb, postgres, clickhouse, clickhouse,
         | timescale, mongodb) as well as to support more powerful machine
         | learning capabilities than the vanila classical models
         | supported by oracle, for instance great timeseries support
        
         | george3d6 wrote:
         | I haven't looked into it myself, but I'll try to understand
         | what they do better, thanks for letting us know.
         | 
         | I will say that:
         | 
         | 1. It's not open source, so hard for us to compare other than
         | running black-box experiments.
         | 
         | 2. Oracle, so presumably that comes with all the Oracle-
         | ecosystem buy-ins that implies, which might not be ideal for
         | many people.
         | 
         | As a purely personal opinion:
         | 
         | I guess it's good to know that other people are thinking in the
         | same direction as us, but at the same time I personally would
         | like for widely-used ML libraries to be open-source. If these
         | models are going to be used as generator of important decision
         | making algorithms, ideally both the model and the algorithm
         | should be open source. The later is up to whoever is building
         | the algorithm, but I think if we can get the zeitgeist to move
         | towards the later being open source as the norm that can
         | alleviate a lot of potential harm and has little downside.
         | 
         | I.e. Do you feel comfortable with the NHS off-sourcing
         | important decision making to algorithms that are proprietary
         | black boxes? Considering that it's funded by the tax paying
         | public and it's supposed to service that public.
         | 
         | "Secret" laws used to be a norm in the past e.g. in large
         | civilziations like the Roman empire, where the norm evolved to
         | be that only "schooled" men could understand the law due to
         | complexity, or in most of medieval Europe where the bible was
         | foundational for morality but closed off to a small subset of
         | the population that knew Greek or Latin and could get their
         | hands on it. But in general that seems to have caused more harm
         | than good.
         | 
         | It seems reasonable to ask that, if algorithms are going to be
         | used by governments in decision making, those should be
         | entirely open. Ideally the ones used by corporations should be
         | open to whatever degree is possible, to avoid run-off harm from
         | buggy or unaligned systems.
        
       | hodgesrm wrote:
       | Go MindsDB!! We've enjoyed working with the MindsDB team at
       | Altinity. The integration with ClickHouse makes clever use of the
       | MySQL protocol to implement models as queryable tables. For
       | anybody interested in the specifics check out the following
       | article: https://altinity.com/blog/machine-learning-models-as-
       | tables.
       | 
       | We will watch your career with great interest.
        
         | adam_carrigan wrote:
         | Thanks, it has been amazing working with you also.
        
           | hodgesrm wrote:
           | Based on the feedback here it may soon be time to do a
           | follow-up talk on MindsDB at a future ClickHouse meetup. :)
        
       | eurasiantiger wrote:
       | Support for graph databases would be cool.
        
         | adam_carrigan wrote:
         | Did you have one in mind? We add integrations based on
         | community demand.
        
       ___________________________________________________________________
       (page generated 2021-02-19 23:01 UTC)