[HN Gopher] Show HN: PostgresML, now with analytics and project ...
       ___________________________________________________________________
        
       Show HN: PostgresML, now with analytics and project management
        
       We've been hard at work for a few weeks and thought it's time for
       another update.  In case you missed our first post, PostgresML is
       an end-to-end machine learning solution, running alongside your
       favorite database.  This time we have more of a suite offering:
       project management, visibility into the datasets and the deployment
       pipeline decision making.  Let us know what you think!  Demo link
       is on the page, and also here: https://demo.postgresml.org
        
       Author : levkk
       Score  : 245 points
       Date   : 2022-05-02 17:48 UTC (5 hours ago)
        
 (HTM) web link (postgresml.org)
 (TXT) w3m dump (postgresml.org)
        
       | jmuguy wrote:
       | What affiliation does this have with PostgreSQL?
        
         | jpgvm wrote:
         | None, it's just an extension.
         | 
         | Which is part of what is so awesome about PostgreSQL, everyone
         | can build extensions that look and feel native and can do
         | almost anything.
        
           | properdine wrote:
           | You might want to use a different elephant logo... the
           | branding implies that it is an official part of the project.
           | 
           | E.g. Postico uses an elephant but not the _same_ elephant -
           | https://eggerapps.at/postico/
        
             | ta05022022 wrote:
        
             | ekzy wrote:
             | I think the logo is clever, and if there's no rights issue
             | it's great. The fact that it's a different website, domain,
             | theme etc doesn't make it ambiguous IMO
        
           | macspoofing wrote:
           | You should review the PostgreSQL trademark policy:
           | https://www.postgresql.org/about/policies/trademarks/
        
           | jmuguy wrote:
           | This logo and name, to me at least, implied some sort of
           | official affiliation with the PostgreSQL project. Which I
           | think should at least be clarified on the site/readme.
        
         | mistrial9 wrote:
        
       | jdoliner wrote:
       | This is really cool, running ML workloads on top of SQL is a very
       | practical way of doing ML for a lot of businesses. Many companies
       | don't have the fancy ML workloads like you see at OpenAI, they
       | just have a SQL database with some data that could greatly help
       | their business with some simple ML models trained on it. This
       | looks like a nice way to do it. A slightly different approach
       | that I've been working on involves hooking data warehouses up to
       | Pachyderm [0] so you can do offline training on it. Not as good
       | for online stuff as this, but for longer running batch style jobs
       | it works really well.
       | 
       | [0] http://github.com/pachyderm/pachyderm
        
       | obert wrote:
       | is it possible, or how hard is it, to plug in custom proprietary
       | models?
        
       | ekzy wrote:
       | This looks awesome! I'm not an expert but wouldn't the typical
       | database hardware not be really optimal for running ML? Is this
       | meant to run on a replica (which is quite straightforward to
       | setup) that has ML optimised hardware?
        
         | montanalow wrote:
         | It's probably a stretch to run GPT-3 inside a db, but most of
         | the "deep learning" models I've run in more traditional
         | environments are a few megabytes. That's millions of params,
         | but Morre's law has been generous enough to us over the decades
         | that I think there is a good case to spend a few megabytes of
         | DB ram on ML. I would think this idea has really landed though,
         | when we start hearing about Postgres deployments with GPUs on
         | board though :)
        
           | noogle wrote:
           | You can already do that by using the pl/python or pl/java
           | extensions with the right environment. However, the interface
           | between SQL and model inference is typically narrow enough
           | that IMO it's better to: read from Postgres --> process in an
           | external Python/Java process --> persist to Postgres.
           | 
           | Maybe enhance this with a FDW to an external inference
           | process to allow triggering of inference from Postgresql
           | itself.
        
       | LunaSea wrote:
       | Do you plan on adding support for managed PostgreSQL services
       | like RDS in the future?
        
         | montanalow wrote:
         | We can't control the extensions RDS allows to be installed, and
         | they are historically conservative. Lev and I do have some
         | fairly extensive experience with replication patterns to
         | Postgres instances running in EC2. Foreign data wrappers are
         | also an option, and depending on workload may be a good
         | horizontal scaling strategy in addition.
        
           | craigkerstiens wrote:
           | I just gave it a try to install on Crunchy Bridge [1],
           | disclaimer I work at Crunchy Data. I did this as a standard
           | user vs. anything special about access. I got quite close,
           | but looks like I'm limited by Python version, we give you
           | plpython3u with Python 3.6. Is there any chance of supporting
           | an earlier 3.x version? If so I'm pretty sure could give a
           | guide on how to self-install on top of us.
           | 
           | [1] https://www.crunchydata.com/products/crunchy-bridge/
        
             | montanalow wrote:
             | Awesome. I'll see what it'll take to get the extension
             | running w/ Python 3.6. It's good to know what people's
             | ecosystem dependencies look like.
        
           | michelpp wrote:
           | Will that also apply to plans? I'd love to see ML models come
           | up with better plans, if possible.
        
       | jorgemf wrote:
       | How do you deal with different dataset train/validation/test? How
       | do you measure the degradation of the model? Is there any way to
       | select the metric you target (accuracy, f1-score or any other)?
        
         | montanalow wrote:
         | The data split technique is one of the optional parameters for
         | the call to 'train'. Model degradation is a really interesting
         | topic, that is hopefully made less difficult when retraining is
         | trivialized, but we also want to add deeper analytics into
         | individual model predictions, as well as better model
         | explanations with tools like shap. We haven't exposed custom
         | performance metrics in the API yet, but we're computing a few
         | right now and can add more. The next thing we may build could
         | be a configuration wizard to help make these decisions easy
         | based on some guided data analysis.
        
           | jorgemf wrote:
           | When I was talking about metrics I meant metrics of the model
           | (accuracy, precission, recall, mean square error, etc), not
           | performance.
        
       | waatels wrote:
       | Hello really nice !
       | 
       | Can you explane the differences with https://madlib.apache.org/ ?
       | Wouldnt an OLAP db better suited than pg for this kind of
       | workload ?
       | 
       | Does being a postgreSQL module make it compatible with citus,
       | greemplum or timescale ?
        
         | montanalow wrote:
         | OLAP vs OLTP will depend on your ML use case. Online
         | predictions will likely be better served by an OLTP vs offline
         | batch predictions being better served by OLAP.
         | 
         | OLAP use cases often involve a lot of extra complexity out of
         | the gate, and something we're targeting is to help startups
         | maintain the simplest possible tech stack early on while they
         | are still growing and exploring PMF. At a high enough level, it
         | should just work with any database that supports Postgres
         | extensions, since it's all just tables going into algos, but
         | the devil in big data is always in evaluating the performance
         | tradeoffs for the different workloads. Maybe we'll eventually
         | need an "enterprise" edition.
        
       | gabereiser wrote:
       | This is awesome. I'm guessing the models are executed on the
       | database server and not a separate cluster? What about GPU
       | training? How is that handled? I'd love to see more docs.
        
       | ta05022022 wrote:
        
       | sagaro wrote:
       | I don't understand 5he example on the homepage. How does the
       | extension know what is "buy it again"?
        
         | sagaro wrote:
         | never mind. figured it out from github. this is cool.
        
         | montanalow wrote:
         | "Buy it again" is simply a name for the PostgresML project that
         | the model is being trained for.
         | 
         | There is deeper explanation in the README:
         | https://github.com/postgresml/postgresml
        
       | phenkdo wrote:
       | Can this be used to deploy an "active learning" model that learns
       | from fresh data and model auto-updates?
        
         | montanalow wrote:
         | That's exactly the target use case. Models make online
         | predictions as part of Postgres queries, and can be
         | periodically retrained in a cadence that makes sense for the
         | particular data set. In my experience the real value of
         | retraining at a fixed cadence is so that you can learn when
         | your data set changes, and have fewer changes to work through
         | when there is some data bug/anomaly introduced into the eco
         | system. Models that aren't routinely retrained tend to die in a
         | catastrophic manner when business logic changes, and their
         | ingestion pipeline hasn't been updated since they were
         | originally created.
        
           | streetcat1 wrote:
           | Well, you would also need new labels to retrain.
        
             | montanalow wrote:
             | Yep! Part of the power of being inside the OLTP is that you
             | can just create a VIEW of your training data, which could
             | be anything from customer purchases, search results,
             | whatever, and that VIEW can be re-scanned every time you do
             | the training run to pickup the latest data.
        
       | kipukun wrote:
       | Neat project. Any roadmap for cross-validation support
       | (GridSearchCV and friends)?
        
         | montanalow wrote:
         | Yep! traditional hypertuning techniques and automated broader
         | surveys across multiple algos are both on the roadmap for
         | "soon".
        
       | Linda703 wrote:
        
       | debarshri wrote:
       | Very cool! Will probably use it soon.
        
         | montanalow wrote:
         | Thanks! We'd still consider this an early stage project and
         | would love your feedback for which features to prioritize. Our
         | roadmap is only getting longer...
        
       | simonw wrote:
       | This looks amazing!
       | 
       | The animated GIF on your homepage moves a little bit too fast for
       | me to follow.
        
       | bguberfain wrote:
       | Can we offload model train to a different server? It can be
       | parallelized? Anyway, nice API and a promising project.
        
         | montanalow wrote:
         | This can be done "manually" by configuring Postgres replication
         | and/or foreign data wrappers. We don't have a magic button for
         | that, but if we have a few examples in the wild we can
         | establish best practices and then put those into code. I say
         | this with some optimism that we may be able to see more
         | targeted ML specific scalability use cases that can be solved
         | more completely than general database scalability.
        
       | zmmmmm wrote:
       | Seems like a great idea. When you look at many ML frameworks half
       | the code and learning overhead is data schlepping code and table
       | like structures that "reinvent" the schema that already exists
       | inside a database. Not to mention, there can be security concerns
       | from dumping large amounts of data out of the primary store (how
       | are you going to GDPR delete that stuff later on?). So why not
       | use it natively where the data already is?
       | 
       | For anything substantive it seems like a bad idea to run this on
       | your primary store since the last thing you want to do is eat up
       | precious CPU and RAM needed by your OLTP database. But in a data
       | warehouse or similar replicated setup, it seems like a really
       | neat idea.
        
         | jiocrag wrote:
         | Yeah -- seems like all you need is Snowflake-esque separation
         | of storage and compute and bob's your uncle.
        
       | ekzhu wrote:
       | Great idea! I see this is implemented using the Python language
       | interface supported by PostgreSQL and importing sklearn models. I
       | always wonder how scalable this is considering the serialization-
       | deserialization overhead between Postgres' core and Python. Do
       | you see any significant performance difference between this and
       | training the sklearn models directly on something like
       | Dataframes?
        
       ___________________________________________________________________
       (page generated 2022-05-02 23:00 UTC)