[HN Gopher] Show HN: PostgresML, now with analytics and project ...
___________________________________________________________________
Show HN: PostgresML, now with analytics and project management
We've been hard at work for a few weeks and thought it's time for
another update. In case you missed our first post, PostgresML is
an end-to-end machine learning solution, running alongside your
favorite database. This time we have more of a suite offering:
project management, visibility into the datasets and the deployment
pipeline decision making. Let us know what you think! Demo link
is on the page, and also here: https://demo.postgresml.org
Author : levkk
Score : 245 points
Date : 2022-05-02 17:48 UTC (5 hours ago)
(HTM) web link (postgresml.org)
(TXT) w3m dump (postgresml.org)
| jmuguy wrote:
| What affiliation does this have with PostgreSQL?
| jpgvm wrote:
| None, it's just an extension.
|
| Which is part of what is so awesome about PostgreSQL, everyone
| can build extensions that look and feel native and can do
| almost anything.
| properdine wrote:
| You might want to use a different elephant logo... the
| branding implies that it is an official part of the project.
|
| E.g. Postico uses an elephant but not the _same_ elephant -
| https://eggerapps.at/postico/
| ta05022022 wrote:
| ekzy wrote:
| I think the logo is clever, and if there's no rights issue
| it's great. The fact that it's a different website, domain,
| theme etc doesn't make it ambiguous IMO
| macspoofing wrote:
| You should review the PostgreSQL trademark policy:
| https://www.postgresql.org/about/policies/trademarks/
| jmuguy wrote:
| This logo and name, to me at least, implied some sort of
| official affiliation with the PostgreSQL project. Which I
| think should at least be clarified on the site/readme.
| mistrial9 wrote:
| jdoliner wrote:
| This is really cool, running ML workloads on top of SQL is a very
| practical way of doing ML for a lot of businesses. Many companies
| don't have the fancy ML workloads like you see at OpenAI, they
| just have a SQL database with some data that could greatly help
| their business with some simple ML models trained on it. This
| looks like a nice way to do it. A slightly different approach
| that I've been working on involves hooking data warehouses up to
| Pachyderm [0] so you can do offline training on it. Not as good
| for online stuff as this, but for longer running batch style jobs
| it works really well.
|
| [0] http://github.com/pachyderm/pachyderm
| obert wrote:
| is it possible, or how hard is it, to plug in custom proprietary
| models?
| ekzy wrote:
| This looks awesome! I'm not an expert but wouldn't the typical
| database hardware not be really optimal for running ML? Is this
| meant to run on a replica (which is quite straightforward to
| setup) that has ML optimised hardware?
| montanalow wrote:
| It's probably a stretch to run GPT-3 inside a db, but most of
| the "deep learning" models I've run in more traditional
| environments are a few megabytes. That's millions of params,
| but Morre's law has been generous enough to us over the decades
| that I think there is a good case to spend a few megabytes of
| DB ram on ML. I would think this idea has really landed though,
| when we start hearing about Postgres deployments with GPUs on
| board though :)
| noogle wrote:
| You can already do that by using the pl/python or pl/java
| extensions with the right environment. However, the interface
| between SQL and model inference is typically narrow enough
| that IMO it's better to: read from Postgres --> process in an
| external Python/Java process --> persist to Postgres.
|
| Maybe enhance this with a FDW to an external inference
| process to allow triggering of inference from Postgresql
| itself.
| LunaSea wrote:
| Do you plan on adding support for managed PostgreSQL services
| like RDS in the future?
| montanalow wrote:
| We can't control the extensions RDS allows to be installed, and
| they are historically conservative. Lev and I do have some
| fairly extensive experience with replication patterns to
| Postgres instances running in EC2. Foreign data wrappers are
| also an option, and depending on workload may be a good
| horizontal scaling strategy in addition.
| craigkerstiens wrote:
| I just gave it a try to install on Crunchy Bridge [1],
| disclaimer I work at Crunchy Data. I did this as a standard
| user vs. anything special about access. I got quite close,
| but looks like I'm limited by Python version, we give you
| plpython3u with Python 3.6. Is there any chance of supporting
| an earlier 3.x version? If so I'm pretty sure could give a
| guide on how to self-install on top of us.
|
| [1] https://www.crunchydata.com/products/crunchy-bridge/
| montanalow wrote:
| Awesome. I'll see what it'll take to get the extension
| running w/ Python 3.6. It's good to know what people's
| ecosystem dependencies look like.
| michelpp wrote:
| Will that also apply to plans? I'd love to see ML models come
| up with better plans, if possible.
| jorgemf wrote:
| How do you deal with different dataset train/validation/test? How
| do you measure the degradation of the model? Is there any way to
| select the metric you target (accuracy, f1-score or any other)?
| montanalow wrote:
| The data split technique is one of the optional parameters for
| the call to 'train'. Model degradation is a really interesting
| topic, that is hopefully made less difficult when retraining is
| trivialized, but we also want to add deeper analytics into
| individual model predictions, as well as better model
| explanations with tools like shap. We haven't exposed custom
| performance metrics in the API yet, but we're computing a few
| right now and can add more. The next thing we may build could
| be a configuration wizard to help make these decisions easy
| based on some guided data analysis.
| jorgemf wrote:
| When I was talking about metrics I meant metrics of the model
| (accuracy, precission, recall, mean square error, etc), not
| performance.
| waatels wrote:
| Hello really nice !
|
| Can you explane the differences with https://madlib.apache.org/ ?
| Wouldnt an OLAP db better suited than pg for this kind of
| workload ?
|
| Does being a postgreSQL module make it compatible with citus,
| greemplum or timescale ?
| montanalow wrote:
| OLAP vs OLTP will depend on your ML use case. Online
| predictions will likely be better served by an OLTP vs offline
| batch predictions being better served by OLAP.
|
| OLAP use cases often involve a lot of extra complexity out of
| the gate, and something we're targeting is to help startups
| maintain the simplest possible tech stack early on while they
| are still growing and exploring PMF. At a high enough level, it
| should just work with any database that supports Postgres
| extensions, since it's all just tables going into algos, but
| the devil in big data is always in evaluating the performance
| tradeoffs for the different workloads. Maybe we'll eventually
| need an "enterprise" edition.
| gabereiser wrote:
| This is awesome. I'm guessing the models are executed on the
| database server and not a separate cluster? What about GPU
| training? How is that handled? I'd love to see more docs.
| ta05022022 wrote:
| sagaro wrote:
| I don't understand 5he example on the homepage. How does the
| extension know what is "buy it again"?
| sagaro wrote:
| never mind. figured it out from github. this is cool.
| montanalow wrote:
| "Buy it again" is simply a name for the PostgresML project that
| the model is being trained for.
|
| There is deeper explanation in the README:
| https://github.com/postgresml/postgresml
| phenkdo wrote:
| Can this be used to deploy an "active learning" model that learns
| from fresh data and model auto-updates?
| montanalow wrote:
| That's exactly the target use case. Models make online
| predictions as part of Postgres queries, and can be
| periodically retrained in a cadence that makes sense for the
| particular data set. In my experience the real value of
| retraining at a fixed cadence is so that you can learn when
| your data set changes, and have fewer changes to work through
| when there is some data bug/anomaly introduced into the eco
| system. Models that aren't routinely retrained tend to die in a
| catastrophic manner when business logic changes, and their
| ingestion pipeline hasn't been updated since they were
| originally created.
| streetcat1 wrote:
| Well, you would also need new labels to retrain.
| montanalow wrote:
| Yep! Part of the power of being inside the OLTP is that you
| can just create a VIEW of your training data, which could
| be anything from customer purchases, search results,
| whatever, and that VIEW can be re-scanned every time you do
| the training run to pickup the latest data.
| kipukun wrote:
| Neat project. Any roadmap for cross-validation support
| (GridSearchCV and friends)?
| montanalow wrote:
| Yep! traditional hypertuning techniques and automated broader
| surveys across multiple algos are both on the roadmap for
| "soon".
| Linda703 wrote:
| debarshri wrote:
| Very cool! Will probably use it soon.
| montanalow wrote:
| Thanks! We'd still consider this an early stage project and
| would love your feedback for which features to prioritize. Our
| roadmap is only getting longer...
| simonw wrote:
| This looks amazing!
|
| The animated GIF on your homepage moves a little bit too fast for
| me to follow.
| bguberfain wrote:
| Can we offload model train to a different server? It can be
| parallelized? Anyway, nice API and a promising project.
| montanalow wrote:
| This can be done "manually" by configuring Postgres replication
| and/or foreign data wrappers. We don't have a magic button for
| that, but if we have a few examples in the wild we can
| establish best practices and then put those into code. I say
| this with some optimism that we may be able to see more
| targeted ML specific scalability use cases that can be solved
| more completely than general database scalability.
| zmmmmm wrote:
| Seems like a great idea. When you look at many ML frameworks half
| the code and learning overhead is data schlepping code and table
| like structures that "reinvent" the schema that already exists
| inside a database. Not to mention, there can be security concerns
| from dumping large amounts of data out of the primary store (how
| are you going to GDPR delete that stuff later on?). So why not
| use it natively where the data already is?
|
| For anything substantive it seems like a bad idea to run this on
| your primary store since the last thing you want to do is eat up
| precious CPU and RAM needed by your OLTP database. But in a data
| warehouse or similar replicated setup, it seems like a really
| neat idea.
| jiocrag wrote:
| Yeah -- seems like all you need is Snowflake-esque separation
| of storage and compute and bob's your uncle.
| ekzhu wrote:
| Great idea! I see this is implemented using the Python language
| interface supported by PostgreSQL and importing sklearn models. I
| always wonder how scalable this is considering the serialization-
| deserialization overhead between Postgres' core and Python. Do
| you see any significant performance difference between this and
| training the sklearn models directly on something like
| Dataframes?
___________________________________________________________________
(page generated 2022-05-02 23:00 UTC)