[HN Gopher] Building a recommendation engine inside Postgres wit...
___________________________________________________________________
Building a recommendation engine inside Postgres with Python and
Pandas (2020)
Author : rbanffy
Score : 44 points
Date : 2021-10-26 20:30 UTC (2 hours ago)
(HTM) web link (blog.crunchydata.com)
(TXT) w3m dump (blog.crunchydata.com)
| wallace01 wrote:
| Enjoyed that read but to be honest I always had mixed feelings
| about PSQL Extensions. Logic in a database is wrong
| CameronNemo wrote:
| Can you expand on why you think it is wrong or problematic?
| FridgeSeal wrote:
| Data and its structure outlives the developer, and often the
| application.
|
| Tightly coupling a lot of the application-specific compute in
| with how the day is stored and accessed says you up for even
| more difficulties when you need to debug, scale, migrate
| storage/compute or evolve your application faster or more
| radically than your data organisation.
| systemvoltage wrote:
| Database is the last thing that scales usually so if you put
| a bunch of computational load on it besides queries, you've
| set yourself up for scaling/sharding sooner than later.
| lowwave wrote:
| Not always! If computation involves math over set of
| records, then Postgres is great for that. Have the
| operation inside the db reduce connection pooling on the
| application level.
| arthurcolle wrote:
| Do what thou wilt shall be the whole of the Law
| [deleted]
| snissn wrote:
| It's so amazing to have the ability to enhance your database
| with custom methods. Keeps your data model really well
| organized across your infrastructure
| systemvoltage wrote:
| That's the job of an API IMO.
| earthscienceman wrote:
| It's interesting to me to see pandas used in this application.
| I'd be curious to see a more fully featured implementation.
|
| I'm a scientist by profession and I've been working on building
| out several different generalized data processing pipelines for
| some specific problems in my sub-field, to make gathering and
| formatting in-situ data easier and more
| standardized/open/version-controlled. It's going great, worlds
| better than the smattering of matlab code strewn across the hard
| drives in the lab written in a non-collaborative manner and
| shared by email...
|
| ... but. I'll admit, I've run into a _lot_ of footguns in the
| pandas API in terms of efficiency. You 'll do something it what
| seems to be the logical way, or in a way that the API funnels you
| towards (like the groupby calls in the OP), and you'll quickly
| realize that if you're working on large-ish tables (>10Gb in
| memory) that it was the stupid way to do things. In terms of
| readable code to share with colleagues, pandas can't be beat, but
| things get wonky when you reach significant complexity and I
| would be surprised if it made any sense to use in a 'real'
| recommendation engine when considering developer productivity.
| mooneater wrote:
| Seems like a nice approach, I used to try things like this with
| postgres (embedding sklearn).
|
| Though it can be slow, harder to deal with memory usage issues,
| harder to debug in general, harder to extend/generalise.
|
| And for me postgres is now one of multiple datastores so doing
| this is not as helpful.
___________________________________________________________________
(page generated 2021-10-26 23:00 UTC)