[HN Gopher] Operating on a minimal two-core Postgres instance: Q...
___________________________________________________________________
Operating on a minimal two-core Postgres instance: Query
optimization insights
Author : rob742
Score : 148 points
Date : 2023-11-15 14:08 UTC (1 days ago)
(HTM) web link (robertrode.com)
(TXT) w3m dump (robertrode.com)
| KingOfCoders wrote:
| This gives me hope, my current idea I work on is osal planned for
| a two-core Postgres cloud instance from Hetzner.
| znpy wrote:
| > This gives me hope, my current idea I work on is osal planned
| for a two-core Postgres cloud instance from Hetzner.
|
| Keep in mind that years ago having two cpus (single core cpus,
| with incredibly worse performance) was the pinnacle of
| technology, and still people managed to handle a decent amount
| of data.
| sgarland wrote:
| Some of the points mentioned are sensible, but others are
| horrible (shift joins to the application) or simply nonsensical,
| like this:
|
| > If suboptimal methods (Nested Loop, Hash Join, Merge Join,
| etc.) are chosen for joins, this can lead to unnecessarily high
| resource consumption.
|
| Pray tell, what join strategies are left to use? Also, none of
| those are suboptimal, given the right schema and settings. Nested
| loop is fine, provided one side of the join is small.
|
| What isn't mentioned (perhaps in a future post it will be) is how
| table and column statistics can affect performance. That's how
| you get nested loops when a merge join would actually be more
| performant.
| OJFord wrote:
| Maybe it means (trying to read charitably) that those are the
| methods, and if a suboptimal one of them (for a given query,
| schema, etc.) is chosen, then ...
|
| Or it's just a continuation of the first point: any non-
| application join is suboptimal. (Erm..)
|
| Edit: in fact, re-reading TFA, I think it is the latter or
| something like it - because the shift to application logic is
| mentioned as a 'countermeasure' for the 'suboptimal join
| methods'. I think it means deal with sometimes getting a bad
| plan by taking control and doing it yourself in the
| application. (I don't agree, just trying to understand.)
| robertlagrant wrote:
| If you're joining in the application, why use a relational
| database?
| horse_dung wrote:
| This is so true. If you just want the data then use a
| different store (e.g. NoSQL) and everything will be quicker,
| lighter weight and slower to develop...
| paulddraper wrote:
| What does "lighter weight" mean?
|
| Smaller install size? Smaller memory footprint? Schemaless?
| toolz wrote:
| Just because you're joining in the application for one code
| path doesn't mean all of the code paths are doing the same
| thing. There are certainly valid use-cases for joining in the
| application because application layer can often times be
| horizontally scaled more cheaply/easily than the persistence
| layer.
|
| So in a scenario where you have plenty of network IO being
| under utilized for your database, it's not unimaginable where
| joining in the application layer would be preferable.
| pard68 wrote:
| FWIW the article never says to join at the app level. GP read
| wrong or something. The article says to shift logic to the
| application.
| robertlagrant wrote:
| Ah - fair enough. I knew I should've checked the article
| first before replying!
| OJFord wrote:
| This may start to depend on exactly what you're willing to
| call relational, but because you still want the schema &
| foreign key constraints for example enforced in the database,
| you're just going to do the actual mechanics of joining the
| data in the application?
| join_users_and_widgets( run_sql('select * from
| users where account_id = :1', req.account_id),
| run_sql('select * from widgets'), )
|
| for example, where the db still enforces that
| `widgets.owner_id` is a fk to `users.id` say. And
| `join_users_and_widgets` only has to know the current version
| of the schema, all rows have the same columns as it were.
| paulddraper wrote:
| It just means PostgreSQL can choose the wrong one for a
| specific instance, not that those are universally wrong.
|
| "PostgreSQL chosing a suboptimal join methods (a 'join method'
| is Nested Join, Hash Join, Merge Join, etc., each of which is
| optimized for a specific situation) leads to unnecessarily high
| resource consumption.
|
| "[And unlike virtually every other DBMS, PostgreSQL doesn't
| support specifying which method is used for a particular join.
| So even if it happen relatively infrequently, it's very
| difficult to correct a situation where it does choose the wrong
| one.]"
| sgarland wrote:
| > PostgreSQL doesn't support specifying which method is used
| for a particular join
|
| Not directly, no. You can make some of them possible /
| impossible depending on your indexing and server settings, as
| well as the query itself. For example, if the smaller side of
| the join can't fix into `work_mem` - which defaults to 4 MiB
| - you won't get a hash join. You can either create smaller
| tables, more restrictive join conditions, or increase the
| size of `work_mem` (but be careful not to blow up the DB,
| since that's per connection).
| paulddraper wrote:
| And you can use the enable_* sledgehammers as well.
| matthewaveryusa wrote:
| Sometimes it's better to do things in the application instead
| of caking it in an sql query.
|
| An example would be a subquery that yields a large temp table,
| only to be pruned down to a single row.
|
| Your choice is to either have understanding of the query
| optimizer and formulate a query that forces the subquery with a
| smaller temp table first, or you can bring it back to the
| application. Either approach works, but in once case you need
| to have deep-ish understanding of the optimizer and SQL, and in
| the other not so much (at the cost of an RTT.)
| sgarland wrote:
| > but in once case you need to have deep-ish understanding of
| the optimizer and SQL
|
| I will never understand why devs think they need to
| understand their chosen application well, but not SQL. Your
| application relies on SQL, so why shouldn't you deeply
| understand it? Relational algebra isn't that complex compared
| to other things we take for granted as necessary.
| whynotmaybe wrote:
| Bringing more data to the app so that the app filters it
| instead of the DB server is always wrong.
|
| An app with such solution is a sign of mutliple issues that
| either the DB schema is wrong, or that the devs don't know
| enough Sql, or both.
|
| Sql has been there for decades, much less longer than many of
| the actual dev framework. PostgreSQL (and other DBMS) have
| also been there for decades, they have perfected the art of
| fetching data.
| jon-wood wrote:
| Nothing is ever "always wrong". It may be a sign that a
| little more thought can lead to a more
| maintainable/quick/efficient solution, but it may also be
| that under the circumstances an approach that would be the
| wrong one if applied widely actually is the correct
| tradeoff between those factors in this case.
| jmull wrote:
| The article says "Shift _logic_ to the application level to
| reduce database load. "
|
| (Emphasis mine.)
|
| It may have been edited since you commented (perhaps because of
| your comment), but it seems like they aren't actually
| advocating to do joins at the application level.
| 7thaccount wrote:
| I worked at a company with an extremely complex optimization
| application with more business rules than any human being
| could ever comprehend. The logic has to go somewhere. Either
| the database or the interface or the application itself I
| guess. There are pro/cons to all. Some of the logic is hard
| to follow in SQL compared to a general purpose language. The
| vendor for us of course did some kind of blend where a lot of
| SQL packages do certain validations and so on. The
| optimization problem reads a bunch of .csv files from the
| database and then has its own rules it uses to build the math
| problem for the mathematical solver.
| sgarland wrote:
| I could've sworn it said joins, but I'm not positive.
|
| Tbf this is what Prisma does, albeit it joins at its own
| layer, then passes results back to the caller. Still
| horrifying.
| dboreham wrote:
| Just to push against the crowd here pooh poohing client-side
| joins: this is a standard technique for scaling when you are
| stuck (for the time being) with a RDB and is definitely not in
| general "horrible". It does _not_ mean pulling entire tables to
| the client then joining there. It means splitting a query that
| in effect walks down a chain of pointers (FKs) between tables
| into separate client-initiated queries. There are several
| benefits to this: 1. makes the design "nosql-ready" for when
| you design out the RDB. 2. Makes all DB queries inherently
| indexed and quick, allowing easy perf analysis (any non-quick
| query must be a bug) and alleviating connection pool starvation
| issues. 3. Prevents pesky devs from adding expensive queries
| under the radar. 4. allows more ready deployment of caching
| solutions. 5. recognizes that client resources scale with the
| number of clients, and are paid for by the client while RDB
| resources generally scale with difficulty and $$$, and are paid
| for by you.
| JCharante wrote:
| > Shift logic to the application level to reduce database load.
|
| I'm not convinced by this. What type of logic are we talking
| about?
| leetrout wrote:
| Some companies do large amounts of stored procedures for even
| routine tasks.
| JCharante wrote:
| Ah I see. I thought it meant taking away some merge logic out
| of the database layer.
| leetrout wrote:
| Oh I think they mean that, too. I am just trying to be
| charitable by calling out times where it can be a lift to
| yank something out of the DB.
|
| This application side join stuff was real popular about 15
| years ago with Mongo and similar stores that didnt support
| joins.
|
| It is one of the many things that separated RethinkDB from
| the document db crowd at that time.
| northbymidwest wrote:
| This is a common pattern with NoSQL databases, but really, the
| author of this article should just spend more than $10 a month
| on their Postgres instance.
| wongarsu wrote:
| How to operate on a tiny database server: shift all work from
| the database server to the application servers.
|
| See also: webscale databases
| nijave wrote:
| The last place I worked had some JSON columns and doing lots of
| JSON operations tended to eat up all the Postgres CPU (the
| servers were generally 16-32 cores)
| ddorian43 wrote:
| Native columns are better/faster.
| ahoka wrote:
| Not everything fits into those.
| sgarland wrote:
| Not with that attitude.
| ttfkam wrote:
| We had that too. Partial expression indexes mitigated most of
| the performance problems and CPU load.
|
| We then created consistently-named functions that made the
| appropriate JSON query. Then an expression index on the
| function. If the key was missing (NULL result), it wasn't
| added to the index, and we stripped all JSON nulls before
| storing.
|
| It wasn't necessarily pretty, but neither was the client's
| source data, which we had little to no control over. Best of
| a bad situation.
|
| Jsonb columns required more steps afforded us more
| flexibility and timeliness when the client kept asking for
| "one more feature" at the last minute.
| winrid wrote:
| Easier to scale application CPU usage horizontally than a
| single instance DB.
| WJW wrote:
| Even easier (and much less error prone when it comes to ACID
| properties) would be to scale the single instance DB
| vertically instead.
| dkjaudyeqooe wrote:
| This makes no sense and is typically the opposite of what I'd
| build. The more business logic (various validation rules and
| processing) you have in the database (via stored procedures)
| the less problems you have, not least because all that
| processing becomes ACID.
|
| The reducing load on the database claim is nonsense since the
| application would have to query the database anyway and less
| efficiently that the database could, avoiding numerous round
| trips between application and database.
|
| If you implement business logic in front end applications,
| you're asking for trouble. Many places I've worked didn't allow
| software outside the database to do anything more than call a
| stored procedure.
|
| Application servers did come into fashion at one point, with
| predictable results: subtle timing bugs, inconsistent
| processing with different results depending on which
| application server processed the data, a general difficulty in
| finding the source of bugs, security problems.
|
| Everyone would be a lot less stressed if they did everything in
| the database.
| count wrote:
| Oracle, DB2 or SQLServer?
|
| >The reducing load on the database claim is nonsense since
| the application would have to query the database anyway and
| less efficiently that the database could, avoiding numerous
| round trips between application and database.
|
| Ah, but by shifting the logic to the app, I can now query
| multiple databases (of disparate types even! vector,
| document, object, relational, etc.), each optimized for its
| particular data and query types and combine the result.
| That's very....not easy in stored procedures. I can also
| scale out different tiers of the application based on its
| needs (heavy compute, gpu-based, etc.). How do you do that
| when everything is a stored proc on a database?
| dkjaudyeqooe wrote:
| Well depends on the requirements, does it? I'm not
| suggesting you try to stick all your processing of all
| things inside a database, I'm saying put all critical
| processing of data stored in said database within it. Put
| all related data in the same database where is makes sense.
| Whatever data doesn't belong in that database can be merely
| referenced from it.
|
| > I can now query multiple databases
|
| This is another thing I disagree with. Wherever possible
| related data should be centralised to improve ACID
| properties and to create a reliable single "source of
| truth". I know this is unfashionable. with everyone
| claiming they need massive scale (which does cause some
| issues with this design), but much like microservices, it
| ends up being mostly busy work for programmers and related
| professions.
|
| Having said that, multiple databases, if largely orthogonal
| to each other present no real problems, and the same design
| concept is applied to each.
| bvirb wrote:
| FWIW I think I agree with you if you consider "critical
| processing of data" data validations (IME via
| constraints, checks, triggers, etc...). I think where
| things can really go wrong is when the DB accepts
| anything the application layer thinks is valid.
|
| My favorite ORM says the same thing: https://sequel.jerem
| yevans.net/rdoc/files/doc/validations_rd...
|
| I wouldn't be surprised if more people agree with that as
| well, maybe just using different terms.
| dkjaudyeqooe wrote:
| > I think where things can really go wrong is when the DB
| accepts anything the application layer thinks is valid.
|
| No, my view is the opposite of that: the database doesn't
| allow anything invalid to enter the database.
| Galanwe wrote:
| That is the exact opposite of my experience. I try very hard
| to have the database as dumb as possible.
|
| - Having mixed workloads on the database servers make it
| almost impossible to profile and tune. By definition, a
| database answers a lot of queries from a lot of clients. If
| these queries are hybrid data fetching / compute, then it's
| pretty much impossible to make sense of any system metrics.
|
| - Having hybrid workloads usually lead to considering DB
| servers as black boxes (because you cannot make sense of
| their resource usage easily). This in turn leads to having
| hyper pumped up DB servers with both crazy RAM, crazy CPU,
| crazy caches, crazy SSDs and crazy HDDs.
|
| - DB vendors of course understand the previous point, thus
| why Oracle switched to a "pay per core on the machine"
| model...
|
| - The DB logic layer ecosystem is most of the time subpar
| compared to a more traditional programming language. If you
| depend on your DB server performing a lot of work, you will
| want to make these awesome queries accessible. That is, you
| will want stored procs, extensions, etc. Dealing with those
| is a huge pain to test, version, update...
|
| - Overall all RDBMS have special SQL extensions all over the
| place. You're probably using tons of those without even
| knowing it. With time, vendor locking will become an immense
| burden and you'll be forced to pay millions to Oracle to keep
| your business running.
| el_nahual wrote:
| The problem with having business logic in the DB is not a
| _performance_ problem, it 's a maintainability, testing, and
| documentation problem.
| dkjaudyeqooe wrote:
| You've demonstrated the real issue here, but not the one
| you claim, it's that programmers have no idea how to use
| (SQL) databases or how they work.
|
| The idea that software has those negative characteristics
| merely because it resides within a database, is laughable.
| setr wrote:
| All of the PLSQL dialects are pretty much universally
| mediocre at best. They don't (de-)compose well, they
| don't version-control well, they don't document well,
| they don't debug well -- they pretty much lack any of the
| learnings and tooling that occurred in application
| languages over the last 20 years. You're basically
| dealing with all of the problems of a poorly designed DSL
| made by your favorite coworker.
|
| You could do something more sane like running lang-
| extensions on the db so you have access to a "real"
| language to specify your business logic in, but then you
| haven't really changed anything except to run your app
| logic on the DB _server_ and doesn't really have anything
| to do with the RDBMS at that point.
|
| Moving business logic into the schema/queries can be
| fine, but you still have the issues that you can't
| document it particularly well, can't version control it
| well, can't test it well, can't trust your performance
| tuning well, etc. Which all derives from DBMS expecting
| to be treated as "living systems", and has nothing to do
| with knowing/not knowing SQL.
|
| And the problem with modifying living systems is that you
| really don't want to be fucking around with it because
| anything could potentially send it reeling (see:
| surgeons). You _really_ want it to be stable... so moving
| your unstable business logic out is almost a necessity
| nicoburns wrote:
| A lot of the stuff I see people writing application code
| for doesn't require any more complex SQL features than
| joins, aggregations and filters. i.e the stuff plain old
| SQL is really good at.
| setr wrote:
| That's just born of either psychosis or naivety; but even
| when one does know SQL, there's more to consider. RDBMS's
| are awkward systems to work with, and SQL knowledge isn't
| the issue
|
| moving business logic into schema/queries is usually
| fine, (and perhaps the only sane place for the DB to
| directly interact with business logic) though schema
| changes has all the issues described. Queries get to live
| with your application code, and so can be treated sanely,
| but you only really get to a good setup with a decent
| query builder library (ORM-haters smashing strings
| together like an ape is naturally full of issues, and
| ORM-lovers get all the fun of table/object model-mismatch
| and poor query generation). But then you run into the
| issue that every database library tries to be database-
| agnostic, and you lose access to all the fancy features
| your database provides
| nicoburns wrote:
| > but you only really get to a good setup with a decent
| query builder library (ORM-haters smashing strings
| together like an ape is naturally full of issues, and
| ORM-lovers get all the fun of table/object model-mismatch
| and poor query generation). But then you run into the
| issue that every database library tries to be database-
| agnostic, and you lose access to all the fancy features
| your database provides
|
| I've found that smashing strings together works
| excellently so long as:
|
| - You use something like JavaScript's template strings so
| that interpolated fragments are inline
|
| - You have named placeholder support (so that you're not
| trying to count parameter placements which gets hugely
| complex for optional parameters).
|
| - You mainly use it for SELECTS and have minimal query
| builder for INSERTs/UPDATEs/UPSERTs
|
| And you get full access to all of the database's
| features.
| setr wrote:
| If you're maintaining a layer between construction and
| finalization (the query builder), then you're not really
| in the string smashing group I was describing. The main
| thing I was trying to get at there is no one should make
| the mistake of "ORMs are shit, I can just write SQL
| strings directly" because that road is plagued with
| problems (because SQL is not a language that composes
| well..).
|
| And it's not just named parameters; with a sane query
| builder you don't need to totally rewrite things when
| dealing with subqueries vs top-level, aggregations, etc.
|
| As long as you don't go so far as to introduce the horror
| of an ORM
| JCharante wrote:
| > The main thing I was trying to get at there is no one
| should make the mistake of "ORMs are shit, I can just
| write SQL strings directly"
|
| This has actually been the road I've been heading down on
| my personal projects. If I were to use this for work I
| would use something like Kysely so I can have type safety
| and make sure the application layer is updated to support
| every migration, but I've taken time to really practice
| SQL and write queries & migrations manually and avoid
| ORMs completely. I love writing SQL by hand, however I
| know that doesn't scale without some automated checking
| going on.
| dkjaudyeqooe wrote:
| You're making my point here. RDBMSes are not any more
| difficult or mysterious than any other software except
| for the fact that you're more familiar with that software
| and language, thus your 'doesn't X well'.
|
| What you will find though is any sufficiently complicated
| business application software implementing an ad hoc
| informally-specified bug-ridden slow implementation of
| half of what the RDBMS is giving you for free.
| setr wrote:
| > except for the fact that you're more familiar with that
| software and language, thus your 'doesn't X well'.
|
| That's exactly not my point. My point is that, knowing
| SQL, knowing what RDBMS's have and what they are, RDBMS's
| are a bad place to stuff your business logic. You have a
| variety of options that don't work well. Schemas/queries
| are the only relevant place where RDBMS's excel, and
| because it's a living system, you don't really want
| something so unstable as business logic to permeate
| through your schema significantly.
|
| It's not a matter of familiarity. It's the wrong tool for
| the job being discussed.
| dkjaudyeqooe wrote:
| What's not a "living system"? You can update RDBMSes much
| more cleanly and reliably than almost all other software.
| Stored procedures (in a variety of languages) and the
| like do not necessitate changes to your data schema and
| the data that ends up in your schema belongs there,
| regardless of how its produced. That is a data design
| issue.
|
| Being able to roll forward and back with code and data
| (including an updated schema) that is synchronised and
| works is a huge win. In most places where I've worked
| with serious RDBMS systems developers have no access
| whatsoever to production. The fact that this isn't the
| case universally illustrates the strength of these RDBMS
| approaches.
| JCharante wrote:
| why aren't stored procedures on a DB not able to be
| version controlled well? People already version control
| database schemas and migrations, why would stored
| procedures not also be able to be version controlled just
| as well? You can test it by creating shadow databases
| (what Prisma does) to make sure all migrations can be
| done on an empty database just fine, add test data, call
| the procedure, and check the output. Genuinely please do
| enlighten me as to why this isn't possible or doable, I
| would love to learn more about this part of the field.
| stiiv wrote:
| Here is a scenario that I've encountered a number of times.
|
| We needed to serve up some data for an API endpoint. Great!
| We'll write a SQL query for it. Just need a couple of joins and
| where-clause parameters -- it will be speedy, and easy to
| maintain.
|
| Before long, we are asked to make our query more robust,
| including a couple of optional parameters. Some null-coalescing
| in our join expressions and where-clauses is introduced.
| Performance immediately suffers. (I've seen the same kind of
| impact when introducing unions or function calls within such
| expressions -- the query plan can get very messy when you try
| to scale in this manner.)
|
| So what do we do? Restructuring our data (maybe with a
| materialized view or something) is one option.
|
| A reasonable alternative is to break up the single query into a
| few more specialized queries, and then use application logic to
| inspect parameters and choose the best query to use for each
| request.
|
| Plenty of trade-offs to consider in both cases.
| JCharante wrote:
| I don't think there's any trade-offs in that situation. It
| doesn't make sense to have your queries handle optional
| parameters when you have that info already precent in the
| application layer and can choose the specialized query. I
| suppose if you were really rushed you could end up in that
| starting scenario.
| ranting-moth wrote:
| I agree that most DB instances are grossly underutilized (or
| abused performance wise).
|
| But an "entire company" is a tad crude unit of measure when
| talking about optimization.
|
| Good point though. I've seen companies create a completely new
| replicated DB instance for every little pet project they start.
| Crazy. Nice if you're a cloud provider though.
| mrweasel wrote:
| One issue I frequently ponder is: What would happen if we gave a
| modern computer to a developer 30 or 40 years ago?
|
| They'd be overwhelmed by the instruction set I suppose, but how
| large a business could they run on a modern desktop?
| narag wrote:
| You know, some of us are still alive today :)
| throwawaaarrgh wrote:
| I don't think most people today realize how wasteful modern
| software engineering is. They think, it's fine to be inefficient,
| I'll just get a bigger X. And we are now blessed with technology
| and solutions that make that a possibility, when for the longest
| time it wasn't.
|
| But you don't want to have to optimize when you don't expect to.
| It takes time away from your features and quality, and is risky.
| You want to predict when you'll need to grow and how much effort
| it might take, and control costs. That's why software development
| should be efficient from the start. By only using what you need,
| you don't run into situations where you have to optimize to
| survive.
|
| Anyone who says to me "I'll just put everything in PostgreSQL and
| optimize later" sounds just as naive as the person who says
| "let's build for scalability before we have any customers". Both
| are giving themselves problems they don't need, and trying to
| justify it by thinking it's going to save them time. The former
| is giving themselves problems in the long term, the latter is
| giving themselves problems in the short term. Even if these
| tradeoffs are acceptable to you, they are often unnecessary!
| api wrote:
| ... then they see the cloud bill.
|
| Cloud companies found a way to directly monetize developers'
| love of complexity and overengineering. It's genius.
| robocat wrote:
| Developers often love simplicity and underengineering, which
| can also lead to performance problems.
|
| My personal definition of engineering is: choosing good
| compromises between competing needs.
| ecshafer wrote:
| This comment will be taught in business schools one day.
| api wrote:
| 1. Find pathology.
|
| 2. Attach toll booth.
|
| 3. Profit.
| aleph_minus_one wrote:
| > Cloud companies found a way to directly monetize
| developers' love of complexity and overengineering. It's
| genius.
|
| The developers' tastes differ a lot. Consider the extreme
| counterexamples
|
| * the "small code" scene
|
| * people who love to port modern games to C64, Amiga, retro
| consoles, ...
|
| The situation is in my opinion rather that quite some
| managers love complexity and "future-proofness", and thus
| hire developers who have a similar taste.
| _a_a_a_ wrote:
| IME it's idiot managers that go for love of complexity and
| overengineering - though not all mgrs are idiots.
| okeuro49 wrote:
| > "I'll just put everything in PostgreSQL and optimize later"
|
| Sounds good to me!
| 0xbadcafebee wrote:
| "I'll just drink and eat heavily in my 20s and worry about
| exercising later"
|
| Or you could... not?
| yjftsjthsd-h wrote:
| To what benefit? In my (early) twenties, I _could_ do that
| without consequence. But what benefit is there to not doing
| it while you can? (I mean, assuming you 're doing so at a
| non dangerous level)
| zamadatix wrote:
| A new business already has a laundry list of "I need to
| ${thing} to survive". Being able to move one such problem a
| couple of years in the future is already an attractive value
| before considering most businesses won't last that long anyways
| (for reasons unrelated to how optimized the database was). I
| mean if you already have the know how, confidence, and
| experience to do it right in the same amount of time then duh.
| If not, is spending precious extra time that early on really
| going to help you survive long enough to reach where how
| optimized you infrastructure spend is seems like an important
| problem for your cash flow?
|
| Sometimes I think this then gets conflated with more
| established businesses where continuing to start every project
| so heavily this way does tend to create more problems than it
| solves.
| SkyPuncher wrote:
| > Anyone who says to me "I'll just put everything in PostgreSQL
| and optimize later" sounds just as naive as the person who says
| "let's build for scalability before we have any customers".
| Both are giving themselves problems they don't need
|
| Both are giving themselves problems, but they are fundamentally
| different decisions. I will almost always take the "optimize
| later" approach as there's not always a "later".
|
| For most new products, the biggest challenge is finding product
| market fit. Once you have that, the world is your oyster. I
| will happily pay a 10% tax after PMF to be 10% better at
| finding PMF now.
|
| If I've found PMF, I will have more revenue, more traction,
| more budget, etc to fix the problems I created. If I don't have
| PMF, everything goes in the trash.
| adamnemecek wrote:
| It's not 10% though, is it.
| SkyPuncher wrote:
| 10% was just a smallish number. In my experience, it's
| actually less than 1%.
|
| Code typically follows one of two paths:
|
| * It's in an unused/underused part of the app. Performance
| doesn't really matter.
|
| * It's in a highly used part of the app. Performance
| matters, but you'll never be able to predict the features
| that get layered on top of it. Fix performance as you build
| those new features.
| nijave wrote:
| Especially in small/medium sized businesses, hardware is
| cheaper than payroll. It's most cost effective to pay people to
| build features (that drive revenue) while paying a bigger
| hardware bill than spend engineering hours optimizing hardware
| usage.
|
| At some scale they'll be a break even/shift
| hliyan wrote:
| I once ran the daily trade reconciliation of an entire US
| primary exchange on my laptop, using a PERL script (in 2007). I
| wrote it in a couple of days, and each day's processing ran in
| minutes. Ten years later, I witnessed a team spending weeks
| spinning up spark clusters to perform a job that was
| significantly smaller. At the time I thought _I_ was the person
| who was missing some important context. Now I 'm beginning to
| realise that I was not in fact misinformed or insane.
| cutemonster wrote:
| Maybe you were missing something? You could have been a
| manager, commanding a team of Spark engineers and an even
| higher salary?
| teaearlgraycold wrote:
| I don't know if you've ever had to build an MVP at an early
| stage company
| numpad0 wrote:
| In my fuzzy sentimental memory, software bloat seemed to always
| self-regulate to 5-10 lines per second of logs or 2-3 screen
| flickers per minute, whichever applicable. Things go slower and
| optimizations come about, go faster and another Midleware.js
| comes to be.
|
| My working theory is this is the common denominator max
| comfortable flickering rate for developers. There's got to be
| sweet spot for UI, how do I say, fluidity. And that fluid-ness
| usually aren't controlled for dev tools, tools themselves must
| evolve to stay at comfortable refresh-rate, by natural
| selection. That shouldn't be too weird of an idea.
| mynameisash wrote:
| > They think, it's fine to be inefficient, I'll just get a
| bigger X. And we are now blessed with technology and solutions
| that make that a possibility, when for the longest time it
| wasn't.
|
| Unfortunately, the culture in my group is basically this. We
| never address things until they're dire inefficiencies; thus, I
| spend a nontrivial amount of my time addressing workloads that
| take 10+ hours but _need_ to be faster (and often with minor
| changes are cut down to 2 hours).
|
| Having a culture of "I'll just throw more hardware at it" / "If
| it takes an hour or ten, I still am going to run it and get the
| results tomorrow" is costing ridiculous amounts of money and
| lost time. It's such an uphill battle to try to change that
| culture.
| alkonaut wrote:
| You have to love the revival of things like "Let's render the
| html on the server!", "Lets create the whole page with just one
| request!", "Let's use SQL, and perhaps just ONE server!", "Try
| using a server in your office to save on cloud bills", "Put all
| your microservices in one process so they are cheap to run and
| easy to deploy!".
| riku_iki wrote:
| > "Try using a server in your office to save on cloud bills"
|
| and then company stops working when that server dies.
| nik736 wrote:
| You could also say the same for the cloud when your internet
| is not working.
| riku_iki wrote:
| It is easy to setup fault tolerant, simple, low maintenance
| and cheap for small traffic and data infra in the cloud
| (e.g. AppEngine+CloudSQL) where provider will provide
| 99.99% availability.
|
| Internet not working is usually can be mitigated through
| multiple providers, cellular included
| narag wrote:
| You can setup redundant local servers too.
| riku_iki wrote:
| yes, and then you need to have engineer with $300k salary
| to support your app/postgresql cluster, load balancer,
| fallack, backups, security, etc with unknown quality.
| narag wrote:
| So it seems I'm severly underpaid and underemployed.
| Seriously I'm wasting my time reading HN.
| jabart wrote:
| Whatever is spent optimizing this to two cores is hours wasted
| when you could expand. Developer time is still cheaper than Cloud
| Cores or Ram. Yes you should write SQL queries with basic best
| practices and have a schema that is just right between
| performance and normalized.
|
| Not mentioned was table size. You can do a lot of odd things with
| < 100 million rows of data and get away with it.
| outside1234 wrote:
| Yes - I think the overall headline of this is correct (most
| people should just use PostgreSQL) but they should NOT go into
| five zoom levels of detail on optimization or anything other
| than really understanding and utilizing knowledge of how a
| relational database works (utilizing indexes etc.)
| sgarland wrote:
| Nothing in the article was in the weeds, and you can do all of
| it in RDS as well if you'd like (pgstat is an available
| extension, I think enabled by default).
| auspiv wrote:
| Not entirely clear from the article/blog - is this an RDS
| instance (equivalent) or EC2 instance (equivalent) in a cloud
| with storage attached?
|
| I didn't think it was possible to edit postgres.conf on a RDS
| type managed database. But the author also talks about scaling up
| quickly as performance walls are hit while they work on
| optimizing stuff.
| rob742 wrote:
| In RDS this would be possible via Parameter/Option groups
| monlockandkey wrote:
| Does anyone have any experience with Pgtune, i.e has if made a
| difference in performance?
|
| https://pgtune.leopard.in.ua/
| paxys wrote:
| The optimizations are no doubt impressive, but looking at it from
| a costs perspective - they run their company on two DB instances,
| each with two CPU cores and 4 GB of ram. The annual cost of
| running such a machine on AWS (with SSD included) is ~$4000. Even
| if their DB infra was only half as efficient as it is today,
| their costs would increase by a whopping $8000/yr.
|
| How much do they pay an average software developer? How many
| hours have developers spent on query monitoring, optimization,
| removing joins, removing aggregations? What is the ongoing cost
| of learning and applying these techniques to all new code written
| at the company? Even with the most conservative estimates I'm
| confident that it costs them several orders of magnitude more in
| developer productivity than they are saving on hardware.
|
| Optimization is good, but every company needs to recognize where
| they are throwing away dollars to save cents. There is an optimal
| level of inefficiency that companies _want to_ have.
| mcqueenjordan wrote:
| How did you arrive at $4000?
|
| I used https://calculator.aws/#/estimate and my number is over
| an order of magnitude less than $8000 for 2 instances with
| SSDs, 4GB RAM, and 2 cores.
| wheaties wrote:
| I/O, backups, and such. Just having the instances cost money.
| Using them costs even more!
| sgt wrote:
| And egress at AWS.
| SteveNuts wrote:
| I've always thought there should be a meta "AWS Calculator"
| Calculator tool, which takes the output from the official AWS
| calculator and applies an algorithm to make it more realistic
| to factor in the things people forget about (mostly data
| transfer)
| fatherzine wrote:
| How much would they pay an average SRE to deal with random
| fires operating a poorly designed, poorly instrumented and
| poorly understood system?
| jezovuk wrote:
| This. Optimized system means the devs took the time to
| understand the system, know the (most frequent) usage
| patterns and will not be nagged by 'something is slow'
| problems every so often. Of course, as with everything, there
| is a danger of going overboard. (And industry is firmly
| headed in the 'just throw more people/hardware/AI at the
| problem' direction anyway).
| YetAnotherNick wrote:
| Reserved m6a.xlarge(4vcpu, 16 GB RAM) costs <$1000. Assuming
| most companies use reserved for things like DB. So yeah their
| savings is like less than $1000, at which point it feels more
| like a developer satisfaction project rather than having any
| ROI on money terms.
| _a_a_a_ wrote:
| Link? ta
| rob742 wrote:
| The balancing act between one-time developer hours and the
| recurring costs associated with scaling up database resources
| can be challenging. If sufficient funds are available, scaling
| up may seem like an easy decision. However, investing in the
| development of expertise in efficient database operations is
| usually more cost-effective in the long run. Additionally,
| acquiring the skills to debug queries and optimize database
| configurations can ultimately save on developer hours.
| mamcx wrote:
| > How much do they pay an average software developer?
|
| Let's say that you pay a lot. Like you expend $20.000 USD doing
| this.
|
| But we forget the _cost power of software_.
|
| Your monthly savings extend to the infinity.
|
| Your dev is now freer to do other things that are precluded in
| a more complex scenario ("you save pennies here, but expend
| pounds elsewhere"), because complexity DEMANDS MORE ATTENTION.
|
| This is something I have observed all my life. I have _never_
| regretted to optime]ize for operational costs even if this
| means I "waste" a lot of time for this. I work in a more
| cutthroat environment than most startups (ERPs), where my time
| to deploy to production something is measured in hours, and is
| an environment that is filled by the graveyard of my
| competitors that have drink all the kool-aids.
|
| Complex operational environments ALWAYS have turned worse/more
| complex than do the simplest things.
|
| P.D: I think is similar to spending time in refactoring,
| cleanups, testing, security audits, training, and even resting.
|
| All of that sounds like "costs". But you are doing this stuff
| alongside, just need to budget a few hours a week on this and
| the profits come and come.
|
| Never regret using time on this stuff.
|
| NEED to be smart on this? Sure. But is pure muscle and after a
| while is great.
| nerdponx wrote:
| Carefully designing your application around an underpowered
| database is almost surely _more_ complicated than just paying
| for a more powerful database.
|
| We're not talking about K8s or sharding. We're talking about
| adding more cores and RAM to the single database node that
| they're already using.
|
| Imagine how much easier the backend dev's life would be if
| they could just write SELECT FROM JOIN WHERE like everyone
| else, and not have to pore over query plans and
| design/implement/test a bunch of custom joining routines.
| This path sounds like a high-effort artisanal approach to
| something where it doesn't seem warranted. It's penny-wise
| and pound-foolish in exactly the way you (very nicely)
| described.
| hobs wrote:
| It's true, but its great experience for future things when
| the costs actually matter - I cut my teeth at a biz that
| refused to pay more for database hardware and now my
| performance skills pay the bills at much nicer companies.
| smallnamespace wrote:
| Surely it's great for you as the developer to get paid
| extra to learn transferable skills.
|
| That's not necessarily true of the business.
| hobs wrote:
| Absolutely, I did not feel it as a positive at the time
| and recommended not that much more money for
| significantly better utilization of human time, but you
| take what you can get with intractable CEOs.
| renegade-otter wrote:
| And now your system can scale better, saving you enough money
| to justify the optimization work long-term.
|
| The StackOverflow folks famously take their DB design and
| performance very seriously, and now they do not need an army of
| overpriced "architects" to plug the performance holes with more
| insane and expensive complexity.
|
| https://stackexchange.com/performance
|
| Coincidentally I just wrote up a post about the importance of
| minding your database skills:
|
| https://renegadeotter.com/2023/11/12/your-database-skills-ar...
| paxys wrote:
| Stack Overflow is the poster child of over-optimizing DB
| engineering while neglecting every other product feature that
| end users actually want, so not sure if that should be your
| example.
| renegade-otter wrote:
| There is a case to be made they took it to an extreme, but
| there is too much going in the other direction as well.
| It's not like most teams are working on the meat of the
| product anyway - they probably spend 40% of their time
| running a small project on Kubernetes and getting rid of
| daily security warnings from NPM.
| adrianN wrote:
| Somehow they managed to become one of the most important
| websites for software developers despite neglecting every
| product feature you miss.
| paxys wrote:
| They have never made a dollar of profit in their 15 years
| of operation and just sold themselves to a private equity
| firm for pennies. "Most important website for software
| developers" doesn't really mean anything unless one can
| also be a sustainable company in the process.
| hobs wrote:
| Have you seen their code base? They did not over optimize
| the DB engineering by any means that I could tell.
|
| Having a simple monolithic database is not optimizing your
| DB engineering by any means.
|
| edit: and inventing a better ORM doesn't count either.
| sharadov wrote:
| As someone who does DB optimization for a living - there is a
| tradeoff, I've seen far too many places where db instances run
| with default settings out of the box, queries missing basic
| indexes, no one even cared to run an explain analyze.
|
| I've seen far too experienced developers with no basic
| understanding of how relational databases operate. And it's
| gotten worse over the years - as they can keep throwing
| hardware at the problem. Performance optimization has become a
| lost art. This is as much a problem of negligence as much as it
| is of education.
|
| I won't even touch on proper design, which requires someone
| with significant database expertise from the get-go ( which
| most startups may not have).
| dkjaudyeqooe wrote:
| > Performance optimization has become a lost art.
|
| It's really not very hard, basically add indexes wherever
| lookups are done but don't go overboard, especially with
| tables that see a lot of inserts. The DB will tell you what
| it's doing if you ask, it's all easily investigated. If your
| joins resist optimisation your data structure is probably
| off.
|
| > I won't even touch on proper design, which requires someone
| with significant database expertise from the get-go
|
| In more extreme cases this is true, but a few basic
| principles and an understanding of databases will get you 95%
| of the way there and is quite intuitive. Something as basic
| as "don't copy data around, each individual entity exists in
| one place, and use its key to refer to it otherwise" will
| help enormously.
| alex7734 wrote:
| > It's really not very hard, basically add indexes
|
| It's really not but you'd be surprised how many people
| don't know the basics of how indexes work. I've had people
| tell me that their queries couldn't be optimized further
| because "they had already added an index for every column
| of the table".
|
| It doesn't help that SQL, by design, hides the actual
| algorithm doing the data access from the users while
| simultaneously relying on them to add indexes to achieve
| performance, which is in my humble opinion the worst
| mistake of SQL.
| corytheboyd wrote:
| I love to ask this very simple question in interviews: if
| a database index makes queries faster, why not add an
| index for every column on a table?
|
| I want them to say "the heck are you talking about,
| because that's not how anything works?"... it does trip
| some people up though that just legitimately don't
| understand what's happening.
| ako wrote:
| That is not a SQL mistake, but an implementation choice.
| Most modern databases can determine where indexes should
| be added, and add these automatically. E.g.:
| https://www.oracle.com/news/connect/oracle-database-
| automati...
| jjav wrote:
| > I've seen far too many places where db instances run with
| default settings out of the box, queries missing basic
| indexes, no one even cared to run an explain analyze.
|
| Indeed! One place I worked at, there was something like 10x
| (forget exactly) gains to be had simply by editing the MySQL
| config to match the workload. Seems like nobody bothers
| anymore. There is _so much_ money to be saved by reducing
| excess operating costs.
| qingcharles wrote:
| You can run this on Hetzner ARM hosts for about $30/year o_O
| jerf wrote:
| The last line from the article: "We must not forget that 20
| years ago only a fraction of the performance of current
| hardware resources was available, and despite these
| restrictions, performance database-based applications were
| developed."
|
| I think you have a mental model in which these people are
| spending weeks and months optimizing a system they should just
| upgrade while the real tasks languish, but in my experience,
| probably all they are doing is just paying a bit of attention
| every so often and simply doing things that don't require 32
| cores and terabytes of RAM. A lot of things don't need that
| power.
|
| 20 years ago you had a lot more real variance in power. If your
| 100MHz server couldn't handle it, maybe your 400MHz server
| could, and the gap between those two systems is probably even
| bigger than the numbers imply because that 400MHz was probably
| equipped with better everything else (RAM, disk, cache, better
| performance/cycle for the CPU) to go with it. Nowadays, with
| CPU speeds having stalled, it's easy to go wide and service
| lots of diverse queries at a time, but if you have a particular
| query that your lowest-end DB node is struggling with, there's
| a pretty decent chance that just pushing the button to go to a
| bigger node will have a distinctly sub-linear result on
| performance improvement. I've certainly seen this from a number
| of teams in a number of places. You can push a button to get
| _more_ CPUs but you can 't push a button to get much _better_
| CPUs. Some better, yes, but not like it used to be.
|
| If you just try a little, modern commodity systems can do a
| _lot_. Even with only 4 cores and 4GB. Just try a little. Yes,
| there are absolutely tasks that that doesn 't work with; if
| you've got 10,000 hours of video to re-encode that system isn't
| going to cut it. But a lot of developers are _really_ bad at
| understanding how long things should take and don 't realize
| that the thing that takes two minutes during CI really should
| take something more like 50ms because
| https://accidentallyquadratic.tumblr.com/ and somebody really
| ought to look into that rather than just bump the CI node size
| again hoping it solves the problem.
|
| (And for Pete's sake, when bumping the instance size does
| happen to not solve the problem... _turn it back down again!_
| Despite what I just wrote I do understand there 's a time and a
| place to spray money at the problem, but when spraying money at
| the problem doesn't even help, _STOP_.)
| jjav wrote:
| > The optimizations are no doubt impressive
|
| The sad thing is that the things they discuss like monitoring
| and optimizing slow queries have always, until recently, been
| baseline competence.
|
| It's only recently that the universal answer to something being
| too slow is to simply spin up another dozen, or heck make it
| two dozen, cloud instances to compensate.
|
| > Even if their DB infra was only half as efficient as it is
| today, their costs would increase by a whopping $8000/yr.
|
| I think the interesting comparison is not so much the same team
| doing things slightly less efficiently, but comparing to the
| average status quo in the industry.
|
| So many companies are spending $100K+/month (month not year) on
| AWS costs to run very tiny minimal loads due to the attitude
| that efficiency never matters, just spin up more instances.
| Pretty soon you're talking real money. And by the time the CFO
| starts screaming, it's very difficult to go back to a more sane
| infrastructure.
| latchkey wrote:
| I'm picturing that meme with the guy sweating over pressing a
| button.
|
| Add new features that generates revenue, or optimize the
| database.
|
| Hmmm.
|
| Of course the response is that you want to optimize the database
| so that when you do eventually add new features that generate
| revenue, you can scale to meet the demand.
|
| Or is it...
| Draiken wrote:
| This meme is a bit too misleading.
|
| In reality tons of startups have a shitty product with zillions
| of bugs because "move fast break things". Then later on they
| don't survive because the only reason people even pay for them
| is because the other startups are spending silly money on it.
| When the well dries up, it all comes crumbling down.
|
| Turns out that startup's software is too shitty and simply not
| good enough to survive a cost cutting round.
|
| People love to pretend that just adding more features always
| beats everything else, but it's not as clear cut as people make
| it seem.
| latchkey wrote:
| We're talking about premature optimization, not necessarily
| shitty/buggy software.
| lgkk wrote:
| Nice.
|
| I do something similar with low cost instances. I use Go and I
| get a great bang for my buck (performance to dx to costs).
|
| In my case I also leverage the cheap access to gobs of SSD or
| nvme. I use super cheap VMs for the app servers and they come
| with 10-20gb disk. I use this as KV embedded cache. For a few
| bucks a month I'm able to pump out 7-10k rps at load (synthetic
| but still).
|
| I have a low end pg instance that's able to pump out 2k tps
| (synthetic) so with the little bit of caching I'm able to save
| some costs and also offer good performance. All under $100 a
| month.
|
| I hope when my business does take off I will have a good base to
| expand that I can get far without spending tons of money on
| inefficient compute.
| mannyv wrote:
| If you want a data store, just use a NoSQL server.
|
| If you want an RDBMS, learn how to use it.
|
| At least with NoSQL you know you're trading ignorance for ease of
| use.
| bhaak wrote:
| > Improve database schemas, e.g., through normalization, to
| promote efficient filtering.
|
| Cries in ActiveRecord type columns.
|
| In less sad news, I recently learned that Postgres can have
| serious problems with joining a lot of tables. I looked at a join
| of several tables (table A 80MM, table B 30MM, table C 1MM rows)
| and instead of using the WHERE parameter to reduce the result set
| of table A to a few thousands, it fetched all rows from table C
| first.
|
| The applied heuristic in this case apparently being "join tables
| by order of lower row count".
|
| Setting "join_collapse_limit" to 1 forces the query planner to
| join the table in exactly the order as they appear in the query
| and that solved my performance issues neatly.
| nerdponx wrote:
| Now that CTEs are no longer optimization fences in recent
| versions of Postgres, is there a way to somehow manually
| declare an optimization fence around a certain region of the
| query?
|
| I've had similar problems in Snowflake with the query planner
| taking a bad path, and no obvious way to hint otherwise.
| bhaak wrote:
| Functions are hard barriers for optimization and I don't see
| that changing any time soon.
|
| Which goes even so far that EXPLAIN doesn't work on them and
| you just get a black box of "Function call to ..." in the
| execution plan.
|
| Using auto_explain is the only workaround known to me for
| that.
| emilsedgh wrote:
| But that's not the case when functions are inlined.
| bhaak wrote:
| Interesting. I've never seen this though. I guess that
| only happens for simple cases?
| baq wrote:
| WITH MATERIALIZED restores the fence.
| anarazel wrote:
| > The applied heuristic in this case apparently being "join
| tables by order of lower row count".
|
| Not really - it's all cost based analysis. There are a few main
| culprits that might be at work:
|
| 1) Selectivity estimates can get very off if you have cross-
| column correlations or your joins are over multiple
| aggregations / DISTINCT or such.
|
| 2) Join order planning is exponential in the number of joins.
| Once there are more than geqo_threshold (default = 12) joins,
| we don't explore the whole space anymore.
|
| 3) If you have more than join_collapse_limit /
| from_collapse_limit joins, subqueries in the from list, we'll
| not flatten them anymore, to prevent the join ordering problem
| to become even harder. It might be worth trying to just
| increase them with your query.
___________________________________________________________________
(page generated 2023-11-16 23:00 UTC)