[HN Gopher] My notes on Gitlab's Postgres schema design (2022)
___________________________________________________________________
My notes on Gitlab's Postgres schema design (2022)
Author : daigoba66
Score : 437 points
Date : 2024-02-17 22:05 UTC (1 days ago)
(HTM) web link (shekhargulati.com)
(TXT) w3m dump (shekhargulati.com)
| kelnos wrote:
| > _For example, Github had 128 million public repositories in
| 2020. Even with 20 issues per repository it will cross the serial
| range. Also changing the type of the table is expensive._
|
| I expect the majority of those public repositories are forks of
| other repositories, and those forks only exist so someone could
| create pull requests against the main repository. As such, they
| won't ever have any issues, unless someone makes a mistake.
|
| Beyond that, there are probably a lot of small, toy projects that
| have no issues at all, or at most a few. Quickly-abandoned
| projects will suffer the same fate.
|
| I suspect that even though there are certainly some projects with
| hundreds and thousands of issues, the average across all 128M of
| those repos is likely pretty small, probably keeping things well
| under the 2B limit.
|
| Having said that, I agree that using a 4-byte type (well, 31-bit,
| really) for that table is a ticking time bomb for some orgs,
| github.com included.
| justinclift wrote:
| > github.com included
|
| Typo?
| rapfaria wrote:
| _> Having said that, I agree that using a 4-byte type (well,
| 31-bit, really) for that table is a ticking time bomb for some
| orgs_
|
| A bomb defused in a migration that takes eleven seconds
| aeyes wrote:
| The migration has to rewrite the whole table, bigint needs 8
| bytes so you have to make room for that.
|
| I have done several such primary key migrations on tables
| with 500M+ records, they took anywhere from 30 to 120 minutes
| depending on the amount of columns and indexes. If you have
| foreign keys it can be even longer.
|
| Edit: But there is another option which is logical
| replication. Change the type on your logical replica, then
| switch over. This way the downtime can be reduced to minutes.
| winrid wrote:
| This largely depends on the disk. I wouldn't expect that to
| take 30mins on a modern NVME drive, but of course it
| depends on table size.
| aeyes wrote:
| Disk wasn't the limit in my case, index creation is
| single threaded.
| winrid wrote:
| Which is still very IO bound... Wonder what kinda IOPS
| you were observing? Also they make pretty fast CPUs these
| days :)
| scient wrote:
| Large tables take hours, if not days. I attempted a test
| case on AWS using souped up io2 disks (the fastest most
| expensive disks they have) and a beast of a DB server
| (r5.12xl I think) and it became abundantly clear that at
| certain scale you won't be doing any kind of in-place
| table updates like that on the system. Especially if your
| allowed downtime is one hour maintenance window per
| week...
| aeyes wrote:
| I did it on a r6.24xlarge RDS instance and the CPU wasn't
| doing much during the operation. IO peaked at 40k IOPS on
| EBS with provisioned IOPS, I'm not sure if a local disk
| would be any faster but I already know that rewriting the
| table and creating the indexes are all single threaded so
| there isn't much you could gain.
|
| Once I got the logical replication setup to work I
| changed 20 tables on the replica and made the switch with
| 15 minutes of downtime. That saved me a lot of long
| nights.
|
| You can get an idea of how long this could take by
| running pg_repack, it's basically doing the same thing:
| Copying the data and recreating all the indexes.
| semiquaver wrote:
| In practice the only option that I've seen work for very
| large teams and very large relational databases is online
| schema change tools like https://github.com/shayonj/pg-osc
| and https://github.com/github/gh-ost (the latter developed
| for GitHub's monolith). It's just too difficult to model
| what migrations will cause problems under load. Using a
| binlog/shadowtable approach for all migrations mostly
| obviates the problem.
| CubsFan1060 wrote:
| You can also migrate it using logical replication.
| scient wrote:
| Rails migrations really fail to be viable at scale. And
| its not really because of the migrations in Rails, its
| because changes in PostgreSQL get very very expensive.
| Things have gotten better in PG 15, but its still not
| quite there yet.
|
| I've ran into an integer PK being almost exhausted in
| values twice now. In one of these cases I used pg-osc to
| rewrite a 8TB table on disk without downtime over a
| period of a week to fix the issue. In the other case it
| was about 6TB within 2 days. Its doable, but the amount
| of planning and time it takes is non-trivial.
|
| I so far have no found any other viable solutions either.
| I keep hearing about logical replication but I haven't
| seen a great real-world solution at scale so far.
| tengbretson wrote:
| In JavaScript land, postgres bigints deserialize as strings.
| Is your application resilient to this? Are your downstream
| customers ready to handle that sort of schema change?
|
| Running the db migration is the easy part.
| Rapzid wrote:
| Depends on the lib. Max safe int size is like 9
| quadrillion. You can safely deserialize serial bigints to
| this without ever worrying about hitting that limit in many
| domains.
| duskwuff wrote:
| > Max safe int size is like 9 quadrillion.
|
| 2^53, to be precise. If your application involves
| assigning a unique identifier to every ant on the planet
| Earth (approx. 10^15 [?] 2^50), you might need to think
| about this. Otherwise, I wouldn't worry about it.
| jameshart wrote:
| 11 seconds won't fix all your foreign keys. And all the code
| written against it that assumes an int type will accommodate
| the value.
| zX41ZdbW wrote:
| It is still under the limit today with 362,107,148 repositories
| and 818,516,506 unique issues and pull requests:
|
| https://play.clickhouse.com/play?user=play#U0VMRUNUIHVuaXEoc...
| nly wrote:
| That query took a long time
| zx8080 wrote:
| Elapsed: 12.618 sec, read 7.13 billion rows, 42.77 GB
|
| This is too long, seems the ORDER BY is not set up correctly
| for the table.
| zx8080 wrote:
| Also,
|
| > `repo_name` LowCardinality(String),
|
| This is not a low cardinality:
|
| 7133122498 = 7.1B
|
| Don't use low cardinality for such columns!
| ssalka wrote:
| I'm guessing this won't be including issues & PRs from
| private repos, which could be substantial
| mvdtnz wrote:
| Do we know for sure if gitlab cloud uses a multi-tenanted
| database, or a db per user/customer/org? In my experience
| products that offer both a self hosted and cloud product tend
| to prefer a database per customer, as this greatly simplifies
| the shared parts of the codebase, which can use the same
| queries regardless of the hosting type.
|
| If they use a db per customer then no one will ever approach
| those usage limits and if they do they would be better suited
| to a self hosted solution.
| Maxion wrote:
| I've toyed with various SaaS designs and multi tenanted
| databses always come to th forefront of my mind. It seems to
| simplify the architecture a lot.
| tcnj wrote:
| Unless something has substantially changed since I last
| checked, gitlab.com is essentially self-hosted gitlab
| ultimate with a few feature flags to enable some marginally
| different behaviour. That is, it uses one multitennant DB for
| the whole platform.
| karolist wrote:
| Not according to [1] where the author said
|
| > This effectively results in two code paths in many parts
| of your platform: one for the SaaS version, and one for the
| self-hosted version. Even if the code is physically the
| same (i.e. you provide some sort of easy to use wrapper for
| self-hosted installations), you still need to think about
| the differences.
|
| 1. https://yorickpeterse.com/articles/what-it-was-like-
| working-...
| golergka wrote:
| Being two orders of magnitude away from running out of ids is
| too close for comfort anyway.
| iurisilvio wrote:
| Migrating primary keys from int to bigint is feasible. Requires
| some preparation and custom code, but zero downtime.
|
| I'm managing a big migration following mostly this recipe, with
| a few tweaks: http://zemanta.github.io/2021/08/25/column-
| migration-from-in...
|
| FKs, indexes and constraints in general make the process more
| difficult, but possible. The data migration took some hours in
| my case, but no need to be fast.
|
| AFAIK GitLab has tooling to run tasks after upgrade to make it
| work anywhere in a version upgrade.
| istvanu wrote:
| I'm convinced that GitHub's decision to move away from Rails
| was partly influenced by a significant flaw in ActiveRecord:
| its lack of support for composite primary keys. The need for
| something as basic as PRIMARY KEY(repo_id, issue_id) becomes
| unnecessarily complex within ActiveRecord, forcing developers
| to use workarounds that involve a unique key alongside a
| singular primary key column to meet ActiveRecord's requirements
| --a less than ideal solution.
|
| Moreover, the use of UUIDs as primary keys, while seemingly a
| workaround, introduces its own set of problems. Despite
| adopting UUIDs, the necessity for a unique constraint on the
| (repo_id, issue_id) pair persists to ensure data integrity, but
| this significantly increases the database size, leading to
| substantial overhead. This is a major trade-off with potential
| repercussions on your application's performance and
| scalability.
|
| This brings us to a broader architectural concern with Ruby on
| Rails. Despite its appeal for rapid development cycles, Rails'
| application-level enforcement of the Model-View-Controller
| (MVC) pattern, where there is a singular model layer, a
| singular controller layer, and a singular view layer, is
| fundamentally flawed. This monolithic approach to MVC will
| inevitably lead to scalability and maintainability issues as
| the application grows. The MVC pattern would be more
| effectively applied within modular or component-based
| architectures, allowing for better separation of concerns and
| flexibility. The inherent limitations of Rails, especially in
| terms of its rigid MVC architecture and database management
| constraints, are significant barriers for any project beyond
| the simplest MVPs, and these are critical factors to consider
| before choosing Rails for more complex applications.
| zachahn wrote:
| I can't really comment on GitHub, but Rails supports
| composite primary keys as of Rails 7.1, the latest released
| version [1].
|
| About modularity, there are projects like Mongoid which can
| completely replace ActiveRecord. And there are plugins for
| the view layer, like "jbuilder" and "haml", and we can bypass
| the view layer completely by generating/sending data inside
| controller actions. But fair, I don't know if we can
| completely replace the view and controller layers.
|
| I know I'm missing your larger point about architecture! I
| don't have so much to say, but I agree I've definitely worked
| on some hard-to-maintain systems. I wonder if that's an
| inevitability of Rails or an inevitability of software
| systems--though I'm sure there are exceptional codebases out
| there somewhere!
|
| [1] https://guides.rubyonrails.org/7_1_release_notes.html#com
| pos...
| silasb wrote:
| Do you have any sources on GitHub moving away from Rails?
| This is the first that I've heard and my googlefu has
| returned zero results around this. Just last year they had a
| blog post around Building GitHub with Ruby and Rails[0] so
| your remark caught my off guard.
|
| [0]: https://github.blog/2023-04-06-building-github-with-
| ruby-and...
| modderation wrote:
| Perhaps too late, but Rails 7.1[1] introduced composite
| primary key support, and there's been a third-party gem[2]
| offering the functionality for earlier versions of
| ActiveRecord.
|
| [1] https://guides.rubyonrails.org/7_1_release_notes.html#com
| pos...
|
| [2] https://github.com/composite-primary-
| keys/composite_primary_...
| martinald wrote:
| Is it just me that thinks in general schema design and
| development is stuck in the stone ages?
|
| I mainly know dotnet stuff, which does have migrations in EF (I
| note the point about gitlab not using this kind of thing because
| of database compatibility). It can point out common data loss
| while doing them.
|
| However, it still is always quite scary doing migrations,
| especially bigger ones refactoring something. Throw into this
| jsonb columns and I feel it is really easy to screw things up and
| suffer bad data loss.
|
| For example, renaming a column (at least in EF) will result in a
| column drop and column create on the autogenerated migrations.
| Why can't I give the compiler/migration tool more context on this
| easily?
|
| Also the point about external IDs and internal IDs - why can't
| the database/ORM do this more automatically?
|
| I feel there really hasn't been much progress on this since
| migration tooling came around 10+ years ago. I know ORMs are
| leaky abstractions, but I feel everyone reinvents this stuff
| themselves and every project does these common things a different
| way.
|
| Are there any tools people use for this?
| sjwhevvvvvsj wrote:
| One thing I like about hand designing schema is it makes you
| sit down and make very clear choices about what your data is,
| how it interrelates, and how you'll use it. You understand your
| own goals more clearly.
| nly wrote:
| Exactly that. Sitting down and thinking about your data
| structures and APIs before you start writing code seems to be
| a fading skill.
| EvanAnderson wrote:
| It absolutely shows in the final product, too.
|
| I wish more companies evaluated the suitability of software
| based on reviewing the back-end data storage schema. A lot
| of sins can be hidden in the application layer but many
| become glaringly obvious when you look at how the data is
| represented and stored.
| doctor_eval wrote:
| So many people I encounter seem to think it's the code that's
| important when building the back end of an application. You
| see this when people discussing database schemas start
| comparing, say, rails to hibernate. But ORMs emphasise code
| instead of data, which in my experience is a big mistake.
|
| In my experience, getting the data structures right is 99% of
| the battle. If you get that right, the code that follows is
| simple and obvious.
|
| For database applications, this means getting the schema
| right. To this end, I always start with the underlying table
| structures, and only start coding once I understand how the
| various tables are going to interact.
|
| Sadly, too many people think of the database as the annoying
| hoops we jump through in order to store the results of our
| code. In my world, the code I write is the minimum required
| to safely manipulate the database; it's the data that counts.
|
| Some people seem to think I'm weird for starting with the
| database (and for using plpgsql), but I think it's actually a
| superpower.
| jupp0r wrote:
| This is true for in memory data as well. Object oriented
| programming is great for some problems, but it's also
| limiting the way we think about data by putting it close to
| the code operating on it. ORMs do the same to databases by
| pretending that rows are objects when that's only one way
| of modeling your problem.
| sjwhevvvvvsj wrote:
| Yup - and you can't code your way to real scale either. At
| real scale the game is all about data structures. Code just
| gets them from A to B.
|
| Or as they say at Google, the job of SWE is "moving
| protos".
| jimbokun wrote:
| It's part of the dirty little secret of why document
| databases and other NoSQL systems became popular.
|
| Required even less up front thinking about how to model
| your data. Throw some blobs of JSON into Mongo or whatever,
| and worry about the rest later.
| thom wrote:
| It wasn't always code first - you mention Hibernate but
| 15-20 years ago it was entirely feasible to inherit a
| database schema or design one up front, and then create
| performant metadata mappings to a usable object model. That
| sort of practise was tainted by the Bad Enterprise brushes
| of verbosity and XML in general, and so everyone moved to
| some flavour of active record. This allowed programmers to
| write less code and fit it into neater boxes, at a time
| when there was enormous demand for basic CRUD web apps, but
| a lot of power and expressiveness was lost.
|
| Somewhat ironically, many modern enterprises have peeled
| all the way back to SQL for a huge amount of logic anyway,
| so I don't think we're done caring about database schemas
| quite yet.
| doctor_eval wrote:
| You are absolutely correct; my previous business ended up
| ejecting almost all database logic from Java ORMs, and
| moved almost everything to straight SQL manipulated by
| stored procedures. Doing so resulted in a significant
| performance and productivity increase, relative to all
| the layers of nonsense we used to jump through.
|
| One of the difficulties of doing this was that the
| tooling isn't great. We had to write our own tools to
| make it work, but the benefits of going back to SQL were
| more than worth it.
|
| (Many years later I made an open source version of that
| tooling, https://github.com/pgpkg/pgpkg)
| sjwhevvvvvsj wrote:
| A simple function call as a wrapper for a well optimized
| sql query just can't be beat for performance. I have
| never understood why anybody would use an ORM, it's
| usually as much work to learn and maintain them as SQL.
| cschmatzler wrote:
| If you use MySQL, Planetscale's branching is really amazing.
| Not using them, but wish I could for that. Gives you a complete
| diff of what you're doing, and can also pre-plan migrations and
| only apply them when you need with their gating.
| Atotalnoob wrote:
| EF core doesn't to drop/create for columns in db providers that
| support renaming columns. It only does it for ones that don't
| like MySQL or SQLite
| winrid wrote:
| Not a silver bullet for every project but the Django ORM
| largely solves this with its migrations. You define your table
| classes and it just generates the migrations.
|
| Throw in a type checker and you're in pretty good shape.
|
| Rust also has sqlx which will type check your code against the
| DB.
| dxdm wrote:
| I'm assuming this is why you say it's not a silver bullet,
| but to make it more explicit: the Django ORM will happily
| generate migrations that will lock crucial tables for long
| amounts of time and bring down your production application in
| the process.
|
| You still need to know what SQL the migration will run (take
| a look at `manage.py sqlmigrate`) and most importantly how
| your database will apply it.
| basil-rash wrote:
| Dealing with a bunch of automigrate headaches in the Prisma
| ORM convinced me to just drop the layer entirely and write
| plain old SQL everywhere. It's forced me to learn a bunch
| of new stuff, but the app runs faster now that I can
| optimize every query and migrations are much simpler with a
| single idempotent SQL setup script I can run to provision
| whatever deployment of the DB I need. I'm sure some problem
| spaces might benefit from all the additional complexity and
| abstraction, but the average app certainly can make do
| without for a long time.
| winrid wrote:
| It's a tradeoff! I think using the ORM to start and then
| move off it later is valid, depending on how much time
| you have to get an MVP out.
| kfajdsl wrote:
| I also switched to plain SQL migrations and queries
| because I find it much simpler. I hear this a lot, that
| ORMs are easier/quicker to use, but I've found that
| writing plain SQL has almost no friction for me. I mean,
| learning at least basic SQL is part of the first year in
| most CS degrees.
| fabian2k wrote:
| There are some things that are really annoying in raw SQL
| and that are much easier with ORMs. Dynamic queries are
| one of those, if you don't want to concatenate SQL
| strings yourself you need at least a query builder at
| that point. The other part is updating/creating entities
| with many relations, that gets very tedious quickly and
| is somethings ORMs can handle for you.
|
| It depends a lot on the specific application, for me
| those things are so common that I prefer to use an ORM
| even though I could write the SQL myself. ORMs are easier
| to use, but you still need to learn what they do under
| the hood and understand them or you will run into issues
| along the way.
| basil-rash wrote:
| I'm not suggesting anyone implement the raw binary
| protocol themselves! The postgres connection library
| you're using should be able to handle dynamic queries
| perfectly fine.
| mvdtnz wrote:
| An ORM is NEVER the solution, ever ever ever. Repeat after
| me: ORMs are not the solution to this problem. They work in
| your little toy apps with 4 customers but they are nothing
| but pain on real enterprise grade software.
| winrid wrote:
| I don't think insults are really necessary here.
|
| Also ORMs can be very useful, just don't do dumb stuff,
| like with any technology.
|
| I use them when appropriate.
| winrid wrote:
| also, every company I've worked at used ORMs in some
| capacity. Sometimes, not always.
|
| Also, I don't really work on any apps with only four
| customers. either they have almost a million or zero :P Try
| again. :)
| timacles wrote:
| Theres no right abstraction for it because everyones data is
| different. From my experience what most developers dont realize
| is that data is more complex than code. Code is merely the
| stuff that sits on top of the data, shuffling it around... but
| designing and handling the data in an efficient way is the real
| engineering problem.
|
| Any abstraction you could come up with wouldnt fit 90% of the
| other cases
| Merad wrote:
| > Also the point about external IDs and internal IDs - why
| can't the database/ORM do this more automatically?
|
| It has pretty big implications for how your application code
| interacts with the database. Queries that involve id's will
| need to perform joins in order to check the external id.
| Inserts or updates that need to set a foreign key need to
| perform an extra lookup to map the external id to the correct
| FK value (whether it's literally a separate query or a
| CTE/subquery). Those are things that are way outside the realm
| of what EF can handle automatically, at least as it exists
| today.
| emodendroket wrote:
| I think that stuff works about as well as it possibly could. If
| you think that's painful think about something like DynamoDB
| where if you didn't really think through the access patterns up
| front you're in for a world of pain.
| josephg wrote:
| > As I discussed in an earlier post[3] when you use Postgres
| native UUID v4 type instead of bigserial table size grows by 25%
| and insert rate drops to 25% of bigserial. This is a big
| difference.
|
| Does anyone know why UUIDv4 is so much worse than bigserial?
| UUIDs are just 128 bit numbers. Are they super expensive to
| generate or something? Whats going on here?
| AprilArcus wrote:
| UUIDv4s are fully random, and btree indices expect "right-
| leaning" values with a sensible ordering. This makes indexing
| operations on UUIDv4 columns slow, and was the motivation for
| the development of UUIDv6 and UUIDv7.
| couchand wrote:
| I'm curious to learn more about this heuristic and how the
| database leverages it for indexing. What does right-leaning
| mean formally and what does analysis of the data structure
| look like in that context? Do variants like B+ or B* have the
| same charactersistics?
| stephen123 wrote:
| I think its because of btrees. Btrees and the pages work better
| if only the last page is getting lots of writes. Iuids cause
| lots of un ordered writes leading to page bloat.
| barrkel wrote:
| Random distribution in the sort order mean the cache locality
| of a btree is poor - instead of inserts going to the last page,
| they go all over the place. Locality of batch inserts is also
| then bad at retrieval time, where related records are looked up
| randomly later.
|
| So you pay taxes at both insert time and later during
| selection.
| perrygeo wrote:
| The 25% increase in size is true but it's 8 bytes, a small and
| predictable linear increase per row. Compared to the rest of
| the data in the row, it's not much to worry about.
|
| The bigger issue is insert rate. Your insert rate is limited by
| the amount of available RAM in the case of UUIDs. That's not
| the case for auto-incrementing integers! Integers are
| correlated with time while UUID4s are random - so they have
| fundamentally different performance characteristics at scale.
|
| The author cites 25% but I'd caution every reader to take this
| with a giant grain of salt. At the beginning, for small tables
| < a few million rows, the insert penalty is almost negligible.
| If you did benchmarks here, you might conclude there's no
| practical difference.
|
| As your table grows, specifically as the size of the btree
| index starts reaching the limits of available memory, postgres
| can no longer handle the UUID btree entirely in memory and has
| to resort to swapping pages to disk. An auto-integer type won't
| have this problem since rows close in time will use the same
| index page thus doesn't need to hit disk at all under the same
| load.
|
| Once you reach this scale, The difference in speed is orders of
| magnitude. It's NOT a steady 25% performance penalty, it's a
| 25x performance cliff. And the only solution (aside from a
| schema migration) is to buy more RAM.
| vinnymac wrote:
| I always wondered what the purpose of that extra "I" was in the
| CI variables `CI_PIPELINE_IID` and `CI_MERGE_REQUEST_IID` were
| for. Always assumed it was a database related choice, but this
| article confirms it.
| zetalyrae wrote:
| The point about the storage size of UUID columns is unconvincing.
| 128 bits vs. 64 bits doesn't matter much when the table has five
| other columns.
|
| A much more salient concern for me is performance. UUIDv4 is
| widely supported but is completely random, which is not ideal for
| index performance. UUIDv7[0] is closer to Snowflake[1] and has
| some temporal locality but is less widely implemented.
|
| There's an orthogonal approach which is using bigserial and
| encrypting the keys: https://github.com/abevoelker/gfc64
|
| But this means 1) you can't rotate the secret and 2) if it's ever
| leaked everyone can now Fermi-estimate your table sizes.
|
| Having separate public and internal IDs seems both tedious and
| sacrifices performance (if the public-facing ID is a UUIDv4).
|
| I think UUIDv7 is the solution that checks the most boxes.
|
| [0]: https://uuid7.com/
|
| [1]: https://en.wikipedia.org/wiki/Snowflake_ID
| newaccount7g wrote:
| If I've learned anything in my 7 years of software development
| it's that this kind of expertise is just "blah blah blah" that
| will get you fired. Just make the system work. This amount of
| trying to anticipate problems will just screw you up. I
| seriously can't imagine a situation where knowing this would
| actually improve the performance noticeably.
| canadiantim wrote:
| Would it ever make sense to have a uuidv7 as primary key but
| then anther slug field for a public-id, e.g. one that is
| shorter and better in a url or even allowing user to customize
| it?
| Horffupolde wrote:
| Yes sure but now you have to handle two ids and guaranteeing
| uniqueness across machines or clusters becomes hard.
| vrosas wrote:
| That and a uuid is going to be unique across all tables and
| objects, whereas a slug will only be unique within a
| certain subset e.g. users within an organization. I've seen
| a production issue IRL where someone (definitely not me)
| wrote a query fetching objects by slug and forgot to
| include the 'AND parent_slug = xxx'
| gfody wrote:
| think of the primary keys in a database like typedef void* ie
| it's your fundamental pointer and the size of it will impact
| every aspect of performance throughout - memory/disk footprint
| and corresponding throughput bottlenecks, cpu time comparing
| keys which is what every operation reduces to in the deepest
| inner-most loops of joins and lookups etc.
|
| when x86-64 cpus were new the performance impact from switching
| to 64-bit pointers was so bad we had to create x32/ilp32 and
| the reason .NET still has "prefer 32-bit" as a default even
| today.
|
| using 128-bit uuids as PKs in a database is an awful mistake
| Merad wrote:
| > The point about the storage size of UUID columns is
| unconvincing. 128 bits vs. 64 bits doesn't matter much when the
| table has five other columns.
|
| But it's not just the size of that one column, it's also the
| size of all the places that id is used as a FK and the indexes
| that may be needed on those FK columns. Think about something
| like a user id that might be referenced by dozens or even
| hundreds of FKs throughout your database.
| solidsnack9000 wrote:
| ...and this has not just a size impact but also a substantial
| performance impact.
| nine_k wrote:
| It very much does when you have a ton of FKs (enforced or not)
| using such a column, and thus indexed and used in many joins.
| Making it twice as hard for the hot part of an index to fit to
| RAM is never good for performance, nor for the cloud bill.
|
| If you have a column that is used in many joins, there are
| performance reasons to make it as compact as possible (but not
| smaller).
| s4i wrote:
| The v7 isn't a silver bullet. In many cases you don't want to
| leak the creation time of a resource. E.g. you want to upload a
| video a month before making it public to your audience without
| them knowing.
| solidsnack9000 wrote:
| > There's an orthogonal approach which is using bigserial and
| encrypting the keys...
|
| Another variant of this approach:
| https://pgxn.org/dist/permuteseq/
|
| It is also feasible to encrypt the value on display (when
| placing it in URLs, emails, &c):
|
| https://wiki.postgresql.org/wiki/Pseudo_encrypt
|
| This maintains many of the benefits of sequential indexes and
| _does_ allow you to change the key. However, if the key is
| changed, it would break any bookmarks, invalidate anything sent
| in older emails -- it would have the same effect as renaming
| everything.
| bluerooibos wrote:
| Has anyone written about or noticed the performance differences
| between Gitlab and GitHub?
|
| They're both Rails-based applications but I find page load times
| on Gitlab in general to be horrific compared to GitHub.
| heyoni wrote:
| I mean GitHub in general has been pretty reliable minus the two
| outages they had last year and is usually pretty performant or
| I wouldn't use their keyboard shortcuts.
|
| There are some complaints here from a former dev about gitlab
| that might provide insight into its culture and lack of regard
| for performance: https://news.ycombinator.com/item?id=39303323
|
| Ps: I do not use gitlab enough to notice performance issues but
| thought you might appreciate the article
| anoopelias wrote:
| More comments on this submission:
| https://news.ycombinator.com/item?id=39333220
| imiric wrote:
| > I mean GitHub in general has been pretty reliable minus the
| two outages they had last year
|
| Huh? GitHub has had major outages practically every other
| week for a few years now. There are pages of HN threads[1].
|
| There's a reason why githubstatus.com doesn't show historical
| metrics and uptime percentages: it would make them look
| incompetent. Many outages aren't even officially reported
| there.
|
| I do agree that when it's up, performance is typically better
| than Gitlab's. But describing GH as reliable is delusional.
|
| [1]: https://hn.algolia.com/?dateRange=all&page=0&prefix=fals
| e&qu...
| heyoni wrote:
| Delusional? Anecdotal maybe...I was describing my
| experience so thanks for elaborating.
|
| I only use it as a code repository. Was it specific
| services within GitHub that failed a lot?
| imiric wrote:
| My apologies, that came off a bit harsh.
|
| > Was it specific services within GitHub that failed a
| lot?
|
| Well, everything from its frontends, to the CI service,
| to the Git service itself. People weren't able to push,
| pull, deploy or review code for hours at a time. Just
| major disruptions all around, which happened regularly.
|
| I do think this has improved slightly over the past few
| months, but you were lucky if you weren't affected much.
| heyoni wrote:
| No worries you're fine. I guess downtime was much more
| widespread than I could even imagine.
|
| I do remember one time all of us at work joking that we
| should get the day off when GitHub was done. :D
| mvdtnz wrote:
| You can't throw around specific metrics ("two outages
| they had last year") then claim you were just sharing a
| personal anecdote when someone calls you out and brings
| receipts. Just take the L.
| heyoni wrote:
| I absolutely can because this is a discussion. Being
| called delusional for sharing my experience on a platform
| I use _at least_ 40 hours a week was a bit much but in
| case you missed it, I relented. They brought receipts and
| I came down from my position because there was none. I
| wasn't arguing with anyone or saying anyone was wrong
| about anything.
|
| Do you expect all of your conversations to be backed by
| facts and citation? Is that why the only comments you
| make are aggressive and confrontational? Because it's
| easier than the bar you're setting for everyone else?
| Yea, I looked back at your history, not a single url.
|
| And now I'm taking a position.
|
| > No worries you're fine. I guess downtime was much more
| widespread than I could even imagine.
|
| Did you miss that?
| golergka wrote:
| I used Gitlab a few years ago, but then it had severe client-
| side performance problems on large pull requests. Github isn't
| ideal with them too, but it manages to be decent.
| wdb wrote:
| Yeah, good reason to spit up pull requests ;) I do think it
| improved a lot over the last two years, though
| oohffyvfg wrote:
| > compared to GitHub.
|
| this is like comparing chrome and other browsers, even chromium
| based.
|
| chrome and github will employ all tricks in the book, even if
| they screw you. for example, how many hours of despair I've
| wasted when manually dissecting a git history on employer
| github by opening merge diffs, hitting ctrl F, seeing no
| results and moving to the next... only to find on the 100th
| diff that deep down the diff lost they hid the most important
| file because it was more convenient for them (so one team lead
| could hit some page load metric and get a promotion)
| gfody wrote:
| > 1 quintillion is equal to 1000000000 billions
|
| it is pretty wild that we generally choose between int32 and
| int64. we really ought to have a 5 byte integer type which would
| support cardinalities of ~1T
| klysm wrote:
| Yeah it doesn't make sense to pick something that's not a power
| of 2 unless you are packing it.
| postalrat wrote:
| I guess that depends on how the index works.
| gfody wrote:
| we usually work with some page size anyways, 64 5-byte ints
| fit nicely into 5 512-bit registers, and ~1T is a pretty
| flexible cardinality limit after ~2B
| sidcool wrote:
| Great read! And even better comments here.
| exabrial wrote:
| Foreign keys are expensive is an oft repeated rarely benched
| claim. There are tons of ways to do it incorrectly. But in your
| stack you are always enforcing integrity _somewhere_ anyway.
| Leveraging the database instead of reimplementing it requires
| knowledge and experimentation, and it more often than not it will
| save your bacon.
| rob137 wrote:
| I found this post very useful. I'm wondering where I could find
| others like it?
| aflukasz wrote:
| I recommend Postgres FM podcast, e.g. available as video on
| Postgres TV yt channel. Good content on its own, and many
| resources of this kind are linked in the episode notes. I
| believe one of the authors even helped Gitlab specifically with
| Postgres performance issues not that long ago.
| cosmicradiance wrote:
| Here's another - https://zerodha.tech/blog/working-with-
| postgresql/
| yellowapple wrote:
| > It is generally a good practice to not expose your primary keys
| to the external world. This is especially important when you use
| sequential auto-incrementing identifiers with type integer or
| bigint since they are guessable.
|
| What value would there be in preventing guessing? How would that
| even be possible if requests have to be authenticated in the
| first place?
|
| I see this "best practice" advocated often, but to me it reeks of
| security theater. If an attacker is able to do anything useful
| with a guessed ID without being authenticated and authorized to
| do so, then something else has gone horribly, horribly,
| _horribly_ wrong and _that_ should be the focus of one 's energy
| instead of adding needless complexity to the schema.
|
| The only case I know of where this might be valuable is from a
| business intelligence standpoint, i.e. you don't want competitors
| to know how many customers you have. My sympathy for such
| concerns is quite honestly pretty low, and I highly doubt GitLab
| cares much about that.
|
| In GitLab's case, I'm reasonably sure the decision to use id +
| iid is less driven by "we don't want people guessing internal
| IDs" and more driven by query performance needs.
| s4i wrote:
| It's mentioned in the article. It's more to do with business
| intelligence than security. A simple auto-incrementing ID will
| reveal how many total records you have in a table and/or their
| growth rate.
|
| > If you expose the issues table primary key id then when you
| create an issue in your project it will not start with 1 and
| you can easily guess how many issues exist in the GitLab.
| JimBlackwood wrote:
| I follow this best practice, there's a few reasons why I do
| this. It doesn't have to do with using a guessed primary ID for
| some sort of privilege escalation, though. It has more to do
| with not leaking any company information.
|
| When I worked for an e-commerce company, one of our biggest
| competitors used an auto-incrementing integer as primary key on
| their "orders" table. Yeah... You can figure out how this was
| used. Not very smart by them, extremely useful for my employer.
| Neither of these will allow security holes or leak customer
| info/payment info, but you'd still rather not leak this.
| tomnipotent wrote:
| > extremely useful for my employer.
|
| I've been in these shoes before, and finding this information
| doesn't help you as an executive or leader make any better
| decisions than you could have before you had the data. No
| important decision is going to be swayed by something like
| this, and any decision that is probably wasn't important.
|
| Knowing how many orders is placed isn't so useful without
| average order value or items per cart, and the same is true
| for many other kinds of data gleamed from this method.
| JimBlackwood wrote:
| That's not correct. Not every market is the same in it's
| dynamics.
|
| Yes, most of the time that information was purely
| insightful and was simply monitored. However, at some
| moments it definitely drove important decisions.
| tomnipotent wrote:
| Like what?
|
| What's going to change how a team develops (physical)
| products? What's a merchandiser or buyer going to learn
| that influences how they spend millions of dollars or
| deal with X weeks-on-hand of existing inventory? What's
| an operations manager going to learn that improves their
| ability to warehouse and ship product? How's marketing
| going to change their strategy around segmentation or
| channel distribution? What's a CEO going to learn that
| changes what departments or activities they want to
| invest in?
|
| At best you get a few little tidbits of data you can
| include in presentations or board decks, but nothing
| that's going to influence critical decisions on how money
| is getting spent to get the job done or how time is
| getting allocated to projects. Worst case you have a
| inexperienced CEO that's chasing rather than leading, and
| just end up copying superficial aspects of your
| competitors without the context or understanding of why
| they did what they did.
|
| I've called up execs at competitors and had friendly
| chats that revealed more about their business in 30
| minutes than you could possibly find out through this
| "method".
| lordgrenville wrote:
| > The only case where this might be valuable is business
| intelligence
|
| Nitpick: I would not call this "business intelligence" (which
| usually refers to internal use of the company's own data) but
| "competitive intelligence".
| https://en.wikipedia.org/wiki/Competitive_intelligence
| EE84M3i wrote:
| See also "German Tank Problem"
| https://en.m.wikipedia.org/wiki/German_tank_problem
| mnahkies wrote:
| It can be really handy for scraping/archiving websites if
| they're kind enough to use a guessable id
| tetha wrote:
| > I see this "best practice" advocated often, but to me it
| reeks of security theater. If an attacker is able to do
| anything useful with a guessed ID without being authenticated
| and authorized to do so, then something else has gone horribly,
| horribly, horribly wrong and that should be the focus of one's
| energy instead of adding needless complexity to the schema.
|
| Yes, but the ability to guess IDs can make this security issue
| horrible, or much much worse.
|
| If you had such a vulnerability and you are exposing the users
| to UUIDs, now people have to guess UUIDs. Even a determined
| attacker will have a hard time doing that or they would need
| secondary sources to get the IDs. You have a data breach, but
| you most likely have time to address it and then you can assess
| the amount of data lost.
|
| If you can just <seq 0 10000 | xargs -I ID curl
| service/ticket/ID> the security issue is instantly elevated
| onto a whole new level. Suddenly all data is leaked without
| further effort and we're looking at mandatory report to data
| protection agencies with a massive loss of data.
|
| To me, this is one of these defense in depth things that should
| be useless. And it has no effect in many, many cases.
|
| But there is truely horrid software out there that has been
| popped in exactly the described way.
| dijit wrote:
| Case in point, a recent security issue Gitlab experienced
| (CVE-2023-7028; arbitrary password reset by knowing one of
| the accounts associated mail addresses) was made worse by a
| feature of gitlab that few people know about; that the
| "userID" is associated with a meta/internal mail address.
|
| This meant that people could send password resets for any
| user if they knew their userID. The mail format was like
| user-1@no-reply.gitlab.com or something.
|
| Since it's a safe bet that "user ID 1" is an admin user,
| someone weaponised this.
| plagiarist wrote:
| I've already resolved to never use Gitlab entirely on the
| basis of that CVE but that makes it worse.
|
| Password resets should just never go to an email that
| hasn't been deliberately attached to an account by the
| account's owner, full stop. There should not be a code path
| where it is possible to send any such thing to arbitrary
| emails. And redirect emails should never be treated as
| account emails in any way.
| yellowapple wrote:
| Even without that auto-incrementing ID, there are plenty of
| other options for guessing valid email addresses to use
| with that exploit. For example, if you're able to figure
| out the format an organization uses for their email
| addresses (e.g. first.last@company.com), and you're able to
| figure out who works at that org (via e.g. LinkedIn), then
| there's a very good chance you can reset passwords for,
| say, the company's CTO or other likely-highly-privileged
| users.
|
| That is: this kind of proves my point. Removing
| autoincrementing IDs from the equation is of minimal
| benefit when things have already gone horribly horribly
| wrong like this. It's a _little_ bit more work on the
| attacker 's part, but not by anywhere near enough for such
| a "mitigation" to be of much practical benefit.
| remus wrote:
| In general it's a defense-in-depth thing. You definitely
| shouldn't be relying on it, but as an attacker it just makes
| your life a bit harder if it's not straightforward to work out
| object IDs.
|
| For example, imagine you're poking around a system that uses
| incrementing ints as public identifiers. Immediately, you can
| make a good guess that there's probably going to be some high
| privileged users with user_id=1..100 so you can start probing
| around those accounts. If you used UUIDs or similar then you're
| not leaking that info.
|
| In gitlabs case this is much less relevant, and it's more fo a
| cosmetic thing.
| serial_dev wrote:
| > In gitlabs case this is much less relevant (...)
|
| Why, though? GitLab is often self hosted, so being able to
| iterate through objects, like users, can be useful for an
| attacker.
| remus wrote:
| You're right, fair point.
| yellowapple wrote:
| In my experience self-hosted GitLabs are rarely publicly-
| accessible in the first place; they're usually behind some
| sort of VPN.
|
| As for an attacker being able to iterate through users, if
| that information is supposed to be private, and yet an
| attacker is getting anything other than a 404, then that's
| a problem in and of itself and my energy would be better
| spent fixing that.
| heax wrote:
| It really depends but useful knowledege can be derived from
| this. If user accounts use sequential ids the id 1 is most
| likely the admin account that is created as first user.
| metafunctor wrote:
| Bugs happen also in access control. Unguessable IDs make it
| much harder to exploit some of those bugs. Of course the focus
| should be on ensuring correct access control in the first
| place, but unguessable IDs can make the difference between a
| horrible disaster and a close call.
|
| It's also possible to use auto-incrementing database IDs and
| encrypt them, if using UUIDs doesn't work for you. With
| appropriate software layers in place, encrypted IDs work more
| or less automatically.
| worksonmine wrote:
| > What value would there be in preventing guessing?
|
| It prevents enumeration, which may or may not be a problem
| depending on the data. If you want to build a database of user
| profiles it's much easier with incremental IDs than UUID.
|
| It is at least a data leak but can be a security issue. Imagine
| a server doing wrong password correctly returning "invalid
| username OR password" to prevent enumeration. If you can still
| crawl all IDs and figure out if someone has an account that way
| it helps filter out what username and password combinations to
| try from previous leaks.
|
| Hackers are creative and security is never about any single
| protection.
| yellowapple wrote:
| > If you can still crawl all IDs and figure out if someone
| has an account that way it helps filter out what username and
| password combinations to try from previous leaks.
|
| Right, but like I suggested above, if you're able to get any
| response other than a 404 for an ID other than one you're
| authorized to access, then that in and of itself is a severe
| issue. So is being able to log in with that ID instead of an
| actual username.
|
| Hackers are indeed creative, but they ain't wizards. There
| are countless other things that would need to go horribly
| horribly wrong for an autoincrementing ID to be useful in an
| attack, and the lack of autoincrementing IDs doesn't really
| do much in practice to hinder an attacker once those things
| have gone horribly, horribly wrong.
|
| I can think of maybe one exception to this, and that's with
| e-commerce sites providing guest users with URLs to their
| order/shipping information after checkout. Even this is
| straightforward to mitigate (e.g. by generating a random
| token for each order and requiring it as a URL parameter),
| and is entirely inapplicable to something like GitLab.
| SkyMarshal wrote:
| _> This is especially important when you use sequential auto-
| incrementing identifiers with type integer or bigint since they
| are guessable._
|
| I thought we had long since moved past that to GUIDs or UUIDs
| for primary keys. Then if you still need some kind of
| sequential numbering that has meaning in relation to the other
| fields, make a separate column for that.
| coldtea wrote:
| > _I see this "best practice" advocated often, but to me it
| reeks of security theater._
|
| The idea of "security theater" is overplayed. Security can be
| (and should be) multilayered, it doesn't have to be all or
| nothing. So that, when they break a layer (say the
| authentication), they shouldn't automatically gain easy access
| to the others
|
| > _If an attacker is able to do anything useful with a guessed
| ID without being authenticated and authorized to do so, then
| something else has gone horribly, horribly, horribly wrong and
| that should be the focus of one 's energy instead of adding
| needless complexity to the schema._
|
| Sure. But by that time, it's will be game over if you don't
| also have the other layers in place.
|
| The thing is that you can't anticipate any contigency. Bugs
| tend to not preannounce themselves, especially tricky nuanced
| bugs.
|
| But when they do appear, and a user can "do [something] useful
| with an ID without being authenticated and authorized to do so"
| you'd be thanking all available Gods that you at least made the
| IDs not guassable - which would also give them also access to
| every user account on the system.
| kehers wrote:
| One good argument I found [^1] about not exposing primary keys
| is that primary keys may change (during system/db change) and
| you want to ensure users have a consistent way of accessing
| data.
|
| [^1]:
| https://softwareengineering.stackexchange.com/questions/2183...
| azlev wrote:
| It's reasonable to not have auto increment id's, but it's not
| clear to me if there is benefits to have 2 IDs, one internal and
| one external. This increases the number of columns / indexes,
| makes you always do a lookup first, and I can't see a security
| scenario where I would change the internal key without changing
| the external key. Am I missing something?
| Aeolun wrote:
| You always have the information at hand anyway when doing
| anything per project. It's also more user friendly to have
| every project's issues start with 1 instead of starting with
| two trillion, seven hundred billion, three hundred and five
| million, sevenhundred and seventeen thousand three hundred
| twentyfive.
| traceroute66 wrote:
| Slight nit-pick, but I would pick up the author on the text vs
| varchar section.
|
| The author effectively wastes many words trying to prove a non-
| existent performance difference and then concludes "there is not
| much performance difference between the two types".
|
| This horse bolted a long time ago. Its not "not much", its
| "none".
|
| The Postgres Wiki[1] explicitly tells you to use text unless you
| have a very good reason not to. And indeed the docs themselves[2]
| tell us that "For many purposes, character varying acts as though
| it were a domain over text" and further down in the docs in the
| green Tip box, "There is no performance difference among these
| three types".
|
| Therefore Gitlab's use of (mostly) text would indicate that they
| have RTFM and that they have designed their schema for their
| choice of database (Postgres) instead of attempting to implement
| some stupid "portable" schema.
|
| [1]
| https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use...
| [2] https://www.postgresql.org/docs/current/datatype-
| character.h...
| alex_smart wrote:
| >The author effectively wastes many words trying to prove a
| non-existent performance difference and then concludes "there
| is not much performance difference between the two types".
|
| They then also show that there is in fact a significant
| performance difference when you need to migrate your schema to
| accodomate a change in length of strings being stored. Altering
| a table to a change a column from varchar(300) to varchar(200)
| needs to rewrite every single row, where as updating the
| constraint on a text column is essentially free, just a full
| table scan to ensure that the existing values satisfy your new
| constraints.
|
| FTA:
|
| >So, as you can see, the text type with CHECK constraint allows
| you to evolve the schema easily compared to character varying
| or varchar(n) when you have length checks.
| traceroute66 wrote:
| > They then also show that there is in fact a significant
| performance difference when you need to migrate your schema
| to accodomate a change in length of strings being stored.
|
| Which is a pointless demonstration if you RTFM and design
| your schema correctly, using text, just like the manual and
| the wiki tells you to.
|
| > the text type with CHECK constraint allows you to evolve
| the schema easily compared to character varying or varchar(n)
| when you have length checks.
|
| Which is exactly what the manual tells you ....
|
| "For many purposes, character varying acts as though it were
| a domain over text"
| alex_smart wrote:
| And what is wrong with someone figuring out for themselves
| and explaining to others why some suggestion makes sense
| logically, rather than just quoting the manual?
| eezing wrote:
| We shouldn't assume that this schema was designed all at once,
| but rather is the product of evolution. For example, maybe the
| external_id was added after the initial release in order to
| support the creation of unique ids in the application layer.
___________________________________________________________________
(page generated 2024-02-18 23:01 UTC)