https://github.com/vishesh92/pg-primer
Skip to content Toggle navigation
Sign up
* Product
+
Actions
Automate any workflow
+
Packages
Host and manage packages
+
Security
Find and fix vulnerabilities
+
Codespaces
Instant dev environments
+
Copilot
Write better code with AI
+
Code review
Manage code changes
+
Issues
Plan and track work
+
Discussions
Collaborate outside of code
Explore
+ All features
+ Documentation
+ GitHub Skills
+ Blog
* Solutions
For
+ Enterprise
+ Teams
+ Startups
+ Education
By Solution
+ CI/CD & Automation
+ DevOps
+ DevSecOps
Case Studies
+ Customer Stories
+ Resources
* Open Source
+
GitHub Sponsors
Fund open source developers
+
The ReadME Project
GitHub community articles
Repositories
+ Topics
+ Trending
+ Collections
* Pricing
[ ]
*
#
In this repository All GitHub |
Jump to |
* No suggested jump to results
*
#
In this repository All GitHub |
Jump to |
*
#
In this user All GitHub |
Jump to |
*
#
In this repository All GitHub |
Jump to |
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session.
{{ message }}
vishesh92 / pg-primer Public
* Notifications
* Fork 6
* Star 32
Beginner's guide to administering/managing postgresql
32 stars 6 forks
Star
Notifications
* Code
* Issues 0
* Pull requests 0
* Actions
* Projects 0
* Security
* Insights
More
* Code
* Issues
* Pull requests
* Actions
* Projects
* Security
* Insights
vishesh92/pg-primer
This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[ ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags
Name already in use
A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code
* Local
* Codespaces
*
Clone
HTTPS GitHub CLI
[https://github.com/v]
Use Git or checkout with SVN using the web URL.
[gh repo clone vishes]
Work fast with our official CLI. Learn more about the CLI.
* Open with GitHub Desktop
* Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
@vishesh92
vishesh92 fix formatting
...
c500513 Jun 10, 2023
fix formatting
c500513
Git stats
* 17 commits
Files
Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
assets
Restructure some content and add more content around MVCC
August 28, 2021 21:25
queries
Update lock_monitoring.sql
January 5, 2022 19:39
sql
Restructure some content and add more content around MVCC
August 28, 2021 21:25
.psqlrc
Initial draft
July 25, 2021 18:32
README.md
fix formatting
June 11, 2023 12:09
psql-cheat-sheet.md
Restructure some content and add more content around MVCC
August 28, 2021 21:25
sql-query-analysis.md
Restructure some content and add more content around MVCC
August 28, 2021 21:25
View code
[ ]
Intro - What is this doc? Setting up postgres A little brief of MVCC
and architecture/internals of postgres. What is MVCC? How PostgreSQL
implements MVCC? VACUUM MVCC Exercise Additional Resources psql Query
Optimisation Monitoring, Observability & Reporting Opensource
monitoring tools System metrics Performance Insights in AWS RDS Index
Types Additional Resources Identifying ongoing issues Performance
Tuning Common questions while chosing data type int vs bigint (serial
vs bigserial) json vs jsonb char vs varchar vs text Additional Tips
Backups Upgrades
README.md
Intro - What is this doc?
It's hard trying to manage a postgresql database by yourself
especially when you have little experience with databases. I started
by writing simple queries which were required for analytics. But the
systems were also facing other issues which pushed me to delve deeper
into postgres. The majority of the issues were quite simple and were
there due to the lack of a DBA/DBRE. During my journey with postgres,
I came across Accidental DBA. This helped me quite a bit at that time
but there were still a few things that I picked up over the years. I
wanted to write this document to share what I have learned over the
years and make it a little easier for another Accidental DBA to work
with postgres. Since I have primarily worked with RDS Postgres
(managed postgres service provided by AWS), there might be a few
things related to postgres setup missing here.
Setting up postgres
To connect to postgres, you need psql or any other client which
supports postgres. You can download psql (part of postgresql-client
package) from here according to your operating system.
If you just want to play around with postgres, you can easily set it
up on your local using docker. You can install docker by following
it's official documentation here.
After setting up docker, run this command in your terminal to run a
postgres container in the background.
docker run --name postgres-playground -d postgres
You can run this command connect to postgres once the container is
ready.
docker exec -it postgres-playground psql -U postgres postgres
Stop and remove the container.
docker stop postgres-playground
docker rm postgres-playground
The above setup is only for playing with postgres and is not for
production use.
A little brief of MVCC and architecture/internals of postgres.
What is MVCC?
PostgreSQL provides concurrent access to the database using MVCC.
Without this, if someone writes a row to a database and someone else
accesses the same row simultaneously, they would see missing or
inconsistent data.. MVCC helps you provide Isolation which guarantees
concurrent access to the data.
Isolation
Let's assume that a bank has only two users having 50$ each, the
total balance in the bank being 100$. Now, user A transfers 50$ to
user B. At T1, the bank manager would see 100$ as bank balance. At
T3, the output would still be 100$ in this case because postgres
provides isolation and both the transactions have a different view of
that table.
How PostgreSQL implements MVCC?
Every table in postgres has some additional system columns. ctid is
one such column that stores the physical location of that row. You
can use a query like this to get the ctid of a row.
SELECT ctid, * FROM
LIMIT 10;
MVCCImplementation
Deleting a row updates the system columns for that row so that it's
not visible for future transactions. Updating a row creates a new
copy of the row and updates the previous row so that it's not visible
for future transactions.
VACUUM
Because of how MVCC is implemented, tuples that are updated & deleted
in a table are not physically deleted from their table. This
increases the size of tables if VACUUM is not run frequently on that
table. To handle this increasing storage, you can run VACUUM manually
or make sure autovacuum is running. VACUUM goes through each table
and marks the older versions of tuples for deletion. VACUUM doesn't
free up disk space, but can be reused for future inserts on this
table. To free up disk space and completely remove bloat from that
table, you can run VACUUM FULL but it takes an exclusive lock on the
table.
It's not recommended to run VACUUM FULL on a production database.
MVCC Exercise
1. Create an empty table.
CREATE TABLE tbl (id bigserial primary key, col text);
2. Insert two rows and check their physical location.
INSERT INTO tbl(col) VALUES ('a'), ('b');
SELECT ctid, * FROM tbl;
3. Delete the row where col value is a and check their physical
location.
DELETE FROM tbl WHERE col = 'a';
SELECT ctid, * FROM tbl;
4. Update the row where col value is b and check their physical
location.
UPDATE tbl SET col = 'c' WHERE col = 'b';
SELECT ctid, * FROM tbl;
You will notice that the physical location of that row has now
changed.
5. Run VACUUM FULL and check physical location of rows.
VACUUM FULL tbl;
SELECT ctid, * FROM tbl;
You will notice that the physical location has changed again after
running VACUUM FULL.
Additional Resources
* MVCC Unmasked by Bruce Momjian - Slides | Video
* Postgres, MVCC, and you or, Why COUNT(*) is slow by David Wolever
- Slides | Video
psql
psql is the official CLI shipped with PostgreSQL. It's good to know
how to move around a database, especially during an incident and psql
is a perfect tool for that. Check this cheat sheet to get familiar
with psql.
Query Optimisation
To fix the slow and poorly written queries to ensure optimal
performance for your database, EXPLAIN and EXPLAIN ANALYZE are used
to identify bottlenecks in query execution. Check this to get an idea
of how queries get executed. After identifying the issue, you can
create an index, rewrite the query, or add more resources depending
on the use case.
Monitoring, Observability & Reporting
Opensource monitoring tools
You can set up one of these tools to get better visibility into your
database:
* pgwatch
* pgMonitor
* Percona Monitoring and Management
You can also set up pgbadger, a tool that parses logs and generates a
report on database usage and workload. You can use this to find out
slow queries that need fixing or tune postgresql parameters for your
workload. Since pgbgadger works on logs, you won't get a real-time
view of your database instance.
System metrics
There are a lot of metrics you might want to track, but these are one
of the most important ones
* CPU
* Memory
* Connections
* IOPS
Performance Insights in AWS RDS
Performance Insights is a feature of RDS which shows you running
queries in real-time.
Index Types
Most of the time, the performance problems in a database are due to a
missing index. There are different types of indexes available. And
some might give good performance gains depending on the use case. By
default, postgres creates a B-Tree index.
While creating an index, I try to follow these rules:
* Don't create an index if it's not going to be used. Unnecessary
indexes will slow down your writes.
* Multi-column indexes: If your queries have multiple conditions, a
multi-column index can help. The order of columns is matters
here. Let's assume you have a table with the below structure with
a B-Tree index on (col1, col2).
Table "public.tbl1"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+----------------------------------
id | integer | | not null | nextval('tbl1_id_seq'::regclass)
col1 | integer | | |
col2 | integer | | |
Indexes:
"tbl1_col1_col2_idx" btree (col1, col2)
If you make a query like:
* SELECT * FROM tbl1 WHERE col1 = 10 AND col2 = 20 - Index scan
* SELECT * FROM tbl1 WHERE col1 = 10 - Index scan
* SELECT * FROM tbl1 WHERE col2 = 20 - Sequential scan. Depends on
the data distribution and the WHERE clause.
* Partial indexes: If you know that a part of the WHERE clause
would have a fixed condition. e.g. WHERE active = TRUE AND col1 =
? where active = TRUE is the only condition on the active column
in your queries, you can create a partial index. Partial indexes
are smaller in size and are more performant as well.
* Indexes on expressions: You can create an index on an expression
as well (e.g. lower(textcol1)). If queries on a table have some
expressions, it's a good idea to create an index on that
expression.
Additional Resources
* Index Types
* Get rid of your unused indexes!
Identifying ongoing issues
Run queries/connections_per_user.sql and check if max_running_time is
high for state - active and idle in transaction. There is a problem
if the number of connections or max_running_time for queries in
active or idle in transaction state is high. high is subjective here
and depends on the database size and type of workload. In my
experience, number of active connections should be less than number
of database's cpu cores and max_running_time should be less than 1
second.
* If time for idle in transaction queries is high, then
+ either your application is taking time to commit transactions
because its under heavy load or is doing some time consuming
task before committing. Or application is not handling
transactions properly.
+ some dev started a transaction and didn't commit it and left
the connection open
* If time for active queries is high, then
+ Queries are in waiting state. Queries can be in waiting state
for multiple reasons. In queries/active_running_queries.sql
output, check wait_event_type and wait_event to figure out
why the query is in waiting state. Reference: https://
www.postgresql.org/docs/current/monitoring-stats.html#
WAIT-EVENT-TABLE (Its better to check documentaton for
postgres version you are workng with). Most of the time, its
because:
o two or more connections are working (lock) on the same
row.
o Someone executed some migrations and because of an idle
in transaction or active query, this DDL query took a
lock on that table preventing queries on table.
+ Query is expensive and is taking time. In this case
wait_event_type and wait_event is null or its taking time to
read data from disk (wait_event_type is IO). If the number of
queries is also high, your database is probably under high
CPU Utilisation. Fix for this is:
o check if an index is missing. You can use EXPLAIN and
EXPLAIN ANALYZE (NEVER USE EXPLAIN ANALYZE for a query
which will modify data. This actually executes the
query.)
o providing more memory so that indexes can fit into memory
or data can be cached into memory. For this, check disk
reads. If its high, tune params (work_mem,
shared_buffers, effective_cache_size) or increase memory
for the instance.
o increase CPU because the number of queries executed per
second is high.
Having queries in idle in transaction state can cause a lot of issues
in the long run. Because a query is in idle in transaction state and
that connection holds a lock on a table, VACUUM won't run on that
table because of which the size of that table might keep on
increasing.
Performance Tuning
* Connection pooling - every postgresql connection is a forked
process on postgres. Lots of connections & disconnections can
result in increased CPU utilization of your database. Setting up
a connection pooler can create a lot of impact in performance.
Most web frameworks provides connection pooling out of the box,
but you can also setup external connection poolers like pgbouncer
or pgpool-II.
* Indexes - one of most common problems which can result in low
performance is missing indexes. Using pgbadger you can identify
slow queries and identify tables which are missing indexes.
Postgresql provides a variety of indexes (btree, brin, gin,
etc.).
* Parameter tuning - I have seen people vertically scale postgres
without attempting to understand bottlenecks causing performance
problems. It is really important to tune your parameters for your
workloads to gain desired performance. You can generate a default
configuration using PGTune. But don't forget to understand more
about your workloads and tune accordingly.
* Partitioning - Partitioning can help you achieve easier archiving
and better performance of large tables. You can partition tables
either using declarative partitioning or inheritance based
partitioning. Simple log or event tables are generally good
candidates for partitioning. For more details, check Postgresql
documentation
Common questions while chosing data type
Postgres offers a lot of data types. While designing schema for a
table, its quite useful to know about them and where to use them.
These are some of the questions I come across:
int vs bigint (serial vs bigserial)
Its better to use int when you know that won't exceed the limit of
int. Changing a column from int to bigint can result in a massive
downtime because postgres will need to rewrite the entire column.
json vs jsonb
json is just like a text column with a json validation whereas in
case of jsonb column, data is stored in a binary format. Because of
which insert and update operations are a little slower for jsonb as
compared to json. If data in the column is just going to be logs and
are not going to be queried, its better to use a json column. For
more details check this article.
char vs varchar vs text
char(n) is fixed length with blanks padded whereas varchar(n) is a
variable length string with a limit. Text on other hand has no
limits. Most of the time, its better to use text data type. If a
check on length is required, you can add a constraint for that on a
text column. For more information check postgres documentation here.
Additional Tips
Backups
Always keep backups enabled and test them regularly. Gitlab faced an
issue with backups in the past which resulted in a data loss. You can
check their postmortem here.
Upgrades
Minor version upgrades just require a restart where as a major
version upgrade will need to update data on disk which can take quite
some time. Its recommended to test the upgrade and the application
with the newer version before actually doing it in production.
About
Beginner's guide to administering/managing postgresql
Topics
postgres guide postgresql cheatsheet
Resources
Readme
Stars
32 stars
Watchers
3 watching
Forks
6 forks
Report repository
Contributors 3
* @vishesh92 vishesh92 Vishesh
* @vaidik vaidik Vaidik Kapoor
* @ambhojgoyal ambhojgoyal Amboj Goyal
Footer
(c) 2023 GitHub, Inc.
Footer navigation
* Terms
* Privacy
* Security
* Status
* Docs
* Contact GitHub
* Pricing
* API
* Training
* Blog
* About
You can't perform that action at this time.