Documentation: https://www.postgresql.org/docs/current/
Don't forget semi-colon ;
at the end of each psql command to run it!
most common data types in postgres
Data Type | Representation | Value | Display |
---|---|---|---|
integer | whole number | 617 | 617 |
decimal | floating-point number | 26.17345 | 26.17345 |
money | fixed floating-point number with 2 decimal places | 6.17 | $6.17 |
boolean | logic | TRUE, FALSE | t, f |
char(n) | fixed-length string, removes trailing spaces! | '123 ' | '123' |
varchar(n) | variable-length string | '123 ' | '123 ' |
text | unlimited-length string | '123 ' | '123 ' |
CREATE DATABASE
or CREATE TABLE
followed by [name]
CREATE TABLE cars (
id BIGSERIAL NOT NULL PRIMARY KEY,
make VARCHAR(100) NOT NULL,
model VARCHAR(100) NOT NULL,
price NUMERIC(19, 2) NOT NULL
);
CREATE TABLE <new_table_name> AS
SELECT DISTINCT <old_table_column_name(s)> --use DISTINCT if you only want unique entries, otherwise will copy duplicates
FROM <old_table_name>;
This is helpful when needing to split out and merge data from a bigger table into a separate table.
ALTER TABLE
followed by [table_name]
and adding whatever needed (e.g. column)
ADD COLUMN
(needs column name and data type!, e.g. INTEGER
)
or
DROP COLUMN
to delete a column
ALTER TABLE <table_name>
ADD COLUMN <new_column_name> <column_type>;
ALTER TABLE <table_name>
DROP COLUMN <column_name_to_be_deleted>
DROP COLUMN <another_column_name_to_be_deleted>;
! This is a powerful command - careful ! See Bobby Tables - preventing SQL injection
DROP DATABASE
or DROP TABLE
followed by [name]
Constraints are rules applied to table columns that ensure data integrity.
Data Types are the "first line of defense". These rules aren't constraints but can help reject incorrect data from the database.
A primary key can be a column or a combination of columns ("Composite Primary Key") that uniquely identify a row and are both NOT NULL
and UNIQUE
. A table can only have one single PRIMARY KEY
as it will often be used in joins between tables. Having primary keys improves query performance and helps to enforce data integrity
CREATE TABLE books (
title varchar(100),
isbn varchar(50) PRIMARY KEY, --primary key as unique identifier column
pages integer,
price money,
description varchar(256),
publisher varchar(100)
);
CREATE TABLE popular_books (
book_title VARCHAR(100),
author_name VARCHAR(50),
number_sold INTEGER,
number_previewed INTEGER,
PRIMARY KEY (book_title, author_name) --composite primary key as constraint
);
ALTER TABLE persons ADD PRIMARY KEY (id); --turning a table column into a primary key
Foreign keys help to maintain "referential integrity" between two tables by validating the entry in one ("Parent Table") also appears in the other ("Child Table"). Meaning, postgres will reject any values to be entered that do not already exist in the parent table. This can be set explicitly.
Referential integrity depends on FOREIGN KEY
constraints. These constraints are created using the REFERENCES
keyword.
CREATE TABLE chapters (
id integer PRIMARY KEY,
number integer,
title varchar(50),
content varchar(1024),
book_isbn varchar(50) REFERENCES books(isbn) --foreign key as column
FOREIGN KEY (book_isbn) REFERENCES books(isbn) --foreign key as constraint
);
ALTER TABLE talks
ADD FOREIGN KEY (speaker_id)
REFERENCES speakers (id)
ON DELETE CASCADE; --adds foreign key constraint with deletion cascade
A foreign key constraint will prevent deleting or updating a row of a parent table that is referenced in child table. This can be explicitly set using the ON DELETE
/UPDATE RESTRICT
keywords.
REFERENCES <parent_table> (<primary key column>) ON DELETE RESTRICT
REFERENCES <parent_table> (<primary key column>) ON UPDATE RESTRICT`
Rather than preventing changes, the CASCADE
clause can be used to explicitly enforce the changes to a child row when the parent row is deleted or updated.
REFERENCES <parent_table> (<primary key column>) ON DELETE CASCADE
REFERENCES <parent_table> (<primary key column>) ON UPDATE CASCADE
The NOT NULL
constraint is used to reject incoming queries when the NOT NULL
constrained column is missing a value.
CREATE TABLE persons (
id INT PRIMARY KEY,
name VARCHAR(20) NOT NULL,
email VARCHAR(15)
);
ALTER TABLE talks ALTER COLUMN speaker SET NOT NULL; --adds NOT NULL constraint to column speaker
The UNIQUE
constraint helps with defining unique values in a table. They also create an index which can improve query and join performance.
A jointly unique constraint adds unique constraint on the combination of two columns ("multi-column constraint").
ALTER TABLE persons ADD UNIQUE (email); --adds UNIQUE constraint to column email
ALTER TABLE random_people ADD CONSTRAINT unique_email UNIQUE (email); --adds custom named UNIQUE constraint to column email
ALTER TABLE table_name ADD UNIQUE (column1, column2) --adds jointly UNIQUE constraint to two columns
The CHECK()
constraint gives more control over what specific rules to apply to a column. These constraints will reject a row if it fails the criteria defined. The condition defined allows to make comparisons between columns within the table, using logical operators like AND
, OR
, IN
or LIKE
that will evaluate to a boolean.
ALTER TABLE random_people ADD CONSTRAINT gender_constraint CHECK (gender = 'Male' OR gender = 'Female'); --custom constraint with check condition
ALTER TABLE persons DROP CONSTRAINT person_pkey; --deletes primary key constraint from table persons
ALTER TABLE talks ALTER COLUMN speaker DROP NOT NULL; --deletes NOT NULL constraint from column speaker
A schema is a container within a database that holds tables, views, functions, and other database objects. It helps organize and manage these objects, allowing for logical grouping and separation. Schemas can also help avoid name conflicts by allowing objects with the same name to exist in different schemas.
A table is a database object within a schema that stores data in rows and columns. It is the primary structure for storing and organizing data in a relational database. Each table has a defined structure, including column names and data types.
Schemas organize tables, while tables store the actual data.
information_schema.key_column_usage
is a built-in table with meta data information that can be queried for every database. It will show each primary key (pkey
) as well as foreign key (fkey
) of a table.
SELECT constraint_name, table_name, column_name
FROM information_schema.key_column_usage;
Will show this:
constraint_name | table_name | column_name |
---|---|---|
book_pkey | book | isbn |
chapter_pkey | chapter | id |
chapter_book_isbn_fkey | chapter | book_isbn |
Table relationships in SQL define how tables are connected to each other. The main types of relationships are:
Each row in Table A is linked to one and only one row in Table B. This is implemented using a PRIMARY KEY
in Table A and a FOREIGN KEY
with a UNIQUE
constraint in Table B.
CREATE TABLE person (
person_id SERIAL PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE passport (
passport_id SERIAL PRIMARY KEY,
person_id INT UNIQUE REFERENCES person(person_id),
passport_number VARCHAR(50)
);
A row in Table A can be linked to multiple rows in Table B. This is implemented using a PRIMARY KEY
in Table A and a FOREIGN KEY
in Table B.
CREATE TABLE customer (
customer_id SERIAL PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE order (
order_id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customer(customer_id),
order_date DATE
);
A many-to-many relationship can be broken into two one-to-many relationships.
To implement a many-to-many relationship in a relational database, we would create a third cross-reference table also known as a join table. It will have these two constraints:
- foreign keys referencing the primary keys of the two member tables.
- a composite primary key made up of the two foreign keys.
CREATE TABLE student (
student_id SERIAL PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE course (
course_id SERIAL PRIMARY KEY,
course_name VARCHAR(100)
);
CREATE TABLE enrollment ( --join table (cross-reference table)
student_id INT REFERENCES student(student_id), --foreign key to table A primary key
course_id INT REFERENCES course(course_id), --foreign key to table B primary key
PRIMARY KEY (student_id, course_id) --composite primary key
);
SELECT country_of_birth, COUNT(*) FROM random_people GROUP BY country_of_birth HAVING COUNT(*) >= 40 ORDER BY country_of_birth;
SELECT
column names separated by comma, or *
for all, FROM
[table name]
COUNT()
returns the number of rows that match a specified condition
COUNT(*) AS
[new column name]
after a column will create a new column with the counts of the entries in that before mentioned column (only works together with GROUP BY
)
GROUP BY
[column name]
(when using COUNT(*)
- needs to be same column)
HAVING
followed by a filter to be applied within the grouping. HAVING
always comes after GROUP BY
, but before
ORDER BY
and LIMIT
.
When we want to limit the results of a query based on values of the individual rows, use
WHERE
. When we want to limit the results of a query based on an aggregate property, useHAVING
.
ORDER BY
[column name]
to sort the results by specified column in either ASC
ending (default) or DESC
ending order
SELECT * FROM random_people WHERE gender = 'Female' AND country_of_birth = 'Germany' OR country_of_birth = 'Austria' AND gender <> 'Male' AND date_of_birth BETWEEN DATE '2000-01-01' AND '2000-12-31' LIMIT 10 OFFSET 5;
WHERE
specifies the conditions for filtering rows
AND
combines multiple conditions that must all be true
OR
combines multiple conditions where at least one must be true
<>
means "not equal to"
BETWEEN
specifies a range of values
LIMIT
specifies the maximum number of rows to return
OFFSET
specifies the number of rows to skip before starting to return rows
SELECT * FROM random_people WHERE gender NOT IN ('Female','Male');
SELECT * FROM random_people WHERE country_of_birth IN ('Nigeria);
NOT
/ NOT IN
specifies an array of values to include / exclude
SELECT * FROM random_people WHERE email LIKE '%google.%';
SELECT * FROM random_people WHERE first_name ='___';
LIKE
is used for pattern matching
%
matches any sequence of characters
_
matches any single character
SELECT DISTINCT country_of_birth FROM random_people WHERE country_of_birth ILIKE 'n%';
SELECT DISTINCT
selects unique values
ILIKE
is a case-insensitive pattern matching operator
Any column mentioned after a SELECT
gets a reference number "behind the scenes" that can be used for further statements in order not to repeat ourselves. (Indexed from 1
)
For example:
SELECT category, price, AVG(downloads)
FROM fake_apps
GROUP BY category, price;
Can be written:
SELECT category, price, AVG(downloads)
FROM fake_apps
GROUP BY 1, 2;
The CASE
statement is used to execute a sequence of statements based on a condition. It is similar to the IF-THEN-ELSE
statement in other programming languages.
SELECT COUNT(*) AS total,
CASE
WHEN url LIKE '%github%' THEN 'Github'
WHEN url LIKE '%medium%' THEN 'Medium'
WHEN url LIKE '%nytimes%' THEN 'New York Times'
ELSE 'Other'
END AS 'Source'
FROM hacker_news
GROUP BY 2;
This example:
SELECT
s two columns: atotal
(which is the count) and aSource
column, which is a new column with our case logic- in
CASE
theurl
column containsGithub
we will add it as such to our newSource
column ELSE
for all other rows that do not match the url criteriaEND AS
finishes the logic and renames the column toSource
GROUP BY 2
groups the counts into Sources.
total | Source |
---|---|
23 | Github |
12 | Medium |
13 | New York Times |
3952 | Other |
MAX
/MIN
/AVG
/SUM
and (column_name)
returns the maximum / minimum / average / sum value of a column
SELECT ROUND(AVG(price), 2) AS dicsounted FROM cars;
ROUND(expression, precision)
rounds a numeric expression to a specified precision
SELECT NOW()
returns the current date and time
SELECT EXTRACT(YEAR FROM NOW());
SELECT EXTRACT(MONTH FROM NOW());
EXTRACT(field FROM source)
retrieves a specific field (such as year, month, day) from a date or time expression.
SELECT NOW() + INTERVAL '6 MONTH';
INTERVAL
adds or subtracts a specific amount of time from a date or time.
test=# SELECT first_name, last_name, date_of_birth, AGE(NOW(), date_of_birth) AS age FROM person;
AGE()
subtracts the 2nd argument from the 1st
SELECT COALESCE(NULL, NULL, 3, 4) AS number;
SELECT COALESCE(email, 'N/A') FROM random_people;
COALESCE()
returns the first non-null expression
second parameter will be used as a default value for NULL values
SELECT NULLIF(10, 10); -- Returns NULL
SELECT NULLIF(10, 1); -- Returns 10
NULLIF(expression1, expression2)
returns NULL if both expressions are equal; otherwise, it returns the first expression.
SELECT NOW()::DATE;
SELECT NOW()::TIME with time zone;
converts one data type into another
INSERT INTO
table name (list column names) VALUES
(list of values to insert in sequence of before mentioned columns)
INSERT INTO persons (first_name, last_name) VALUES ('Ria', 'Scholz');
PostgreSQL will try to interpret incoming data as the data type the column has been defined as (type casting). This can be problematic for example when a column is type
INTEGER
but the value entered is a decimal. Postgres will then round the value upon insertion. Or if the column type wasVARCHAR
Postgres will cast the decimal value into a string.
INSERT INTO random_people (id, first_name, last_name, gender, email, date_of_birth, country_of_birth)
VALUES (2017, 'Russell', 'Ruddoch', 'Male', '[email protected]', DATE '1952-09-25', 'Norway')
ON CONFLICT (id) DO UPDATE SET email = EXCLUDED.email, first_name = EXCLUDED.first_name;
INSERT INTO ... ON CONFLICT
allows you to insert data into a table but specify how to handle conflicts if a unique constraint (such as a primary key or unique column) is violated
ON CONFLICT (column_name)
defines which column(s) should be checked for conflicts
DO UPDATE
tells the database what to do if a conflict occurs
EXCLUDED.column_name
refers to the value that was intended to be inserted but was excluded due to the conflict.
ON CONFLICT() ... DO NOTHING
skips the insert if a conflict occurs without performing any updates
UPDATE
which table name SET
column name =
new value WHERE
which row in the table you want to change (identifier) =
matching value. e.g. person with id 4 gets a car with id 1.
UPDATE persons SET car_id = 1 WHERE id = 4;
DELETE FROM
which table name WHERE
which row in the table you want to change (identifier) =
matching value. e.g. deletes person with id 7 from the table persons.
DELETE FROM persons WHERE id = 7;
Combines two tables with records that have foreign key values. The ON
keyword defines the matching column. The Inner Join Table returned will only contain matched values. Any values that didn't match will be dropped.
SELECT
the columns to display FROM
which table JOIN
ed with which other table ON
which table.columns (foreign key) to match.
SELECT persons.first_name, cars.make, cars.model, cars.price FROM persons JOIN cars ON persons.car_id = cars.id;
Combines two tables with all records, also ones without foreign key values. All rows of the "left" table are kept. Any record that is not "matched" in the foreign table will have null
values in the joined table.
SELECT
the columns to display FROM
which table LEFT JOIN
ed with which other table ON
which table.columns (foreign key) to match. This will keep all records from table A even if there are no records for some of them in the table B.
SELECT * FROM persons LEFT JOIN cars ON cars.id = persons.car_id;
If both tables have same column matches you can simply use
USING (column name)
SELECT * FROM persons LEFT JOIN cars USING (car_uid);
Combine all rows of one table with all rows of another table. This is commonly used for when needing to compare each row of a table to a list of values.
For example if we have a table of customer subscription records that include columns on start_month
and end_month
, we can use a reference table months
to CROSS JOIN
matching months the customer was subscribed in to get a total count of subscribed customers per month.
SELECT month,
COUNT(*) AS 'subscribers'
FROM newspaper
CROSS JOIN months
WHERE start_month <= month
AND end_month >= month
GROUP BY month
ORDER BY 2 DESC;
This will produce the following result:
CROSS JOIN
does not require theON
statement because there is no columns to match on.
To stack tables into one. SQL has strict rules for appending data:
- Tables must have the same number of columns
- The columns must have the same data types (schema)
- The columns need to be in the same order
SELECT * FROM table1
UNION
SELECT * FROM table2;
The WITH AS
clause stores the result of a query in a temporary table using an alias. It is also known as Common Table Expressions (CTEs) and allows for referencing this temporarily defined result sets with subsequent queries (like SELECT
, INSERT
, UPDATE
, or DELETE
statements). This can make complex queries easier to read and maintain by breaking them into simpler, reusable parts.
WITH recent_orders AS (
SELECT customer_id, order_date
FROM orders
WHERE order_date > '2023-01-01'
)
SELECT customers.name, recent_orders.order_date
FROM customers
JOIN recent_orders ON customers.id = recent_orders.customer_id;
- The
WITH
clause defines a CTE namedrecent_orders
that selects orders placed after January 1, 2023. - The main query then joins the
customers
table with therecent_orders
CTE to retrieve customer names and their recent order dates.
Important to note that the CTE will only have the columns that are specified in its
SELECT
statement.
\copy
command and in () the actual query TO
filepath (new filename.extension) DELIMITER
and if CSV HEADER
true or not.
\copy (SELECT * FROM persons LEFT JOIN cars ON persons.car_id = cars.id ORDER BY persons.id) TO '/Users/ria/Desktop/persons.csv' DELIMITER ',' CSV HEADER;
If table description has an id serial (or any other serialized column) it will have a function to call the next increment.
View current serial number tablename_columnname_seq
:
SELECT * FROM persons_id_seq;
Invoke function and increment this value:
SELECT nextval('persons_id_seq'::regclass)
reseting the serial sequence:
ALTER SEQUENCE persons_id_seq RESTART WITH 1000;
setting the sequence to a specific value
SELECT setval('random_people_id_seq', (SELECT MAX(id) FROM random_people));
setval(sequence_name, value_to_set)
A database trigger is procedural code that is automatically executed in response to certain events on a particular table or view in a database. Triggers are primarily used for maintaining the integrity of the information in the database. When a specific change occurs to a table or view, a trigger can be set to call a function automatically, ensuring that certain actions are consistently performed. This helps in enforcing rules and saving users from forgetting to perform necessary actions. However, triggers come with overhead and may sometimes need to be bypassed in specific situations. Proper design can minimize these exceptions.
CREATE TRIGGER <trigger_name>
BEFORE UPDATE ON <table_name>
FOR EACH ROW
EXECUTE PROCEDURE <function>;
-- Example function
CREATE OR REPLACE FUNCTION check_account_update() RETURNS TRIGGER AS $$
BEGIN
NEW.active := 1;
RETURN NEW;
END;
$$ LANGUAGE PLPGSQL;
-- Example trigger
CREATE TRIGGER check_update
BEFORE UPDATE ON accounts
FOR EACH ROW
EXECUTE PROCEDURE check_account_update(); --define function with parenthesis!
In this example, whenever an UPDATE
statement is executed on the accounts
table, the trigger will ensure that the active
column of the modified rows is set to 1
. Triggers can be set for various events, including UPDATE
, INSERT
, DELETE
, and TRUNCATE
, allowing you to customize the behavior of your database tables based on different types of data modifications.
Newer versions of PostgreSQL may use
EXECUTE FUNCTION
rather thanEXECUTE PROCEDURE
. These are logically equivalent and both call a trigger function.
--Renaming a Trigger
ALTER TRIGGER <old_trigger_name> ON <table_name>
RENAME TO <new_trigger_name>;
--Deleting a Trigger
DROP TRIGGER IF EXISTS <trigger_name> ON <table_name>;
--Displaying Triggers
SELECT * FROM information_schema.triggers;
-
BEFORE
- Executes before the operation (INSERT, UPDATE, DELETE) is performed.
- Allows you to modify the row before it is committed to the database.
- Useful for validation, modification, or enforcing business rules.
-
AFTER
- Executes after the operation has been completed.
- Cannot modify the row being processed as the operation is already done.
- Ideal for logging, auditing, or triggering other actions based on the completed operation.
CREATE TRIGGER logging
AFTER UPDATE ON customers
FOR EACH ROW
EXECUTE FUNCTION log_customers_change();
-
FOR EACH ROW
- The trigger fires and calls the function for every row impacted by the related query.
- Can have large impacts on efficiency and data integrity, especially with large datasets.
- Example: If a function tracks the number of modifications, deleting 10 records would increment the counter by 10.
-
FOR EACH STATEMENT
- The trigger calls the function once for each query, not for each record.
- More efficient for large datasets but can have logical differences.
- Example: If a function tracks the number of modifications, deleting 10 records would increment the counter by 1.
CREATE TRIGGER log_queries
AFTER UPDATE ON customers
FOR EACH STATEMENT
EXECUTE PROCEDURE statement_function();
-
WHEN Clause
- the
WHEN
clause is used to specify conditions (boolean) that must be met for the trigger to execute. - This allows for more granular control over when the trigger's function is called.
CREATE TRIGGER insert_trigger_high BEFORE INSERT ON clients FOR EACH ROW WHEN (NEW.total_spent >= 1000) EXECUTE PROCEDURE high_spender(); --In this example, the trigger `update_check` will only execute the `check_account_update` function if the `balance` column is changed. CREATE TRIGGER update_check BEFORE UPDATE ON accounts FOR EACH ROW WHEN (OLD.balance IS DISTINCT FROM NEW.balance) EXECUTE FUNCTION check_account_update();
- the
-
NEW
- Represents the new row data that will be inserted or updated in the table.
- It is available in
INSERT
andUPDATE
triggers.
-
OLD
- Represents the old row data that is being updated or deleted.
- It is available in
UPDATE
andDELETE
triggers.
CREATE OR REPLACE FUNCTION log_account_update() RETURNS TRIGGER AS $$ BEGIN IF OLD.balance IS DISTINCT FROM NEW.balance THEN INSERT INTO account_log (account_id, action, old_balance, new_balance, timestamp) VALUES (NEW.id, 'UPDATE', OLD.balance, NEW.balance, NOW()); END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER log_update AFTER UPDATE ON accounts FOR EACH ROW EXECUTE FUNCTION log_account_update();
With the
WHEN
clause, you can useNEW
andOLD
to get records from the table before and after the query. Logically,INSERT
can not refer toOLD
(nothing existed before the insert) andDELETE
can not refer toNEW
(nothing exists after the delete).NEW
andOLD
referencing only works forFOR EACH ROW
queries
- Multiple triggers of the same kind can exist on the same table
- If a statement causes multiple triggers to fire, they are triggered in alphabetical order (name of trigger)
- since
SELECT
statements do not modify rows, no trigger can be set on aSELECT
statement.
date_part
- returns a portion of the date as an integer
date_part('year' ,'2020-08-01 00:00:00'::date); --returns 2020
By default there will always be a postgres
database and a postgres
user in a PostgreSQL database server. This superuser
user has the ability to create new databases, tables, users etc. A superuser
bypasses all permission checks that other users face before being allowed to perform an action. superuser
privileges are not restricted to a single user and can be passed along to any number of users in the DB. However, the general principle of "least privilege" should be applied, where every user is only given the minimum permissions required for their function. Other important principles should be kept in mind: Most user's privileges should be restricted. superusers
should not be performing routine database tasks and specialized roles are created with only the permissions they require.
superuser
can create new roles. Roles can either be login roles
or group roles
. Login roles are used for most routine database activity. Group roles typically do not have the ability to login themselves, but can hold other roles as “members” and allow access to certain shared permissions.
Some common roles:
Permission Name | Function |
---|---|
SUPERUSER |
role is superuser |
NOSUPERUSER |
role is not superuser |
CREATEROLE |
permission to create additional roles |
CREATEDB |
permission to create databases |
LOGIN |
permission to login (Note: If not specified, the default is NOLOGIN !) |
IN ROLE |
List of existing roles that a role will be added to as a new member. |
Sometimes
CREATE USER
is used on older versions of PostgreSQL, which is equivalent toCREATE ROLE
, except thatCREATE USER
assumesWITH LOGIN
whileCREATE ROLE
does not.
Every table or schema in a PostgreSQL database has an owner that can set the permissions on their tables.
A superuser
or table/schema owner can GRANT
and REVOKE
permissions at the schema and table level. To use a schema and to interact with that table, a role must have a permission called USAGE
. Without USAGE
a role cannot access tables within that schema. Other schema level permissions include CREATE
and DROP
, which allow the grantee the ability to create or remove tables in that schema respectively. Additionally, the table owner must also grant SELECT
, UPDATE
, DELETE
, INSERT
etc. on a specific table to define how that role can interact with the table.
Default permissions can be set that are independent on schema or table roles and can also be set at database level. These permissions will apply for all schemas or tables (created after the default permission was set). Any permissions that could otherwise be set with a GRANT
statement can be applied to newly created objects with ALTER DEFAULT PRIVILEGES
.
This feature is not widely adopted and for older PostgreSQL systems or other database servers permissions may still need to be set manually.
Group roles are used to automatically inherit permissions to a number of roles that are members of the group. Groups can also be members of other group roles (nesting). This is a useful feature for maintaining databases with many users, but only a few “types” of users. Maintaining a DB with hundreds of users becomes simpler if only having a few permissions managed through just a few group roles.
For security reasons, PostgreSQL disallows the inheritance of certain powerful permissions such as
LOGIN
,SUPERUSER
,CREATEDB
andCREATEROLE
.
Colum Level Security is used to set permissions per column instead of table or schema wide with the GRANT
statement and the respective columns in ()
.
Row Level Security (RLS) is used to set permissions per row instead of table or schema wide. This is an additional layer of security. To access (or modify) information from a table with RLS, a row-specific condition must be met. For this, a policy needs to be created that defines the permissions type, roles that the policy applies to and specific conditions to check before permitting a user to carry out an action. After a policy has been created it can then be applied to the specific tables (enables RLS).
SELECT current_user; --check current user
SELECT usename, --role name
usecreatedb, --permission to create DB ?
usesuper --superuser privileges?
FROM pg_catalog.pg_roles; --table of all users in the database and their permissions
SELECT grantor, grantee, --user
table_schema, table_name, privilege_type
FROM information_schema.table_privileges --table of operations a user may have permissions for
WHERE grantee = 'user_name';
SET ROLE <rolname>; --sets current user to other role (if user has permission to do so)
CREATE ROLE <rolname> WITH <list of permissions>; --creates new role
ALTER ROLE <rolname> WITH <permission_name>; --adds a permission to an existing role
GRANT USAGE, CREATE ON SCHEMA <schema_name> TO <rolname>; --granting a role the permission to use a schema and create new tables in that schema
GRANT SELECT, UPDATE ON <schema_name.table_name> TO <rolname>; --granting table specific permissions
REVOKE UPDATE ON <schema_name.table_name> FROM <rolname>; --revoking a permission from a role
ALTER DEFAULT PRIVILEGES IN SCHEMA <schema_name> GRANT SELECT ON TABLES TO <rolname>; --allows role to select on all newly-created tables in schema immediately after another user has created them
ALTER DEFAULT PRIVILEGES IN DATABASE <schema_name> GRANT SELECT ON TABLES TO <rolname>; --same privileges on database level
SELECT * FROM information_schema.table_privileges; --displays default permissions
CREATE ROLE <group_rolname> WITH NOLOGIN ROLE <list of rolnames to include in group>; --creates a group role including login members
CREATE ROLE <group_rolname> WITH NOLOGIN;
GRANT <group_rolname> TO <rolname>; --adding role to group role
CREATE ROLE <role_name> WITH LOGIN PASSWORD '<password>'; --creates user without superuser attribute
CREATE ROLE <rolname> WITH LOGIN IN ROLE <group_rolname>; --adds new role to existing group role upon creation
GRANT SELECT (<list of column names>) ON <table_name> TO <rolname>; --grants column specific SELECT permission to role
SELECT * FROM information_schema.column_privileges; --displays all column level permissions
SELECT * FROM information_schema.row_privileges; --displays all row level permissions
CREATE POLICY <policy_name> ON <table_name> FOR
SELECT
TO <rolname> USING (<column_name> = current_user) --example policy that grants select access to rows where current user is mentioned
ALTER TABLE <table_name> ENABLE ROW LEVEL SECURITY; --applies policies defined for this table
View available extensions:
SELECT * FROM pg_available_extensions;
Install an extension (in double quotes):
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
Shortcut | Description |
---|---|
psql |
start postgresql connection |
psql=# |
indicates postgresql connection as super user |
psql=> |
indicates postgresql connection as non-super user |
-h or --host=HOSTNAME |
the database server host or socket directory to connect with (the default is local socket) |
-p or --port=PORT |
the database server port to connect to (the default is 5432 ) |
-U or --username=USERNAME |
the database username to connect with (the default is your_username ) |
-w or --no-password |
never prompt for password, if a password is required but not provided, the connection will fail |
-W or --password |
force password prompt, which should happen automatically (will always ask for password even if not strictly needed) |
\conninfo |
get info about current connection |
\q |
quit postgresql connection |
\l or \list |
lists databases |
\c [database name] |
connect to database |
\c - [username] |
connect to database with different user |
\d |
describe databases (shows tables) |
\d [table name] |
describes specific table |
\dt |
describes tables in current database |
\d+ [table name] |
describes table with metadata |
\df |
list of installed function |
\du |
list of all roles |
CTRL +L |
clear screen |
\x |
toggle expanded display (vertical alignment per record) |
\i [FILE PATH] |
execute a sql file (queries) |
\s |
show query history |
\s [FILE PATH] |
save query history to a file |
(from https://drew.silcock.dev/blog/how-postgres-stores-data-on-disk/)
Postgres stores all its data in a directory called /var/lib/postgresql/data
Directory | Explanation |
---|---|
base/ | Contains a subdirectory for each database. Inside each sub-directory are the files with the actual data in them. We’ll dig into this more in a second. |
global/ | Directly contains files for cluster-wide tables like pg_database. |
pg_commit_ts/ | As the name suggests, contains timestamps for transaction commits. We don’t have any commits or transactions yet, so this is empty. |
pg_dynshmem/ | Postgres uses multiple processes (not multiple threads, although there has been discussion around it) so in order to share memory between processes, Postgres has a dynamic shared memory subsystem. This can use shm_open, shmget or mmap on Linux – by default it uses shm_open. The shared memory object files are stored in this folder. |
pg_hba.conf | This is the Host-Based Authentication (HBA) file which allows you to configure access to your cluster based on hostname. For instance, by default this file has host all all 127.0.0.1/32 trust which means “trust anyone connecting to any database without a password if they’re connecting from localhost”. If you’ve ever wondered why you don’t need to put your password in when running psql on the same machine as the server, this is why. |
pg_ident.conf | This is a user name mapping file which isn’t particularly interesting for our purposes. |
pg_logical/ | Contains status data for logical decoding. We don’t have time to talk about how the Write-Ahead Log (WAL) works in full, but in short, Postgres writes changes that it’s going to make to the WAL, then if it crashes it can just re-read and re-run all the operations in the WAL to get back to the expected database state. The process of turning the WAL back into the high-level operations – for the purposes of recovery, replication, or auditing – is called logical decoding and Postgres stores files related to this process in here. |
pg_multixact/ | ”xact” is what the Postgres calls transactions so this contains status data for multitransactions. Multitransactions are a thing that happens when you’ve got multiple sessions who are all trying to do a row-level lock on the same rows. |
pg_notify/ | In Postgres you can listen for changes on a channel and notify listeners of changes. This is useful if you have an application that wants to action something whenever a particular event happens. For instance, if you have an application that wants to know every time a row is added or updated in a particular table so that it can synchronise with an external system. You can set up a trigger which notifies all the listeners whenever this change occurs. Your application can then listen for that notification and update the external data store however it wants to. |
pg_replslot/ | Replication is the mechanism by which databases can synchronise between multiple running server instances. For instance, if you have some really important data that you don’t want to lose, you could set up a couple of replicas so that if your main database dies and you lose all your data, you can recover from one of the replicas. This can be physical replication (literally copying disk files) and logical replication (basically copying the WAL to all the replicas so that the main database can eb reconstructed from the replica’s WAL via logical decoding.) This folder contains data for the various replication slots, which are a way of ensuring WAL entries are kept for particular replicas even when it’s not needed by the main database. |
pg_serial/ | Contains information on committed serialisable transactions. Serialisable transactions are the highest level of strictness for transaction isolation, which you can read more about in the docs. |
pg_snapshots/ | Contains exported snapshots, used e.g. by pg_dump which can dump a database in parallel. |
pg_stat/ | Postgres calculates statistics for the various tables which it uses to inform sensible query plans and plan executions. For instance, if the query planner knows it needs to do a sequential scan across a table, it can look at approximately how many rows are in that table to determine how much memory should be allocated. This folder contains permanent statistics files calculated form the tables. Understanding statistics is really important to analysing and fixing poor query performance. |
pg_stat_tmp/ | Similar to pg_stat/ apart from this folder contains temporary files relating to the statistics that Postgres keeps, not the permanent files. |
pg_subtrans/ | Subtransactions are another kind of transaction, like multitransactions. They’re a way to split a single transaction into multiple smaller subtransactions, and this folder contains status data for them. |
pg_tblspc/ | Contains symbolic references to the different tablespaces. A tablespace is a physical location which can be used to store some of the database objects, as configured by the DB administrator. For instance, if you have a really frequently used index, you could use a tablespace to put that index on a super speedy expensive solid state drive while the rest of the table sits on a cheaper, slower disk. |
pg_twophase/ | It’s possible to “prepare” transactions, which means that the transaction is dissociated from the current session and is stored on disk. This is useful for two-phase commits, where you want to commit changes to multiple systems at the same time and ensure that both transactions either fail and rollback or succeed and commit. |
PG_VERSION | This one’s easy – it’s got a single number in which is the major version of Postgres we’re in, so in this case we’d expect this to have the number 16 in. |
pg_wal/ | This is where the Write-Ahead Log (WAL) files are stored. |
pg_xact/ | Contains status data for transaction commits, i.e. metadata logs. |
postgresql.auto.conf | This contains server configuration parameters, like postgresql.conf, but is automatically written to by alter system commands, which are SQL commands that you can run to dynamically modify server parameters. |
postgresql.conf | This file contains all the possible server parameters you can configure for a Postgres instance. This goes all the way from autovacuum_naptime to zero_damaged_pages. If you want to understand all the possible Postgres server parameters and what they do in human language, I’d highly recommend checking out postgresqlco.nf |
postmaster.opts | This simple file contains the full CLI command used to invoke Postgres the last time that it was run. |
description | command |
---|---|
show name from object identifiers for databases | SELECT oid, datname FROM pg_database; |
get OID of 'public' namespace (connect to database first) | SELECT to_regnamespace('public')::oid; |
list all tables, indexes etc that live in a namespace | SELECT * FROM pg_class WHERE relnamespace = to_regnamespace('public')::oid; |
see examples
ACID is an acronym for good practices when working with database transactions that help to ensure that all data is complete, accurate and has integrity. Furthermore, these principles assist in recovering a database in case of system failure and allow for concurrent use of a database.
ACID stands for:
- Atomicity: All changes to data are performed as if they are a single operation. That is, all the changes are performed, or none of them are.
- Consistency: Any transaction done has to follow the rules of the database to ensure data integrity.
- Isolation: Every transaction is executed one at a time in isolation. The intermediate state of a transaction is invisible to other transactions. As a result, transactions that run concurrently appear to be serialized. Multiple transactions can still be performed at once but no two transactions can read or write from the same location at the same time.
- Durability: After a transaction successfully completes, changes to data persist and are not undone, even in the event of a system failure. This could be achieved through many means, for example keeping change logs that are referenced whenever a database has to restart.
Database normalization uses progressive Normal Forms (NF)
to reduce data redundancy and improve data integrity. The numbers stand for the level of normalization.
- Makes data atomic (one value per cell)
- Ensures each row is unique
'atomic' means, a unit cannot be broken down further (like an atom)
- Must be in 1NF
- Removes partial dependencies (splitting tables into separate entities to distinguish relationships)
A
partial dependency
is when an attribute depends on part of the table’s primary key rather than the whole primary key.
- Must be in 2NF
- Eliminates transitive dependencies
- Non-key attributes depend only on primary key, not other non-key attributes
A
transitive functional dependency
is when a non-prime attribute depends on another non-prime attribute rather than a primary key or prime attribute.
-- Example of progression from unnormalized to 3NF:
-- Unnormalized
CREATE TABLE books (
book_title text,
book_author text,
author_address text,
book_sales numeric
);
-- 3NF
CREATE TABLE books (
book_id serial PRIMARY KEY,
author_id integer REFERENCES authors,
title text,
sales numeric
);
CREATE TABLE authors (
author_id serial PRIMARY KEY,
address_id integer REFERENCES addresses,
name text
);
CREATE TABLE addresses (
address_id serial PRIMARY KEY,
address text
);
Database anomalies are problems that can occur in badly designed databases leading to data inconsistency. These anomalies are often resolved through proper database normalization. There are four main types:
Occurs when you cannot insert data because of missing related data. For example, if you can't add a new course to a school database until a student enrolls in it.
Happens when deleting one piece of data unintentionally removes other unrelated data. For example, removing a student from a class might delete the only record of that class existing.
When updating one data value requires multiple updates in different places. If a teacher's phone number changes, having to update it in multiple records increases the chance of inconsistency.
When the same data is stored in multiple places, leading to:
- Waste of storage space
- Increased risk of data inconsistency
- Higher maintenance effort
The space that PostgreSQL uses on disk can grow in several ways and it's important to keep an eye on it. Apart from adding new data or tables there are some properties of the postgres data storage system that can cause an unexpected increase in disk usage. In general, keeping tables and indexes in the database small can ensure better query performance, more efficient disk utilization and lower database costs.
Each row in a postgres table is stored in a file on the disk. With every UPDATE
or DELETE
query, the content is not actually deleted from disk but instead marked as "invalid" and a new row added for the updated content. These rows are called dead tuples
and still occupy disk space potentially affecting performance.
pg_total_relation_size
will return the size of the table and all its indexes in bytes. These values are often in the millions or billions and thus hard to read.pg_table_size
andpg_indexes_size
return the size of the table’s data and table’s indexes in bytes. The sum of these two functions is equal topg_total_relation_size
pg_size_pretty
can be used with the functions above to format a number in bytes as KB, MB, or GB.VACUUM
can be used to clean dead tuples from disk or at least make them re-writable. One nice property aboutVACUUM
is that it allows space to be reused. If tables are vacuumed frequently enough, the disk usage of a table will stay relatively steady because updates will never get “too far ahead” of the required space on the disk. Postgres has a feature calledautovacuum
enabled on most databases by default, that periodically checks for tables that have had a large number of inserted, updated or deleted tuples that could be vacuumed for performance.VACUUM ANALYZE
performs aVACUUM
but also updates postgres' internal statistics and can help further improve query performance after a largeUPDATE
VACUUM FULL
is an aggressive operation can completely clear dead tuples from a table and return the space to disk by rewriting all data from a table into a new location in disk excluding dead tuples. This is essentially clearing all the space the table occupied. However, this operation is quite and heady and therefore slow and blocks the table from other operations. It should be used sparingly. The "plain"VACUUM
is designed to be able to run in parallel with normal reading and writing of the table. The best strategy for maintaining a database is to make sure thatVACUUM
runs frequently andautovacuum
is enabled. These measures will ensure that table sizes are relatively stable over time.pg_stat_all_tables
is a system view that provides statistics about table access and updates, including information about dead tuples, vacuum operations, and row counts for all tables in the database. To query for specific tables use therelname
column.TRUNCATE
is an operation that quickly removes all rows from a table and can be used to clear a table much faster without having to copy it (like aVACUUM FULL
does). It simultaneously reclaims disk space immediately, rather than requiring a subsequentVACUUM
operation. However,TRUNCATE
also requires an exclusive lock while removing a table’s contents.
TRUNCATE
is an alternative to DELETE
when you want to remove all rows from a table. Here are some key differences:
- Performance:
TRUNCATE
is faster thanDELETE
because it doesn’t scan the table row by row. - Disk Space:
TRUNCATE
immediately reclaims disk space, whileDELETE
may require aVACUUM
to reclaim space. - Transaction:
TRUNCATE
is not transactional in the same way asDELETE
. It cannot be rolled back if not used within a transaction block. - Triggers:
TRUNCATE
does not fireDELETE
triggers.
Use TRUNCATE
when you need to quickly clear a table and don’t need to log each row deletion.
--getting size of tables
SELECT
pg_size_pretty(pg_table_size('table_name')) as tbl_size,
pg_size_pretty(pg_indexes_size('table_name')) as idx_size,
pg_size_pretty(pg_total_relation_size('table_name')) as total_size
pg_size_pretty(pg_total_relation_size('index_name')) as total_size --use relation to get size of specific index
--getting size of index
SELECT pg_size_pretty(pg_total_relation_size('index_name')) as idx_size;
--cleaning tables
VACUUM <table_name>
VACUUM ANALYZE <table_name> --updates the pg_stat_all_tables
VACUUM FULL <table_name> --performs a full vacuum on table
TRUNCATE <table_name> --deletes all rows without scanning and leaves no dead tuples
--getting statistics for a table
SELECT schemaname, relname, last_vacuum, last_autovacuum, last_analyze, n_dead_tup, n_live_tup
FROM pg_stat_all_tables
WHERE relname = '<table_name>';
An index
is an organization of the data in a table to help with performance when searching and filtering records. By default it divides the possible matching records in half, then half again, then half again and so on until the specific match is found. This is known as a Binary Tree
(B-Tree
).
Indexes are built on columns and therefore a table can have multiple indexes.
Postgres automatically creates an index for Primary Keys and Unique constraints.
Every query is planned first (which algorithm to take), for example Seq Scan
means the system is scanning every record to find the match. More efficient would be Bitmap Index Scan
which is using an index to do the binary search. The planning takes some time and together with the execution time makes up the total time of the query.
-- no index on first_name
EXPLAIN ANALYZE SELECT * FROM customers WHERE first_name = 'Indiana';
-- Seq Scan on customers (cost=0.00..2614.00 rows=500 width=466) (actual time=0.016..10.136 rows=538 loops=1)
-- Planning time: 0.271 ms
-- Execution time: 10.178 ms
-- with index on last_name
EXPLAIN ANALYZE SELECT * FROM customers WHERE last_name = 'Jones';
-- -> Bitmap Index Scan on customers_last_name_idx (cost=0.00..12.17 rows=500 width=0) (actual time=0.189..0.189 rows=1008 loops=1)
-- Planning time: 0.066 ms
-- Execution time: 0.953 ms
Terminology: SQL Server uses
Seek
andScan
, PostgreSQL usesScan
,Index Scan
andBitmap Heap Scan
A scan searches through every record in a database table/view to find the records being asked for.
On the other hand, a seek uses an index to find the specific records being asked for by jumping to their location and either grabbing the data or, if it is a reference, using it to get the information. A seek can only be done when the database server has an organized way to search through the target table/view. This is done by having an index covering your search criteria.
In general, the database server will use a seek or a scan based on the specific query and it makes its best guess of which way would be faster. This is part of the pre-processing that a server does before it actually runs any query you write. When searching for a limited subset of records, the server will likely use a seek for efficiency. However, if searching for a large portion of the table, a scan is more efficient as it examines all relevant records.
A developer can control the server to use a seek or a scan by writing queries taking advantage of indexes or to start with the most limiting queries to encourage seek searches. If appropriate create/delete indexes to further improve the database.
Every query that modifies or updates on an indexed table will have a higher runtime. This means that, while filtering in SELECT
statements become faster, other statements like INSERT
, UPDATE
or DELETE
become slower. On an indexed table, when a single record is modified or added, the server will have to modify every index that would be impacted by this change. Therefore, a good practice is to drop the index before inserting or updating a large amount of data and then re-adding the index to restructure the updated data set.
Indexes also take up more space in the table, which will increase time when doing backups or migrations.
Using indexes makes more sense on tables that need to be more searched than updated as well as when the searches or filtering is very specific (having to return only a few matches out of a big data set).
Two or more columns can be combined to create a search structure that is based on values found in all of the combined columns. The index is built in the specific order of columns listed at creation and that can have an impact on the efficiency of the search. Multi-column indexes can also be referred to as Composite
or Compound
Indexes.
- A single multi-column index is faster (if ordered well) than the server combining indexes.
- A multi-column index is less efficient than a single index in cases where a single index is needed.
- You could create all of them (two single indexes and one multicolumn index), and then the server will try to use the best one in each case, but if they are all not used relatively often/equally then this is a misuse of indexes.
A partial index allows for indexing on a subset of a table, allowing searches to be conducted on just this group of records in the table. It essentially like a filtered index. The filtering does not have to be for the same column that is part of the index.
By default indexes are ordered ASCENDING and with NULL values last. This can be changed by specifying the order right after the column name in the index, with either DESC
or ASC
and NULLS FIRST
or NULLS LAST
.
Clustered Indexes:
- A clustered index determines the physical order of data in a table.
- In PostgreSQL, the primary key is often used as the clustered index.
- Only one clustered index can be created per table because the data rows themselves can only be sorted in one order.
- It improves the performance of queries that retrieve a range of values or need to access data in a sorted order.
Something to note that PostgreSQL does differently than other systems is that it does not maintain this order automatically. When inserting data into a table with a clustered index on other systems, those systems will place the new records and altered records in their correct location in the database order in memory. PostgreSQL keeps modified records where they are and adds new records to the end, regardless of sorting. If you want to maintain the order, you must run the
CLUSTER
command again on the index when there have been changes. This will “re-cluster” the index to put all of those new records in the correct place.
Because PostgreSQL does not automatically recluster on
INSERT
,UPDATE
andDELETE
statements, those statements might run faster than equivalent statements using a different system. The flip side of this coin though is that after time, the more your table is modified the less useful the cluster will be on your searches. Reclustering the table has a cost, so you will need to find a balance on when to recluster your table(s). There are tools that can be used to help you identify when this would be useful.
Unclustered Indexes:
- An unclustered index, also known as a non-clustered index, does not alter the physical order of the data in the table.
- It creates a separate structure within the database that points to the actual data rows.
- Multiple unclustered indexes can be created on a single table.
- It is useful for improving the performance of queries that search for specific values or need to access data in a non-sorted order.
PostgreSQL defaults all indexes to non-clustered unless the index is specified to be clustered.
Indexes in PostgreSQL can be created using functions and expressions to enhance query performance and enforce constraints. This allows for more flexible and efficient indexing strategies.
Functions and expressions in indexes allow you to create indexes on the result of a function or an expression applied to a column, rather than the raw column data itself. This can be particularly useful for case-insensitive searches, computed columns, or any other scenario where the indexed value is derived from the original column value.
Using functions and expressions in indexes is useful when you need to:
- Perform case-insensitive searches.
- Index computed or derived values.
- Enforce constraints on transformed data.
Consider a scenario where you want to enforce a unique constraint on email addresses in a case-insensitive manner. You can create a unique index using the LOWER()
function to achieve this:
CREATE UNIQUE INDEX unique_email_idx ON users (LOWER(email));
CREATE INDEX orders_total_price_idx ON orders ((quantity * price_base)); --creates index on the sum of two columns
In this example, the LOWER()
function is applied to the email
column, ensuring that the index is case-insensitive. This means that [email protected]
and [email protected]
will be considered the same, preventing duplicate entries with different cases.
SELECT * FROM pg_indexes WHERE tablename = <table_name>; --view for existing indexes on a table
CREATE INDEX <index_name> ON <table_name>(<column_name>); --creates new index on table column, the naming convention for indexes is "<table_name>_<column>_idx"
CREATE INDEX <index_name> ON <table_name>(<column_name> DESC NULLS FIRST); --index with custom ordering
CREATE INDEX <index_name> ON <table_name>(<column_name>, <column_name>); --multi column index
CREATE INDEX <index_name> ON <table_name> (<column_name>) WHERE <column_name> LIKE 'some_value'; --partial index for subset of data
CREATE UNIQUE INDEX <index_name> ON <table_name> (<EXP>(<column_name>)); --index with function applied on column
EXPLAIN ANALYZE VERBOSE SELECT * FROM <table_name>; --returns verbose information about the query instead of the query results
DROP INDEX IF EXISTS <index_name>; --deletes an index
SELECT pg_size_pretty (pg_total_relation_size(<table_name>)); --gets size of a table
CLUSTER <table_name> USING <index_name>; --cluster table with existing index
CLUSTER <table_name>; --if index to be clustered on has already been established
CLUSTER; --to cluster every table in database that has an identified index to use