You are on page 1of 18

16/4/2014

PostgreSQL for Beginners


Compartir

Ms

Siguiente blog

Crear blog

Acceder

PostgreSQL for Beginners


Tuesday, Decem ber 7, 2010

CREATE FUNCTION: return types


In the last post we covered the basic anatomy of a SQL function or stored procedure with
CREATE FUNCTION:
CREATE FUNCTION function_name (argument_name argument_type, argument_name
argument_type, ...) RETURNS return_type AS $$
function_body_query;
$$ language language_name;
In this post we'll add some additional options and features related to the function's output.
VOID FUNCTIONS
Functions can return nothing. To do that we declare them as "RETURNS void". This is
appropriate for non-SELECT commands such as INSERT:
CREATE FUNCTION addsalesdata(name varchar, month varchar, value
decimal) returns void as $$
insert into salesfigures values($1,$2,$3);
$$ language 'sql';
For these functions one can call SELECT function_name(...)or SELECT * FROM
function_name(...).
FUNCTIONS THAT RETURN A SINGLE VALUE
Functions can return one value (one int, one decimal, one textetc.). For these
functions one can call SELECT function_name(...)or SELECT * FROM
function_name(...).
CREATE FUNCTION totalsalesformonth(month varchar) returns decimal
as $$
select sum(sales) from salesfigures where month=$1;
$$ language 'sql';
Another option to define a function's output is using OUT parameters. Here we do not use
RETURNS but we list the order, names and types of output alongside the input arguments
preceding each field with "OUT".
CREATE FUNCTION salesforpersonmonth4(month varchar, OUT name
varchar, OUT value decimal) as $$
select name,sales from salesfigures where month=$1;
$$ language 'sql';
aggs=# select * from salesforpersonmonth4('jan');
name | value
------+------joe |
34
(1 row)

Blog Archive

2010 (19)
December (4)
CREATE FUNCTION: return
types
CREATE FUNCTION - basic
anatomy
schema
Aggregates
November (15)

There was an error in this gadget

About Me

Notice that this is returning just one record when in fact the query would produce multiple
rows of results.
FUNCTIONS THAT RETURN A SETOF VALUES

http://www.postgresqlforbeginners.com/

Carl Anderson
Director of Data Science at
WarbyParker.com. Follow me on

1/18

16/4/2014

PostgreSQL for Beginners


Functions can also return a set of values of the same type. That is, imagine it returning a
column of decimals or a column of text. To specify this we use the SETOFmodifier. Here
is a function that returns the sales figures for each salesperson for January.
CREATE FUNCTION salesformonth(month varchar) returns SETOF
decimal as $$
select sales from salesfigures where month=$1;
$$ language 'sql';

twitter
@leapingllamas.
My linkedin profile
View my complete
profile

aggs=# select * from salesformonth('jan');


salesformonth
--------------34
18
(2 rows)
FUNCTIONS THAT RETURN A SETOF ROWS
There are 4 options to return a set of rows.
1) RETURNS table
PostgreSQL functions (version 8.4 and later) can also return a table. That is, we define the
fields of the SETOF rows that the function will return. Here is that last function but with the
person's name as well as their sales:
CREATE FUNCTION salesforpersonformonth(month varchar) returns
table(name text, sales decimal) as $$
select name, sales from salesfigures where month=$1;
$$ language 'sql';
aggs=# select * from salesforpersonformonth('jan');
name | sales
------+------joe |
34
bob |
18
(2 rows)
The types that functions return can be standard PostgreSQL types but they can also be
columns or tables that exist or they can be a type that the user defines.
2) Existing table
Here we have a table with a name and sales value:
aggs=# create table personsales(name varchar, sales decimal);
CREATE TABLE
A function can then return that type:
CREATE FUNCTION salesforpersonmonth(month varchar) returns SETOF
personsales as $$
select name,sales from salesfigures where month=$1;
$$ language 'sql';
aggs=# select * from salesforpersonmonth('jan');
name | sales
------+------joe |
34
bob |
18
(2 rows)
3) Column of existing table
Here is a function that returns a SETOF values of a type defined by the column of an
existing table:
CREATE FUNCTION salesforpersonmonth2(month varchar) returns SETOF
personsales.sales%TYPE as $$
select sales from salesfigures where month=$1;
$$ language 'sql';

http://www.postgresqlforbeginners.com/

2/18

16/4/2014

PostgreSQL for Beginners


aggs=# select * from salesforpersonmonth2('jan');
salesforpersonmonth2
---------------------34
18
(2 rows)
4) Defined type
We can create a new type by using CREATE TYPE. CREATE TYPE has a lot of options
but the basic use is
CREATE TYPE typename AS (fieldname1 type1, fieldname2 type2,...);
For instance,
CREATE TYPE personsales3 AS (name varchar, sales decimal);
Then, one can define
CREATE FUNCTION salesforpersonmonth3(month varchar) returns SETOF
personsales3 as $$
select name,sales from salesfigures where month=$1;
$$ language 'sql';
Posted by Carl Anderson at 6:00 PM

5 comments:

Links to this post

+1 Recommend this on Google

CREATE FUNCTION - basic anatomy


Functions or stored procedures are a very powerful feature of PostgreSQL. Functions
allow one to modularize code, create "units" that do useful tasks and hide the complexity
of the underlying code, here a query. They also reduce the risk of making a mistake by
typing the same query over and over again, such as running a daily or monthly report.
PostgreSQL supports a particularly large variety of programming languages. By using
these procedural language extensions one can write functions in Java (PL/Java), Python
(PL/Python), R (PL/R), Perl (PL/Perl) and many others to interact with a PostgreSQL
database. In this and the next post, however, we are going to cover simple functions in
SQL only. In later posts we will cover PL/pgSQL a more powerful and flexible procedural
language that provides features such as control flow (such as IF...THEN...ELSE), local
variables and multiple queries.
We'll start with a simple SQL function. In the next post, we will expand on this and detail
some additional options and features that CREATE FUNCTION supports.
Here are the sales data from the last post:
aggs=# select * from salesfigures;
name | month | sales
------+-------+------joe | jan |
34
joe | feb |
23
bob | jan |
18
bob | feb |
30
(4 rows)
If we want to get the total sales for a given month we can do
aggs=# select sum(sales) from salesfigures where month='jan';
sum
----52
(1 row)

http://www.postgresqlforbeginners.com/

3/18

16/4/2014

PostgreSQL for Beginners


but if we know that are going to have to do this each month (and you are likely to forget
the query next month) it might make sense to create a function called
totalsalesformonth()that one can call and pass in the month of interest. Here is
how we do that. Don't worry if this looks complicated for now. We will step through each
part.
CREATE FUNCTION totalsalesformonth(month varchar) returns decimal
as $$
select sum(sales) from salesfigures where month=$1;
$$ language 'sql';
and we can run it as
aggs=# select totalsalesformonth('jan');
totalsalesformonth
-------------------52
(1 row)
and get the same result as the plain query. Let us break this function definition down.
COMMAND
CREATE FUNCTION is the command that runs to the final semicolon. It is a query so we
can paste this into the psql command line or into pgAdmin the same as any other query.
It is defining a function called totalsalesformonth. PostgreSQL ignores the capitilization
and turns everything into lower case so this call is also recognized
aggs=# select totalsalesForMonth('jan');
totalsalesformonth
-------------------52
(1 row)
NAME
The function name must be a single string. It cannot be "total sales for month" but it can
be total_sales_for_month. Choose a sensible descriptive name for the function. You and
any other users will thank you later.
PARAMETER LIST
We are specifying that it takes one argument or parameter (technically this is an IN
parameter) called "month" and is of type varchar.
We can have more than one argument. These are specified as a comma-delimited list.
RETURN TYPE
The function is declared to return a single value of type decimal. We will discuss return
types further in the next post.
FUNCTION BODY
The "$$" says to PostgreSQL that we are starting the code that constitutes the body of
the function.
The function body should be a single statement, i.e. only one semicolon. PostgreSQL will
allow multiple statements but it is only going to return the last. That is,
CREATE FUNCTION totalsalesformonth(month varchar) returns decimal
as $$
select sum(sales) from salesfigures where month=$1;
select 3.;
$$ language 'sql';
is valid and will return 3 when called.
The names of the parameters in the parameter list are completely ignored. Instead we use
$1 in place of the first parameter, $2 for the second parameter etc. So, when we call
select totalsalesForMonth('jan');
PostgreSQL takes the value 'jan' and replaces $1 with this to produce

http://www.postgresqlforbeginners.com/

4/18

16/4/2014

PostgreSQL for Beginners


select sum(sales) from salesfigures where month='jan';
which is then executed.
Importantly, one cannot use a passed in parameter for the name of a field or the name of a
table. For instance, this function will not compile:
CREATE FUNCTION totalsales(tablename varchar) returns decimal as
$$
select sum(sales) from
$1;
//WRONG!
$$ language 'sql';
PostgreSQL complains of a syntax error. Why is this?
PostgreSQL is strongly typed. That means it checks that the output of the query matches
the types that the function should return. In this case, the parser cannot know if the
tablename passed in has a sales field and if it does whether it is numeric. There are too
many unknowns for PostgreSQL to be happy that this function will work as intended.If you
do need to create dynamic queries such as this, one can achieve it with PL/pgSQL.
Although the names are ignored it is important to use good descriptive names for
documentation purposes. I don't know about you but I can forget what I coded last week
and appreciate the help of good documentation and besides, you might not be the only
user of these functions. I would strongly discourage use of generic names such as "arg0",
"arg1", "x", "y" unless there is good reason. (For instance, x and y might be the best
choice for a function that processes a point in the Cartesian plane.)
LANGUAGE
Finally, we tell PostgreSQL which language we are using.
In summary, our function is
CREATE FUNCTION function_name (argument_name argument_type, argument_name
argument_type, ...) RETURNS return_type AS $$
function_body_query;
$$ language language_name;
This covers the basic anatomy of a SQL function. CREATE FUNCTION is a sophisticated
command and supports a richer set of options. We'll start on those in the next post.
Posted by Carl Anderson at 9:12 AM

5 comments:

Links to this post

+4 Recommend this on Google

Thursday, Decem ber 2, 2010

schema
In this post we'll examine PostgreSQL's schema support.
We encountered the default schema called "public" in an earlier post when exploring
pgAdmin. Colloquially, when someone talks about a database's schema they typically
mean the structure of the tables (that this table has these fields with these types). Here, a
schema is a higher level of organization, a container or namespace within a database. A
schema contains a set of tables, views, stored procedures and triggers and so on, all the
usual components of a database. However, one can have multiple schema within a
PostgreSQL database. To clarify this, let's start with a concrete example and consider the
pros and cons of different ways of organizing the data.
Suppose we have a supermarket company, SuperFoods Inc., who have two stores, say
Sausalito, CA and Petaluma, CA. (Clearly they have recovered from the bad economy and
layoffs of this earlier post.) Each of the two stores has a set of information: employees
and contact details, details about the physical store itself, replenishment orders and order
history etc.
Imagine the data are stored in two separate databases. The CEO of SuperFoods asks one
of his minions to produce a report on the sales history of the two stores. The problem here

http://www.postgresqlforbeginners.com/

5/18

16/4/2014

PostgreSQL for Beginners


is that databases are independent silos. One cannot write a SQL query that will collate
the data from the two stores. psqlcan only connect to one database at a time. We can
use external tools, such as a JDBC driver in Java that will make a connection to each of
the two databases and collect the data, but we cannot query the two databases
simultaneously from the psqlcommand line.
Imagine now that we take the two databases and put them inside a single SuperFoods
database. We create a container, a schema, for each of the two original databases. Each
schema has a unique name, "sausalito" and "petaluma", and this causes each of the
tables in the schema to have a qualified name that includes the schema name.
That is, we have
sausalito.employees
sausalito.sales
petaluma.employees
petaluma.sales
(Technically, it would include the database name too and be
"SuperFoods".sausalito.employeesetc. under the hood but the user doesn't see
this or need to see it.)
Importantly, when we connect to the SuperFoods database we can now access the sales
report tables from both stores. We cannot do
select * from sales;
because PostgreSQL is going to complain. There is an abmiguity here. There are multiple
sales tables. Which do you mean? Thus, one has to qualify it with the schema name:
select * from sausalito.sales;
to specify the Sausalito sales and
select * from petaluma.sales;
to access the Petaluma sales.
We can now combine those queries with operators such as UNION and UNION ALL
(covered in this post) to get something closer to what the CEO wants:
select 'sausalito', sum(sales) from sausalito.sales where
year=2010
UNION ALL
select 'petaluma', sum(sales) from petaluma.sales where
year=2010;
Great. Goal achieved.
OK, now consider this structure from the perspective of a human resources person at the
Sausalito store. She needs to update the employees table for her store and she isn't
interested in the Petaluma tables. Writing 'sausalito.' before each table in the queries is a
bit of a nuisance. PostgreSQL helps us out with the notion of the search path. The search
path says where to look, similar to classpaths in most programming languages.
Let's connect to a database, set the search path and create a new table.
psql -d SuperFoods
testdb=#set search_path TO sausalito;
SET
testdb=# \d
List of relations
Schema | Name
| Type |
Owner
-----------+-----------+-------+--------------

http://www.postgresqlforbeginners.com/

6/18

16/4/2014

PostgreSQL for Beginners


sausalito | employees | table | postgresuser
sausalito | sales
| table | postgresuser
(2 rows)
Notice we see salesand employeesnot sausalito.sales. When we set the
search path it means, work on the database from the perspective of the sausalito
schema. Now when one does
select * from sales;
it does not complain. It knows that you mean sausalito.sales. There is no ambiguity
now because you cleared that up earlier when you set the search path.
You can even select the data from the other schema using the qualified name
select * from petaluma.sales;
from within the sausalito schema.
Creating tables once one has set the search_path has the same implicit name
qualification:
testdb=# create table promotions(name varchar, value int);
CREATE TABLE
testdb=# \d
List of relations
Schema |
Name
| Type |
Owner
-----------+------------+-------+-------------sausalito | employees | table | postgresuser
sausalito | promotions | table | postgresuser
sausalito | sales
| table | postgresuser
(3 rows)
How do we create a schema? That is a synch:
testdb=# create schema sausalito;
CREATE SCHEMA
testdb=# create schema petaluma;
CREATE SCHEMA
Further, you can dump data from a single schema from the operating system command
line
localhost:~ postgresuser$ pg_dump --schema sausalito testdb >
sausalito.sql
We can see that using schema gives us all the tools that exist for single databases but
we can collect related tables together, in SuperFoods by store, and yet compare data
mong those different collections. Now we understand more about schema we can refine
our notion about them. They are a namespace or rather a namespace within a database.
PostgreSQL only sees one database here: SuperFoods.
Schema can be very powerful and useful way of partitioning data. However, they are
typically used for more complex sets of tables, more complex than those typically needed
for beginner exercises. For simpler cases, an additional field that defines the subset of
data might suffice. Imagine that we include an additional 'store' field in the sales table. A
single table could then be queried for Sausalito sales, Petaluma sales or both.
Also, imagine that SuperFoods assigns a regional manager to the two stores. That
manager is effectively an employee at both stores. We want to avoid replication of data in
databases and so would not want to include his personal details in both schema. In this
case a single, more unified database structure might be more appropriate. Given those,
schema are a valuable addition to PostgreSQL and worth considering when designing
one's database.
Posted by Carl Anderson at 1:54 PM

http://www.postgresqlforbeginners.com/

8 comments:

Links to this post

7/18

16/4/2014

PostgreSQL for Beginners


Recommend this on Google

Wednesday, Decem ber 1, 2010

Aggregates
PostgreSQL supports a number of different aggregate functions (a tutorial appears here).
These are functions that perform some mathematical or statistical calculations on a set of
values. Common examples are sum, min, max and count. The values are often numeric
but they can be boolean, string, or XML too and count accepts anything.
Let's start with a simple example. Suppose we have the following table of sales per month
for different sales people
aggs=# select * from salesfigures;
name | month | sales
------+-------+------joe | jan |
34
joe | feb |
23
bob | jan |
18
bob | feb |
30
(4 rows)
We can get the total sales for the sales team for the year to date as
aggs=# select sum(sales) from salesfigures;
sum
----105
(1 row)
Notice that sum is a function that we can specify in the select list (the part before the
FROM) and it takes one argument. We are supplying it the values in the sales column. It
adds 34 + 23 + 18 + 30 to get 105. Easy.
Filtering by WHERE
As the aggregate gets computed after any WHERE clause is applied we can easily apply
filters on the data that is supplied to the aggregate. For instance, to get the sales for
January
aggs=# select sum(sales) from salesfigures where month='jan';
sum
----52
(1 row)
(34 + 18 = 52)
Filtering by GROUP
Similarly, aggregates are computed after GROUP BY clauses. This is a really powerful
feature. So, if we want the total sales for each month we can run
aggs=# select month, sum(sales) from salesfigures group by month;
month | sum
-------+----feb | 53
jan | 52
(2 rows)
Clearly, using strings for month is not the best table design here as they cannot sensibly
be ordered or compared. I used them to keep the example simple.
We call the aggregates multiple times in the select list. Here we get the min and max
sales values for each person
aggs=# select name, min(sales), max(sales) from salesfigures

http://www.postgresqlforbeginners.com/

8/18

16/4/2014

PostgreSQL for Beginners


group by name;
name | min | max
------+-----+----bob | 18 | 30
joe | 23 | 34
(2 rows)
GOTCHA
If we wanted to know who had the most sales in any month we might be tempted to query
select name from salesfigures where sales=max(sales);

WRONG!

We could imagine this as computing the maximum value in the sales column and the
selecting the one or more names that matched this value. This will not work. If one runs
this query, PostgreSQL complains
aggs=# select name from salesfigures where sales=max(sales);
ERROR: aggregates not allowed in WHERE clause
LINE 1: select name from salesfigures where sales=max(sales);
^
aggs=#
This is a typical beginner mistake. I've done it. You'll probably do it too. As the error
message says, aggregates are not allowed in the WHERE clause. This is because the
WHERE clause is used to generate the set of rows to be passed to the select list and
any aggregates. What we have here is a circular relationship: max depends on which rows
it is given and that is determined by WHERE, which depends on max... What you have to
do is obtain the maximum value as one query and then pass that into another query.
aggs=# select name from salesfigures where sales=(select
max(sales) from salesfigures);
name
-----joe
(1 row)
(select max(sales) from salesfigures)is called a subquery, something we
will cover in more detail in a later post. For now, just know that it computes this subquery
first to obtain a value, which will be 34. We then effectively have a simple query
select name from salesfigures where sales=34;
which is OK as the circularity has been eliminated.
Filtering by HAVING
OK, so we cannot have aggregates in WHERE clause. However, we can have them in a
HAVING clause. If you recall our earlier post, WHERE can be accompanied by an
optional HAVING clause that can be used to filter rows after the WHERE clause is
performed.
Who is underperforming? Which salesperson(s) have had less than 20 sales in any one
month?
aggs=# select name from salesfigures group by name having
min(sales)<20;
name
-----bob
(1 row)
This query groups first to get a set of virtual tables, then applies the min aggregate to get
the minumum per person. It then filters in those with a minumum less than 20. Finally, we
asked for SELECT name so it returns the one name left, that for Bob.
COUNT
COUNT is commonly used aggregate. SELECT count(*) from tablenamereturns
the number of rows. The * just means "everything", every column in this context. If we ran

http://www.postgresqlforbeginners.com/

9/18

16/4/2014

PostgreSQL for Beginners


SELECT count(*) from salesfiguresit would return 4, as there are four rows, as
would SELECT count(sales) from salesfigures. Count counts the rows not the
values. Don't confuse this with sum!
Unfortunately, COUNT is fairly slow in PostgreSQL. It can take seconds or sometime tens
of seconds in really large tables. The reason is that when a row is deleted from a table, it
is not really deleted, just flagged as not being needed any more. (This is the same in
some operating system when one deletes a file.) Also, the rows for a table may be
scattered through memory. Count has to scan through each row of the table sequentially,
work out which rows are "dead" in order to return the total row count. Other databases can
do this more efficiently and is a frequent complaint about PostgreSQL performance.
However, it does excel in many other respects over other databases.
This lists the various aggregates that PostgreSQL provides. As well as count, min, max
and sum mentioned above other commonly used aggregates are avg (average) and stddev
(standard deviation, a statistical measure of variability). Unfortunately, median is not
provided.
Let's end with a non numeric example. The EVERY aggregate checks whether each and
every row is true
aggs=# select * from clients;
name
| paidbill
--------------+---------acme corp. | t
lehman bros. | f
GM
|t
(3 rows)
aggs=# select every(paidbill) from clients;
every
------f
(1 row)
Posted by Carl Anderson at 9:25 AM

No comments:

Links to this post

Recommend this on Google

Saturday, Novem ber 27, 2010

SQL: outer joins


Last time we covered INNER JOINs including a couple of special cases CROSS JOIN and
self joins. In INNER JOINS the result set returns only the set of rows that match the join
criteria (if any). IN OUTER JOINS, the results might contain both matched and unmatched
rows. It is for this reason that beginners might find such JOINS a little more confusing.
However, the logic is really quite straightforward.
There are three types of outer joins: LEFT, RIGHT, and FULL. Let's start with a LEFT
OUTER JOIN.
In a left join when joining table t1 with table t2 PostgreSQL first does a "normal" inner join.
It then looks to see if there are any rows from t1 that are not in the result set. If so, it adds
in those rows placing NULLs for all the fields of t2. Thus, this guarantees that each row of
t1 will appear at least once in the results.
Here are our data:
joindb=# select * from clients;
id | name
----+------0 | joe
1 | bob
2 | alice
(3 rows)

http://www.postgresqlforbeginners.com/

10/18

16/4/2014

PostgreSQL for Beginners


joindb=# select * from orders;
id | value
----+------1 | val1
2 | val2
3 | val3
(3 rows)
An inner join on ids matches on id=1 and id=2
joindb=# select * from clients join orders on
clients.id=orders.id;
id | name | id | value
----+-------+----+------1 | bob | 1 | val1
2 | alice | 2 | val2
(2 rows)
but a left outer join has that inner join result set plus the row for id=0 from clients (shown
in red):
joindb=# select * from clients left join orders on
clients.id=orders.id;
id | name | id | value
----+-------+----+------0 | joe |
|
1 | bob | 1 | val1
2 | alice | 2 | val2
(3 rows)
Left joins are implicitly OUTER JOINs but we can add that keyword in explicitly
select * from clients left outer join orders on
clients.id=orders.id;
We can obtain the same result set if we use the USING syntax: select * from
clients left join orders using(id);
That's it. Simple isn't it?
Why might we want something like this? Suppose we have a list of clients and we have
purchase data from the last month. We might want a report of the sales for all our clients,
even those that didn't buy anything in that month. This sort of report might be useful for
the sales staff to target those clients.
OK, if a LEFT JOIN does an inner join and adds in any additonal "missing" rows from t1,
what do you think a RIGHT JOIN does? Exactly. Do an INNER JOIN and add in any
missing rows from t2.
joindb=# select * from clients right join orders on
clients.id=orders.id;
id | name | id | value
----+-------+----+------1 | bob | 1 | val1
2 | alice | 2 | val2
|
| 3 | val3
(3 rows)
Here we can see the row for id=3 from orders shown in blue.
The last type is a FULL OUTER JOIN. That is an left and a right outer join. Do an inner
join and add in any missing rows from both t1 and t2.
joindb=# select * from clients full join orders on
clients.id=orders.id;
id | name | id | value
----+-------+----+-------

http://www.postgresqlforbeginners.com/

11/18

16/4/2014

PostgreSQL for Beginners


0 | joe |
|
1 | bob | 1 | val1
2 | alice | 2 | val2
|
| 3 | val3
(4 rows)
Again, we can use the ON join_criteria or the USING (join_column_list) syntax and we
can include or omit the OUTER keyword.
We can also do a NATURAL join for left, right and full outer joins. Recall that with natural
joins we ask PostgreSQL to match up the columns that share the same name.
joindb=# select * from clients natural full join orders;
id | name | value
----+-------+------0 | joe |
1 | bob | val1
2 | alice | val2
3|
| val3
(4 rows)
Finally, in this and the last post we only joined two tables. We can in fact join as many
tables as we wish in a single query. For outer joins you will need to use parentheses and
nest the joins. For inner joins, however, one can do
select * from tablename1 t1, tablename2 t2, tablename3 t3,...
where t1.id=t2.id and t1.id=t3.id and...;
Posted by Carl Anderson at 4:13 PM

5 comments:

Links to this post

Recommend this on Google

Friday, Novem ber 26, 2010

SQL: inner, cross and self joins


Understanding how table joins work is one of the key skills of a database beginner. In an
earlier post, we normalized our ceostable into two tables, one table to list the set of
unique people and a second table to associate a person with a term as CEO. What we
didn't cover then was how to combine those tables now those data had been broken into
separate tables. This is the work of join queries. When we join tables we take two or more
tables and combine them using zero, one or more join criteria to create a set of resulting
rows.
Suppose that we have two tables:
joindb=# select * from clients;
id | name
----+------0 | joe
1 | bob
2 | alice
(3 rows)
joindb=# select * from orders;
id | value
----+------1 | val1
2 | val2
3 | val3
(3 rows)
CROSS JOIN
The simplest type of join is a cross join. This says create a results set of all possible
combinations of the the rows from the two tables.

http://www.postgresqlforbeginners.com/

12/18

16/4/2014

PostgreSQL for Beginners


joindb=# select * from clients, orders;
id | name | id | value
----+-------+----+------0 | joe | 1 | val1
1 | bob | 1 | val1
2 | alice | 1 | val1
0 | joe | 2 | val2
1 | bob | 2 | val2
2 | alice | 2 | val2
0 | joe | 3 | val3
1 | bob | 3 | val3
2 | alice | 3 | val3
(9 rows)
The first two columns are all the fields of the first of our tables listed in the query, clients
(shown in red), followed by all the fields of the second table, orders. We can see that it
takes the clients table and combines all its rows with the first row of orders, then it takes
the clients table again and combines with the second row of orders and finally does the
same for the third row of orders to produce 3 * 3 rows.
We could have produced the same set of results using the more explicit CROSS JOIN
terminology:
joindb=# select * from clients cross join orders;
id | name | id | value
----+-------+----+------0 | joe | 1 | val1
1 | bob | 1 | val1
2 | alice | 1 | val1
0 | joe | 2 | val2
1 | bob | 2 | val2
2 | alice | 2 | val2
0 | joe | 3 | val3
1 | bob | 3 | val3
2 | alice | 3 | val3
(9 rows)
INNER JOIN
Cross joins can be useful but more typically we restrict the set of combinations based on
some join criteria. Suppose we want to match up the client and order Ids:
joindb=# select * from clients join orders on
clients.id=orders.id;
id | name | id | value
----+-------+----+------1 | bob | 1 | val1
2 | alice | 2 | val2
(2 rows)
This says output only the rows where there is a matching ID for both clients and orders so
we only obtain 2 rows. (0,'joe')of clients doesn't have a match so that is not output.
Similarly for (3,'val3')for orders. These criteria are boolean criteria. They must return
true or false for each row combination. This is called an INNER JOIN and we can supply
that keyword explicitly. This query produces the same results as the last query: select
* from clients inner join orders on clients.id=orders.id;.
It is not unusual to have a large number of join criteria. It can be tedious to keep writing
out the table names for all these criteria so we can shorten the criteria text by the use of
an alias. This is a short, temporary "nickname" we can give to each table. Let's give client
an alias of t1 and orders t2.
We assign the name by putting the alias after each of our tables in the from section of our
query. We can then use them in the later part of the query:
joindb=# select * from clients t1 join orders t2 on t1.id=t2.id;
id | name | id | value
----+-------+----+-------

http://www.postgresqlforbeginners.com/

13/18

16/4/2014

PostgreSQL for Beginners


1 | bob | 1 | val1
2 | alice | 2 | val2
(2 rows)
If you use aliases you must use them in the join criteria otherwise PostgreSQL will
complain:
joindb=# select * from clients t1 join orders t2 on
clients.id=orders.id;
ERROR: invalid reference to FROM-clause entry for table
"clients"
LINE 1: select * from clients t1 join orders t2 on
clients.id=orders...
^
HINT: Perhaps you meant to reference the table alias "t1".
Rather than using ON we can specify a field list after USING
joindb=# select * from clients join orders using (id);
id | name | value
----+-------+------1 | bob | val1
2 | alice | val2
(2 rows)
We can actually drop the JOIN keyword altogether here if we use WHERE instead of ON:
joindb=# select * from clients t1, orders t2 where t1.id=t2.id;
id | name | id | value
----+-------+----+------1 | bob | 1 | val1
2 | alice | 2 | val2
(2 rows)
NATURAL JOIN
Often we give fields the same fieldname for the same entity types across different tables.
That is, if we have a person field in different tables we might use the same "person_id"
name in those different tables. Such conventions is an aspect of good database design
and can avoid some confusion when designing queries. We can ask PostgreSQL to make
use of this convention in a NATURAL join which is a special type of INNER join. This is a
join where we ask PostgreSQL to match up on all columns that have the same name. In
the following query
joindb=# select * from clients natural join orders;
id | name | value
----+-------+------1 | bob | val1
2 | alice | val2
(2 rows)
PostgreSQL works out that we have an id column in each table and implicitly does an
INNER JOIN on that column.
SELF JOIN
The tables we are joining don't have to be different tables. We can join a table with itself.
This is called a self join. In this case we have to use aliases for the table otherwise
PostgreSQL will not know which id column of which table instance we mean
joindb=# select * from clients t1, clients t2 where t1.id=t2.id;
id | name | id | name
----+-------+----+------0 | joe | 0 | joe
1 | bob | 1 | bob
2 | alice | 2 | alice
(3 rows)
To summarize, we can join two tables t1 and t2 using a cross join:

http://www.postgresqlforbeginners.com/

14/18

16/4/2014

PostgreSQL for Beginners


T1 CROSS JOIN T2
T1, T2
a self join
T1 CROSS JOIN T1
T1, T1
or a qualified inner join:
T1 [INNER] JOIN T2 ON boolean_expression
T1 [INNER] JOIN T2 USING ( join column list )
T1 NATURAL [INNER] JOIN T2
T1, T2 WHERE ....
and we can use table aliases
tablename t1, tablename t2...
In the next post we will cover OUTER JOINs.
Posted by Carl Anderson at 5:54 PM

2 comments:

Links to this post

Recommend this on Google

Sunday, Novem ber 21, 2010

SQL: UNION, EXCEPT and INTERSECT


In this post we shall examine the very useful PostgreSQL operators UNION, EXCEPT and
INTERSECT, each of which can be followed by ALL.
A hot startup is holding a special event and wants to send out invites to some of their best
clients and also to some VIPs. Some of the VIPs are actually very supportive of the site
and are clients too. What query will provide the complete set of people to invite avoiding
duplicate records? For this, we want UNION. UNION is an OR operator for multiple result
sets. Here is our data
create database hotstartup;
\c hotstartup;
create table clients(name varchar);
insert into clients values('John Smith');
insert into clients values('Ashton Kutcher');
insert into clients values('Joe Doe');
insert into clients values('Stephen Fry');
create table vips(name varchar);
insert into vips values('Ashton Kutcher');
insert into vips values('Stephen Fry');
insert into vips values('Demi Moore');
insert into vips values('Hugh Laurie');
and the query will be
hotstartup=# select * from clients union select * from vips;
name
---------------Ashton Kutcher
Hugh Laurie
Joe Doe
Demi Moore
John Smith
Stephen Fry
(6 rows)

http://www.postgresqlforbeginners.com/

15/18

16/4/2014

PostgreSQL for Beginners


The query is saying get the rows from clients and append the rows of vips but remove any
duplicates. Here we can see that we can have a total of 6 unique names from the initial
set of 8 rows.
There are times when we do not want to remove duplicates. For that, we can append ALL
after UNION:
hotstartup=# select * from clients union all select * from vips;
name
---------------John Smith
Ashton Kutcher
Joe Doe
Stephen Fry
Ashton Kutcher
Stephen Fry
Demi Moore
Hugh Laurie
(8 rows)
The head party planner sees the list of 6 invites and says "No, no, no. We need to send
out two sets of invites, one set to the VIPs that will get them into the VIP tent and then
ordinary invites for everyone else". Clearly, select * from vipswill return the names
for the VIP invite list. However, what will get us the list for the remaining invites? We want
everyone on the clients list EXCEPT those on the VIP list. Unsurprisingly, we use
EXCEPT
hotstartup=# select * from clients except select * from vips;
name
-----------Joe Doe
John Smith
(2 rows)
Finally, if we want to get the list of people who are both clients and VIP we can use
INTERSECT.
hotstartup=# select * from clients intersect select * from vips;
name
---------------Ashton Kutcher
Stephen Fry
(2 rows)
Both EXCEPT and INTERSECT can be followed by an optional ALL. To examine its effect
we need to add some duplicate data. Let's insert a duplicate name into clients:
hotstartup=# select * from clients;
name
---------------John Smith
Ashton Kutcher
Joe Doe
Stephen Fry
Stephen Fry
(5 rows)
hotstartup=# select * from vips;
name
---------------Ashton Kutcher
Demi Moore
Hugh Laurie
Stephen Fry
(4 rows)

http://www.postgresqlforbeginners.com/

16/18

16/4/2014

PostgreSQL for Beginners


Compare the results of these two
hotstartup=# select * from clients except select * from vips;
name
-----------Joe Doe
John Smith
(2 rows)
versus
hotstartup=# select * from clients except all select * from vips;
name
------------Joe Doe
John Smith
Stephen Fry
(3 rows)
The latter query is essentially saying take each row of clientsand find a matching row
in vips. If there is match delete the pair of rows from the results set.
For INTERSECT, let's insert Stephen Fry into vipsto have him duplicated in both lists
hotstartup=# select * from clients;
name
---------------John Smith
Ashton Kutcher
Joe Doe
Stephen Fry
Stephen Fry
(5 rows)
hotstartup=# select * from vips;
name
---------------Ashton Kutcher
Stephen Fry
Demi Moore
Hugh Laurie
Stephen Fry
(5 rows)
Then we can see that INTERSECT ALL no longer removes duplicates, i.e.
hotstartup=# select * from clients intersect select * from vips;
name
---------------Ashton Kutcher
Stephen Fry
(2 rows)
versus
hotstartup=# select * from clients intersect all select * from
vips;
name
---------------Ashton Kutcher
Stephen Fry
Stephen Fry
(3 rows)
The latter query is saying find matching pairs of rows. Stephen Fry appears in both tables
twice so we find two matching pairs for him and hence two rows appears in the results.

http://www.postgresqlforbeginners.com/

17/18

16/4/2014

PostgreSQL for Beginners


In all these three operators the columns of the two select statements must match. We
cannot have the first select statement returning 3 fields and the second returning two. We
can demonstrate that with our ceostable:
apple=# select * from ceos;
name
| year
------------------+-----Steve Jobs
| 1976
Mike Markkula
| 1977
Mike Scott
| 1978
John Sculley
| 1983
Michael Spindler | 1993
Gil Amelio
| 1996
Steve Jobs
| 1997
(7 rows)
apple=# select * from ceos where year between 1970 and 1979 union
select * from ceos where year=1977;
name
| year
---------------+-----Mike Scott
| 1978
Steve Jobs
| 1976
Mike Markkula | 1977
(3 rows)
This is OK because both statements are returning * which as they refer to the same table
means the same set of fields.
This, however, is not acceptable
apple=# select * from ceos where year between 1970 and 1979 union
select name from ceos where year=1977;
ERROR: each UNION query must have the same number of columns
LINE 1: ...os where year between 1970 and 1979 union select name
from ...
because the first statement returns all fields wheareas the second returns one field only.
Posted by Carl Anderson at 9:42 PM

3 comments:

Links to this post

Recommend this on Google

Home

Older Posts

Subscribe to: Posts (Atom)

2010 Carl Anderson. All Rights Reserved. Simple template. Powered by Blogger.

http://www.postgresqlforbeginners.com/

18/18

You might also like