Contents

  1. The Basics
    1. Case sensitivity
    2. Conventions in SQL syntax
    3. Quotation marks
    4. Data Types
      1. Character
      2. Number
      3. Date
      4. Time
  2. Retrieving Data from a Table
    1. The Simplest SELECT Statements
    2. Retrieving columns with SELECT and FROM
    3. Eliminating duplicate rows with DISTINCT
    4. Filtering rows with WHERE
      1. Comparisons
      2. Combining and negating conditions with AND, OR, and NOT
      3. Matching patterns with LIKE
      4. Range filtering with BETWEEN
      5. List filtering with IN
      6. Testing for blanks or nulls with NULL
    5. Sorting rows with ORDER BY
    6. Creating column aliases with AS
    7. Writing Human Readable Queries
  3. Operators and Functions
    1. Performing Arithmetic Operations
  4. Summarizing and Grouping Data
    1. Aggregate Functions
      1. COUNT
      2. MIN and MAX
      3. AVG
      4. SUM
      5. FLOOR and CEILING
    2. Grouping rows with GROUP BY
    3. Filtering groups with HAVING
  5. Retrieving Data from Multiple Tables
    1. Qualifying column names
    2. Creating joins with JOIN or WHERE
      1. Selecting Columns from two or more tables with WHERE
      2. Join two or more tables using JOIN
  6. Schemas
    1. Saving your query result INTO a new table
    2. Queries with multiple SELECT statements
    3. Temporary tables
    4. Deleting tables
    5. Importing Tables into your schema
    6. Exporting Tables
    7. Sandbox
      1. Permissions
    8. Granting Permission to a table
  7. Views
  8. Saving your queries
    1. About text editors:
    2. Documenting your queries with comments
      1. Some examples of comment use
        1. All comments laid out before stating the query
        2. Brief "big picture" explanation at beginning, then brief comments within the query on their own lines
        3. Technical details laid out briefly at beginning, then comments on the same line as the code
        4. Thorough explanation and rationale, then comments spread throughout the query
  9. A Couple Useful Queries
    1. Calculate Age
    2. Change a table name
    3. Concatenate table columns
    4. Extract year (or day or month) from a date field

The Basics

SQL (Structured Query Language) is the standard programming language for creating and retrieving information that is stored in a relational database. The user interface or front-end (what you see online) is called phpPgAdmin. The following pages are meant as a basic introduction to SQL, phpPgAdmin, and query writing. Please see BaBase documentation for specifics about the database. There are many resources online for more detailed information about SQL. There are links to SQL tutorials and documentation through the Babase Wiki

Case sensitivity

SQL language is case insensitive but user-defined identifiers (table names, column names, and other such names for things that are not data) are case sensitive. For example, the SQL SELECT statement could be written as SELECT, SeLect, or select but the table name biograph is case sensitive. (A convention often used is to write key words in upper case and names in lower case, e.g.: SELECT * FROM babase;) Fortunately, unless identifiers are written inside double quotes (") they are automatically converted to lower case. In practice this means that you do not need to worry about case unless someone has gone to the trouble of creating mixed-case identifiers.

The point to take away is: create table names, column names, and the like, using lower case letters only. To do otherwise is to invite trouble.

Conventions in SQL syntax

In this document I have followed some SQL syntax conventions when writing query statements.

UPPERCASE

Uppercase indicates SQL language or commands

lowercase

Lowercase indicates user defined items such as tables and columns specific to BaBase.

Quotation marks

All quotation marks in SQL are straight quotes (such as ' and "), not curly quotes (such as ‘ and “). Curly quotes will not work in the SQL window. Be careful when cutting and pasting from text editors. Sometimes the default is curly quotes and you will receive an error message when you run the query. You can delete the quotes in the SQL query window and retype them. When you type directly into the SQL query window you will have the correct quote format.

Data Types

Each column in a table has a single data type. A data type determines a column’s allowable values. In general, we can classify columns as the following types.

Character

Represent text and are case sensitive. Whenever characters are requested they are case sensitive and need to be enclosed in single quotation marks. Examples of character columns are the name and pid columns in the biograph table.

Number

Numbers can be integers or decimals. Numbers can be negative, zero, or positive. Numbers do not require quotation marks. Examples of numeric columns are the matgrp and bstatus columns in the biograph table.

Date

A date stored in a column defined as DATE has three fields - YEAR, MONTH, and DAY - and is formatted yyyy-mm-dd. Dates must be enclosed in single quotations. Examples of date columns are the birth and statdate columns in the biograph table.

Time

Represents a time of day. A time stored in a column defined as TIME has three fields – HOUR, MINUTE, and SECOND – and is formatted hh:mm:ss. Times must be enclosed in single quotations. Examples of time columns are the start and stop columns in the interact_data table.Retrieving data from a table.

Retrieving Data from a Table

The Simplest SELECT Statements

The simplest select statements do not even return table content.

SELECT 'Hello World';

It is not very often that you ask SQL to return a constant value, but it does happen occasionally. More often, some columns are the result of computation.

SELECT 'The answer is:', (2 * 20) + 2;

Retrieving columns with SELECT and FROM

The simplest useful SELECT statement retrieves columns FROM a table. The SELECT and FROM clauses are always required if you retrieve columns from tables. All other clauses are optional.

SELECT name FROM biograph;

This query will return every row in the name column of the biograph table.

SELECT name, birth FROM biograph;

This statement will return every name and birthdate in the biograph table.

SELECT * FROM biograph;

This will return all of the rows and columns contained in the biograph table.

Eliminating duplicate rows with DISTINCT

Although DISTINCT is an aggregate function (discussed below) it is useful to introduce this one earlier on.

SELECT DISTINCT(sname) FROM members;

The members table has multiple entries for most individuals. In this case the DISTINCT command returns each sname only once.

Filtering rows with WHERE

The result of each SELECT statement so far has included every row in the table (for the specified columns). The WHERE clause filters unwanted rows from the result. In a WHERE clause you specify a search condition that contains one or more conditions that need to be satisfied by the rows of the table. Rows for which the condition or conditions are true are returned.

Comparisons

The most common types of conditions involve comparing two values to each other. There are six types of comparisons we can define. Note that inequality can be defined in two different ways.

Symbol

Usage

=

Equality

<>

Inequality (not equal to)

!=

Inequality (not equal to)

<

Less than

>

Greater than

<=

Less than or equal to

>=

Greater than or equal to

SELECT * FROM biograph
         WHERE sname=’CHA’;

All information contained in the biograph table where the sname is ‘CHA’.

SELECT * FROM biograph
         WHERE sex=’F’;

All rows in the biograph table that have ‘F’ listed in the sex column will be returned. This is every row pertaining to a female.

SELECT * FROM biograph
         WHERE matgrp <> 9;

This query returns all rows from the biograph table where the matgrp (maternal group) is not 9.

SELECT * FROM biograph
         WHERE birth < ‘2005-01-01’;

This query will return every row in the biograph table where the birth date is before (but not including) January 1st, 2005.

SELECT * FROM biograph
         WHERE birth <= ‘2005-01-01’;

This query will return every row in the biograph table where the birth date is before or equal to January 1st, 2005.

The “greater than” statements work in exactly the same way as the “less than” statements shown above.

Combining and negating conditions with AND, OR, and NOT

Often you will need to specify multiple conditions in one SELECT statement. You can combine two or more conditions using the AND, OR, and NOT operators.

SELECT * FROM biograph
         WHERE matgrp=1.1 AND sex=’M’ AND birth >= ‘2005-01-01’;

This statement will return all rows in the biograph table where the individual is male, has a matgrp of 1.1 (born in Nyayo’s group) and was born on or after January 1st 2005.

SELECT * FROM biograph
         WHERE sname=’ELV’ OR sname=’CHA’ OR sname=’POW’;

This statement will return all rows in the biograph table where the sname is ‘ELV’, ‘CHA’ or ‘POW’.

SELECT * FROM biograph
         WHERE NOT matgrp <= 3;

All rows in the biograph table where matgrp is not less than or equal to 3 will be returned. This could also be stated without using a NOT statement and using

SELECT * FROM biograph
         WHERE matgrp > 3;

Matching patterns with LIKE

Preceding examples retrieved rows based on the exact value of a column or columns. You can use LIKE to retrieve rows based on partial information. LIKE works only with character strings. In the LIKE statement you can use two different wildcard characters. A percent sign (%) matches any string of zero or more characters. An underscore (_) matches any one character. You can negate a LIKE statement with NOT LIKE.

SELECT * FROM biograph
         WHERE name LIKE 'C%';

This statement will return all rows in the biograph table where the name starts with a “C”

SELECT * FROM biograph
         WHERE pid LIKE 'MET_';

This statement will return all rows in the biograph table where the pid starts with ‘MET’. In other words, all rows corresponding to an individual whose mother’s sname is ‘MET’.

SELECT * FROM biograph
         WHERE pid LIKE 'MET%';

Range filtering with BETWEEN

SELECT DISTINCT(sname) FROM members
                       WHERE grp=1.1 AND date BETWEEN '2004-01-01'
                                               AND '2005-01-01';

This statement will return the distinct snames of all the individuals in the members table from group 1.1 between (this is an inclusive range) January 1st 2004 and January 1st 2005. In other words, who was in group 1.1 at some point during the given time period.

List filtering with IN

SELECT * FROM biograph
         WHERE sname IN ('CHA', 'ELV', 'POW');

Testing for blanks or nulls with NULL

Nulls represent missing or unknown values. This situation can cause problems in WHERE clauses because unknown values do not satisfy specific conditions. Nulls match no values – not even other nulls.

SELECT * FROM biograph
         WHERE pid IS NULL;

This statement returns every row in the biograph table where a null exists in the pid column. This translates to every individual whose mother is unknown to us.

SELECT * FROM biograph
         WHERE sname IS NOT NULL;

This statement returns every row in the biograph table where we have an sname for the individual. Aborted fetuses usually are not given a name but may still have a row in the biograph table.

Sorting rows with ORDER BY

Rows in a query result are unordered. Often you will want the output of your data in a particular order. For example, you may want your result ordered alphabetically or by date.

SELECT * FROM biograph
         WHERE sex=’M’ ORDER BY sname;

This will return all entries in biograph where the sex is male and will be returned in alphabetical order by sname.

SELECT * FROM members
         ORDER BY grp, date;

This will return all entries in members where the group is equal to 2. The results will be in numerical order by group and, within each group, will be ordered chronologically by date.

SELECT * FROM members
         WHERE date between '2005-01-01' and '2005-03-01'
         ORDER by grp, date;

This will return all individuals present in any group during this time period, sorted into their groups and by the daye, in that order.

Creating column aliases with AS

In the queries we have seen so far we have accepted the return of the default column names. The AS clause can be used to create a column alias. A column alias is an alternative name. The new name is used in both the column headings and elsewhere in the query itself. If a column name is too long or cryptic, or if the column is the result of a computation the query may be made more clear by changing the name.

Double quotes are required around the new name unless the alias is a single word that contains only lower case letters, digits, or underscores.

SELECT 'The answer is:' AS "Label", (2 * 20) + 2 AS "Answer";

SELECT sname AS "short name" FROM biograph;

or

SELECT sname AS shortname, statdate - birth AS daysalive FROM biograph WHERE statdate - birth < 20 ORDER BY daysalive, shortname;

The computed value is given a column name in both the output column heading and the ORDER BY that orders the output, but not in the WHERE clause because the WHERE clause is needed to produce the output.

When using column aliases, it is strongly-recommended that you use descriptive names that clearly illustrate what the column holds. It's tempting to abbreviate complicated titles with a few letters, if for no other reason than because it's easier to just type a few letters than a long name. For the sake of keeping your data readable (to others and even to yourself, in case you revisit old data a few months later), avoid over-abbreviating column names. See "Saving your query result INTO a new table" for an example of over-abbreviating a table name.

Writing Human Readable Queries

It is important to write your queries so that they can be read by a human. It is no good having a query that the computer understands but a human does not because it is the human that must interpret the query's result. Column aliases are useful in this regard, but the most useful technique is indentation. SQL statements can be continued over multiple lines; they do not end until the final semicolon. Breaking a query into multiple lines, and then using indentation that reflects the structure of the SQL, gives the reader an overview of the query structure at a glance and allows her to focus on and understand each component of the query in the context of the whole.

SELECT sname AS shortname
            , statdate - birth AS daysalive
  FROM biograph
  WHERE statdate - birth < 20
  ORDER BY daysalive, shortname;

Line breaks and indentation makes queries easier to understand.

Queries may contain comments. Comments begin with -- (2 dashes) and continue to the end of the line. Comments can go anywhere; on a line by themselves or at the end of a line containing SQL. Judicious use of comments will help the reader understand the most complex of queries.

-- This is a comment.  It does nothing, has no effect, and may as well not be here.

Comments aid understanding, but unless they reveal what's otherwise hidden they are just clutter.

Formatting and commenting queries as part of the writing process helps clarify goals, aids present and future understanding of the query, aids debugging, and helps ensure the query really does produce the result expected.

You may have noted that the query above, which returns baboons who have lived less than 20 days, does not consider whether the individuals are alive or dead. The results would include not only those individuals which lived less than 20 days, but also all individuals born within 20 days of the last observation of the group. Queries quickly become complex and hard to read, with subtleties that only the informed can appreciate. The formatting conventions and commenting techniques of the following example may be applied to queries of arbitrary complexity to greatly aid readability.

-- Show individuals who have lived less than 20 days
SELECT sname AS shortname
            , statdate - birth AS daysalive         -- Subtracting two dates gives the number of days between the two
  FROM biograph
  WHERE statdate - birth < 20
                -- Don't show living individuals; do include more than just the confirmed dead
                AND status <> 0                           -- Status of 0 means alive
  ORDER BY daysalive, shortname;

Comments can document goals, describe obscure SQL, document a query's sub-components and the non-obvious implications of the choices made, explain data values, and so forth.

Operators and Functions

Performing Arithmetic Operations

To add, subtract, multiply, or divide use the operators found in the following table.

Symbol

Usage

+

Addition

-

Subtraction

*

Multiplication

/

Division

SELECT (stop - start) AS "time spent" FROM interact;

This query will create a column called “time spent” which will give us the length of each interaction in the interact table based on the start and stop time.

SELECT sname, (statdate - birth) AS days FROM biograph;

This query will create a column with sname and one called “days” with the number of days from birth until the statdate for each individual.

Summarizing and Grouping Data

Aggregate Functions

COUNT

SELECT COUNT(sname) FROM biograph
                    WHERE matgrp=1.1;

This will show how many individuals were born into Nyayo’s group (1.1)

SELECT COUNT(DISTINCT(sname)) FROM members
                              WHERE grp=1.1;

The members table has multiple entries for most individuals. In this case the DISTINCT command lets us count each sname only once. This query tells us how many different individuals have been a member of group 1.1.

MIN and MAX

SELECT MAX(rank), MIN(rank) FROM ranks
                            WHERE sname=’ROC’;

The result of this query is the maximum numeric rank attained by Rocky (5), and the minimum numeric rank attained by Rocky (1).

SELECT MAX(date), MIN(date) FROM members
                            WHERE grp=1.21;

The result of this query is the first and last dates recorded in the members table for Omo’s group (1.21).

AVG

SELECT AVG(stop-start) FROM interact_data
                       WHERE act='C' ;

This query lets us know the average length of all consortships found in the interact_data table

SUM

SELECT SUM(stop-start) FROM interact_data
                       WHERE act='C' ;

This query lets us know the total time for which we have observed consortships from the interact_data table.

FLOOR and CEILING

SELECT FLOOR(1.21);

This will return the value of 1 which is the largest integer equal or less than 1.21.

SELECT CEILING(6.45);

This will return the value of 7 which is the smallest integer greater than 6.45.

Grouping rows with GROUP BY

All of the examples so far have applied the aggregate functions across all the rows returned by the FROM and WHERE clauses. You can use the GROUP BY clause to divide a table into groups or categories and apply aggregate functions to each subset. The GROUP BY clause comes after the WHERE clause but before the ORDER BY clause. An example will clarify the concept.

SELECT matgrp, COUNT(matgrp) FROM biograph
                             GROUP BY matgrp
                             ORDER BY matgrp;

This query returns the total number of individuals attributed to each matgrp in the biograph table. The result shows the count of individuals grouped by matgrp.

SELECT matgrp, sex, COUNT(matgrp) FROM biograph
                                  GROUP BY matgrp, sex
                                  ORDER BY matgrp;

This query returns the total number of individuals attributed to each matgrp in the biograph table. The results are also grouped by sex. The result shows the count of individuals grouped by matgrp and then sex.

Filtering groups with HAVING

**************this needs to be added**************************

Retrieving Data from Multiple Tables

All of the queries so far have retrieved rows from a single table. In a relational database we may need information from multiple tables at once. In order to retrieve information from two or more tables simultaneously we need to understand how the columns in the different tables are related to each other. The Babase Pocket Reference outlines how the columns of all of the BaBase tables are connected to each other (entity diagram).

Qualifying column names

Column names must be unique within a table but can be reused in another table. To identify an otherwise ambiguous column uniquely in a query that involves multiple tables, we must use a qualified name. A qualified name is a table name followed by a dot followed by the name of the column in the table. Because table names within the database must be unique, a qualified name uniquely identifies a single column within the entire database. If two tables have any columns in common you must qualify all references to these columns throughout the query.

A qualified table name is of the form:

table.column

It is considered good form by the SQL digirati to always qualify all column names.

Creating joins with JOIN or WHERE

A query that extracts data from more than one table must specify how we want to join the tables. There must be a connection! The tables are always joined row by row by satisfying whatever conditions you specify in the query. We must always specify one or more join conditions to be evaluated for each pair of joined rows.

Selecting Columns from two or more tables with WHERE

SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph, maturedates
                                                           WHERE biograph.sname=maturedates.sname;

This query returns the sname and birth columns from the biograph table and the matured column from the maturedates table. The two tables are connected by sname. We are returned only the rows where we have a birth date and a matured date associated with a particular sname.

SELECT biograph.sname, biograph.birth, maturedates.matured, rankdates.ranked, consortdates.consorted  FROM biograph, maturedates, rankdates,consortdates  WHERE biograph.sname=maturedates.sname  AND biograph.sname=rankdates.sname  AND biograph.sname=consortdates.sname;

This query returns the sname and birth date from biograph, the matured date from maturedates, the ranked date from rankdates, and the consorted date from consortdates. Only the rows where the individual has all of these dates are returned. The next section describes joins which will allow you to retrieve the null or blank values as well.

Join two or more tables using JOIN

The most commonly used JOIN statements are shown below. There are others. I have given examples of the most useful joins.

Common Joins

Description

Inner Join

The most common join. Uses a comparison operator (=, <>,<, >, <=, >=) to match rows from two tables based on the values from common columns.

Left Outer Join

Returns all rows from the left table (first table mentioned is called the left table), not just the rows in which the joined columns match. Nulls are returned for the right tables if the left table has no matching rows in the right tables.

Right Outer Join

The reverse of a Left Outer Join. All rows from the right table are returned. Nulls are returned for the left table if a right table has no matching row in the left table.

Full Outer Join

A combination of the left and right outer joins. Returns all rows in both the left and right tables.

SELECT biograph.sname, biograph.birth, maturedates.matured
       FROM biograph
            INNER JOIN maturedates
                       ON biograph.sname=maturedates.sname;

This query returns exact same information as above using INNER JOIN instead of WHERE.

SELECT biograph.sname, biograph.sex, maturedates.matured,
members.grp FROM biograph
                 INNER JOIN maturedates
                            ON biograph.sname=maturedates.sname
                 INNER JOIN members
                            ON biograph.sname=members.sname
                 WHERE members.grp=1.21 AND members.date='2005-01-01'
                 ORDER BY biograph.sname;

This more complicated query returns the snames of all the matured individuals plus sex, matured date, and group for all who were in grp 1.21 on January 1st 2005. They are ordered alphabetically by sname.

SELECT biograph.sname, biograph.birth, maturedates.matured
        FROM biograph
             LEFT OUTER JOIN maturedates
                        ON biograph.sname=maturedates.sname;

If you take a look at the results of this query you can see that all rows from the biograph table (left table) are returned. We can now see which individuals do not have a matured date associated with them as well as the ones do.

SELECT biograph.sname, biograph.sex, maturedates.matured, members.grp
           FROM biograph
                LEFT OUTER JOIN maturedates
                         ON biograph.sname=maturedates.sname
                LEFT OUTER JOIN members
                         ON biograph.sname=members.sname
                WHERE members.grp=1.21 AND members.date='2005-01-01'
                ORDER BY biograph.sname;

Schemas

Each BaBase user has his or her own area of the database, called a schema, where they can create their own tables. Your schema is identified by your BaBase login name and is located within the babase database. We qualify schema names and the associated tables the same way we qualified tables and columns above.

A qualified table name within a schema is of the form:

schemaname.tablename

The biograph table could be written as:

babase.biograph

You can see which database you are in by looking at the schema search box within the SQL window. If you have selected the babase database the default search path is babase, then the sandbox, then your own personal schema. This means that the system looks at the different schemas for tables or columns, for example table names appearing in SQL queries, in the order in which the schemas are listed. If the table does not appear in the first schema it looks in the second, and so forth. As soon as table is found with the name given, that table is used and the search stops. a single SQL statement can refer to objects in the different schemas of the parent database, but cannot refer to objects in other databases -- tables within a database can be related, but tables in different databases cannot

Saving your query result INTO a new table

In the SELECT statement you can specify that the query result be saved as a new table in your schema. To identify the hierarchy, the schema and table name are separated by a dot or period.

SELECT * INTO username.mybiograph FROM biograph;

A table called mybiograph will be created in the “username” schema. This new table is a copy of the biograph table (all rows and columns selected).

As you start to write more-complicated queries and select them "into" new tables in your schema, it will be tempting to use small, highly-abbreviated names. DO NOT DO THIS! It may be easy for you at the moment to remember what that short name stands for, but in a few days/weeks/months it will be much more difficult. For example, perhaps you have a query that returns a set of juvenile males with their moms. You probably don't want to name the table "juvenile_males_with_moms", so it's tempting to use a query like this:

SELECT [columns] into username.jmm FROM [tables];

It'll be much faster to query from table "jmm", but in a few days or weeks, that abbreviation will lose all meaning to you. Better to use a more-descriptive name, like "juv_males_w_moms", "j_males_w_moms", or even "j_males_moms" (though that last one is getting close to being too abbreviated).

Queries with multiple SELECT statements

********************NEED TO ADD THIS*****************************

Temporary tables

********************NEED TO ADD THIS*****************************

Deleting tables

You are allowed to delete tables from your own schema. Only BaBase administrators have sufficient privileges to edit or delete the BaBase tables. The tables you create in your own schemas are not considered a part of BaBase, even though they may have been created from data contained in BaBase tables. To delete a table go into your schema. You will see a list of all tables you have created. Under the Actions bar you will see a button called Drop. You will be asked if you are sure you would like to delete the table. If you are sure then hit the Drop button. Do NOT check the cascade button. Data managers may have permission to add and delete tables from some sandbox schemas (hybridity etc. but not babase itself).

Importing Tables into your schema

Exporting Tables

Sandbox

The sandbox schema holds tables that are used together with the “official” BaBase tables but have not yet made it into the Babase project. They will not be documented in the BaBase documentation. The tables in the Sandbox will be considered “final copies” of data. Everyone can create tables in the Sandbox schema. Only place completely proofed and useable data in the Sandbox.

Permissions

All BaBase users have all the permissions in the sandbox schema. They may add, delete, or modify any information in the schema but may not alter the structure of the schema by adding or removing tables, procedures, triggers, or anything else. PosgreSQL, the database underlying Babase, is secure by default. This means that any tables or other database objects cannot be acesssed by anyone but their creator or without permission of the creator. Babase_editors who create tables in the sandbox schema should use the GRANT statement to grant access to Babase's other users. babase_editors permissions in the sandbox schema The babase_editors have all the permissions of the babase_readers, plus they may add or delete tables, stored procedures, or any other sort of object necessary to control the structure of the data.

Granting Permission to a table

CREATE TABLE sandbox.foo (somecolumn INTEGER);

This is done as follows:

Example 2.2. Granting permission to table foo in the sandbox schema

GRANT ALL ON sandbox.foo TO GROUP babase_editors;

GRANT SELECT ON sandbox.foo TO GROUP babase_readers;

It is important to note that you have to consider who you want to be able to view these tables versus who you want to be able to edit these tables. Also, when granting permissions, granting permission to babase_readers, does not grant permission to babase_editors. These are two discrete groups. You can grant "select" permissions to babase_editors instead of "all" permissions, but just granting permission to readers does not automatically include giving permissions to the group babase_editors.

Views

A view is a virtual table composed of fields from one or more tables. Views enable us to see information from the database from many different perspectives.

Saving your queries

Save your SELECT statements! Saving your statements eliminates the need to recreate them every time you need to ask the same question of the database. If you write your queries in a text editor, you can easily save them on your computer and easily recall them in the future.

About text editors:

Documenting your queries with comments

After saving your query safely away somewhere, it may be weeks, months, or even years before you or someone else returns to look at the code you used. An elaborate calculation that you did may make sense now, but it probably won't be so clear in the future. For your own sake and that of others who may look at your code later, please include "comments" in your queries.

A comment is optional text that you type to help explain what's happening in your query, why, or anything else that may be noteworthy. SQL has assigned two hyphens, “--“, as a comment. This means that any text following the comment characters will be ignored by SQL. A comment continues until the first "newline" character (i.e. when you hit the enter/return key).

Some examples of comment use

There are a few different approaches a person can take when deciding when to add comments, where to write them, and how verbose the comment needs to be. Let's look at some examples from some "celebrity" authors.

Note that in every example, the user includes their name and the date. This is always a good idea.

All comments laid out before stating the query

--Thomas Robert Malthus, Jan 2016

--This query determines how many males and females we have blood samples for, as well as social connectedness data collected in the same year (sci data is stored in the 'sci' table).
--Exclude Lodge group, only look for available samples/not exhausted samples, restrict specifically to blood samples collected during particular periods when SCI data are available

select biograph.sname, birth, biograph.sex, count(avail)
from biograph, tissue, sci
where biograph.sname=tissue.sname
and avail='Y'
and sci.sname=biograph.sname
and sci.sname=tissue.sname
and sample_type='BLOOD'
and matgrp != '3.0'
and sci.start_date < collection_date
and sci.start_date+365 > collection_date
group by biograph.sname, collection_date, biograph.birth, biograph.sex, start_date, sci_af
order by biograph.sex, biograph.sname;

One major perk of this approach is that any explanations you might seek are easy to find because they're all in one place. This is also probably the easiest approach to writing comments, because you do it all at once and then are free to forget about it.

On the other hand, with this approach it can be hard to know which part(s) of a comment refers to which part(s) of the query. This makes this approach not ideal for complex queries.

Brief "big picture" explanation at beginning, then brief comments within the query on their own lines

--Alfred Russel Wallace, Jun 2014
--Query to extract data on early life variables for females in study groups.  Comments indicated with --.

--1) pull basic data on everyone we might care about into a new table, following Charles's initial query and who survived to at least 4 years old (526 individuals)

select sname, matgrp, pid, left(pid,3) as mom, sex, birth, bstatus, biograph.status, statuses.descr as status_descr, statdate, ((statdate - birth)/365.25) as deathage
into coregroup_4yr
from biograph, statuses
where biograph.status = statuses.status
--exclude lodge group
and biograph.matgrp < '3.00'
--exclude animals with bad age estimates
and biograph.bstatus < '5'
--exclude animals born before the onset of monitoring
and birth > '1971-07-01'
--exclude animals who lived for < 1 day
and statdate > birth
--exclude individuals who died before age 4
and ((statdate - birth)/365.25) >=4
order by matgrp, sname;

--2) get all names and pids for moms represented in the core group and the statdate and birth for each kid
SELECT sname, name, pid, left(pid,3) as mom,(biograph.statdate - birth) AS deathagedays,birth;
FROM biograph
WHERE left(pid,3) in (SELECT DISTINCT coregroup_4yr.mom FROM coregroup_4yr);

Following this approach, comments are easy to write; everytime you write a more-complicated piece of code, you immediately add a comment explaning it. Naturally, this also helps make the code easier for a new reader to understand when explanations are right there among the text.

However, the extra lines of text can easily turn a relatively simple query into a wall of text that can thus become generally harder for a person to read. Also, if a commenter isn't consistent with comment placement, it can be unclear if a comment refers to code above it or below it.

Technical details laid out briefly at beginning, then comments on the same line as the code

-- Charles Lyell, Feb 2016
-- Rules for querying information about natal dispersal
    --only individuals above 6.5 years
    --only individuals with a bstatus <1
    --only individuals with matgrp <3
    --individuals with disperse confidence >2

    -- after you complete the query exclude the following individuals with BY dates: KRI, SAT, NJU

-- This is a nested query: the fourth select statement is the first one executed, and so on.

-- This query was written in TextWrangler for Mac. Comments and annotations are indicated with —- and are often provided after a query line.

select hold3.*

from

(select hold2.*
, dispersedates.dispersed
, (hold2.date - dispersedates.dispersed) as drd --# of days between dispersal and collection date of this line's fecal sample
, ((dispersedates.dispersed-hold2.birth)/365.25)::numeric(4,2) as dis_age
, rankdates.ranked
, (hold2.date - rankdates.ranked) as drr --# of days between rank attainment and collection date of this line's fecal sample
, ((rankdates.ranked-hold2.birth)/365.25)::numeric(4,2) as rank_age
, rankdates.rstatus, dispersedates.dispconfidence

from

(select hold.*

from

(select biograph.sname, biograph.matgrp, biograph.birth, fecal.prep.date, biograph.sex, fecal.prep.sid, fecal.results.gc, fecal.results.t, log(10, fecal.results.gc) as loggc, log(10,fecal.results.t) as logt, biograph.bstatus, ((fecal.prep.date-biograph.birth)/365.25)::numeric(4,2) as age

from biograph, fecal.prep, fecal.results

where biograph.sname = fecal.prep.sname --This is a join condition

and fecal.results.gc>0 --excludes null values

and fecal.results.t>0 --excludes null values

and fecal.prep.date between '2000-01-01' and '2013-12-31' --range of available data that include male dominance ranks

and biograph.sex = 'M'

and biograph.matgrp <3

and fecal.prep.sid = fecal.results.sid --This is a join condition

and biograph.bstatus <1 --includes individuals with known birthdates

)as hold

where hold.age between '6' and '10' --first cut to remove immature and older males, does not define the final age to be determined.

) as hold2

left join dispersedates --left join results in the inclusion of males from one table that have null values in the other table
on dispersedates.sname=hold2.sname --join condition

left join rankdates
on rankdates.sname = hold2.sname

) as hold3

where hold3.drd between -180 and 180 --use this alternative when you want data set restricted to 6 months before and after,

--if you want -1.5 years to +1.5, change to -540 to +540

and dis_age > 6.5

and hold3.dispconfidence>2

order by hold3.sname;

Note that in this case, many comments are reminders about SQL syntax (e.g. "This is a join condition"), rather than those in the previous examples that tended mostly to refer to the data itself (e.g. "exclude animals who lived for < 1 day"). This is okay; you should explain whatever you think may be unclear.

This approach is especially helpful when there are lots of technical details to relate. It's helpful for a reader if all those details are listed in one place at the beginning. However, as with an earlier example, when details are listed at the beginning it can be hard to know where a given rule is actually enforced in the code. In this example, the writer is trying to state the rules at the beginning and emphasize them again in-line, which is great. This can be risky, though: if you later decide to change a rule, it can be easy to change it in one place but not the other.

In this example, the writer included comments showing alternative code that could be added or could replace what's already there, e.g. "if you want -1.5 years to +1.5, change to -540 to +540". These kinds of comments are usually very helpful, and highly-encouraged.

Adding comments after a line of actual code has its pro's and con's. It's helpful for especially-brief comments (e.g. "excludes null values"), but if the comment or the code it refers to is especially long, then the whole line can become so long that the comment might be difficult to see.

Thorough explanation and rationale, then comments spread throughout the query

--Charles Darwin, Mar 2015
--Get relative (aka proportional) adult male ranks

--Ranks in Babase are recorded as "ordinal" rankings, i.e. first, second, third, etc. (or
--Alpha, Beta, Gamma, etc.) For some questions, it's may be useful to instead consider the
--ranks proportionally; being #1 in a group of 10 may be quite different from being #1 in
--a group of 50, for instance.  This query calculates these "relative" ranks for each line
--in the RANKS table.

--In this case, the "proportion" being calculated is:
    --the adult male's rank / number of adult males in the group
    --("adult" males are defined by whether or not they have a rank date in RANKDATES)

--This query assumes that for any given month, all the adult males are ranked above the
--subadults and juveniles.  If this is ever not so (a rel_rank > 1), a mistake was made in
--assigning ranks and/or rank dates.  I should bring these cases to a data manager's
--attention ASAP.

--First, make a temporary version of RANKS that excludes subadult and juvenile males.
WITH adm_ranks AS
    (
    SELECT ranks.*
        FROM ranks
        JOIN rankdates
          ON (rankdates.sname = ranks.sname
            --Only allow males that have attained adult rank on/by this date
            AND rankdates.ranked <= ranks.rnkdate)
        WHERE ranks.rnktype = 'ALM'
    )

--Using the temporary adult males-only RANKS table, calculate relative rank
SELECT ord_ranks.rnkid
    ,  ord_ranks.sname
    ,  ord_ranks.rnkdate
    ,  ord_ranks.grp
    ,  ord_ranks.rnktype
    ,  ord_ranks.rank AS ord_rank
    ,  num_males.grp_size
    ,  (--Change both of these from integer so we can have a non-integer quotient
        ord_ranks.rank::numeric
        / num_males.grp_size::numeric
        )::numeric(3,2) AS rel_rank --Max 3 digits, 2 of which must be after the decimal
    FROM adm_ranks AS ord_ranks
    JOIN (--Get number of adult males by counting # of lines per grp-month in adm_ranks
         SELECT grp
             ,  rnkdate
             ,  COUNT(*) AS grp_size
             FROM adm_ranks
             GROUP BY grp, rnkdate
         ) AS num_males
      ON (num_males.grp = ord_ranks.grp
        AND num_males.rnkdate = ord_ranks.rnkdate)
    --In case I want to check a particular group and/or date:
    --WHERE ord_ranks.grp=2.11 AND ord_ranks.rnkdate>='2012-01-01'
    ORDER BY rnkdate, grp, ord_rank
    ;

When looking at someone's saved queries, the overall purpose of the query is often as unclear as the rationale behind a particular line of code. More than any of the other examples, this one uses comments to explain that purpose, not just the "why" behind a particular calculation or line of code. Most saved queries probably don't need the amount of detail used here to explain their purpose, but any amount of explanation along these lines can be helpful.

Note that unlike the others, this query also uses the recommendations laid out above for using descriptive aliases for column and table names, and for using indentation to make the query generally more human-readable.

This last approach tends to borrow from the styles shown in the previous examples. Sometimes a brief comment is added in-line, sometimes a comment is relegated to it's own line. Near the end, a "WHERE" statement is commented-out, but available to be used if needed. It's up to you to decide how you'll work. You might pick one of these, or (more likely) some hybrid of them. Whatever you decide, be consistent, and don't let yourself get lazy! You'll regret it in the future!

A Couple Useful Queries

Calculate Age

SELECT sname, (statdate - birth)/365.25  AS age FROM biograph;

This query will create a column with sname and one with the age of each individual on their statdate. This will give us exact age. We use 365.25 days the calculation to account for leap years.

Note: There are multiple ways to calculate integer age and everyone has there own preference. The following example illustrates one method that used quite regularly in the Alberts lab.

SELECT sname, FLOOR((statdate - birth)/365.25) + 1 AS age FROM biograph;

This query returns sname and age in integer years. Note that 1 year is added to the calculation. The result is actually what “year of life” the individual is in on the statdate. Using this method an infant between birth and the first birthday will be considered of age 1, or in the first year of life. An individual who is 6.3 years old will be considered of age 7, or in the 7th year of life.

Change a table name

ALTER TABLE oldtablename RENAME TO newtablename

This query will change the name of an existing table. Simply replace "oldtablename" and "newtablename" with your specific original table name and your revised table name, respectively.

Concatenate table columns

SELECT *, season || year as seasonyear FROM sampletable

This query will concatenate the data from two columns in an existing table into a single column. In the sample query above, sampletable has two columns called "season" and "year". By running this query, a new column is created that concatenates season and year. For example, a season value of "wet" and a year value of "2003" concatenate to form a new value of "wet2003" in the output table.

Extract year (or day or month) from a date field

SELECT *, extract (year from date) as year from sampletable

This query will extract the year from a date column and save the value in a new column called "year". In the sample query above, sampletable already has a column for the date (called "date"). By running this query, a new column is created that extracts the year portion of the date column and saves the output in a new column. For example, a date value of 1976-09-20 will result in a year value of 1976. Similar query structure can be used to extract the day and month from a date field by replacing "year" in the above query with either "day" or "month".

SQL_Language (last edited 2021-09-09 19:49:10 by JakeGordon)

Wiki content based upon work supported by the National Science Foundation under Grant Nos. 0323553 and 0323596. Any opinions, findings, conclusions or recommendations expressed in this material are those of the wiki contributor(s) and do not necessarily reflect the views of the National Science Foundation.