Contents
- The Basics
- Retrieving Data from a Table
- Operators and Functions
- Summarizing and Grouping Data
- Retrieving Data from Multiple Tables
- Schemas
- Views
-
Saving your queries
- About text editors:
-
Documenting your queries with comments
-
Some examples of comment use
- All comments laid out before stating the query
- Brief "big picture" explanation at beginning, then brief comments within the query on their own lines
- Technical details laid out briefly at beginning, then comments on the same line as the code
- Thorough explanation and rationale, then comments spread throughout the query
-
Some examples of comment use
- A Couple Useful Queries
The Basics
SQL (Structured Query Language) is the standard programming language for creating and retrieving information that is stored in a relational database. The user interface or front-end (what you see online) is called phpPgAdmin. The following pages are meant as a basic introduction to SQL, phpPgAdmin, and query writing. Please see BaBase documentation for specifics about the database. There are many resources online for more detailed information about SQL. There are links to SQL tutorials and documentation through the Babase Wiki
Case sensitivity
SQL language is case insensitive but user-defined identifiers (table names, column names, and other such names for things that are not data) are case sensitive. For example, the SQL SELECT statement could be written as SELECT, SeLect, or select but the table name biograph is case sensitive. (A convention often used is to write key words in upper case and names in lower case, e.g.: SELECT * FROM babase;) Fortunately, unless identifiers are written inside double quotes (") they are automatically converted to lower case. In practice this means that you do not need to worry about case unless someone has gone to the trouble of creating mixed-case identifiers.
The point to take away is: create table names, column names, and the like, using lower case letters only. To do otherwise is to invite trouble.
Conventions in SQL syntax
In this document I have followed some SQL syntax conventions when writing query statements.
UPPERCASE |
Uppercase indicates SQL language or commands |
|
lowercase |
Lowercase indicates user defined items such as tables and columns specific to BaBase. |
Quotation marks
All quotation marks in SQL are straight quotes (such as ' and "), not curly quotes (such as ‘ and “). Curly quotes will not work in the SQL window. Be careful when cutting and pasting from text editors. Sometimes the default is curly quotes and you will receive an error message when you run the query. You can delete the quotes in the SQL query window and retype them. When you type directly into the SQL query window you will have the correct quote format.
Data Types
Each column in a table has a single data type. A data type determines a column’s allowable values. In general, we can classify columns as the following types.
Character
Represent text and are case sensitive. Whenever characters are requested they are case sensitive and need to be enclosed in single quotation marks. Examples of character columns are the name and pid columns in the biograph table.
Number
Numbers can be integers or decimals. Numbers can be negative, zero, or positive. Numbers do not require quotation marks. Examples of numeric columns are the matgrp and bstatus columns in the biograph table.
Date
A date stored in a column defined as DATE has three fields - YEAR, MONTH, and DAY - and is formatted yyyy-mm-dd. Dates must be enclosed in single quotations. Examples of date columns are the birth and statdate columns in the biograph table.
Time
Represents a time of day. A time stored in a column defined as TIME has three fields – HOUR, MINUTE, and SECOND – and is formatted hh:mm:ss. Times must be enclosed in single quotations. Examples of time columns are the start and stop columns in the interact_data table.Retrieving data from a table.
Retrieving Data from a Table
- The SELECT statement forms the basis of every question you pose to the database.
- When you create and execute a SELECT statement you are ‘querying’ the database.
- A SELECT statement is composed of several distinct keywords known as clauses. You define a SELECT statement using various configurations of the clauses to retrieve the information you require. Some clauses are required while others are optional. The main clauses in a SELECT statement are INTO, FROM, WHERE, GROUP BY, and ORDER BY.
- You may need multiple statements or queries to retrieve the desired information.
- Every SELECT statement must end in a semi colon, ;
The Simplest SELECT Statements
The simplest select statements do not even return table content.
SELECT 'Hello World';
It is not very often that you ask SQL to return a constant value, but it does happen occasionally. More often, some columns are the result of computation.
SELECT 'The answer is:', (2 * 20) + 2;
Retrieving columns with SELECT and FROM
The simplest useful SELECT statement retrieves columns FROM a table. The SELECT and FROM clauses are always required if you retrieve columns from tables. All other clauses are optional.
- To retrieve one column from a table
SELECT name FROM biograph;
This query will return every row in the name column of the biograph table.
- To retrieve multiple columns from a table
SELECT name, birth FROM biograph;
This statement will return every name and birthdate in the biograph table.
- To retrieve all columns from a table without writing them all out you can use an asterisk “*”.
SELECT * FROM biograph;
This will return all of the rows and columns contained in the biograph table.
Eliminating duplicate rows with DISTINCT
Although DISTINCT is an aggregate function (discussed below) it is useful to introduce this one earlier on.
- Columns often contain duplicate values, and it is common to want a result that lists each duplicate only once.
SELECT DISTINCT(sname) FROM members;
The members table has multiple entries for most individuals. In this case the DISTINCT command returns each sname only once.
Filtering rows with WHERE
The result of each SELECT statement so far has included every row in the table (for the specified columns). The WHERE clause filters unwanted rows from the result. In a WHERE clause you specify a search condition that contains one or more conditions that need to be satisfied by the rows of the table. Rows for which the condition or conditions are true are returned.
Comparisons
The most common types of conditions involve comparing two values to each other. There are six types of comparisons we can define. Note that inequality can be defined in two different ways.
Symbol |
Usage |
|
= |
Equality |
|
<> |
Inequality (not equal to) |
|
!= |
Inequality (not equal to) |
|
< |
Less than |
|
> |
Greater than |
|
<= |
Less than or equal to |
|
>= |
Greater than or equal to |
- Equality and inequality
SELECT * FROM biograph WHERE sname=’CHA’;
All information contained in the biograph table where the sname is ‘CHA’.
SELECT * FROM biograph WHERE sex=’F’;
All rows in the biograph table that have ‘F’ listed in the sex column will be returned. This is every row pertaining to a female.
SELECT * FROM biograph WHERE matgrp <> 9;
This query returns all rows from the biograph table where the matgrp (maternal group) is not 9.
- Less than and greater than
SELECT * FROM biograph WHERE birth < ‘2005-01-01’;
This query will return every row in the biograph table where the birth date is before (but not including) January 1st, 2005.
SELECT * FROM biograph WHERE birth <= ‘2005-01-01’;
This query will return every row in the biograph table where the birth date is before or equal to January 1st, 2005.
The “greater than” statements work in exactly the same way as the “less than” statements shown above.
Combining and negating conditions with AND, OR, and NOT
Often you will need to specify multiple conditions in one SELECT statement. You can combine two or more conditions using the AND, OR, and NOT operators.
- AND connects two or more conditions and returns true only if all conditions are true
SELECT * FROM biograph WHERE matgrp=1.1 AND sex=’M’ AND birth >= ‘2005-01-01’;
This statement will return all rows in the biograph table where the individual is male, has a matgrp of 1.1 (born in Nyayo’s group) and was born on or after January 1st 2005.
- OR connects two or more conditions and returns true if either or all conditions are true
SELECT * FROM biograph WHERE sname=’ELV’ OR sname=’CHA’ OR sname=’POW’;
This statement will return all rows in the biograph table where the sname is ‘ELV’, ‘CHA’ or ‘POW’.
- Unlike AND and OR, NOT doesn’t connect two conditions. Instead it negates (reverses) a single condition
SELECT * FROM biograph WHERE NOT matgrp <= 3;
All rows in the biograph table where matgrp is not less than or equal to 3 will be returned. This could also be stated without using a NOT statement and using
SELECT * FROM biograph WHERE matgrp > 3;
Matching patterns with LIKE
Preceding examples retrieved rows based on the exact value of a column or columns. You can use LIKE to retrieve rows based on partial information. LIKE works only with character strings. In the LIKE statement you can use two different wildcard characters. A percent sign (%) matches any string of zero or more characters. An underscore (_) matches any one character. You can negate a LIKE statement with NOT LIKE.
- Find similar character strings
SELECT * FROM biograph WHERE name LIKE 'C%';
This statement will return all rows in the biograph table where the name starts with a “C”
SELECT * FROM biograph WHERE pid LIKE 'MET_';
This statement will return all rows in the biograph table where the pid starts with ‘MET’. In other words, all rows corresponding to an individual whose mother’s sname is ‘MET’.
- The exception to this rule is situations in which the pids, or other data fields contain double digits. This query will only return pids 1-9. To include all desired results, use
SELECT * FROM biograph WHERE pid LIKE 'MET%';
Range filtering with BETWEEN
- Use BETWEEN to determine whether a given value falls within a range
SELECT DISTINCT(sname) FROM members WHERE grp=1.1 AND date BETWEEN '2004-01-01' AND '2005-01-01';
This statement will return the distinct snames of all the individuals in the members table from group 1.1 between (this is an inclusive range) January 1st 2004 and January 1st 2005. In other words, who was in group 1.1 at some point during the given time period.
List filtering with IN
SELECT * FROM biograph WHERE sname IN ('CHA', 'ELV', 'POW');
Testing for blanks or nulls with NULL
Nulls represent missing or unknown values. This situation can cause problems in WHERE clauses because unknown values do not satisfy specific conditions. Nulls match no values – not even other nulls.
- Retrieve rows with nulls or non-null values
SELECT * FROM biograph WHERE pid IS NULL;
This statement returns every row in the biograph table where a null exists in the pid column. This translates to every individual whose mother is unknown to us.
SELECT * FROM biograph WHERE sname IS NOT NULL;
This statement returns every row in the biograph table where we have an sname for the individual. Aborted fetuses usually are not given a name but may still have a row in the biograph table.
Sorting rows with ORDER BY
Rows in a query result are unordered. Often you will want the output of your data in a particular order. For example, you may want your result ordered alphabetically or by date.
- To sort your result by one column
SELECT * FROM biograph WHERE sex=’M’ ORDER BY sname;
This will return all entries in biograph where the sex is male and will be returned in alphabetical order by sname.
- To sort your result by multiple columns
SELECT * FROM members ORDER BY grp, date;
This will return all entries in members where the group is equal to 2. The results will be in numerical order by group and, within each group, will be ordered chronologically by date.
SELECT * FROM members WHERE date between '2005-01-01' and '2005-03-01' ORDER by grp, date;
This will return all individuals present in any group during this time period, sorted into their groups and by the daye, in that order.
Creating column aliases with AS
In the queries we have seen so far we have accepted the return of the default column names. The AS clause can be used to create a column alias. A column alias is an alternative name. The new name is used in both the column headings and elsewhere in the query itself. If a column name is too long or cryptic, or if the column is the result of a computation the query may be made more clear by changing the name.
Double quotes are required around the new name unless the alias is a single word that contains only lower case letters, digits, or underscores.
- Change the name of a column.
SELECT 'The answer is:' AS "Label", (2 * 20) + 2 AS "Answer";
SELECT sname AS "short name" FROM biograph;
or
SELECT sname AS shortname, statdate - birth AS daysalive FROM biograph WHERE statdate - birth < 20 ORDER BY daysalive, shortname;
The computed value is given a column name in both the output column heading and the ORDER BY that orders the output, but not in the WHERE clause because the WHERE clause is needed to produce the output.
When using column aliases, it is strongly-recommended that you use descriptive names that clearly illustrate what the column holds. It's tempting to abbreviate complicated titles with a few letters, if for no other reason than because it's easier to just type a few letters than a long name. For the sake of keeping your data readable (to others and even to yourself, in case you revisit old data a few months later), avoid over-abbreviating column names. See "Saving your query result INTO a new table" for an example of over-abbreviating a table name.
Writing Human Readable Queries
It is important to write your queries so that they can be read by a human. It is no good having a query that the computer understands but a human does not because it is the human that must interpret the query's result. Column aliases are useful in this regard, but the most useful technique is indentation. SQL statements can be continued over multiple lines; they do not end until the final semicolon. Breaking a query into multiple lines, and then using indentation that reflects the structure of the SQL, gives the reader an overview of the query structure at a glance and allows her to focus on and understand each component of the query in the context of the whole.
SELECT sname AS shortname , statdate - birth AS daysalive FROM biograph WHERE statdate - birth < 20 ORDER BY daysalive, shortname;
Line breaks and indentation makes queries easier to understand.
Queries may contain comments. Comments begin with -- (2 dashes) and continue to the end of the line. Comments can go anywhere; on a line by themselves or at the end of a line containing SQL. Judicious use of comments will help the reader understand the most complex of queries.
-- This is a comment. It does nothing, has no effect, and may as well not be here.
Comments aid understanding, but unless they reveal what's otherwise hidden they are just clutter.
Formatting and commenting queries as part of the writing process helps clarify goals, aids present and future understanding of the query, aids debugging, and helps ensure the query really does produce the result expected.
You may have noted that the query above, which returns baboons who have lived less than 20 days, does not consider whether the individuals are alive or dead. The results would include not only those individuals which lived less than 20 days, but also all individuals born within 20 days of the last observation of the group. Queries quickly become complex and hard to read, with subtleties that only the informed can appreciate. The formatting conventions and commenting techniques of the following example may be applied to queries of arbitrary complexity to greatly aid readability.
-- Show individuals who have lived less than 20 days SELECT sname AS shortname , statdate - birth AS daysalive -- Subtracting two dates gives the number of days between the two FROM biograph WHERE statdate - birth < 20 -- Don't show living individuals; do include more than just the confirmed dead AND status <> 0 -- Status of 0 means alive ORDER BY daysalive, shortname;
Comments can document goals, describe obscure SQL, document a query's sub-components and the non-obvious implications of the choices made, explain data values, and so forth.
Operators and Functions
Performing Arithmetic Operations
To add, subtract, multiply, or divide use the operators found in the following table.
Symbol |
Usage |
|
+ |
Addition |
|
- |
Subtraction |
|
* |
Multiplication |
|
/ |
Division |
- Using subtraction with times
SELECT (stop - start) AS "time spent" FROM interact;
This query will create a column called “time spent” which will give us the length of each interaction in the interact table based on the start and stop time.
- Using subtraction with dates
SELECT sname, (statdate - birth) AS days FROM biograph;
This query will create a column with sname and one called “days” with the number of days from birth until the statdate for each individual.
Summarizing and Grouping Data
Aggregate Functions
COUNT
- The count command does exactly what it says, it counts whatever you ask of it
SELECT COUNT(sname) FROM biograph WHERE matgrp=1.1;
This will show how many individuals were born into Nyayo’s group (1.1)
- Many times it is very useful to combine the COUNT and the DISTINCT commands
SELECT COUNT(DISTINCT(sname)) FROM members WHERE grp=1.1;
The members table has multiple entries for most individuals. In this case the DISTINCT command lets us count each sname only once. This query tells us how many different individuals have been a member of group 1.1.
MIN and MAX
- As it sounds, these commands are used to determine the maximum and minimum value in a query
SELECT MAX(rank), MIN(rank) FROM ranks WHERE sname=’ROC’;
The result of this query is the maximum numeric rank attained by Rocky (5), and the minimum numeric rank attained by Rocky (1).
- The maximum and minimum can also be used with dates
SELECT MAX(date), MIN(date) FROM members WHERE grp=1.21;
The result of this query is the first and last dates recorded in the members table for Omo’s group (1.21).
AVG
- This command allows you to find the average or mean of a group of numbers.
SELECT AVG(stop-start) FROM interact_data WHERE act='C' ;
This query lets us know the average length of all consortships found in the interact_data table
SUM
- This command allows you to find the sum of a group of numbers.
SELECT SUM(stop-start) FROM interact_data WHERE act='C' ;
This query lets us know the total time for which we have observed consortships from the interact_data table.
FLOOR and CEILING
- FLOOR returns the largest integer equal or less to the value provided. This is useful when rounding numbers. We tend to use this when calculating age in years. See Useful Queries section
SELECT FLOOR(1.21);
This will return the value of 1 which is the largest integer equal or less than 1.21.
- The CEILING function returns the smallest integer value that is greater than the value provided. This is equivalent to rounding up to the next whole number.
SELECT CEILING(6.45);
This will return the value of 7 which is the smallest integer greater than 6.45.
Grouping rows with GROUP BY
All of the examples so far have applied the aggregate functions across all the rows returned by the FROM and WHERE clauses. You can use the GROUP BY clause to divide a table into groups or categories and apply aggregate functions to each subset. The GROUP BY clause comes after the WHERE clause but before the ORDER BY clause. An example will clarify the concept.
- Use the COUNT aggregate function with GROUP BY
SELECT matgrp, COUNT(matgrp) FROM biograph GROUP BY matgrp ORDER BY matgrp;
This query returns the total number of individuals attributed to each matgrp in the biograph table. The result shows the count of individuals grouped by matgrp.
- You can use GROUP BY with more than one grouping column
SELECT matgrp, sex, COUNT(matgrp) FROM biograph GROUP BY matgrp, sex ORDER BY matgrp;
This query returns the total number of individuals attributed to each matgrp in the biograph table. The results are also grouped by sex. The result shows the count of individuals grouped by matgrp and then sex.
Filtering groups with HAVING
**************this needs to be added**************************
Retrieving Data from Multiple Tables
All of the queries so far have retrieved rows from a single table. In a relational database we may need information from multiple tables at once. In order to retrieve information from two or more tables simultaneously we need to understand how the columns in the different tables are related to each other. The Babase Pocket Reference outlines how the columns of all of the BaBase tables are connected to each other (entity diagram).
Qualifying column names
Column names must be unique within a table but can be reused in another table. To identify an otherwise ambiguous column uniquely in a query that involves multiple tables, we must use a qualified name. A qualified name is a table name followed by a dot followed by the name of the column in the table. Because table names within the database must be unique, a qualified name uniquely identifies a single column within the entire database. If two tables have any columns in common you must qualify all references to these columns throughout the query.
A qualified table name is of the form:
table.column |
It is considered good form by the SQL digirati to always qualify all column names.
Creating joins with JOIN or WHERE
A query that extracts data from more than one table must specify how we want to join the tables. There must be a connection! The tables are always joined row by row by satisfying whatever conditions you specify in the query. We must always specify one or more join conditions to be evaluated for each pair of joined rows.
Selecting Columns from two or more tables with WHERE
- You may want to retrieve information from both the biograph table and the maturedates table
SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph, maturedates WHERE biograph.sname=maturedates.sname;
This query returns the sname and birth columns from the biograph table and the matured column from the maturedates table. The two tables are connected by sname. We are returned only the rows where we have a birth date and a matured date associated with a particular sname.
- You can retrieve information from many tables at once as long as there is common column information with which you can compare the tables.
SELECT biograph.sname, biograph.birth, maturedates.matured, rankdates.ranked, consortdates.consorted FROM biograph, maturedates, rankdates,consortdates WHERE biograph.sname=maturedates.sname AND biograph.sname=rankdates.sname AND biograph.sname=consortdates.sname;
This query returns the sname and birth date from biograph, the matured date from maturedates, the ranked date from rankdates, and the consorted date from consortdates. Only the rows where the individual has all of these dates are returned. The next section describes joins which will allow you to retrieve the null or blank values as well.
Join two or more tables using JOIN
The most commonly used JOIN statements are shown below. There are others. I have given examples of the most useful joins.
Common Joins |
Description |
|
Inner Join |
The most common join. Uses a comparison operator (=, <>,<, >, <=, >=) to match rows from two tables based on the values from common columns. |
|
Left Outer Join |
Returns all rows from the left table (first table mentioned is called the left table), not just the rows in which the joined columns match. Nulls are returned for the right tables if the left table has no matching rows in the right tables. |
|
Right Outer Join |
The reverse of a Left Outer Join. All rows from the right table are returned. Nulls are returned for the left table if a right table has no matching row in the left table. |
|
Full Outer Join |
A combination of the left and right outer joins. Returns all rows in both the left and right tables. |
- We can perform the same query from above (retrieving rows from biograph and maturedates using WHERE) using the INNER JOIN command
SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph INNER JOIN maturedates ON biograph.sname=maturedates.sname;
This query returns exact same information as above using INNER JOIN instead of WHERE.
- You can select information from many tables at once using INNER JOIN
SELECT biograph.sname, biograph.sex, maturedates.matured, members.grp FROM biograph INNER JOIN maturedates ON biograph.sname=maturedates.sname INNER JOIN members ON biograph.sname=members.sname WHERE members.grp=1.21 AND members.date='2005-01-01' ORDER BY biograph.sname;
This more complicated query returns the snames of all the matured individuals plus sex, matured date, and group for all who were in grp 1.21 on January 1st 2005. They are ordered alphabetically by sname.
- Example using LEFT OUTER JOIN
SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph LEFT OUTER JOIN maturedates ON biograph.sname=maturedates.sname;
If you take a look at the results of this query you can see that all rows from the biograph table (left table) are returned. We can now see which individuals do not have a matured date associated with them as well as the ones do.
- Example using LEFT OUTER JOIN with more than two tables
SELECT biograph.sname, biograph.sex, maturedates.matured, members.grp FROM biograph LEFT OUTER JOIN maturedates ON biograph.sname=maturedates.sname LEFT OUTER JOIN members ON biograph.sname=members.sname WHERE members.grp=1.21 AND members.date='2005-01-01' ORDER BY biograph.sname;
Schemas
- Schemas partition databases. Tables, procedures, triggers, and so forth are all kept in schemas. Schemas are like sub-databases within a database. You can query tables from any schema within a particular database but not across databases. Tables within a database can be related, but tables in different databases cannot. Babase uses schemas to partition each database into areas where users have a greater or lesser degree of freedom to make changes. For further information on schemas see the schema documentation for PosgreSQL. Within the babase database the “official” babase tables are within the babase schema The system looks at the different schemas for objects, for example table names appearing in SQL queries, . If the table does not appear in the first schema it looks in the second, and so forth. As soon as table is found with the name given, that table is used and the search stops. To explicitly reference an object in a specific schema, place the name of the schema in front of the object, separating the two with a period (e.g. schemaname.tablename).
Each BaBase user has his or her own area of the database, called a schema, where they can create their own tables. Your schema is identified by your BaBase login name and is located within the babase database. We qualify schema names and the associated tables the same way we qualified tables and columns above.
A qualified table name within a schema is of the form:
schemaname.tablename |
The biograph table could be written as:
babase.biograph |
You can see which database you are in by looking at the schema search box within the SQL window. If you have selected the babase database the default search path is babase, then the sandbox, then your own personal schema. This means that the system looks at the different schemas for tables or columns, for example table names appearing in SQL queries, in the order in which the schemas are listed. If the table does not appear in the first schema it looks in the second, and so forth. As soon as table is found with the name given, that table is used and the search stops. a single SQL statement can refer to objects in the different schemas of the parent database, but cannot refer to objects in other databases -- tables within a database can be related, but tables in different databases cannot
Saving your query result INTO a new table
In the SELECT statement you can specify that the query result be saved as a new table in your schema. To identify the hierarchy, the schema and table name are separated by a dot or period.
- Save the results of a query into a table in your schema (username)
SELECT * INTO username.mybiograph FROM biograph;
A table called mybiograph will be created in the “username” schema. This new table is a copy of the biograph table (all rows and columns selected).
As you start to write more-complicated queries and select them "into" new tables in your schema, it will be tempting to use small, highly-abbreviated names. DO NOT DO THIS! It may be easy for you at the moment to remember what that short name stands for, but in a few days/weeks/months it will be much more difficult. For example, perhaps you have a query that returns a set of juvenile males with their moms. You probably don't want to name the table "juvenile_males_with_moms", so it's tempting to use a query like this:
SELECT [columns] into username.jmm FROM [tables];
It'll be much faster to query from table "jmm", but in a few days or weeks, that abbreviation will lose all meaning to you. Better to use a more-descriptive name, like "juv_males_w_moms", "j_males_w_moms", or even "j_males_moms" (though that last one is getting close to being too abbreviated).
Queries with multiple SELECT statements
********************NEED TO ADD THIS*****************************
Temporary tables
********************NEED TO ADD THIS*****************************
Deleting tables
You are allowed to delete tables from your own schema. Only BaBase administrators have sufficient privileges to edit or delete the BaBase tables. The tables you create in your own schemas are not considered a part of BaBase, even though they may have been created from data contained in BaBase tables. To delete a table go into your schema. You will see a list of all tables you have created. Under the Actions bar you will see a button called Drop. You will be asked if you are sure you would like to delete the table. If you are sure then hit the Drop button. Do NOT check the cascade button. Data managers may have permission to add and delete tables from some sandbox schemas (hybridity etc. but not babase itself).
Importing Tables into your schema
- Turn your excel or text file into comma delimited format (CSV) or tab-delimited text (TXT) using “save as”
If you're using Mac OS X, you must save it as a TXT file.
- Go to your schema (You can’t create tables anywhere else unless you are a data manager).
- Hit the “create table” button
- Name the table (lowercase with characters other than underscore is best)
- Choose the number of columns
- Name the columns (exact same names and same case as your CSV file)
- Choose the data type for each column and whether nulls are allowed or not
- You now have an empty table that can be filled with your data.
- If using Windows or Linux:
- Click on the table which is now in your schema and hit the “import” button
- Pick “CSV” as the format and choose what your nulls look like (if any)
Browse for your file and hit import. All the rows should now be there. You can also use the Babase Upload program, which Mac OS X users are required to use.
- If using Mac OS X:
You cannot import data using the "import" button. Instead, open a new tab/window in your browser and use the Babase Upload program (https://papio.biology.duke.edu/programs/upload/).
- Fill in each field as appropriate:
- Database: babase, unless for some reason you'll want to import babase_test or babase_copy
- User: the ID you use to log in to babase
- Password: the password you use to log in to babase
- Table or view: the name of the table to which you want to add data. You should include the name of your schema, then a period, then the table's name, e.g. schema.tablename
- Upload null values: if your data includes cells with null values, then check this.
- NULL representation: type how your data have indicated NULL values. For example, if you type "\N" or "NULL" in each cell that you want to be NULL, then type /N or NULL, respectively. If you left your NULL cells blank, then leave this section empty as well.
- Browse for your file and hit Upload. All the rows should now be there. If there are any problems with your data, nothing will import, and you will see a list of the errors in your data.
Exporting Tables
- Click on the table you would like to export
- Hit the “export” button
- Select “data only” and CSV format
- Under Option select “download”
- You will be prompted to save or open the file with excel. Either option is fine. If you save it it will be called dump.csv. You can rename it later. If you choose to open with excel you can do a “save as” and save it directly as the XLS file with the name of your choice. This is the easiest option.
Sandbox
The sandbox schema holds tables that are used together with the “official” BaBase tables but have not yet made it into the Babase project. They will not be documented in the BaBase documentation. The tables in the Sandbox will be considered “final copies” of data. Everyone can create tables in the Sandbox schema. Only place completely proofed and useable data in the Sandbox.
Permissions
All BaBase users have all the permissions in the sandbox schema. They may add, delete, or modify any information in the schema but may not alter the structure of the schema by adding or removing tables, procedures, triggers, or anything else. PosgreSQL, the database underlying Babase, is secure by default. This means that any tables or other database objects cannot be acesssed by anyone but their creator or without permission of the creator. Babase_editors who create tables in the sandbox schema should use the GRANT statement to grant access to Babase's other users. babase_editors permissions in the sandbox schema The babase_editors have all the permissions of the babase_readers, plus they may add or delete tables, stored procedures, or any other sort of object necessary to control the structure of the data.
Granting Permission to a table
CREATE TABLE sandbox.foo (somecolumn INTEGER);
This is done as follows:
Example 2.2. Granting permission to table foo in the sandbox schema
GRANT ALL ON sandbox.foo TO GROUP babase_editors;
GRANT SELECT ON sandbox.foo TO GROUP babase_readers;
- There is one other issue. Only the creator of a table can change its structure -- to add another column, change the table name, etc. And only the creator can destroy (DROP) the table.
It is important to note that you have to consider who you want to be able to view these tables versus who you want to be able to edit these tables. Also, when granting permissions, granting permission to babase_readers, does not grant permission to babase_editors. These are two discrete groups. You can grant "select" permissions to babase_editors instead of "all" permissions, but just granting permission to readers does not automatically include giving permissions to the group babase_editors.
Views
A view is a virtual table composed of fields from one or more tables. Views enable us to see information from the database from many different perspectives.
Saving your queries
Save your SELECT statements! Saving your statements eliminates the need to recreate them every time you need to ask the same question of the database. If you write your queries in a text editor, you can easily save them on your computer and easily recall them in the future.
About text editors:
- The text editor(s) that comes pre-installed with your computer is adequate, but there are other options available for free that have more tools and can even help you check your SQL syntax:
For Windows, there's Notepad++
For Mac OS X, there's TextWrangler
Documenting your queries with comments
After saving your query safely away somewhere, it may be weeks, months, or even years before you or someone else returns to look at the code you used. An elaborate calculation that you did may make sense now, but it probably won't be so clear in the future. For your own sake and that of others who may look at your code later, please include "comments" in your queries.
A comment is optional text that you type to help explain what's happening in your query, why, or anything else that may be noteworthy. SQL has assigned two hyphens, “--“, as a comment. This means that any text following the comment characters will be ignored by SQL. A comment continues until the first "newline" character (i.e. when you hit the enter/return key).
Some examples of comment use
There are a few different approaches a person can take when deciding when to add comments, where to write them, and how verbose the comment needs to be. Let's look at some examples from some "celebrity" authors.
Note that in every example, the user includes their name and the date. This is always a good idea.
All comments laid out before stating the query
--Thomas Robert Malthus, Jan 2016 --This query determines how many males and females we have blood samples for, as well as social connectedness data collected in the same year (sci data is stored in the 'sci' table). --Exclude Lodge group, only look for available samples/not exhausted samples, restrict specifically to blood samples collected during particular periods when SCI data are available select biograph.sname, birth, biograph.sex, count(avail) from biograph, tissue, sci where biograph.sname=tissue.sname and avail='Y' and sci.sname=biograph.sname and sci.sname=tissue.sname and sample_type='BLOOD' and matgrp != '3.0' and sci.start_date < collection_date and sci.start_date+365 > collection_date group by biograph.sname, collection_date, biograph.birth, biograph.sex, start_date, sci_af order by biograph.sex, biograph.sname;
One major perk of this approach is that any explanations you might seek are easy to find because they're all in one place. This is also probably the easiest approach to writing comments, because you do it all at once and then are free to forget about it.
On the other hand, with this approach it can be hard to know which part(s) of a comment refers to which part(s) of the query. This makes this approach not ideal for complex queries.
Brief "big picture" explanation at beginning, then brief comments within the query on their own lines
--Alfred Russel Wallace, Jun 2014 --Query to extract data on early life variables for females in study groups. Comments indicated with --. --1) pull basic data on everyone we might care about into a new table, following Charles's initial query and who survived to at least 4 years old (526 individuals) select sname, matgrp, pid, left(pid,3) as mom, sex, birth, bstatus, biograph.status, statuses.descr as status_descr, statdate, ((statdate - birth)/365.25) as deathage into coregroup_4yr from biograph, statuses where biograph.status = statuses.status --exclude lodge group and biograph.matgrp < '3.00' --exclude animals with bad age estimates and biograph.bstatus < '5' --exclude animals born before the onset of monitoring and birth > '1971-07-01' --exclude animals who lived for < 1 day and statdate > birth --exclude individuals who died before age 4 and ((statdate - birth)/365.25) >=4 order by matgrp, sname; --2) get all names and pids for moms represented in the core group and the statdate and birth for each kid SELECT sname, name, pid, left(pid,3) as mom,(biograph.statdate - birth) AS deathagedays,birth; FROM biograph WHERE left(pid,3) in (SELECT DISTINCT coregroup_4yr.mom FROM coregroup_4yr);
Following this approach, comments are easy to write; everytime you write a more-complicated piece of code, you immediately add a comment explaning it. Naturally, this also helps make the code easier for a new reader to understand when explanations are right there among the text.
However, the extra lines of text can easily turn a relatively simple query into a wall of text that can thus become generally harder for a person to read. Also, if a commenter isn't consistent with comment placement, it can be unclear if a comment refers to code above it or below it.
Technical details laid out briefly at beginning, then comments on the same line as the code
-- Charles Lyell, Feb 2016 -- Rules for querying information about natal dispersal --only individuals above 6.5 years --only individuals with a bstatus <1 --only individuals with matgrp <3 --individuals with disperse confidence >2 -- after you complete the query exclude the following individuals with BY dates: KRI, SAT, NJU -- This is a nested query: the fourth select statement is the first one executed, and so on. -- This query was written in TextWrangler for Mac. Comments and annotations are indicated with —- and are often provided after a query line. select hold3.* from (select hold2.* , dispersedates.dispersed , (hold2.date - dispersedates.dispersed) as drd --# of days between dispersal and collection date of this line's fecal sample , ((dispersedates.dispersed-hold2.birth)/365.25)::numeric(4,2) as dis_age , rankdates.ranked , (hold2.date - rankdates.ranked) as drr --# of days between rank attainment and collection date of this line's fecal sample , ((rankdates.ranked-hold2.birth)/365.25)::numeric(4,2) as rank_age , rankdates.rstatus, dispersedates.dispconfidence from (select hold.* from (select biograph.sname, biograph.matgrp, biograph.birth, fecal.prep.date, biograph.sex, fecal.prep.sid, fecal.results.gc, fecal.results.t, log(10, fecal.results.gc) as loggc, log(10,fecal.results.t) as logt, biograph.bstatus, ((fecal.prep.date-biograph.birth)/365.25)::numeric(4,2) as age from biograph, fecal.prep, fecal.results where biograph.sname = fecal.prep.sname --This is a join condition and fecal.results.gc>0 --excludes null values and fecal.results.t>0 --excludes null values and fecal.prep.date between '2000-01-01' and '2013-12-31' --range of available data that include male dominance ranks and biograph.sex = 'M' and biograph.matgrp <3 and fecal.prep.sid = fecal.results.sid --This is a join condition and biograph.bstatus <1 --includes individuals with known birthdates )as hold where hold.age between '6' and '10' --first cut to remove immature and older males, does not define the final age to be determined. ) as hold2 left join dispersedates --left join results in the inclusion of males from one table that have null values in the other table on dispersedates.sname=hold2.sname --join condition left join rankdates on rankdates.sname = hold2.sname ) as hold3 where hold3.drd between -180 and 180 --use this alternative when you want data set restricted to 6 months before and after, --if you want -1.5 years to +1.5, change to -540 to +540 and dis_age > 6.5 and hold3.dispconfidence>2 order by hold3.sname;
Note that in this case, many comments are reminders about SQL syntax (e.g. "This is a join condition"), rather than those in the previous examples that tended mostly to refer to the data itself (e.g. "exclude animals who lived for < 1 day"). This is okay; you should explain whatever you think may be unclear.
This approach is especially helpful when there are lots of technical details to relate. It's helpful for a reader if all those details are listed in one place at the beginning. However, as with an earlier example, when details are listed at the beginning it can be hard to know where a given rule is actually enforced in the code. In this example, the writer is trying to state the rules at the beginning and emphasize them again in-line, which is great. This can be risky, though: if you later decide to change a rule, it can be easy to change it in one place but not the other.
In this example, the writer included comments showing alternative code that could be added or could replace what's already there, e.g. "if you want -1.5 years to +1.5, change to -540 to +540". These kinds of comments are usually very helpful, and highly-encouraged.
Adding comments after a line of actual code has its pro's and con's. It's helpful for especially-brief comments (e.g. "excludes null values"), but if the comment or the code it refers to is especially long, then the whole line can become so long that the comment might be difficult to see.
Thorough explanation and rationale, then comments spread throughout the query
--Charles Darwin, Mar 2015 --Get relative (aka proportional) adult male ranks --Ranks in Babase are recorded as "ordinal" rankings, i.e. first, second, third, etc. (or --Alpha, Beta, Gamma, etc.) For some questions, it's may be useful to instead consider the --ranks proportionally; being #1 in a group of 10 may be quite different from being #1 in --a group of 50, for instance. This query calculates these "relative" ranks for each line --in the RANKS table. --In this case, the "proportion" being calculated is: --the adult male's rank / number of adult males in the group --("adult" males are defined by whether or not they have a rank date in RANKDATES) --This query assumes that for any given month, all the adult males are ranked above the --subadults and juveniles. If this is ever not so (a rel_rank > 1), a mistake was made in --assigning ranks and/or rank dates. I should bring these cases to a data manager's --attention ASAP. --First, make a temporary version of RANKS that excludes subadult and juvenile males. WITH adm_ranks AS ( SELECT ranks.* FROM ranks JOIN rankdates ON (rankdates.sname = ranks.sname --Only allow males that have attained adult rank on/by this date AND rankdates.ranked <= ranks.rnkdate) WHERE ranks.rnktype = 'ALM' ) --Using the temporary adult males-only RANKS table, calculate relative rank SELECT ord_ranks.rnkid , ord_ranks.sname , ord_ranks.rnkdate , ord_ranks.grp , ord_ranks.rnktype , ord_ranks.rank AS ord_rank , num_males.grp_size , (--Change both of these from integer so we can have a non-integer quotient ord_ranks.rank::numeric / num_males.grp_size::numeric )::numeric(3,2) AS rel_rank --Max 3 digits, 2 of which must be after the decimal FROM adm_ranks AS ord_ranks JOIN (--Get number of adult males by counting # of lines per grp-month in adm_ranks SELECT grp , rnkdate , COUNT(*) AS grp_size FROM adm_ranks GROUP BY grp, rnkdate ) AS num_males ON (num_males.grp = ord_ranks.grp AND num_males.rnkdate = ord_ranks.rnkdate) --In case I want to check a particular group and/or date: --WHERE ord_ranks.grp=2.11 AND ord_ranks.rnkdate>='2012-01-01' ORDER BY rnkdate, grp, ord_rank ;
When looking at someone's saved queries, the overall purpose of the query is often as unclear as the rationale behind a particular line of code. More than any of the other examples, this one uses comments to explain that purpose, not just the "why" behind a particular calculation or line of code. Most saved queries probably don't need the amount of detail used here to explain their purpose, but any amount of explanation along these lines can be helpful.
Note that unlike the others, this query also uses the recommendations laid out above for using descriptive aliases for column and table names, and for using indentation to make the query generally more human-readable.
This last approach tends to borrow from the styles shown in the previous examples. Sometimes a brief comment is added in-line, sometimes a comment is relegated to it's own line. Near the end, a "WHERE" statement is commented-out, but available to be used if needed. It's up to you to decide how you'll work. You might pick one of these, or (more likely) some hybrid of them. Whatever you decide, be consistent, and don't let yourself get lazy! You'll regret it in the future!
A Couple Useful Queries
Calculate Age
- Calculating exact age at statdate
SELECT sname, (statdate - birth)/365.25 AS age FROM biograph;
This query will create a column with sname and one with the age of each individual on their statdate. This will give us exact age. We use 365.25 days the calculation to account for leap years.
Note: There are multiple ways to calculate integer age and everyone has there own preference. The following example illustrates one method that used quite regularly in the Alberts lab.
- Often age is required in integer years. We use the FLOOR command and add one year.
SELECT sname, FLOOR((statdate - birth)/365.25) + 1 AS age FROM biograph;
This query returns sname and age in integer years. Note that 1 year is added to the calculation. The result is actually what “year of life” the individual is in on the statdate. Using this method an infant between birth and the first birthday will be considered of age 1, or in the first year of life. An individual who is 6.3 years old will be considered of age 7, or in the 7th year of life.
Change a table name
- Changing a table name
ALTER TABLE oldtablename RENAME TO newtablename
This query will change the name of an existing table. Simply replace "oldtablename" and "newtablename" with your specific original table name and your revised table name, respectively.
Concatenate table columns
- Concatenating table columns
SELECT *, season || year as seasonyear FROM sampletable
This query will concatenate the data from two columns in an existing table into a single column. In the sample query above, sampletable has two columns called "season" and "year". By running this query, a new column is created that concatenates season and year. For example, a season value of "wet" and a year value of "2003" concatenate to form a new value of "wet2003" in the output table.
Extract year (or day or month) from a date field
- Extracting year from a date field
SELECT *, extract (year from date) as year from sampletable
This query will extract the year from a date column and save the value in a new column called "year". In the sample query above, sampletable already has a column for the date (called "date"). By running this query, a new column is created that extracts the year portion of the date column and saves the output in a new column. For example, a date value of 1976-09-20 will result in a year value of 1976. Similar query structure can be used to extract the day and month from a date field by replacing "year" in the above query with either "day" or "month".