Size: 1361
Comment:
|
Size: 29980
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
SQL and phpPgAdmin for BaBase The Basics 2 Case sensitivity 2 Conventions in SQL syntax 2 Quotation marks 2 Data Types 3 Character 3 Number 3 Date 3 Time 3 Retrieving Data from a Table 3 Retrieving columns with SELECT and FROM 4 Eliminating duplicate rows with DISTINCT 4 Filtering rows with WHERE 4 Comparisons 5 Combining and negating conditions with AND, OR, and NOT 6 Matching patterns with LIKE 6 Range filtering with BETWEEN 7 List filtering with IN 7 Testing for blanks or nulls with NULL 7 Sorting rows with ORDER BY 8 Creating column aliases with AS 8 Operators and Functions 9 Performing Arithmetic Operations 9 Summarizing and Grouping Data 9 Aggregate Functions 9 COUNT 9 MIN and MAX 10 AVG 10 SUM 10 FLOOR and CEILING 10 Grouping rows with GROUP BY 11 Filtering groups with HAVING 11 Retrieving Data from Multiple Tables 12 Qualifying column names 12 Creating joins with JOIN or WHERE 12 Selecting Columns from two or more tables with WHERE 12 Join two or more tables using JOIN 13 Schemas 15 Saving your query result INTO a new table 16 Queries with multiple SELECT statements 16 Temporary tables 16 Deleting tables 16 Importing Tables into your schema 16 Exporting Tables 17 Sandbox 17 Permissions 17 Permissions 18 Views 18 Saving your queries 18 A Couple Useful Queries 19 Calculate Age 19 |
[[TableOfContents]] = The Basics = SQL (Structured Query Language) is the standard programming language for creating and retrieving information that is stored in a relational database. The user interface or front-end (what you see online) is called phpPgAdmin. The following pages are meant as a basic introduction to SQL, phpPgAdmin, and query writing. Please see BaBase documentation for specifics about the database. There are many resources online for more detailed information about SQL. There are links to SQL tutorials and documentation through the Babase Wiki Case sensitivity SQL language is case insensitive but user-defined identifiers (table names, column names, and table entries) are case sensitive. For example, the SQL SELECT statement could be written and SELECT, SeLect, or select but the table biograph is case sensitive and must be written as is. Conventions in SQL syntax In this document I have followed some SQL syntax conventions when writing query statements. UPPERCASE Uppercase indicates SQL language or commands lowercase Lowercase indicates user defined items such as tables and columns specific to BaBase. == Quotation marks == All quotation marks in SQL are straight quotes (such as ' and "), not curly quotes (such as ‘ and “). Curly quotes will not work in the SQL window. Be careful when cutting and pasting from text editors. Sometimes the default is curly quotes and you will receive an error message when you run the query. You can delete the quotes in the SQL query window and retype them. When you type directly into the SQL query window you will have the correct quote format. == Data Types == Each column in a table has a single data type. A data type determines a column’s allowable values. In general, we can classify columns as the following types. === Character === Represent text and are case sensitive. Whenever characters are requested they are case sensitive and need to be enclosed in single quotation marks. Examples of character columns are the name and pid columns in the biograph table. === Number === Numbers can be integers or decimals. Numbers can be negative, zero, or positive. Numbers do not require quotation marks. Examples of numeric columns are the matgrp and bstatus columns in the biograph table. Date A date stored in a column defined as DATE has three fields - YEAR, MONTH, and DAY - and is formatted yyyy-mm-dd. Dates must be enclosed in single quotations. Examples of date columns are the birth and statdate columns in the biograph table. Time Represents a time of day. A time stored in a column defined as TIME has three fields – HOUR, MINUTE, and SECOND – and is formatted hh:mm:ss. Times must be enclosed in single quotations. Examples of time columns are the start and stop columns in the interact_data table.Retrieving data from a table. Retrieving Data from a Table • The SELECT statement forms the basis of every question you pose to the database. • When you create and execute a SELECT statement you are ‘querying’ the database. • A SELECT statement is composed of several distinct keywords known as clauses. You define a SELECT statement using various configurations of the clauses to retrieve the information you require. Some clauses are required while others are optional. The main clauses in a SELECT statement are INTO, FROM, WHERE, GROUP BY, and ORDER BY. • You may need multiple statements or queries to retrieve the desired information. • Every SELECT statement must end in a semi colon, ; Retrieving columns with SELECT and FROM In its simplest form the SELECT statement retrieves columns FROM a table. The SELECT and FROM clauses are always required if you retrieve columns from tables. All other clauses are optional. • To retrieve one column from a table SELECT name FROM biograph; This query will return every row in the name column of the biograph table. • To retrieve multiple columns from a table SELECT name, birth FROM biograph; This statement will return every name and birthdate in the biograph table. • To retrieve all columns from a table without writing them all out you can use an asterisk “*”. SELECT * FROM biograph; This will return all of the rows and columns contained in the biograph table. Eliminating duplicate rows with DISTINCT Although DISTINCT is an aggregate function (discussed below) it is useful to introduce this one earlier on. • Columns often contain duplicate values, and it is common to want a result that lists each duplicate only once. SELECT DISTINCT(sname) FROM members; The members table has multiple entries for most individuals. In this case the DISTINCT command returns each sname only once. Filtering rows with WHERE The result of each SELECT statement so far has included every row in the table (for the specified columns). The WHERE clause filters unwanted rows from the result. In a WHERE clause you specify a search condition that contains one or more conditions that need to be satisfied by the rows of the table. Rows for which the condition or conditions are true are returned. Comparisons The most common types of conditions involve comparing two values to each other. There are six types of comparisons we can define. Note that inequality can be defined in two different ways. Symbol Usage = Equality <> Inequality (not equal to) != Inequality (not equal to) < Less than > Greater than <= Less than or equal to >= Greater than or equal to • Equality and inequality SELECT * FROM biograph WHERE sname=’CHA’; All information contained in the biograph table where the sname is ‘CHA’. SELECT * FROM biograph WHERE sex=’F’; All rows in the biograph table that have ‘F’ listed in the sex column will be returned. This is every row pertaining to a female. SELECT * FROM biograph WHERE matgrp <> 9; This query returns all rows from the biograph table where the matgrp (maternal group) is not 9. • Less than and greater than SELECT * FROM biograph WHERE birth < ‘2005-01-01’; This query will return every row in the biograph table where the birth date is before (but not including) January 1st, 2005. SELECT * FROM biograph WHERE birth <= ‘2005-01-01’; This query will return every row in the biograph table where the birth date is before or equal to January 1st, 2005. The “greater than” statements work in exactly the same way as the “less than” statements shown above. Combining and negating conditions with AND, OR, and NOT Often you will need to specify multiple conditions in one SELECT statement. You can combine two or more conditions using the AND, OR, and NOT operators. • AND connects two or more conditions and returns true only if all conditions are true SELECT * FROM biograph WHERE matgrp=1.1 AND sex=’M’ AND birth >= ‘2005-01-01’; This statement will return all rows in the biograph table where the individual is male, has a matgrp of 1.1 (born in Nyayo’s group) and was born on or after January 1st 2005. • OR connects two or more conditions and returns true if either or all conditions are true SELECT * FROM biograph WHERE sname=’ELV’ OR sname=’CHA’ OR sname=’POW’; This statement will return all rows in the biograph table where the sname is ‘ELV’, ‘CHA’ or ‘POW’. • Unlike AND and OR, NOT doesn’t connect two conditions. Instead it negates (reverses) a single condition SELECT * FROM biograph WHERE NOT matgrp <= 3; All rows in the biograph table where matgrp is not less than or equal to 3 will be returned. This could also be stated without using a NOT statement and using SELECT * FROM biograph WHERE matgrp > 3; Matching patterns with LIKE Preceding examples retrieved rows based on the exact value of a column or columns. You can use LIKE to retrieve rows based on partial information. LIKE works only with character strings. In the LIKE statement you can use two different wildcard characters. A percent sign (%) matches any string of zero or more characters. An underscore (_) matches any one character. You can negate a LIKE statement with NOT LIKE. • Find similar character strings SELECT * FROM biograph WHERE name LIKE 'C%'; This statement will return all rows in the biograph table where the name starts with a “C” SELECT * FROM biograph WHERE pid LIKE 'MET_'; This statement will return all rows in the biograph table where the pid starts with ‘MET’. In other words, all rows corresponding to an individual whose mother’s sname is ‘MET’. Range filtering with BETWEEN • Use BETWEEN to determine whether a given value falls within a range SELECT DISTINCT(sname) FROM members WHERE grp=1.1 AND date BETWEEN '2004-01-01' AND '2005-01-01'; This statement will return the distinct snames of all the individuals in the members table from group 1.1 between (this is an inclusive range) January 1st 2004 and January 1st 2005. In other words, who was in group 1.1 at some point during the given time period. List filtering with IN SELECT * FROM biograph WHERE sname IN ('CHA', 'ELV', 'POW'); Testing for blanks or nulls with NULL Nulls represent missing or unknown values. This situation can cause problems in WHERE clauses because unknown values do not satisfy specific conditions. Nulls match no values – not even other nulls. • Retrieve rows with nulls or non-null values SELECT * FROM biograph WHERE pid IS NULL; This statement returns every row in the biograph table where a null exists in the pid column. This translates to every individual whose mother is unknown to us. SELECT * FROM biograph WHERE sname IS NOT NULL; This statement returns every row in the biograph table where we have an sname for the individual. Aborted fetuses usually are not given a name but may still have a row in the biograph table. Sorting rows with ORDER BY Rows in a query result are unordered. Often you will want the output of your data in a particular order. For example, you may want your result ordered alphabetically or by date. • To sort your result by one column SELECT * FROM biograph WHERE sex=’M’ ORDER BY sname; This will return all entries in biograph where the sex is male and will be returned in alphabetical order by sname. • To sort your result by multiple columns SELECT * FROM members ORDER BY grp, date; This will return all entries in members where the group is equal to 2. The results will be in numerical order by group and, within each group, will be ordered chronologically by date. Creating column aliases with AS In the queries we have seen so far we have accepted the return of the default column names. The AS clause can be used to create a column alias. A column alias is an alternative name that you specify to control how the column headings are displayed. If a column name is too long or cryptic to you, you may want to change it in your output. Double quotes are required around the new name unless the alias is a single word that contains only letters, digits, or underscores. • Change the name of a column in your output. SELECT sname AS shortname FROM biograph; or SELECT sname AS "short name" FROM biograph; These two statements will both change the column sname in the output. Operators and Functions Performing Arithmetic Operations To add, subtract, multiply, or divide use the operators found in the following table. Symbol Usage + Addition - Subtraction * Multiplication / Division • Using subtraction with times SELECT (stop - start) AS "time spent" FROM interact; This query will create a column called “time spent” which will give us the length of each interaction in the interact table based on the start and stop time. • Using subtraction with dates SELECT sname, (statdate - birth) AS days FROM biograph; This query will create a column with sname and one called “days” with the number of days from birth until the statdate for each individual. Summarizing and Grouping Data Aggregate Functions COUNT • The count command does exactly what it says, it counts whatever you ask of it SELECT COUNT(sname) FROM biograph WHERE matgrp=1.1; This will show how many individuals were born into Nyayo’s group (1.1) • Many times it is very useful to combine the COUNT and the DISTINCT commands SELECT COUNT(DISTINCT(sname)) FROM members WHERE grp=1.1; The members table has multiple entries for most individuals. In this case the DISTINCT command lets us count each sname only once. This query tells us how many different individuals have been a member of group 1.1. MIN and MAX • As it sounds, these commands are used to determine the maximum and minimum value in a query SELECT MAX(rank), MIN(rank) FROM ranks WHERE sname=’ROC’; The result of this query is the maximum numeric rank attained by Rocky (5), and the minimum numeric rank attained by Rocky (1). • The maximum and minimum can also be used with dates SELECT MAX(date), MIN(date) FROM members WHERE grp=1.21; The result of this query is the first and last dates recorded in the members table for Omo’s group (1.21). AVG • This command allows you to find the average or mean of a group of numbers. SELECT AVG(stop-start) FROM interact_data WHERE act='C' ; This query lets us know the average length of all consortships found in the interact_data table SUM • This command allows you to find the sum of a group of numbers. SELECT SUM(stop-start) FROM interact_data WHERE act='C' ; This query lets us know the total time for which we have observed consortships from the interact_data table. FLOOR and CEILING • FLOOR returns the largest integer equal or less to the value provided. This is useful when rounding numbers. We tend to use this when calculating age in years. See Useful Queries section SELECT FLOOR(1.21); This will return the value of 1 which is the largest integer equal or less than 1.21. • The CEILING function returns the smallest integer value that is greater than the value provided. This is equivalent to rounding up to the next whole number. SELECT CEILING(6.45); This will return the value of 7 which is the smallest integer greater than 6.45. Grouping rows with GROUP BY All of the examples so far have applied the aggregate functions across all the rows returned by the FROM and WHERE clauses. You can use the GROUP BY clause to divide a table into groups or categories and apply aggregate functions to each subset. The GROUP BY clause comes after the WHERE clause but before the ORDER BY clause. An example will clarify the concept. • Use the COUNT aggregate function with GROUP BY SELECT matgrp, COUNT(matgrp) FROM biograph GROUP BY matgrp ORDER BY matgrp; This query returns the total number of individuals attributed to each matgrp in the biograph table. The result shows the count of individuals grouped by matgrp. • You can use GROUP BY with more than one grouping column SELECT matgrp, sex, COUNT(matgrp) FROM biograph GROUP BY matgrp, sex ORDER BY matgrp; This query returns the total number of individuals attributed to each matgrp in the biograph table. The results are also grouped by sex. The result shows the count of individuals grouped by matgrp and then sex. Filtering groups with HAVING Retrieving Data from Multiple Tables All of the queries so far have retrieved rows from a single table. In a relational database we may need information from multiple tables at once. In order to retrieve information from two or more tables simultaneously we need to understand how the columns in the different tables are related to each other. The Babase Pocket Reference outlines how the columns of all of the BaBase tables are connected to each other (entity diagram). Qualifying column names Column names must be unique within a table but can be reused in another table. To identify an otherwise ambiguous column uniquely in a query that involves multiple tables, we must use a qualified name. A qualified name is a table name followed by a dot followed by the name of the column in the table. Because table names within the database must be unique, a qualified name uniquely identifies a single column within the entire database. If two tables have any columns in common you must qualify all references to these columns throughout the query. A qualified table name is of the form: table.column Creating joins with JOIN or WHERE A query that extracts data from more than one table must specify how we want to join the tables. There must be a connection! The tables are always joined row by row by satisfying whatever conditions you specify in the query. We must always specify one or more join conditions to be evaluated for each pair of joined rows. Selecting Columns from two or more tables with WHERE • You may want to retrieve information from both the biograph table and the maturedates table SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph, maturedates WHERE biograph.sname=maturedates.sname; This query returns the sname and birth columns from the biograph table and the matured column from the maturedates table. The two tables are connected by sname. We are returned only the rows where we have a birth date and a matured date associated with a particular sname. • You can retrieve information from many tables at once as long as there is common column information with which you can compare the tables. SELECT biograph.sname, biograph.birth, maturedates.matured, rankdates.ranked, consortdates.consorted FROM biograph, maturedates, rankdates,consortdates WHERE biograph.sname=maturedates.sname AND biograph.sname=rankdates.sname AND biograph.sname=consortdates.sname; This query returns the sname and birth date from biograph, the matured date from maturedates, the ranked date from rankdates, and the consorted date from consortdates. Only the rows where the individual has all of these dates are returned. The next section describes joins which will allow you to retrieve the null or blank values as well. Join two or more tables using JOIN The most commonly used JOIN statements are shown below. There are others. I have given examples of the most useful joins. Common Joins Description Inner Join The most common join. Uses a comparison operator (=, <>,<, >, <=, >=) to match rows from two tables based on the values from common columns. Left Outer Join Right Outer Join Full Outer Join Returns all rows from the left table (first table mentioned is called the left table), not just the rows in which the joined columns match. Nulls are returned for the right tables if the left table has no matching rows in the right tables. The reverse of a Left Outer Join. All rows from the right table are returned. Nulls are returned for the left table if a right table has no matching row in the left table. A combination of the left and right outer joins. Returns all rows in both the left and right tables. • We can perform the same query from above (retrieving rows from biograph and maturedates using WHERE) using the INNER JOIN command SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph INNER JOIN maturedates ON biograph.sname=maturedates.sname; This query returns exact same information as above using INNER JOIN instead of WHERE. • You can select information from many tables at once using INNER JOIN SELECT biograph.sname, biograph.sex, maturedates.matured, members.grp FROM biograph INNER JOIN maturedates ON biograph.sname=maturedates.sname INNER JOIN members ON biograph.sname=members.sname WHERE members.grp=1.21 AND members.date='2005-01-01' ORDER BY biograph.sname; This more complicated query returns the snames of all the matured individuals plus sex, matured date, and group for all who were in grp 1.21 on January 1st 2005. They are ordered alphabetically by sname. • Example using LEFT OUTER JOIN SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph LEFT OUTER JOIN maturedates ON biograph.sname=maturedates.sname; If you take a look at the results of this query you can see that all rows from the biograph table (left table) are returned. We can now see which individuals do not have a matured date associated with them as well as the ones do. • Example using LEFT OUTER JOIN with more than two tables SELECT biograph.sname, biograph.sex, maturedates.matured, members.grp FROM biograph LEFT OUTER JOIN maturedates ON biograph.sname=maturedates.sname LEFT OUTER JOIN members ON biograph.sname=members.sname WHERE members.grp=1.21 AND members.date='2005-01-01' ORDER BY biograph.sname; Schemas Schemas partition databases. Tables, procedures, triggers, and so forth are all kept in schemas. Schemas are like sub-databases within a database. You can query tables from any schema within a particular database but not across databases. Tables within a database can be related, but tables in different databases cannot. Babase uses schemas to partition each database into areas where users have a greater or lesser degree of freedom to make changes. For further information on schemas see the schema documentation for PosgreSQL. Within the babase database the “official” babase tables are within the babase schema The system looks at the different schemas for objects, for example table names appearing in SQL queries, . If the table does not appear in the first schema it looks in the second, and so forth. As soon as table is found with the name given, that table is used and the search stops. To explicitly reference an object in a specific schema, place the name of the schema in front of the object, separating the two with a period (e.g. schemaname.tablename). Each BaBase user has his or her own area of the database, called a schema, where they can create their own tables. Your schema is identified by your BaBase login name and is located within the babase database. We qualify schema names and the associated tables the same way we qualified tables and columns above. A qualified table name within a schema is of the form: schemaname.tablename The biograph table could be written as: babase.biograph You can see which database you are in by looking at the schema search box within the SQL window. If you have selected the babase database the default search path is babase, then the sandbox, then your own personal schema. This means that the system looks at the different schemas for tables or columns, for example table names appearing in SQL queries, in the order in which the schemas are listed. If the table does not appear in the first schema it looks in the second, and so forth. As soon as table is found with the name given, that table is used and the search stops. a single SQL statement can refer to objects in the different schemas of the parent database, but cannot refer to objects in other databases -- tables within a database can be related, but tables in different databases cannot Saving your query result INTO a new table Unfortunately, at the present time downloading the results of a query requires the results be saved in a table and this table must be in your own schema. In the SELECT statement you will need to specify that your new table will be saved in your schema. To identify the hierarchy, the schema and table name are separated by a dot or period. • Save the results of a query into a table in your schema (username) SELECT * INTO username.mybiograph FROM biograph; A table called mybiograph will be created in the “username” schema. This new table is a copy of the biograph table (all rows and columns selected). Queries with multiple SELECT statements ********************NEED TO ADD THIS***************************** Temporary tables ********************NEED TO ADD THIS***************************** Deleting tables You are allowed to delete tables from your own schema. Only BaBase administrators have sufficient privileges to edit or delete the BaBase tables. The tables you create in your own schemas are not considered a part of BaBase, even though they may have been created from data contained in BaBase tables. To delete a table go into your schema. You will see a list of all tables you have created. Under the Actions bar you will see a button called Drop. You will be asked if you are sure you would like to delete the table. If you are sure then hit the Drop button. Do NOT check the cascade button. Data managers may have permission to add and delete tables from some sandbox schemas (hybridity etc. but not babase itself). Importing Tables into your schema 1. Turn your excel or text file into comma delimited format (CSV) using “save as” 2. Go to your schema (You can’t create tables anywhere else unless you are a data manager). 3. Hit the “create table” button 4. Name the table (lowercase with characters other than underscore is best) 5. Choose the number of columns 6. Name the columns (exact same names and same case as your CSV file) 7. Choose the data type for each column and whether nulls are allowed or not 8. You now have an empty table that can be filled with your data. Click on the table which is now in your schema and hit the “import” button 9. Pick “CSV” as the format and choose what your nulls look like (if any) 10. Browse for your file and hit import. All the rows should now be there. Exporting Tables 1. Click on the table you would like to export 2. Hit the “export” button 3. Select “data only” and CSV format 4. Under Option select “download” 5. You will be prompted to save or open the file with excel. Either option is fine. If you save it it will be called dump.csv. You can rename it later. If you choose to open with excel you can do a “save as” and save it directly as the XLS file with the name of your choice. This is the easiest option. Sandbox The sandbox schema holds tables that are used together with the “official” BaBase tables but have not yet made it into the Babase project. They will not be documented in the BaBase documentation. The tables in the Sandbox will be considered “final copies” of data. Everyone can create tables in the Sandbox schema. Only place completely proofed and useable data in the Sandbox. Permissions All BaBase users have all the permissions in the sandbox schema. They may add, delete, or modify any information in the schema but may not alter the structure of the schema by adding or removing tables, procedures, triggers, or anything else. PosgreSQL, the database underlying Babase, is secure by default. This means that any tables or other database objects cannot be acesssed by anyone but their creator or without permission of the creator. Babase_editors who create tables in the sandbox schema should use the GRANT statement to grant access to Babase's other users. babase_editors permissions in the sandbox schema The babase_editors have all the permissions of the babase_readers, plus they may add or delete tables, stored procedures, or any other sort of object necessary to control the structure of the data. Granting Permission to a table CREATE TABLE sandbox.foo (somecolumn INTEGER); This is done as follows: Example 2.2. Granting permission to table foo in the sandbox schema GRANT ALL ON sandbox.foo TO GROUP babase_editors; GRANT SELECT ON sandbox.foo TO GROUP babase_readers; There is one other issue. Only the creator of a table can change its structure -- to add another column, change the table name, etc. And only the creator can destroy (DROP) the table. Views A view is a virtual table composed of fields from one or more tables. Views enable us to see information from the database from many different perspectives. Saving your queries Save your SELECT statements! Saving your statements eliminates the need to recreate them every time you need to ask the same question of the database. At this point it is best to save your queries in any text program you like. It is also useful to write comments in your files. A comment is optional text that you type to explain your query Then you can copy and paste specific SELECT statements into the SQL window as many times as you like. Here is an example of a saved query NOTE: SQL has assigned two hyphens, “--“, as a comment. This means that any text following the comment characters will be ignored by SQL. A comment continues to the end of a line. -- count -- May 16, 2006 -- Count Number of A Couple Useful Queries Calculate Age • Calculating exact age at statdate SELECT sname, (statdate - birth)/365.25 AS age FROM biograph; This query will create a column with sname and one with the age of each individual on their statdate. This will give us exact age. We use 365.25 days the calculation to account for leap years. Note: There are multiple ways to calculate integer age and everyone has there own preference. The following example illustrates one method that used quite regularly in the Albert’s lab. • Often age is required in integer years. We use the FLOOR command and add one year. SELECT sname, FLOOR((statdate - birth)/365.25) + 1 AS age FROM biograph; This query returns sname and age in integer years. Note that 1 year is added to the calculation. The result is actually what “year of life” the individual is in on the statdate. Using this method an infant between birth and the first birthday will be considered of age 1, or in the first year of life. An individual who is 6.3 years old will be considered of age 7, or in the 7th year of life. |
The Basics
SQL (Structured Query Language) is the standard programming language for creating and retrieving information that is stored in a relational database. The user interface or front-end (what you see online) is called phpPgAdmin. The following pages are meant as a basic introduction to SQL, phpPgAdmin, and query writing. Please see BaBase documentation for specifics about the database. There are many resources online for more detailed information about SQL. There are links to SQL tutorials and documentation through the Babase Wiki Case sensitivity
SQL language is case insensitive but user-defined identifiers (table names, column names, and table entries) are case sensitive. For example, the SQL SELECT statement could be written and SELECT, SeLect, or select but the table biograph is case sensitive and must be written as is. Conventions in SQL syntax
In this document I have followed some SQL syntax conventions when writing query statements.
UPPERCASE Uppercase indicates SQL language or commands
lowercase Lowercase indicates user defined items such as tables and columns specific to BaBase.
Quotation marks
All quotation marks in SQL are straight quotes (such as ' and "), not curly quotes (such as ‘ and “). Curly quotes will not work in the SQL window. Be careful when cutting and pasting from text editors. Sometimes the default is curly quotes and you will receive an error message when you run the query. You can delete the quotes in the SQL query window and retype them. When you type directly into the SQL query window you will have the correct quote format.
Data Types
Each column in a table has a single data type. A data type determines a column’s allowable values. In general, we can classify columns as the following types.
Character
Represent text and are case sensitive. Whenever characters are requested they are case sensitive and need to be enclosed in single quotation marks. Examples of character columns are the name and pid columns in the biograph table.
Number
Numbers can be integers or decimals. Numbers can be negative, zero, or positive. Numbers do not require quotation marks. Examples of numeric columns are the matgrp and bstatus columns in the biograph table. Date A date stored in a column defined as DATE has three fields - YEAR, MONTH, and DAY - and is formatted yyyy-mm-dd. Dates must be enclosed in single quotations. Examples of date columns are the birth and statdate columns in the biograph table. Time Represents a time of day. A time stored in a column defined as TIME has three fields – HOUR, MINUTE, and SECOND – and is formatted hh:mm:ss. Times must be enclosed in single quotations. Examples of time columns are the start and stop columns in the interact_data table.Retrieving data from a table.
Retrieving Data from a Table
• The SELECT statement forms the basis of every question you pose to the database. • When you create and execute a SELECT statement you are ‘querying’ the database. • A SELECT statement is composed of several distinct keywords known as clauses. You define a SELECT statement using various configurations of the clauses to retrieve the information you require. Some clauses are required while others are optional. The main clauses in a SELECT statement are INTO, FROM, WHERE, GROUP BY, and ORDER BY. • You may need multiple statements or queries to retrieve the desired information. • Every SELECT statement must end in a semi colon, ; Retrieving columns with SELECT and FROM
In its simplest form the SELECT statement retrieves columns FROM a table. The SELECT and FROM clauses are always required if you retrieve columns from tables. All other clauses are optional.
• To retrieve one column from a table
SELECT name FROM biograph;
This query will return every row in the name column of the biograph table.
• To retrieve multiple columns from a table
SELECT name, birth FROM biograph;
This statement will return every name and birthdate in the biograph table.
• To retrieve all columns from a table without writing them all out you can use an asterisk “*”.
SELECT * FROM biograph;
This will return all of the rows and columns contained in the biograph table.
Eliminating duplicate rows with DISTINCT
Although DISTINCT is an aggregate function (discussed below) it is useful to introduce this one earlier on.
• Columns often contain duplicate values, and it is common to want a result that lists each duplicate only once.
SELECT DISTINCT(sname) FROM members;
The members table has multiple entries for most individuals. In this case the DISTINCT command returns each sname only once. Filtering rows with WHERE
The result of each SELECT statement so far has included every row in the table (for the specified columns). The WHERE clause filters unwanted rows from the result. In a WHERE clause you specify a search condition that contains one or more conditions that need to be satisfied by the rows of the table. Rows for which the condition or conditions are true are returned.
Comparisons
The most common types of conditions involve comparing two values to each other. There are six types of comparisons we can define. Note that inequality can be defined in two different ways.
Symbol Usage = Equality <> Inequality (not equal to) != Inequality (not equal to) < Less than > Greater than <= Less than or equal to >= Greater than or equal to
• Equality and inequality
SELECT * FROM biograph WHERE sname=’CHA’;
All information contained in the biograph table where the sname is ‘CHA’.
- SELECT * FROM biograph WHERE sex=’F’;
All rows in the biograph table that have ‘F’ listed in the sex column will be returned. This is every row pertaining to a female.
SELECT * FROM biograph WHERE matgrp <> 9;
This query returns all rows from the biograph table where the matgrp (maternal group) is not 9.
• Less than and greater than
SELECT * FROM biograph WHERE birth < ‘2005-01-01’;
This query will return every row in the biograph table where the birth date is before (but not including) January 1st, 2005.
SELECT * FROM biograph WHERE birth <= ‘2005-01-01’;
This query will return every row in the biograph table where the birth date is before or equal to January 1st, 2005.
The “greater than” statements work in exactly the same way as the “less than” statements shown above. Combining and negating conditions with AND, OR, and NOT
Often you will need to specify multiple conditions in one SELECT statement. You can combine two or more conditions using the AND, OR, and NOT operators.
• AND connects two or more conditions and returns true only if all conditions are true
SELECT * FROM biograph WHERE matgrp=1.1
AND sex=’M’ AND birth >= ‘2005-01-01’;
This statement will return all rows in the biograph table where the individual is male, has a matgrp of 1.1 (born in Nyayo’s group) and was born on or after January 1st 2005.
• OR connects two or more conditions and returns true if either or all conditions are true
SELECT * FROM biograph WHERE sname=’ELV’ OR sname=’CHA’ OR sname=’POW’;
This statement will return all rows in the biograph table where the sname is ‘ELV’, ‘CHA’ or ‘POW’.
• Unlike AND and OR, NOT doesn’t connect two conditions. Instead it negates (reverses) a single condition
SELECT * FROM biograph WHERE NOT matgrp <= 3;
All rows in the biograph table where matgrp is not less than or equal to 3 will be returned. This could also be stated without using a NOT statement and using SELECT * FROM biograph WHERE matgrp > 3; Matching patterns with LIKE
Preceding examples retrieved rows based on the exact value of a column or columns. You can use LIKE to retrieve rows based on partial information. LIKE works only with character strings. In the LIKE statement you can use two different wildcard characters. A percent sign (%) matches any string of zero or more characters. An underscore (_) matches any one character. You can negate a LIKE statement with NOT LIKE.
• Find similar character strings
SELECT * FROM biograph WHERE name LIKE 'C%';
This statement will return all rows in the biograph table where the name starts with a “C”
SELECT * FROM biograph WHERE pid LIKE 'MET_';
This statement will return all rows in the biograph table where the pid starts with ‘MET’. In other words, all rows corresponding to an individual whose mother’s sname is ‘MET’. Range filtering with BETWEEN
• Use BETWEEN to determine whether a given value falls within a range
SELECT DISTINCT(sname) FROM members WHERE grp=1.1 AND date BETWEEN '2004-01-01' AND '2005-01-01';
This statement will return the distinct snames of all the individuals in the members table from group 1.1 between (this is an inclusive range) January 1st 2004 and January 1st 2005. In other words, who was in group 1.1 at some point during the given time period. List filtering with IN
- SELECT * FROM biograph WHERE sname IN ('CHA', 'ELV', 'POW');
Testing for blanks or nulls with NULL
Nulls represent missing or unknown values. This situation can cause problems in WHERE clauses because unknown values do not satisfy specific conditions. Nulls match no values – not even other nulls.
• Retrieve rows with nulls or non-null values
SELECT * FROM biograph WHERE pid IS NULL;
This statement returns every row in the biograph table where a null exists in the pid column. This translates to every individual whose mother is unknown to us.
SELECT * FROM biograph WHERE sname IS NOT NULL;
This statement returns every row in the biograph table where we have an sname for the individual. Aborted fetuses usually are not given a name but may still have a row in the biograph table. Sorting rows with ORDER BY
Rows in a query result are unordered. Often you will want the output of your data in a particular order. For example, you may want your result ordered alphabetically or by date.
• To sort your result by one column
SELECT * FROM biograph WHERE sex=’M’ ORDER BY sname;
This will return all entries in biograph where the sex is male and will be returned in alphabetical order by sname.
• To sort your result by multiple columns
SELECT * FROM members ORDER BY grp, date;
This will return all entries in members where the group is equal to 2. The results will be in numerical order by group and, within each group, will be ordered chronologically by date. Creating column aliases with AS
In the queries we have seen so far we have accepted the return of the default column names. The AS clause can be used to create a column alias. A column alias is an alternative name that you specify to control how the column headings are displayed. If a column name is too long or cryptic to you, you may want to change it in your output. Double quotes are required around the new name unless the alias is a single word that contains only letters, digits, or underscores.
• Change the name of a column in your output.
SELECT sname AS shortname FROM biograph;
or
SELECT sname AS "short name" FROM biograph;
These two statements will both change the column sname in the output.
Operators and Functions Performing Arithmetic Operations
To add, subtract, multiply, or divide use the operators found in the following table.
Symbol Usage + Addition - Subtraction * Multiplication / Division
• Using subtraction with times
SELECT (stop - start) AS "time spent" FROM interact;
This query will create a column called “time spent” which will give us the length of each interaction in the interact table based on the start and stop time.
• Using subtraction with dates
SELECT sname, (statdate - birth) AS days FROM biograph;
This query will create a column with sname and one called “days” with the number of days from birth until the statdate for each individual. Summarizing and Grouping Data Aggregate Functions COUNT
• The count command does exactly what it says, it counts whatever you ask of it
SELECT COUNT(sname) FROM biograph WHERE matgrp=1.1;
This will show how many individuals were born into Nyayo’s group (1.1)
• Many times it is very useful to combine the COUNT and the DISTINCT commands
SELECT COUNT(DISTINCT(sname)) FROM members WHERE grp=1.1;
The members table has multiple entries for most individuals. In this case the DISTINCT command lets us count each sname only once. This query tells us how many different individuals have been a member of group 1.1. MIN and MAX
• As it sounds, these commands are used to determine the maximum and minimum value in a query
SELECT MAX(rank), MIN(rank) FROM ranks WHERE sname=’ROC’;
The result of this query is the maximum numeric rank attained by Rocky (5), and the minimum numeric rank attained by Rocky (1).
• The maximum and minimum can also be used with dates
SELECT MAX(date), MIN(date) FROM members WHERE grp=1.21;
The result of this query is the first and last dates recorded in the members table for Omo’s group (1.21). AVG
• This command allows you to find the average or mean of a group of numbers.
SELECT AVG(stop-start) FROM interact_data WHERE act='C' ;
This query lets us know the average length of all consortships found in the interact_data table SUM
• This command allows you to find the sum of a group of numbers.
SELECT SUM(stop-start) FROM interact_data WHERE act='C' ;
This query lets us know the total time for which we have observed consortships from the interact_data table. FLOOR and CEILING
• FLOOR returns the largest integer equal or less to the value provided. This is useful when rounding numbers. We tend to use this when calculating age in years. See Useful Queries section
SELECT FLOOR(1.21);
This will return the value of 1 which is the largest integer equal or less than 1.21.
• The CEILING function returns the smallest integer value that is greater than the value provided. This is equivalent to rounding up to the next whole number.
SELECT CEILING(6.45);
This will return the value of 7 which is the smallest integer greater than 6.45. Grouping rows with GROUP BY
All of the examples so far have applied the aggregate functions across all the rows returned by the FROM and WHERE clauses. You can use the GROUP BY clause to divide a table into groups or categories and apply aggregate functions to each subset. The GROUP BY clause comes after the WHERE clause but before the ORDER BY clause. An example will clarify the concept.
• Use the COUNT aggregate function with GROUP BY
SELECT matgrp, COUNT(matgrp) FROM biograph GROUP BY matgrp ORDER BY matgrp;
This query returns the total number of individuals attributed to each matgrp in the biograph table. The result shows the count of individuals grouped by matgrp.
• You can use GROUP BY with more than one grouping column
SELECT matgrp, sex, COUNT(matgrp) FROM biograph GROUP BY matgrp, sex ORDER BY matgrp;
This query returns the total number of individuals attributed to each matgrp in the biograph table. The results are also grouped by sex. The result shows the count of individuals grouped by matgrp and then sex. Filtering groups with HAVING
Retrieving Data from Multiple Tables
All of the queries so far have retrieved rows from a single table. In a relational database we may need information from multiple tables at once. In order to retrieve information from two or more tables simultaneously we need to understand how the columns in the different tables are related to each other. The Babase Pocket Reference outlines how the columns of all of the BaBase tables are connected to each other (entity diagram). Qualifying column names
Column names must be unique within a table but can be reused in another table. To identify an otherwise ambiguous column uniquely in a query that involves multiple tables, we must use a qualified name. A qualified name is a table name followed by a dot followed by the name of the column in the table. Because table names within the database must be unique, a qualified name uniquely identifies a single column within the entire database. If two tables have any columns in common you must qualify all references to these columns throughout the query.
A qualified table name is of the form:
table.column
Creating joins with JOIN or WHERE
A query that extracts data from more than one table must specify how we want to join the tables. There must be a connection! The tables are always joined row by row by satisfying whatever conditions you specify in the query. We must always specify one or more join conditions to be evaluated for each pair of joined rows. Selecting Columns from two or more tables with WHERE
• You may want to retrieve information from both the biograph table and the maturedates table
SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph, maturedates WHERE biograph.sname=maturedates.sname;
This query returns the sname and birth columns from the biograph table and the matured column from the maturedates table. The two tables are connected by sname. We are returned only the rows where we have a birth date and a matured date associated with a particular sname.
• You can retrieve information from many tables at once as long as there is common column information with which you can compare the tables.
SELECT biograph.sname, biograph.birth, maturedates.matured, rankdates.ranked, consortdates.consorted FROM biograph, maturedates, rankdates,consortdates WHERE biograph.sname=maturedates.sname AND biograph.sname=rankdates.sname AND biograph.sname=consortdates.sname;
This query returns the sname and birth date from biograph, the matured date from maturedates, the ranked date from rankdates, and the consorted date from consortdates. Only the rows where the individual has all of these dates are returned. The next section describes joins which will allow you to retrieve the null or blank values as well. Join two or more tables using JOIN
The most commonly used JOIN statements are shown below. There are others. I have given examples of the most useful joins.
Common Joins Description Inner Join The most common join. Uses a comparison operator (=, <>,<, >, <=, >=) to match rows from two tables based on the values from common columns.
Left Outer Join
Right Outer Join
Full Outer Join Returns all rows from the left table (first table mentioned is called the left table), not just the rows in which the joined columns match. Nulls are returned for the right tables if the left table has no matching rows in the right tables.
The reverse of a Left Outer Join. All rows from the right table are returned. Nulls are returned for the left table if a right table has no matching row in the left table.
A combination of the left and right outer joins. Returns all rows in both the left and right tables.
• We can perform the same query from above (retrieving rows from biograph and maturedates using WHERE) using the INNER JOIN command
SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph INNER JOIN maturedates ON biograph.sname=maturedates.sname;
This query returns exact same information as above using INNER JOIN instead of WHERE.
• You can select information from many tables at once using INNER JOIN
SELECT biograph.sname, biograph.sex, maturedates.matured, members.grp FROM biograph INNER JOIN maturedates ON biograph.sname=maturedates.sname INNER JOIN members ON biograph.sname=members.sname WHERE members.grp=1.21 AND members.date='2005-01-01' ORDER BY biograph.sname;
This more complicated query returns the snames of all the matured individuals plus sex, matured date, and group for all who were in grp 1.21 on January 1st 2005. They are ordered alphabetically by sname.
• Example using LEFT OUTER JOIN
SELECT biograph.sname, biograph.birth, maturedates.matured FROM biograph LEFT OUTER JOIN maturedates ON biograph.sname=maturedates.sname;
If you take a look at the results of this query you can see that all rows from the biograph table (left table) are returned. We can now see which individuals do not have a matured date associated with them as well as the ones do.
• Example using LEFT OUTER JOIN with more than two tables
SELECT biograph.sname, biograph.sex, maturedates.matured, members.grp FROM biograph LEFT OUTER JOIN maturedates ON biograph.sname=maturedates.sname LEFT OUTER JOIN members ON biograph.sname=members.sname WHERE members.grp=1.21 AND members.date='2005-01-01' ORDER BY biograph.sname; Schemas Schemas partition databases. Tables, procedures, triggers, and so forth are all kept in schemas. Schemas are like sub-databases within a database. You can query tables from any schema within a particular database but not across databases. Tables within a database can be related, but tables in different databases cannot. Babase uses schemas to partition each database into areas where users have a greater or lesser degree of freedom to make changes. For further information on schemas see the schema documentation for PosgreSQL. Within the babase database the “official” babase tables are within the babase schema The system looks at the different schemas for objects, for example table names appearing in SQL queries, . If the table does not appear in the first schema it looks in the second, and so forth. As soon as table is found with the name given, that table is used and the search stops. To explicitly reference an object in a specific schema, place the name of the schema in front of the object, separating the two with a period (e.g. schemaname.tablename).
Each BaBase user has his or her own area of the database, called a schema, where they can create their own tables. Your schema is identified by your BaBase login name and is located within the babase database. We qualify schema names and the associated tables the same way we qualified tables and columns above.
A qualified table name within a schema is of the form:
schemaname.tablename
The biograph table could be written as:
- babase.biograph
You can see which database you are in by looking at the schema search box within the SQL window. If you have selected the babase database the default search path is babase, then the sandbox, then your own personal schema. This means that the system looks at the different schemas for tables or columns, for example table names appearing in SQL queries, in the order in which the schemas are listed. If the table does not appear in the first schema it looks in the second, and so forth. As soon as table is found with the name given, that table is used and the search stops. a single SQL statement can refer to objects in the different schemas of the parent database, but cannot refer to objects in other databases -- tables within a database can be related, but tables in different databases cannot Saving your query result INTO a new table
Unfortunately, at the present time downloading the results of a query requires the results be saved in a table and this table must be in your own schema. In the SELECT statement you will need to specify that your new table will be saved in your schema. To identify the hierarchy, the schema and table name are separated by a dot or period.
• Save the results of a query into a table in your schema (username)
SELECT * INTO username.mybiograph FROM biograph;
A table called mybiograph will be created in the “username” schema. This new table is a copy of the biograph table (all rows and columns selected). Queries with multiple SELECT statements
********************NEED TO ADD THIS***************************** Temporary tables
********************NEED TO ADD THIS***************************** Deleting tables
You are allowed to delete tables from your own schema. Only BaBase administrators have sufficient privileges to edit or delete the BaBase tables. The tables you create in your own schemas are not considered a part of BaBase, even though they may have been created from data contained in BaBase tables. To delete a table go into your schema. You will see a list of all tables you have created. Under the Actions bar you will see a button called Drop. You will be asked if you are sure you would like to delete the table. If you are sure then hit the Drop button. Do NOT check the cascade button. Data managers may have permission to add and delete tables from some sandbox schemas (hybridity etc. but not babase itself). Importing Tables into your schema
1. Turn your excel or text file into comma delimited format (CSV) using “save as” 2. Go to your schema (You can’t create tables anywhere else unless you are a data manager). 3. Hit the “create table” button 4. Name the table (lowercase with characters other than underscore is best) 5. Choose the number of columns 6. Name the columns (exact same names and same case as your CSV file) 7. Choose the data type for each column and whether nulls are allowed or not 8. You now have an empty table that can be filled with your data. Click on the table which is now in your schema and hit the “import” button 9. Pick “CSV” as the format and choose what your nulls look like (if any) 10. Browse for your file and hit import. All the rows should now be there. Exporting Tables
1. Click on the table you would like to export 2. Hit the “export” button 3. Select “data only” and CSV format 4. Under Option select “download” 5. You will be prompted to save or open the file with excel. Either option is fine. If you save it it will be called dump.csv. You can rename it later. If you choose to open with excel you can do a “save as” and save it directly as the XLS file with the name of your choice. This is the easiest option. Sandbox The sandbox schema holds tables that are used together with the “official” BaBase tables but have not yet made it into the Babase project. They will not be documented in the BaBase documentation. The tables in the Sandbox will be considered “final copies” of data. Everyone can create tables in the Sandbox schema. Only place completely proofed and useable data in the Sandbox. Permissions
All BaBase users have all the permissions in the sandbox schema. They may add, delete, or modify any information in the schema but may not alter the structure of the schema by adding or removing tables, procedures, triggers, or anything else. PosgreSQL, the database underlying Babase, is secure by default. This means that any tables or other database objects cannot be acesssed by anyone but their creator or without permission of the creator. Babase_editors who create tables in the sandbox schema should use the GRANT statement to grant access to Babase's other users. babase_editors permissions in the sandbox schema The babase_editors have all the permissions of the babase_readers, plus they may add or delete tables, stored procedures, or any other sort of object necessary to control the structure of the data. Granting Permission to a table
CREATE TABLE sandbox.foo (somecolumn INTEGER);
This is done as follows: Example 2.2. Granting permission to table foo in the sandbox schema
GRANT ALL ON sandbox.foo TO GROUP babase_editors; GRANT SELECT ON sandbox.foo TO GROUP babase_readers; There is one other issue. Only the creator of a table can change its structure -- to add another column, change the table name, etc. And only the creator can destroy (DROP) the table.
Views
A view is a virtual table composed of fields from one or more tables. Views enable us to see information from the database from many different perspectives. Saving your queries
Save your SELECT statements! Saving your statements eliminates the need to recreate them every time you need to ask the same question of the database. At this point it is best to save your queries in any text program you like. It is also useful to write comments in your files. A comment is optional text that you type to explain your query Then you can copy and paste specific SELECT statements into the SQL window as many times as you like. Here is an example of a saved query
NOTE: SQL has assigned two hyphens, “--“, as a comment. This means that any text following the comment characters will be ignored by SQL. A comment continues to the end of a line.
-- count -- May 16, 2006 -- Count Number of A Couple Useful Queries Calculate Age
• Calculating exact age at statdate
SELECT sname, (statdate - birth)/365.25 AS age FROM biograph;
This query will create a column with sname and one with the age of each individual on their statdate. This will give us exact age. We use 365.25 days the calculation to account for leap years.
Note: There are multiple ways to calculate integer age and everyone has there own preference. The following example illustrates one method that used quite regularly in the Albert’s lab.
• Often age is required in integer years. We use the FLOOR command and add one year.
SELECT sname, FLOOR((statdate - birth)/365.25) + 1 AS age FROM biograph;
This query returns sname and age in integer years. Note that 1 year is added to the calculation. The result is actually what “year of life” the individual is in on the statdate. Using this method an infant between birth and the first birthday will be considered of age 1, or in the first year of life. An individual who is 6.3 years old will be considered of age 7, or in the 7th year of life.