ExtendedQuizQuestions

What does ending your query with “limit 100” do and when should you use this statement?

What data will this query return?

select * from biograph, members 
where matgrp = grp and 
biograph.sname = members.sname and 
sex = 'M' and 
status = 0 and 
dcause = 0 and 
statdate = date 
order by matgrp, pid;

What data will this query return?

select * from biograph, members 
where biograph.matgrp = members.grp and 
biograph.sname = members.sname and 
biograph.sex = 'M' and 
biograph.status = 0 and 
biograph.dcause = 0 and 
biograph.statdate = members.date 
order by biograph.matgrp, biograph.pid;

What data will this query return?

select *
  from biograph
       join members on (biograph.matgrp = members.grp and 
                        biograph.sname = members.sname)
   where biograph.sex = 'M' and 
         biograph.status = 0 and 
         biograph.dcause = 0 and 
         biograph.statdate = members.date 
   order by biograph.matgrp, biograph.pid;

What data will this query return?

-- This is an illustrative query used in the Babase Quiz
-- It is fiendishly tricky so pay close attention!
SELECT *
  FROM biograph
       JOIN members ON (biograph.matgrp = members.grp AND
                        biograph.sname = members.sname)
  WHERE biograph.sex = 'M' AND   --  M means male (tricky 'eh?)
        biograph.status = 0 AND
        biograph.dcause = 0 AND
        biograph.statdate = members.date 
  ORDER BY biograph.matgrp, biograph.pid;

What are the pros and cons of the various queries above. Which do you prefer?

Answer:

It's a matter of both taste and purpose. Your mileage may vary. While these methods are considered safe for most adults, always contact your physician before use as individual reactions may vary. De gustibus non est disputandum. When column names are not qualified
with the tables to which they belong (by writing tablename.columname) the query is quick to type and answers can be obtained rapidly. But fully qualifying all the column names has some advantages. It makes the query easier to understand and it makes the query easier to extend; extending a query by joining another table will break a query which uses unqualified column names if the new table shares a column name with any of the previously used tables. Queries do not usually spring into existence all at once, they are usually developed step by step with additional tables added as the query develops. Getting into the habit of fully qualifying all your column names can keep you focused as you write a more complex query because you don't have to go back and qualify columns when you want to concentrate on extending the query you are building. Likewise, breaking a query into multiple lines by "clause", FROM, WHERE, etc., and indenting the lines takes longer but makes the query easier to understand. This can help in developing a complex query and when going back to read and re-use an old query. When the query is simple and will never be re-used there's no point in expending the effort. At other times formatting can be very useful.
Using JOIN table ON (condition) makes your queries more clear, to both the reader and the computer, by explicitly specifying which columns are used to join tables. These correspond to the lines in the entity relationship diagrams that document the database structure. If the join relationships are buried in the WHERE clause along with the conditions that further restrict which rows from individual tables should be incorporated into the results it is more difficult to understand what the query is doing. Using capitalization to separate the keywords that are part of SQL, like SELECT, FROM, WHERE, AND, etc., from the parts of the query that are specific to Babase helps draw the eye to important aspects of the query and ignore unimportant aspects. At times the difference between an AND and an OR can be critical. At other times it's more important to focus on which tables are involved. Visually separating the SQL keywords from the source of the data helps to debug syntax and differentiate semantically similar queries. Remember, a query that can't be read by a human, even one that runs perfectly, is useless because nobody knows what the output means. Finally, adding comments, both specific comments describing what parts of the query do and why they are there, and general comments describing the overall purpose of the query helps immensely when going back to re-visit an old query. Often an analysis may take some time, and require corrections to both data and queries and require multiple runs of the same query. It is usually worth the up-front time to make it easy to go back and understand just what your queries do.

Write a query that shows each date that each male offspring spent in his maternal group -- the one codified in the GROUPS table.

SELECT biograph.matgrp, biograph.sname, members.date
  FROM biograph
       JOIN members ON (members.grp = biograph.matgrp AND
                        members.sname = biograph.sname)
  WHERE biograph.sex = 'M'
  ORDER BY biograph.matgrp, biograph.sname, members.date;

Write a query that returns the number of days that each male offspring spends in his maternal group -- the one codified in the GROUPS table.

SELECT biograph.matgrp, biograph.sname, COUNT(*) AS daysingrp
  FROM biograph
       JOIN members ON (members.grp = biograph.matgrp AND
                        members.sname = biograph.sname)
  WHERE biograph.sex = 'M'
  GROUP BY biograph.matgrp, biograph.sname
  ORDER BY biograph.matgrp, biograph.sname;

Hot tip to takeaway from the above: First get the rows you want, then analyze them.

Write a query that returns the mean number of days, per each group defined in the GROUPS table, that a male offspring spends in his maternal group. Include a column showing how many males were in the sample. Having done that, extend the query so it shows not only the mean but the standard deviation.

-- Use a transaction, so even if there's an error the
-- table we create will go away automatically when the transaction
-- rolls back.
BEGIN;

-- Create an intermediate table in my personal schema to query for the real results.
SELECT biograph.matgrp, biograph.sname, COUNT(*) AS daysingrp
  INTO TABLE myschema.foobar
  FROM biograph
       JOIN members ON (members.grp = biograph.matgrp AND
                        members.sname = biograph.sname)
  WHERE biograph.sex = 'M'
  GROUP BY biograph.matgrp, biograph.sname;

SELECT foobar.matgrp, COUNT(*), AVG(foobar.daysingrp)
  FROM foobar
  GROUP BY foobar.matgrp
  ORDER BY foobar.matgrp;

-- Undo all the changes we made to the database, i.e. forget
-- about the intermediate table we created.
ROLLBACK;

or better:

-- Create a temporary table that automatically goes away when this
-- sql session is done executing.
SELECT biograph.matgrp, biograph.sname, COUNT(*) AS daysingrp
  INTO TEMP TABLE foobar
  FROM biograph
       JOIN members ON (members.grp = biograph.matgrp AND
                        members.sname = biograph.sname)
  WHERE biograph.sex = 'M'
  GROUP BY biograph.matgrp, biograph.sname;

SELECT foobar.matgrp, COUNT(*), AVG(foobar.daysingrp)
  FROM foobar
  GROUP BY foobar.matgrp
  ORDER BY foobar.matgrp;

or even better:

-- Instead of a real table run a sub-query and use those results
-- as if they were a real table.
SELECT foobar.matgrp, COUNT(*), AVG(foobar.daysingrp)
  FROM (SELECT biograph.matgrp, biograph.sname, COUNT(*) AS daysingrp
          FROM biograph
               JOIN members ON (members.grp = biograph.matgrp AND
                                members.sname = biograph.sname)
          WHERE biograph.sex = 'M'
          GROUP BY biograph.matgrp, biograph.sname) AS foobar
  GROUP BY foobar.matgrp
  ORDER BY foobar.matgrp;

-- And, if we want standard devation:
SELECT foobar.matgrp, COUNT(*), AVG(foobar.daysingrp), STDDEV(foobar.daysingrp)
  FROM (SELECT biograph.matgrp, biograph.sname, COUNT(*) AS daysingrp
          FROM biograph
               JOIN members ON (members.grp = biograph.matgrp AND
                                members.sname = biograph.sname)
          WHERE biograph.sex = 'M'
          GROUP BY biograph.matgrp, biograph.sname) AS foobar
  GROUP BY foobar.matgrp
  ORDER BY foobar.matgrp;

What are the four things that are wrong with using the above queries to analyze how long males stay in their maternal group before dispersing? I.e. Given that the purpose is to analyze male dispersal what's wrong with the way the questions above were framed and what data peculiarities get in the way of obtaining a good answer? Answer: 1) Some of the males die before they disperse. 2) Some of the groups fission before the juvenile males disperse so the males "move" to another group as the result of fission. 3) Some groups were not under continuous observation and when there is no observation for long enough individuals are automatically placed in the unknown group. 4) As of the end of the most recent data entered some of the males are not old enough to have dispersed.

Write a query that shows, for each male, how many days the male stays in it's maternal group before dispersing. Hint: Use the DISPSERSEDATES table.

-- We can use the Babase julian() function, which returns
-- the date as a number of days from a starting point in time.

SELECT biograph.sname, julian(dispersedates.dispersed) - julian(biograph.birth)
  FROM biograph JOIN dispersedates ON (dispersedates.sname = biograph.sname)
  ORDER BY biograph.sname;

-- But really that is overkill because it turns out that the default
-- result when you subtract one date from another in PostgreSQL is
-- the number of days between the dates.
SELECT biograph.sname, dispersedates.dispersed - biograph.birth
  FROM biograph JOIN dispersedates ON (dispersedates.sname = biograph.sname)
  ORDER BY biograph.sname;

Ignoring problems due to intermittent observational frequency of the groups, what is the per-group sample size, mean, and standard deviation of the number of days a male stays in each group before dispersing?

SELECT biograph.matgrp AS grp,
          count(*) AS samplesize,
          avg(dispersedates.dispersed - biograph.birth) AS mean,
          stddev(dispersedates.dispersed - biograph.birth) AS standarddeviation
 FROM biograph JOIN dispersedates ON (dispersedates.sname = biograph.sname)
 GROUP BY biograph.matgrp
 ORDER BY biograph.matgrp;