WritingAComplexQuery - Babase Wiki

Task: Write a query to count the number of ranked males per month in each group.

Tip: First get the rows you are interested in, then analyze them.

Start by writing a query to get males.

SELECT biograph.sname, biograph.sex
  FROM biograph
  WHERE biograph.sex = 'M'
  ORDER BY biograph.sname;

We know we're going to need the group the male is in, as well as something about the date the male is in the group. This means joining in the MEMBERS table, and that means that we'll suddenly get gobs and gobs of rows in the result.

SELECT members.grp, biograph.sname, members.date
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
  WHERE biograph.sex = 'M'
  ORDER BY members.grp, biograph.sname, members.date
  LIMIT 100;

We immediately note that using LIMIT 100 does not help much when we order the results because the database first produces all the results so that it can order them before returning the first 100. Being, uh, brave, we ignore the problem.

It's a bit hard to see what these results mean, so let's do _something_ to tell how old the individual is on the date shown. The birthdate will give us a clue.

SELECT members.grp, biograph.sname, biograph.birth, members.date
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
  WHERE biograph.sex = 'M'
  ORDER BY members.grp, biograph.sname, members.date
  LIMIT 100;

Clearly the individuals are not old enough to rank. Let's get rid of all the rows where the individual is not ranked. RANKDATES tell us when the individual attained rank.

SELECT members.grp, biograph.sname, biograph.birth, rankdates.ranked, members.date
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
       JOIN rankdates ON (rankdates.sname = biograph.sname)
  WHERE biograph.sex = 'M'
        AND rankdates.ranked <= members.date
  ORDER BY members.grp, biograph.sname, members.date
  LIMIT 100;

That's pretty good. But we want anybody who was ranked at any point during the month. Having read the PostgreSQL documentation we vaguely remember there's a way to "truncate" a date, ignore the days and get just the year and month, or ignore the days and month and get just the year, etc. We could use this on RANKDATES.Ranked to get the first of the month so we are sure to get all the MEMBERS rows for the month where the individual was first ranked. Filled with passion, we go read the PostgreSQL manual and find the function we want -- DATE_TRUNC.

SELECT members.grp, biograph.sname, biograph.birth, rankdates.ranked, members.date
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
       JOIN rankdates ON (rankdates.sname = biograph.sname)
  WHERE biograph.sex = 'M'
        AND DATE_TRUNC('month', rankdates.ranked) <= members.date
  ORDER BY members.grp, biograph.sname, members.date
  LIMIT 100;

But then we, having also read the Babase Technical Documentation, remember that RANKDATES.Ranked is always the first of the month anyway. Oh well.

So we're doing pretty good. Let's alter the columns returned so as to just get what we're interested in.

SELECT members.grp, biograph.sname, members.date
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
       JOIN rankdates ON (rankdates.sname = biograph.sname)
  WHERE biograph.sex = 'M'
        AND rankdates.ranked <= members.date
  ORDER BY members.grp, biograph.sname, members.date
  LIMIT 100;

But we're not really interested in the date the male was in the group. We want to know the month the individual was in the group. At last! We find a use for the DATE_TRUNC function.

SELECT members.grp, biograph.sname, DATE_TRUNC('month', members.date)
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
       JOIN rankdates ON (rankdates.sname = biograph.sname)
  WHERE biograph.sex = 'M'
        AND rankdates.ranked <= members.date
  ORDER BY members.grp, biograph.sname, members.date
  LIMIT 100;

So far so good, but our aesthetics are offended. DATE_TRUNC returns not a nice elegant date but a timestamp, which has ugly things in it like seconds and a timezone. We want to change the data type into a date. Being lazy, we don't use the CAST syntax that is standard in SQL because it requires a whole gob of typing. Instead we use the PostgreSQL :: extension to convert the timestamp into a date.

SELECT members.grp, biograph.sname, DATE_TRUNC('month', members.date)::DATE
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
       JOIN rankdates ON (rankdates.sname = biograph.sname)
  WHERE biograph.sex = 'M'
        AND rankdates.ranked <= members.date
  ORDER BY members.grp, biograph.sname, members.date
  LIMIT 100;

Flush with adrenaline from our success it's clear to see that we now have duplicate rows that we don't want. We have dutifully read the PostgreSQL reference manual page on the SELECT statement, and it made our eyes cross. Still, there was some mention of a DISTINCT keyword. Let's give it a try. What's the worst that can happen?

SELECT DISTINCT
          members.grp, 
          biograph.sname, 
          DATE_TRUNC('month', members.date)::DATE
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
       JOIN rankdates ON (rankdates.sname = biograph.sname)
  WHERE biograph.sex = 'M'
        AND rankdates.ranked <= members.date
  ORDER BY members.grp, biograph.sname, members.date
  LIMIT 100;

Ok. It didn't work. But the computer didn't catch on fire either. And we got a useful error message telling us we can only order the results using the columns that the select returns -- the "select list".

Try again.

SELECT DISTINCT
       members.grp, 
       biograph.sname, 
       DATE_TRUNC('month', members.date)::DATE
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
       JOIN rankdates ON (rankdates.sname = biograph.sname)
  WHERE biograph.sex = 'M'
        AND rankdates.ranked <= members.date
  ORDER BY members.grp, biograph.sname, DATE_TRUNC('month', members.date)::DATE
  LIMIT 100;

Gosh. We now have the rows we want to analyze.

Give the columns nice names.

SELECT DISTINCT
       members.grp AS grp, 
       biograph.sname AS sname, 
       DATE_TRUNC('month', members.date)::DATE AS date
  FROM biograph
       JOIN members ON (members.sname = biograph.sname)
       JOIN rankdates ON (rankdates.sname = biograph.sname)
  WHERE biograph.sex = 'M'
        AND rankdates.ranked <= members.date
  ORDER BY members.grp, biograph.sname, DATE_TRUNC('month', members.date)::DATE
  LIMIT 100;

Now that we have the rows that we want how do we analyze them? One of the things that you read, and then forgot, was that anywhere in a query that a column or a table appears, a sub-query can appear instead. (In place of columns you can use any sort of expression at all.) So, to analyze the results obtained so far query them as if they were a table.

SELECT malegroupmonth.grp, malegroupmonth.date, malegroupmonth.sname
  FROM (SELECT DISTINCT
               members.grp AS grp, 
               biograph.sname AS sname, 
               DATE_TRUNC('month', members.date)::DATE AS date
          FROM biograph
               JOIN members ON (members.sname = biograph.sname)
               JOIN rankdates ON (rankdates.sname = biograph.sname)
          WHERE biograph.sex = 'M'
                AND rankdates.ranked <= members.date
          ORDER BY members.grp, biograph.sname, DATE_TRUNC('month', members.date)::DATE
       ) AS malegroupmonth
  LIMIT 100;

This does not seem to have accomplished much. In fact it seems silly. Get rid of some of the sillyness by ordering our final results instead of the underlying data.

SELECT malegroupmonth.grp, malegroupmonth.date, malegroupmonth.sname
  FROM (SELECT DISTINCT
               members.grp AS grp, 
               biograph.sname AS sname, 
               DATE_TRUNC('month', members.date)::DATE AS date
          FROM biograph
               JOIN members ON (members.sname = biograph.sname)
               JOIN rankdates ON (rankdates.sname = biograph.sname)
          WHERE biograph.sex = 'M'
                AND rankdates.ranked <= members.date
       ) AS malegroupmonth
  ORDER BY malegroupmonth.grp, malegroupmonth.date
  LIMIT 100;

This is hardly better, but we are finally in a position where we can aggregate and get results per-month.

SELECT malegroupmonth.grp, malegroupmonth.date, COUNT(*)
  FROM (SELECT DISTINCT
               members.grp AS grp, 
               biograph.sname AS sname, 
               DATE_TRUNC('month', members.date)::DATE AS date
          FROM biograph
               JOIN members ON (members.sname = biograph.sname)
               JOIN rankdates ON (rankdates.sname = biograph.sname)
          WHERE biograph.sex = 'M'
                AND rankdates.ranked <= members.date
       ) AS malegroupmonth
  GROUP BY malegroupmonth.grp, malegroupmonth.date
  ORDER BY malegroupmonth.grp, malegroupmonth.date
  LIMIT 100;

This seems to be close to what we want. Get rid of that useless LIMIT clause.

SELECT malegroupmonth.grp, malegroupmonth.date, COUNT(*)
  FROM (SELECT DISTINCT
               members.grp AS grp, 
               biograph.sname AS sname, 
               DATE_TRUNC('month', members.date)::DATE AS date
          FROM biograph
               JOIN members ON (members.sname = biograph.sname)
               JOIN rankdates ON (rankdates.sname = biograph.sname)
          WHERE biograph.sex = 'M'
                AND rankdates.ranked <= members.date
       ) AS malegroupmonth
  GROUP BY malegroupmonth.grp, malegroupmonth.date
  ORDER BY malegroupmonth.grp, malegroupmonth.date;

At last we seem to be done -- NOT. A quick look at the results show obvious problems. First there's the problem of the time periods when the data is good -- censuses are regular and so forth. That's easily solved with additional WHERE restrictions that exclude those time periods where the data is bad or incomplete. It makes the most sense to do these in the sub-query because there's no point in generating a lot of rows that the outer query is going to ignore anyway.

Second there's the problem of group fissions. Depending on the data entry protocol individuals may be censused in sub-groups that are either nascent and have not yet become full-fledged groups or which never become real groups at all. The Babase supergroup() function exists to resolve these sorts of problems.

At some point group 9 will raise it's ugly head. It makes little sense to analyze group 9 in this fashion.

And so it goes. The construction of a really useful query is left to the reader. Don't forget to add comments to your SQL, preferably as you write it, so that you can come back later and make sense of the final result.