"Wrong" Queries
(This is page is new, as of September 2023. If you have feedback/suggestions on how to improve it, please let Jake know.)
When writing queries, you will occasionally make mistakes. That is, the SQL does not do what you intended it to do. For purposes of this discussion, we will call these queries "wrong". Furthermore, there are two major types of mistakes that make them "wrong". With type 1 mistakes, the database will kindly inform you of the mistake by returning an error. However, type 2 mistakes do not elicit an error. Even if the query executes without error, it is still "wrong" if it does not do what you intended it to do. It is incumbent on you as the user to check for type 2 mistakes by perusing the query result for problems, inconsistencies, etc.
In other words, even after a query successfully executes, you need to look at its result to make sure that it's "right". If it's "wrong", you need to figure out how to make it right. Identifying when something is "wrong" and then making it "right" are two very important skills that every user needs to acquire. You will mostly learn to do this through experience, but you've got to start somewhere. Let's look at a hypothetical example.
The greatest baboon ever?
(All SQL shown in this example should be safe for you to execute for yourself. When SQL is written, it is presumed that you will execute it and look at the result before you continue reading.)
Cindy was a new grad student and was eager to learn more about the Amboseli baboons. In a lab meeting, someone casually mentioned that 'Rocky' was "the greatest baboon ever". Someone else called him "beautiful" and mentioned him having a lot of offspring. A third person said something about him being especially close to a female named 'Prudy'. Cindy was as intrigued as she was amused, and decided to take some time to learn more about him by looking at the data.
First, she checked BIOGRAPH to get some general information about him:
select * from biograph where sname='ROC';
Yikes, BIOGRAPH has a lot of columns. She expected to join this to other tables, so for brevity she decided to select just the few columns that were personally interesting to her.
select sname , name , entrydate , entrytype , statdate , status from biograph where sname='ROC';
So, Rocky immigrated (Entrytype I) into the population in 1993. Was he already an adult at this time, or was he a subadult who became an adult later? Adult attainment for males is recorded in RANKDATES, so Cindy modified her query so that it would also get Rocky's "rankdate" and "rstatus":
select sname , name , entrydate , entrytype , statdate , status , ranked , rstatus from biograph, rankdates where sname='ROC';
Oops, this caused an error. The error message says that the "column reference "sname" is ambiguous". BIOGRAPH and RANKDATES both have "sname" columns, so the DB didn't know which "sname" column was being selected, nor which "sname" needed to be "ROC". Cindy realized her mistake and edited the query to remove the ambiguity: she wants the sname from BIOGRAPH. She verified that none of the other columns selected here occur in both tables, then tried again:
select biograph.sname , name , entrydate , entrytype , statdate , status , ranked , rstatus from biograph, rankdates where biograph.sname='ROC';
No errors. Success!
...except that instead of returning one row like it did before, this now returns several hundred. This query didn't return an error, but clearly it is still "wrong". Cindy was confused, and looked closer at the result. The values in all those columns from BIOGRAPH were identical in each row, but the "Ranked" column was different. Cindy noticed that many of these "Ranked" dates were before Rocky's Entrydate or after his Statdate. That doesn't make sense; those values cannot be related to Rocky. That must mean that we're seeing "Ranked" values for other individuals. And that's when she realized her problem: the query was selecting "FROM" BIOGRAPH and RANKDATES, but it does nothing to indicate how to connect the two tables' rows to each other. So she modified the query so that it would do that. And just in case she was still wrong, she added a "LIMIT".
select biograph.sname , name , entrydate , entrytype , statdate , status , ranked , rstatus from biograph, rankdates where biograph.sname='ROC' and biograph.sname = rankdates.sname limit 25;
Back to one row. Hooray!
Cindy remembered some advice she had received about queries that use multiple tables. Instead of listing all the tables in the "FROM" clause, she had been advised to use "JOIN ... ON ...", because the "ON" part of these clauses ensures that you remember to specify how the joined tables should be connected. Having just forgotten to specify that, she decided to slightly rewrite her query so that it followed that advice:
select biograph.sname , name , entrydate , entrytype , statdate , status , ranked , rstatus from biograph join rankdates on rankdates.sname = biograph.sname where biograph.sname='ROC' limit 25;
Cindy wrote this query to see when Rocky became an adult. With these new columns, she could clearly see that he was ranked "BY" the month that he immigrated. So he was already an adult. Good to know.
Next, Cindy remembered the claims that Rocky was beautiful and had a lot of offspring. To date, there are no tables in Babase recording beauty, so Cindy focused her attention on the second part: offspring. Cindy remembered that there's a PARENTS view, and learned from her previous mistake, so when she added PARENTS to her query she knew just what to do:
select biograph.sname , name , entrydate , entrytype , statdate , status , ranked , rstatus , parents.kid , parents.mom , parents.zdate from biograph join rankdates on rankdates.sname = biograph.sname join parents on parents.dad = biograph.sname where biograph.sname='ROC' order by mom, zdate limit 100;
Twenty rows! Is that right? Maybe she made another mistake. The values in the first several columns don't change, and then the last several columns' values DO change. Cindy was sure that something must be wrong.
After poking at it for a little while, she realized that no, everything is fine. When checking a query result for unintended mistakes, lots of duplicated values can indicate a problem, but not always. Rocky really did have twenty offspring (that we know of).
While looking at the list of Rocky's offspring, Cindy noticed that five of them shared the same mother: "PRU". (Also, there are five from "WEN".) This reminded Cindy of that last thing she had heard about Rocky: he seemed to be especially close to Prudy.
Babase does not have an ANTHROPOMORPHIZED_UNSCIENTIFIC_DESCRIPTIONS table (yet?), so there is no direct documentation of a close relationship between the two. Nevertheless, Cindy persisted. She decided to look at interactions between ROC and PRU in ACTOR_ACTEES.
There are over 1000 interactions between Rocky and Prudy...
select count(*) from actor_actees where actor in ('ROC', 'PRU') and actee in ('ROC', 'PRU');
...so Cindy decided to narrow her search by looking at their interactions on the conception date (zdate) of their offspring. From her old query, she removed the columns from BIOGRAPH and RANKDATES, and instead just focused on joining PARENTS to ACTOR_ACTEES. Unsure of what she even wanted to see, she chose to just select all of the columns from ACTOR_ACTEES:
select parents.dad , parents.mom , parents.kid , parents.zdate , actor_actees.* from parents join actor_actees on actor_actees.actor in (parents.mom, parents.dad) and actor_actees.actee in (parents.mom, parents.dad) where parents.dad = 'ROC' and parents.mom = 'PRU' limit 100;
Uh-oh. We used LIMIT 100 and got 100 rows. Maybe there are a lot of interactions. Or maybe there's a problem and there are some unintended duplicate rows. Cindy changed the limit to 1000, and it returned...1000 rows. Once more, she increased the limit to 1200, which is higher than the total number of known interactions between the two of them. This returned...1200 rows. Something is wrong. Cindy looked closer at the result to try and figure out the problem.
Look at the query result for yourself (LIMIT 100 should be more than sufficient). Does anything look amiss to you?
...
Anything?
...
Anything at all?
...
This is some whitespace, added to try and discourage you from looking ahead and reading the answer.
...
What do you see?
...
Okay, that's enough. After looking for a while, Cindy noticed that there were rows in which the "zdate" (from PARENTS) and the "date" (from ACTOR_ACTEES) were not at all close to each other. Sometimes they were multiple years apart. This told her...
...
Hold that thought. What does it tell you? What's wrong? What needs to be fixed?
...
This told her that the JOIN between PARENTS and ACTOR_ACTEES had not done enough--or in this case, had not done anything--to connect the two dates. Cindy only joined "ON" their names! She wanted interactions on zdate, too, so she revised the query to:
select parents.dad , parents.mom , parents.kid , parents.zdate , actor_actees.* from parents join actor_actees on actor_actees.actor in (parents.mom, parents.dad) and actor_actees.actee in (parents.mom, parents.dad) and parents.zdate = actor_actees.date where parents.dad = 'ROC' and parents.mom = 'PRU' limit 100;
This returns 0 rows. [cue sad trombone]
So what's wrong? Nothing, actually. This query is doing exactly what Cindy intended to have it do. There just are not any interactions recorded between these two on their five kids' zdates. (Note: a "zdate" is an estimate of the conception date. It is not necessarily a date that observers visited the group and actually saw the mother.)
Cindy's query isn't "wrong", per se, but if she wants to see interactions likely to be related to the zdate then she needs to "fix" it. Instead of only looking at interactions on the zdate, she decided she should be looking in a range of days centered on the zdate. Specifically, she decided to look at interactions within 5 days of the zdate. That is, anywhere between 5 days before and 5 days after the zdate. With advice from a friend, she was able to pack all of that into a single line:
-- "abs()" returns the absolute value abs(parents.zdate - actor_actees.date) <= 5
...which she added to her "ON" statement:
select parents.dad , parents.mom , parents.kid , parents.zdate , actor_actees.* from parents join actor_actees on actor_actees.actor in (parents.mom, parents.dad) and actor_actees.actee in (parents.mom, parents.dad) and abs(parents.zdate - actor_actees.date) <= 5 where parents.dad = 'ROC' and parents.mom = 'PRU' limit 100;
Less than 100 rows. That's promising, but it might still be wrong. She reran the query, this time removing the LIMIT and instead using "ORDER BY date" so she could more easily look at the results.
This query is indeed "right"; it's doing exactly what Cindy wanted it to. For each offspring, there is at least one day where Rocky consorted Prudy. We also see many groomings between the two of them. Also some agonisms.
Did Cindy find evidence of a "special relationship" between Rocky and Prudy? Probably not. But she did learn an important lesson: baboon project members from the 90's and early 00's love Rocky.
Things to remember
When writing or bugfixing queries, there are a few useful principles to remember:
When first learning SQL, it is tempting to avoid learning how to use "JOIN". You can join tables using only the "FROM" and "WHERE" clauses. But as Cindy learned, using "JOIN" can help you make sure you joined the tables correctly. More importantly--and not addressed at all in Cindy's story--there are some things you can only do when using JOIN.
Problems in the "SELECT" clause are usually pretty obvious, because that clause determines what you actually see. For example, think of how Cindy selected all of the columns from BIOGRAPH, but immediately changed her query to only select some of them because it was obvious to her that she didn't want all those columns.
- If you're having trouble figuring out what's wrong with your query, the problem is probably somewhere in the "FROM ... JOIN ... ON ... WHERE ..." clauses. Unintended issues can easily happen behind the scenes without you ever seeing any evidence of it. You may have noticed that that is where nearly all of Cindy's problems were.
Most of the time, problems in your query are because of something that's missing from your SQL. None of Cindy's problems were fixed by removing text from her queries.
When looking at a query result, having lots of duplicate values often but not necessarily always indicates that something is wrong. Which tables' columns values are duplicated and which ones are not can be very helpful when identifying which table(s) have been insufficiently joined with the rest. For example, think of how Cindy identified her problem when she tried joining BIOGRAPH and RANKDATES (after she fixed the "ambiguous" error).
Resist the urge to try and fix a problem by adding "DISTINCT" to your SELECT clause. Duplicate rows in your query result are evidence of a problem; they are not the problem.
- There are three column types that tend to appear in a lot of Babase tables and views: an "sname", a "date", and a "grp". Those may not be the columns' actual names--in RANKS, for instance, the "date" column is "Rnkdate"--but the purpose of the column should be fairly clear no matter what it is actually called. When joining tables that contain these columns, you will usually need to account for at least two but sometimes all of those columns in your "ON" clause. It is VERY common for users to only join "ON" one of them and forget the other(s). For example, when Cindy joined ACTOR_ACTEES and PARENTS, she accounted for the "sname" columns (mom, dad, actor, actee) but forgot to account for the "date" columns (zdate, date).