[Babase] WeatherHawk design review

Karl O. Pinc kop at meme.com
Tue Jul 7 18:34:09 EDT 2009


On 07/07/2009 09:11:23 AM, Niki Learn wrote:
> It looks like you deleted the duplicates too?

In moving the data from babase_pending.whawk to babase.whawks
(in the babase_copy) database, no.   Normally it would not
allow duplicates but I turned that check off to make it
easier to load the data and find the duplicates.

I can delete the duplicates and turn the check back on if you
like.  The babase_copy database is not normally something
that other people look at so I'm not keeping it really tidy.

>  I am having difficulty
> telling for sure since the dates are somewhat out of order now (at
> least in
> 2008).

The browse function in PPA is rather useless IMO.  It shows you
the rows in the order they happen to be in the database,
which is not the same order they may happen to be in the
database in the future.

If you want a consistent list you have to run an SQL query
that orders the results.

>  How do I get babase to return all results for a single date,
> regardless of time stamp?  I tried just using the date and it only
> gives me
> 00:00:00 (why are there seconds in the time stamp anyway?) and no
> other
> times.  I tried some wildcards and bits of stuff from the SQL you sent
> to
> look up other things but with no luck.

A timestamp,
like WHAWKS.WHdaytime, represents a single point in time.
As you discovered, if you leave the hours, minutes, seconds, etc.
off of a timestamp value, in this case '2008-07-07',
it's the same as writing zeros for those values.
Because it represents one point in time a timestamp has
not only century and seconds, but milliseconds too.
Most of the time you don't see the milliseconds,
but they're there if you need them.  The default
output format does show you the seconds and all
the larger units.

The basic outlines on working with dates and times
are in Appendex A of the Babase docs.
https://papio.biology.duke.edu/babase_system_html/apa.html
(See also the Postgres date and time functions referenced
there:
http://www.postgresql.org/docs/8.1/static/functions-datetime.html
)

To make it even easier I just added the date_mod() function
to babase.
https://papio.biology.duke.edu/babase_system_html/re02.html

You have lots of choices.  Often you want to use the
Postgres date_trunc() function.  (Or something based
on it like date_mod().)

date_trunc('day', whawks.whdaytime) = '2008-07-07'

Generally I'd expect you'd use date_trunc() when
you're joining with another table:

select * from members, whawks
   where members.date = date_trunc('day', whawks.whdaytime)
   order by members.sname, members.date;

(The above will work poorly until I put the
index and rules back regarding unique whwstation/
whdaytime.)

Other times you're more
interested in looking at a range of values.
To look for a whole day's worth you look between
the start and finish of the day.

'2008-07-07' <= whawks.whdaytime and whawks.whdaytime < '2008-07-08'

Wildcards do pattern matches on strings (sequences of
alphanumeric characters), so for the most
part you won't want to do that because the first step
would then be to convert the timestamp to a string
and that's extra work for both you and the computer.

You can also use 'between':

whawks.whdaytime between '2008-07-07' and '2008-07-07 23:59:59'

I've never found the point in using 'between' because it's just
one more thing to have to remember; the thing to remember being
that 'between' is inclusive of endpoints.

>  Are you sure separate dates
> and
> times aren't easier for the end users?

Maybe not.  If we have to we can always make a view.
Or if calling date_trunc('day', whawks.whdaytime)
is too complicated I could make a "day_trunc()" function
so you could instead call day_trunc(whawks.whdaytime).

My general approach is to do things all one way or
all the other.  In this case that means either
having a single timestamp value to record the
appropriate point in time, or to have separate
columns for everything: year, month, day, hour.
(In this case we would not need minutes and seconds,
but if we really wanted to be anal about it we could
opt for a column for century too.)

If you can see the complication in having
separate year, month, and day columns
you can also see why having a separate time
column is also a problem, it makes it
harder to do more complicated things.
In general I think that having a single
timestamp column makes the easy things a little more
complicated and the hard things a lot easier.
(I thought I knew what I was doing and managed
to exclude all the 23:00 hour samples when
I loaded the whawks tables when I wanted to
exclude one single sample on a specific day
and time.  Lumping the day and time together
into a single timestamp and working with that
turned out to be a lot more understandable
than working with an bunch of AND and OR
operations on separate date and time columns.)
However that's just my take on it.  The
design process is all about getting everybody
to think through issues like this and
see everything from all sides.  Sometimes
it can make more sense to get everybody
on the phone to talk things out, and sometimes
not.

>  I couldn't figure out how to
> get it
> to return all results the month of August 2008 either.

A range, just like you'd do if there were a column that
was just a date and not a timestamp.

whawks.whdaytime >= '2008-08-01' and whawks.whdaytime < '2008-09-01'

I like the >= and then < idiom because that way you never
have to remember how many days are in the month (or whatever
the 'last thing' is in the interval you're interested in).
The first of the month, the beginning of the day, etc.,
all those are easy to remember and to read.

> 
> Btw, there is a paper weatherhawk log that includes a column for
> upload to
> babase_pending.  Tabby uploaded everything through September 2008
> before I
> arrived, so I guess we just added October - December 2008.  The log
> says she
> uploaded August and September on 2008-11-07 and that July and August
> were
> uploaded on 2008-09-24.  She would have had all of the August data in
> September.  Maybe she forgot to log that she uploaded August too and
> then
> uploaded it again in November?  That's my best guess anyway unless
> there was
> some kind of freak uploading glitch that caused it to upload twice.

It's not _just_ August 2008 though. There's some 700+ duplicates,
which sounds like more than just one double spreadsheet and
the month of August.  (I forget just how big that double spreadsheet
was, but...)

Anyhow, I don't think we need to think too hard about it.


Karl <kop at meme.com>
Free Software:  "You don't pay back, you pay forward."
                  -- Robert A. Heinlein



More information about the Babase mailing list