[Babase] WeatherHawk design review

Niki Learn nlearn at princeton.edu
Thu Jul 2 17:31:12 EDT 2009


Karl et al.,

Looks mostly good to me.  A few questions/thoughts/answers:

Does anybody know why there is rain data to start off with in April 2004?
Did we input year-to-date data based on our manual rain measurements or is
this another case of pre-existing data showing up when a unit was first
deployed (as happened with the rain data in the new unit)?  The amount of
yearly rain it starts off with seems reasonable given the amount of rain to
that point of time in later years (not counting this year...) but I'm just
checking.

On the rain decimal thing:  The only decimals I am seeing in any rain
columns are for 17/04/2005 and for 01/05/2005 - 19/06/2005, all of which I
have flagged in orange in my Excel sheet because the data in several
columns, including DayRain, HourRain, and MonthRain are inconsistent with
the data for the adjacent days and with the total annual rain column -
really, really inconsistent.  And the decimals are weird things like .01,
not .5.  Not sure what happened but some readings are definitely screwy on
those days (and monthly rain is too through the end of June 2005 since it
kept the screwball number from earlier in the month).  Also, as I understand
it the rain gauge only reads rain in whole mm and does not in fact take a
reading if there is less than 1 mm of rain, a situation which has led us to
wonder how much the weatherhawk is underestimating the amount of rain due to
evaporation of small amounts of rain between rain events.  Therefore there
should not be decimals for rain measurements in any case.

There was no data provided for 2300 on 2005-07-02 - that row should be
deleted.  I was not aware of duplicate date/times occurring.  A quick look
at t29/05/2009 shows me identical data for the two dates.  I will verify
that the others are like the same and then we can delete the extra rows.
Problem solved.

I don't mind getting rid of columns that we don't even know what mean...  We
can still keep the data outside of babase in case we someday figure out what
it means and think it is useful.  Thoughts anyone?
 


Questions/comments on the whawks documentation page:

At the top: "This table is not yet implimented."  This should be
implemented.

I don't know what you mean by this: "The WeatherHawk data dump encodes this
value in two separate places."  Where's the other place besides in the time
column?

On YearRain: "The amount of rain since 12AM January 1st[133] in
millimeters."  Crazy question - wouldn't rain measured at 12am on January
1st actually be the last hour of rain for the previous year since the rain
would have been collected between 11pm and 12am?  Not sure if that's worth
bothering with, especially since the weatherhawk counts it as a January 1st
measurement (though really, at least for rain, it should belong with the day
before).  Feel free to tell me to stop overanalyzing...

Are we sure EV and ET are the same thing?

On Heat_lx_in: "The WeatherHawk data dump names this column "Heat Lx In"."
It's actually Ix - I and l look practically the same in Excel but the I is
shorter - I checked and it is definitely an I.  Ix is probably short for
index but I don't know what these "in" measurements are supposed to mean...

On DayRain: "The amount of rain since midnight[136] in millimeters with 2
decimal points of precision."  The weatherhawk does reset this value at
midnight (see for example Oct 18-19, 2004) but that means it is really rain
since 2300 the day before - this goes back to that problem of the
weatherhawk counting midnight with the am hours for rain even though it is
really measuring rain in the last pm hour.  It might be too much to go
against the weatherhawk to fix but we should be clear in documentation that
the 11pm - 12am hour is included with the day that begins at 12am, no?  Of
course explaining it clearly might be a challenge too...  Same basic comment
for MonthRain.  This might be true for some of these other measures like
heating and cooling too?  But then we're not entirely sure what those are so
who knows?


Niki


-----Original Message-----
From: babase-bounces at eeblistserv.Princeton.EDU
[mailto:babase-bounces at eeblistserv.Princeton.EDU] On Behalf Of Karl O. Pinc
Sent: Wednesday, July 01, 2009 6:21 PM
To: babase at eeblistserv.Princeton.EDU
Subject: [Babase] WeatherHawk design review

Hi,

The WeatherHawk tables are ready for review.  Mostly what has changed from
the babase_pending.whawk table is to change some column names for
consistency with the rest of Babase, or so that the column name actually
reflects what's recorded in the column.  The idea here is to make it easier
to get the data out of the database.
This means that it's a little
harder to get the data into the database because some of the column names as
supplied by the WeatherHawk must be changed
before the final spreadsheet is uploaded.   I figure this is
not too much trouble because the column names need changing anyway because
you really don't want spaces in column names.
The whole thing should be easily scripted with an Excel macro or some such.

The data entry staff was also adding the date as a column to the WeatherHawk
data.  This still needs to be done, and another column, representing the
weather station (the WeatherHawk) itself, also needs to be added.
Like the date column it's a constant
value. (In fact it's more constant than the date because it does not change
for any given weatherhawk.)

Note the table name is WHAWKS instead of the WHAWK as in babase_pending
because (almost) all the table names in Babase are plural.

Please read:

The WStations documentation.
https://papio.biology.duke.edu/babase_system_html/ch06s05.html

This raises some design issues for consideration.

Also read the WHawks documentation.
https://papio.biology.duke.edu/babase_system_html/ch05s06.html

Aside from the questions that may be raised by the documentation I've also
the following concerns:

The YearRain column is an integer, whereas the DayRain column has two
decimal points of precision.  It seems to me that the precision of the
DayRain column (even if the instrument is not accurate enough to use all the
precision) should be carried through to the YearRain column.  I do see data
values to the right of the decimal point in some of the WeatherHawk daily
rain values.  There is no point if the WeatherHawk does not report decimal
points of precision for annual rainfall, does it?

In the babase_pending schema the hourrain and 24hrrain columns are integral,
and again the dayrain column has
2 digits of precision.  In the babase schema I've made the hourrain and
lst24hrrain columns have 2 decimal points of precision, but it'll obviously
not be used if we just copy over the data from the babase_pending.whawk
table.

I don't know if there is data to the right of the decimal point for the
hourrain and 24hrrain columns in the WeatherHawk data dumps.  If there is
then my guess is that it's rounded (always rounded up from .5) when it's
loaded.

As a general rule the database performs a lot better if the data is
integral, where "a lot"
may or may not make any difference in practical terms.  If we can get rid of
the decimal points lets do so, but we surely want to keep them rather than
lose any information.

The following statement could be used to copy the data from babase_pending
to babase, but for the 2 problems noted below:

insert
   into whawks (whwstation, whdaytime, winddir, windspeed
                , windgust, hum_in, humidity, temp_in, temp
                , barom, yearrain, et, battv, solar
                , heat_lx_in, heatindex, dayrain, hourrain
                , lst24hrrain, degheat, degcool, monthrain
                , degheat_mo, degcool_mo)
  select 'BC4', date + time, wind_dir, wind_spd, wind_gust
         , hum_in, humidity, temp_in, temp, raw_barom
         , tot_rain, coalesce(et, ev), battv, solar
         , heat_lx_in, heat_index, dailyrain, hourrain
         , "24hrrain", deg_heat, deg_cool, monthrain
         , degheat_mo, degcool_mo
    from babase_pending.whawk;

The first problem is that there is a row with a date of '2005-07-02' and a
time of '23:00' which has NULLs for all the data values.

The second problem is that there are 703 duplicate dates and times.  The
list of duplicate dates and times are attached.  I've not looked to see if
the data is actually duplicated or just the dates and times.  If they are
for 2 different WeatherHawk units then we will want to use separate weather
station codes.  This will probably involve re-uploading the overlapping data
from the original spreadsheets; that's the only way I can think of to tell
which rows go with which unit.  Otherwise we will want to cleanup and delete
the duplicates.  The easiest way might also involve re-uploading the data.

On another note: we could consider deleting those columns that are
meaningless, but then again just because they're meaningless now does not
mean that we'll never have a use for them.  Again, the database performs "a
lot" better when there's less data.
I'd lean toward getting
rid of at least some of them, but that may mean more work getting the data
ready to load into Babase.  The right way to do it would be to somehow keep
the WeatherHawk from reporting the data in the first place but that may mean
fussing with something that's not broken, which provides a good opportunity
to introduce breakage.

Finally, I don't have indexes on anything but date and time.  These are the
only values likely to be involved in a join.  If there are ever performance
problems creating indexes would speed up sorting and comparisons of other
data values.
We can certainly do this need be.

Karl <kop at meme.com>
Free Software:  "You don't pay back, you pay forward."
                  -- Robert A. Heinlein




More information about the Babase mailing list