Copyright (C) 2004, 2005, 2008, 2010, 2011 Karl O. Pinc

   This file is part of Babase.

   Babase is free software; you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation; either version 3 of the License, or
   (at your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with Babase.  If not, see <http://www.gnu.org/licenses/>.

Karl O. Pinc <kop@meme.com>

Design decisions regarding the database and it's construction toolkit.

Tools:

PostgreSQL: database

PostgreSQL was chosen as it supports triggers and full referential
integrity.  As of this writing MySQL has only begin to have similar
support.

PL/PgSQL: database extension language

Comes with PostgreSQL.  Most of the headaches involved in data
validation have to do with the timewise sequencing of trigger
invocations, which is nothing that can be solved by the choice
of language.  May as well use PL/PgSQL, it's almost SQL.

Postgis: Extends PostgreSQL with spatial data types.

M4 (and debugging with m4): macro pre-processor

Many of the programs used to construct the database are produced with
the m4 macro toolkit.  SQL databases in general and PostgreSQL in
particular are not designed so that the tables can be abstracted out
of the SQL and a variable used in it's place.  There are some
work-arounds, generally less efficient, but why go there when plain old
textual substitution does the trick quite nicely.  Having started down
the m4 path it only makes sense to rely on m4 as the primary means of
abstraction.  So, for instance, the database's constants are defined
using m4 rather than a block of PL/PgSQL code.

The downside to m4 is that the line numbers in the m4 source have no
relationship to the line numbers reported for runtime errors.  The
obvious solution is to expand the m4 source and work with the line
numbers within the resulting output when debugging.  This does not
work very well, as there is always some text in the m4 output
preceding the actual executable code.  If nothing else there will be
a bunch of SQL function declaration lines.  PostgreSQL currently
reports errors on line numbers in PL/PgSQL functions by counting the
line containing the opening single quote as 0.  Working in emacs I
find it easiest to use command repetition and the go-down-a-line
command to find the line containing the error.

PostgreSQL also reports errors by character number.  (These seem to
come from the parser.)  Character counts start with the first
character of the statement, the C in CREATE FUNCTION, the S in SELECT,
etc.  Again, using command repetition and the forward-character
command works well in locating the error.

It may be that dumping the code of the function out of the database
and working with that will yield better results, but my limited
experience has not proven this out.  In any case this has not yet been
a big problem so I'm not worrying about it.

A few data types are declared in m4 as PostgreSQL 7.3 has some
limitations in the way data types are abstracted.  These are declared
in the include directory.

To expand an m4 file use the m4 command.  You will often need to
supply (multiple) -I arguments so that the correct include files will
be found.  Typically you will need a path to the "include" directory
and often to the directory containing the file you are expanding as
well.

The m4 documentation is in an texinfo file and must be read with the
info command.  There are no man pages.

Most if not all m4 macros are surrounded with changequote() to change
the open and closing quote characters to {} and then back. This is so
that the regular single quote character can be used inside the macro.

Dia: diagram editor

It plays pretty with xml and so stores nicely as text in revision
archives.  Besides, other than Viseo it's the only game in town,
and we can't run Viseo on Linux and who knows if it's scriptable
to work with Makefiles.

Make: determine which pieces are used to construct which pieces

Gotta have Make.  Gnu make is what comes with the system so is what we
use.

sh, psql, and the Posix/Unix environment:  supports make

There are also a couple of utilities I quickly wacked together.

Apache: the webserver

The worlds leading webserver.  Any webserver which supports PHP should
work.

PHP: the web programming language

Kinda sticky around the edges, this popular programming language is
easy to work with.

DocBook: technical documentation support

Handy dandy.  Input files are text/xml.  Output files are anything.
Produces good output.  Standard, and an open format.

ImageMagic: technical diagram image manipulation

Dia does not have good output generation yet, and DocBook is immature
in it's input diagram formats so this bridges the gap.

Openjade: DocBook xml validation

Seems the standard DocBook toolset.

Xmlto: apply DocBook xml dtds to generate readable documents

Seems to be the new tool of choice for converting from DocBook.

passivtex: a DocBook to PDF tool

Lame.  A bad choice.  Chosen because it takes XML (not just SGML) as
input.

xsltproc: an xml styler and munger

Fast and popular.  Used both for styling DocBook output and for
sucking the view definitions out of the documenation and producing the
code that creates the views.

awk: A text processing lanugage.

Pervasive, on Unix at least.

Darcs: revision control and source code management

Started with CVS, which is clunky and subversion not much better.
Used GNU Arch for a while but was never satisfied and they replaced
themselves with Bazar.  Darcs is clean and simple and works for what
we need to do.

Design notes:

Triggers:

All referential integrity is done in the triggers.  This allows any
program what so ever to be used to update the database and users can
make whatever ad-hoc updates are desired, for the most part without
fear of database corruption.  There are some checks which cannot be
done in the triggers, as well as some unusual data allowed in the
database in exceptional circumstances which one would not normally
expect to see.  These conditions are tested for by the check_integrity
script.

All updates are expected to take place inside transactions.  When the
triggers fail they roll back the current transaction.

Triggers are not presently highly optimized, for example the triggers
delete and recreate all of an individual's REPSTATS and CYCSTATS
whenever anything changes rather than identifying just those rows that
need adjustments.  (Except that they don't current update either
REPSTATS or CYCSTATS, but they do do a lot of other data creation.)


Interpolation:

The interpolation routines are called from the CENSUS triggers.  They
update MEMBERS and change the Statdate on BIOGRAPH.  When they change
the Statdate they do not interpolate the 'corner case' of
interpolation to the Statdate.  When the interpolation routines do not
change the Statdate, they will interpolate over the Statdate corner
case interval.  (And the interpolation routines always interpolate
over the Birth interval corner case.  The BIOGRAPH triggers look for
changes in Birth or Statdate and _also_ interpolate the corner case
intervals, creating the appropriate rows in MEMBERS over those
intervals.  (Fill in function names when the functions are written.)


Sequence numbers:

Some tables have sequence number columns (CYCPOINTS.Seq, CYCPOINTS.Series,
PREGS.Parity).  These are automatically generated using the following
technique:

  --
  --  The triggers actually update cycpoints.  This is the strategy:
  -- 
  -- A BEFORE BY ROW INSERT/UPDATE trigger:
  --  Computes the appropriate values for
  --  Cid, Seq, and Series.  If they differ (from non-NULL) supplied
  --  values throw an exception.
  --
  --  Constructs and inserts a Mdate row from a Tdate row, if necessary.
  --
  -- A AFTER BY ROW INSERT/UPDATE trigger:
  --  Checks Seq of current row, detects duplicates in other rows
  --  and, if found, sets Seq >= MIN(NEW.Seq, OLD.Seq) + 1 to
  --  the correct value (not NULL, to avoid infinite loop on error)
  --  (This presumes the correct value will never be a duplicate,
  --  so be sure there's first a duplicate date check or suchlike.)
  --  for processing by the BEFORE BY ROW trigger.
  --  Note that this also sets Seq of other CYCPOINTS in the
  --  same cycle (same Cid) to the correct value.
  --  Updates REPSTATS and CYCSTATS.  (When date or Code changes)
  --
  -- A AFTER BY ROW DELETE trigger:
  --  Set Seq of all rows >= earliest gap to NULL.
  --
  -- A AFTER BY STATEMENT INSERT/UPDATE/DELETE trigger:
  --  Checks Seq for duplicates/gaps.