[Babase] Interpolation documentation for review
Karl O. Pinc
babase@www.eco.princeton.edu
Fri, 29 Jul 2005 09:03:34 +0000
At long last there's documentation on interpolation,
as well as the related tables. I only wish I had this
while writing the code.
Please review for accuracy, comprehensibility, style,
and anything else. I'm sure there are tortured sentences
in there screaming to be helped. I particularly want
feedback on interpolation, before I leave it behind.
The documentation has got to be comprehensible to
people or else they won't know how to use the system
so please comment if there's problems.
So that people have something to notate, I've appended
the "thin" format of the text version. This should
allow people to comment directly in an email reply.
Or maybe we want a conference call after everybody's
digested it. (The adventurous can notate
the xml source directly I suppose.) The xml (and everything
else) is at http://papio.biology.duke.edu. It's probably
best to go to the
web version and do the reading there as it'll be
both pretty and hyperlinked. You could also glance
at the PDF version on the web site. The PDF has a
strange formatting issue with the interpolation diagrams
and with some of the tables, so you don't need to tell
me about that. The problems with the diagrams in the
PDF mean that it's probably better to read something
else.
The sections I want review of are, in order:
BIOGRAPH (Baboon Biographical Data)
Column Descriptions
https://papio.biology.duke.edu/babase_system_html/ar01s05.html
MEMBERS (Group Membership)
Column Descriptions
https://papio.biology.duke.edu/babase_system_html/ar01s14.html
CENSUS
Column Descriptions
https://papio.biology.duke.edu/babase_system_html/ar01s15.html
DEMOG (Demography Notes)
Column Descriptions
https://papio.biology.duke.edu/babase_system_html/ar01s16.html
Interpolation
Interpolation's 3 Fundamentals
Interpolation Visualized
The Interpolation Rules
Expectations and Implications
https://papio.biology.duke.edu/babase_system_html/ar01s24.html
A. Changes to Babase between 1.0 and 2.0
Changes to .Statdate
Changes To Interpolation and MEMBERS
Changes To The Sexual Cycle Information
https://papio.biology.duke.edu/babase_system_html/apa.html
Note that in the appended text all the footnotes are at the
bottom.
Thanks.
Karl <kop@meme.com>
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein
----------------------------------------------------------
Babase:
Technical Specifications for the Amboseli Baboon Project Data
Management System
Karl O. Pinc
The Meme Factory, Inc.
Jeanne Altmann, PhD.
Princeton University
Susan C. Alberts, PhD.
Duke University
ER Diagram layout and conversion to Dia: Leah Gerber
Docbook formatting: Anne Ndeti Hubbard, Karl O. Pinc
Copyright (c) 2005 Karl O. Pinc, Jeanne Altmann, Susan
Alberts, Leah Gerber, The Meme Factory, Inc.
Permission is granted to copy, distribute and/or modify
this document under the terms of the GNU Free
Documentation License, Version 1.2 or any later version
published by the Free Software Foundation; with no
Invariant Sections, no Front-Cover Texts, and no
Back-Cover Texts. A copy of the license is included in
the section entitled "GNU Free Documentation License."
March 2, 2005
+---------------------------------------------------------+
| Revision History |
|---------------------------------------------------------|
| Revision 0.0 | March, 2 2004 |
|---------------------------------------------------------|
| Initial document |
+---------------------------------------------------------+
-------------------------------------------------------
Table of Contents
Introduction
This Document
System Designs
To Start BABASE
Data organization
Databases
Users, Groups and Database Permissions
Schemas
Organization of the Babase Program Code
Data Relationships
The Master Tables
GROUPS (Groups)
Data Entry Rules
Data Element Descriptions
BIOGRAPH (Baboon Biographical Data)
Column Descriptions
MATUREDATES (Sexual Maturity Dates)
Matured
Mstatus (Sexual Maturity Status)
RANKDATES (Adult Rank Attainment Dates)
Ranked
CONSORTDATES (First Consortship Dates)
Consorted
DISPERSEDATES (Dispersal Dates)
Dispersed
PREGS (Pregnancies)
Data Entry Rules
Data Element Descriptions
CYCGAPS
Data Entry Rules
Data Element Descriptions
CYCPOINTS
Data Entry Rules
Data Element Descriptions
SEXSKINS (Sexskin Turgescence Measurements)
Data Entry Rules
Data Element Descriptions
MEMBERS (Group Membership)
Column Descriptions
CENSUS
Column Descriptions
DEMOG (Demography Notes)
Column Descriptions
RANKS (Rankings Within Groups)
Data Entry Rules
Data Element Descriptions
INTERACT (Interactions)
Data Entry Rules
Data Element Descriptions
PARTS (Participants in interactions)
Data Entry Rules
Data Element Descriptions
Sname
Role
Iid
REPSTATS
Data Entry Rules
Data Element Descriptions
CYCSTATS
Data Entry Rules
Data Element Descriptions
THE SUPPORT TABLES
BSTATUSES
STATUSES
DCAUSES
MSTATUSES
WSTATIONS
ACTS
RNKTYPES
Interpolation
Interpolation's 3 Fundamentals
Interpolation Visualized
The Interpolation Rules
Expectations and Implications
Data Entry
Automaticlly Generated IDs
The Dataset Tables
Datasets Containing INTERACT and PARTS Data
Datasets Containing CENSUS Data
Datasets Containing DEMOG Data
Datasets Containing CYCLES Data
BABASE PROGRAMS
Data Maintenance Programs
Useful Programs and Functions
A. Changes to Babase between 1.0 and 2.0
Changes to .Statdate
Changes To Interpolation and MEMBERS
Changes To The Sexual Cycle Information
B. Docbook, Styling and other issues
---------------------------<snip>------------------------
BIOGRAPH (Baboon Biographical Data)
This table records the basic biographical data on the
baboons. It contains one row for each baboon, including
aborted fetuses and fetal deaths (collectively, fetal
losses), on which any data have been collected. All
individuals with an Sname, i.e., those which aren't fetal
losses, should have a Name and should have rows on MEMBERS.
Those rows that record data on fetal losses should maintain
the following relations between their data values: the
Sname and Name values should be NULL; the Statdate should
be the same as the birth date (Birth); the Status should be
1 (definitely dead); and the Dcause should be 7 (unknown)
or 5 (loss of mother). Jeanne needs to confirm that this is
still the case since her changes to DCAUSES. Because the
fetal losses have no Sname, there will not be any record of
their group membership in MEMBERS. The Statdate value
should not be less than the Birth value. Live animals
should not have a recorded cause of death. Live animals
that have no associated CENSUS rows (absences excepted)
must have a Statdate equal to their Birth date.
Column Descriptions
Sname
The short name of the individual. This is an exactly three
character long name abbreviation which is used to identify
the individual and so should be a unique data value. This
value appears in many other places in the system and so
should not be changed without changing all the other places
in the database where the abbreviation appears; really,
once established, the only reason to change this column is
because the short name had already been used.^[5] The Sname
is always composed of capital letters (and may not contain
a space). This column should only be NULL if the row
represents an aborted fetus.
Name
The name of the individual. This is a textual column used
for descriptive purposes. This value should be unique when
a comparison is done in a case insensitive fashion. This
column should only be NULL if the row records an aborted
fetus.
Pid
The Pid value, from the PREGS (Pregnancies) table, of the
individual's mother's pregnancy that ended in the
birth^[6]of the individual. This column may be NULL when
there is no record of the individual's mother.
Birth
The date the pregnancy ends. If the pregnancy results in a
birth, this date is the birth date of the offspring,
otherwise, this is the date of the fetal loss. (A pregnancy
that ends with the mother's death is considered as a
spontaneous abortion for this purpose.)
This column may not be NULL.
Bstatus
Birthday status. This column records the quality of the
birth date estimate. The legal values for this column are
defined by the BSTATUSES support table.
Tip
At the time of this writing the legal values are:
The BSTATUSES Table
Code Description
0 Known exactly (to within several weeks, usually to
within a few days)
1 Estimate good to within 1 year
2 Estimate good to within 2 years
3 Estimate good to within 3 years
4 Estimate good within 4 years
9 Unknown, i.e. these dates are guesses and should not
be used
I don't think it's a particularly good idea to show support
table values in this document. The procedure manual is the
place for that. The whole point of support tables are that
you can put anything you like in them.
This column may not be NULL.
Sex
The sex of the individual. The legal values are:
Valid Sex Values
Code Description
M the individual is male
F the individual is female
U the individual is of unknown sex
This column may not be NULL.
Matgrp
The maternal group of the individual, the Gid of the
sub-group into which the individual was born.
This column must contain a Gid value of a row on the GROUPS
table. This column may not be NULL.
Tip
If the maternal group is not known, the maternal group
should be recorded as the unknown group.
Statdate
The status date of the individual. When the individual is
alive, this is the latest date on which the animal was
censused and found in a group^[7], absences don't count.
When there are no such censuses, and the individual is
alive, then the Statdate is the birth date. This column is
automatically updated when CENSUS is updated to ensure the
these relationship remain true. When the individual is not
alive the Statdate is the date of death, disappearance,
etc.
Caution
Living individuals, unlike dead ones, can have MEMBERS rows
created by the interpolation procedure that locate the
individual in a group on a date later than the individual's
Statdate. For further information see: Interpolation At The
Statdate .
Statdate (almost, given the preceding caveat) provides a
convenient way of determining the end of the time interval
during which there is data on an individual, a way that is
independent of whether the individual is alive or dead.
This column may not be NULL.
Status
The state of the individual's life at the Statdate. The
legal values for this column are defined by the STATUSES
support table.
Tip
At the time of this writing the legal values are:
The STATUSES Table
Code Description
0 alive
1 known death
2 suspected death
This column may not be NULL.
Dcause
The cause of death or circumstances associated with death.
The legal values for this column are defined by the DCAUSES
support table.
Tip
At the time of this writing the legal values are:
The DCAUSES Table
Code Description
1 predation
2 conspecific
3 other wounds or injuries
4 Pathology or congenital problem
5 loss of mother
6 human action
7 unknown
8 under review
Tip
A value of 5 should only be present for individuals whose
mother has died or disappeared at the same time or shortly
before said individual.
This column may not be NULL.
---------------------------<snip>------------------------
MEMBERS (Group Membership)
The group membership table. This table records which group
each animal is in on which date, excepting fetal losses
(individuals with no Sname). There is a row in MEMBERS for
every individual for every day between Birth and Statdate,
inclusive, including periods during which the whereabouts
of an individual are unknown or assumed unknown. (See: the
unknown group.) Some living individuals have MEMBERS rows
after their Statdate, for more information see the section:
Interpolation At The Statdate . MEMBERS is most useful when
one is interested in an individual's location on a
particular date. Simply check MEMBERS for the individual on
that date. To find all the individuals in a group on a
date, look at all the rows in the table on that date for
the group.
MEMBERS is a single population-wide table created and
updated automatically using information from CENSUS,
BIOGRAPH, and DEMOG. The method used to do this is called
interpolation and is described fully in a section below.
Briefly, interpolation guesses which group an individual is
likely to be in when there is no observational data. The
MEMBERS rows which are the result of guessing have an I as
their Origin value.
Note
Babase requires that an animal be located in exactly one
group on any particular day, the combination of Sname and
Date should be unique. The intent of this table is to
record the location of each animal at the start of each
day. See other documents for further information on how the
actual practice of data acquisition and entry impacts this
goal.
Column Descriptions
Sname
The individual whose location is being recorded. The three
letter code that identifies the individual's row in the
BIOGRAPH table. There will always be a row in BIOGRAPH for
the individual identified here.
This column may not be NULL.
Date
The date.
This column may not be NULL.
Grp
The group where the individual is located. This is a Gid
value from GROUPS. This field should contain the most
specific sub-grouping available -- subject to the
constraints of the data entry protocol, of course.
Aggregation into larger groupings is accomplished by
retrieving the associated Supergroup from GROUPS.
This column may not be NULL.
Note
Usage exception: For the years 1989-1991, inclusive, the
group recorded for the sub-groups of Alto's group do not
necessarily reflect the actual groupings of the animals on
a particular day, but are instead indications of the
group-splitting process. See Jeanne Altmann and the Data
Management Manual for a further explanation.
Origin
A one letter code indicating the source of the location
information. This information is derived from, and has the
same values as, the Status column of CENSUS, although
MEMBERS.Origin contains the I (interpolated) value not
found in CENSUS. The codes are as follows: C (CENSUS)
values represent census data points, I (interpolated)
values are derived from the census data points, D
(demography) values represent demography notes not present
in the census sheets, M and N (manual) values represent
census data points due to operator intervention in CENSUS .
The S, E, F, B, G, T, L, and R codes are derived from
analysis of historical data. See the CENSUS section for
further information.
This column may not be NULL.
Interp
The distance, in days, from the date in which an individual
was previously observed to be in a group (censused --
automatic placement in the unknown group does not count) to
the date of the MEMBERS row. So the value is 0 on those
days on which the individuals are censused, 1 on those
(non-census) days immediately before or after the census
days, etc. For those MEMBERS rows that the interpolation
procedure has placed in the unknown group for lack of a
better place to put them, the Interp column is the number
of days "distant" from the interpolating CENSUS row, or the
birth date, that determined the group membership. Note that
the CENSUS row that determined that the MEMBERS.Grp should
be unknown may record an absence.
Important
The Interp value is not meaningful over intervals that
contain census rows which are themselves the result of an
analysis. Over these intervals Interp is NULL. For more
information see Interpolation, Data is not Re-Analyzed.
This column many be NULL.
CENSUS
The population census table. Aside from the BIOGRAPH Matgrp
column, this table is the origin of all information
regarding group membership. This table holds all the field
census data any any information regarding group membership
that is recorded in the field demography notes. It contains
one row per animal per group per day censused. There is an
additional row per individual per demography note for those
days when there is a demography note regarding the
individual and group but no census of the group. (See
DEMOG.)
Tip
The way to record that an individual is alone is to create
a row in GROUPS (Groups) meaning alone, and then to assign
individuals who are alone to this group. The "alone-ness"
of an individual can then be tracked in the same fashion as
group membership, although the Babase user does then need
to be aware that the members of the "alone" group are not
actually proximate to one another.
As noted in the MEMBERS documentation, Babase does not
allow an individual to be in more than one group on a given
day.
The original field census data sheets can be recovered from
CENSUS, with one exception. Data is lost when an individual
is actually censused in two groups on the same day because
of movement between groups and the timing of the censuses.
In this situation a decision should be made as to which
group CENSUS should record the individual's presence on
that day.. A demography note should then be added to DEMOG,
with text that notes the individual's presence in the
second group. Although it is technically true that this
does put into the database all of the information from the
censuses in the field, as the information regarding the
second census is in textual information it is not readily
available to automated tools.
Caution
Be careful when changing this data; remember that rank will
almost certainly change should group membership change.
Column Descriptions
Cenid
A unique identifier. This is an automatically generated
sequential number. Cenid links CENSUS to DEMOG.
This column may not be NULL.
Date
The date of the census, or the date of the demography note
(when Status is D).
This column may not be NULL.
Sname
The individual whose location is being recorded. The three
letter code that identifies an individual in BIOGRAPH.
There will always be a row in BIOGRAPH for the individual
identified here.
This column may not be NULL.
Grp
The group where the individual is located. This is a Gid
value from GROUPS. This column should contain the most
specific sub-grouping available -- subject to the
constraints of the data entry protocol, of course.
Aggregation into larger groupings is accomplished by
retrieving the associated Supergroup from GROUPS.
This column may not be NULL.
Note
Usage exception: For the years 1989-1991, inclusive, the
group recorded for the sub-groups of Alto's group do not
necessarily reflect the actual groupings of the animals on
a particular day, but are instead indications of the
group-splitting process. See Protocol for Data Management:
Amboseli Baboon Project document for a further explanation.
Status
A one letter code indicating the source of the location
information. Status is the source of MEMBERS.Origin data.
The current codes are as follows: C (census), A (absent), D
(demography), and M or N (manual). Other values derived
from analysis of historical data include: S, E, F, B, G, T,
L, and R.
The CENSUS.Status Codes
C
(census) The animal was found in the group on a
field census sheet: from the census datasheets.
(There may or may not be a corresponding demography
note on DEMOG as well.)
A
(absent) The animal was not found in the group on a
field census sheet. Note that while an individual
should not be recorded "present" in more than one
group on the same day, s/he may be absent from
several groups on any given day.
D
(demography) The animal was noted in the field
notebooks or elsewhere to be in a group but was not
marked present in a field census on that day. There
is an associated DEMOG row associated with the
CENSUS row. The individual may or may not have been
marked "absent" on the same group's field census
for the day.^[9]
M
(manual, interpolated) This code provides a way to
manually supplement what is in the CENSUS table
when there is no other way to get the data in.
Babase considers this code to be the same as the C
code.
N
(manual, not interpolated) This code provides an
alternate way to manually supplement what is in the
CENSUS table when there is no other way to get the
data in. This code does not interpolate, it is
presumed to be the result of some analysis.
S
(Susan's data) The data comes from the old DISPERSE
database where the record had both a Datein and a
Dateout.
E
(ending date) The data comes from the old DISPERSE
database where the record had a Datein but not a
Dateout.
F
(final date) The data comes from the old DISPERSE
database where there is a Dateout and the last
recorded location is before the Statdate.
B
(birth date) The data comes from the old DISPERSE
database where the record had a Dateout but not a
Datein.
T
(total) The data comes from the old DISPERSE
database where the record had neither a Datein nor
a Dateout.
G
(gap) The data is a record of the animal in the
unknown group when the animal appeared in the old
DISPERSE database but where there was a gap between
times of recorded location.
L
(lineage) The group is from the Matgrp on the old
CYCTOT database, either because the animal did not
appear in the DISPERSE database, or because the
first location for the animal in the old DISPERSE
database had a Datein and this Datein was after the
birth date of the animal.
R
(result of Alto's breakup) The data is S, E, F, B,
G, T, or L data which has had locations which were
changed from 1.0 to the group in which the animal
was censused on 15/4/92. This change left all R
rows as part of a contiguous series of days during
which the animals are located in the Alto's
sub-group as censused on 15/4/92, and the
time-adjacent locations were not 1.0.
A C Status is marked on the census data sheet as an "X" .
An A or D Status is marked on the census data sheet as a
"0".
This column may not be NULL.
Cen
Whether or not the CENSUS row represents an entry on a
census data sheet. TRUE means the CENSUS row exists because
of an entry on a census data sheet, FALSE means there was
no census done and the CENSUS row exists to support a
demography note, manual notation of absence, etc. Cen
should only be TRUE when Status is C, A, or D.
This column may not be NULL.
DEMOG (Demography Notes)
This table holds group membership related text from the
field demography notes, especially that which records group
membership information not otherwise written on the regular
field census sheets. DEMOG provides a means of notating
CENSUS rows, and thus facilitates management of additional
"free form" CENSUS rows, rows that do not directly
correspond with the field census sheets.^[10] It contains
one row for every individual for every date for every group
where the individual was noted present in the field
demography notes. The DEMOG row holds the textual
information. There is always exactly one corresponding
CENSUS row which holds the corresponding group membership
information in the usual coded and structured form. (Note
that only some CENSUS rows will have DEMOG rows; CENSUS
rows that originate entirely in the regular censuses of
groups will not, in general, have an associated DEMOG row).
A single field note referring to more than one individual
must appear in DEMOG as two (or more) separate rows, one
row per individual. Multiple field notes pertaining to a
single individual on a single date must be combined into
one piece of text and entered in a single DEMOG row. (See
Protocol notes for structure of the demography data as
entered by the operator.)
Column Descriptions
Cenid
A unique identifier. This is an automatically generated
sequential number. Cenid links CENSUS to DEMOG.
This column may not be NULL.
Reference
A GROUPS Gid value that links the DEMOG row with the
written field notebook where the note can be found.
This column may not be NULL.
Comment
The demography note text pertaining to the CENSUS row with
the given Cenid.
This column may not be NULL.
---------------------------<snip>------------------------
Interpolation
The Babase database uses a procedure called interpolation
to update MEMBERS whenever the CENSUS table, or the
BIOGRAPH.Birth, or BIOGRAPH.Statdate columns are updated.
Interpolation extrapolates the group membership of
individuals into days for which there is no actual
observation of the individuals' whereabouts. It "guesses"
with which group an individual is associated, given
knowledge of the individual's group membership (or lack
thereof) at given points in time, and records the result in
MEMBERS. Thus, MEMBERS always has a row recording group
membership for every day of every individual's life.
Interpolation's 3 Fundamentals
It is primarily by census records that Babase tracks group
membership. The CENSUS table is the source of all group
membership information. Babase places rows in the CENSUS
table to indicate presence in a group whenever demography
information is stored other tables.^[12][13] Throughout
this section it is to be understood that any sort of
demographic information which results in CENSUS data is
implied when the term census, or it's plural, is used.
Unfortunately, the term census is further overloaded. It is
occasionally used in the colloquial sense, meaning present
-- found when a group census was taken, the alternative
being absent. It is hoped the meaning will be clear from
context.
It is important to remember that censuses record absence
from a group as well as presence in a group, that there are
two mutually exclusive classes of CENSUS rows: absences,
records of absence from specific groups on specific days;
and "locating censuses", records that place the individual
in specific groups on specific days.
The premise of interpolation is that an individual is
assumed to be in the group where observed for a period of
14 days to either side of the observation unless there's
indication otherwise. To this end, interpolation keeps an
individual in the group where a census locates him for a
time period that is the shorter of:
1. Half of the time interval between the individual's next
(or prior) census which finds the individual in any
group.
2. Half of the time interval between the next (or prior)
recorded absence from the group in which the individual
was censused. Absences from other groups are ignored.
3. The 14 day Interpolation Limit. Given no other
information, an individual is considered to remain (or
have been) in the group where observed for 14 days
following (or preceding) the date of observation.
Should the above process not place an individual in a
group, the individual is placed in the unknown group; so
long as the individual is alive on the day in question.
There are some subtleties to these rules, and there is
further elaboration necessary to allow for "old style"
CENSUS rows, which do not directly correspond with actual
census taking, and other factors. But these rules are the
foundation and we begin with them.
Interpolation Visualized
Interpolation is best described with the help of diagrams
as it is all about computing and comparing time intervals
of various lengths, which are easily represented in a
diagram by lines of various lengths. We begin with the
simplest case, an individual censused present and absent in
a single group.
Tip
As the examples throughout this section are developed be
sure to pay close attention to the diagrams' keys. At times
the meaning of a symbol changes from diagram to diagram to
reflect a subtlety.
Interpolating presences and absences
Figure 5 shows a record of one individual's censuses. The
group, for the moment we'll assume group 1, is censused
several times over a period of days. One day the individual
is absent.
Figure 5. An Individual is Censused Present and Absent
One individual's census records
CENSUS: C C A C
Date: 1 2 3 4 5 6 7 8 9 10 11
Key:
C Censused present in group (group 1)
A Censused absent in group (group 1)
The first step in interpolation is to construct the various
intervals from the given CENSUS rows. Figure 6 shows how
interpolation "splits the difference" between presences and
absences to construct two intervals for each locating
census, one preceding the census and one following it. As
the diagrams given here can only show a window in time and
omit what falls outside that window, only one interval each
is shown for the censuses taken on day 1 and day 11.
Figure 6. Interpolating From Presences and Absences
Interpolation intervals within a group
CENSUS: C C A C
Intervals: X---|---X---------| O |-----X
Date: 1 2 3 4 5 6 7 8 9 10 11
Key:
C Censused present in group (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
| Midpoint between census takings
Interpolation creates MEMBERS rows that place the
individual in a group each day. Figure 7 shows how group
membership assignment is based upon the computed intervals.
Because of the absence, there are days when the individual
is placed in group 9, the unknown group.
Figure 7. Interpolating Group Membership
Intervals determine group membership
CENSUS: C C A C
Intervals: X---|---X---------| O |-----X
MEMBERS.
Group: 1 1 1 1 1 9 9 9 9 1 1
Origin: C I C I I I I I I I C
Date: 1 2 3 4 5 6 7 8 9 10 11
Key:
C Censused present in group (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
| Midpoint between census takings
Figure 7 also introduces the MEMBERS' Origin column. As can
be seen, the Origin column mimics the corresponding CENSUS
Status column on those days when interpolation is not
guessing group membership. Origin is I on those day when
interpolation is guessing.
The MEMBERS' Interp column represents distance from a
census. Interp is zero on those days when a census has
located the individual. The recorded absence is reflected
in the group, but is immaterial to Interp. Even though
there's an absence, the Interp count is over the interval
between the two locating censuses. Interp gets it's value
from a "split the difference" between censuses which record
presence in the group, a different sort of "split the
difference" than is used to determine into which group an
individual should be placed. Figure 8 extends Figure 7,
showing the computation of Interp. With this addition the
interpolation has finished, the MEMBERS table can be
constructed from the given CENSUS rows.
Figure 8. Computing Interp Values
The resulting MEMBERS rows
CENSUS: C C A C
Intervals
For Group: X---|---X---------| O |-----X
For Interp: X~~~|~~~X~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~X
MEMBERS.
Group: 1 1 1 1 1 9 9 9 9 1 1
Interp: 0 1 0 1 2 3 4 3 2 1 0
Origin: C I C I I I I I I I C
Date: 1 2 3 4 5 6 7 8 9 10 11
Key:
C Censused present in group (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
~ Inside of interval
| Midpoint of interval
Applying the 14 day interpolation limit
So far we have only explored the first 2 of the 3
fundamental interpolation intervals, those dealing with
being censused present and absent. Before we elaborate
further and examine the more complicated interactions
between presences and absences let us dispense with the 14
day interpolation limit.
Figure 9 shows the effect of the 14 day interpolation
limit. For reasons of space some days are removed from the
interval. There are no censuses, present or absent, on the
days omitted. As the "Date:" line shows, a total of 33 days
are examined, an entire month 31 days in length and the
first two days of the following month. Again, we assume the
censuses are taken in group 1.
Figure 9. The 14 Day Interpolation Limit
The shorter intervals are chosen
CENSUS: C C
C C Interval: X----- ... -----------|------- ... ---------X
14 Day Limit: X----- ... -------| |--- ... ---------X
MEMBERS.
Group: 1 1 ... 1 1 9 9 1 ... 1 1 1
Interp: 0 1 ... 13 14 15 15 14 ... 2 1 0
Origin: C I ... I I I I I ... I I C
Date: 1 2 ... 14 15 16 17 18 ... 31 1 2
Key:
C Censused present in group (group 1)
X Known present in group (group 1)
- Inside of interval
| Interval endpoint
As the 16th and 17th are more than 14 days away from either
census the individual is placed in the unknown group on
those days. Days that are closer to the actual censuses are
interpolated into group 1. So, as the rules require, the
individual is interpolated into the censused group for the
shorter of the two time periods. As before, all the
interpolated MEMBERS rows, those which do not correspond to
an actual census, have an Origin of I. And as before the
Interp column counts up from and down to the actual
censuses.
Interpolation and Birth Dates
There are some exceptions to the rules as stated so far.
Not surprisingly, interpolation will not presume to put an
individual in a group, create a MEMBERS row, before the
individual's Birth date.
The birth date is an exception another fashion, it locates
the individual in his Matgrp like a special sort of census.
The rationale for this is that although the birth may not
be observed the individual most certainly enters the group
when born. Further, this rule ensures that we have a row in
MEMBERS for every day the individual is alive. When there
is a regular census on the birth date, finding the
individual in his Matgrp -- or so one would hope, the
interpolated MEMBERS row is like that for any other census.
But when there is no locating census on the birth date the
resulting MEMBERS row has a Origin of I and an Interp of 0.
This is shown in Figure 10.
Figure 10. Interpolation at Birth
Individual born into group 1
CENSUS: B C C C
Intervals: X-----|-----X-|-X-----|-----X
MEMBERS.
Group: 1 1 1 1 1 1 1 1
Interp: 0 1 1 0 0 1 1 0
Origin: I I I C C I I C
Date: 1 2 3 4 5 6 7 8 9 10
Key:
B Born (into group 1)
C Censused present in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
| Midpoint between census takings
Clearly, there are no MEMBERS rows before the birth date,
the individual is in his Matgrp on the day of his birth,
and the Interp value counts up from the birth date and then
down to the next census as though there were a census on
the birth date.
An individual is placed in his Matgrp on his birth date
even when a regular census has an absence recorded for the
individual on the date of birth.^[14]
Interpolation At The Statdate
Another exception to the rules, or rather two exceptions,
occur at the Statdate. You might expect that interpolation
would not place a row after the individual's Statdate, and
this is indeed true, but true only when the individual is
dead. When an individual is alive, interpolation will place
a row after the individual's Statdate, but only when there
is a subsequent absence from the same group as the group in
which the individual was censused.^[15][16] While at first
this may seem odd, the reasoning behind this behavior is
clear -- the Statdate is not the last date on which there
is data for the individual. This is elaborated below.
All the same, at times there is a reason to have
interpolation halt at the Statdate. When individuals are
alive the system should not try to interpolate into time
periods for which data has yet to be entered, elsewise
there would always be spurious interpolated MEMBERS rows
which vanish as soon as additional data is entered. The
trouble with creating such rows is that, although the
interpolation is corrected and the rows disappear once data
entry resumes, the use of these rows in analysis is always
inappropriate. Such rows will exist at the end of every
period of data entry, as there will always be a large
number of living individuals found in their groups on the
last census entered. The solution is to not create the
rows.^[17] When a living individual has no later absences
from the group where last located, no absences from the
group of his last locating census that post-date his last
locating census, this is taken to mean that there is
additional as yet unentered data on the individual. In this
case interpolation stops on the day the individual was last
found in a group. This situation is shown in Figure 11,
where the last census taken found the individual in group 1
on day 5, and so this day is the individual's Statdate as
well. There is no interpolation past the last census.
Figure 11. Alive and Present When Last Censused
Living individual with Statdate of 5
CENSUS: C A C
Intervals: X-----| O |-X
MEMBERS.
Group: 1 1 9 9 1
Interp: 0 1 2 1 0
Origin: C I I I C
Date: 1 2 3 4 5 6 7 8 9 10 11
Key:
C Censused present in group (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
| Midpoint between census takings
In Figure 12 more data has been entered, the individual has
been missing since the last census shown in Figure 11
above. As there have been no further censuses during which
the individual was found the individual's Statdate is still
day 5, although there is now subsequent interpolation.
Notice that there are no MEMBERS rows created after day 7.
When interpolating a living individual, after the Statdate
there is no default placement of the individual into the
unknown group.^[18]
Figure 12. Alive and Absent in Last Census^[19]
Living individual with Statdate of 5
CENSUS: C A C A A
Intervals: X-----| O |-X---------| O
MEMBERS.
Group: 1 1 9 9 1 1 1
Interp: 0 1 2 1 0 1 2
Origin: C I I I C I I
Date: 1 2 3 4 5 6 7 8 9 10 11
Key:
C Censused present in group (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
| Midpoint between census takings
Although the only change between Figure 11 and Figure 12 is
the entry into CENSUS of rows recording absence, that is
enough to signal that interpolation can go forward without
creating spurious MEMBERS rows -- rows likely erased upon
the entry of more data. It is important that interpolation
does go forward in this case, past the Statdate, as
otherwise bias would be introduced. The last C CENSUS would
be interpolated differently from all the other censuses. To
be sure, there is bias introduced in Figure 11 when
interpolation is cut short. But censoring bias at the end
of data collection is unavoidable, whereas we can avoid
introducing bias here.
Warning
So long as an individual is alive the last CENSUS to locate
the individual ought be followed by a record of absence, an
absence from the group where the individual was last found.
To do otherwise, as must occur when there is simply no
further data to be entered, is to introduce a bias into
MEMBERS.
In Figure 13 there is no additional census information, but
the individual's Status has been adjusted to mark the
individual dead. A new Statdate value indicates the
individual died on day 9 and interpolation is now up to and
including the day of death. As is usual, when an
individual's group membership cannot be determined he is
placed in the unknown group.
Figure 13. Interpolation to Statdate When Dead
Dead individual with Statdate of 9
CENSUS: C A C A A
Intervals: X-----| O |-X---------| O
MEMBERS.
Group: 1 1 9 9 1 1 1 9 9
Interp: 0 1 2 1 0 1 2 3 4
Origin: C I I I C I I I I
Date: 1 2 3 4 5 6 7 8 9 10 11
Key:
C Censused present in group (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
| Midpoint between census takings
Although Figure 13 does not show this, the 14 day
interpolation limit applies when the individual is dead.
When there are no absences after the last census and there
are more than 14 days between the last census and the
Statdate the individual is placed in the unknown group from
the 15th day through the day of death.
The Midpoint Rule
The alert reader may have noticed that the above examples
are carefully crafted so that the midpoint between
presences and absences always falls between two days. What
happens when there is an odd number of days in the interval
so that the midpoint is a day exactly in between the
endpoints, as occurs 3 times in Figure 14?
Figure 14. Midpoint Days
Intervals with an odd number of days
CENSUS: C A C C A C
Intervals: X---| O |-------X-|-X---| O |-X
Date: 1 2 3 4 5 6 7 8 9 10 11
Key:
C Censused present in group (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
| Midpoint between census takings
The MEMBERS table has a 1 day precision, there is no way to
be in a group in the morning and out of it in the
afternoon, so on any one midpoint day the individual must
either be in the group or out of it. Should the individual
be in the group on midpoint day or out of it? The question
is resolved using a property of the date itself. Briefly,
the julian dating system is a method of assigning every day
a unique number. As a midpoint day is no more likely to be
on one day than another, we can avoid bias by using whether
or not the midpoint day falls on an even or an odd julian
date to resolve the problem.
Whenever interpolation is called upon to halve an interval
between two CENSUS rows that contains an odd number of days
then the "midpoint day" is assigned to the left, earlier,
half of the interval when the julian date of the midpoint
day is even. A midpoint day is assigned to the right,
later, half of the interval when the julian date of the
midpoint day is odd.
So, The Midpoint Rule resolves the issue by adjusting the
intervals as shown in Figure 15. The intervals are no
longer perfectly halved. On the midpoint day there is no
preference either for or against interpolating the
individual into the group censused.
Figure 15. The Midpoint Rule Adjusts Intervals
Intervals with an odd number of days
CENSUS: C A C C A C
Intervals: X-----| O |---------X-|-X-| O |-X
MEMBERS.
Group: 1 1 9 9 1 1 1 1 9 9 1
Interp: 0 1 2 3 2 1 0 0 1 1 0
Origin: C I I I I I C C I I C
Julian Date: 1 2 3 4 5 6 7 8 9 10 11
Key:
C Censused present in group (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
| Interval endpoint
Interpolating When The Group Changes
Having dispensed with the various elaborations and
exceptions that occur in unusual cases it is time to return
to the fundamentals of interpolation and examine what
happens when an individual moves between groups. What comes
into play are the first 2 of the 3 interpolation intervals.
Recall:
Interpolation keeps an individual in the group where a
census locates him for a time period that is the shorter
of:
1. Half of the time interval between the individual's
next (or prior) census which finds the individual in
any group.
2. Half of the time interval between the next (or prior)
recorded absence from the group in which the
individual was censused. Absences from other groups
are ignored.
Figure 16 shows a record of one individual's censuses. He,
a male, is censused in 2 groups, group 1 and group 2. The
census records for each group reflect both presence in the
group and absence from the group.
Figure 16. An Individual is Censused in 2 Groups
One individual's census records
Group 1: C C A C A
Group 2: A C C
Date: 1 2 3 4 5 6 7 8 9 10
Key:
C Censused present
A Censused absent
Figure 17 shows what would happen if interpolation worked
with each group separately. There are conflicts, days when
the individual is in both groups. Something else must be
done.
Caution
Figure 17 is an example of an interpolation method that
does not work. The method shown in the figure is not one
Babase uses when interpolating.
Figure 17. Interpolating Each Group Separately
One individual's census records
Group 1: C C A C A
Group 2: A C C
Group 1 Interpolating just group 1
CENSUS: C C A C A
Intervals: X---|---X---------| O |-X-| O
Group 2 Interpolating just group 2
CENSUS: A C C
Intervals: O |---------X-------|-------X
Date: 1 2 3 4 5 6 7 8 9 10
Key:
C Censused present
A Censused absent
X Known present
O Known absent
- Presumed present
| Interval endpoint
The solution is return to the interpolation fundamentals.
We begin by taking a closer look at the way we have been
diagramming intervals. In Figure 17 the first group has 3
locating census and 2 absences, and yet we've diagrammed
the resultant intervals on a single line. The interpolation
fundamentals tell us to obtain 2 pairs of intervals for
each locating census. A "halfway to census" pair of
intervals and a "halfway to absence" pair of intervals.
Figure 18 takes the CENSUS rows of the first group shown in
Figures 16 and 17 and does this for each locating census.
In Figure 18 the CENSUS rows of days 1, 3 and 9 each have
their own sections detailing the intervals to the nearest
censuses and intervals to the nearest absences. The lines
labeled Presence show the intervals that are halfway from
each locating census to the next. The lines labeled Absence
show the intervals that are halfway from each census to the
nearest absence. This detailed breakdown is followed by a
composite interval diagram of the familiar type encountered
in figures 6 through 17 above. It should be clear that we
have arrived at the "composite" form of the interval
diagram by following the fundamentals, the composite is
made up of the shorter of each census's intervals. The
result is correct, the composite constructed in Figure 18
is identical to the one shown previously in Figure 17. It
had better be, or else the interpolations of Figure 17
would be in conflict with the fundamental interpolation
rules.
Figure 18. A Closer Look at Intervals
CENSUS rows from group 1
CENSUS: C C A C A
Day 1 Intervals by presence and absence
Presence: X---| X
Absence: X-------------| O
Day 3 Intervals by presence and absence
Presence: X |---X-----------| X
Absence: X---------| O
Day 9 Intervals by presence and absence
Presence: X |-----------X
Absence: O |-X-| O
Combining the shorter intervals
Interval: X---|---X---------| O |-X-|
Date: 1 2 3 4 5 6 7 8 9 10
Key:
C Censused present
A Censused absent
X Known present in same group
x Known present in different group
O Known absent in same group
- Inside of interval
| Interval endpoint
The intervals in Figure 18 did not have to be grouped by
censused day, they could have been grouped by Presence and
Absence or any other way. For each set of locating censuses
we can always split out the "halfway to census" intervals
from the "halfway to absence" intervals, group them any way
we like, and later use the interpolation fundamentals to
recombine them, without affecting the result. This has not
been necessary so far, but it is essential if we are to
correctly interpolate when an individual moves between
groups, as above in Figure 16: "An Individual is Censused
in 2 Groups". We must return to the fundamentals to make
sense of interpolation. Rather than trying to combine the
results of interpolating the groups separately, as was done
in Figure 17: "Interpolating Each Group Separately",
instead combine the results of interpolating the presences
in all the groups with separate interpolations of the
absences in each group. Each time a census finds an
individual in a group, separately compute both the interval
halfway to the nearest census that finds the individual in
any group and the interval halfway to the nearest absence
from the particular group being censused. In Figure 19,
this method is applied to the data first seen in Figure 16.
For clarity the intervals surrounding the censuses that
belong to one group are shown separately from those
belonging to the other group.^[20] The lines labeled
Presence show the intervals that are halfway from each
census to the nearest census that finds the individual in
any group. The lines labeled Absence show the intervals
that are halfway from each census to the nearest absence in
the same group. Censuses with no neighboring absence do not
have this latter sort of interval shown.^[21]
Figure 19. Presence and Absence Interpolated Separately
One individual's census records
Group 1: C C A C A
Group 2: A C C
Group 1 The intervals of group 1's censuses
Presence: X---|---X-----| x |-----X-| x
Absence: X---------| O |-X-| O
Group 2 The intervals of group 2's censuses
Presence: x x |-----X-----| x |-X
Absence: O |---------X
Date: 1 2 3 4 5 6 7 8 9 10
Key:
C Censused present
A Censused absent
X Known present in same group
x Known present in different group
O Known absent in same group
- Inside of interval
| Interval endpoint
Figure 20 shows how interpolation combines the "presence"
and "absence" intervals by choosing the shorter of the two
to as the period during which the individual is assumed to
be in the group where censused. The line labeled Used
contains the shorter of each census's two intervals.^[22]
Figure 20. Combining Presence and Absence Intervals
One individual's census records
Group 1: C C A C A
Group 2: A C C
Group 1 The intervals of group 1's censuses
Presence: X---|---X-----| x |-----X-| x
Absence: X---------| O |-X-| O
Used: X---|---X-----| |-X-|
In Group: 1 1 1 1 ? ? ? ? 1 ?
Group 2 The intervals of group 2's censuses
Presence: x x |-----X-----| x |-X
Absence: O |---------X
Used: |-----X-----| |-X
In Group: ? ? ? ? 2 2 2 ? ? 2
Date: 1 2 3 4 5 6 7 8 9 10
Key:
C Censused present
A Censused absent
X Known present in same group
x Known present in different group
O Known absent in same group
- Inside of interval
| Interval endpoint
Having interpolated the intervals surrounding each census,
determining the final group membership is a straightforward
matter of placing the individual in the unknown group when
there's no where else to put him. Figure 21 shows this
process. All that remains is to compute the Interp values
in the usual fashion, by ignoring absences and counting
distance from the nearest census. In Figure 21 the
intervals between locating census are shown, labeled For
Interp, to support the Interp values given.
Figure 21. Group Membership Given Multiple Groups
One individual's census records
Group 1: C C A C A
Group 2: A C C
Group 1 The intervals of group 1's censuses
Used: X---|---X-----| |-X-|
In Group: 1 1 1 1 ? ? ? ? 1 ?
Group 2 The intervals of group 2's censuses
Used: |-----X-----| |-X
In Group: ? ? ? ? 2 2 2 ? ? 2
Intervals between locating censuses
For Interp: X~~~|~~~X~~~~~|~~~~~X~~~~~|~~~~~X~|~X
MEMBERS.
Group: 1 1 1 1 2 2 2 9 1 2
Interp: 0 1 0 1 1 0 1 1 0 0
Origin: C I C I I C I I C C
Date: 1 2 3 4 5 6 7 8 9 10
Key:
C Censused present
A Censused absent
X Known present in same group
- Presumed present
~ Inside of interval
| Interval endpoint
By now it should be clear that interpolation^[23] is a
function over CENSUS row sets. It is a function, for every
input you get exactly one output. It takes sets of CENSUS
rows as input. Because sets are unordered you can put
CENSUS rows into the database in any order and the result
will be the same. And, because it is a function, you can
re-interpolate the same CENSUS rows as many times as
desired without altering the final result.
It should also be clear why interpolation always chooses to
use "the shorter interval", and why this always produces
the "correct" result. The shorter interval is short for a
reason, there is some reason to believe the individual is
not in the group elsewise the interval would be longer.
Further, every time the shorter interval is chosen a
possible overlap with another interval from a different
locating census is eliminated. By always choosing the
shorter interval interpolation insures that the
interpolation of any two locating censuses will not
conflict.
Pre-Analyzed Data Disturbs Interpolation
In addition to that most important distinction which
classifies CENSUS rows into absent and locating censuses
there is a second distinction which further divides
locating censuses into those which interpolate and those
which do not. Those CENSUS rows that record observational
data are interpolating censuses; those with Status values
of C, D and, M.^[24] (All of the previous examples have
concerned CENSUS rows of this type.) The remaining
CENSUS.Status values indicate that the CENSUS row is the
result of analysis, all of the "old style", that is
"historical", CENSUS.Status values and the N manual Status
value. These are the non-interpolating censuses.
This further division of locating censuses into
interpolating and non-interpolating, the division between
raw and already analyzed data, leads to the final
refinement to the interpolation procedure. We do not want
interpolation to produce re-analyzed results from already
analyzed data. Interpolation occurs only between "regular",
that is to say interpolating, censuses (and to the birth
date as a special case). "Non-interpolating" census rows
are copied directly from CENSUS to MEMBERS, CENSUS.Status
becomes MEMBERS.Origin, and Interp is set to 0. When a
non-interpolating census is found on the birth date, the
birth date will not interpolate.
Interpolation looks at "regular" census rows and attempts
to guess the individual's location on those days when there
are no observations. It does so by looking at the intervals
between the "regular" censuses. Finding non-interpolating
CENSUS rows, that is to say already analyzed data, on one
of these intervals breaks the assumptions interpolation
uses in it's "guessing". The previously analyzed data point
could be there for any reason at all, and there's no point
in pretending it's not there either. What interpolation
does is give up. It interpolates up to the offending data
point and then stops.^[25] After that it still creates rows
in MEMBERS, but it does not attempt to make guesses about
where to place an individual or what the interpolated row
means.
Note
This situation is not expected to occur, or, rather,
whenever there are non-interpolating CENSUS rows between
interpolating censuses, the non-interpolating CENSUS rows
are expected to be contiguous over the entire interval
between the interpolating censuses. So, the expected cases
are the trivial degenerate ones. None the less, such
situations probably do occur in the existent data. It would
probably best to either require the expected behavior, or
to get rid of all the pre-analyzed CENSUS rows and replace
them with raw data. Especially given the design problems
pointed out below.
Regardless, non-trivial examples are presented here so that
a complete understanding of interpolation can be developed.
Figure 22 shows that the 3 fundamental interpolation
intervals are shortened when a non-interpolating census is
found between interpolating censuses. The intervals for
each locating census are examined separately. The
non-interpolating census has no interpolation intervals.
The intervals of the interpolating censuses are truncated,
reduced to the interval between the interpolating and
non-interpolating censuses. By this means a portion of the
diagram, days 4 and 5, are blocked from interpolating into
the group. If there were no N census, the Absence interval
would be day 1's shortest interval, and days 4 and 5 (as
well as day 3) would interpolate into the group. (Notice
that day 1's Absence interval has a midpoint day, day 5,
and that it would have been included in the interval.)
Interpolation is prevented from placing individuals in the
group of their interpolating census on the "far side" of
non-interpolating censuses.
Figure 22. Pre-Analyzed Data Truncates Interpolation
Intervals
CENSUS rows from group 1
CENSUS: C N A C
Day 1 Intervals per fundamental type
Presence: X-----| N X
Absence: X-----| N O
14 Day Lim: X-----| N
Day 3 Intervals per fundamental type
Presence: N
Absence: N
14 Day Lim: N
Day 12 Intervals per fundamental type
Presence: X N |---------------------X
Absence: N O |-----X
14 Day Lim: N |---------------------------------X
Julian Day: 1 2 3 4 5 6 7 8 9 10 11 12
Key:
C Censused present in group (group 1)
N Manual entry,
present in group but non-interpolating (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Inside of interval
| Interval endpoint
In Figure 23 the shortest intervals of each locating census
have been chosen and combined; the result is the line
labeled For Group. This is then used to determine group
membership.
The interesting part of Figure 23 is the computation of the
Interp values. The "halfway to census" intervals of
Figure 22 have been combined and labeled For Interp. Recall
that it is these intervals that are used to compute the
Interp values. The N census has created a "gap" in
interpolation, clearly shown on the For Interp line as
running from day 3 through day 6. Over this interval
interpolation's assumptions have been violated and it does
not know what to do. The group membership is easy. On day
3, the day of the N census it can simply copy the CENSUS
row's Grp and Status into the appropriate MEMBERS columns
in the same fashion it would for any other locating census.
On days 4 through 6 it can do what it usually does with
group membership when it does not know where to locate an
individual, it places the individual in the unknown group
with a Origin of I. On days 3 through 6 interpolation has
no way of knowing how far away the day is from the nearest
locating census, which is what is supposed to go in the
Interp column. Due to this lack of information it assigns
the Interp column a value of NULL, no data, on this
interval.
Figure 23. Pre-Analyzed Data Interrupts Interpolation
An individual is censused
CENSUS: C N A C
Intervals
For Group: X-----| N O |-----X
For Interp: X~~~~~| |~~~~~~~~~~~~~~~~~~~~~X
MEMBERS.
Group: 1 1 1 9 9 9 9 9 9 9 1 1
Interp: 0 1 5 4 3 2 1 0
Origin: C I N I I I I I I I I C
Date: 1 2 3 4 5 6 7 8 9 10 11 12
Key:
C Censused present in group (group 1)
N Manual entry,
present in group but non-interpolating (group 1)
A Censused absent in group (group 1)
X Known present in group (group 1)
O Known absent in group (group 1)
- Presumed in group (group 1)
~ Inside of interval
| Interval endpoint
When looking at Figure 23, one way to explain what happens
to Interp is to say that it is fixed at NULL over that
portion of the day 1 census's "halfway to census" interval
that was truncated because the N row showed up. (See
Figure 22.) Effectively, as MEMBERS Interp counts up with
increasing distance from the interpolating census, the
count is fixed at NULL upon encountering a
non-interpolating census until the point is reached at
which counting back down to the next interpolating census
begins, at which point the count downward resumes as though
never interrupted.^[27]
The approach interpolation takes, in some sense, attempts
to minimize the disturbance created when already analyzed
census data is mixed in with raw census information.
However, as can be seen in Figure 23, it is not entirely
successful. Although day 7, for example, has an Interp
value indicating it is 5 days away from a census, it is
really 4 days away from the N census. If the N CENSUS does
really represent a census, then day 7's Interp value is
wrong. And the problems are not restricted to Interp
values. Is it really true that days 4 and 5 should be
assigned to the unknown group? If so then why aren't there
N rows that say so? Day 2 is even more disturbing. There is
no diagram for this, but suppose the N census found the
individual in a different group. Figure 22 would be
unchanged, all of day 1's intervals would be truncated at
the N census. The effect would be more clear if the
interval between the preceding C census and the following N
census were larger, but consider that day 2, by the
midpoint rule, would be "assigned" to the N census. That
means that if the N census really does represent a census
in a different group, that day 2 should be assigned to that
group, not to group 1.
Note that, in the general case, even though the "halfway to
census" interval does not determine group membership (all
the intervals are truncated, leaving a "gap" in which
interpolation defaults to the unknown group), whether this
interval has a midpoint day, and if so where it falls, does
matter to the computation of Interp. If the midpoint day
happens to fall into the side of the interval containing
the non-interpolating census then the Interp value will be
NULL. Otherwise, it will have a value representing the
number of days to the nearest locating, and interpolating,
census.
Incorporating the above safety checks into the rules we
already have, ensuring that data is not re-analyzed,
produces the actual interpolation rules.
The Interpolation Rules
Using these rules interpolation creates rows in MEMBERS
based on the information it finds in CENSUS, and the
BIOGRAPH columns Birth, Matgrp, Statdate and Status.
I. CENSUS Rows Are Either Absences, Interpolating, or
Non-Interpolating
Interpolation partitions all CENSUS rows into one of 3
categories:
1. Absences
CENSUS rows which indicate absence from a group.
2. Interpolating censuses
Those CENSUS rows that record observational data
are interpolating censuses; those with Status
values of C, D and, M.
3. Non-interpolating censuses
The remaining CENSUS.Status values indicate the
CENSUS row is the result of analysis. These rows,
all of the "old style", that is "historical",
CENSUS.Status values and the N manual Status value,
are not re-analyzed and so do not interpolate.
For convenience, the CENSUS rows that are not absences,
the interpolating and the non-interpolating censuses,
are termed "locating censuses".
II. Censusing Assigns Group Membership
On those days when an individual is censused in a
group, when there is a locating CENSUS row, a row is
created in MEMBERS to place that individual in the
group on the given day. The Origin value is the CENSUS
row's Status value. When the CENSUS row is
interpolating the Interp value is 0. When the CENSUS
row is non-interpolating the Interp value is NULL.
III. The 3 Interpolation Intervals
Interpolation places an individual in the group into
which he is censused, the Grp of an interpolating
CENSUS row (Status values C, D, and M), on the days to
either side of the census being interpolated for a
time period that is the shorter of:
1. The Halfway to Census Interval
Half of the time interval between the
individual's next (or prior) locating and
interpolating census, which may locate the
individual in any group.
2. The Halfway to Absence Interval
Half of the time interval between the next (or
prior) recorded absence, considering only
absences from the same group in which the
individual was censused. Absences from other
groups are ignored.
3. The 14 day Interpolation Limit
Given no other information, an individual is
considered to remain (or have been) in the group
where observed for 14 days following (or
preceding) the date of observation.
The resulting MEMBERS rows have an Origin of I and an
Interp value of the number of days difference between
the MEMBERS row's Date and the date of the nearest
locating census; Interp values count up over the The
Halfway to Census Interval as the distance from the
interpolated census increases. An interpolated MEMBERS
row falling on the day after a census has an Interp of
1, the day after that the Interp is 2, and so forth,
assuming, of course, the individual has no other
nearby CENSUS rows.
IV. The Midpoint Rule
This rule qualifies how interpolation assigns the
halfway point between two CENSUS rows in The Halfway to
Census Interval and The Halfway to Absence Intervals,
above, when the number of days in the interval cannot
be divided into equal halves. Whenever interpolation is
called upon to halve an interval between two CENSUS
rows that contains an odd number of days then the
"midpoint day" is assigned to the left, earlier, half
of the interval when the julian date of the midpoint
day is even. A midpoint day is assigned to the right,
later, half of the interval when the julian date of the
midpoint day is odd.
V. Births Locate Individuals
This rule declares a live birth to be the equivalent of
an interpolating census, one that indicates presence in
the individual's Matgrp. fetal losses, individuals with
NULL Snames, are not considered births and are never
interpolated. An individual is placed in his Matgrp on
his birth date even when a regular census has an absence
recorded for the individual on the date of birth. In
this case interpolation always entirely ignores the
absence and will not use such an absence to compute a
Halfway To Absence Interval.
When there is a locating census on the birth date, the
MEMBERS row interpolation creates is like that made for
any other locating census with the given Status. But,
when there is no locating census on the birth date the
resulting MEMBERS row has a Origin of I (and an Interp
of 0 as any census with a Status of C would have.) Aside
from their I Origin value, births interpolate as would
any CENSUS with a C Status.
VI. No Data Implies Unknown Group Membership
On days when none of the above rules serve to place an
individual in a group, the individual is placed in the
unknown group. The resulting MEMBERS rows have an
Origin of I and an Interp value of the number of days
difference between the MEMBERS row's Date and the date
of individual's nearest interpolating census.^[28]
VII. Birth stops interpolation
Interpolation will not place a row in MEMBERS before
an individual's Birth date.
VIII. Death stops interpolation
When an individual is dead, interpolation will not
place a row after the individual's Statdate.
IX. Data Entry Cessation Stops Interpolation of Living
Individuals
When an individual is alive, interpolation will create
rows after the individual's last locating census only
when there are subsequent absences; absences, that is,
from the group in which the individual was
censused.^[29] In this case, unlike above, no data does
not imply unknown group membership; such rows are
created only so long as the individual is interpolated
into the group of his last locating census. When a
living individual has no absences after their last
locating census, absences from the group of their last
locating census, interpolation assumes that there is
further data available which has yet to be entered and
interpolation stops at the last locating census.
X. Data is not Re-Analyzed
Interpolation is only done to regular, that is
interpolating, CENSUS rows; data that was collected in
the field. Other data, the "non-interpolating" census
rows that represent the result of prior analysis, do not
interpolate; they are copied directly from CENSUS to
MEMBERS, CENSUS.Status becomes MEMBERS.Origin and Interp
is set to 0. Further, when a non-interpolating census is
found on one of The 3 Interpolation Intervals the
interval is shortened enough that the non-interpolating
census is no longer on the interval. When a
non-interpolating census is found on a birth date, the
birth date does not interpolate.
The MEMBERS Interp column is fixed at NULL on the
interval from the non-interpolating census row through
the "midpoint" end of The Halfway to Census Interval,
endpoints included.^[30] Here we are speaking of The
Halfway to Census Interval as computed, not a Halfway to
Census Interval shortened in the preceding paragraph.
Expectations and Implications
It is expected that all non-interpolating CENSUS rows, that
is to say CENSUS rows produced by prior analysis, will be
clustered in contiguous intervals with "regular" census
rows at the endpoints. This is particularly expected of
"old style" census rows from before Babase, as they precede
all "regular" census data, but is also expected of the N
non-interpolating, manual, Status code, should it ever be
used. If these expectations are born out, the Data is not
Re-Analyzed rule will never be invoked.
There are some not-quite-obvious implications given these
interpolation rules:
o The only rows in MEMBERS that have an Origin of I, and
an Interp of 0, and are not placed in the unknown group
are birth dates. Not every birth date will have an
associated MEMBERS row with these values, as some birth
dates have locating censuses, but MEMBERS rows with
these values will be birth dates.
o Living individuals, but not dead ones, can have MEMBERS
rows created by the interpolation procedure that locate
the individual in a group on a date later than the
individual's Statdate.^[31]
o So long as an individual is alive the last CENSUS to
locate the individual ought be followed by a record of
absence, an absence from the group where the individual
was last found. To do otherwise, as must occur when
there is simply no further data to be entered, is to
introduce a bias into MEMBERS.
o Aside from births, the only other rows in MEMBERS with
an Origin of I and an Interp of 0 are those in the
unknown group which were created by Data is not
Re-Analyzed.
o As fetal losses, individuals with NULL Snames, cannot
appear in CENSUS, are not considered a live birth, and
always have their birth date equal to their Statdate,
they never have MEMBERS rows associated with them.
o When computing Interp values from The Halfway to Census
Interval The Midpoint Rule is usually immaterial.
However, when non-interpolating censuses affect the
interpolation The Midpoint Rule can be the factor that
determines whether or not a MEMBERS row has a 0 Interp
value or not.
---------------------------<snip>------------------------
A. Changes to Babase between 1.0 and 2.0
A number of changes were made to Babase in the transition
from FoxPro (Babase 1.0) to Postgresql (Babase 2.0). This
appendix attempts to document changes made to data
semantics.
Changes to BIOGRAPH.Statdate
The Statdate is now constrained, when the individual is
alive, to be the most recent date on which a census located
an individual in a group. Although this was true in
practice, the 1.0 system did not require it.
This constraint leads directly to another, when the
individual is alive and there are no (non-absent) censuses
then the individual's Statdate must be the individual's
birth date. Because arbitrary Statdates are not allowed, we
prevent automatic changes from erasing manually set
Statdates.
Changes To Interpolation and MEMBERS
The interpolation procedure changed somewhat. As the
interpolation is what creates the MEMBERS table this
appendix also describes the changes made to MEMBERS between
1.0 and 2.0.
o Individuals have a row in MEMBERS for every day of
their lives.
Interpolation now places individuals in the unknown
group when individuals' locations cannot be otherwise
assigned, for example outside of the 14 day
interpolation limit. Formerly, when the individual
could not be place in a group on a particular day the
individual had no row in MEMBERS on that day.
o Individuals are no longer placed in a group, the group
in which they were last censused, on their Statdate and
this "location" no longer interpolates.
When first written, the interpolation procedure was
designed to work with females, who are unlikely to be
absent from their group for more than 28 days. (Twice
the 14 day interpolation limit.) By placing an
individual in a group on their Statdate, the group in
which they were last censused, the females were assured
a row in MEMBERS for every day of their lives. Further,
analysis was simplified as each of these rows
associated the females with their group (even though at
the end of their lives they may not have been present
in the group.)
The new interpolation procedure does not consider the
Statdate in it's determination of the individual's
group membership on that day, although, as always, when
the Statdate is a death date it does stop
interpolation.
o There is a change in what happens when an individual is
censused absent on his birth day. In the new system, if
the individual is censused "absent" on his birth
interpolation will "override" the absence and place the
individual in his Matgrp group in MEMBERS.
In the old system, if the individual is censused
"absent" on his birth interpolation will not "override"
the absence and place the individual in a group in
MEMBERS. As the individual is expected to be somewhere
on his birth, it's expected that there be a demography
note made for the individual on that date to give the
individual a location ' a row in MEMBERS.
o MEMBERS.Interp may now be NULL. The Foxpro system did
not have NULL values. In the new system Interp is NULL
when interpolation does not know where the nearest
locating census is. See Pre-Analyzed Data Disturbs
Interpolation
o The behavior of interpolation on the last census is now
documented.
The interpolation procedure changed during the period
of use of Babase 1.0, but the changes were not
documented. The primary change was that interpolation
was altered so that it did not interpolate if there was
no subsequent, absent or not, censuses. This prevented
(almost) every living individual currently monitored
from having a 14 day "tail" of interpolated values
following the last entered census -- a "tail" that
would disappear the next time the census information
was updated.
Changes To The Sexual Cycle Information
The structure of the sexual cycle portion of the database
was changed. The CYCLES table became CYCPOINTS. The CYCGAPS
table was added. And the CYCSTATS and REPSTATS were
modified and made useful. For further information please
compare the old and new documentation.
---------------------------<snip>------------------------
--------------
^[1] We do this rather than paying one of the regular
certification authorities to validate our identity. These
certification authorities appear to validate the identity
of their customers by virtue of having successfully been
paid.
^[2] As security restrictions permit, of course.
^[3] That way if you unknowingly revealed your password to
the terrorists last weekend when you were drunk, by the
time everybody sobers up the password will have been
changed and the amount of damage done will be limited.
^[4] Presently group 9.0. This hardcoded at present.
^[5] This is unlikely as the database will not allow entry
of a duplicate Sname.
^[6] Or whatever you want to call it in the case of a fetal
loss.
^[7] An actual census does not have to be taken, as the
Statdate of live individuals is derived from the CENSUS
table, any observation of an individual in a group which
results in a row being added to CENSUS is sufficient.
^[8] This criteria is specifically phrased to account for
gaps in the recorded data during the time period in which
the peak turgesence probably occured.
^[9] D usually occurs when a male is seen alone or in a
non-census group.
^[10] DEMOG nearly makes the M CENSUS Status code obsolete,
were it not so hard to search on textual data. Indeed, it
was created in response to difficulties with the M code.
^[11] One would think that, in order to maintain perfect
database consistency, the actor and actee participants in
an interaction should be in the same Supergroup, according
to the MEMBERS table. The database consistency checker
(integrit.prg) does report when the actor and actee are not
members of the same Supergroup. However, there is currently
no check for actor/actee location correspondence in the
update programs. This is for three reasons. First,
movements between groups and the timing of censuses and
interaction data collection may result in valid records of
interactions between individuals that are recorded as being
in different supergroups. The effects of this on the manual
data correction process could be reduced by having the
interaction master table update process add additional
location data into the MEMBERS table, but not totally
eliminated because the resolution of the MEMBERS table is
one day and individuals can move between groups during a
one day interval. Second, some of the interaction data are
entered with a date of the first of the month, not the
actual date of the interaction. Thus, the animals could be
in different groups on the first of the month and still
interact during the month. When this situation is
discovered, the date of the interaction for these
interactions should be manually changed to the first day of
the month on which the two animals were in the same group.
Third, the lack of a check allows the interaction data to
be entered before the census data for the month. Also, from
1989 through 1991, inclusive, recorded group for the
sub-groups of Alto's group does not always represent the
actual location of the individual. (See the MEMBERS
documentation.)
^[12] At this time only DEMOG, the demography notes table,
contributes to CENSUS any information regarding group
membership.
^[13] Sometimes, when demography information is added into
other tables, CENSUS rows are altered rather than removed.
Likewise, CENSUS rows are removed (or altered as necessary)
when demography information is removed from other tables.
^[14] This is the one exception, if you wish to consider it
so, to the rule that an individual cannot be censused both
present and absent in the same group on the same day.
^[15] The "same group" condition is one that must be met
whenever interpolation examines intervals between presence
and absence.
^[16] As the individual is alive, every census that
post-dates the individual's Statdate must record an
absence, else the Statdate would be adjusted to reflect the
date of last census.
^[17] This is a heuristic. While it should work well enough
most of the time the Babase user must be aware of the
pitfalls in this approach. These are explained below.
^[18] Without this restriction interpolation would have to
insert rows forever, placing the individual in the unknown
group off into the indefinite future.
^[19] Notice that interpolation does not bother analyzing
absences, such as the last-most, that are not neighbor to
censuses.
^[20] As locating censuses are interpolated individually
the figure could diagram the intervals associated with each
census separately, as in Figure 18, work out group
membership from that, and then combine the results; the
outcome would be unaffected. The chosen presentation form
allows the interval endpoints to "match up" in a revealing
fashion. As an exercise the reader should prove to himself
that the intervals associated with each locating census are
accurately depicted, and that the order in which locating
censuses are interpolated does indeed make no difference.
^[21] Figure 18: " A Closer Look at Intervals" makes clear
that it is not necessary to show these intervals. By
definition, the omitted intervals will always be longer
than the "halfway to census" interval of the census being
interpolated. As the shorter interval is the one used the
longer may be ignored.
^[22] When there are two intervals. When there's no
"absence" interval the "Used:" line shows the "presence"
interval.
^[23] The proper term is "The Glorious Interpolation
Procedure", but we don't tell this to just anybody.
^[24] See MEMBERS.Origin.
^[25] It might be better if interpolation did not
interpolate at all on those intervals between interpolating
censuses that contain a non-interpolating census^[25] -- if
it put the individual in the unknown group, with an Interp
of 0 and an Origin of NULL whenever there was no locating
census. However, this could easily cause problems because
interpolation has always worked as the body of this
document describes. Although these situations are not
supposed to occur, it is likely the data contains such
situations and changes should not be made to interpolation
which break the database.
^[2525] I have not thought this through. At first glance it
seems the code would be simpler, but perhaps not. And the
effect on data analysis is unclear. It is probably best to
adopt one of the solutions presented in the note below.
^[27] Although in this example we "count up" traversing the
timeline from left to right, had the N census had been
closer to the right side of the diagram than the left we
would be "counting up" the interval by traversing the
timeline in the opposite direction, from right to left.
^[28] The same method is used to compute Interp values when
interpolation uses The 3 Interpolation Intervals, above.
^[29] This "same group" criteria corresponds with the
criteria found in The Halfway to Absence Interval.
^[30] Interp is fixed at 0 over the portion of The Halfway
to Census Interval that was truncated in the preceding
paragraph. Effectively, as MEMBERS Interp counts up with
increasing distance from the interpolating census, the
count is fixed at NULL upon encountering a
non-interpolating census until the point is reached at
which counting back down to the next interpolating census
begins, at which point the count downward resumes as though
never interrupted.
^[31] This is examined in detail in Interpolation At The
Statdate .
^[32] Be sure to read the edition that describes the
version of Docbook you're using. This text was written for
Docbook 4.3.