This section contains data about the origin, identity, location, and various other traits about the tissue and nucleic acid samples in the users' inventory. This includes samples currently residing in the users' inventory, as well as older samples that may have previously been in use but have since been sent to others, consumed, discarded, or lost. Because of this, the data in this section serve as both a historical record of all samples that have ever been in the users' possession and an active record of the samples that are currently in the users' possession.
The text in this section uses the terms "nucleic acid" and "nucleic acid sample" interchangeably[116]. At the time of this writing, the system does not attempt to record details at the molecular level, so the reader can be assured that comments about the location, source, etc. of a specific "nucleic acid" should be interpreted as referring to a sample and not a specific molecule.
This table contains one row for every location that may be used to store tissue or nucleic acid samples.
Samples may be stored in varied locations with different organizations/research groups ("institutions"). The Institution column is included to allow easy segregation of locations across these varying locales.
The name of each distinct location is recorded in the Location column. Different organizations have their own conventions about how to organize and name storage locations, so this code may be a very descriptive and specific space ("Shelf 1, Rack 2, Box 3, Position D") or something more general ("PINK BOX").
Each Institution-Location pair must be unique.
To allow the use of nondescriptive
general Location values but retain
the ability to enforce uniqueness of specific ones, the
boolean column Is_Unique is
included. When Is_Unique is
TRUE
, the row's LocId may occur
at most once across both the NUCACID_DATA.LocId
and TISSUE_DATA.LocId columns (once total, not once
per table). When FALSE
, the LocId may be used any number of times in
either table.
A unique identifier for the location. This is an automatically generated sequential number that uniquely identifies the row.
This column is automatically maintained by the
database, cannot be changed, and must not be NULL
.
The INSTITUTIONS.Institution indicating the organization or research group at which this row's Location exists.
This column may not be NULL
.
A boolean indicating whether or not this location at this institution is unique.
This column defaults to TRUE
.
This column may not be NULL
.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every quantification of a nucleic acid sample's concentration. All concentrations are recorded in picograms per microliter (pg/μL).
A nucleic acid sample
cannot be quantified before it was created, before the source
tissue sample was collected, nor before the tissue sample's
donor entered the study population (if applicable); the Conc_Date cannot be before the
related NUCACID_DATA.Creation_Date, TISSUE_DATA.Collection_Date, nor the related BIOGRAPH.Entrydate. These dates already have a
required sequence to them — Entrydate <= Collection_Date <= Creation_Date <= Conc_Date — so in many
cases it may be sufficient for the system to only require that
Conc_Date is after the
Creation_Date. However, any of
these date columns can be NULL
, so for the sake of
completeness the system separately checks that Conc_Date is greater than each
of them.
Some quantification methods may use a different unit of concentration than that used in this table. Nanograms per microliter (ng/μL) is especially common. Such concentrations must be converted to pg/μL before they are added to this table.
Use the NUCACID_CONCS view instead of this table. It includes an additional column that indicates concentration in ng/μL, and also allows the insertion of quantifications in ng/μL. The conversion to ng/μL is thus performed by the system and not the user.
Do not assume that the number of significant figures employed in the Pg_ul column is the "true" number of significant figures for this quantification. This table records concentrations from a variety of quantification methods with varying levels of accuracy and stores them all in a single column that records all data to the nearest 0.1 pg/μL[117]. When new data are added, this column pays no attention to the number of provided significant figures and may indicate more than were actually used at the time of quantification. See the example below.
Example 3.2. (Mis)Use of Significant Figures in NUCACID_CONC_DATA
The concentration of a new DNA sample is determined to
be 10.0
ng/μL, which has 3 significant
figures. When recorded in NUCACID_CONC_DATA, this concentration will be
recorded in Pg_ul as
10000.0
pg/μL, with 6 significant
figures. A user should not assume that this quantification
was originally performed with 6 significant figures'
accuracy.
A unique identifier. This is an automatically generated sequential number that uniquely identifies the row.
This column is automatically maintained by the
database, cannot be changed, and must not be
NULL
.
The NUCACID_DATA.NAId of the quantified sample.
This column may not be NULL
.
The NUCACID_CONC_METHODS.Conc_Method used to quantify this concentration.
This column may not be NULL
.
The date that this concentration was quantified.
This column may be NULL
, when the date is
unknown.
The concentration of the sample according to this quantification, in picograms per microliter (pg/μL).
This column may not be NULL
.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every person involved with the creation of a specific nucleic acid sample. When a nucleic acid sample has multiple creators, each of them is recorded here in a separate row.
Most nucleic acid samples are created via "extraction". This table favors using "creation" rather than "extraction", for reasons explained in the discussion of the NUCACID_DATA table.
Each NAId-Creator combination must be unique; a sample cannot have the same creator more than once.
Use the NUCACIDS view to insert data into this table. It provides a simple way to determine the appropriate NAId value to use, and for a human data enterer to provide multiple creators in a single row.
A unique identifier. This is an automatically generated sequential number that uniquely identifies the row.
This column is automatically maintained by the
database, cannot be changed, and must not be
NULL
.
The NUCACID_DATA.NAId of the related nucleic acid sample.
This column may not be NULL
.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every nucleic acid sample that is or ever has been in the inventory. Each nucleic acid sample is associated with a "source" tissue sample, which is indicated in the TId column.
Always use the NUCACIDS view in place of this table. It contains additional related columns which may be of interest.
This table records a nucleid acid sample's current location using the LocId column. Values in this column constrain and are constrained by values in the TISSUE_DATA.LocId column, and may or may not be unique, as discussed in the LOCATIONS table.
The Name_on_Tube column indicates whatever "name" or other identifying information is recorded on the tube. Because of labeling errors or misidentification in the field, this value may not indicate the true identity of the individual from whom this sample came.
To see the "true" identity of this individual, see the related line in the TISSUE_DATA table. This information is also provided in the NUCACIDS view.
Two columns in this table record information related to the sample's creation: Creation_Date and Creation_Method. Also the related table, NUCACID_CREATORS. In laboratory vernacular, the term "extraction" is usually favored over "creation" for most nucleic acid sample types. However, some samples are not "extracted" and are instead generated via a laboratory procedure (e.g. reverse transcription, dilution, PCR amplification, etc.). Because of this, the generic term "creation" is used here.
A sample's Creation_Date cannot be before the source tissue's Collection_Date, nor before the source individual's Entrydate, if any. It may often be redundant to verify that Creation_Date is on or after both dates, but this redundancy is intended, as discussed above.
This table attempts to keep an ongoing record of a
sample's current volume in the Actual_Vol_ul column. It is left to
the user to judge this column's accuracy, which depends
greatly on 1) how diligently the lab personnel keep the data
manager(s) informed of changes, and 2) the amount of time that
has passed since this volume was determined[118]. To assist users in making these judgments, the
date that the Actual_Vol_ul was
last updated is recorded in the Actual_Vol_Date column. A sample's
current volume cannot be recorded without also recording this
date; both of the Actual_Vol_ul
and Actual_Vol_Date columns
must be NULL
or both non-NULL
.
A sample cannot have its current volume determined before the sample was created; the Actual_Vol_Date must be on or after the sample's Creation_Date.
It is unlikely, though not impossible, that a sample's volume might increase after its creation. The system will report a warning when a sample's Actual_Vol_ul is greater than its Initial_Vol_ul.
A unique identifier. This is an automatically generated sequential number that uniquely identifies the row.
This column is automatically maintained by the
database, cannot be changed, and must not be
NULL
.
The TISSUE_DATA.TId of the tissue sample from which this nucleic acid sample originated.
This column may not be NULL
.
The LOCATIONS.LocId indicating the current locale and location of the nucleic acid sample.
This column may not be NULL
.
The name of the source individual, according to the label on the tube.
This column may be NULL
, when there is no
identifying information on the tube. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The NUCACID_TYPES.NucAcid_Type of this nucleic acid sample.
This column may not be NULL
.
The date that this nucleic acid sample was created. When the process to generate a sample lasts more than one day, this is the date that the procedure was completed.
This column may be NULL
, when the creation date is
unknown.
The NUCACID_CREATION_METHODS.Creation_Method describing how this nucleic acid sample was created.
This column may not be NULL
.
The sample's volume, in microliters, when it was first created.
This column may be NULL
, when the initial volume
is unknown.
The sample's volume, in microliters, as of the Actual_Vol_Date.
This column may be NULL
, when users have not
updated the sample's "current" volume or when the sample
has not yet been used.
The date that the Actual_Vol_ul was determined.
This column may be NULL
, when users have not
updated the sample's "current" volume or when the sample
has not yet been used.
Comments or miscellaneous information about this nucleic acid sample.
This column may be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every name or ID used only at a specific institution (an ID that is "local" to that institution) to describe a particular nucleic acid.
Identity of samples is maintained by the system as much as possible, but when working with samples in the laboratory this is often inconvenient or impractical. Different groups and institutions often have their own systems for giving unique names to their samples, and while these names may be useful and meaningful for humans, they are mostly unhelpful from the database's perspective. They're vulnerable to typos, and can be very confusing when a sample is shared between institutions. However, these "local names" remain important for the people who are actually using these samples, so these identifiers are recorded in this table, one per nucleic acid sample, per institution.
Every combination of NAId and Institution must be unique; an NAId cannot go by more than one local name at the same Institution.
Every combination of Institution and LocalId must be unique; the same local name cannot be used at a single Institution more than once.
The NUCACID_DATA.NAId of the nucleic acid sample.
This column may not be NULL
.
The INSTITUTIONS.Institution indicating the organization or research group at which this NaId's name is used.
This column may not be NULL
.
The local name used for this NAId at this Institution.
This column may not be NULL
.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every nucleic acid sample having another nucleic acid as its source.
Often, nucleic acid samples are created through some "extraction" process in which the nucleic acids are purified from a tissue sample (e.g. a blood draw, a buccal swab, etc.) However, there are also numerous different methods by which nucleic acid samples may instead be created from another nucleic acid sample (e.g PCR[119], reverse transcription, dilution, etc.). In addition to recording the identity of the source nucleic acid, this table includes the Relationship column, which indicates the nature of the connection between the row's nucleic acid and its source nucleic acid. This relationship may be simple enough to explain in a single word (e.g. "DILUTION"), or complex enough to require a lengthy explanation. To allow this flexibility, Relationship is not constrained to a set of legal values in a support table.
A nucleic acid sample cannot indicate itself as its source; the NAId and Source_NAId cannot be equal.
A nucleic acid sample cannot have more than one other sample as its source; this table's NAId column is unique.
A nucleic acid cannot have been created before its source; the related Creation_Date of this NAId must be on or after the Source_NAId's related Creation_Date.
Although a nucleic acid sample may have been generated from another nucleic acid sample, there will always be a single tissue sample from which both the nucleic acid samples originated; both samples' related NUCACID_DATA.TId's must be equal.
The NUCACID_DATA.NAId of the nucleic acid that has another nucleic acid as its source.
This column may not be NULL
.
The NUCACID_DATA.NAId of the source nucleic acid.
This column may not be NULL
.
A textual description of how this nucleic acid and its source are connected.
This column may not be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every population under observation, and/or from which tissue or nucleic acid samples have been collected.
In this context, the term "population" refers to a particular species at a specific location. "The baboons in the Amboseli basin in Kenya", for example, are a population. "All baboons", or "all wildlife in the Amboseli basin", are not.
In the common vernacular, a population is often referred to only by the name of its site, e.g. "Gombe" when referring to the Gombe chimpanzees. Because of this, the Pop_Name and Site columns may seem redundant, but when setting vernacular aside it should be obvious that these two columns contain objectively different information. In practice, users may elect to enter the same value in both of these columns, but the two columns remain independent of each other.
PopId 1
has special meaning to the system. Data integrity rules for
the UNIQUE_INDIVS table presume that the
population with this PopId is the population whose
individuals are recorded in BIOGRAPH. No
other code should be created to refer to that
population.
A unique identifier. This is an automatically generated sequential number that uniquely identifies the row.
This column is automatically maintained by the
database, cannot be changed, and must not be
NULL
.
The name of the population.
This column may not be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The scientific name of this population's species.
This column may be NULL
, when unknown or not
applicable. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The common name of this population's species.
This column may not be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
A code indicating whether or not the population is wild or captive. The legal values are shown below.
POPULATIONS.Wild_Captive Values
W
Wild.
C
Captive.
U
Unknown.
NA
Not applicable.
This column may not be NULL
.
The location of the population.
This column may not be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
Comments or miscellaneous information about this population.
This column may be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every tissue sample that is or ever has been in the inventory.
Always use the TISSUES view in place of this table. It contains additional related columns which may be of interest.
This table records a tissue sample's current location using the LocId column. Values in this column constrain and are constrained by values in the NUCACID_DATA.LocId column, and may or may not be unique, as discussed in the LOCATIONS table.
If a sample was collected from an individual in BIOGRAPH — if the related UNIQUE_INDIVS.UIId
has a PopId of
1
— the
sample's Collection_Date must be
on or after that individual's Entrydate. Depending on the sample's
Tissue_Type, the Collection_Date may also be
constrained by the individual's Statdate. See TISSUE_TYPES for more information.
The system will return a warning if a sample's Collection_Date is after the
individual's Statdate, but only
when the sample's Tissue_Type
indicates that the Collection_Date is not constrained by
the individual's Statdate. That
is, when the related TISSUE_TYPES.Max_After_Statdate is NULL
.
From time to time, field observers may mistakenly record the wrong collection date on a tube. To help identify when this has occurred, the system uses the CENSUS table to confirm whether the Collection_Date is a date that the individual was actually observed[120]. The result of that confirmation is indicated in the Collection_Date_Status column.
When a sample's Collection_Date is not a Date on which the individual was recorded
present in CENSUS, the Collection_Date is
not necessarily "wrong". There are numerous circumstances in
which a sample may have been collected without a census being
performed. Still, the absence of a related row in CENSUS is suspicious, so it elicits a warning.
That is, the system will return a warning a tissue sample's
Collection_Date_Status is
1
.
Do not assume that the date written on a sample's label will always match the Collection_Date. When data managers determine that the date written on a label is erroneous, they may be able to determine the true date and update the Collection_Date as needed.
A unique identifier for the tissue sample. This is an automatically generated sequential number that uniquely identifies the row.
This column is automatically maintained by the
database, cannot be changed, and must not be
NULL
.
The UNIQUE_INDIVS.UIId of the individual from whom this tissue sample was collected.
This column may not be NULL
.
The LOCATIONS.LocId indicating the current locale and location of the sample.
This column may not be NULL
.
The name of the individual from whom this tissue sample was collected, according to the label on the tube.
This column may be NULL
, when there is no
identifying information on the tube. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The date the sample was collected or originated.
This column may be NULL
, when the date is
unknown.
The time the sample was collected or originated.
This column may be NULL
, when the time is
unknown.
The STORAGE_MEDIA.Storage_Medium in which the sample is stored.
This column may not be NULL
.
The MISID_STATUSES.Misid_Status of this tissue sample.
This column may not be NULL
.
A code indicating whether this row's Collection_Date is or isn't plausible according to available CENSUS data. The legal values are:
Code | Description |
---|---|
0
|
This individual is part of the main population and has a non-"absent" CENSUS row on this Collection_Date, OR this individual is not part of the main population and we have no basis to question the accuracy of this Collection_Date |
1
|
This Collection_Date is NULL , OR this
individual is part of the main population and either i)
has no CENSUS rows on this
Collection_Date or ii) has only "absent" censuses on
this Collection_Date |
This column is automatically maintained by the
database and may not be NULL
. Attempts to manually
populate or update this column are silently
ignored.
Comments or miscellaneous information about this tissue sample.
This column may be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every name or ID used only at a specific institution (an ID that is "local" to that institution) to describe a particular tissue sample.
For more details about the reason for this table and the difference between a "local" name/identifier and an ID generated by the database, see the discussion for the NUCACID_LOCAL_IDS table.
Every combination of TId and Institution must be unique; a TId cannot go by more than one name at the same Institution.
Every combination of Institution and LocalId must be unique; the same local name cannot be used at a single Institution to describe more than one sample.
The INSTITUTIONS.Institution indicating the locale in which this TId's name is used.
This column may not be NULL
.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every tissue sample having another tissue as its source.
In addition to recording the identity of the source tissue, this table includes the Relationship column, which indicates the nature of the connection between the row's tissue and its source tissue. This relationship may be simple enough to explain in a single word (e.g. "ALIQUOT"), or complex enough to require a lengthy explanation. To allow this flexibility, Relationship is not constrained to a set of legal values in a support table.
A tissue sample cannot indicate itself as its source; the TId and Source_TId cannot be equal.
A tissue sample cannot have more than one other sample as its source; this table's TId column is unique[121].
A tissue sample cannot have been collected before its source; the related Collection_Date of this TId must be on or after the Source_TId's related Collection_Date.
Depending on the details of the Relationship, a tissue sample and its source may or may not be from a different individual. The system does not require that the related UIId's of the TId and Source_TId be equal. However, the system will return a warning when they are not equal.
The TISSUE_DATA.TId of the tissue sample that has another tissue sample as its source.
This column may not be NULL
.
The TISSUE_DATA.TId of the source tissue sample.
This column may not be NULL
.
A textual description of how this tissue sample and its source are connected.
This column may not be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
This table contains one row for every individual under observation, and every individual from whom tissue or nucleic acid samples have been collected.
In contrast to BIOGRAPH, which records the identities of every individual in the main study population[122], this table also records the identities of all the individuals in other populations from whom there are tissue or nucleic acid samples recorded in the inventory. All individuals in BIOGRAPH are also included in this table, whether or not tissue or nucleic acid samples exist in the inventory. This presents a problem: there are two tables that separately track the identities of all individuals in the main population. To address this, the triggers have been written to ensure that BIOGRAPH retains primary authority over all individuals in the main population.
Management of individuals in the main population is done by BIOGRAPH (see its discussion for more information), so the ability to perform inserts/updates/deletes in this table for those individuals is heavily constrained, as follows:
Inserting rows for individuals in the main population is only allowed for the unknown individual or for individuals in BIOGRAPH who have not yet been added to this table[123].
The unknown individual's row can only be updated or deleted by an administrator.
Deleting rows for individuals in the main population is only allowed for individuals who are no longer in BIOGRAPH[124].
Updating rows for individuals in the main population is only allowed when changing only the Notes column.
Any individual's PopId cannot be updated to add or remove the individual from the main population.
Do not manually insert or delete rows in this table for individuals in BIOGRAPH. Perform those actions in BIOGRAPH, and the action will automatically be performed in this table, as well. Manual inserts and deletes in this table should only be done for individuals who are not in BIOGRAPH.
The IndivId column is used to record the individual's name or similar ID. Study projects and research institutions each have their own rules of nomenclature for their individuals, so this might be a lengthy name, an abbreviation, a series of numbers, or some mix of these. This value is not unique; the same identifier may be used more than once across different populations. However, per PopId, each IndivId must be unique; a population cannot use the same identifier more than once.
PopId
1
is the
population recorded in BIOGRAPH, so any
row with this PopId (with a
few exceptions, discussed below) must use the individual's
Bioid as its IndivId.
IndivId
UNKNOWN
indicates
the unknown individual, and is allowed to have PopId
1
and not be a
Bioid.
IndivId
MULTIPLE
is used
to indicate when TISSUE_DATA row includes
samples from multiple individuals. It is allowed to have
PopId
1
and not be a
Bioid.
A unique identifier for the individual. This is an automatically generated sequential number that uniquely identifies the row.
This column is automatically maintained by the
database, cannot be changed, and must not be
NULL
.
The name/identifier for this individual.
This column may not be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The POPULATIONS.PopId of the individual's population.
This column may not be NULL
.
Comments or miscellaneous information about this individual.
This column may be NULL
. This column may not be empty, it must contain characters,
and it must contain at least one non-whitespace character.
The timestamp range during which this row's data are considered valid. See The Sys_Period Column for more information.
[116] Also "tissue" and "tissue sample", but those two terms aren't terribly different anyway.
[117] This is expected to be the highest plausible accuracy to ever be used for the concentrations stored in this table. This can easily be expanded if needed.
[118] Even in the coldest of cold storage, frozen samples will slowly evaporate over time. A 100-μL sample that is frozen and stored for 5 years is unlikely to still be the full 100 μL at the end of that time.
[119] It is presumed that any reader who cares enough about nucleic acid samples to read this documentation is already familiar with the polymerase chain reaction. We will not attempt to explain it here.
[120] Admittedly, this approach is imperfect and is likely underestimating the true prevalance of the problem. The date written on a sample may not be the true date it was collected but may still be a date that the individual was censused. Unfortunately, there is little else that the system can do to recognize when this occurs.
[121] In real life, this rule could easily be violated. Something is going to need to change, before long.
[122] That is, the population whose data are recorded throughout the many tables in Babase.