|
Medical Bioinformatics in Cytomics |
= external link
- Multiparameter flow cytometric data analysis is typically directed towards the determination of cell frequency within one or several multidimensional gates. An essential part of potentially useful information like fluorescence intensities, average fluorescence surface densities, intercolour fluorescence ratios, coefficients of variation of the fluorescence, light scatter intensity or scatter and fluorescence ratio distributions of the remaining cell populations remain frequently unconsidererd.
- The goal of data pattern classification in
cytomics
(cell systems research) concerns the exhaustive
knowledge extraction from all available flow cytometric
(single cell molecular profiling) or
other multiparameter data by the determination of the
most discriminatory data patterns for individualized disease
course predictions or diagnostics.
- CLASSIF1 algorithmic data sieving
(fig.1)
and data pattern classification permit the development
of standardized, instrument and laboratory independent
data pattern classifiers from flow cytometric list mode, flow bead array,
high content image analysis, cDNA (Lymphochip, Affymetrix) or protein
expression chip arrays,
clinical chemistry, biomedical or clinical data for
predictive medicine
by cytomics
or for diagnostic purposes
(literature references:
bioinformatics,
medical and clinical cytomics,
further references
)
- Numeric data values are transformed into
triple matrix characters
(fig.2)
to permit subsequent data pattern classification.
- Lower and upper percentiles like 10% and 90%
(fig.3)
are calulated for each data column of the reference patients.
- Data column values are transformed
(fig.3)
by assigning:
(- =diminished)
to values below the lower percentile,
(0 =unchanged)
to values between the lower and upper percentiles and
(+ =increased)
to values above the upper percentile.
- Disease classification masks
(fig.3a)
for each classification category are determined from the most
frequent triple matrix character in each data column of the learning set.
Individual patients are classified according to the highest positional
coincidence between the patient classification mask and any
one of the disease classification masks.
- A classification (confusion) matrix is established between the
known predictive or diagnostic clinical classification of reference
and abnormal patient samples against the same
classification categories for the CLASSIF1 triple matrix
classification. An ideal classification result is characterized by 100%
specificity and sensitivity values in the diagonal boxes
as well as the for the negative and positive predictive values
while the values in all other boxes are 0%
(fig.4).
- The data columns as in the case of myocardial infarction risk
assessment are obtained by flow cytometric determination of
activation antigens on
peripheral blood thrombocytes.
Four databases, each containing eleven parameters
(fig.5).
were calculated from the thrombocyte evaluation windows (gates) in
two parameter histograms.
They were obtained by projection of flow cytometric
four parameter list mode data either onto the foward(FSC)/sideward (SSC)
light scatter or onto the FSC/fluorescence_1 plane to calculate the expression
of thrombocyte surface antigens CD62, CD63 and thrombospondin as
well as of spontaneously attached IgG on normal individuals
or angiographically identified myocardial risk patients.
- Correct classification for all patient samples of
the learning set as well as for the unknown test set
(validation set) of patients is achieved
(fig.6A,6B),
indicating that the CLASSIF1 algorithm has identified a
suitable discriminatory data pattern of thrombocyte parameters.
(fig.7).
The selected parameters are statistically
significantly different between normal individuals and
myocardial risk patients
(fig.8).
- The unknown test set of patients was defined prior to
learning as data records 1,5,10,15... etc of the normal individuals
as well as of the myocardial risk patients and remained hidden to
the learning process.
As an embedded test set, it was measured under very
similar conditions conditions as the samples of the learning set.
- The principle of data pattern classification
(fig.9)
assures classification accuracy as well
as classification multiplicity to take care of the many
combinatorial possibilities between genotype and
internal or external exposure influences on observed
molecular cell phenotypes.
- Data patterns can be seen as the heat map of a
virtual data pattern chip with (-)=decreased, (0)=unchanged and
(+)=increased categories, substituting for the color code
green, yellow and red.
- The optimization of the initial classification result
that includes all data columns
(fig.10A)
is achieved by maximizing the sum of values in the
diagonal boxes of the classification matrix
(fig.10B).
The classifcation process reduces the number of informative
classification parameters in this example from initially 44
to finally 5 data columns.
- For this purpose, the most frequent triple matrix character of
each classification category and of each database column
is inserted into the category (disease) classification masks
of the first triple matrix.
Data columns without differences that is without discrimination between
classification categories are permanently removed from further
consideration. This provides the second triple matrix. It used
for the subsequent iterative optimization process
(fig.11).
- Data records (patients) are classified according
to the highest positional coincidence of the individual
data record (patient) classification mask with any one of
the category (disease) classification masks.
- Multiple classifications occur at equal positional
coincidence with more than one of the category classification
masks
(for example record#17 (N,R) of the first classification
mask fig.10).
They may represent transitional states in disease
or classification errors for example in case of small
learning sets, of comparatively small differences
between for the selected parameters amongst different categoriesd.
They may also occur in the first triple matrix
prior to the iterative removal of
non-informative parameters.
- The classification result is iteratively perfectioned
by temporary removal of single or variable combinations
of two database columns from the classification
process, followed by reclassification. It is retained whether their
temporary absence of the column(s) has improved or deteriorated the
classification result.
- The columns are then reinserted and the next data columns are temporarily
removed until the positive or negative contribution of all data
columns to the classification process is known.
- Data columns having improved the classification result by their
absence are removed from further consideration. The remaining data
columns represent the category classification masks.
- The data records of the learning set
are reclassified against the category classification masks to
assess the achieved discrimination for the various classification categories
(fig.12).
- The data records of the unknown test (validation) set
are subsequently classified to verify the robustness of
classification for unknown data records
(fig.13).
- The classification operations described in chapter 3 and 4
are performed unsupervised that is automkatically by the CLASSIF1
algorithm and do not require human interference, once the classification
process has been started.
- The classification of data records, unknown to a learned
classifier, is important to avoid erroneous classifications
of random statistical aberrations.
- To assess the susceptibility of the CLASSIF1 algorithm for
the detection of random statistical aberrations, a 133 column
wide data set of 40 data records was generated using a
random number generator and kindly made available by
Dr.W.Meyer and PD Dr.G.Haroske
(Pathologisches Institut, Universität Dresden). The data columns had
different means and coefficients of variation between 0.97-25.7%
(CV=100*standard deviation/mean). For the classification, each second
data record was assigned in sequence to either the arbitrary category#1
or category#2, resulting in 20 category#1 and 20
category#2 records.
- Records 1,5,10,15,20 of each category were prior to the learning
phase assigned to the unknown test (validation) set. They remained
hidden during the learning phase. This left a learning set of
15 category#1 and 15 category#2 data records
- The classification of the learning set by the CLASSIF1 algorithm
provided a specificity of 100% for the correct
recognition of category#1 records at a sensitivity of 40%
for the recognition of category#2 data records
(fig.14A).
Parameters #29,#77,#133
(fig.15).
were selected with means± standard deviation (SD) of 73.2±6.6/72.8±11.2,
53.3±8.0/59.0±11.3, 25.0±3.7/25.0±5.9 at no statistical
difference between the category#1 and category#2 means of the selected
data columns.
- The classification of the unknown test set of records
resulted in a low specificity of 33.7% for the correct
recognition of category#1 records and a low sensitivity of
20.0% for category#2 records
(fig.14B).
- The display of the triple matrix display of the
learning set
(fig.16)
and of the unknown test
(fig.17)
shows the low quality for the
classification of the random number data set.
- The result emphasizes the well known fact that the
discrimination of random statistical aberrations in a learning
set collapses typically during the classification of
unknown test sets.
- This contrasts to the robustness of classification in case of
existing molecular differences like for the discrimination of
risk patients for myocardial infarction from increased thrombocyte activation
antigens CD62,CD63 and thrombospondin
(fig.5)
where learning and test set patients are in >95% of the cases
correctly classified and statistically significant differences
of the selected parameters exist between both patient groups
(fig.7).
- Triple matrix classifiers are inherently standardized onto the
reference samples during the classification process
(standardized multiparameter data classification (SMDC)).
Classifiers can therefore be compared in an instrument and
laboratory independent way, in case no differences between the
various reference groups are detected by the CLASSIF1 algorithm.
This is advantageous for consensus formation e.g. on leukemia,
HIV and thrombocyte classifications by immunophenotyping.
- The identity of reference groups from different
institutions for classification purposes is assured by proving
that the various reference groups cannot be discriminated by
data pattern classification from each other.
- The systematic analysis of patient
classification masks provides information on individual
genotypic and exposure influences on
expressed data patterns. Such analysis may prove useful for
the development of a relational classification system similar
to the periodic system of elements. In such a system different
cell types and their activity states could be compared in a
standardised way for example during disease development,
under therapy but also during cell division, cell differentiation
or cell migration.
- The performance of triple matrix classifiers depends
on intralaboratory precision rather than on
accuracy since measurement accuracy cancels out
within certain limits through the normalization of the experimental
values on the respective mean values of the reference samples in each
database column. Reference groups are typically constituted
from age and sex matched patients.
- FCS1.0 and 2.0 list mode files or dBase3 database exports from database (e.g. Access) or table calculation programs (e.g. Excel) are classified by the CLASSIF1 algorithm in various clinical situations:
- the CLASSIF1 algorithm provides access to
predictive medicine
with >95% correct disease
course prediction in individual patients as well as to
standardized diagnostic classifications.
- the CLASSIF1 approach facilitates the elaboration of
interlaboratory consensus classifiers in complex
multiparameter data sieving or data mining analysis.
- as a practical consequence, diseases
can be classified at institutions where no sufficient learning sets can
be generated in reasonable times or where costly investigations are
necessary to establish appropriate learning sets.
- furthermore the molecular and biochemical properties of
many body cell systems during disease can be compared
by standardized classification e.g. blood leukocytes versus
tissue or effusion leukocytes.
|
CD1 1996(1)
ISBN: none |
CD2 1996(2)
ISSN: 1091-2037 |
CD3 1997(1)
ISBN: 1-890473-02-2 |
CD4 1997(2)
ISBN: 1-890473-03-0 |
CD5 2000
ISBN: 1-890475-05-7 |
CD6 2002
ISBN: 0-97117498-3-3 |
CD7 2003
ISBN: 0-97117498-8-4 |
CD8 2004
ISBN: 1-890473-C6-5 |
Download the ZIP file containing all Cell Biochemistry pages for example into directory: d:\classimed\, unzip into the same directory, enter the address: file:///d:/classimed/cellbio.html into the URL field of the Internet browser to directly access text & figures on your harddisk free of network delays (further information).
| © 2009 G.Valet |