Abstract
In the approach to compositional data analysis originated by John Aitchison, a set of linearly independent logratios (i.e., ratios of compositional parts, logarithmically transformed) explains all the variability in a compositional data set. Such a set of ratios can be represented by an acyclic connected graph of all the parts, with edges one less than the number of parts. There are many such candidate sets of ratios, each of which explains 100% of the compositional logratio variance. A simple choice consists in using additive logratios, and it is demonstrated how to identify one set that can serve as a substitute for the original data set in the sense of best approximating the essential multivariate structure. When all pairwise ratios of parts are candidates for selection, a smaller set of ratios can be determined by automatic selection, but preferably assisted by expert knowledge, which explains as much variability as required to reveal the underlying structure of the data. Conventional univariate statistical summary measures as well as multivariate methods can be applied to these ratios. Such a selection of a small set of ratios also implies the choice of a subset of parts, that is, a subcomposition, which explains a maximum percentage of variance. This approach of ratio selection, designed to simplify the task of the practitioner, is illustrated on an archaeometric data set as well as three further data sets in an “Appendix”. Comparisons are also made with existing proposals for selecting variables in compositional data analysis.









Similar content being viewed by others
References
Aitchison J (1982) The statistical analysis of compositional data (with discussion). J R Stat Soc B 44:139–177
Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70:57–65
Aitchison J (1986) The statistical analysis of compositional data. Chapman & Hall, London. Reprinted in 2003 with additional material by Blackburn Press
Aitchison J (1990) Relative variation diagrams for describing patterns of compositional variability. Math Geol 22(4):487–511
Aitchison J (1992) On criteria for measures of compositional difference. Math Geol 24:365–379
Aitchison J (1994) Principles of compositional data analysis. In: Anderson TW, Olkin I, Fang KT (eds) Multivariate analysis and its applications. Institute of Mathematical Statistics, Hayward, pp 73–81
Aitchison J (2003) Compositional data analysis: where are we and where should we be heading? In: Proceedings of the compositional data analysis workshop, CoDaWork’03, Girona, Spain. CD-format, ISBN 84-8458-111-X
Aitchison J (2005) A concise guide to compositional data analysis. http://ima.udg.edu/Activitats/CoDaWork05/A_concise_guide_to_compositional_data_analysis.pdf. Accessed 29 May 2018
Aitchison J, Egozcue JJ (2005) The statistical analysis of compositional data: where are we and where should we be heading? Math Geol 37:829–850
Aitchison J, Greenacre MJ (2002) Biplots for compositional data. J R Stat Soc Ser C (Appl Stat) 51:375–392
Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2000) Logratio analysis and compositional distance. Math Geol 32:271–275
Bacon-Shone J (2011) A short history of compositional data analysis. In: Pawlowsky V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester, pp 3–11
Baxter MJ, Cool HEM, Heyworth MP (1990) Principal component and correspondence analysis of compositional data: some similarities. J Appl Stat 17:229–235
Baxter MJ, Beardah CC, Cool HEM, Jackson CM (2005) Compositional data analysis of some alkaline glasses. Math Geol 37:183–196
Benzécri J-P (1973) Analyse des Données. Tôme II, Analyses des Correspondances. Dunod, Paris
Bóna M (2006) A walk through combinatorics: an introduction to enumeration and graph theory, 2nd edn. World Scientific Publishing, Singapore
Box GEP, Cox DR (1964) An analysis of transformations. J Roy Stat Soc Ser B 26:211–252
Cortés J (2009) On the Harker variation diagrams; a comment on “The statistical analysis of compositional data. Where are we and where should we be heading?” by Aitchison and Egozcue (2005). Math Geosc 41:817–828
Dijksterhuis G, Frøst MB, Byrne DV (2002) Selection of a subset of variables: minimisation of Procrustes loss between a subset and the full set. Food Qual Prefer 13:89–97
Filzmoser P, Hron K, Reimann C (2009) Univariate statistical analysis of environmental (compositional) data: problems and possibilities. Sci Total Environ 407:6100–6108
Gittins R (1985) Canonical analysis: a review with applications in ecology. Springer, New York
Gower JC, Dijksterhuis GB (2004) Procrustes problems. Oxford University Press, Oxford
Greenacre MJ (2009) Power transformations in correspondence analysis. Comput Stat Data Anal 53:3107–3116
Greenacre MJ (2010a) Logratio analysis is a limiting case of correspondence analysis. Math Geosci 42:129–134
Greenacre MJ (2010b) Biplots in practice. BBVA Foundation, Bilbao. www.multivariatestatistics.org. Accessed 29 May 2018
Greenacre MJ (2011a) Measuring subcompositional incoherence. Math Geosc 43:681–693
Greenacre MJ (2011b) Compositional data and correspondence analysis. In: Pawlowski-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester, pp 104–113
Greenacre MJ (2013) Contribution biplots. J Comput Graph Stat 22:107–122
Greenacre MJ (2016) Correspondence analysis in practice, 3rd edn. Chapman & Hall/CRC, Boca Raton
Greenacre MJ, Lewi PJ (2009) Distributional equivalence and subcompositional coherence in the analysis of compositional data, contingency tables and ratio-scale measurements. J Classif 26:29–64
Harary F, Palmer EM (1973) Graphical enumeration. Academic Press, New York
Harker A (1909) Natural history of the igneous rocks. Methuen, London
Hron K, Filzmoser P, Donevska S, Fišerová E (2013) Covariance-based variable selection for compositional data. Math Geosci 45:487–498
Hron K, Filzmoser P, de Caritat P, Fišerová E, Gardlo A (2017) Weighted pivot coordinates for compositional data and their application to geochemical mapping. Math Geosci 49:777–796
Kraft A, Graeve M, Janssen D, Greenacre MJ, Falk-Petersen S (2015) Arctic pelagic amphipods: lipid dynamics and life strategy. J Plank Res 37:790–807
Krzanowski WJ (1987) Selection of variables to preserve multivariate data structure, using principal components. Appl Stat 36:22–33
Krzanowski WJ (2000) Principles of multivariate analysis: a user’s perspective. Oxford University Press, Oxford
Legendre P, Legendre L (2012) Numerical ecology, 3rd edn. Elsevier, Amsterdam
Lewi PJ (1976) Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim Forsch (Drug Res) 26:1295–1300
Lewi PJ (1980) Multivariate data analysis in APL. In: van der Linden GA (ed) Proceedings of APL-80 conference. North-Holland, Amsterdam, pp 267–271
Lewi PJ (1989) Spectral map analysis. Factorial analysis of contrasts, especially from log ratios. Chemometr Intell Lab 5:105–116
Lewi PJ (2005) Spectral mapping, a personal and historical account of an adventure in multivariate data analysis. Chemometr Intell Lab 77:215–223
Lovell D, Müller W, Taylor J, Zwart A, Helliwell C (2011) Proportions, percentges, ppm: do the molecular biosciences treat compositional data right? In: Pawlowski-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester UK, pp 193–207
Martín-Fernández JA, Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2018) Advances in principal balances for compositional data. Math Geosci 50:273–298
Mert MC, Filzmoser P, Hron K (2015) Sparse principal balances. Stat Model 15:159–174
Murtagh F (1984) Counting dendrograms: a survey. Discrete Appl Math 7:191–199
Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2015) vegan: community ecology package. R package version 2.3-2. https://CRAN.R-project.org/package=vegan. Accessed 11 June 2018
Pawlowski-Glahn V, Buccianti A (eds) (2011) Compositional data analysis. Wiley, Chichester
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2007) Lecture notes on compositional data analysis. http://dugi-doc.udg.edu/bitstream/handle/10256/297/CoDa-book.pdf?sequence=1. Accessed 11 June 2018
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015) Modeling and analysis of compositional data. Wiley, Chichester
Rao CR (1964) The use and interpretation of principal component analysis in applied research. Sankhya A 26:329–358
Tanimoto S, Rehren T (2008) Interactions between silicate and salt melts in LBA glassmaking. J Archaeol Sci 35:2566–2573
R core team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
van den Boogaart KG, Tolosana-Delgado R (2013) Analyzing compositional data with R. Springer, Berlin
Wollenberg AL (1977) Redundancy analysis—an alternative for canonical analysis. Psychometrika 42:207–219
Wouters L, Göhlmann HW, Bijnens L, Kass SU, Molenberghs G, Lewi PJ (2003) Graphical exploration of gene expression data: a comparative study of three multivariate methods. Biometrics 59:1131–1139
Acknowledgements
This work is dedicated to the memory of John Aitchison who passed away in December 2016 and whom I met when he gave a seminar in Girona, Catalonia, in 2000. He started his talk with a slide containing a single blank triangle, following which, it was like the scales fell from my eyes.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
1.1 A.1 Three Additional Data Sets
Three more data sets are analyzed, to demonstrate the benefit of using ALRs as a substitute for the full compositional data set. Two of these compositional data sets are taken from Aitchison (2005) and the third one is considered by Greenacre (2016) in the context of CA. For each data set, the sets of ALRs are computed, using each part in turn as the reference in the denominator. The set of ALRs that lead to inter-case distances that best match the logratio distances, using the Procrustes correlation as the criterion, is identified.
-
Data Set 1 (Aitchison 2005)
-
Minerals compositions: 21 samples, 8 minerals
-
qu: Quartz or: orthoclase al: albite an: anorthite
-
en: Enstatite ma: magnetite il: ilmenite ap: apatite
The ALRs with respect to quartz (qu) give the best agreement—the Procrustes correlation (between full space configurations) is equal to 0.995. Figure 10 shows the two-dimensional LRA based on all 28 logratios alongside the PCA of the 7 ALRs, showing the almost identical configurations of sample points.
-
Data Set 2 (Aitchison 2005)
-
Activity pattern of a statistician: 20 days, 6 activities
-
te = Teaching; co = consultation; ad = administration;
-
re = Research; ot = other wakeful activities; sl = sleep
The ALRs with respect to sleep (sl) give the best agreement—the Procrustes correlation (between full space configurations) is equal to 0.960. Figure 11 shows the two-dimensional LRA based on all 15 logratios alongside the PCA of the 5 ALRs, showing the highly similar configurations of sample points. The first dimension of the ALR analysis accounts for a much higher percentage of variance, similar to the glass cup example in the main text, suggesting that there is only one relevant dimension and that the LRA analysis is inflated with redundant variance.
-
Data set 3 (see Greenacre 2016, Appendix E)
-
Fatty acid data: 42 samples, 25 fatty acids with nonzero values
This data set consists of groups of marine organisms collected in three different seasons. The ALRs with respect to fatty acid 16:0 give the best agreement to the multivariate structure—the Procrustes correlation (between full space configurations) is equal to 0.989. Figure 12 shows the two-dimensional LRA based on all 300 logratios alongside the PCA of the 24 ALRs, showing the similar groupings of the three seasonal subsets of data, separated by the ALR analysis just as well as by the LRA. The four ratios that stand out in the contribution biplot on the right are made up of the four parts prominently radiating out from the centre in the LRA on the left, expressed relative to the more centrally located fatty acid 16:0 (Fig. 12).
Graph of set of logratios in first column of Table 4
1.2 A.2 Procrustes Analysis and Procrustes Correlation
The following matrix formulation summarizes the computations required:
Suppose F1 (n1 × p) and F2 (n2 × p) are two matrices of coordinates defining two configurations of the same labelled points in separate p-dimensional spaces. Both matrices are column-centered (i.e., column means are zero). Then the following steps lead to the Procrustes correlation.
-
1.
Normalize both matrices: \( {\mathbf{F}}_{1}^{*} = {\mathbf{F}}_{1} /\sqrt {{\text{trace(}}{\mathbf{F}}_{1}^{\text{T}} {\mathbf{F}}_{1} )} , \, {\mathbf{F}}_{2}^{*} = {\mathbf{F}}_{2} /\sqrt {{\text{trace(}}{\mathbf{F}}_{2}^{\text{T}} {\mathbf{F}}_{2} )} \)
-
2.
Compute cross-product matrix: \( {\mathbf{S}} = {\mathbf{F}}_{1}^{{* \, \text{T}}} {\mathbf{F}}_{2}^{*} \)
-
3.
Perform singular value decomposition (SVD): \( {\mathbf{S}} = {\mathbf{UD}}_{\alpha } {\mathbf{V}}^{\text{T}} \)
-
4.
Procrustes rotation matrix: \( {\mathbf{Q}} = {\mathbf{VU}}^{\text{T}} \)
-
5.
Sum of squared errors between normalized coordinates after rotation of the second matrix:
$$ E = {\text{trace[(}}{\mathbf{F}}_{1}^{* \, } - {\mathbf{F}}_{2}^{*} {\mathbf{Q}})^{\text{T}} ({\mathbf{F}}_{1}^{* \, } - {\mathbf{F}}_{2}^{*} {\mathbf{Q}})] $$ -
6.
Procrustes correlation: \( r = \sqrt {1 - E} \)
1.3 A.3 Comparison of the Present Logratio Approach with the Principal Balances of Martín-Fernández et al. (2018)
Martín-Fernández et al. (2018) developed an algorithm for a stepwise selection of ILR balances, by successively partitioning the parts using an exhaustive search at each step of this divisive algorithm. They apply their method to the ten-part Aar Massif geochemical data set from the book by Van den Boogart and Tolosana-Delgado (2013), and their approach uses unweighted parts, which is the present practice of the CODA school. A major difference between their approach and the one in the present article is they do not use variance explained in the sense used here, but rather “variance contained” in, or “variance contributed” to the logratio variance (although they sometimes do use the term “variance explained”, but they mean “variance contained”). This is a weaker criterion than the variance explained one that is proposed in the present study, because a part of variance contributed by a logratio or a balance is a measure in isolation from the remainder of the variability in the rest of the data set (see Sect. 3.5 of the article for more explanation). Thus, in order to compare our results with those of Martín-Fernández et al. (2018), the explained variances have had to be computed for the sequence of ILR balances published in that paper (Table 4, columns 3 and 4). In addition, the simpler approach of selecting logratios proposed in the present study was executed (Table 4, columns 1 and 2, see Fig. 13 for a graph of these ratios). As a yet further comparison, the simple logratios of amalgamated parts, using the same partitioning sequence as the ILR balances, were also computed and their explained variances computed—these can be termed “amalgamation balances” (Table 4, fifth column). Finally, the variances explained by the principal component axes (i.e., dimensions of the unweighted LRA of the data), which are the optimal explained variances, are reproduced (Table 4, columns 6 and 7). Note that these last explained variances are the only ones where the definition of variance explained is equivalent to variance contained.
The results are also presented graphically in Fig. 14 in the style of Table 3 of Martín-Fernández et al. (2018). The results have been graphed in two separate figures for clarity. In both, the PCA sequence of cumulative explained variances is shown to give common reference points. In the left-hand figure, the first ILR, involving nine out of the ten parts, is higher by 1.5 percentage points compared to the first logratio Na2O/MgO, involving only two parts. At steps 3, 4 and 5, the simple logratio sequence is superior to the ILR sequence, after which the two sequences converge. In the right-hand figure, the ILR balance sequence is superior to the amalgamation balance sequence for the first two steps, but afterwards, they are practically identical. Notice that the amalgamation balance sequence does not necessarily reach exactly 100% variance explained, but in this example, it reaches 99.97% variance explained using 9 balances, lacking only 0.03%.
Plots of cumulative variances in Table 4. The optimal values, obtained by PCA, are shown in both plots as a reference for comparison
In conclusion, this is another example where the sequence of simple logratios seems perfectly adequate to explain the variance of the whole compositional data set. They are comparable to the ILR sequence in terms of explained variance, sometimes even outperforming it, and are much easier to compute and interpret. Using amalgamations instead of geometric means is an alternative way of defining balances, and these also have an easier interpretation in practice.
1.4 A.4 Simulation Study of the Ward Dendrogram as Parts are Sequentially Randomized
The idea of this simulation is to study how the dendrogram from the Ward clustering breaks down as parts are sequentially randomized (i.e., columns are randomly permuted) to simulate growing random noise in the data set. The values of each part (i.e., oxide in the Roman glass cups data set) are permuted in turn, the data reclosed and the Ward clustering repeated. The order of the parts randomized is from the part with the least part of variance to that of the highest part (the parts being randomized are shown in the boxes next to the dendrograms). Figure 15 is read in horizontal steps, and after three parts are randomized, the structure is still fairly stable, but starts to break down from the fourth part being randomized onwards. The element Si is kept fixed throughout, but by the last randomization, the whole data set has been effectively converted to noise.
Rights and permissions
About this article
Cite this article
Greenacre, M. Variable Selection in Compositional Data Analysis Using Pairwise Logratios. Math Geosci 51, 649–682 (2019). https://doi.org/10.1007/s11004-018-9754-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11004-018-9754-x