Variable Selection in Compositional Data Analysis Using Pairwise Logratios

Greenacre, Michael

doi:10.1007/s11004-018-9754-x

Variable Selection in Compositional Data Analysis Using Pairwise Logratios

Published: 09 July 2018

Volume 51, pages 649–682, (2019)
Cite this article

Mathematical Geosciences Aims and scope Submit manuscript

Michael Greenacre ORCID: orcid.org/0000-0002-0054-3131^1,2

2144 Accesses
Explore all metrics

Abstract

In the approach to compositional data analysis originated by John Aitchison, a set of linearly independent logratios (i.e., ratios of compositional parts, logarithmically transformed) explains all the variability in a compositional data set. Such a set of ratios can be represented by an acyclic connected graph of all the parts, with edges one less than the number of parts. There are many such candidate sets of ratios, each of which explains 100% of the compositional logratio variance. A simple choice consists in using additive logratios, and it is demonstrated how to identify one set that can serve as a substitute for the original data set in the sense of best approximating the essential multivariate structure. When all pairwise ratios of parts are candidates for selection, a smaller set of ratios can be determined by automatic selection, but preferably assisted by expert knowledge, which explains as much variability as required to reveal the underlying structure of the data. Conventional univariate statistical summary measures as well as multivariate methods can be applied to these ratios. Such a selection of a small set of ratios also implies the choice of a subset of parts, that is, a subcomposition, which explains a maximum percentage of variance. This approach of ratio selection, designed to simplify the task of the practitioner, is illustrated on an archaeometric data set as well as three further data sets in an “Appendix”. Comparisons are also made with existing proposals for selecting variables in compositional data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weighting of Parts in Compositional Data Analysis: Advances and Applications

Article 05 July 2021

Log-contrast and Orthonormal Log-ratio Coordinates for Compositional Data with a Total

Insights in Hierarchical Clustering of Variables for Compositional Data

Article Open access 16 November 2023

References

Aitchison J (1982) The statistical analysis of compositional data (with discussion). J R Stat Soc B 44:139–177
Google Scholar
Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70:57–65
Article Google Scholar
Aitchison J (1986) The statistical analysis of compositional data. Chapman & Hall, London. Reprinted in 2003 with additional material by Blackburn Press
Aitchison J (1990) Relative variation diagrams for describing patterns of compositional variability. Math Geol 22(4):487–511
Article Google Scholar
Aitchison J (1992) On criteria for measures of compositional difference. Math Geol 24:365–379
Article Google Scholar
Aitchison J (1994) Principles of compositional data analysis. In: Anderson TW, Olkin I, Fang KT (eds) Multivariate analysis and its applications. Institute of Mathematical Statistics, Hayward, pp 73–81
Chapter Google Scholar
Aitchison J (2003) Compositional data analysis: where are we and where should we be heading? In: Proceedings of the compositional data analysis workshop, CoDaWork’03, Girona, Spain. CD-format, ISBN 84-8458-111-X
Aitchison J (2005) A concise guide to compositional data analysis. http://ima.udg.edu/Activitats/CoDaWork05/A_concise_guide_to_compositional_data_analysis.pdf. Accessed 29 May 2018
Aitchison J, Egozcue JJ (2005) The statistical analysis of compositional data: where are we and where should we be heading? Math Geol 37:829–850
Article Google Scholar
Aitchison J, Greenacre MJ (2002) Biplots for compositional data. J R Stat Soc Ser C (Appl Stat) 51:375–392
Article Google Scholar
Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2000) Logratio analysis and compositional distance. Math Geol 32:271–275
Article Google Scholar
Bacon-Shone J (2011) A short history of compositional data analysis. In: Pawlowsky V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester, pp 3–11
Google Scholar
Baxter MJ, Cool HEM, Heyworth MP (1990) Principal component and correspondence analysis of compositional data: some similarities. J Appl Stat 17:229–235
Article Google Scholar
Baxter MJ, Beardah CC, Cool HEM, Jackson CM (2005) Compositional data analysis of some alkaline glasses. Math Geol 37:183–196
Article Google Scholar
Benzécri J-P (1973) Analyse des Données. Tôme II, Analyses des Correspondances. Dunod, Paris
Google Scholar
Bóna M (2006) A walk through combinatorics: an introduction to enumeration and graph theory, 2nd edn. World Scientific Publishing, Singapore
Book Google Scholar
Box GEP, Cox DR (1964) An analysis of transformations. J Roy Stat Soc Ser B 26:211–252
Google Scholar
Cortés J (2009) On the Harker variation diagrams; a comment on “The statistical analysis of compositional data. Where are we and where should we be heading?” by Aitchison and Egozcue (2005). Math Geosc 41:817–828
Article Google Scholar
Dijksterhuis G, Frøst MB, Byrne DV (2002) Selection of a subset of variables: minimisation of Procrustes loss between a subset and the full set. Food Qual Prefer 13:89–97
Article Google Scholar
Filzmoser P, Hron K, Reimann C (2009) Univariate statistical analysis of environmental (compositional) data: problems and possibilities. Sci Total Environ 407:6100–6108
Article Google Scholar
Gittins R (1985) Canonical analysis: a review with applications in ecology. Springer, New York
Book Google Scholar
Gower JC, Dijksterhuis GB (2004) Procrustes problems. Oxford University Press, Oxford
Book Google Scholar
Greenacre MJ (2009) Power transformations in correspondence analysis. Comput Stat Data Anal 53:3107–3116
Article Google Scholar
Greenacre MJ (2010a) Logratio analysis is a limiting case of correspondence analysis. Math Geosci 42:129–134
Article Google Scholar
Greenacre MJ (2010b) Biplots in practice. BBVA Foundation, Bilbao. www.multivariatestatistics.org. Accessed 29 May 2018
Greenacre MJ (2011a) Measuring subcompositional incoherence. Math Geosc 43:681–693
Article Google Scholar
Greenacre MJ (2011b) Compositional data and correspondence analysis. In: Pawlowski-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester, pp 104–113
Chapter Google Scholar
Greenacre MJ (2013) Contribution biplots. J Comput Graph Stat 22:107–122
Article Google Scholar
Greenacre MJ (2016) Correspondence analysis in practice, 3rd edn. Chapman & Hall/CRC, Boca Raton
Google Scholar
Greenacre MJ, Lewi PJ (2009) Distributional equivalence and subcompositional coherence in the analysis of compositional data, contingency tables and ratio-scale measurements. J Classif 26:29–64
Article Google Scholar
Harary F, Palmer EM (1973) Graphical enumeration. Academic Press, New York
Google Scholar
Harker A (1909) Natural history of the igneous rocks. Methuen, London
Google Scholar
Hron K, Filzmoser P, Donevska S, Fišerová E (2013) Covariance-based variable selection for compositional data. Math Geosci 45:487–498
Article Google Scholar
Hron K, Filzmoser P, de Caritat P, Fišerová E, Gardlo A (2017) Weighted pivot coordinates for compositional data and their application to geochemical mapping. Math Geosci 49:777–796
Article Google Scholar
Kraft A, Graeve M, Janssen D, Greenacre MJ, Falk-Petersen S (2015) Arctic pelagic amphipods: lipid dynamics and life strategy. J Plank Res 37:790–807
Article Google Scholar
Krzanowski WJ (1987) Selection of variables to preserve multivariate data structure, using principal components. Appl Stat 36:22–33
Article Google Scholar
Krzanowski WJ (2000) Principles of multivariate analysis: a user’s perspective. Oxford University Press, Oxford
Google Scholar
Legendre P, Legendre L (2012) Numerical ecology, 3rd edn. Elsevier, Amsterdam
Google Scholar
Lewi PJ (1976) Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim Forsch (Drug Res) 26:1295–1300
Google Scholar
Lewi PJ (1980) Multivariate data analysis in APL. In: van der Linden GA (ed) Proceedings of APL-80 conference. North-Holland, Amsterdam, pp 267–271
Google Scholar
Lewi PJ (1989) Spectral map analysis. Factorial analysis of contrasts, especially from log ratios. Chemometr Intell Lab 5:105–116
Article Google Scholar
Lewi PJ (2005) Spectral mapping, a personal and historical account of an adventure in multivariate data analysis. Chemometr Intell Lab 77:215–223
Article Google Scholar
Lovell D, Müller W, Taylor J, Zwart A, Helliwell C (2011) Proportions, percentges, ppm: do the molecular biosciences treat compositional data right? In: Pawlowski-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. Wiley, Chichester UK, pp 193–207
Google Scholar
Martín-Fernández JA, Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2018) Advances in principal balances for compositional data. Math Geosci 50:273–298
Article Google Scholar
Mert MC, Filzmoser P, Hron K (2015) Sparse principal balances. Stat Model 15:159–174
Article Google Scholar
Murtagh F (1984) Counting dendrograms: a survey. Discrete Appl Math 7:191–199
Article Google Scholar
Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2015) vegan: community ecology package. R package version 2.3-2. https://CRAN.R-project.org/package=vegan. Accessed 11 June 2018
Pawlowski-Glahn V, Buccianti A (eds) (2011) Compositional data analysis. Wiley, Chichester
Google Scholar
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2007) Lecture notes on compositional data analysis. http://dugi-doc.udg.edu/bitstream/handle/10256/297/CoDa-book.pdf?sequence=1. Accessed 11 June 2018
Pawlowsky-Glahn V, Egozcue JJ, Tolosana-Delgado R (2015) Modeling and analysis of compositional data. Wiley, Chichester
Google Scholar
Rao CR (1964) The use and interpretation of principal component analysis in applied research. Sankhya A 26:329–358
Google Scholar
Tanimoto S, Rehren T (2008) Interactions between silicate and salt melts in LBA glassmaking. J Archaeol Sci 35:2566–2573
Article Google Scholar
R core team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
van den Boogaart KG, Tolosana-Delgado R (2013) Analyzing compositional data with R. Springer, Berlin
Book Google Scholar
Wollenberg AL (1977) Redundancy analysis—an alternative for canonical analysis. Psychometrika 42:207–219
Article Google Scholar
Wouters L, Göhlmann HW, Bijnens L, Kass SU, Molenberghs G, Lewi PJ (2003) Graphical exploration of gene expression data: a comparative study of three multivariate methods. Biometrics 59:1131–1139
Article Google Scholar

Download references

Acknowledgements

This work is dedicated to the memory of John Aitchison who passed away in December 2016 and whom I met when he gave a seminar in Girona, Catalonia, in 2000. He started his talk with a slide containing a single blank triangle, following which, it was like the scales fell from my eyes.

Author information

Authors and Affiliations

Department of Economics and Business, Universitat Pompeu Fabra, Ramon Trias Fargas 25-27, 08005, Barcelona, Spain
Michael Greenacre
Barcelona Graduate School of Economics, Ramon Trias Fargas 25-27, 08005, Barcelona, Spain
Michael Greenacre

Authors

Michael Greenacre
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Michael Greenacre.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (TXT 2 kb)

Appendix

1.1 A.1 Three Additional Data Sets

Three more data sets are analyzed, to demonstrate the benefit of using ALRs as a substitute for the full compositional data set. Two of these compositional data sets are taken from Aitchison (2005) and the third one is considered by Greenacre (2016) in the context of CA. For each data set, the sets of ALRs are computed, using each part in turn as the reference in the denominator. The set of ALRs that lead to inter-case distances that best match the logratio distances, using the Procrustes correlation as the criterion, is identified.

Data Set 1 (Aitchison 2005)
Minerals compositions: 21 samples, 8 minerals
qu: Quartz or: orthoclase al: albite an: anorthite
en: Enstatite ma: magnetite il: ilmenite ap: apatite

The ALRs with respect to quartz (qu) give the best agreement—the Procrustes correlation (between full space configurations) is equal to 0.995. Figure 10 shows the two-dimensional LRA based on all 28 logratios alongside the PCA of the 7 ALRs, showing the almost identical configurations of sample points.

Data Set 2 (Aitchison 2005)
Activity pattern of a statistician: 20 days, 6 activities
te = Teaching; co = consultation; ad = administration;
re = Research; ot = other wakeful activities; sl = sleep

The ALRs with respect to sleep (sl) give the best agreement—the Procrustes correlation (between full space configurations) is equal to 0.960. Figure 11 shows the two-dimensional LRA based on all 15 logratios alongside the PCA of the 5 ALRs, showing the highly similar configurations of sample points. The first dimension of the ALR analysis accounts for a much higher percentage of variance, similar to the glass cup example in the main text, suggesting that there is only one relevant dimension and that the LRA analysis is inflated with redundant variance.

Data set 3 (see Greenacre 2016, Appendix E)
Fatty acid data: 42 samples, 25 fatty acids with nonzero values

This data set consists of groups of marine organisms collected in three different seasons. The ALRs with respect to fatty acid 16:0 give the best agreement to the multivariate structure—the Procrustes correlation (between full space configurations) is equal to 0.989. Figure 12 shows the two-dimensional LRA based on all 300 logratios alongside the PCA of the 24 ALRs, showing the similar groupings of the three seasonal subsets of data, separated by the ALR analysis just as well as by the LRA. The four ratios that stand out in the contribution biplot on the right are made up of the four parts prominently radiating out from the centre in the LRA on the left, expressed relative to the more centrally located fatty acid 16:0 (Fig. 12).

1.2 A.2 Procrustes Analysis and Procrustes Correlation

The following matrix formulation summarizes the computations required:

Suppose F₁ (n₁ × p) and F₂ (n₂ × p) are two matrices of coordinates defining two configurations of the same labelled points in separate p-dimensional spaces. Both matrices are column-centered (i.e., column means are zero). Then the following steps lead to the Procrustes correlation.

1.
Normalize both matrices: $ {\mathbf{F}}_{1}^{*} = {\mathbf{F}}_{1} /\sqrt {{\text{trace(}}{\mathbf{F}}_{1}^{\text{T}} {\mathbf{F}}_{1} )} , \, {\mathbf{F}}_{2}^{*} = {\mathbf{F}}_{2} /\sqrt {{\text{trace(}}{\mathbf{F}}_{2}^{\text{T}} {\mathbf{F}}_{2} )} $
2.
Compute cross-product matrix: $ {\mathbf{S}} = {\mathbf{F}}_{1}^{{* \, \text{T}}} {\mathbf{F}}_{2}^{*} $
3.
Perform singular value decomposition (SVD): $ {\mathbf{S}} = {\mathbf{UD}}_{\alpha } {\mathbf{V}}^{\text{T}} $
4.
Procrustes rotation matrix: $ {\mathbf{Q}} = {\mathbf{VU}}^{\text{T}} $
5.
Sum of squared errors between normalized coordinates after rotation of the second matrix:
$$ E = {\text{trace[(}}{\mathbf{F}}_{1}^{* \, } - {\mathbf{F}}_{2}^{*} {\mathbf{Q}})^{\text{T}} ({\mathbf{F}}_{1}^{* \, } - {\mathbf{F}}_{2}^{*} {\mathbf{Q}})] $$
6.
Procrustes correlation: $ r = \sqrt {1 - E} $

1.3 A.3 Comparison of the Present Logratio Approach with the Principal Balances of Martín-Fernández et al. (2018)

Martín-Fernández et al. (2018) developed an algorithm for a stepwise selection of ILR balances, by successively partitioning the parts using an exhaustive search at each step of this divisive algorithm. They apply their method to the ten-part Aar Massif geochemical data set from the book by Van den Boogart and Tolosana-Delgado (2013), and their approach uses unweighted parts, which is the present practice of the CODA school. A major difference between their approach and the one in the present article is they do not use variance explained in the sense used here, but rather “variance contained” in, or “variance contributed” to the logratio variance (although they sometimes do use the term “variance explained”, but they mean “variance contained”). This is a weaker criterion than the variance explained one that is proposed in the present study, because a part of variance contributed by a logratio or a balance is a measure in isolation from the remainder of the variability in the rest of the data set (see Sect. 3.5 of the article for more explanation). Thus, in order to compare our results with those of Martín-Fernández et al. (2018), the explained variances have had to be computed for the sequence of ILR balances published in that paper (Table 4, columns 3 and 4). In addition, the simpler approach of selecting logratios proposed in the present study was executed (Table 4, columns 1 and 2, see Fig. 13 for a graph of these ratios). As a yet further comparison, the simple logratios of amalgamated parts, using the same partitioning sequence as the ILR balances, were also computed and their explained variances computed—these can be termed “amalgamation balances” (Table 4, fifth column). Finally, the variances explained by the principal component axes (i.e., dimensions of the unweighted LRA of the data), which are the optimal explained variances, are reproduced (Table 4, columns 6 and 7). Note that these last explained variances are the only ones where the definition of variance explained is equivalent to variance contained.

Table 4 Cumulative explained variances of sequences of simple logratios, ILR balances, amalgamation balances and principal components

Full size table

The results are also presented graphically in Fig. 14 in the style of Table 3 of Martín-Fernández et al. (2018). The results have been graphed in two separate figures for clarity. In both, the PCA sequence of cumulative explained variances is shown to give common reference points. In the left-hand figure, the first ILR, involving nine out of the ten parts, is higher by 1.5 percentage points compared to the first logratio Na₂O/MgO, involving only two parts. At steps 3, 4 and 5, the simple logratio sequence is superior to the ILR sequence, after which the two sequences converge. In the right-hand figure, the ILR balance sequence is superior to the amalgamation balance sequence for the first two steps, but afterwards, they are practically identical. Notice that the amalgamation balance sequence does not necessarily reach exactly 100% variance explained, but in this example, it reaches 99.97% variance explained using 9 balances, lacking only 0.03%.

In conclusion, this is another example where the sequence of simple logratios seems perfectly adequate to explain the variance of the whole compositional data set. They are comparable to the ILR sequence in terms of explained variance, sometimes even outperforming it, and are much easier to compute and interpret. Using amalgamations instead of geometric means is an alternative way of defining balances, and these also have an easier interpretation in practice.

1.4 A.4 Simulation Study of the Ward Dendrogram as Parts are Sequentially Randomized

The idea of this simulation is to study how the dendrogram from the Ward clustering breaks down as parts are sequentially randomized (i.e., columns are randomly permuted) to simulate growing random noise in the data set. The values of each part (i.e., oxide in the Roman glass cups data set) are permuted in turn, the data reclosed and the Ward clustering repeated. The order of the parts randomized is from the part with the least part of variance to that of the highest part (the parts being randomized are shown in the boxes next to the dendrograms). Figure 15 is read in horizontal steps, and after three parts are randomized, the structure is still fairly stable, but starts to break down from the fourth part being randomized onwards. The element Si is kept fixed throughout, but by the last randomization, the whole data set has been effectively converted to noise.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Greenacre, M. Variable Selection in Compositional Data Analysis Using Pairwise Logratios. Math Geosci 51, 649–682 (2019). https://doi.org/10.1007/s11004-018-9754-x

Download citation

Received: 22 September 2017
Accepted: 05 June 2018
Published: 09 July 2018
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s11004-018-9754-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable Selection in Compositional Data Analysis Using Pairwise Logratios

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Weighting of Parts in Compositional Data Analysis: Advances and Applications

Log-contrast and Orthonormal Log-ratio Coordinates for Compositional Data with a Total

Insights in Hierarchical Clustering of Variables for Compositional Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (TXT 2 kb)

Appendix

Appendix

1.1 A.1 Three Additional Data Sets

1.2 A.2 Procrustes Analysis and Procrustes Correlation

1.3 A.3 Comparison of the Present Logratio Approach with the Principal Balances of Martín-Fernández et al. (2018)

1.4 A.4 Simulation Study of the Ward Dendrogram as Parts are Sequentially Randomized

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now