Algebraic statistics for computational biology
Pachter L., Sturmfels B., Cambridge University Press, New York, NY, 2005. 432 pp. Type: Book

Date Reviewed: Apr 28 2006

This book combines, in an exciting way, statistical theory, algorithms, and abstract algebra in an effort to provide a framework for the mathematical study of deoxyribonucleic acid (DNA) sequences. It is the joint work of Lior Pachter, Bernd Sturmfels, and their colleagues who were involved in a graduate course with the same title at the University of California at Berkeley.

Although there is no apparent connection between the concepts of the areas described in the book, the authors manage to bring together all of the pieces and present a solid and comprehensive theory. The entire body of material is wisely organized, into two parts and 22 chapters.

The first part provides an introduction to the four basic themes of the theory, namely, statistics, computation, algebra, and biology, in four corresponding chapters. The purpose is to give readers with different backgrounds the opportunity to connect the area of their expertise with other nonfamiliar areas.

Chapter 1 is devoted to statistics, in particular to the new field of algebraic statistics. The authors present the appropriate statistical theory and methods for the analysis of discrete data concerning genomes and DNA sequences, or else sequences randomly constructed by the letters A, C, G, and T. The originality in this introduction is that statistical notions are formulated using terminology from abstract algebra. Since only parametric statistical models are considered, the main issue discussed is the estimation of parameters by the maximum likelihood method (MLM). Special attention is given to linear and log-linear (toric) models that have at most one local maximum of the likelihood function. The expectation-minimization (EM) algorithm for maximizing the likelihood function in cases where there are multiple local maxima is also described. Other issues discussed in this chapter involve Markov models (Markov chains, hidden Markov models, and Markov models on trees) and graphical models where the graphs are the basis for studying and developing algorithms.

Chapter 2 discusses computation, specifically discrete algorithms. First, the so-called tropical arithmetic is defined. This kind of arithmetic is applied to describe dynamic programming algorithms that are used for optimal sequence alignment. Notions like polytopes and phylogenetic trees are also discussed. At the end of the chapter, there is a very useful section presenting relevant software packages for mathematics and computational biology.

Chapter 3 is an introduction to algebraic concepts related to statistics and computational biology. Algebraic notions like varieties, Grobner bases, and the implicitization problem are connected to the maximum likelihood estimation, while principles of tropical geometry are presented.

In chapter 4, the authors explain how the genome sequence data are related to the theory of statistics, computation, and algebra, and how all of these areas can be combined in order to model the random evolution of genomes, and to study the structure and organization of functional elements. Furthermore, hidden Markov models for gene identification, and statistical models for the evolution of DNA sequences, are discussed.

Part 2 is made up of 18 chapters, offering thorough and insightful studies of a wide variety of relevant topics, like polytope propagation, parametric inference, optimal sequence alignment, inference functions, geometry of Markov chains, equations defining hidden Markov models and the EM algorithm for them, identification of evolutionary events, various tree models, and various aspects of tree models. Furthermore, a numerical approach to the MLM problem is given for phylogenetic trees via interval methods.

In general, the book brings new ideas and perspectives to scientists dealing with biological data. Moreover, the whole approach is extremely interesting for all statisticians, since it provides formulation and tools for any kind of discrete data analysis. Although the level of the content is advanced, and requires some familiarity with at least one of the basic themes, the presentation of the concepts is clear and supported by good examples, so the book can be used as a textbook in advanced courses. I strongly recommend this fascinating and useful book for researchers in statistics, algebra, and computational biology.

Reviewer: Lefteris Angelis