<< 1 >>
Rating: Summary: Misleading title! Review: A better title for this book would be 'How Blast works' because this book is centered around this topic. If you expect a general overview of statistical methods used in bioinformtics you should buy another book (e.g. Hastie, Baldi, Pevzner, Duda, Eddy which provide more general methods). If you want to know in mathematical detail how blast works, this is your book. I think the level is advanced and one needs some mathematical background to appreciate it (certainly not to recommend for biologists).
I don't think it is a really bad book but I think it gives a wrong impression of (statistical) methods in bioinformatics. Another reviewer wrote ...This is one of the books I have been waiting for. For a population geneticist who wants to learn bioinformatics, most texts are unacceptable: They present heuristic methods in a cookbook fashion, with little reference to what is going on biologically as well as mathematically....
This is exactly the problem with this book!! Bioinformatics is more machine learning than statistics and more heuristic then exact.
Rating: Summary: Great all-around review of probability Review: The book's title says 'Statistical Methods', but all of statistics is derived from probability theory. That's really where Ewens and Grant start, with the best high-density review of probability I know.
The first two chapters cover probabilities of one and many variables, respectively. This includes several topics that other authors equently skip, including conditional and marginal probabilities, probability- and moment-generating functions, a little about entropy, distributions of sums, and extreme value statistics. All that takes about 100 pages. Two later chapters cover statistical inference (parameter estimation, hypothesis testing, and Bayesian techniques), two more cover stochastic processes including Markov models, a short chapter includes hidden Markov models and their training, and another chapter covers sampling techniques: bootstraps, permutation tests and such.
If the book contained only that material, it would still be a valuable review and summary of basic probability. It's way too dense to be a beginner's text. That's OK, those chapters were really intended as a review and as a statement of the terms and notation used in the book's real objectives: models of biological systems.
The chapters on biological applications are interspersed with chapters on basics, so that each application is presented as soon as its elements are covered. Those chapters describe statistical properties of a single DNA or protein string, relationships between two strings, BLAST and its scoring models, mutation modeling, and construction of phylogenetic trees. Coverage of each topic is brief but very dense. A surprising amount of information is packed into each brief chapter, and it's surprisingly readable. Still, these are big topics. Ewens and Grant don't and don't try to present any topic to its full depth. Instead, they give enough discussion that a determined reader can learn the basics, and can understand more advanced discussions of specific topics.
The book does require a determined reader with some background in probability - this shouldn't be anyone's first book, unless you have a very skilled teacher. The prepared and careful reader will be very well rewarded, however. Despite the book's title about statistics and bioinformatics, this is a reference you may use for probability models in any field. It's certainly one that I keep coming back to.
//wiredweird
Rating: Summary: Pretty good overview Review: This book is a timely introduction to the mathematical statistics used in computational biology and bioinformatics. The authors have done a superb job in the overview of a subject that students of biology and bioinformatics can rely on for study and for reference. The mathematics is done at an advanced undergraduate level, but the authors are pragmatic in their approach, and interlace the discussion with biological applications immediately after the appropriate mathematical background has been developed. It thus seems appropriate to discuss the quality of the presentation with these applications in mind. Chapter one begins, appropriately, with an introduction to probability theory, with a consideration of discrete probability distributions of one variable beginning the chapter. The Bernoulli, binomial, uniform, geometric, generalized geometric, and Poisson distributions are discussed. The authors point out the use of geometric-like distributions in the BLAST application. The also caution the reader as to the difference between the mean and the average of a random variable. They then move on to consider continuous distributions, discussing briefly the uniform, Normal, exponential, gamma, and beta distributions. Moment-generating functions are also introduced, and they prove a "convexity" theorem for these functions that is important in the BLAST application. The authors also introduce the relative entropy and generalized support statistics, the later also being used in BLAST. The next chapter is an overview of probability theory in many random variables. The results in chapter one are discussed in this context, and the authors give an interesting application to the sequencing of EST libraries. The authors also point out that the variance of the maximum of a collection random variables is finite as the number of variables increases, a fact that is used quite often in bioinformatics. Transformations of random variables are also discussed, with the goal of showing how these can be used to find the density function of a single random variable, this also being important in BLAST. The most important subject of the book begins in chapter 3, wherein the authors introduce statistical inference. They begin with a very brief discussion of the differences between the frequentist and Bayesian approaches to statistical inference and then move on to classical hypothesis testing and nonparametric tests. This chapter is of great value to those readers, for example biologists/would-be bioinformaticists who are approaching statistics for the first time. Chapter 4 introduces concepts that are of upmost importance in probabilistic computational biology, namely Markov chains. The discussion in this chapter sets up the strategies used in the next chapter on analyzing a single DNA sequence and a latter chapter on hidden Markov models. Shotgun sequencing is discussed as a tool to determine the an actual DNA sequence, and the authors discuss the probabilistic issues that arise in the reconstruction of long DNA sequences from shorter sequences. Missing in this chapter is a mathematical analysis of the advantages/disadvantages between shotgun and whole genome sequencing strategies. Chapter 6 then generalizes the analysis of chapter 5 to multiple DNA and protein sequences. It is here that one begins to talk about alignments between sequences, which bring about some very subtle mathematical problems in computational biology. The computational complexity of the (global) alignment problem entails the use of softer techniques, such as dynamic programming, which is discussed in this chapter. The (local) alignment problem is also discussed in some detail, using the linear gap model. The alignment problem and the issues with scoring for protein sequences are also discussed in detail. The reader first encounters the famous PAM and BLOSUM matrices in this chapter. The authors do not discuss any connections with the protein folding problem, unfortunately. The next chapter introduces the basic probability theory behind the BLAST algorithm, namely random walks. They do so with emphasis on moment generating functions, which might be a little abstract for the biologist reader. The authors return to tatistical estimation and hypothesis testing in chapter 8, with maximum liklihood and fixed sample size tests discussed in some detail. Again connecting with the BLAST algorithm, the sequential probability ratio test is treated. The authors finally get down to the BLAST algorithm in chapter 9, using an older version of the software (1.4). The connection of the algorithm with random walks and how to assign scores is immediately apparent, as is the ability of BLAST to do database queries against a chosen sequence. The algorithm is compared with the sequential analysis discussed in the last chapter. The authors return to Markov chains in chapter 10, and give some numerical examples. In addition, they treat the important topic of Markov chain Monte Carlo via the Hastings-Metropolis algorithm, Gibbs sampling, and simulated annealing. An application of simulated annealing to the double digest problem is described. The authors also spend a litte time discussing continuous-time Markov chains. Hidden Markov models are finally discussed in chapter 11. These have been the most effective tools in sequence analysis and the authors give a nice overview of their construction and properties in this chapter. The Pfam package is discussed as a software implementation of HMMs for determining protein domains. Unfortunately, they do not discuss the excellent package HMMER for implementing HMMs in sequence analysis. Chapter 12 discusses computationally intensive methods in classical inference. One of these methods, the bootstrap procedure, which is used for large sample sizes, is described. Used to estimate confidence intervals in situations where there is not enough information to employ classical methods, the authors detail a method using quantiles to estimate the confidence interval for the standard deviation of the expression intensity of a gene. This is followed by a return to the multiple testing problem of chapter 3 in the context of the data analysis of expression arrays. I did not read the last two chapters on evolutionary models and phylogenetic tree estimation so I will omit their review.
Rating: Summary: Disappointing overview Review: This book is a tremendous disappointment, given other Amazon reviews and the impressive Table of Contents. I picked several topics about which I know something: Likelihoods, P-values, bootstraps. I would have had NO idea about either of these subjects based on the poor delivery in this book. Topics are not well introduced, there are virtually no examples, and the introduction/discussion of most topics is wordy and not informative. A topic such as the two-sample t-statistic is scattered throughout the book, with the main part not even cited in the index! Unfortunately there are not a lot of books in the field of Statistics in Bioinformatics. However, I would recommend "The Elements of Statistical Learning" (Hastie et al.) for classifiers etc (Duda and Hart's classic is also good). I would recommend "Biostatistical Analysis" by Zar for a general coverage, and Terry Speed's "stat Labs: Mathematical Statistics ..." which is not comprehensive but has good lab examples with associated statistical analysis.
Rating: Summary: guide into the right direction Review: This is one of the books I have been waiting for. For a population geneticist who wants to learn bioinformatics, most texts are unacceptable: They present heuristic methods in a cookbook fashion, with little reference to what is going on biologically as well as mathematically. This book is the first exception I know of. It builds, and rests on, solid foundations of genetic stochastic processes and still goes all the way to real-life problems. Let me illustrate this by means of an example, rather than enumerating all the topics in the book. Chap. 14, entitled `phylogenetic tree estimation' (as opposed to the more common term `phylogenetic tree reconstruction' - not without reason, I presume) builds on, and is firmly interlaced with, Chap. 13 about `evolutionary models', which systematizes the zoo (if not jungle) of substitution models in both discrete and continuous time. On this basis, the overview of tree-building methods makes a lot of sense. Even better, it does not stop here, but presents an application (to real sequence data), followed by a careful analysis of where the various methods agree, and where - and maybe why - they disagree. This way, it clears away some common misconceptions; in particular, it presents a careful analysis of what bootstrap does and what it does not in this context. The chapter closes with a discussion of unresolved problems (like inhomogeneity of substitution rates), and methods and possible pitfalls related to testing of nested and non-nested hypotheses in tree estimation. The book is written in an informal style without being imprecise, which makes it pleasant reading. It is particularly suitable for teaching at a high level. This is enhanced by realistic (and even real-life) examples that furnish the text, as well as carefully chosen exercises at the end of each chapter. Certainly, this first edition of `Statistical Methods in Bioinformatics' cannot be the last word in this fast-moving field. But it is an excellent guide into the `right' direction.
Rating: Summary: poor delivery but potentially useful Review: This topic should be of prime interest to statisticians. The authors are mathematical biologists and they bring out the theory and methodology in probability and statistics that is applicable to DNA and protein sequencing and matching. They provide a treatment of probability, stochastic processes and statistics that starts with the very basics and builds up. Topics include basic probability and statistical inference, Poisson processes and Markov chains, DNA sequencing, hidden Markov models, computer intensive methods, evolutionary models and phylogenetic tree estimation. Of particular interest to me is the material on permutation methods and the bootstrap. The bootstrap has been applied in phylogenetics and there has been some controversy about its application there. The authors cover this in Chapter 14 where they appear to have a resolution for the controversy. Permutation tests are first discussed in Chapter 3 "A Introduction to Statistical Inferrence" and are compared with other computer intensive methods in Chapter 12. In Section 12.3 they discuss the Behrens-Fisher problem pointing out why permutation tests are not possible due to the unequal variances. They give the bootstrap t solution. Section 12.2.2 gives a brief, but nicely described, account of bootstrap estimation and confidence intervals and provides a number of references including the following books: Efron and Tibshirani (1993), Davison and Hinkley (1997), Efron (1982), Hall (1992), Manly (1997), Sprent (1998) and Chernick (1999). Bootstrap and permutation approaches to multiple testing are covered in Section 12.4.
<< 1 >>
|