This text is intended for a broad audience as both an introduction to predictive models as well as a guide to applying them. Non-mathematical readers will appreciate the intuitive explanations of the techniques while an emphasis on problem-solving with real data across a wide variety of applications will aid practitioners who wish to extend their expertise. Readers should have knowledge of basic statistical ideas, such as correlation and linear regression analysis. While the text is biased against complex equations, a mathematical background is needed for advanced topics.
Dr. Kuhn is a Director of Non-Clinical Statistics at Pfizer Global R&D in Groton Connecticut. He has been applying predictive models in the pharmaceutical and diagnostic industries for over 15 years and is the author of a number of R packages.
Dr. Johnson has more than a decade of statistical consulting and predictive modeling experience in pharmaceutical research and development. He is a co-founder of Arbor Analytics, a firm specializing in predictive modeling and is a former Director of Statistics at Pfizer Global R&D. His scholarly work centers on the application and development of statistical methodology and learning algorithms.
Applied Predictive Modeling covers the overall predictive modeling process, beginning with the crucial steps of data preprocessing, data splitting and foundations of model tuning. The text then provides intuitive explanations of numerous common and modern regression and classification techniques, always with an emphasis on illustrating and solving real data problems. Addressing practical concerns extends beyond model fitting to topics such as handling class imbalance, selecting predictors, and pinpointing causes of poor model performance—all of which are problems that occur frequently in practice.
The text illustrates all parts of the modeling process through many hands-on, real-life examples. And every chapter contains extensive R code for each step of the process. The data sets and corresponding code are available in the book’s companion AppliedPredictiveModeling R package, which is freely available on the CRAN archive.
This multi-purpose text can be used as an introduction to predictive models and the overall modeling process, a practitioner’s reference handbook, or as a text for advanced undergraduate or graduate level predictive modeling courses. To that end, each chapter contains problem sets to help solidify the covered concepts and uses data available in the book’s R package.
Readers and students interested in implementing the methods should have some basic knowledge of R. And a handful of the more advanced topics require some mathematical knowledge.
tl;dr: A brilliant book covering Predictive modelling in R. With a strong practical bent it walks the reader through the application of modern classification and regression techniques to a broad number of varied and interesting data sets. It uses existing packages where possible so you can jump straight in (great for Kagglers) but there is a lot here to master. It is especially strong on preprocessing (both unsupervised and supervised), model tuning and model assessment. Should not be your first book on R or data analytics but the best balance of Practical application without foregoing theory that I have seen. It is wonderful to see how professional data analysts approach predictive modelling tasks. The data sets are not toy models to highlight approaches but interesting and complex problems from a wide variety of disciplines.(Note that this book does not cover Time Series, Generalised Additive Models and Ensemble’s of different models).
Data science has become very popular due to the increase in computing power (including things like AWS), the amount of data that is accessible on the internet and a number of open-source tools (R and Python for example) that allow even relative beginners to complete quite sophisticated models. Coursera allows for one to complete courses on Machine Learning for free and sites like Kaggle have even turned it into something of a sport where people compete to create predictive models for money or even job interviews. Part of the excitement is that Predictive models can be applied to almost any field you can think of.
Given the easy access to predict things using sophisticated techniques, the number of books on machine learning, data mining and predictive data analytics has grown to fill the demand of people looking to learn about the field. As data science is itself a combination of many different disciplines (statistics, computer science, artificial intelligence etc) there are many different points of entry. For this reason books can often be placed on a spectrum from straightforward examples of already constructed programs to theoretical textbooks with lots of mathematical background and constructing approaches from scratch. “Applied Predictive Modeling” tries to find a middle ground between these two approaches though it unashamedly sides with the practical. In contrast to many other works though, it utilises existing packages (notably caret) rather than having the reader construct the approaches themselves in code.
Applied Predictive Modeling contains 20 Chapters set out to be quasi-independent whilst still being a coherent book. An abstract opens each chapter followed by sections discussing the approaches used. The writing is excellent, very easy to follow and wonderfully informative with an excellent choice of example data sets. The discussions are not afraid to highlight the problems of different approaches – in one of the latter chapters noise is deliberately added to a data set so the differing impact can be seen on a range of models. Theory is discussed insofar as it is useful for understanding the use of certain approaches and references to further reading are clearly given. The chapters conclude with a summary before containing a computation section which contains all of the R code used for the chapters with some discussion where important. Finally most chapters have a section containing exercises. Usefully these exercises use different data sets so are not merely regurgitation of what one has just read. The chapters also have independent Bibliographies which is a little annoying when reading the book cover to cover, but makes it excellent as a reference book.
After a few chapters of overview the chapters largely work through the components in the process of Data Analytics; data-splitting, pre-processing chapters cover transforming, centering, dealing with missing values and setting up the data for the application of models. The next section of the book covers Regression models. It utilises a Pharmaceutical dataset and works through the creation of models of increasing complexity. A chapter then works through an example of concrete strength prediction based on ingredients to show clearly how regression applications work end to end. A number of chapters then look at classification algorithms using the construction of a data model from a kaggle competition from late 2010 on University Grants. This highlights what this book offers that I have not seen in other comparable books – real life examples on the steps a professional analyst takes in the construction of a model. The reader is almost always watching the construction of a real model throughout the discussion of the differing approaches. The book does discuss theory where it is useful. But rather then going into the miniature of constructing things directly in code to highlight the underlying structure, existing packages are used where possible. This lowers the barrier to getting started on using the techniques. Finally the book is rounded out with chapters on model tuning, detecting variable importance, how to handle class imbalances and some broader issues in modelling all again using real data sets from different fields.
The authors have created an R package for the book containing the code and data sets used and an excellent website and blog. The book ranges broadly across disciplines and includes separate data sets for the exercises, in all I count 21 data sets ranging from concrete strength to caravan insurance that are either covered in the book or are given as exercises in the chapters.
In short I congratulate the authors on an excellent book that I look forward to working through in depth over the coming months. If you are looking to improve your predictive modelling and are short of professional standard, this is the book you are looking for. Whilst there are loads to learn and master – you can jump in and use things from the book very quickly thanks to its use of impressive packages. One area I would love to see added to future editions would be the ensembling of different models.
Product Details :
- Hardcover: 620 pages
- Publisher: Springer; 2013 edition (September 15, 2013)
- Language: English
- ISBN-10: 1461468485
- ISBN-13: 978-1461468486
- Product Dimensions: 6.3 x 9.2 inches