Handbook of Big Data provides a state-of-the-art overview of the analysis of large-scale datasets. Featuring contributions from well-known experts in statistics and computer science, this handbook presents a carefully curated collection of techniques from both industry and academia. Thus, the text instills a working understanding of key statistical and computing ideas that can be readily applied in research and practice.
Offering balanced coverage of methodology, theory, and applications, this handbook:
- Describes modern, scalable approaches for analyzing increasingly large datasets
- Defines the underlying concepts of the available analytical tools and techniques
- Details intercommunity advances in computational statistics and machine learning
Handbook of Big Data also identifies areas in need of further development, encouraging greater communication and collaboration between researchers in big data sub-specialties such as genomics, computational biology, and finance.
Table of Contents
GENERAL PERSPECTIVES ON BIG DATA
The Advent of Data Science: Some Considerations on the Unreasonable Effectiveness of Data
Big n versus Big p in Big Data
DATA-CENTRIC, EXPLORATORY METHODS
Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data
Integrate Big Data for Better Operation, Control, and Protection of Power Systems
Interactive Visual Analysis of Big Data
A Visualization Tool for Mining Large Correlation Tables: The Association Navigator
Andreas Buja, Abba M. Krieger, and Edward I. George
High-Dimensional Computational Geometry
IRLBA: Fast Partial SVD Method
Structural Properties Underlying High-Quality Randomized Numerical Linear Algebra Algorithms
Michael W. Mahoney and Petros Drineas
Something for (Almost) Nothing: New Advances in Sublinear-Time Algorithms
Ronitt Rubinfeld and Eric Blais
Elizabeth L. Ogburn and Alexander Volfovsky
Mining Large Graphs
David F. Gleich and Michael W. Mahoney
MODEL FITTING AND REGULARIZATION
Estimator and Model Selection Using Cross-Validation
Stochastic Gradient Methods for Principled Estimation with Large Datasets
Panos Toulis and Edoardo M. Airoldi
Learning Structured Distributions
Penalized Estimation in Complex Models
Jacob Bien and Daniela Witten
High-Dimensional Regression and Inference
Divide and Recombine Subsemble, Exploiting the Power of Cross-Validation
Stephanie Sapp and Erin LeDell
Scalable Super Learning
Tutorial for Causal Inference
Laura Balzer, Maya Petersen, and Mark van der Laan
A Review of Some Recent Advances in Causal Inference
Marloes H. Maathuis and Preetam Nandy
Targeted Learning for Variable Importance
Online Estimation of the Average Treatment Effect
Mining with Inference: Data-Adaptive Target Parameters
Alan Hubbard and Mark van der Laan
Peter Bühlmann is a professor of statistics at ETH Zürich, Switzerland, fellow of the Institute of Mathematical Statistics, elected member of the International Statistical Institute, and co-author of the book titled Statistics for High-Dimensional Data: Methods, Theory and Applications. He was named a Thomson Reuters’ 2014 Highly Cited Researcher in mathematics, served on various editorial boards and as editor of the Annals of Statistics, and delivered numerous presentations including a Medallion Lecture at the 2009 Joint Statistical Meetings, a read paper to the Royal Statistical Society in 2010, the 14th Bahadur Memorial Lectures at the University of Chicago, Illinois, USA, and other named lectures.
Petros Drineas is an associate professor in the Computer Science Department at Rensselaer Polytechnic Institute, Troy, New York, USA. He is the recipient of an Outstanding Early Research Award from Rensselaer Polytechnic Institute, an NSF CAREER award, and two fellowships from the European Molecular Biology Organization. He has served as a visiting professor at the US Sandia National Laboratories; visiting fellow at the Institute for Pure and Applied Mathematics, University of California, Los Angeles; long-term visitor at the Simons Institute for the Theory of Computing, University of California, Berkeley; program director in two divisions at the US National Science Foundation; and worked for industrial labs. He is a co-organizer of the series of workshops on Algorithms for Modern Massive Datasets and his research has been featured in numerous popular press articles.
Michael Kane is a member of the research faculty at Yale University, New Haven, Connecticut, USA. He is a winner of the American Statistical Association’s Chambers Statistical Software Award for The Bigmemory Project, a set of software libraries that allow the R programming environment to accommodate large datasets for statistical analysis. He is a grantee on the Defense Advanced Research Projects Agency’s XDATA project, part of the White House’s Big Data Initiative, and on the Gates Foundation’s Round 11 Grand Challenges Exploration. He has collaborated with companies including AT&T Labs Research, Paradigm4, Sybase, (a SAP company), and Oracle.
Mark van der Laan is the Jiann-Ping Hsu/Karl E. Peace professor of biostatistics and statistics at the University of California, Berkeley, USA. He is the inventor of targeted maximum likelihood estimation, a general semiparametric efficient estimation method that incorporates the state of the art in machine learning through the ensemble method super learning. He is the recipient of the 2005 COPPS Presidents’ and Snedecor Awards, the 2005-van Dantzig Award, and the 2004 Spiegelman Award. He is also the founding editor of the International Journal of Biostatistics and the Journal of Causal Inference, and the co-author of more than 250 publications and various books.
"The book contains a nice mix of philosophical musings, survey articles and cutting-edge research. It was designed as ‘a useful resource for seasoned practitioners and enthusiastic neophytes alike’ . . . Enthusiastic neophytes are still left with plenty to get their teeth into. In summary, I am happy to recommend the book to those seeking to broaden their understanding of the underpinning methodologies for analysing Big Data." ~ Richard J. Samworth, University of Cambridge, UK
“. . . Handbook of Big Data is the first compilation on this emerging subject in our field and is therefore highly recommended to all statisticians and computer scientists."
~The International Biometric Society
"The book strikes a great balance between the breadth and depth of recent research-active topics. It is an excellent reference book to keep for both academic researchers and industrial practitioners. It is also a good reference book for whoever teaches in the area of big data analysis.
~Journal of the American Statistical Association