Big Data Analytics: A Practical Guide for Managers, 1st Edition (Hardback) book cover

Big Data Analytics

A Practical Guide for Managers, 1st Edition

By Kim H. Pries, Robert Dunnigan

Auerbach Publications

576 pages | 58 B/W Illus.

Purchasing Options:$ = USD
Hardback: 9781482234510
pub: 2015-02-05
SAVE ~$15.59
$77.95
$62.36
x
eBook (VitalSource) : 9780429183522
pub: 2015-02-05
from $37.48


FREE Standard Shipping!

Description

With this book, managers and decision makers are given the tools to make more informed decisions about big data purchasing initiatives. Big Data Analytics: A Practical Guide for Managers not only supplies descriptions of common tools, but also surveys the various products and vendors that supply the big data market.

Comparing and contrasting the different types of analysis commonly conducted with big data, this accessible reference presents clear-cut explanations of the general workings of big data tools. Instead of spending time on HOW to install specific packages, it focuses on the reasons WHY readers would install a given package.

The book provides authoritative guidance on a range of tools, including open source and proprietary systems. It details the strengths and weaknesses of incorporating big data analysis into decision-making and explains how to leverage the strengths while mitigating the weaknesses.

  • Describes the benefits of distributed computing in simple terms
  • Includes substantial vendor/tool material, especially for open source decisions
  • Covers prominent software packages, including Hadoop and Oracle Endeca
  • Examines GIS and machine learning applications
  • Considers privacy and surveillance issues

The book further explores basic statistical concepts that, when misapplied, can be the source of errors. Time and again, big data is treated as an oracle that discovers results nobody would have imagined. While big data can serve this valuable function, all too often these results are incorrect, yet are still reported unquestioningly. The probability of having erroneous results increases as a larger number of variables are compared unless preventative measures are taken.

The approach taken by the authors is to explain these concepts so managers can ask better questions of their analysts and vendors as to the appropriateness of the methods used to arrive at a conclusion. Because the world of science and medicine has been grappling with similar issues in the publication of studies, the authors draw on their efforts and apply them to big data.

Table of Contents

Introduction

So What Is Big Data?

Growing Interest in Decision Making

What This Book Addresses

The Conversation about Big Data

Technological Change as a Driver of Big Data

The Central Question: So What?

Our Goals as Authors

References

The Mother of Invention’s Triplets: Moore’s Law, the Proliferation of Data, and Data Storage Technology

Moore’s Law

Parallel Computing, Between and Within Machines

Quantum Computing

Recap of Growth in Computing Power

Storage, Storage Everywhere

Grist for the Mill: Data Used and Unused

Agriculture

Automotive

Marketing in the Physical World

Online Marketing

Asset Reliability and Efficiency

Process Tracking and Automation

Toward a Definition of Big Data

Putting Big Data in Context

Key Concepts of Big Data and Their Consequences

Summary

References.

Hadoop

Power through Distribution

Cost Effectiveness of Hadoop

Not Every Problem Is a Nail

Some Technical Aspects

Troubleshooting Hadoop

Running Hadoop

Hadoop File System

MapReduce

Pig and Hive

Installation

Current Hadoop Ecosystem

Hadoop Vendors

Cloudera

Amazon Web Services (AWS)

Hortonworks

IBM

Intel

MapR

Microsoft

To Run Pig Latin Using Powershell

Pivotal

References

HBase and Other Big Data Databases

Evolution from Flat File to the Three V’s

Flat File

Hierarchical Database

Network Database

Relational Database

Object-Oriented Databases

Relational-Object Databases

Transition to Big Data Databases

What Is Different bbout HBase?

What Is Bigtable?

What Is MapReduce?

What Are the Various Modalities for Big Data Databases?

Graph Databases

How Does a Graph Database Work?

What is the Performance of a Graph Database?

Document Databases

Key-Value Databases

Column-Oriented Databases

HBase

Apache Accumulo

References

Machine Learning

Machine Learning Basics

Classifying with Nearest Neighbors

Naive Bayes

Support Vector Machines

Improving Classification with Adaptive Boosting

Regression

Logistic Regression

Tree-Based Regression

K-Means Clustering

Apriori Algorithm

Frequent Pattern-Growth

Principal Component Analysis (PCA)

Singular Value Decomposition

Neural Networks

Big Data and MapReduce

Data Exploration

Spam Filtering

Ranking

Predictive Regression

Text Regression

Multidimensional Scaling

Social Graphing

References

Statistics

Statistics, Statistics Everywhere

Digging into the Data

Standard Deviation: The Standard Measure of Dispersion

The Power of Shapes: Distributions

Distributions: Gaussian Curve

Distributions: Why Be Normal?

Distributions: The Long Arm of the Power Law

The Upshot? Statistics Are not Bloodless

Fooling Ourselves: Seeing What We Want to See in the Data

We Can Learn Much from an Octopus

Hypothesis Testing: Seeking a Verdict

Two-Tailed Testing

Hypothesis Testing: A Broad Field

Moving on to Specific Hypothesis Tests

Regression and Correlation

p Value in Hypothesis Testing: A Successful Gatekeeper?

Specious Correlations and Overfitting the Data

A Sample of Common Statistical Software Packages

Minitab

SPSS

R

SAS

Big Data Analytics

Hadoop Integration

Angoss

Statistica

Capabilities

Summary

References

Google

Big Data Giants

Google

Go

Android

Google Product Offerings

Google Analytics

Advertising and Campaign Performance

Analysis and Testing

Facebook

Ning

Non-United States Social Media

Tencent

Line

Sina Weibo

Odnoklassniki

Vkontakte

Nimbuzz

Ranking Network Sites

Negative Issues with Social Networks

Amazon

Some Final Words

References

Geographic Information Systems (GIS)

GIS Implementations

A GIS Example

GIS Tools

GIS Databases

References

Discovery

Faceted Search versus Strict Taxonomy

First Key Ability: Breaking Down Barriers

Second Key Ability: Flexible Search and Navigation

Underlying Technology

The Upshot

Summary

References

Data Quality

Know Thy Data and Thyself

Structured, Unstructured, and Semistructured Data

Data Inconsistency: An Example from This Book

The Black Swan and Incomplete Data

How Data Can Fool Us

Ambiguous Data

Aging of Data or Variables

Missing Variables May Change the Meaning

Inconsistent Use of Units and Terminology

Biases

Sampling Bias

Publication Bias

Survivorship Bias

Data as a Video, Not a Snapshot: Different Viewpoints as a Noise Filter

What Is My Toolkit for Improving My Data?

Ishikawa Diagram

Interrelationship Digraph

Force Field Analysis

Data-Centric Methods

Troubleshooting Queries from Source Data

Troubleshooting Data Quality beyond the Source System

Using Our Hidden Resources

Summary

References

Benefits

Data Serendipity

Converting Data Dreck to Usefulness

Sales

Returned Merchandise

Security

Medical

Travel

Lodging

Vehicle

Meals

Geographical Information Systems

New York City

Chicago CLEARMAP

Baltimore

San Francisco

Los Angeles

Tucson, Arizona, University of Arizona, and COPLINK

Social Networking

Education

General Educational Data

Legacy Data

Grades and other Indicators

Testing Results

Addresses, Phone Numbers, and More

Concluding Comments

References

Concerns

Part Two: Basic Principles of National Application

Collection Limitation Principle

Data Quality Principle

Purpose Specification Principle

Use Limitation Principle

Security Safeguards Principle

Openness Principle

Individual Participation Principle

Accountability Principle

Logical Fallacies

Affirming the Consequent

Denying the Antecedent

Ludic Fallacy

Cognitive Biases

Confirmation Bias

Notational Bias

Selection/Sample Bias

Halo Effect

Consistency and Hindsight Biases

Congruence Bias

Von Restorff Effect

Data Serendipity

Converting Data Dreck to Usefulness Sales

Merchandise Returns

Security

CompStat

Medical

Travel

Lodging

Vehicle

Meals

Social Networking

Education

Making Yourself Harder to Track

Misinformation

Disinformation

Reducing/Eliminating Profiles

Social Media

Self Redefinition

Identity Theft

Facebook

Concluding Comments

References

Epilogue

Michael Porter’s Five Forces Model

Bargaining Power of Customers

Bargaining Power of Suppliers

Threat of New Entrants

Others

The OODA Loop

Implementing Big Data

Nonlinear, Qualitative Thinking

Closing

References

About the Authors

Kim H. Pries has four college degrees: a bachelor of arts in history from the University of Texas at El Paso (UTEP), a bachelor of science in metallurgical engineering from UTEP, a master of science in engineering from UTEP, and a master of science in metallurgical engineering and materials science from Carnegie-Mellon University.

Pries worked as a computer systems manager, a software engineer for an electrical utility, and a scientific programmer under a defense contract for Stoneridge, Incorporated (SRI). He has worked as software manager, engineering services manager, reliability section manager, and product integrity and reliability director.

In addition to his other responsibilities, Pries has provided Six Sigma training for both UTEP and SRI and cost reduction initiatives for SRI. Pries is also a founding faculty member of Practical Project Management. Additionally, in concert with Jon Quigley, Pries was a cofounder and principal with Value Transformation, LLC, a training, testing, cost improvement, and product development consultancy.

He trained for Introduction to Engineering Design and Computer Science and Software Engineering with Project Lead the Way. He currently teaches biotechnology, computer science and software engineering, and introduction to engineering design at the beautiful Parkland High School in the Ysleta Independent School District of El Paso, Texas.

Robert Dunniganis a manager with Janus Consulting Partners and is based in Dallas, Texas. He holds a bachelor of science in psychology and in sociology with an anthropology emphasis from North Dakota State University. He also holds a master of business administration from INSEAD, "the business school for the world," where he attended the Singapore campus.

As a Peace Corps volunteer, Robert served over 3 years in Honduras developing agribusiness opportunities. As a consultant, he later worked on the Afghanistan Small and Medium Enterprise Development project in Afghanistan, where he traveled the country with his Afghan colleagues and friends seeking opportunities to develop a manufacturing sector in the country.

Robert is an American Society for Quality–certified Six Sigma Black Belt and a Scrum Alliance–certified Scrum Master.

Subject Categories

BISAC Subject Codes/Headings:
COM021000
COMPUTERS / Database Management / General
COM021030
COMPUTERS / Database Management / Data Mining
COM032000
COMPUTERS / Information Technology