Introduction to Data Science: Data Analysis and Prediction Algorithms with R, 1st Edition (Hardback) book cover

Introduction to Data Science

Data Analysis and Prediction Algorithms with R, 1st Edition

By Rafael A. Irizarry

Chapman and Hall/CRC

784 pages

Purchasing Options:$ = USD
Hardback: 9780367357986
pub: 2019-10-22
SAVE ~$19.99
Available for pre-order
$99.95
$79.96
x


FREE Standard Shipping!

Description

Introduction to Data Science: Data Analysis and Prediction Algorithms with R introduces concepts and skills that can help you tackle real-world data analysis challenges. It covers concepts from probability, statistical inference, linear regression, and machine learning. It also helps you develop skills such as R programming, data wrangling, data visualization, predictive algorithm building, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation.

This book is a textbook for a first course in data science. No previous knowledge of R is necessary, although some experience with programming may be helpful. The book is divided into six parts: R, data visualization, statistics with R, data wrangling, machine learning, and productivity tools. Each part has several chapters meant to be presented as one lecture.

The author uses motivating case studies that realistically mimic a data scientist’s experience. He starts by asking specific questions and answers these through data analysis so concepts are learned as a means to answering the questions. Examples of the case studies included are: US murder rates by state, self-reported student heights, trends in world health and economics, the impact of vaccines on infectious disease rates, the financial crisis of 2007-2008, election forecasting, building a baseball team, image processing of hand-written digits, and movie recommendation systems.

The statistical concepts used to answer the case study questions are only briefly introduced, so complementing with a probability and statistics textbook is highly recommended for in-depth understanding of these concepts. If you read and understand the chapters and complete the exercises, you will be prepared to learn the more advanced concepts and skills needed to become an expert.

Reviews

"I think the book would be perfect for schools looking to make a transition to a model where introduction to data science takes the place of introduction to statistics and maybe introductory computer science." ~Arend Kuyper, Northwestern University

Table of Contents

I R 20

1. Installing R and RStudio

Installing R

Installing RStudio

2. Getting Started with R and RStudio

Why R?

The R console

Scripts

RStudio

The panes

Key bindings

Running commands while editing scripts

Changing global options

Installing R packages

3. R Basics

Case study: US Gun Murders

The very basics

Objects

The workspace

Functions

Other prebuilt objects

Variable names

Saving your workspace

Motivating scripts

Commenting your code

Exercises

Data types

Data frames

Examining an object

The accessor: $

Vectors: numerics, characters, and logical

Factors

Lists

Matrices

Exercises

Vectors

Creating vectors

Names

Sequences Subsetting

Coercion

Not availables (NA)

Exercises

Sorting

sort

order

max and which.max

rank

Beware of recycling

Exercise

Vector arithmetics

Rescaling a vector

Two vectors

Exercises

Indexing

Subsetting with logicals

Logical operators

which

match

%in%

Exercises

Basic plots

plot

hist

boxplot

image

Exercises

4. Programming basics

Conditional expressions

Defining functions

Namespaces

For-loops

Vectorization and functionals

Exercises

5. The tidyverse 84

Tidy data

Exercises

Manipulating data frames

Adding a column with mutate

Subsetting with filter

Selecting columns with select

Exercises

The pipe: %>%

Exercises

Summarizing data

summarize

pull

Group then summarize with group by

Sorting data frames

Nested sorting

The top n

Exercises

Tibbles

Tibbles display better

Subsets of tibbles are tibbles

Tibbles can have complex entries

Tibbles can be grouped

Create a tibble using tibble instead of data frame

The dot operator

do

The purrr package

Tidyverse conditionals

Case when

between

Exercises

6. Importing data 105

Paths and the working directory

The filesystem

Relative and full paths

The working directory

Generating path names

Copying files using paths

The readr and readxl packages

readr

readxl

Exercises

Downloading files

R-base importing functions

scan

Text versus binary files

Unicode versus ASCII

Organizing Data with Spreadsheets

Exercises

II Data Visualization

7. Introduction to data visualization

8. ggplot2

The components of a graph

ggplot objects

Geometries

Aesthetic mappings

Layers

Tinkering with arguments

Global versus local aesthetic mappings

Scales

Labels and titles

Categories as colors

Annotation, shapes, and adjustments

Add-on packages

Putting it all together

Quick plots with qplot

Grids of plots

Exercises

9. Visualizing data distributions

Variable types

Case study: describing student heights

Distribution function

Cumulative distribution functions

Histograms

Smoothed density

Interpreting the y-axis

Densities permit stratification

Exercises

The normal distribution

Standard units

Quantile-quantile plots

Percentiles

Boxplots

Stratification

Case study: describing student heights (continued)

Exercises

ggplot2 geometries

Barplots

Histograms

Density plots

Boxplots

QQ-plots

Images

Quick plots

Exercises

10. Data visualization in practice

Case study: new insights on poverty

Hans Rosling’s quiz

Scatterplots

Faceting

facet_wrap

Fixed scales for better comparisons

Time series plots

Labels instead of legends

Data transformations

Log transformation

Which base?

Transform the values or the scale?

Visualizing multimodal distributions

Comparing multiple distributions with boxplots and ridge plots

Boxplots

Ridge plots

Example: 1970 versus 2010 income distributions

Accessing computed variables

Weighted densities

The ecological fallacy and importance of showing the data

Logistic transformation

Show the data

11. Data visualization principles

Encoding data using visual cues

Know when to include

Do not distort quantities

Order categories by a meaningful value

Show the data

Ease comparisons

Use common axes

Align plots vertically to see horizontal changes and horizontally to

see vertical changes

Consider transformations

Visual cues to be compared should be adjacent

Use color

Think of the color blind

Plots for two variables

Slope charts

Bland-Altman plot

Encoding a third variable

Avoid pseudo-three-dimensional plots

Avoid too many significant digits

Know your audience

Exercises

Case study: impact of vaccines on battling infectious diseases

Exercises

12. Robust summaries

Outliers

Median

The inter quartile range (IQR)

Tukey’s definition of an outlier

Median absolute deviation

Exercises

Case study: self-reported student heights

III Statistics with R

13. Introduction to Statistics with R

14. Probability

Discrete probability

Relative frequency

Notation

Probability distributions

Monte Carlo simulations for categorical data

Setting the random seed

With and without replacement

Independence

14.4 Conditional probabilities

Addition and multiplication rules

Multiplication rule

Multiplication rule under independence

Addition rule

Combinations and permutations

Monte Carlo example

Examples

Monty Hall problem

Birthday problem

Infinity in practice

Exercises

Continuous probability

Theoretical continuous distributions

Theoretical distributions as approximations

The probability density

Monte Carlo simulations for continuous variables

Continuous distributions

Exercises

15. Random variables

Random variables

Sampling models

The probability distribution of a random variable

Distributions versus probability distributions

Notation for random variables

The expected value and standard error

Population SD versus the sample SD

Central Limit Theorem

How large is large in the Central Limit Theorem

Statistical properties of averages

Law of large numbers

Misinterpreting law of averages

Exercises

Case study: The Big Short

Interest rates explained with chance model

The Big Short

Exercises

16. Statistical Inference

1Polls

The sampling model for polls

Populations, samples, parameters and estimates

The sample average

Parameters

Polling versus forecasting

Properties of our estimate: expected value and standard error

Exercises

Central Limit Theorem in practice .

A Monte Carlo simulation

The spread

Bias: why not run a very large poll?

Exercises

Confidence intervals

A Monte Carlo simulation

The correct language

Exercises

Power

p-values

Association Tests

Lady Tasting Tea

Two-by-two tables

Chi-square Test 16.10.4 The odds ratio

Confidence intervals for the odds ratio

Small count correction

Large samples, small p-values

Exercises

17. Statistical models

Poll aggregators

Poll data

Pollster bias

Data driven models

Exercises

Bayesian statistics

Bayes theorem

Bayes Theorem simulation

Bayes in practice

Hierarchical models

Exercises

Case study: Election forecasting

Bayesian approach

The general bias

Mathematical representations of models

Predicting the electoral college

Forecasting

Exercises

The t-distribution

18. Regression

Case study: is height hereditary?

The correlation coefficient

Sample correlation is a random variable

Correlation is not always a useful summary

Conditional expectations

The regression line

Regression improves precision

Bivariate normal distribution (advanced)

Variance explained

Warning: there are two regression lines

Exercises

19. Linear Models

Case Study: Moneyball

Sabermetics

Baseball basics

No awards for BB

Base on Balls or Stolen Bases?

Regression applied to baseball statistics

Confounding

Understanding confounding through stratification

Multivariate regression

Least Squared Estimates

Interpreting linear models

Least Squares Estimates (LSE)

The lm function

LSE are random variables

Predicted values are random variables

Exercises

Linear regression in the tidyverse

The broom package

Exercises

Case study: Moneyball (continued)

Adding salary and position information

Picking 9 players

The regression fallacy

Measurement error models

Exercises

20. Association is not causation

Spurious correlation

Outliers

Reversing cause and effect

Confounders

Example: UC Berkeley admissions

Confounding explained graphically

Average after stratifying

Simpson’s paradox

Exercises

IV Data Wrangling

21. Introduction to Data Wrangling

22. Reshaping data

gather

spread

separate

unite

Exercises

23. Joining tables

Joins

Left join

Right join

Inner join

Full join

Semi join

Anti-join

Binding

Binding columns

Binding by rows.

Set operators

Intersect

Union

setdiff

setequal

Exercises

 

24. Web Scraping

HTML

The rvest package

CSS selectors

JSON

Exercises

25. String Processing

The stringr package

Case study 1: US murders data

Case study 2: self reported heights

How to escape when defining strings

Regular expressions

Strings are a regexp

Special characters

Character classes

Anchors

Quantifiers

White space \s

Quantifiers: *, ?, +

Groups

Search and replace with regex

Search and replace using groups

Testing and improving

Trimming

Changing lettercase

Case study 2: self reported heights (continued)

The extract function

Putting it all together

String splitting

Case study 3: extracting tables from a PDF

Recoding

Exercises

26. Parsing Dates and Times

The date data type

The lubridate package

Exercises

27. Text mining

Case study: Trump tweets

Text as data

Sentiment analysis

Exercises

V Machine Learning

28. Introduction to Machine Learning

Notation

An example

Exercises

Evaluation Metrics

Training and test sets

Overall accuracy

The confusion matrix

Sensitivity and specificity

Balanced accuracy and F1 score

Prevalence matters in practice

ROC and precision-recall curves

The loss function

Exercises

Conditional probabilities and expectations

Conditional probabilities Conditional expectations

Conditional expectation minimizes squared loss function

Exercises

Case study: is it a 2 or a 7?

29. Smoothing

Bin smoothing

Kernels

Local weighted regression (loess)

Fitting parabolas

Beware of default smoothing parameters

Connecting smoothing to machine learning

Exercises

30. Cross validation

Motivation with k-nearest neighbors

Over-training

Over-smoothing

Picking the k in kNN

Mathematical description of cross validation

K-fold cross validation

Exercises

Bootstrap

Exercises

31. The caret package

The caret train functon

Cross validation

Example: fitting with loess

32. Examples of algorithms

Linear regression

The predict function

Exercises

Logistic regression

Generalized Linear Models

Logistic regression with more than one predictor

Exercises

k-nearest neighbors

Exercises

Generative models

Naive Bayes

Controlling prevalence

Quadratic Discriminant Analysis

Linear discriminant analysis

Connection to distance

Case study: more than three classes

Exercises

Classification and Regression Trees (CART)

The curse of dimensionality

CART motivation

Regression trees

Classification (decision) trees

Random Forests

Exercises

33. Machine learning in practice

Preprocessing

k-Nearest Neighbor and Random Forest

Variable importance

Visual assessments

Ensembles

Exercises

34. Large datasets

Matrix algebra

Notation

Converting a vector to a matrix

Row and column summaries

apply

Filtering columns based on summaries

Indexing with matrices

Binarizing the data

Vectorization for matrices

Matrix algebra operations

Exercises

Distance

Euclidean distance

Distance in higher dimensions

Euclidean distance example

Predictor Space

Distance between predictors

Exercises

Dimension reduction

Preserving distance

Linear transformations (advanced)

Orthogonal transformations (advanced)

Principal Component Analysis

Iris Example

MNIST Example

Exercises

Recommendation systems

Movielens data

Recommendation systems as a machine learning challenge

Loss function

A first model

Modeling movie effects

User effects

Exercises

Regularization

Motivation

34.9.2 Penalized Least Squares

Choosing the penalty terms

Exercises

Matrix factorization

Factors analysis

Connection to SVD and PCA

Exercises

35. Clustering

Hierarchical clustering

k-means

Heatmaps

Filtering features

Exercises

VI Productivity tools

36. Introduction to productivity tools

37. Accessing the terminal and installing Git

Accessing the terminal on a Mac

Installing Git on the Mac

Installing Git and Git Bash on Windows

Accessing the terminal on Windows

38. Organizing with Unix

Naming convention

The terminal

The filesystem

Directories and subdirectories

The home directory

Working directory

Paths

Unix commands

ls: Listing directory content

mkdir and rmdir: make and remove a directory

cd: Navigating the filesystem by changing directories

Some examples

More Unix commands

mv: moving files

cp: copying files

rm: removing files

less: looking at a file

Preparing for a data science project

Advanced Unix

Arguments

Getting help

Pipes

Wild cards

Environment variables

Shells

Executables

Permissions and file types

Commands you should learn

File manipulation in R

39. Git and GitHub

Why use Git and GitHub?

GitHub accounts

GitHub repositories

Overview of Git

Clone

Initializing a Git directory

Using Git and GitHub in RStudio

40. Reproducible projects with RStudio and R markdown

RStudio projects

R markdown

The header

R code chunks

I R 20

1. Installing R and RStudio

Installing R

Installing RStudio

2. Getting Started with R and RStudio

Why R?

The R console

Scripts

RStudio

The panes

Key bindings

Running commands while editing scripts

Changing global options

Installing R packages

3. R Basics

Case study: US Gun Murders

The very basics

Objects

The workspace

Functions

Other prebuilt objects

Variable names

Saving your workspace

Motivating scripts

Commenting your code

Exercises

Data types

Data frames

Examining an object

The accessor: $

Vectors: numerics, characters, and logical

Factors

Lists

Matrices

Exercises

Vectors

Creating vectors

Names

Sequences Subsetting

Coercion

Not availables (NA)

Exercises

Sorting

sort

order

max and which.max

rank

Beware of recycling

Exercise

Vector arithmetics

Rescaling a vector

Two vectors

Exercises

Indexing

Subsetting with logicals

Logical operators

which

match

%in%

Exercises

Basic plots

plot

hist

boxplot

image

Exercises

4. Programming basics

Conditional expressions

Defining functions

Namespaces

For-loops

Vectorization and functionals

Exercises

5. The tidyverse 84

Tidy data

Exercises

Manipulating data frames

Adding a column with mutate

Subsetting with filter

Selecting columns with select

Exercises

The pipe: %>%

Exercises

Summarizing data

summarize

pull

Group then summarize with group by

Sorting data frames

Nested sorting

The top n

Exercises

Tibbles

Tibbles display better

Subsets of tibbles are tibbles

Tibbles can have complex entries

Tibbles can be grouped

Create a tibble using tibble instead of data frame

The dot operator

do

The purrr package

Tidyverse conditionals

Case when

between

Exercises

6. Importing data 105

Paths and the working directory

The filesystem

Relative and full paths

The working directory

Generating path names

Copying files using paths

The readr and readxl packages

readr

readxl

Exercises

Downloading files

R-base importing functions

scan

Text versus binary files

Unicode versus ASCII

Organizing Data with Spreadsheets

Exercises

II Data Visualization

7. Introduction to data visualization

8. ggplot2

The components of a graph

ggplot objects

Geometries

Aesthetic mappings

Layers

Tinkering with arguments

Global versus local aesthetic mappings

Scales

Labels and titles

Categories as colors

Annotation, shapes, and adjustments

Add-on packages

Putting it all together

Quick plots with qplot

Grids of plots

Exercises

9. Visualizing data distributions

Variable types

Case study: describing student heights

Distribution function

Cumulative distribution functions

Histograms

Smoothed density

Interpreting the y-axis

Densities permit stratification

Exercises

The normal distribution

Standard units

Quantile-quantile plots

Percentiles

Boxplots

Stratification

Case study: describing student heights (continued)

Exercises

ggplot2 geometries

Barplots

Histograms

Density plots

Boxplots

QQ-plots

Images

Quick plots

Exercises

10. Data visualization in practice

Case study: new insights on poverty

Hans Rosling’s quiz

Scatterplots

Faceting

facet_wrap

Fixed scales for better comparisons

Time series plots

Labels instead of legends

Data transformations

Log transformation

Which base?

Transform the values or the scale?

Visualizing multimodal distributions

Comparing multiple distributions with boxplots and ridge plots

Boxplots

Ridge plots

Example: 1970 versus 2010 income distributions

Accessing computed variables

Weighted densities

The ecological fallacy and importance of showing the data

Logistic transformation

Show the data

11. Data visualization principles

Encoding data using visual cues

Know when to include

Do not distort quantities

Order categories by a meaningful value

Show the data

Ease comparisons

Use common axes

Align plots vertically to see horizontal changes and horizontally to

see vertical changes

Consider transformations

Visual cues to be compared should be adjacent

Use color

Think of the color blind

Plots for two variables

Slope charts

Bland-Altman plot

Encoding a third variable

Avoid pseudo-three-dimensional plots

Avoid too many significant digits

Know your audience

Exercises

Case study: impact of vaccines on battling infectious diseases

Exercises

12. Robust summaries

Outliers

Median

The inter quartile range (IQR)

Tukey’s definition of an outlier

Median absolute deviation

Exercises

Case study: self-reported student heights

III Statistics with R

13. Introduction to Statistics with R

14. Probability

Discrete probability

Relative frequency

Notation

Probability distributions

Monte Carlo simulations for categorical data

Setting the random seed

With and without replacement

Independence

14.4 Conditional probabilities

Addition and multiplication rules

Multiplication rule

Multiplication rule under independence

Addition rule

Combinations and permutations

Monte Carlo example

Examples

Monty Hall problem

Birthday problem

Infinity in practice

Exercises

Continuous probability

Theoretical continuous distributions

Theoretical distributions as approximations

The probability density

Monte Carlo simulations for continuous variables

Continuous distributions

Exercises

15. Random variables

Random variables

Sampling models

The probability distribution of a random variable

Distributions versus probability distributions

Notation for random variables

The expected value and standard error

Population SD versus the sample SD

Central Limit Theorem

How large is large in the Central Limit Theorem

Statistical properties of averages

Law of large numbers

Misinterpreting law of averages

Exercises

Case study: The Big Short

Interest rates explained with chance model

The Big Short

Exercises

16. Statistical Inference

1Polls

The sampling model for polls

Populations, samples, parameters and estimates

The sample average

Parameters

Polling versus forecasting

Properties of our estimate: expected value and standard error

Exercises

Central Limit Theorem in practice .

A Monte Carlo simulation

The spread

Bias: why not run a very large poll?

Exercises

Confidence intervals

A Monte Carlo simulation

The correct language

Exercises

Power

p-values

Association Tests

Lady Tasting Tea

Two-by-two tables

Chi-square Test 16.10.4 The odds ratio

Confidence intervals for the odds ratio

Small count correction

Large samples, small p-values

Exercises

17. Statistical models

Poll aggregators

Poll data

Pollster bias

Data driven models

Exercises

Bayesian statistics

Bayes theorem

Bayes Theorem simulation

Bayes in practice

Hierarchical models

Exercises

Case study: Election forecasting

Bayesian approach

The general bias

Mathematical representations of models

Predicting the electoral college

Forecasting

Exercises

The t-distribution

18. Regression

Case study: is height hereditary?

The correlation coefficient

Sample correlation is a random variable

Correlation is not always a useful summary

Conditional expectations

The regression line

Regression improves precision

Bivariate normal distribution (advanced)

Variance explained

Warning: there are two regression lines

Exercises

19. Linear Models

Case Study: Moneyball

Sabermetics

Baseball basics

No awards for BB

Base on Balls or Stolen Bases?

Regression applied to baseball statistics

Confounding

Understanding confounding through stratification

Multivariate regression

Least Squared Estimates

Interpreting linear models

Least Squares Estimates (LSE)

The lm function

LSE are random variables

Predicted values are random variables

Exercises

Linear regression in the tidyverse

The broom package

Exercises

Case study: Moneyball (continued)

Adding salary and position information

Picking 9 players

The regression fallacy

Measurement error models

Exercises

20. Association is not causation

Spurious correlation

Outliers

Reversing cause and effect

Confounders

Example: UC Berkeley admissions

Confounding explained graphically

Average after stratifying

Simpson’s paradox

Exercises

IV Data Wrangling

21. Introduction to Data Wrangling

22. Reshaping data

gather

spread

separate

unite

Exercises

23. Joining tables

Joins

Left join

Right join

Inner join

Full join

Semi join

Anti-join

Binding

Binding columns

Binding by rows.

Set operators

Intersect

Union

setdiff

setequal

Exercises

 

24. Web Scraping

HTML

The rvest package

CSS selectors

JSON

Exercises

25. String Processing

The stringr package

Case study 1: US murders data

Case study 2: self reported heights

How to escape when defining strings

Regular expressions

Strings are a regexp

Special characters

Character classes

Anchors

Quantifiers

White space \s

Quantifiers: *, ?, +

Groups

Search and replace with regex

Search and replace using groups

Testing and improving

Trimming

Changing lettercase

Case study 2: self reported heights (continued)

The extract function

Putting it all together

String splitting

Case study 3: extracting tables from a PDF

Recoding

Exercises

26. Parsing Dates and Times

The date data type

The lubridate package

Exercises

27. Text mining

Case study: Trump tweets

Text as data

Sentiment analysis

Exercises

V Machine Learning

28. Introduction to Machine Learning

Notation

An example

Exercises

Evaluation Metrics

Training and test sets

Overall accuracy

The confusion matrix

Sensitivity and specificity

Balanced accuracy and F1 score

Prevalence matters in practice

ROC and precision-recall curves

The loss function

Exercises

Conditional probabilities and expectations

Conditional probabilities Conditional expectations

Conditional expectation minimizes squared loss function

Exercises

Case study: is it a 2 or a 7?

29. Smoothing

Bin smoothing

Kernels

Local weighted regression (loess)

Fitting parabolas

Beware of default smoothing parameters

Connecting smoothing to machine learning

Exercises

30. Cross validation

Motivation with k-nearest neighbors

Over-training

Over-smoothing

Picking the k in kNN

Mathematical description of cross validation

K-fold cross validation

Exercises

Bootstrap

Exercises

31. The caret package

The caret train functon

Cross validation

Example: fitting with loess

32. Examples of algorithms

Linear regression

The predict function

Exercises

Logistic regression

Generalized Linear Models

Logistic regression with more than one predictor

Exercises

k-nearest neighbors

Exercises

Generative models

Naive Bayes

Controlling prevalence

Quadratic Discriminant Analysis

Linear discriminant analysis

Connection to distance

Case study: more than three classes

Exercises

Classification and Regression Trees (CART)

The curse of dimensionality

CART motivation

Regression trees

Classification (decision) trees

Random Forests

Exercises

33. Machine learning in practice

Preprocessing

k-Nearest Neighbor and Random Forest

Variable importance

Visual assessments

Ensembles

Exercises

34. Large datasets

Matrix algebra

Notation

Converting a vector to a matrix

Row and column summaries

apply

Filtering columns based on summaries

Indexing with matrices

Binarizing the data

Vectorization for matrices

Matrix algebra operations

Exercises

Distance

Euclidean distance

Distance in higher dimensions

Euclidean distance example

Predictor Space

Distance between predictors

Exercises

Dimension reduction

Preserving distance

Linear transformations (advanced)

Orthogonal transformations (advanced)

Principal Component Analysis

Iris Example

MNIST Example

Exercises

Recommendation systems

Movielens data

Recommendation systems as a machine learning challenge

Loss function

A first model

Modeling movie effects

User effects

Exercises

Regularization

Motivation

34.9.2 Penalized Least Squares

Choosing the penalty terms

Exercises

Matrix factorization

Factors analysis

Connection to SVD and PCA

Exercises

35. Clustering

Hierarchical clustering

k-means

Heatmaps

Filtering features

Exercises

VI Productivity tools

36. Introduction to productivity tools

37. Accessing the terminal and installing Git

Accessing the terminal on a Mac

Installing Git on the Mac

Installing Git and Git Bash on Windows

Accessing the terminal on Windows

38. Organizing with Unix

Naming convention

The terminal

The filesystem

Directories and subdirectories

The home directory

Working directory

Paths

Unix commands

ls: Listing directory content

mkdir and rmdir: make and remove a directory

cd: Navigating the filesystem by changing directories

Some examples

More Unix commands

mv: moving files

cp: copying files

rm: removing files

less: looking at a file

Preparing for a data science project

Advanced Unix

Arguments

Getting help

Pipes

Wild cards

Environment variables

Shells

Executables

Permissions and file types

Commands you should learn

File manipulation in R

39. Git and GitHub

Why use Git and GitHub?

GitHub accounts

GitHub repositories

Overview of Git

Clone

Initializing a Git directory

Using Git and GitHub in RStudio

40. Reproducible projects with RStudio and R markdown

RStudio projects

R markdown

The header

R code chunks

Global options

knitR

More on R markdown

Organizing a data science project

Create directories in Unix

Create an RStudio project

Edit some R Scripts

Create some more directories using Unix

Add a README file

Initilazing a Git directory

Add, commit and push files using RStudio

Global options

knitR

More on R markdown

Organizing a data science project

Create directories in Unix

Create an RStudio project

Edit some R Scripts

Create some more directories using Unix

Add a README file

Initilazing a Git directory

Add, commit and push files using RStudio

About the Author

Rafael A. Irizarry is professor of data sciences at the Dana-Farber Cancer Institute, professor of biostatistics at Harvard, and a fellow of the American Statistical Association. Dr. Irizarry is an applied statistician and during the last 20 years has worked in diverse areas, including genomics, sound engineering, and public health. He disseminates solutions to data analysis challenges as open source software, tools that are widely downloaded and used. Prof. Irizarry has also developed and taught several data science courses at Harvard as well as popular online courses.

About the Series

Chapman & Hall/CRC Data Science Series

Reflecting the interdisciplinary nature of the field, this new data science book series brings together researchers, practitioners, and instructors from statistics, computer science, machine learning, and analytics. The series will publish cutting-edge research, industry applications, and textbooks in data science.

Features:
* Presents the latest research and applications in the field, including new statistical and computational techniques
* Covers a broad range of interdisciplinary topics
* Provides guidance on the use of software for data science, including R, Python, and Julia
* Includes both introductory and advanced material for students and professionals
* Presents concepts while assuming minimal theoretical background

The scope of the series is broad, including titles in machine learning, pattern recognition, artificial intelligence, predictive analytics, business analytics, visualization, programming, software, learning analytics, data collection and wrangling, interactive graphics, reproducible research, and more. The inclusion of examples, applications, and code implementation is essential.

Learn more…

Subject Categories

BISAC Subject Codes/Headings:
MAT029000
MATHEMATICS / Probability & Statistics / General