Cybersecurity Analytics is for the cybersecurity student and professional who wants to learn data science techniques critical for tackling cybersecurity challenges, and for the data science student and professional who wants to learn about cybersecurity adaptations. Trying to build a malware detector, a phishing email detector, or just interested in finding patterns in your datasets? This book can let you do it on your own. Numerous examples and datasets links are included so that the reader can "learn by doing." Anyone with a basic college-level calculus course and some probability knowledge can easily understand most of the material.
The book includes chapters containing: unsupervised learning, semi-supervised learning, supervised learning, text mining, natural language processing, and more. It also includes background on security, statistics, and linear algebra. The website for the book contains a listing of datasets, updates, and other resources for serious practitioners.
Table of Contents
Preface 1 Introduction 2 What is Data Analytics? 2.1 Data Ingestion 2.2 Data Processing and Cleaning 2.3 Visualization and Exploratory Analysis 2.3.1 Scatterplots 2.4 Pattern Recognition 2.4.1 Classification 2.4.2 Clustering 2.5 Feature extraction 2.5.1 Feature Selection 2.5.2 Random Projections 2.6 Modeling 2.6.1 Model Specification 2.6.2 Model Selection and Fitting 2.7 Evaluation 2.8 Strengths and Limitations 2.8.1 The Curse of Dimensionality 3 Security: Basics and Security Analytics 3.1 Basics of Security 3.1.1 Know Thy Enemy – Attackers and Their Motivations 3.1.2 Security Goals 3.2 Mechanisms for Ensuring Security Goals 3.2.1 Confidentiality 3.2.2 Integrity 3.2.3 Availability 3.2.4 Authentication 3.2.5 Access Control 3.2.6 Accountability 3.2.7 Non-repudiation 3.3 Threats, Attacks and Impacts 3.3.1 Passwords 3.3.2 Malware 3.3.3 Spam, Phishing and its Variants 3.3.4 Intrusions 3.3.5 Internet Surfing 3.3.6 System Maintenance and Firewalls 3.3.7 Other Vulnerabilities 3.3.8 Protecting Against Attacks 3.4 Applications of Data Science to Security Challenges 3.4.1 Cybersecurity Datasets 3.4.2 Data Science Applications 3.4.3 Passwords 3.4.4 Malware 3.4.5 Intrusions 3.4.6 Spam/Phishing 3.4.7 Credit Card Fraud/Financial Fraud 3.4.8 Opinion Spam 3.4.9 Denial of Service 3.5 Security Analytics and Why Do We Need It4 Statistics 4.1 Probability Density Estimation 4.2 Models 4.2.1 Poisson 4.2.2 Uniform 4.2.3 Normal 4.3 Parameter Estimation 4.3.1 The Bias-Variance Trade-Off 4.4 The Law of Large Numbers and the Central Limit Theorem 4.5 Confidence Intervals 4.6 Hypothesis Testing 4.7 Bayesian Statistics 4.8 Regression 4.8.1 Logistic Regression 4.9 Regularization 4.10 Principal Components 4.11 Multidimensional Scaling 4.12 Procrustes 4.13 Nonparametric Statistics 4.14 Time Series 5 Data Mining – Unsupervised Learning 5.1 Data Collection 5.2 Types of Data and Operations 5.2.1 Properties of Datasets 5.3 Data Exploration and Preprocessing 5.3.1 Data Exploration 5.3.2 Data Preprocessing/Wrangling 5.4 Data Representation 5.5 Association Rule Mining 5.5.1 Variations on the Apriori Algorithm 5.6 Clustering 5.6.1 Partitional Clustering 5.6.2 Choosing K 5.6.3 Variations on K-means Algorithm 5.6.4 Hierarchical Clustering 5.6.5 Other Clustering Algorithms 5.6.6 Measuring the Clustering Quality 5.6.7 Clustering Miscellany: Clusterability, Robustness, Incremental, 5.7 Manifold Discovery 5.7.1 Spectral Embedding 5.8 Anomaly Detection 5.8.1 Statistical Methods 5.8.2 Distance-based Outlier Detection 5.8.3 kNN based approach 5.8.4 Density-based Outlier Detection 5.8.5 Clustering-based Outlier Detection 5.8.6 One-class learning based Outliers 5.9 Security Applications and Adaptations 5.9.1 Data Mining for Intrusion Detection 5.9.2 Malware Detection 5.9.3 Stepping-stone Detection 5.9.4 Malware Clustering 5.9.5 Directed Anomaly Scoring for Spear Phishing Detection 5.10 Concluding Remarks and Further Reading 6 Machine Learning – Supervised Learning 6.1 Fundamentals of Supervised Learning 6.2 The Bayes Classifier 6.2.1 Naïve Bayes6.3 Nearest Neighbors Classifiers 6.4 Linear Classifiers 6.5 Decision Trees and Random Forests 6.5.1 Random Forest 6.6 Support Vector Machines 6.7 Semi-Supervised Classification 6.8 Neural Networks and Deep Learning 6.8.1 Perceptron 6.8.2 Neural Networks 6.8.3 Deep Networks 6.9 Topological Data Analysis 6.10 Ensemble Learning 6.10.1 Majority 6.10.2 Adaboost 6.11 One-class Learning 6.12 Online Learning 6.13 Adversarial Machine Learning 6.13.1 Adversarial Examples 6.13.2 Adversarial Training 6.13.3 Adversarial Generation 6.13.4 Beyond Continuous Data 6.14 Evaluation of Machine Learning 6.14.1 Cost-sensitive Evaluation 6.14.2 New Metrics for Unbalanced Datasets 6.15 Security Applications and Adaptations 6.15.1 Intrusion Detection 6.15.2 Malware Detection 6.15.3 Spam and Phishing Detection 6.16 For Further Reading 7 Text Mining 7.1 Tokenization 7.2 Preprocessing 7.3 Bag-Of-Words 7.4 Vector space model 7.4.1 Weighting 7.5 Latent Semantic Indexing 7.6 Embedding 7.7 Topic Models: Latent Dirichlet Allocation 7.8 Sentiment Analysis 8 Natural Language Processing 8.1 Challenges of NLP 8.2 Basics of Language Study and NLP Techniques 8.3 Text Preprocessing 8.4 Feature Engineering on Text Data 8.4.1 Morphological, Word and Phrasal Features 8.4.2 Clausal and Sentence Level Features 8.4.3 Statistical Features 8.5 Corpus-based Analysis 8.6 Advanced NLP Tasks 8.6.1 Part of Speech Tagging 8.6.2 Word sense Disambiguation 8.6.3 Language Modeling 8.6.4 Topic Modeling 8.7 Sequence to Sequence Tasks 8.8 Knowledge Bases and Frameworks 8.9 Natural Language Generation 8.10 Issues with Pipelining 8.11 Security Applications of NLP 8.11.1 Password Checking 8.11.2 Email Spam Detection 8.11.3 Phishing Email Detection 8.11.4 Malware Detection 8.11.5 Attack Generation 9 Big Data Techniques and Security 9.1 Key terms 9.2 Ingesting the Data 9.3 Persistent Storage 9.4 Computing and Analyzing 9.5 Techniques for Handling Big Data 9.6 Visualizing 9.7 Streaming Data 9.8 Big Data Security 9.8.1 Implications of Big Data Characteristics on Security and Privacy 9.8.2 Mechanisms for Big Data Security Goals A Linear Algebra Basics A.1 Vectors A.2 Matrices A.2.1 Eigenvectors and Eigenvalues A.2.2 The Singular Value Decomposition B Graphs B.1 Graph Invariants B.2 The Laplacian C Probability C.1 Probability C.1.1 Conditional Probability and Bayes’ Rule C.1.2 Base Rate Fallacy C.1.3 Expected Values and Moments C.1.4 Distribution Functions and Densities C.2 Models C.2.1 Bernoulli and Binomial C.2.2 Multinomial C.2.3 Uniform Bibliography Author Index Index
Rakesh Verma is a professor of computer science at the University of Houston where he is leading a research group that applies reasoning and data science to cybersecurity challenges. He teaches a course on security analytics that includes some of the material here. Since 2015, he has been co-organizing and editing the proceedings of the ACM International Workshop on Security and Privacy Analytics. He is an editor of Frontiers of Big Data in the Cybersecurity Area, an ACM Distinguished Speaker (2011-2018), and the winner of two Best Paper Awards. He received the Lifetime Mentoring Award from the University of Houston and he is a Fulbright Senior Specialist in Computer Science.
David Marchette is a principal scientist at the Naval Surface Warfare Center, Dahlgren Division where he is responsible for leading basic and applied research projects in computational statistics, graph theory, network analysis, pattern recognition, computer intrusion detection, and text analysis. He is a fellow of the American Statistical Association (ASA) and the American Association for the Advancement of Science (AAAS) and an elected member of the International Statistical Institute (ISI).