Statistical Machine Learning

Overview

A crash course to enable gentle introduction into the machine learning techniques and its applications into data science.

Topics Covered:

Basic Concepts
Data Models
Machine Learning
ML Techniques
Software Toolkit
Continued Learning

Basic Concepts

Definitions:

Statistical Machine Learning is a set of tools used to model and understand complex data sets
Data Science is a set of techniques in computing to support the analysis of data
- Not very useful without some domain knowledge: it is important to know your data.
Includes analytic techniques:
- descriptive statistics
- data visualization
- statistical machine learning
- neural networks
- actor-environment models
Also includes computational techniques:
- database administration
- management of information systems
- parallelization ${\textstyle \rightarrow }$ high performance computing

Basic Concepts

Knowing your data

Technical definition:

Let ${\textstyle n}$ represent a number of distinct observations, and let ${\textstyle p}$ represent a number of predictors Then, our observed data ${\textstyle {\textbf {X}}}$ is an ${\textstyle n\times p}$ matrix with row observation vectors ${\textstyle {\vec {x}}_{1..n}}$ and column predictor vectors ${\textstyle {\vec {x}}_{1..p}}$ .

In addition, we will also have response variable(s) ${\textstyle {\textbf {Y}}}$ , which is a made up of some ${\textstyle n}$ -length vectors.

So, our combined dataset consists of $[({\vec {x}}_{1},{\vec {y}}_{1}),({\vec {x}}_{2},{\vec {y}}_{2}),...,({\vec {x}}_{n},{\vec {y}}_{n})]$

Our Mission: determine relationships between ${\textstyle {\textbf {X}}}$ and ${\textstyle {\textbf {Y}}}$ which are mathematically sound, leading to better understanding

Typically a table has columns as features, rows as entries

Entries might be numeric or categorical.

Data sources are either Structured or Unstructured:

Unstructured data will require some transformation.

Some data may also be time series taking a sampling of points over time, contributing to a 3-dimensional Data Cub

Several techniques can be used to reduce complex data:

numeric representation mapping of categorical information into numbers.
scaling redefine a new range for a predictor vector.
normalization redefine a predictor by its mean and standard deviation, giving a normal distribution of values.
dimension reduction lose fine grain of data, but gain understandability.
feature extraction a data mining technique in which we can generate new predictors from known information

Modeling

What is a model?

Very well-known model: Gravity is a functional model between masses, distances, and force. $F=G{\frac {m_{1}m_{2}}{r^{2}}}\rightarrow g={\frac {GM}{r^{2}}}\rightarrow v(t)=v(0)-gt.$ $k=Ae^{\frac {E_{a}}{K_{b}T}}.$ Statistics definition:

Let ${\textstyle X=({\vec {x}}_{1},{\vec {x}}_{2},...,{\vec {x}}_{p})}$ each of length ${\textstyle n}$ , and ${\textstyle Y=({\vec {y}})}$ of length ${\textstyle n}$ , then for ${\textstyle X}$ and ${\textstyle Y}$ , there exists a function with a systematic ${\textstyle f}$ and error term ${\textstyle \varepsilon }$ : $Y=f(X)+\varepsilon$ Why do we even estimate ${\textstyle f}$ at all? Prediction or Inference

Predictive models create an estimator ${\textstyle {\hat {f}}}$ which we can use to estimate ${\textstyle Y}$ using a sample ${\textstyle X}$ from a larger population: ${\hat {Y}}={\hat {f}}(X)$ With error: ${\textstyle E(Y-{\hat {Y}})^{2}=[f(X)-{\hat {f}}(X)]^{2}+\varepsilon }$ .

Inference models are primarily interested in how ${\textstyle Y}$ is affected by ${\textstyle X}$ :

What predictors associated with response?
What is the relationship of predictors to response?
What is the overall nature of relationship between ${\textstyle Y}$ and the predictors.

Signal vs Noise

Consider precision and accuracy.

Both contribute into data set

noise: variation in data which detracts from constructing information, as opposed to signal–data which is representative of a system under study and contains information.

High signal to noise allows us to minimize reducible error, caused by sampling technique.

Different than irreducible error, created by factors we are not measuring.

Error and Fit

In Modeling In the terms of modeling, precision of a model is referred to variance and the accuracy of a model its degree of bias.

Generally, overly complex models generate high variance, and can over-fit to input data, making the model useless to new data.

Generally, Mean Square Error or MSE, used to determine goodness-of-fit for model calibration: ${\textstyle MSE={\frac {1}{n}}\sum _{i=1}^{n}(y_{i}-{\hat {f}}(x_{i}))^{2}}$ .

Error rate used in classification: ${\textstyle {\frac {1}{n}}\sum _{i=1}^{n}I(y_{i}\neq {\hat {y}}_{i})}$ .

However for reporting, ${\textstyle R^{2}}$ statistic is more often used, because it gives a value between 0 and 1 useful to determine how much of variance in ${\textstyle Y}$ is explained by variance in ${\textstyle X}$ : ${\textstyle R^{2}=1-{\frac {\sum (y_{i}-{\hat {y}}_{i})^{2}}{\sum (y_{i}-{\bar {y}})^{2}}}={\frac {RSS}{TSS}}}$

Example functions relating ${\textstyle X}$ and ${\textstyle Y}$ :

A linear function: ${\textstyle Y=\beta _{0}+\beta _{1}X+\varepsilon }$
A polynomial: ${\textstyle Y=\beta _{0}+\beta _{1}X+\beta _{2}X+...+\varepsilon }$
A natural function: ${\textstyle Y=e^{-\alpha _{0}X^{\alpha _{1}}}+\varepsilon }$
A logistical function: ${\textstyle p={\frac {e^{\beta _{0}+\beta _{1}X}}{1+e^{\beta _{0}+\beta _{1}X}}}}$
A series of nested if statements
A series of differential equations: ${\textstyle x'=x_{n}-{\bar {x}}_{n}:y'=y_{n}-\beta _{n}x_{n}'}$

Why Machine Learning?

Types of Questions:

Exact Solution is known normal coding problems, linear models, and classical statistics.
Exact Solution is unknown, but can be extracted with work work with systems experts and domain knowledge to create code.
Exact Solution is known, but not yet conveyable ML is useful.
Exact Solution not known by humans ML and/or Deep Learning needed

Example: consider a prediction of temperature:

Knowledge Based Models
- Physics and atmospheric science based model.
- Up to differential equations on chaotic systems.
- As fine granularity of prediction increases, number of factors and density of data quickly becomes too much for most humans to consider
Data Driven Models
- Use ML to provide iterative gain to reduce error
  - Known data→model creation→point toward new factors.
- Uses a split in training and testing data, or sum of error to move toward the correct answer.
- Human researchers more free to find more data, improve prediction, develop theories

Techniques

Machine Learning in General

Supervised Learning: the model estimates, error verifies

If incorrect, needs user input for correction.
Example: a computer vision system trained to find features in images via user annotated images.

Unsupervised Learning: Clustering/Grouping of similar items

Need a similarity measure via feature vectors and ability to adjust weights
Example: Taste prediction algorithms used in web advertising.

Reinforced Learning Model estimates a sequence of guesses

Correct if and only if the entire sequence or a parameterized output scoring
Instant feedback but high compute cost
Example: Game-play in actor-environment model

Types of Problems and Output

Numeric function maps ${\textstyle X}$ to ${\textstyle Y}$ and output is in ${\textstyle \mathbb {R} }$ .

Categorization non-orderable sorting

Clustering finding principle ways groups differ

Anomaly Detection finding data points which are out of the ordinary

Actor Models real-time decision making or detailed simulation

Predictive Models

Predictive ML models which are also Linear:

Utilize a split of training and test data: test-training or ${\textstyle k}$ -fold cross-validation
use a function mapping of one or more independent variables to the dependent variables, then re-evaluate to reduce error Includes techniques for mixed model reduction
Reduction of the number of predictors via Lasso, Ridge, and Elastic Net techniques

Feature-vector based models

Nested if statements try to find decision boundaries by distance between independent data and dependent outcome features have weighted probability most information by Bayesian inference

Decision Trees can be used to create decision models for linearly separable data effectively a neural network with one neuron
Random Forest utilizes a number of differently-tuned trees trees provide consensus voting-based approach for non-linearly separable data smaller tree depth typically prevents over-fit

Clustering

Groups data into cluster such that distance within clusters is small, and between differing groups is large

Works with any well-defined "distance" function: Euclidean, Hamming, Inner Product, etc.

k-Means Clustering:

choose ${\textstyle k}$ number of clusters randomly distribute ${\textstyle k}$ points, centroids, into feature space
divide and classify data by distance to ${\textstyle k}$ centroids
move centroids based on center of groups repeats until convergence to some epsilon value
where points no longer move across iterations
Goodness of Fit for ${\textstyle N-M}$ possible values of ${\textstyle k}$ , an inflection in overall likelihood ratio given by probability function for set

K-nearest Neighbors:

Creates a probabilistic decision boundary within a feature space between ${\textstyle K}$ centroids

Unsupervised system to find structures of data works on majority voting system

Gradient Boosting:

System attempts to find the direction and vector of change in a dimensional field, and follow these iteratively to find local extrema.

Most use some Quasi-Newton Method for finding extrema for faster centroid convergence.

Support Vector Machine:

Also known as SVM, applies a classifier into high-dimensional data to split points into groups some use a kernel trick.

Generates the inner product space of two arbitrary-dimensional numeric matrix spaces, showing the shape of data

Genetic Algorithms

Utilize some adversarial scoring method of initially randomized vectors:

’survivors’ become the basis of new models similar iterative concept to gradient methods.
Does not have to understand topology of space requires creator to specify scoring for the machine .

Convolutional Neural Networks:

Utilize iterative scoring between training and testing, along with gradient descent on a number of layered, weighted vectors to extract features from a complex data set.

Vision Systems

Take high throughput data and simplify before work is done ex: 1080p @ 60fps ${\textstyle \rightarrow }$ 240x360 @ 15fps, broken into component channels Gaussian blur filter kernel applied to high-res images Edge detection via double threshold ML or CNN used past this to determine actual features

Large Language Models

Large language models (LLMs) are stochastic systems which attempt to capture the 'shape' of a language (a recursively enumerable language by Chomsky Hierarchy) by pulling successive tokens from a bag of words model; in this system, a corpus of text is transformed into a mathematical space of the tokens, which are substrings of the input language, and weights, which are a cosine similarity between tokens in a Hilbert space. In other words, a large language model is a function which descends the gradient of this space, and uses probability to arrive at what word goes next in a given completion. This mathematical loss minimization is performed sequentially as the text is generated, in what is known as self-attention, which relates different positions of text sequence in order to compute a representation of the sequence. This function takes the form:

$\sum _{t=1}^{T}P(x_{t}|{\vec {x}}_{<t}{\vec {x}}_{i:j})$

where tokens are generated until limit T is reached, and where there is a probability of another token generated as a function of previous tokens and the input. This functional setup is called an encoder-decoder.

So, based on the corpus, and given an input, an encoder-decoder will try to create a response given an input. Usually, encoder-decoders are further trained by a process called Reinforcement Learning from Human Feedback (RLHF), in which human feedback is given in an iterative process until the underlying language weights prefer the trained preferences.

To consider in regard to LLMs:

They are stochastic word generator machines, and contain structure, rather than problem solving logic.
They are typically computationally expensive to run.
They are typically monstrously computationally expensive to train.
If equipped with enough memory for self-attention, they become a Turing Machine with Type-0 Grammar.
They may occasionally generate incorrect information.
They hallucinate, generating nonsense (verbal noise) when confronted with unexpected text not encountered in training.
Because the attention of a model is limited, LLMs perform badly with large multi-step processes which require dense context.

Software Toolkit

Python

General purpose programming language with many libraries Interpreted language: each line is run one at a time by a virtual machine.

Dependency Structure

system vs user python

virtualenv: $ python3 -m venv /path/to/new/environment

pip libraries $ pip list outdated format=freeze | grep -v | cut -d=" " -f1 | xargs -n1 pip install -U

Anaconda

separate virtualenv system specifically for data science:

Spyder IDE with visual output
JupyterLabs notes with data visualizations
Orange visual IDE for stats exploration
pandas, NumPy, and SciPy libraries for data serialization and numerical work in python
matplotlib for data visualization
scikit-learn main library for ML
DASK distributed abstraction layer with pandas grammar to easily distribute python tasks into 1-1000 compute nodes
PyTorch, and TensorFlow deep learning and CNN generation systems massive compute overhead to train models require data map reduction and or imputation to run well

R language

Statistical programming language: interpreter invokes compiled C or FORTRAN.

Also works within Jupyter notebook for instant visualization, if wanted.

Open-source and extended by the Comprehensive R Archive Network (CRAN), which includes extensive documentation.

rmarkdown format a document from R with optional LaTeXbindings
tidyverse
- dplyr grammar for mass data manipulation
- ggplot2 a library for creating graphs and visualizations
doparallel cost-free abstraction, pooling of CPU threads
mlr interface to a large number of classification and regression techniques
shiny provides ability to create web servers similar to NodeJS or Python Flaskl

Intel MKL (Math Kernel Library)

Improves performance for Fast Fourier Transforms, linear algebra operations, vector math, deep neural networks, and kernel solvers.

Default math backend for NumPy, SciPy, and MATLAB

Not hardware agnostic: chooses slowest solvers for non-Intel chips by default

OpenBLAS and LAPACK

LAPACK (Linear Algebra PACKage) provides APIs much like MKL

OpenBLAS (Basic Linear Algebra Subprograms) extends LAPACK with optimizations for parallel computing

Default for R and Biopython

Message Passing Interface (MPI)

Supported by all major compilers (Intel and OpenMP implementations)

An API supporting shared-memory multiprocessing provides backend for many parallel computing systems, allowing for multi-threaded access

ex: mpirun -np $NUM_PROC /path/to/coolProgram < $INPUT > /path/to/output

GNU Parallel

simple vectorization of loops over processors for non-multithreaded processes

ex: parallel -j $NUM_PROC /path/to/thescript.sh ::: 1..n ::: 1..m

CUDA/OpenCL

Nvidia-specific CUDA and open-source OpenCL provide a hardware abstracting API for using GPU for compute tasks

must-have for Pytorch or TensorFlow workloads

Nomenclature Divergence

CUDA thread = OpenCL work item = CPU lane
CUDA multiprocessor = OpenCL compute unit = CPU

High Performance Computers

HPC or super-computing clusters provide high throughput analysis.

Amazingly high amount of computational power.

Need to plan your analysis.

NDSU Center for Computationally-Assisted Science and Technology (CCAST) provides a platform for these workloads connect via ssh uses loadable modules:

ex: module load parallel

batch processing via PBS scripting

Continued Learning

General Programming/Computers Websites

StackOverflow.com - check before asking new questions
RosettaCode.org - data structures and algorithms in many languages
Linux.die.net/man/ - the Linux manual
grymoire.com/Unix/ - more *nix CLI tutorials

Python

docs.python.org/3/ - the official python documentation
docs.python.org/3/tutorial - the official tutorial
diveintopython.net - guided tutorial online
pythontutor.com - visual debugger

R

cran.r-project.org - CRAN
cran.r-project.org/manuals.html
rdrr.io - meta-manual lookup and many other tools for R
swirlstats.com - learn R, in R
statlearning.com - statistical machine learning coursework

Statistical Machine Learning

Contents

Overview

Basic Concepts

Definitions:

Basic Concepts

Knowing your data

Modeling

Signal vs Noise

Error and Fit

Why Machine Learning?

Techniques

Machine Learning in General

Types of Problems and Output

Predictive Models

Feature-vector based models

Clustering

Genetic Algorithms

Large Language Models

To consider in regard to LLMs:

Software Toolkit

Python

Anaconda

R language

Intel MKL (Math Kernel Library)

OpenBLAS and LAPACK

Message Passing Interface (MPI)

GNU Parallel

CUDA/OpenCL

High Performance Computers

Continued Learning

General Programming/Computers Websites

Python

R

Recommended Reading

Navigation menu

Statistical Machine Learning

Overview

Basic Concepts

Definitions:

Basic Concepts

Knowing your data

Modeling

Signal vs Noise

Error and Fit

Why Machine Learning?

Techniques

Machine Learning in General

Types of Problems and Output

Predictive Models

Feature-vector based models

Clustering

Genetic Algorithms

Large Language Models

To consider in regard to LLMs:

Software Toolkit

Python

Anaconda

R language

Intel MKL (Math Kernel Library)

OpenBLAS and LAPACK

Message Passing Interface (MPI)

GNU Parallel

CUDA/OpenCL

High Performance Computers

Continued Learning

General Programming/Computers Websites

Python

R

Recommended Reading

Navigation menu

Search