Statistical Machine Learning - Revision history

Sysadmin: Added LLM section

2023-03-20T11:14:42Z

Added LLM section

@@ Line 215: / Line 215: @@
 Take high throughput data and simplify before work is done ''ex'': 1080p @ 60fps <math display="inline">\rightarrow</math> 240x360 @ 15fps, broken into component channels Gaussian blur filter kernel applied to high-res images Edge detection via double threshold ML or CNN used past this to determine actual features
 </div>
 == Software Toolkit ==
@@ Line 328: / Line 346: @@
 * rdrr.io - meta-manual lookup and many other tools for R
 * swirlstats.com - learn R, in R
-* statslearning.com - statistical machine learning coursework
+* statlearning.com - statistical machine learning coursework
 === Recommended Reading ===

Sysadmin: /* GNU Parallel */

2022-11-08T19:16:12Z

GNU Parallel

← Older revision		Revision as of 19:16, 8 November 2022
Line 280:		Line 280:
	simple vectorization of loops over processors for non-multithreaded processes		simple vectorization of loops over processors for non-multithreaded processes

	''ex'': <code>parallel -j $NUM_PROC /path/to/thescript.sh ::: ~~</code><span><code>~~1..n~~</code></span><code>~~ ::: ~~</code><span><code>~~1..m</code~~></span~~>		''ex'': <code>parallel -j $NUM_PROC /path/to/thescript.sh ::: 1..n ::: 1..m</code>

	=== CUDA/OpenCL ===		=== CUDA/OpenCL ===

Sysadmin at 21:08, 3 November 2022

2022-11-03T21:08:20Z

← Older revision		Revision as of 21:08, 3 November 2022
Line 9:		Line 9:
	* Machine Learning		* Machine Learning
	* ML Techniques		* ML Techniques
	* ~~SoftwareToolkit~~		* Software Toolkit
	* Continued Learning		* Continued Learning

Line 42:		Line 42:
	So, our combined dataset consists of <math>[(\vec{x}_1,\vec{y}_1),(\vec{x}_2, \vec{y}_2),...,(\vec{x}_n,\vec{y}_n)]</math>		So, our combined dataset consists of <math>[(\vec{x}_1,\vec{y}_1),(\vec{x}_2, \vec{y}_2),...,(\vec{x}_n,\vec{y}_n)]</math>

	'''Our Mission''': determine relationships between <math display="inline">\textbf{X}</math> and <math display="inline">\textbf{Y}</math> which are mathematically sound, leading to better ~~understandin~~		'''Our Mission''': determine relationships between <math display="inline">\textbf{X}</math> and <math display="inline">\textbf{Y}</math> which are mathematically sound, leading to better understanding

	Typically a table has columns as features, rows as entries		Typically a table has columns as features, rows as entries
Line 202:		Line 202:
	Generates the inner product space of two arbitrary-dimensional numeric matrix spaces, showing the shape of data		Generates the inner product space of two arbitrary-dimensional numeric matrix spaces, showing the shape of data

	=== ~~'''~~Genetic Algorithms~~'''~~ ===		=== Genetic Algorithms ===
	Utilize some adversarial scoring method of initially randomized vectors:		Utilize some adversarial scoring method of initially randomized vectors:

Line 230:		Line 230:
	pip libraries <code>$ pip list outdated format=freeze \| grep -v \| cut -d=" " -f1 \| xargs -n1 pip install -U</code>		pip libraries <code>$ pip list outdated format=freeze \| grep -v \| cut -d=" " -f1 \| xargs -n1 pip install -U</code>

	==== ~~'''~~Anaconda~~'''~~ ====		==== Anaconda ====
	separate virtualenv system specifically for data science:		separate virtualenv system specifically for data science:

Line 293:		Line 293:

	=== High Performance Computers ===		=== High Performance Computers ===
	HPC or ~~supercomputing~~ clusters provide high throughput analysis.		HPC or super-computing clusters provide high throughput analysis.

	Amazingly high amount of computational power.		Amazingly high amount of computational power.
Line 328:		Line 328:
	* rdrr.io - meta-manual lookup and many other tools for R		* rdrr.io - meta-manual lookup and many other tools for R
	* swirlstats.com - learn R, in R		* swirlstats.com - learn R, in R
	* ~~statlearning~~.com - statistical machine learning coursework		* statslearning.com - statistical machine learning coursework

	=== Recommended Reading ===		=== Recommended Reading ===

Sysadmin at 21:04, 1 November 2022

2022-11-01T21:04:05Z

← Older revision		Revision as of 21:04, 1 November 2022
Line 328:		Line 328:
	* rdrr.io - meta-manual lookup and many other tools for R		* rdrr.io - meta-manual lookup and many other tools for R
	* swirlstats.com - learn R, in R		* swirlstats.com - learn R, in R
	* ~~statslearning~~.com - statistical machine learning coursework		* statlearning.com - statistical machine learning coursework

	=== Recommended Reading ===		=== Recommended Reading ===

Sysadmin at 20:41, 21 October 2022

2022-10-21T20:41:25Z

← Older revision		Revision as of 20:41, 21 October 2022
Line 124:		Line 124:
	* '''Data Driven Models'''		* '''Data Driven Models'''
	** Use ML to provide '''iterative''' gain to reduce error		** Use ML to provide '''iterative''' gain to reduce error
	*** Known ~~data {\textstyle \rightarrow } model creation {\textstyle \rightarrow } point~~ toward new factors.		*** Known data→model creation→point toward new factors.
	** Uses a split in training and testing data, or sum of error to move toward the correct answer.		** Uses a split in training and testing data, or sum of error to move toward the correct answer.
	** Human researchers more free to find more data, improve prediction, develop theories		** Human researchers more free to find more data, improve prediction, develop theories

Sysadmin at 20:39, 21 October 2022

2022-10-21T20:39:41Z

← Older revision		Revision as of 20:39, 21 October 2022
Line 104:		Line 104:
	* A polynomial: <math display="inline">Y = \beta_0 + \beta_1 X + \beta_2 X + ... + \varepsilon</math>		* A polynomial: <math display="inline">Y = \beta_0 + \beta_1 X + \beta_2 X + ... + \varepsilon</math>
	* A natural function: <math display="inline">Y = e^{-\alpha_0 X^{\alpha_1}} + \varepsilon</math>		* A natural function: <math display="inline">Y = e^{-\alpha_0 X^{\alpha_1}} + \varepsilon</math>
	* A logistical function: <math display="inline">p = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}</math>		* A logistical function: <math display="inline">p = \frac{ e^{\beta_0 + \beta_1 X} }{1 + e^{\beta_0 + \beta_1 X} }</math>
	* A series of nested <code>if</code> statements		* A series of nested <code>if</code> statements
	* A series of differential equations: <math display="inline">x~~^\prime~~ = x_n - \bar{x}_n: y~~^\prime~~ = y_n - \beta_n ({x_n~~}^\prime).~~</math>		* A series of differential equations: <math display="inline">x' = x_n - \bar{x}_n: y' = y_n - \beta_n x_n'</math>

	==== Why Machine Learning? ====		==== Why Machine Learning? ====
Line 124:		Line 124:
	* '''Data Driven Models'''		* '''Data Driven Models'''
	** Use ML to provide '''iterative''' gain to reduce error		** Use ML to provide '''iterative''' gain to reduce error
	*** Known data ~~<math display="inline">~~\rightarrow~~</math>~~ model creation ~~<math display="inline">~~\rightarrow~~</math>~~ point toward new factors.		*** Known data {\textstyle \rightarrow } model creation {\textstyle \rightarrow } point toward new factors.
	** Uses a split in training and testing data, or sum of error to move toward the correct answer.		** Uses a split in training and testing data, or sum of error to move toward the correct answer.
	** Human researchers more free to find more data, improve prediction, develop theories		** Human researchers more free to find more data, improve prediction, develop theories

Sysadmin: created and edited page

2022-10-21T20:23:33Z

created and edited page

New page

=== <span>Overview</span> ===
<div class="outline">
A crash course to enable gentle introduction into the machine learning techniques and its applications into data science.

Topics Covered:

* Basic Concepts
* Data Models
* Machine Learning
* ML Techniques
* SoftwareToolkit
* Continued Learning

== <span>Basic Concepts</span> ==
<div class="outline">
=== Definitions: ===

* '''Statistical Machine Learning''' is a set of tools used to model and understand complex data sets
* '''Data Science''' is a set of techniques in computing to support the analysis of data
** Not very useful without some domain knowledge: it is important to ''know your data''.
* Includes analytic techniques:
** descriptive statistics
** data visualization
** statistical machine learning
** neural networks
** actor-environment models
* Also includes computational techniques:
** database administration
** management of information systems
** parallelization <math display="inline">\rightarrow</math> high performance computing
</div>

=== Basic Concepts ===
<div class="outline">
==== Knowing your data ====
Technical definition:

Let <math display="inline">n</math> represent a number of distinct '''observations''', and let <math display="inline">p</math> represent a number of '''predictors''' Then, our observed data <math display="inline">\textbf{X}</math> is an <math display="inline">n\times p</math> matrix with row observation vectors <math display="inline">\vec{x}_{1..n}</math> and column predictor vectors <math display="inline">\vec{x}_{1..p}</math>.

In addition, we will also have '''response''' variable(s) <math display="inline">\textbf{Y}</math>, which is a made up of some <math display="inline">n</math>-length vectors.

So, our combined dataset consists of <math>[(\vec{x}_1,\vec{y}_1),(\vec{x}_2, \vec{y}_2),...,(\vec{x}_n,\vec{y}_n)]</math>

'''Our Mission''': determine relationships between <math display="inline">\textbf{X}</math> and <math display="inline">\textbf{Y}</math> which are mathematically sound, leading to better understandin

Typically a table has columns as features, rows as entries

'''Entries''' might be '''numeric''' or '''categorical'''.

Data sources are either '''Structured''' or '''Unstructured''':

* Unstructured data will require some transformation.

Some data may also be '''time series''' taking a sampling of points over time, contributing to a 3-dimensional '''Data Cub'''

Several techniques can be used to reduce complex data:

* '''numeric representation''' mapping of categorical information into numbers.
* '''scaling''' redefine a new range for a predictor vector.
* '''normalization''' redefine a predictor by its mean and standard deviation, giving a normal distribution of values.
* '''dimension reduction''' lose fine grain of data, but gain understandability.
* '''feature extraction''' a data mining technique in which we can generate new predictors from known information

==== Modeling ====
What is a model?

'''Very well-known model''': Gravity is a functional model between masses, distances, and force. <math display="block">F = G\frac{m_1m_2}{r^2} \rightarrow g = \frac{G M}{r^2} \rightarrow v(t) = v(0) - gt.</math> '''<math display="block">k = Ae^{\frac{E_a}{K_b T}}.</math>Statistics definition''':

Let <math display="inline">X = (\vec{x}_1, \vec{x}_2, ..., \vec{x}_p)</math> each of length <math display="inline">n</math>, and <math display="inline">Y = (\vec{y})</math> of length <math display="inline">n</math>, then for <math display="inline">X</math> and <math display="inline">Y</math>, there exists a function with a '''systematic''' <math display="inline">f</math> and '''error term''' <math display="inline">\varepsilon</math>: <math display="block">Y = f(X) + \varepsilon</math>Why do we even estimate <math display="inline">f</math> at all? '''Prediction''' or '''Inference'''

'''Predictive models''' create an estimator <math display="inline">\hat{f}</math> which we can use to estimate <math display="inline">Y</math> using a sample <math display="inline">X</math> from a larger population: <math display="block">\hat{ Y} = \hat{f}(X)</math> With error: <math display="inline">E(Y - \hat{Y})^2 = [f(X) - \hat{f}(X)]^2 + \varepsilon</math>.

'''Inference models''' are primarily interested in how <math display="inline">Y</math> is affected by <math display="inline">X</math>:

* What predictors associated with response?
* What is the relationship of predictors to response?
* What is the overall nature of relationship between <math display="inline">Y</math> and the predictors.

==== Signal vs Noise ====
Consider '''precision''' and '''accuracy'''.

* Both contribute into data set

'''noise:''' variation in data which detracts from constructing '''information''', as opposed to '''signal'''–data which is representative of a system under study and contains information.

High signal to noise allows us to minimize '''reducible error''', caused by sampling technique.

Different than '''irreducible error''', created by factors we are not measuring.

==== Error and Fit ====
In Modeling In the terms of modeling, precision of a model is referred to '''variance''' and the accuracy of a model its degree of '''bias'''.

Generally, overly complex models generate high variance, and can '''over-fit''' to input data, making the model useless to new data.

Generally, '''Mean Square Error''' or MSE, used to determine goodness-of-fit for model calibration: <math display="inline">MSE = \frac{1}{n}\sum^n_{i=1}(y_i - \hat{f}(x_i))^2</math>.

Error rate used in classification: <math display="inline">\frac{1}{n} \sum^n_{i=1} I(y_i \ne \hat{y}_i)</math>.

However for reporting, <math display="inline">R^2</math> statistic is more often used, because it gives a value between 0 and 1 useful to determine how much of variance in <math display="inline">Y</math> is explained by variance in <math display="inline">X</math>: <math display="inline">R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2} = \frac{RSS}{TSS}</math>

Example functions relating <math display="inline">X</math> and <math display="inline">Y</math>:

* A linear function: <math display="inline">Y = \beta_0 + \beta_1 X + \varepsilon</math>
* A polynomial: <math display="inline">Y = \beta_0 + \beta_1 X + \beta_2 X + ... + \varepsilon</math>
* A natural function: <math display="inline">Y = e^{-\alpha_0 X^{\alpha_1}} + \varepsilon</math>
* A logistical function: <math display="inline">p = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}</math>
* A series of nested <code>if</code> statements
* A series of differential equations: <math display="inline">x^\prime = x_n - \bar{x}_n: y^\prime = y_n - \beta_n ({x_n}^\prime).</math>

==== Why Machine Learning? ====
Types of Questions:

* '''Exact Solution is known''' normal coding problems, linear models, and classical statistics.
* '''Exact Solution is unknown, but can be extracted with work''' work with systems experts and domain knowledge to create code.
* '''Exact Solution is known, but not yet conveyable''' ML is useful.
* '''Exact Solution not known by humans''' ML and/or Deep Learning needed

''Example:'' consider a prediction of temperature:

* '''Knowledge Based Models'''
** Physics and atmospheric science based model.
** Up to differential equations on chaotic systems.
** As fine granularity of prediction increases, number of factors and density of data quickly becomes too much for most humans to consider
* '''Data Driven Models'''
** Use ML to provide '''iterative''' gain to reduce error
*** Known data <math display="inline">\rightarrow</math> model creation <math display="inline">\rightarrow</math> point toward new factors.
** Uses a split in training and testing data, or sum of error to move toward the correct answer.
** Human researchers more free to find more data, improve prediction, develop theories

== <span>Techniques</span> ==

=== Machine Learning in General ===
'''Supervised Learning:''' the model estimates, error verifies

* If incorrect, needs user input for correction.
* ''Example'': a computer vision system trained to find features in images via user annotated images.

'''Unsupervised Learning:''' Clustering/Grouping of similar items

* Need a similarity measure via feature vectors and ability to adjust weights
* ''Example'': Taste prediction algorithms used in web advertising.

'''Reinforced Learning''' Model estimates a sequence of guesses

* Correct if and only if the entire sequence or a parameterized output scoring
* Instant feedback but high compute cost
* ''Example'': Game-play in actor-environment model

=== Types of Problems and Output ===
'''Numeric''' function maps <math display="inline">X</math> to <math display="inline">Y</math> and output is in <math display="inline">\mathbb{R}</math>.

'''Categorization''' non-orderable sorting

'''Clustering''' finding principle ways groups differ

'''Anomaly Detection''' finding data points which are out of the ordinary

'''Actor Models''' real-time decision making or detailed simulation

=== Predictive Models ===
Predictive ML models which are also Linear:

* Utilize a split of training and test data: test-training or <math display="inline">k</math>-fold cross-validation
* use a function mapping of one or more independent variables to the dependent variables, then re-evaluate to reduce error Includes techniques for mixed model reduction
* Reduction of the number of predictors via '''Lasso''', '''Ridge''', and '''Elastic Net''' techniques

=== Feature-vector based models ===
Nested <code>if</code> statements try to find decision boundaries by distance between independent data and dependent outcome features have weighted probability most information by Bayesian inference

* '''Decision Trees''' can be used to create decision models for linearly separable data effectively a neural network with one neuron
* '''Random Forest''' utilizes a number of differently-tuned trees trees provide consensus voting-based approach for non-linearly separable data smaller tree depth typically prevents over-fit

=== Clustering ===
Groups data into cluster such that distance within clusters is small, and between differing groups is large

Works with any well-defined "distance" function: Euclidean, Hamming, Inner Product, etc.

'''k-Means Clustering:'''

* choose <math display="inline">k</math> number of clusters randomly distribute <math display="inline">k</math> points, '''centroids''', into feature space
* divide and classify data by distance to <math display="inline">k</math> centroids
* move centroids based on center of groups repeats until convergence to some epsilon value
* where points no longer move across iterations
* '''Goodness of Fit''' for <math display="inline">N-M</math> possible values of <math display="inline">k</math>, an inflection in overall likelihood ratio given by probability function for set

'''K-nearest Neighbors:'''

Creates a probabilistic decision boundary within a feature space between <math display="inline">K</math> centroids

Unsupervised system to find structures of data works on majority voting system

'''Gradient Boosting:'''

System attempts to find the direction and vector of change in a dimensional field, and follow these iteratively to find local extrema.

Most use some '''Quasi-Newton Method''' for finding extrema for faster centroid convergence.

'''Support Vector Machine:'''

Also known as SVM, applies a classifier into high-dimensional data to split points into groups some use a '''kernel trick.'''

Generates the inner product space of two arbitrary-dimensional numeric matrix spaces, showing the shape of data

=== '''Genetic Algorithms''' ===
Utilize some adversarial scoring method of initially randomized vectors:

* ’survivors’ become the basis of new models similar iterative concept to gradient methods.
* Does not have to understand topology of space requires creator to specify scoring for the machine .

'''Convolutional Neural Networks:'''

Utilize iterative scoring between training and testing, along with '''gradient descent''' on a number of layered, weighted vectors to extract features from a complex data set.

'''Vision Systems'''

Take high throughput data and simplify before work is done ''ex'': 1080p @ 60fps <math display="inline">\rightarrow</math> 240x360 @ 15fps, broken into component channels Gaussian blur filter kernel applied to high-res images Edge detection via double threshold ML or CNN used past this to determine actual features
</div>

== Software Toolkit ==
<div class="outline">
=== Python ===
General purpose programming language with many libraries Interpreted language: each line is run one at a time by a virtual machine.

'''Dependency Structure'''

''system'' vs ''user'' python

virtualenv: <code>$ python3 -m venv /path/to/new/environment</code>

pip libraries <code>$ pip list outdated format=freeze | grep -v | cut -d=" " -f1 | xargs -n1 pip install -U</code>

==== '''Anaconda''' ====
separate virtualenv system specifically for data science:

* '''Spyder''' IDE with visual output
* '''JupyterLabs''' notes with data visualizations
* '''Orange''' visual IDE for stats exploration
* <code>pandas, NumPy, and SciPy</code> libraries for data serialization and numerical work in python
* <code>matplotlib</code> for data visualization
* <code>scikit-learn</code> main library for ML
* <code>DASK</code> distributed abstraction layer with <code>pandas</code> grammar to easily distribute python tasks into 1-1000 compute nodes
* <code>PyTorch, and TensorFlow</code> deep learning and CNN generation systems massive compute overhead to train models require data map reduction and or imputation to run well</div><div class="outline">
=== R language ===
Statistical programming language: interpreter invokes compiled C or FORTRAN.

Also works within Jupyter notebook for instant visualization, if wanted.

Open-source and extended by the Comprehensive R Archive Network (CRAN), which includes extensive documentation.

* <code>rmarkdown</code> format a document from R with optional LaTeXbindings
* <code>tidyverse</code>
** <code>dplyr</code> grammar for mass data manipulation
** <code>ggplot2</code> a library for creating graphs and visualizations
* <code>doparallel</code> cost-free abstraction, pooling of CPU threads
* <code>mlr</code> interface to a large number of classification and regression techniques
* <code>shiny</code> provides ability to create web servers similar to NodeJS or Python Flaskl

=== Intel MKL (Math Kernel Library) ===
Improves performance for Fast Fourier Transforms, linear algebra operations, vector math, deep neural networks, and kernel solvers.

Default math backend for NumPy, SciPy, and MATLAB

Not hardware agnostic: chooses slowest solvers for non-Intel chips by default

=== OpenBLAS and LAPACK ===
'''LAPACK''' (Linear Algebra PACKage) provides APIs much like MKL

'''OpenBLAS''' (Basic Linear Algebra Subprograms) extends LAPACK with optimizations for parallel computing

Default for R and Biopython

=== Message Passing Interface (MPI) ===
Supported by all major compilers (Intel and OpenMP implementations)

An API supporting shared-memory multiprocessing provides backend for many parallel computing systems, allowing for multi-threaded access

''ex'': <code>mpirun -np $NUM_PROC /path/to/coolProgram < $INPUT > /path/to/output</code>

=== GNU Parallel ===
simple vectorization of loops over processors for non-multithreaded processes

''ex'': <code>parallel -j $NUM_PROC /path/to/thescript.sh ::: </code><span><code>1..n</code></span><code> ::: </code><span><code>1..m</code></span>

=== CUDA/OpenCL ===
Nvidia-specific '''CUDA''' and open-source '''OpenCL''' provide a hardware abstracting API for using GPU for compute tasks

must-have for Pytorch or TensorFlow workloads

Nomenclature Divergence

* CUDA thread = OpenCL work item = CPU lane
* CUDA multiprocessor = OpenCL compute unit = CPU

=== High Performance Computers ===
HPC or supercomputing clusters provide high throughput analysis.

Amazingly high amount of computational power.

Need to plan your analysis.

'''NDSU Center for Computationally-Assisted Science and Technology (CCAST)''' provides a platform for these workloads connect via <code>ssh</code> uses loadable modules:

* ''ex:'' <code>module load parallel</code>

batch processing via PBS scripting
</div>

== <span>Continued Learning</span> ==
<div class="outline">
=== General Programming/Computers Websites ===

* StackOverflow.com - check before asking new questions
* RosettaCode.org - data structures and algorithms in many languages
* Linux.die.net/man/ - the Linux manual
* grymoire.com/Unix/ - more *nix CLI tutorials

=== Python ===

* docs.python.org/3/ - the official python documentation
* docs.python.org/3/tutorial - the official tutorial
* diveintopython.net - guided tutorial online
* pythontutor.com - visual debugger

=== R ===

* cran.r-project.org - CRAN
* cran.r-project.org/manuals.html
* rdrr.io - meta-manual lookup and many other tools for R
* swirlstats.com - learn R, in R
* statslearning.com - statistical machine learning coursework

=== Recommended Reading ===

* ''An Introduction to Statistical Machine Learning'' by Gareth James et al.
* ''A Primer on Scientific Programming with Python'' by Hans Petter Langtangen
* ''R for Data Science'' by Wickham and Grolmund</div></div>