Overview

Data organized as a table of columns and rows; like a spreadsheet. The rows are data points (or observations or measurements) and the columns are property values. The columns are partitioned into input columns and output columns. This is arbitrary, but the input is typically what you will be given from a measurement, and the output is what you will like to predict. The input values (one for each column in a row) can be:

Some data, like photos, can be a multilabel classification; e.g. this picture contains "mom" and "dad".

If the output:

A batch of rows is a subset of rows that are processed together for efficiency. For example, if you have 1024 rows of data, you might partition them into 8 batches of 128 records so that they can efficiently be loaded into a GPU. Batch sizes are typically a power of two.

A model is a function that takes an input and returns an output. $f(x) = x^2$ is a trivial model that takes one real input value and returns a real output value. In school, you are taught to take the input (-3, -2, -1, 0, 1, 2, 3, ...) and generate the output (9, 4, 1, 0, 1, 4, 9) from the given model. But in science, nature (through experimentation) produces the input and the output, and you are supposed to find a model that best fits the data; i.e. minimizes the cost or the prediction. For example, you might observe a baseball being thrown up as an input velocity, and record the height of the baseball at various times t. Later, you might surmise that the formula for the height at time t is $h(t) = -16t^2 + vt + h_0$. This is the standard formula derived from laws of physics, but how well does it fit the observed data, which contains measurement errors, and is subject to air friction, humidity, altitude, and wind? The model does not take these additional parameters into account so we would expect some errors in the predicted output from the actual data (or target output or ground-truth or annotations). We can measure how well this model predicts the output from the input as follows:

Suppose for example, your data has two columns: height and weight; and we want to predict the unknown weight of a person given the person's known height. We would then define the input column to be height and the output column to be weight. Then we'd "train" the model and use that to predict the output from the input.

However, one trivial model for illustration is to just ignore the input altogether and predict the output to be just the average of the output. So regardless of a person's height, we would always guess the person's weight to be 62 kg for instance. Of course we can do better by taking in account other information about the person, such as the height and sex.

In statistics, a popular model (invented in 1808 before computers) is called Linear Regression. We use the data to calculate the two parameters (Y-intercept and slope) to create a line that best fits the data. We don't really train the model to these two parameters; well, maybe we iterate through the data only one time, which is all that is needed to get the parameters, and maybe this is like training the model. The Least Squared method and the Pearson correlation coefficientare used to determine the best fit score. If this sounds complicated, know that most $15 scientific calculators can do linear regression. Here are some screenshots from the calculator manual that illustrates an old form of finding a model from the data:

Alt text

Alt text

As per George Box, "All models are wrong, but some are useful".

Linear Regression is a very simple model, but imagine a model that takes thousands of input values (like pixels from a photo) and returns the category of what is in the picture. This model is too complicated for humans to create, or even understand. But

Machine Learning (ML) is about algorithms that train a complex model (to create it from scratch) using data. It is very much described with math.

General Strategy

To train a model, these general steps are used below.

Import required libraries

Read in the dataset

Contains the input and output data for supervised learning.

Partition data columns into input and output

Partition data rows into training and testing data

Split the dataset into 75% random columns for testing and the remaining 25% for testing

Scale the input data

If you can linearly scale the data so that it's between -1 and 1, your NN will work better.

Pick the model type

You'll draw upon the wisdom of others to decide this.

Train the model on the training data

Pass through all the rows in the training set to train the network. One pass is called an "epoch". For each epoch, use the error (or cost score) to decide how to change the model (e.g. back propagation) by some amount (called the learning rate).

Test the "trained" model on the testing data

For each row in the testing set:
  Use the model to predict the output from the input.
  Score it and see how well you did.

There are lots of types of reports, including the "confusion matrix".

Tools and Libraries

TensorFlow

A mature popular machine-learning library.

Keras

Adds very convenient wrapper classes to TensorFlow to make it easier to use.

PyTorch

JAX

JAX is a Python library designed for high-performance numerical computing, and known as "Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more."

References