[ML] Machine Learning

Modeling (What): the process of creating a model to understand how the collected data interact in an specified environment
Trial and Error (How): the process of making guesses about what will happen, measuring the output, and updating the model accordingly
Predictions (Why, Goal): making correct predictions with new data

Three terminologies in data science can be used in a confusing way.

Artificial Intelligence (AI): AI is a broad field of science of how machines or computers mimic human intelligence, such as visual/audio perception and decision making.
Machine Learning (ML): ML is a subfield of AI. ML enables machines to improve the performance of a given task with experience. Not all AI engines are not based on ML.
Deep Learning (DL): DL is a specialized field of ML. ML is based on the Deep Artificial Neural Networks (ANNs) using a large dataset. ANNs mathematically mimic the behavior of the human brain by connecting multi-layered (‘deep’) artificial neurons.

DS and ML

Attempts to perform tasks or infer conclusions with new data without explicit instructions

Algorithm vs. Heuristic

Algorithm: A set of steps to follow to solve a specific problem intended to be repeatable with the same outcome if the same input is given.
Heuristic: A guidance on doing a task, but it does not guarantee a consistent output. We can get a good solution much faster rather than the best solution.

Data

Structured Data: data that adheres to some rules or formats, easy to processed
Unstructured Data: data without neat groups or formats, hard to process

Context

Provides the back story for data such as where it came from, who created it, how was it collected, and is it trustful?

Schema

Tensor

A generic multidimensional array (matrices of matrices)
A commonly used data type in ML because it is flexible and capable of facilitating multidimensional math.

Data Preparation

Upfront cleaning: removing errors, correcting missing and invalid values, removing duplicates
Aggregation: creating summary records through consolidation of data
Transformation: change data into another format, unit, or value.
Normalization: scale values to more consistent or manageable values

Datasets

Training dataset: A batch of data used as initial inputs to ML process (example dataset)
Validation dataset: A subset of data used to evaluate the training process. It is used to tune or adjust the model. (development dataset)
Testing dataset: A batch of data used to evaluate the accuracy of the finished model. (hold-out dataset)

Published by P. L.