[ML] Machine Learning

Machine Learning

• Modeling (What): the process of creating a model to understand how the collected data interact in an specified environment
• Trial and Error (How): the process of making guesses about what will happen, measuring the output, and updating the model accordingly
• Predictions (Why, Goal): making correct predictions with new data

AI, ML, and DL

Three terminologies in data science can be used in a confusing way.

• Artificial Intelligence (AI): AI is a broad field of science of how machines or computers mimic human intelligence, such as visual/audio perception and decision making.
• Machine Learning (ML): ML is a subfield of AI. ML enables machines to improve the performance of a given task with experience. Not all AI engines are not based on ML.
• Deep Learning (DL): DL is a specialized field of ML. ML is based on the Deep Artificial Neural Networks (ANNs) using a large dataset. ANNs mathematically mimic the behavior of the human brain by connecting multi-layered (‘deep’) artificial neurons.

Data Science vs. Machine Learning

DS and ML

• Use computer power
• Deal with large sets of data
• Use a lot of mathematics and statistics

DS

• Analyze the existing data to draw conclusions

ML

• Attempts to perform tasks or infer conclusions with new data without explicit instructions

Terminologies

Algorithm vs. Heuristic

• Algorithm: A set of steps to follow to solve a specific problem intended to be repeatable with the same outcome if the same input is given.
• Heuristic: A guidance on doing a task, but it does not guarantee a consistent output. We can get a good solution much faster rather than the best solution.

Data

• Structured Data: data that adheres to some rules or formats, easy to processed
• Unstructured Data: data without neat groups or formats, hard to process

Context

• Provides the back story for data such as where it came from, who created it, how was it collected, and is it trustful?

Schema

• A pre-defined structure of how data is related.
• Schemas provide the mechanism of how data is stored, read, and processed.

Tensor

• A generic multidimensional array (matrices of matrices)
• A commonly used data type in ML because it is flexible and capable of facilitating multidimensional math.

Data Preparation

• Upfront cleaning: removing errors, correcting missing and invalid values, removing duplicates
• Aggregation: creating summary records through consolidation of data
• Transformation: change data into another format, unit, or value.
• Normalization: scale values to more consistent or manageable values

Datasets

• Training dataset: A batch of data used as initial inputs to ML process (example dataset)
• Validation dataset: A subset of data used to evaluate the training process. It is used to tune or adjust the model. (development dataset)
• Testing dataset: A batch of data used to evaluate the accuracy of the finished model. (hold-out dataset)