Machine Learning
- Modeling (What): the process of creating a model to understand how the collected data interact in an specified environment
- Trial and Error (How): the process of making guesses about what will happen, measuring the output, and updating the model accordingly
- Predictions (Why, Goal): making correct predictions with new data
AI, ML, and DL
Three terminologies in data science can be used in a confusing way.
- Artificial Intelligence (AI): AI is a broad field of science of how machines or computers mimic human intelligence, such as visual/audio perception and decision making.
- Machine Learning (ML): ML is a subfield of AI. ML enables machines to improve the performance of a given task with experience. Not all AI engines are not based on ML.
- Deep Learning (DL): DL is a specialized field of ML. ML is based on the Deep Artificial Neural Networks (ANNs) using a large dataset. ANNs mathematically mimic the behavior of the human brain by connecting multi-layered (‘deep’) artificial neurons.
Data Science vs. Machine Learning
DS and ML
- Use computer power
- Deal with large sets of data
- Use a lot of mathematics and statistics
DS
- Analyze the existing data to draw conclusions
ML
- Attempts to perform tasks or infer conclusions with new data without explicit instructions
Terminologies
Algorithm vs. Heuristic
- Algorithm: A set of steps to follow to solve a specific problem intended to be repeatable with the same outcome if the same input is given.
- Heuristic: A guidance on doing a task, but it does not guarantee a consistent output. We can get a good solution much faster rather than the best solution.
Data
- Structured Data: data that adheres to some rules or formats, easy to processed
- Unstructured Data: data without neat groups or formats, hard to process
Context
- Provides the back story for data such as where it came from, who created it, how was it collected, and is it trustful?
Schema
- A pre-defined structure of how data is related.
- Schemas provide the mechanism of how data is stored, read, and processed.
Tensor
- A generic multidimensional array (matrices of matrices)
- A commonly used data type in ML because it is flexible and capable of facilitating multidimensional math.
Data Preparation
- Upfront cleaning: removing errors, correcting missing and invalid values, removing duplicates
- Aggregation: creating summary records through consolidation of data
- Transformation: change data into another format, unit, or value.
- Normalization: scale values to more consistent or manageable values
Datasets
- Training dataset: A batch of data used as initial inputs to ML process (example dataset)
- Validation dataset: A subset of data used to evaluate the training process. It is used to tune or adjust the model. (development dataset)
- Testing dataset: A batch of data used to evaluate the accuracy of the finished model. (hold-out dataset)