[ML] Machine Learning

Machine Learning

  • Modeling (What): the process of creating a model to understand how the collected data interact in an specified environment
  • Trial and Error (How): the process of making guesses about what will happen, measuring the output, and updating the model accordingly
  • Predictions (Why, Goal): making correct predictions with new data

AI, ML, and DL

Three terminologies in data science can be used in a confusing way.

  • Artificial Intelligence (AI): AI is a broad field of science of how machines or computers mimic human intelligence, such as visual/audio perception and decision making.
  • Machine Learning (ML): ML is a subfield of AI. ML enables machines to improve the performance of a given task with experience. Not all AI engines are not based on ML.
  • Deep Learning (DL): DL is a specialized field of ML. ML is based on the Deep Artificial Neural Networks (ANNs) using a large dataset. ANNs mathematically mimic the behavior of the human brain by connecting multi-layered (‘deep’) artificial neurons.

Data Science vs. Machine Learning

DS and ML

  • Use computer power
  • Deal with large sets of data
  • Use a lot of mathematics and statistics

DS

  • Analyze the existing data to draw conclusions

ML

  • Attempts to perform tasks or infer conclusions with new data without explicit instructions

Terminologies

Algorithm vs. Heuristic

  • Algorithm: A set of steps to follow to solve a specific problem intended to be repeatable with the same outcome if the same input is given.
  • Heuristic: A guidance on doing a task, but it does not guarantee a consistent output. We can get a good solution much faster rather than the best solution.

Data

  • Structured Data: data that adheres to some rules or formats, easy to processed
  • Unstructured Data: data without neat groups or formats, hard to process

Context

  • Provides the back story for data such as where it came from, who created it, how was it collected, and is it trustful?

Schema

  • A pre-defined structure of how data is related.
  • Schemas provide the mechanism of how data is stored, read, and processed.

Tensor

  • A generic multidimensional array (matrices of matrices)
  • A commonly used data type in ML because it is flexible and capable of facilitating multidimensional math.

Data Preparation

  • Upfront cleaning: removing errors, correcting missing and invalid values, removing duplicates
  • Aggregation: creating summary records through consolidation of data
  • Transformation: change data into another format, unit, or value.
  • Normalization: scale values to more consistent or manageable values

Datasets

  • Training dataset: A batch of data used as initial inputs to ML process (example dataset)
  • Validation dataset: A subset of data used to evaluate the training process. It is used to tune or adjust the model. (development dataset)
  • Testing dataset: A batch of data used to evaluate the accuracy of the finished model. (hold-out dataset)

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s