Code&Data Insights

[Machine Learning] What is Machine Learning ? 본문

Data Science/Machine Learning

[Machine Learning] What is Machine Learning ?

paka_corn 2023. 5. 13. 05:16

[ Terminology in Machine Learning  ]

- Training set : data used to train the model => input features + target variables ( x + y ) 

- x : input variable  | (input) features 

- y : output variable | target variable

- m : number of training examples

- (x,y) : single training example 

-ŷ (y hat) : predicted output

 

The data type in ML 

: Machine learning builds predictive models based on your data and learns on it. In ML the data is divided into two sets which are Training Data, and Testing Data

 

1) Training Data

- the initial dataset used to train machine learning algorithms

- it’s larger than testing data.

- more data result in more accurate predictive models.

 

 

2) Test(ing) Data 

- Once your machine learning model is built with your training (historical) data, you need to test it. In this case, the AI platform uses testing data to evaluate the performance of your model and adjust or optimize it for improved results.

- this data is used to evaluate ML methods.

 

 

 

 

Why we have to use new dataset for test set?

> If we use the training data for testing, the model may perform well on the test set because it has already seen and learned from that data during training. This situation increases the likelihood of overfitting, where the model becomes too specific to the training data and fails to generalize well to unseen data from the real world. Therefore, it is important that the test data is independent of the training data and consists of previously unseen examples. This allows us to assess how well the model performs on new, unseen data and ensures its ability to generalize beyond the training set.

 

만약 테스트에 학습 데이터를 사용한다면, 모델은 이미 해당 데이터에 대해 학습되어 있기 때문에 테스트 성능이 좋게 나올 수 밖에 없다. 모델이 훈련 데이터에 대해 과적합(overfitting)되었을 가능성이 높다. 결국, 모델은 훈련 데이터에만 맞추어져서 실제 세계의 다른 데이터에 대해서는 제대로 일반화하지 못하게 된다. 따라서 테스트 데이터는 학습 데이터와 독립적이고 이전에 보지 못한 데이터로 구성되어야 한다.

 

 

 

 

 

* Bias-Variance Tradeoff 

: Fitting the Training Data well but making poor predictions. 

 

 

== > How we decide which data go into the Training Set or Testing Set? 

 

 

[ Supervised Learning  ]

: learns from being given "Right Answers"

=> major types : Classification | Regression 

- Classification

- Decision Tree

- Random Forest 

- Linear Regression

- Logistic Regression

 

 

 

[ Unsupervised Learning ] 

: data only comes with input x, but not output labels y. So, Algorithms need to find the structure in the data.

* Grouping / Clustering * 

 

- K-Means Clustering 

- PCA 

- Anomaly Detection | Dimensionality Reduction

 

 

 

 

 

Comments