[Book] Hands On Machine Learning with Scikit Learn and TensorFlow - Chapter 2:End-to-End Machine Learning Project

Recent Posts

Recent Comments

Tags more

Archives

Today

Total

Code&Data Insights

[Book] Hands On Machine Learning with Scikit Learn and TensorFlow - Chapter 2:End-to-End Machine Learning Project 본문

Artificial Intelligence/Data Analytics

[Book] Hands On Machine Learning with Scikit Learn and TensorFlow - Chapter 2:End-to-End Machine Learning Project

paka_corn 2023. 11. 27. 13:32

Main Steps in Machine Learning Project

1. Look at the big picture

2. Get the data

3. Discover and visualize the data to gain insights

4. Prepare the data for ML algorithms (data cleaning, preprocessing)

5. Select a model and train it

6. Fine-tune the model

7. Present your solution

8. Launch, monitor, and maintain your system

1. Look at the Big Picture

- Frame the Problem(business objective)

: what algorithms you will select, what performance measure you will use to

evaluate your model, and how much effort you should spend tweaking it.

- Pipelines

: sequence of data processing components

=> Each component operates quite independently.

=> The interface between components is simply the data store.

- Select a Performance Measure

=> regression : RMSE

=> classification : CE

2. Get the Data

- Take a Quick Look at the Data Structure

value_counts() :for the categorical attribute, to know how many districts belong to each category

describe() : summary of the numerical attributes(count, mean, min, and max)

- Create a Test Set

: put the test set aside from training set

Data Snooping Bias : If the model is selected or adjusted based on specific features of the test data, it may become overfit to those features and fail to generalize well to new, unseen data.

stratified sampling - avoid sampling bias, better than random sampling

(from sklearn.model_selection import StratifiedShuffleSplit)

3. Discover and visualize the data to gain insights

- Visualizing Geographical Data

=> scatterplot (latitude and longitude)

- Looking for Correlations

=> compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes

using the corr()

=> The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation.

=> Close to 0 : NO correlation

- Experimenting with Attribute Combinations

* What is Data Visualization?

- Using visual tools to discern patterns, features, relationships, and trends in data.

- Raw data need to be encode to the view(data viz) for human understanding

- Make a story of the data ‘grammar of graphics’

Most common problem

- Omitting baselines ( ignore the existing solution )

=> Baseline : what we want to achieve

=> Compare my model to existing solutions *good!

* What is EDA?

- EDA(Exploratory Data Analysis) :

- Visual EDA

=> Variation

: bar chart / histogram, density

=> Association - Covariance| Correlation

: scatter plot / box plot / facets

box plot

- can do continuous vs. categorical values

facets

- can do continuous vs. categorical values

In EDA, we generate new insights ! => Confirmatory -> Exploratory

4. Prepare the data for ML algorithms (data cleaning, preprocessing)

* Data Cleaning

- drop uncorrelated columns or fill missing values with median

- Handling Text and Categorical Attributes (Label Encoding, One-Hot Encoding)

=> LabelBinarizer

- Custom Transformers

- Feature Scaling

=> MinMaxScaler / StandardScaler

- Transformation Pipelines

=> FeatureUnion

5. Select and Train a Model

- Training and Evaluating on the Training Set

- Better Evaluation Using Cross-Validation

=> K-fold cross-validation

- Ensemble Learning - Bagging / Boosting

- Check the performance on traninig/validation set

=> if overfitting, perform regularization, or simplify the model, or feed more data

- Confusion Matrix

selected elements

- f1 score

==> by this measure(classification report), can see the class distributaion aside accuracy

*data imbalance*

==> To address this issue?

- feed more data

- delete few data from the majority class

- create synthetic data

=> random oversampling - minority class

=> random undersanpling - majority class

BUT, problems can be arised such as Loss of information or Overfitting

6. Fine-Tune the Model

- Grid Search / Randomized Search

: find the best hyperparameters

'Artificial Intelligence > Data Analytics' 카테고리의 다른 글

[Data Analytics] Entity Linkage - Atomic String Similarity \| Gap Distance \| Jaccard Distance \| Jaro Similarity \| Jaro-Winkler similarity (1)	2023.10.24
[Data Analytics] Cohort Analysis \| Behavioral Analytics (0)	2023.09.11
[Mathematics of Data Management ] study notes \| basic concepts related to data analytics (0)	2023.06.13
[Pandas] Pandas DataFrame \| Series \| Index \| Basic APIs (0)	2023.06.02

'Artificial Intelligence/Data Analytics' Related Articles

Comments

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Code&Data Insights

Code&Data Insights

[Book] Hands On Machine Learning with Scikit Learn and TensorFlow - Chapter 2:End-to-End Machine Learning Project 본문

[Book] Hands On Machine Learning with Scikit Learn and TensorFlow - Chapter 2:End-to-End Machine Learning Project

Main Steps in Machine Learning Project

1. Look at the Big Picture

2. Get the Data

3. Discover and visualize the data to gain insights

4. Prepare the data for ML algorithms (data cleaning, preprocessing)

5. Select and Train a Model

6. Fine-Tune the Model

'Artificial Intelligence > Data Analytics' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역