Code&Data Insights

[Google Data Analytics Professional Certificate] Process Data from Dirty to Clean 본문

Data Science/Certificate

[Google Data Analytics Professional Certificate] Process Data from Dirty to Clean

paka_corn 2023. 5. 19. 05:49

[ Data Intergrity ]

Data Intergrity -

 

- Alignment to business objective + newly discovered variables + constraints = accurate conclusion

 

 

[ Types of insufficient data ]

- Data from onlly one source

- Data that keeps updating

- Outdated data

- Geographically-limited data

 

 

 

 

for consumer preferences, a smaller sample size at a lower cost could provide good enough results. 

=> Sample size calculator - let you enter a desired confidence level and margin of error for a given population size. They then calculate the sample size needed to statistically achieve those results. 

 

 

the maximum amount that the sample results are expected to differ from those of the actual population.

 

 

 

 

=> To calculate margin of error, we need 'population size, sample size, confidence level'. 

 

=> Margin of error is used to determine how close your sample’s result is to what the result would likely have been if you could have surveyed or tested the entire population.

 

 

[ Data Cleaning Tools ]

- Data validation

- Conditional formatting

- COUNTIF

- Sorting

- Filtering 

 

 

 

[ VLOOKUP searches ] 

VLOOKUP 

 It is used to search for a value in the leftmost column of a table or range and retrieve a corresponding value from a specified column in the same row.

 

=> formula : VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup], 0)

 

- 0 means : the search correctness is 99.99% 

- col_index_num: This is the column number within the table array from which you want to retrieve the result. The leftmost column in the table array is considered column 1.

 

 

 

[ SQL - Advanced Data Cleaning Functions ] 

- SUBSTR

- LENGTH

- CAST : do type casting 

- CONCAT : adds strings together to create new text string, it can be used as unique keys

- COALESCE : can be used to return non-null values in a list 

 

 

 

 

 

[  ]

 

 

 

 

 

 

 

Confidence level : How confident you are in the survey results

Statistical significance : The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.

Statistical power : the probability of getting meaningful results from a test

Hypothesis testing : a way to see if a survey or experiment has meaningful results

A/B testing: The process of testing two variations of the same web page to determine which
page is more successful at attracting user traffic and generating revenue

Data validation : a tool for checking the accuracy and quality of data before adding or importing it

Compatibility : how well two or more datasets are able to work together 

Verification : a process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable

Changelog : a file containing a chronologically ordered list of modifications made to a project.

Pivot table : a data summarization tool that is used in data processing

 

 

 

Comments