목록All Contents (143)
Code&Data Insights
[ R ] R : a programming language frequently used for statistical analysis, visualization, and other data analysis * R is case sensitive !! Common features - Open-source - Data stored in data frames - Formulas and functions readily available - Community for code development and support Unique advantages - Data manipulation, data visualization, and statistics packages - "Scalpel" approach to data:..
[ The 4-Phases of Analysis ] 1) Organize data 2) Format and adjust data 3) Get input from others 4) Transfrom data ( make calculation based on the data, find the relationship ) [ Organize data ] - Sorting : Sort Sheet in spreadsheet : ORDER BY in SQL - Filtering : WHERE in SQL [ Data Formatting ] - From one type to another : CAST in SQL : CONVERT in spreadsheet [ Data validation ] Data validatio..
[ Data Intergrity ] Data Intergrity - Data integrity is the accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle. - Alignment to business objective + newly discovered variables + constraints = accurate conclusion [ Types of insufficient data ] - Data from onlly one source - Data that keeps updating - Outdated data - Geographically-limited data [ Minimum sampl..
[ Data Collection ] How data is collected? - Interviews - Observations - Forms - Questionnaires - Surveys - Cookies (personal interests, habits) [ Data Formats ] Discrete data - data that is counted and has a limited number of values (ex) room maximum capacity Continuous data - data that is measured and can have most any numeric value (ex) temperature Nominal data - a type of qualitative data th..
[ Common Problem Types ] 1. Making predictions 2. Categorizing things - categorized by specific keyword or score 3. Spotting something unusal 4. Identifying themes - Grouping categorized info into broader concepts 5. Discovering connections 6. Finding patterns - using historical data to understand what happened in the past and is therefore likely to happen again [ SMART questions ] => kinds of q..
[ The six phases of data analysis ] Ask : Business Challenge/Objective/Question Prepare : Data generation, collection, storage, and data management Process : Data cleaning/data integrity - what type of data we have, missing data, wrong data collection? Analyze : Data exploration, visualization, and analysis => should be unbiased! look for the patterns Share : Communicating and interpreting resul..
[ Cross Validation ] : Cross Validation allows us to compare different machine learning methods and get a sense of how well they will work in practice. - K-Fold ( K could be arbitrary! ) [ Confusion Matrix ] Confusion Matrix : To decide which method should be performed with the given data sets, we need to summurize how each method performed on the testing Data. => one way to do this is by creati..
[ Terminology in Machine Learning ] - Training set : data used to train the model => input features + target variables ( x + y ) - x : input variable | (input) features - y : output variable | target variable - m : number of training examples - (x,y) : single training example -ŷ (y hat) : predicted output The data type in ML : Machine learning builds predictive models based on your data and lea..