Code&Data Insights
[Mathematics of Data Management ] study notes | basic concepts related to data analytics 본문
[Mathematics of Data Management ] study notes | basic concepts related to data analytics
paka_corn 2023. 6. 13. 03:14[ Type of Data ]
[ Sampling Methods ]
1) Random Sampling : sample choice made without any pattern and would be completely unrelated
2) Simple Random Sampling : all of the selections are equally likely, for example drawing one name and each name has the same chance of being selected
3) Systematic Random Sampling : more organized in sample selection, create pattern to choose the samples
4) Stratified Random Sampling : the population is divided into group(strata), the strata can be arranged based on gender, age or any characteristics
Stratify : 계층화하다
5) Cluster Random Sampling : the population is ordered in terms of groups first. then the groups are randomly chosen for sampling of which all the members in the group are surved.
6) Multi-stage Random Sampling : groups are randomly chosen from a population, subgroups from these groups are randomly chosen and then individuals in these subgroups are then randomly chosen to be survey
7) Destructive Sampling : situations where the sample is damaged or killed to extract information ( mosly sample would be not human! )
[ Types of Bias ]
1) Sampling Bias : the chosen sample does not accurately represent population, most common form
(ex) people outside McDonalds are surveyed about their opinions of fast food
2) Non-Response Bias : a situation where the data is not collected from all of the potential respondents
(ex) people do not return mail-in surveys
3) Household Bias : occurs in situations where respondent are over- or under represented because groups of different sizes are polled eqaully
(ex) take surveys only from girls school
4) Response Bias : occurs as a result of the sampling method and the design of the study
(ex) questions poorly worded or the interviewer leading the answers
[ Central Tendency ]
Weighted Mean : the data is organized by a frequency table, the mean can still be calculating using a weighting factor which multiplies the value times the frequency by which it occurs
[ Measures of Spread ]
Variability : how far apart data points lie from each other and from the center of a distribution.
Along with measures of central tendency, measures of variability give you descriptive statistics that summarize your data. Variability is also referred to as spread, scatter or dispersion.
Interquartile Range(IQR) : divide the data up into four equal groups(quartiles)
https://www.youtube.com/watch?v=esskJJF8pCc
Standard Deviation : a valuable tool to measure the spread data ( How far? or How Close? )
=> a measure that indicates how much data scatter around the mean
(deviation : 편차 | variance : 변화 )
- deviation : a term that measures the distance a data point is compared to the mean
- variance : a measure of spread and can be calculated by averaging the deviation squared
** Calculating standard deviation using EXCEL : STDEV.S(cell 1:cell last)
[ Normal Distribution ]
Normal Distribution : a histogram that has a symmetrical 'bell' shape
Mean == Median == Mode
[ Z Scores ]
Z Scores : a statistical measurement that describes a value's relationship to the mean of a group of datas
=> calculated based on the number of standard deviations a data point is away from the mean
=> Positive value : above the mean | Negative value : below the mean
** Calculating standard deviation using EXCEL : STANDARDIZE(x,mean,standard_dev)
[ Mathmatical Indices ]
Indices : indices are valuable because they indicate a value so that we can make comparisons
=> index values do not necessarily represent an actual measurement or quantity, but also have a starting or ending point
(ex) BMI, SLG(Slugging Percentage), Consumer Price Index,