Code&Data Insights
[Google Data Analytics Professional Certificate] Prepare Data for Exploration 본문
[Google Data Analytics Professional Certificate] Prepare Data for Exploration
paka_corn 2023. 5. 18. 09:42[ Data Collection ]
How data is collected?
- Interviews
- Observations
- Forms
- Questionnaires
- Surveys
- Cookies (personal interests, habits)
[ Data Formats ]
Discrete data
- data that is counted and has a limited number of values
(ex) room maximum capacity
Continuous data
- data that is measured and can have most any numeric value
(ex) temperature
Nominal data
- a type of qualitative data that is categorized without a set order
(ex) First time customer, returning customer, regular customer
Ordinal data
- a type of qualitative data with a set order or scale
(ex) star rating, Income level
Structured data
- data organized in a certain format such as rows and columns => easily searchable
- sourse of structured data : spreadsheets, relational databases
(ex) Tax returns
Unstructure data
(ex) audio, video files, emails, social media, text mesages, images
Primary data
- collected by a researcher from first-hand sources
Secondary data
- gathered by other people or from other research
[ Data Transformation ]
- Adding, copying, or replicating data
- Deleting fields or records
- Standardizing the names of variables
- Renaming, moving, or combining(merging) columns in a database
- Joining one set of data with another
- Saving a file in a different format.
( saving a spreadsheet as a CSV file )
[ Data Ethics ]
- Ownership
- Transaction transparency
- Consent
- Currency
- Privacy
- Openness
[ Meta Data ]
Metadata
- metadata is a data that describes or provides information about other data. It describes the characteristics, attributes, and context of the data, helping to organize, manage, and understand it effectively. For example, in the case of a photograph, metadata could include details such as the date of capture, camera model, and location. Metadata plays a crucial role in understanding and utilizing data, improving data quality, and enhancing efficiency in data management and retrieval processes.
Metadata management
=> Data governance : A process to ensure the formal management of a company's data assets
Metadata vs Data
Data
- Data refers to the raw form of information, such as numbers, text, images, etc., that is collected or generated. For example, a photograph file itself is considered data.
Metadata
- Metadata provides additional information about the data, describing and managing it. It includes details about the characteristics, attributes, quality, creation time, authorship, location, and more. For instance, the metadata of a photograph can include the capture date, camera model, and location.
In a company context, both metadata and data are utilized. Data contains the actual information being used, while metadata is used to describe and organize the data. Companies rely on both data and metadata to analyze and manage information. Metadata helps in understanding the quality, accuracy, source, and other aspects of the data, enabling effective utilization of the data resources.
[ Best Practices when organizing Data ]
- Naming conventions
- Foldering
- Archiving older files
- Align your naming and storage practices with your team
- Develop metadata practices
[ Public Datasets ]
Open data helps create a lot of public datasets that you can access to make data-driven decisions. Here are some resources you can use to start searching for public datasets on your own:
- The Google Cloud Public Datasets allow data analysts access to high-demand public datasets, and make it easy to uncover insights in the cloud.
- The Dataset Search can help you find available datasets online with keyword searches.
- Kaggle has an Open Data search function that can help you find datasets to practice with.
- Finally, BigQuery hosts 150+ public datasets you can access and use.
Public health datasets
- Global Health Observatory data: You can search for datasets from this page or explore featured data collections from the World Health Organization.
- The Cancer Imaging Archive (TCIA) dataset: Just like the earlier dataset, this data is hosted by the Google Cloud Public Datasets and can be uploaded to BigQuery.
- 1000 Genomes: This is another dataset from the Google Cloud Public resources that can be uploaded to BigQuery.
< Glossary >
Population : all possible data values in a certain dataset
First-party data: Data collected by an individual or group using their own resources
Second-party data: Data collected by a group directly from its audience and then sold
Third-party data: Data provided from outside sources who didn’t collect it directly
Data bias : a type of error that systematically skews results in a certain direction
Privacy : Preserving a data subject's information and activity any time a data transaction occurs
Metadata : Data about data ( Metadata creates a single source of truth by keeping things consistent and uniform