The Data Sources You Need to Know About as a Data Enthusiast
Raise your hand if you’ve ever cited Google and Kaggle as examples of data sources during data debates (if it is a thing 💩). Let’s see if you can still defend your stance after reading this post. A data source suggests an origin, possibly in physical or digital form.
Here, we will discuss the physical sources of data, and the nature of the data during its transition into the digital phase, and then delve into digital data sources and their pros and cons when using them.
A Data Source is the first point where data is birthed or where it is first digitized. Ultimately, it is a data source so long as a process or system accesses it and utilizes it. The source could either be through physical means like surveys, or interviews or through digital means like readings from sensors.
Physical Data Sources
Think of how you traditionally curated data for research purposes even before the introduction of online forms, researchers printed sheets with questions and sought subjects within a sample space to respond to the surveys. Researchers then summarised and converted the responses into digital forms for visualization and analysis. The physical data sources are:
- Primary: produced by someone who witnessed and experienced an event firsthand. surveys, interviews, diaries, memoirs. These sources act as First-hand evidence.
- Secondary — is derived by a third party from primary data sources. The secondary sources analyze and interpret the primary sources. Examples include dictionaries, biographies
- Tertiary — engages data from both the secondary and primary sources.
The Nature of Data While Transitioning
The digitization process from physical data sources presents the data in either structured, unstructured, or semi-structured formats.
- Structured — the data is organized in a tabular format with rows and columns. There is a relationship between the rows and columns. CSV files, Excel sheets, and database management systems are structured
- Unstructured — Has no pre-defined structure, and no data model. The data is irregular and ambiguous. Social media sentiments and open-ended surveys are unstructured. 80% of the data in the world is unstructured. Even though unstructured data has no form, they contain a lot of valuable information and insights on data subjects
- Semi-structured — Sits in the middle of both structured and unstructured sources. They usually lack the complete structure to be structured yet structured enough not to be unstructured ( :D ). E-mails are semi-structured
Digital Data Sources
Computer Files
Computer files are a common source of data for data science projects. These files can be in various formats, such as CSV, Excel, JSON, XML, text files, and more.
- Advantages: Computer files are easily accessible and can be shared across different platforms. They are lightweight and can be processed using standard programming languages and libraries.
- Challenges: Ensuring data consistency and quality can be a challenge when dealing with large volumes of files. Excel is very popular with its ability to autoformat data, particularly date columns and these changes are much more difficult to track when the data is large. Additionally, file-based data sources might not support real-time data updates.
Databases Sources
Databases are structured repositories that store data in a well-organized manner, making it easier to query and retrieve information. Common types include SQL databases (e.g., MySQL, PostgreSQL) and NoSQL databases (e.g., MongoDB, Cassandra).
- Advantages: Databases provide fast access to large volumes of data, support data integrity through relationships and constraints, and allow concurrent access by multiple users.
- Challenges: Setting up and managing databases can be complex, especially for large-scale systems. Data schema changes and migrations require careful planning to avoid disruptions.
Web-Based Sources
Web-based sources encompass a wide range of data available on the internet. This includes data from APIs (Application Programming Interface), web scraping, social media platforms, online surveys, and more.
- Advantages: Web-based sources offer real-time data access and the ability to tap into vast amounts of dynamic information. APIs provide structured data, making it easier to work with specific endpoints.
- Challenges: Web scraping may raise legal and ethical concerns if not done responsibly. APIs might have rate limitations or require authentication, and their endpoints could change over time.
Now that you’re acquainted with the data sources concerning the origin of data within the data science process — spanning from physical sources to digital ones — as well as the myriad advantages and disadvantages inherent in various digital sources, how will you now assess the status of Kaggle and Google when it comes to data acquisition?
On and on the Beginner training series is being compiled, just in case you missed the latest update, Data Types, Types of Data. Check it out!