The Data Sources You Need to Know About as a Data Enthusiast

4 min readAug 9, 2023

Raise your hand if you’ve ever cited Google and Kaggle as examples of data sources during data debates (if it is a thing 💩). Let’s see if you can still defend your stance after reading this post. A data source suggests an origin, possibly in physical or digital form.

Here, we will discuss the physical sources of data, and the nature of the data during its transition into the digital phase, and then delve into digital data sources and their pros and cons when using them.

A Data Source is the first point where data is birthed or where it is first digitized. Ultimately, it is a data source so long as a process or system accesses it and utilizes it. The source could either be through physical means like surveys, or interviews or through digital means like readings from sensors.

Physical Data Sources

Think of how you traditionally curated data for research purposes even before the introduction of online forms, researchers printed sheets with questions and sought subjects within a sample space to respond to the surveys. Researchers then summarised and converted the responses into digital forms for visualization and analysis. The physical data sources are:

Primary: produced by someone who witnessed and experienced an event firsthand. surveys, interviews, diaries, memoirs. These sources act as First-hand evidence.
Secondary — is derived by a third party from primary data sources. The secondary sources analyze and interpret the primary sources. Examples include dictionaries, biographies
Tertiary — engages data from both the secondary and primary sources.

The Nature of Data While Transitioning

The digitization process from physical data sources presents the data in either structured, unstructured, or semi-structured formats.

Structured — the data is organized in a tabular format with rows and columns. There is a relationship between the rows and columns. CSV files, Excel sheets, and database management systems are structured
Unstructured — Has no pre-defined structure, and no data model. The data is irregular and ambiguous. Social media sentiments and open-ended surveys are unstructured. 80% of the data in the world is unstructured. Even though unstructured data has no form, they contain a lot of valuable information and insights on data subjects
Semi-structured — Sits in the middle of both structured and unstructured sources. They usually lack the complete structure to be structured yet structured enough not to be unstructured ( :D ). E-mails are semi-structured

Digital Data Sources

Computer Files

Computer files are a common source of data for data science projects. These files can be in various formats, such as CSV, Excel, JSON, XML, text files, and more.

Advantages: Computer files are easily accessible and can be shared across different platforms. They are lightweight and can be processed using standard programming languages and libraries.
Challenges: Ensuring data consistency and quality can be a challenge when dealing with large volumes of files. Excel is very popular with its ability to autoformat data, particularly date columns and these changes are much more difficult to track when the data is large. Additionally, file-based data sources might not support real-time data updates.

Databases Sources

Databases are structured repositories that store data in a well-organized manner, making it easier to query and retrieve information. Common types include SQL databases (e.g., MySQL, PostgreSQL) and NoSQL databases (e.g., MongoDB, Cassandra).

Advantages: Databases provide fast access to large volumes of data, support data integrity through relationships and constraints, and allow concurrent access by multiple users.
Challenges: Setting up and managing databases can be complex, especially for large-scale systems. Data schema changes and migrations require careful planning to avoid disruptions.

Web-Based Sources

Web-based sources encompass a wide range of data available on the internet. This includes data from APIs (Application Programming Interface), web scraping, social media platforms, online surveys, and more.

Advantages: Web-based sources offer real-time data access and the ability to tap into vast amounts of dynamic information. APIs provide structured data, making it easier to work with specific endpoints.
Challenges: Web scraping may raise legal and ethical concerns if not done responsibly. APIs might have rate limitations or require authentication, and their endpoints could change over time.

Now that you’re acquainted with the data sources concerning the origin of data within the data science process — spanning from physical sources to digital ones — as well as the myriad advantages and disadvantages inherent in various digital sources, how will you now assess the status of Kaggle and Google when it comes to data acquisition?

On and on the Beginner training series is being compiled, just in case you missed the latest update, Data Types, Types of Data. Check it out!

Resources

What is a data source?
Sources of Data
Sources of Data
What is Data, Different types of Data

The Data Sources You Need to Know About as a Data Enthusiast

Physical Data Sources

The Nature of Data While Transitioning

Digital Data Sources

Resources

Written by Joana Owusu-Appiah

No responses yet