As a data scientist, finding high-quality datasets in your preferred niche isn’t always straightforward, especially if we’re talking about free resources.
Luckily, there are various free websites that you can use to find some datasets for your next big data science projec dt. Now, let’s explore them one by one!
API Spreadsheets is one of best dataset resources for data scientists on the web.
It breaks down datasets into categories, making it easier to pick your preferred one. These include school grades, healthcare topics, car statistics, wine, German credit data, Forbes billionaire, computer hardware, and many more.
You can download any of these datasets for free. It’s also worth noting that the directory tells you how many times each dataset was downloaded.
Google Dataset Search is a search engine designed by the search giant just for datasets.
All you have to do is type a keyword, and Google will crawl the web for datasets from as many external sources as possible, as long as they match your search query.
With each dataset, you get a brief description of the data, who the provider is, and whether it was recently updated.
Back in 2015, the US Government published its 200,000+ datasets to the public. The topics are pretty diverse, ranging from crime rates to climate change data and car accidents.
The directory’s search feature is unexpectedly powerful. It lets you filter results according to file format, geographical region, and type of organization.
Another excellent resource is the US Census Bureau. It’s suitable for those looking for generic data about the US population, such as birth rates and education. .
The UCI Machine Learning Repository was launched by the University of California Irvine 30 years ago. It contains a massive number of datasets categorized by attribute, task, niche, and data type.
Researchers, professors, and college students rely on this huge database for most of their machine data science projects.
Want to demonstrate your ability to work with highly complex datasets? The CERN Open Data Portal is an excellent dataset resource for data scientists capable of handling complex datasets.
With over two petabytes of data, you’ll definitely find what you’re looking for here. The library also includes datasets from the Large Hadron Collider particle accelerator! How cool is that?
The CERN database is a living dream for data scientists interested in particle physics, but of course, you need to have adequate experience before getting your hands dirty in one of these datasets.
The BFI Film Industry Statistics is a good dataset resource if you’re looking for less-complicated datasets.
The datasets often include UK box office figures, home entertainment, movie production costs, and audience demographics.
Need something a bit more sophisticated? Check out their annual statistical yearbook. This one contains visual reports and statistical analysis that satisfy data scientists of all experience levels.
Kaggle is a data science community that contains user-posted datasets for data science projects. These include customer sales, movie reviews, and others.
Most datasets are pre-processed, which can save you a great deal of time and effort in your project.
Whether you’re working on visualization, data analysis, or sentiment analysis projects, this collaborative library definitely has something for you.
GitHub’s Awesome-Public-Datasets is a dataset library that’s updated frequently. They’re divided into categories to make the research process easier for you.
These include software, machine learning, and computer hardware. All of these resources can be downloaded with a free GitHub account.
The Amazon Web Service Open Data Registry contains a considerable number of datasets that you can download for free. The platform also lets you analyze the datasets using Hadoop via EMR or cloud-based EC2.
To access the dataset library, you must create a new AWS account if you don’t have one.
Data.World is a collaborative data community where data scientists can share projects and datasets they’ve worked on together. Examples include data journalism and social bot detection.
Additionally, the website contains datasets in lots of categories, including healthcare, Twitter, and crime.
Reddit is a social media platform where users can talk about a particular subject in sections called subreddits.
For data science enthusiasts, there’s a subreddit called /r/datasets, where users share datasets for the community. Since this is a generalized discussion subreddit, you’ll be able to get your hands on a wide variety of datasets from various subjects.
Compiled by NASA, this repository contains datasets from the space agency’s satellite observation data for planet Earth. These include vegetation mapping, atmospheric observations, climate changes, and many more.
You can also head to NASA’s Planetary Data repository to download datasets from other planets, like the Cassini probe that orbited Saturn from 2004 to 2017.
FiveThirtyEight contains datasets from data journalism in sports, culture, and politics. All of the data is available to the public for free. This is an excellent resource for you to explore some beginner-level datasets that can be beneficial for your next data science project.
So that was a quick overview of some of the best free dataset resources for your next data science project.
You can start with API Spreadsheets’ library; there’s a pretty high chance you’ll find what you’re looking for there. Of course, you can also explore some of the other resources if you’re tight on data. .