Last modified on 01 Oct 2021.
Create artificial dataset
- sklearn dataset module:
from sklearn import datasets
. This contains also some popular reference datasets.
Source of datasets
- Google Dataset Search.
- Google Trends Datastore
- Google AI Datasets — In order to contribute to the broader research community, Google periodically releases data of interest to researchers in a wide range of computer science disciplines.
- Data Hub Datasets collection — high quality data and datasets organized by topic.
- Kaggle Datasets.
- awesome-public-datasets — A topic-centric list of HQ open datasets.
- Stanford Large Network Dataset Collection.
- FiveThirtyEight — hard data and statistical analysis to tell stories about politics, sports, societal matters and more.
- data.world.
- BuzzFeedNews/everything — data from BuzzFeed.
- data.gov — a large dataset aggregator and the home of the US Government’s open data.
- Quandl — your perfect choice for testing your machine learning algorithms and don’t waste your time on cleaning data.
- r/datasets.
- Built-in datasets in Scikit-Learn.
- NLP-progress.
- UCI
- The Yahoo Webscope Program
- TensorFlow Datasets
Datasets
- WordNet – A Lexical Database for English.
- ImageNet – ImageNet is an image database organized according to the WordNet hierarchy.
- Fruit-Images-Dataset — A dataset of images containing fruits and vegetables.
- Dataset samples from Machine Learning Mastery.
- UEA & UCR Time Series Classification Repository
- Sarcasm detection dataset.
- Insight - BBC News Datasets
- Large Movie Review Dataset (IMDB)
- COCO Dataset – a large-scale object detection, segmentation, and captioning dataset.
Vietnamese
- PhoBERT: Pre-trained language models for Vietnamese.
- IWSLT’15 English-Vietnamese data (small from Stanford).
- NLP-progress - Vietnamese
Sample datasets
- Labeled Faces in the Wild Home (
from sklearn.datasets import fetch_lfw_people
). - Iris flower dataset (
from sklearn.datasets import load_iris
). - The digits dataset (
sklearn.datasets.load_digits
). - pydatafaker – A python package to create fake data with relationships between tables.
Tools
- TimeSynth – A Multipurpose Library for Synthetic Time Series Generation in Python.