top of page

Top 10 Open Datasets For Machine Learning And AI

Some time problem to find useful and demandable dataset for practice in machine learning. Here we have provide all 25 dataset which is useful for all beginners and students.


1. MNIST Handwritten Digits image Dataset

MNIST is one of the most popular deep learning datasets out there. It’s a dataset of handwritten digits and contains a training set of 60,000 examples and a test set of 10,000 examples. It’s a good database for trying learning techniques and deep recognition patterns on real-world data while spending minimum time and effort in data preprocessing.


Size: 50 MB

Number of Records: 70,000 images in 10 classes


2. MS COCO Dataset

COCO is a large-scale and rich for object detection, segmentation and captioning dataset. It has several features:

  • Object segmentation

  • Recognition in context

  • Superpixel stuff segmentation

  • 330K images (>200K labeled)

  • 1.5 million object instances

  • 80 object categories



Size: 25GB

Number of Records:330K images, 80 object categories, 5 captions per image, 250,000 people with key points


3. ImageNet Dataset

ImageNet is a dataset of images that are organized according to the WordNet hierarchy. WordNet contains approximately 100,000 phrases and ImageNet has provided around 1000 images on average to illustrate each phrase.



Size: 150GB

Number of Records:Total number of images: ~1,500,000; each with multiple bounding boxes and respective class labels


3. Open Images Dataset

Open Images is a dataset of almost 9 million URLs for images. These images have been annotated with image-level labels bounding boxes spanning thousands of classes. The dataset contains a training set of 9,011,219 images, a validation set of 41,260 images and a test set of 125,436 images.



Size: 500GB

Number of Records: 9,011,219 images with more than 5k labels


4. VisualQA Dataset

VQA is a dataset containing open-ended questions about images. These questions require an understanding of vision and language. Some of the interesting features of this dataset are:

  • 265,016 images (COCO and abstract scenes)

  • At least 3 questions (5.4 questions on average) per image

  • 10 ground truth answers per question

  • 3 plausible (but likely incorrect) answers per question

  • Automatic evaluation metric

Size: 25 GB

Number of Records:265,016 images, at least 3 questions per image, 10 ground truth answers per question


5. CIFAR-10 Dataset

This dataset is another one for image classification. It consists of 60,000 images of 10 classes (each class is represented as a row in the above image). In total, there are 50,000 training images and 10,000 test images. The dataset is divided into 6 parts – 5 training batches and 1 test batch. Each batch has 10,000 images.


Size: 170 MB

Number of Records: 60,000 images in 10 classes


6. Fashion-MNIST Dataset

Fashion-MNIST consists of 60,000 training images and 10,000 test images. It is a MNIST-like fashion product database. The developers believe MNIST has been overused so they created this as a direct replacement for that dataset. Each image is in greyscale and associated with a label from 10 classes.


Size: 30 MB

Number of Records: 70,000 images in 10 classes


7. IMDB Reviews Dataset

This is a dream dataset for movie lovers. It is meant for binary sentiment classification and has far more data than any previous datasets in this field. Apart from the training and test review examples, there is further unlabeled data for use as well. Raw text and preprocessed bag of words formats have also been included.


Size: 80 MB

Number of Records: 25,000 highly polar movie reviews for training, and 25,000 for testing


8. WordNet Dataset

Mentioned in the ImageNet dataset above, WordNet is a large database of English synsets. Synsets are groups of synonyms that each describe a different concept. WordNet’s structure makes it a very useful tool for NLP.


Size: 10 MB

Number of Records: 117,000 synsets is linked to other synsets by means of a small number of “conceptual relations.


9. Million Song Dataset

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are:

  • To encourage research on algorithms that scale to commercial sizes

  • To provide a reference dataset for evaluating research

  • As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest’s)

  • To help new researchers get started in the MIR field

Size: 280 GB

Number of Records: PS – its a million songs!


10. Yelp Reviews Dataset

This is an open dataset released by Yelp for learning purposes. It consists of millions of user reviews, businesses attributes and over 200,000 pictures from multiple metropolitan areas. This is a very commonly used dataset for NLP challenges globally.


Size: 2.66 GB JSON, 2.9 GB SQL and 7.5 GB Photos (all compressed)



Contact Us! to get help related to dataset of machine learning problem like how to use dataset and implement it with algorithms.


realcode4you@gmail.com
bottom of page