Open data

A topic-based overview of public repositories containing research data.

Agricultural sciences

AQUASTAT Dissemination System

Global information system of the Food and Agriculture Organization of the United Nations (FAO) on water resources and agricultural water management

Data Commons Agriculture

Data Commons brings together public data from many parts of the world, including surveys

FoodData Central

Food composition data from the United States Department of Agriculture (USDA)

Hyperspectral benchmark dataset on soil moisture

Hyperspectral and soil moisture data from a lysimeter field campaign based on a soil sample. Karlsruhe (Germany), 2017

Index DataBase

Vegetation indices from the Institute of Crop Science and Resource Conservation (INRES)

PLANTS Database

Standardised information on vascular plants, mosses and lichens in the USA

Biology

American-Gut

Open-access code and IPython notebooks from the American Gut project

ArrayExpress - Functional Genomics Data

Data from functional genomics experiments

Catalogue of Life (COL)

Integrated list of all known species worldwide

Cell Image Library (CIL)

Over 12,000 datasets from the Centre for Research in Biological Systems (CRBS)

CytoImageNet

Extensive dataset for pre-training with microscopy images

Electron Microscopy Data Bank (EMDB)

Data from cryo-electron microscopy (cryo-EM) and representative tomograms of macromolecular complexes and subcellular structures

EMBL-EBI

Data sources and analysis tools from the European Bioinformatics Institute of the European Molecular Biology Laboratory (EMBL)

ENCODE portal

The ENCODE Consortium is an ongoing international collaborative project involving research groups, funded by the National Human Genome Research Institute (NHGRI)

EnsemblGenomes

Genomic data for invertebrate species, as well as tools for processing, analysing and visualising this data

FireBrowse portal

FireBrowse provides access to a wide range of cancer genomics data, including clinical annotations, DNA copy number, miR, miRseq, mRNA and mRNAseq

Gene Expression Omnibus

Functional genomics data supporting the submission of MIAME-compliant data

Gene Ontology

Gene Ontology (GO) knowledge base on gene functions

Genomic Data Commons Data Portal

Data from genomic cancer studies

Global Biotic Interactions (GloBI)

Data on species interactions, such as predator-prey, pollinator-plant, pathogen-host and parasite-host

ICOS PSP benchmarks

A collection of practical benchmarks designed to test the scalability of classification and regression methods developed by the ICOS research group

IGSR: The International Genome Sample Resource

As part of the ‘1000 Genomes’ project, a catalogue of common genetic variations in humans was compiled

Journal of Cell Biology

A collection of image data relating to articles published in the ‘Journal of Cell Biology’.

KEGG: Kyoto Encyclopedia of Genes and Genomes

Database on higher-level functions and relationships within biological systems such as cells, organisms, ecosystems and the biosphere, based on information at the molecular level

NIH Human Microbiome Project Catalog

Metadata on all reference genomes of human-related isolates and on metagenome samples from healthy humans

National Center for Biotechnology Information

Databases, including those on chemicals and bioassays, DNA and RNA, and homology

openSNP

Repository for genetic and phenotypic data

palmerpenguins

Dataset for data exploration and visualisation as an alternative to Iris

Pathguide

Resources on biological signalling pathways and molecular interactions

RCSB Protein Data Bank (RCSB PDB)

Data on experimentally determined 3D structures, integrative 3D structures and computer-generated structural models (CSM)

Personal Genome Project

Publicly accessible data on genomes, health and traits

PGC Data Access Portals

Portals for querying data at the individual level or with restricted access

Rfam

Collection of RNA families, each represented by multiple sequence alignment (MSA), consensus RNA structures and covariance models

SSBD:database

Open resources for the analysis of microscopic images and quantitative data of biological objects, such as single molecules, cells, tissues, individuals, etc.

UniGene

NCBI database on the transcriptome and therefore not primarily a database for genes

UniProt

Freely accessible source of protein sequences and functional information

UCSC Genome Browser

Sequence and annotation data for the genome sequences displayed in the UCSC Genome Browser

Chemistry

Ionic Liquids Database - ILThermo

Online search tool for the thermodynamic and transport properties of ionic liquids, as well as binary and ternary mixtures containing ionic liquids

PubChem

A collection of freely accessible chemical information from the National Center for Biotechnology Information

Climate and Weather

38-Cloud: A Cloud Segmentation Dataset

38 Landsat-8 images and manually extracted pixel-level reference values for cloud detection

Aviation Weather Center

Warnings, forecasts and analyses of hazardous weather conditions for aviation

Actuaries Climate Index

Monthly and seasonal data by region and component

Average city temperatures

Daily data on average air temperatures in major cities worldwide

Canadian Weather Information

Historical data by station name, province, territory or distance

Caravan

Global hydrology dataset with large samples

CDC – Climate Data Center

Climate data from the german weather service (DWD)

Climate Data Online (CDO)

Statistics, current weather observations and climate data from the Australian Data Archive for Meteorology (ADAM)

Climatic Research Unit

Data provided by the CRU of the National Centre for Atmospheric Science (NCAS)

Copernicus Climate Change Service (C3S)

One of the six thematic services provided under the European Union’s Copernicus programme

European Climate Assessment & Dataset (ECA&D)

Datasets on changes in weather and climate extremes

GDELT Project: Four Massive Datasets Charting The Global Climate Change News Narrative 2009-2020

Four extensive datasets illustrating global climate change reporting from 2009 to 2020

NOAA Global Radiation and Aerosols (GRAD) Data

Long-term measurements of radiation, meteorological parameters and aerosols at various remote locations worldwide, as well as at sites across the Americas

NOAA Local Climatological Data (LCD)

Summaries of climatological conditions from airports and other major weather stations

Open-Meteo

Open data with a high resolution of 1 to 11 kilometres

WorldClim

Maps, graphs, tables and data on the global climate

Complex Network

Archive-IT

Archived websites and web pages

CRAWDAD

‘Community Resource for Archiving Wireless Data at Dartmouth’ (CRAWDAD)

DIMACS

Benchmarks for synthetic and real-world input generators, short-range calculators and scripts for generating benchmark performance reports, as well as detailed documentation

DOI URLs

DOIs for nearly 50 million journal articles from the OAI-PMH server

Internet Archive Dataset Collection

Extensive data archives from both institutions and individuals

KONECT

Network datasets from the Koblenz Network Collection

Laboratory for Web Algorithmics

Data for the WebGraph framework

Mark Newman: Network data

Links to network datasets in GML format

Microsoft Research Tools: code, datasets, & models

Directory of datasets, SDKs, APIs and open-source tools developed by Microsoft researchers

NBER U.S. Patent Citations Data File

Findings, insights and methodological tools

Network Repository

Interactive data and network data repository with real-time visual analysis, featuring thousands of datasets spanning over 30 disciplines, from biological to social network data

NIST Complex Network Resources

Standard datasets against which algorithms and claims can be compared and verified

The R Datasets Package

The R datasets package

PyPi/Maven dependency data

Three LZMA-compressed files: mvn-deps.csv.lzma, mvn-minimal-deps.csv.lzma and pypi-deps.csv.lzma

Scopus

Database of abstracts and citations

Stack Overflow Annual Developer Survey

Annual developer survey by Stack Overflow

Stanford GraphBase

Literate Programming with more than 30 examples

Stanford Large Network Dataset Collection

Collection from the Stanford Network Analysis Project, including social networks, citation and collaboration networks, road networks and Wikipedia networks

SuiteSparse Matrix Collection

Collection of sparse matrices

UCI Network Data Repository

Data sets from the UCI Network Data Repository, including collections of classic network data sets and data sets curated by research groups or organisations

Computer Networks

CAIDA Data

Internet topology showing the arrangement and interconnection of devices within autonomous systems (AS) on the Internet

Click Dataset

Around 53.5 billion HTTP requests from users at Indiana University

ClueWeb09 Dataset

Approximately 1 billion web pages in ten languages, collected in January and February 2009

ClueWeb12 Dataset

733,019,372 English-language web pages, collected between 10 February 2012 and 10 May 2012

Common Crawl

Free, open repository of web crawling data

Criteo 1TB Click Logs Dataset

Feature values and click data for millions of display ads to evaluate algorithms for predicting click-through rate (CTR)

Merklemap DNS records database

Database of DNS records containing over 4 billion entries

MIRAGE Project

Reproducible architecture for capturing mobile app traffic and generating ground-truth data

MobiPerf

MobiPerf is an open-source application for measuring network performance (throughput, latency, etc.) on mobile platforms

Shopper Intent Prediction from Clickstream E‑Commerce Data

Prediction of purchase intent based on e-commerce clickstream data

Stanford Internet Research Data Repository

Public archive of research datasets describing hosts, services and websites on the internet

Open Observatory of Network Interference (OONI)

Non-profit open-source software project aimed at supporting decentralised initiatives to document internet censorship worldwide

Project Sonar

SSL, DNS, HTTP and UDP connections on public networks

UCSD Network Telescope

Passive traffic monitoring system based on a globally distributed but low-traffic /9 and /10 network.

Energy sector

Almanac of Minutely Power dataset (AMPds)

Two years’ worth of minute-by-minute measurement data on electricity, water and natural gas

Commercial Building Energy Dataset (COMBED)

Energy-related dataset from a commercial building, with data recorded more than once per minute

Direct Borohydride Fuel Cell (DBFC) Dataset

Impedance and polarisation measurements at the anode using Pd/C, Pt/C and Pd-coated Ni–Co/rGO catalysts

Domestic Electrical Load Survey (DELS) Secure Data 1994–2014

The ‘DELS Secure Data’ dataset contains anonymised survey responses

ECO data set (Electricity Consumption & Occupancy)

Non-intrusive load monitoring and occupancy detection over an eight-month period in six Swiss households

EIA-923

The EIA-923 questionnaire collects detailed electricity data on power generation, fuel consumption, fossil fuel stocks and goods received at the level of power stations and generating units

Global Power Plant Database

Global open-source database for power stations

Household Electricity Study - EV0702

Data on household electricity consumption from April 2010 to April 2011 from domestic appliances in a total of 251 owner-occupied households across England

High Frequency EMI Data Set (HFED)

Data set on high-frequency electromagnetic interference (EMI) containing measurement curves derived from a signal analyser and a Universal Software Radio Peripheral (USRP)

Moroccan buildings’ electricity consumption dataset (MORED)

Data on the electricity consumption of various urban buildings in Moroccan cities

Marktstammdatenregister (MaStR)

Basic data on the electricity and gas market

Proton Exchange Membrane (PEM) Fuel Cell Dataset

Standard tests on Nafion-112 membranes and MEA activation tests of a PEM fuel cell under various operating conditions

Plug Load Appliance Identification Dataset (PLAID)

Voltage and current measurements at a sampling rate of 30 kHz on 11 different appliance types in more than 60 households in Pittsburgh, Pennsylvania

Public Utility Data Liberation Project (PUDL)

Open-source data processing pipeline that facilitates access to US energy data and its programmatic use

Smart Meter Data Listing

List of datasets relating to smart meters

SynD

Synthetic energy dataset for non-intrusive load monitoring in households (SynD) provides a synthetic energy dataset with a focus on residential buildings

tracebase data set

Collection of electricity consumption data for research purposes in the field of energy analysis

UK Domestic Appliance-Level Electricity (UK-DALE) dataset

Electricity demand from five homes and individual appliances recorded every six seconds

Indian Dataset for Ambient Water and Energy

Energy monitoring and energy consumption of a home in India over 73 days

Financial sector

BIS Data Portal

The Bank for International Settlements (BIS) provides statistics in collaboration with central banks and other national authorities

Cboe Futures Exchange Market Data

Daily market statistics and closing prices, price summaries and other market data services

EDGAR

Electronic Data Gathering, Analysis, and Retrieval (EDGAR) is the central system for companies filing documents under the Securities Act, Securities Exchange Act, Trust Indenture Act and Investment Company Act

FAANG- Complete Stock Data

Data on the shares of FAANG (Facebook, Amazon, Apple, Netflix and Google) companies since they were first listed

Federal Reserve Economic Data (FRED)

Online database comprising hundreds of thousands of time series of economic data from numerous national, international, public and private sources

Google Finance

Search for shares, ETFs, etc.

Nasdaq Data Link

Platform for financial and alternative data, providing financial professionals with useful information and tools for capturing, managing and analysing data

NYSE Exchange Proprietary Market Data

Low-latency real-time market data covering the various asset classes and markets of the NYSE Group

Yahoo Finance

Financial news, data and commentary, including share prices, press releases, financial reports and original content

Geosciences and Environmental Sciences

AODN Portal

Data from the Australian Ocean Data Network (AODN) and the Integrated Marine Observing System (IMOS)

Alabama’s Real-Time Coastal Observing System (ARCOS)

Environmental monitoring data in and around Mobile Bay

BODC Database

Collection of marine datasets from the British Oceanographic Data Centre (BODC)

Common Metadata Repository (CMR)

Search API for NASA metadata on remotely sensed geosciences

Earth Models

Modelling tools and datasets relating to the Earth

Earthdata Data Catalog

The Earth Science Data Systems (ESDS) programme offers free access to NASA’s archive of geoscience data

Earthquake Catalog

Current or past earthquakes, earthquake resources by state and web services

Global Volcanism Program

Catalogue of Holocene and Pleistocene volcanoes and their eruptions over the last 12,000 years

Global Wind Atlas

Web-based application for decision-makers, planners and investors to identify areas with strong winds for wind energy generation

Meteoritical Bulletin Database

International database of officially recognised meteorites and their locations

National Data Buoy Center

Meteorological and oceanographic measurements for the marine environment

National Estuarine Research Reserve System

Short-term fluctuations and long-term changes in the integrity and biodiversity of estuarine ecosystems and coastal waters

Norwegian Polar Data Centre: Datasets

Antarctica, the Arctic Ocean and Svalbard

PANGAEA Publisher for Earth & Environmental Science

Georeferenced data on chemistry, the lithosphere and atmosphere, biology and palaeontology, oceans and land areas, fisheries and agriculture etc.

Radiance – Global Light Pollution Visualization & Analysis

For astrophotography, astrophysics and the protection of the night sky

UC Irvine Machine Learning Repository

Machine learning datasets containing data on air quality, ozone detection, greenhouse gas concentrations, aquatic toxicity and more

UK National Data Repository (NDR) for offshore petroleum-related licence information

In future, records relating to licences for the exploration and storage of carbon dioxide will also be stored

WHPA Prediction

Dataset from the study ‘A new framework for experimental design using Bayesian Evidential Learning’

Government information

Datos Argentina

Data repository of the Argentine Nation

Australian Bureu of Statistics

Australian Bureau of Statistics

Data.gov.au

Open government data in Australia

data.gv.at

Central catalogue containing metadata from the decentralised data catalogues of Austrian public authorities

Data.Gov.be

The Belgian data portal

dados.gov.br

Brazilian Open Data Portal

GovData

Data portal for Germany with legal texts, studies and guidelines on ‘Open Government’

open.canada.ca

‘Open Government’ of the Canadian government

datos.gob.cl

Data sets from public institutions in Chile

EU Open Data Portal

The official portal for European data

Metadaten Verbund (MetaVer)

Joint portal of the German federal states of Brandenburg, Bremen, Hamburg, Hesse, Mecklenburg-Western Pomerania, Saarland, Saxony and Saxony-Anhalt

National Bureau of Statistics of China (NBS)

Open data from the Chinese National Bureau of Statistics

Debt to the Penny

Information from the US Department of the Treasury on total outstanding government debt

National Archives

The National Archives and Records Administration (NARA) archives documents and materials produced in the course of the US federal government’s activities

Eurostat

Statistics and data on Europe

EveryPolitician

OpenSanctions’ global database of political office holders

StatsPolicy|gov

Decentralised network of the US federal statistics system

Finnish open data

Finnish open data portal

data.gouv

Platform for French open data

GENESIS-Online

Database of the German Federal Statistical Office

data.gov.gr

Greek register for open public sector data

Open Government Data (OGD) Platform India

Portal for open government data from the Indian Government’s National Informatics Centre (NIC)

data.go.id

Data and official public information from the Indonesian government

data.gov.ie

Ireland’s open data portal

data.gov.il

Databases of all Israeli ministries

dati.gov.it

Open data from the Italian public administration

e-Stat Portal Site of Official Statistics of Japan

Portal for Japanese government statistics

data.public.lu

Luxembourg Open Data Platform

data.gov.my

Malaysia’s official open data portal

datos.gob.mx

Mexican national open data platform

date.gov.md

Moldovan government data portal

data.overheid.nl

Dutch government data register

stats.govt.nz

Statistics from New Zealand’s official statistics agency, Stats NZ (Tatauranga Aotearoa)

OECD Data

Data from the Organisation for Economic Co-operation and Development

Open Data Hub

Open data catalogue focusing on mobility and tourism

pordata.pt

PORDATA was organised and developed by the Francisco Manuel dos Santos Foundation

data.gov.ro

Romanian open data sets provided by public authorities and institutions

data.gov.ru

Russia’s open data register

Singapore’s open data portal

Singapore’s open data portal

stats sa

Statistics of the Republic of South Africa

opendata.swiss

Swiss Open Government Data

data.gov.tw

Taiwanese Open Government Data

Tunisia Data Portal

Tunisia’s data portal

data.gov.uk directory

Data from the UK central government, local authorities and public bodies

Geographic Data Service

UK Research and Innovation (UKRI) Smart Data Research (SDR UK)

Healthy and Sustainable Places (HASP) Data Service

Smart data for a better understanding of the quality of life and sustainability of places

United States Census Bureau

Data from the United States Census Bureau

National Center for Health Statistics (CDC)

Data and analysis tools from the National Center for Health Statistics

U.S. Department of Housing and Urban Development’s Office of Policy Development and Research (PD&R)

Research findings, publications and datasets on housing, community development and other areas in the United States

data.gov

Data, tools and resources from the US government

OpenFDA

Data from the Food and Drug Administration (FDA) of the US Department of Health and Human Services

National Center for Education Statistics (NCES)

Data on the state of education in the United States

United States Patent and Trademark Office (USPTO)

The USPTO’s data platform

Congressional Research Service

Reports from the Congressional think tank

Uganda Bureau of Statistics

Data portals of the Uganda Bureau of Statistics

data.gov.ua

Ukraine’s data portal

catalogodatos.gub.uy

Uruguay’s open data

IATI Country Development Finance Data

Data on development and humanitarian activities, by country, reporting organisation and sector

UNdata

Resources from the United Nations (UN) statistical system and other international organisations

UNESCO Datahub

Data from UNESCO initiatives in the fields of education, science, culture and communication

UNICEF Data and Analytics

Data on the situation of children and women worldwide

World Bank Open Data

World Bank open data platform

Healthcare

covid-19-lake

AWS S3 Explorer

COVID-19 Case Surveillance Public Use Data

COVID-19 case surveillance up to 1 July 2024

Health Inspection Scores (2024-Present)

Results of health inspections carried out by the San Francisco Department of Public Health from 2024 to the present

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE

COVID-19 database from the Centre for Systems Science and Engineering (CSSE) at Johns Hopkins University (archived on 10 March 2023)

NYT Coronavirus (Covid-19) Data in the United States

A database containing data on coronavirus cases and deaths in the US

HealthData.gov

Data, tools and resources in the field of health and social care

The COVID Tracking Project

Reported data in various units and according to different definitions used by US states and territories

Vitalnet Data Scenarios

A Vitalnet ‘data scenario’ is a complete data analysis situation

Genomic Data Commons (GDC)

A repository and computing platform for cancer researchers focusing on cancer, its clinical course and response to therapies

Gapminder

Complete datasets with hundreds of indicators

Medical Subject Headings

The ‘Medical Subject Headings’ (MeSH) thesaurus is a controlled and hierarchically structured vocabulary created by the National Library of Medicine

MeDAL dataset

Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining (MeDAL)

Medicare Coverage Database (MCD)

Procedures and timelines for determining insurance coverage

data.cms.gov

Data from the Centers for Medicare & Medicaid Services (CMS)

Nightingale Open Science

Datasets on heart attacks, cancer metastases, cardiac arrest, bone ageing, Covid-19

Ebola Cases and Deaths in Affected Countries

Total number of probable, confirmed and suspected Ebola cases and deaths in Guinea, Liberia, Sierra Leone, Nigeria, Senegal, Mali, Spain, the USA, the UK and Italy

Organisation Data Service (ODS)

Data service of the National Health Service (NHS) in England

OpenPaymentsData.CMS.gov

Payments to hospitals and healthcare providers by medical companies

PhysioNet

Databases from PhysioNet

Spanish Flu Dataset

Mortality resulting from the 1918 influenza pandemic, Chicago, USA

Cancer Imaging Archive

The Cancer Imaging Archive (TCIA) is a service that anonymises and provides access to an extensive archive of medical cancer images

US Water Quality Data by ZIP Code

Daily data on water quality in the USA by postcode – breaches of EPA regulations, lead levels, safety ratings

The Global Health Observatory

The GHO data archive is the WHO’s portal for health-related statistics from its 194 member states

Informatics for Integrating Biology & the Bedside (i2b2)

NLP research datasets

Image Processing

10k US Adult Faces Database

Over ten thousand natural face photographs, along with various measurements for 2,222 of these faces, including memorability scores, psychological features and annotations on landmarks

Action Similarity Labeling (ASLAN) Challenge

Video database containing actions and a comprehensive test protocol for investigating the similarity of actions

Affective Image Classification

Affective image classification using features inspired by psychology and art theory

AI Detector Arena Benchmark Dataset

Dataset for evaluating AI image recognition tools

Airborne Object Tracking Dataset (AOT)

Dataset for tracking airborne objects

All-Age-Faces (AAF) Database

The All-Age-Faces (AAF) dataset contains 13,322 facial images of predominantly Asian individuals from all age groups

animals with attributes

A dataset for attribute-based classification

Arabic Font Classification

Classification of Arabic fonts, see also Arabic Font Classification

Biometrically Filtered Famous Figure (B3FD) Dataset

Dataset containing facial images for age estimation

CADDY Underwater Stereo-Vision Dataset

Human-Robot Interaction (HRI) for divers and autonomous underwater vehicles

Caltech Vision Lab Datasets

See also caltechvisionlab.github.io

Cat Dataset

Over 9,000 images of cats with annotated facial features

CCAgT

Images of cervical cells stained using the AgNOR method

Chars74K dataset

Character recognition in natural images

Cube++

4,890 images of various scenes under different conditions

Danbooru2021

Extensive anime image database with over 4.9 million images and over 162 million tags

Densely Annotated Video Driving Data Set

28 driving sequences recorded in the CARLA simulator, comprising a total of 10,767 individual frames

ETH Entomological Collection (ETHEC) Dataset

Data for hierarchical image classification using entailment cone embeddings

Face Image Project

Unfiltered faces for gender and age classification

Face Recognition Databases

Datasets for benchmarking face recognition algorithms

FlickrLogos

Company logos from Flickr in various situations

Fluorescent Neuronal Cells v2

Collection of fluorescence microscopy images and their corresponding ground-truth annotations

HumanEva Dataset

Seven calibrated video sequences synchronised with 3D body postures

IEEE DataPort: Image Processing

IEEE datasets for image processing

ImageNet

Image database organised according to the WordNet hierarchy

Indoor Scene Recognition

Images for indoor scene recognition

Iranis Dataset

Extensive dataset containing more than 83,000 images of Persian numbers and letters sourced from real-world vehicle number plates

KITTI Vision Benchmark Suite

Computer vision benchmarks for real-world environments, focusing on stereo, optical flow, visual odometry, 3D object detection and 3D tracking

Labeled Information Library of Alexandria: Biology and Conservation (LILA BC)

Repository for datasets from the fields of biology and conservation

Labelled Faces in the Wild (LFW) Dataset

Database of facial photographs used to investigate the problem of unrestricted facial recognition

LLVIP: A Visible-infrared Paired Dataset for Low-light Vision

Paired visible-infrared datasets for image processing in low-light conditions

Multi-View Region of Interest Prediction Dataset for Autonomous Driving

Multi-view images captured in the CARLA simulator with annotations for regions of interest

Newspaper Navigator

Experimental application for locating historical newspaper images based on visual similarity

Open Images Dataset V6

1,743,042 training images with bounding boxes, object segmentations, visual relationships and localised descriptions

Oxford-IIIT Pet Dataset

Dataset with 37 categories of pets

Roboflow Computer Vision Datasets

Public datasets for computer vision

Stanford Dogs Dataset

Images of 120 dog breeds from around the world with annotations from ImageNet

SUN database project

Collection of annotated images featuring a wide variety of environmental scenes, locations and the objects within them

SVIRO Dataset and Benchmark

Synthetic dataset for Vehicle Interior Rear seat Occupancy (SVIRO) is a synthetic dataset for the detection and classification of rear seat

TikTok dataset

Dataset published at CVPR 2021, presented in the paper ‘Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos’

Violent-Flows Database

Database and benchmark for crowd violence and non-violence

Visual Genome

Dataset and knowledge base for linking structured image concepts with language

X-ray images

The X-ray images contained in GDXray+ may only be used for research and educational purposes

YouTube-BoundingBoxes Dataset

Extensive dataset of video URLs with densely distributed, high-quality annotations of bounding boxes for individual objects

YouTube-8M Segments

Human-verified labels for approximately 237,000 segments across 1,000 classes

Medicine

BCNB

WSI dataset on core needle biopsy in early-stage breast cancer

Broad Bioimage Benchmark Collection

The Broad Bioimage Benchmark Collection (BBBC) is a collection of microscopy image sets. In addition to images, each set contains a description of the biological application and expected results

Catalogue Of Semantic Mutations In Cancer (COSMIC)

Data from COSMIC, the Cell Lines Project, Actionability and the Cancer Mutation Census (CMC)

CCLE Cancer Cell Line Encyclopedia

Cancer cell line models for research into cancer biology, validation of cancer targets and determination of drug efficacy

Genomics of Drug Sensitivity in Cancer datasets

Datasets and features on the genomics of drug sensitivity in cancer

Grand Challenge

Platform for machine learning in medical imaging

HMS LINCS Project

The LINCS project collects and disseminates data and analytical tools to understand how human cells respond to disturbances caused by drugs, the environment and mutations.

Serratus

Collaborative open-science project for virus detection

Stowers Original Data Repository

The data underlying scientific publications from the Stowers Institute for Medical Research

Natural Language

Automatic Keyphrase Extraction

Datasets for the automatic extraction of key phrases

The Big Bad NLP database

More than 400 well-structured NLP datasets for common NLP tasks and requirements, such as document classification, automatic image captioning, dialogues, clustering, intent classification, language modelling, machine translation, text corpora and much more

Blizzard Challenge 2018

Approx. 6.5 hours of British English speech data from a single female speaker

The Blog Authorship Corpus

Posts from 19,320 bloggers collected from blogger.com in August 2004

CLiPS Stylometry Investigation (CSI) Corpus

Annually updated corpus of student essays and reviews

DBpedia

Current publications of core data from en.wikipedia.org

List of Dirty, Naughty, Obscene, and Otherwise Bad Words

Filter for Shutterstock’s autocomplete server and recommendation engine

European Parliament Proceedings Parallel Corpus 1996-2011

A parallel corpus for statistical machine translation

Explanation Bank

Inference algorithms that answer complex questions and provide explanations understandable to humans

German Political Speeches Corpus and Visualization

Political speeches by leading German politicians, predominantly delivered from 1990 onwards

Google Books Ngram Viewer Datasets

The Google Books Ngram Viewer is optimised for quickly querying the usage of short phrases

Gutenberg Offline Catalogs

eBooks from Project Gutenberg

The LJ Speech Dataset

Public domain speech dataset consisting of 13,100 short audio clips

Making Sense of Microposts (#Microposts2016)

Tweets from the Redites project covering numerous notable events from 2011 and 2013

MC-AFP

Machine comprehension dataset based on the Gigaword dataset

Machine Comprehension Test (MCTest)

Collection of 660 stories and associated questions

MS MARCO

Datasets for natural language generation, text passage ranking, key term extraction, dialogue-oriented search and a crawling dataset

Multi-Domain Sentiment Dataset

Product reviews from Amazon.com across many different product categories (domains)

No Language Left Behind (NLLB - 200vo)

Dataset based on metadata for bitexts published by Meta AI

Noisy speech database for training speech enhancement algorithms and TTS models

Database containing clear and noisy parallel speech

Personae Corpus

The ‘Personae’ corpus was compiled for experiments on author attribution and personality prediction

SMS Spam Collection

The corpus was compiled from free or freely accessible sources on the internet for research purposes

SQuAD2.0 – The Stanford Question Answering Dataset

SQuAD 2.0 tests a system’s ability not only to answer reading comprehension questions, but also to provide no answer when a question cannot be answered

Universal Dependencies

Framework for the consistent annotation of grammar (parts of speech, morphological features and syntactic dependencies) in various human languages

USENET corpus

Collection of public USENET posts between October 2005 and January 2011

Web 1T 5-gram Version 1

The N-gram frequencies were generated from texts sourced from publicly accessible websites

Wikidata

Wikidata database dumps

Wordbank

An open database on children’s vocabulary development

WordNet – A Lexical Database for English wndb(5WN)

A comprehensive lexical database of the English language. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitively related words (synsets)

Neuroscience

Allen Institute Brain Knowledge Platform

The Brain Knowledge Platform’s data catalogue provides access to a wide range of projects and datasets

BrainOmics Neuroimaging Genetics

Links between neuroimaging, genetics and cognitive data

codeneuro neurofinder

Each dataset is available as a ZIP file and contains images, reference neuron regions, metadata and code for loading the data

CRCNS - Collaborative Research in Computational Neuroscience

Data from the first round of data-sharing projects supported under the CRCNS funding programme

Child Mind Institute

International Neuroimaging Data-Sharing Initiative (INDI)

Human Connectome Project (HCP) Young Adult

Study data from the HCP Young Adult (HCP-YA) project

National Database for Autism Research (NDAR)

Data on autism spectrum disorders at all levels of biological and behavioural organisation

NIMH Data Archive

Data archive of the National Institute of Mental Health (NIMH)

NeuroElectro

Electrophysiological properties, such as resting membrane potentials and membrane time constants, of various neuron types

NeuroMorpho.Org

Collection of digitally reconstructed neurons and glial cells

Open Access Series of Imaging Studies (OASIS)

Brain neuroimaging datasets

Open NeuroData Registry

Numerous neuroimaging datasets (as pre-computed Neuroglancer volumes) from various modalities

OpenfMRI

Archive of human brain imaging data collected using MRI and EEG techniques

OpenNeuro

Platform for the validation and exchange of BIDS-compliant MRI, PET, MEG, EEG and iEEG data

StudyForrest

Data on brain structure, brain function and the characteristics of film stimuli

GigaDB

2,669 discoverable, traceable and citable datasets

Physics

CERN Open Data portal

Archived results of various research activities, along with associated software and documentation

IceCube Neutrino Observatory

IceCube neutrino point-source data pointing towards TXS 0506+056

Gravitational Wave Open Science Center (GWOSC)

Data from gravitational-wave observatories

NASA Exoplanet Archive

Planetary parameters for confirmed planets

Entry Points to NASA Science Data

Thematic archives on stars, planets and other celestial bodies, the Sun, our Earth and cells

Quantum simulations

Numerical simulation of an electron in a two-dimensional potential well Einschlusspotential

Sloan Digital Sky Survey (SDSS)

Mapping the near and distant universe to understand the physical processes that govern our universe

Search engines

Academic Torrents

Scalable BitTorrent infrastructure

Data Basis

Non-governmental organisation operating Brazil’s largest public data platform

Data Commons

Data Commons is a Google initiative designed to enable the exploration of diverse, standardised data using a unified Knowledge Graph

DataHub Collections

Curated datasets

Domains Project

The world’s single largest internet domains dataset

ERIC - Education Resources Information Center

Internet-based database containing bibliographic references and full-text documents in the field of educational research and information

Galaxy Europe

Thousands of tools, quotas and computing infrastructure as part of ‘Training Infrastructure as a Service’ (TIaaS)

Google Dataset Search

Name, description, author and publication formats of datasets

Harvard Dataverse

Repository for research data and code

ICPSR

Bibliography, variable search and thematic collection of the Inter-university Consortium for Political and Social Research (ICPSR)

Kaggle Datasets

Kaggle supports a wide range of formats for the publication of datasets

National Technical Reports Library (NTRL)

Collection of technical reports funded by the US government

NFDI4DS Portal

Research data from the NFDI4DataScience (NFDI4DS) consortium

ODI Certified Datasets

Datasets certified by the Open Data Institute (ODI)

Open Data Inception

Open data portals worldwide

PaN-Finder

Building on the PaNOSC project, data catalogues from major research institutions are being linked together.

Registry of research data repositories (re3data)

Global directory of research data repositories across all research disciplines

Statista

Portal for market data, market research and market studies

Zenodo

Repository for research results from the OpenAIRE project funded by the European Commission

Social sciences

ACLED

ACLED is an independent, impartial conflict monitoring organisation that provides real-time data and analysis on violent conflicts and protests in all countries and regions worldwide

ARED

ARED is a collection of biographical and professional information on individuals who form the top echelons of authoritarian regimes

CanLII

Canadian Institute for Legal Information

CEWS

Statistical data on gender representation in academia

COW

COW promotes the collection, dissemination and use of accurate and reliable quantitative data in the field of international relations

Cryptome

Cryptome publishes open, secret and classified documents

data.police.uk

Information on criminal offences, investigation outcomes, as well as stop-and-searches and searches at street level, broken down by police district

Employment Research Data Center

Data from numerous research and evaluation projects at the Upjohn Institute, funded by the US Department of Labour

ESS Data Portal

ESS is a scientifically oriented cross-national survey

FBI Hate Crimes Report 2013

Aggregated data from all US states

Fragile States Index

The Fund for Peace (FFP) compiles the Fragile States Index, a ranking of 178 countries that assesses the risks and vulnerabilities of individual states using 12 indicators

GDELT Project

The Global Knowledge Graph connects people, organisations, places, topics, numbers, images and emotions into a single network spanning the entire planet

Global Religious Futures Project

Religious change and its impact on societies worldwide

GSS

The GDD data includes information on pets, credit history, social networks, the importance of cultural values, as well as interviewer characteristics and observations

Gun Violence Data

Database containing records of over 260,000 incidents of gun violence in the US from January 2013 to March 2018

Humanitarian Data Exchange

HDX is an open platform for data exchange between different crises and organisations

IDB Open Data

IDB data on economic and social development in Latin America and the Caribbean

INED surveys and data

Online catalogue of surveys and data from the French National Institute for Demographic Studies (INED)

INFORM Severity Index

INFORM is a collaboration between the Reference Group on Risks, Early Warning and Preparedness of the Standing Inter-Agency Committee and the European Commission

INSCR

INSCR was established to coordinate and consolidate the information resources created and used by the Center for Systemic Peace

Integrated Civil Society Organizations System

The iCSO system facilitates cooperation between civil society organisations and DESA

International Networks Archive (INA)

The INA on weapons, books and capital flows

International Social Survey Programme

The ISSP is a cross-national collaborative programme that conducts annual surveys on various topics relevant to the social sciences

IPUMS

IPUMS provides census and survey data from around the world, linked together in time and space

Mass Mobilization Protest Data

Protests against governments in all countries, 1990–2020

Microsoft Academic Graph

The Microsoft Academic Graph is a heterogeneous graph comprising datasets on academic publications, citation relationships between these publications, as well as authors, institutions, journals, conferences and subject areas

ND-GAIN

ND-GAIN is a measurement tool that helps governments, businesses and communities to analyse risks exacerbated by climate change, such as overpopulation, food insecurity, inadequate infrastructure and civil conflicts

OpenSanctions

OpenSanctions is an international database of individuals and companies of political, criminal or economic interest

Our World in Data

‘Our World in Data’ focuses on the world’s major and daunting challenges: poverty, disease, hunger, climate change, war, existential risks and inequality

Oxford Research Encyclopedia of International Studies

The ‘Encyclopedia of International Studies’ is now available as the ‘Oxford Research Encyclopedia (ORE) of International Studies’, featuring new and revised articles

Reality Commons

Ways in which smartphones can be used to study human interactions beyond traditional methods based on surveys or simulations

Stack Exchange Data Explorer

Open-source tool for running queries on public data from the Stack Exchange network

Titanic Dataset

Dataset containing predictions about Titanic survivors

UC DATA

UC Berkeley’s archive of digitised social science data and statistics

UCLA Social Science Data Archive

The Social Science Data Archive has been operating at UCLA since 1961

Uppsala Conflict Data Program

The UCDP, part of the Department of Peace and Conflict Research, provides data on organised violence

World Inequality Database

The World Inequality Database (WID) provides a database on the historical development of global income and wealth distribution, both within individual countries and between countries

WorldPop

Population data at local level, including tracking progress towards the Sustainable Development Goals

Joshua Project

Data for a specific continent, region, country, religion, affinity group or population group

Transport and traffic

Autobahn App API

API for up-to-date administrative data on roadworks, traffic jams and charging stations

Aviation accident database

All accidents in civil and commercial aviation involving passenger aircraft in scheduled and non-scheduled services worldwide

BASt Datensammlungen

Data on bridge and civil engineering, road construction, behaviour and safety, and traffic engineering from the Federal Roads and Transport Agency BASt

Bike Share Data Systems

Data portals for bike-sharing schemes

BIXI Open data

Members vs. occasional users, journey history and station status

Chicago Metropolitan Agency for Planning: Transportation Data

CMAP traffic forecasts, based on a comprehensive regional modelling system

Czech National Traffic Information Registry

Overview of traffic information sources and their providers, including a technical description of formats and protocols

Darmstadt Mobilität

Mobility data from Darmstadt

Data Expo 2009: Airline on time data

Arrival and departure details for commercial flights within the USA from October 1987 to April 2008

data.europa.eu: Transport

EU transport datasets

DB AG APIs und Datenströme

OpenAPI, AsyncAPI, RIS-API and GTFS, GTFS-RT, RiFahrt

Datastore.brussels: Transport

Traffic datasets from Brussels

Düsseldorf Verkehrsmeldungen – Mobilitätsdaten

Traffic updates and geodata from the City of Düsseldorf

England National Highways

Up-to-date traffic information from the National Traffic Information Service

Fatality Analysis Reporting System (FARS)

FARS reports on fatal accidents

Finish Transport Infrastructure Agency

Open data from the Finnish Transport Agency

Fintraffic Data sources

Traffic information from the traffic management systems of ITM Finland Ltd.

Freight Analysis Framework Data

Freight transport analysis by the BTS and the FHWA

gencat.cat

Mobility and traffic data for Catalonia

GeoLife GPS Trajectories

GPS movement data from the Geolife project (Microsoft Research Asia) covering 182 users from April 2007 to August 2012

Jena Open Data: Mobilität

Parking, traffic disruptions, cycle routes for tourists, roadworks, etc.

Köln: Transport und Verkehr

Transport and traffic data for the city of Cologne

Transport for London

List of available TfL data feeds

MobiData BW

Mobility data from the Baden-Württemberg Local Transport Authority

Mobilithek

A platform for the exchange of digital information between mobility providers, infrastructure operators, transport authorities and information providers

NDW Open Data

Dutch mobility data

Open Data im Tourismus

Knowledge graphs covering the domains of attractions, events, tours, accommodation providers and restaurants

Open.NRW: Verkehr

Transport datasets from the state of North Rhine-Westphalia

OpenFlights Airports Database

The basic data on airports comes from DAFIF and OurAirports

OpenStation

Central data source from DB InfraGO for open data on the infrastructure of passenger railway stations in Germany

Paris Data Comptage routier

Road counting – traffic data from permanent sensors

Pedestrian Counting System

Hourly pedestrian counts since 2009, recorded by pedestrian sensors in Melbourne

renfe Data

Data from the Spanish railways

Schweizer Bundesamt für Strassen ASTRA

Traffic data from ASTRA

Traffic Scotland Data Hub

Traffic and travel information from Traffic Scotland

SF Bay Area Bike Share

Bay Area Bike Share regularly publishes open data

Tark Tee Smart Road DATEX II data gateway

Traffic and road-related information from the Estonian Transport Authority in DATEX II format

TLC Trip Record Data

Trip records for yellow and green taxis from the New York City Taxi and Limousine Commission (TLC)

Toronto’s Open Data: Transportation

Transport datasets from Toronto

Uber TLC FOIL Response

Uber journey data requested via a Freedom of Information request to the New York City Taxi & Limousine Commission

UK National Highways

Highways Agency data on journey times and traffic flow on the road network

US Bureau of Transportation Statistics

BTS databases

US domestic flights from 1990 to 2009

US domestic flights from 1990 to 2009

US Traffic Volume Trends

Monthly report based on hourly traffic count data reported by US states

Vlaams Verkeerscentrum

Data from the Flemish Traffic Centre on traffic demand and a large-scale traffic survey