Open data¶
A topic-based overview of public repositories containing research data.
Agricultural sciences¶
- AQUASTAT Dissemination System
Global information system of the Food and Agriculture Organization of the United Nations (FAO) on water resources and agricultural water management
- Data Commons Agriculture
Data Commons brings together public data from many parts of the world, including surveys
- FoodData Central
Food composition data from the United States Department of Agriculture (USDA)
- Hyperspectral benchmark dataset on soil moisture
Hyperspectral and soil moisture data from a lysimeter field campaign based on a soil sample. Karlsruhe (Germany), 2017
- Index DataBase
Vegetation indices from the Institute of Crop Science and Resource Conservation (INRES)
- PLANTS Database
Standardised information on vascular plants, mosses and lichens in the USA
Biology¶
- American-Gut
Open-access code and IPython notebooks from the American Gut project
- ArrayExpress - Functional Genomics Data
Data from functional genomics experiments
- Catalogue of Life (COL)
Integrated list of all known species worldwide
- Cell Image Library (CIL)
Over 12,000 datasets from the Centre for Research in Biological Systems (CRBS)
- CytoImageNet
Extensive dataset for pre-training with microscopy images
- Electron Microscopy Data Bank (EMDB)
Data from cryo-electron microscopy (cryo-EM) and representative tomograms of macromolecular complexes and subcellular structures
- EMBL-EBI
Data sources and analysis tools from the European Bioinformatics Institute of the European Molecular Biology Laboratory (EMBL)
- ENCODE portal
The ENCODE Consortium is an ongoing international collaborative project involving research groups, funded by the National Human Genome Research Institute (NHGRI)
- EnsemblGenomes
Genomic data for invertebrate species, as well as tools for processing, analysing and visualising this data
- FireBrowse portal
FireBrowse provides access to a wide range of cancer genomics data, including clinical annotations, DNA copy number, miR, miRseq, mRNA and mRNAseq
- Gene Expression Omnibus
Functional genomics data supporting the submission of MIAME-compliant data
- Gene Ontology
Gene Ontology (GO) knowledge base on gene functions
- Genomic Data Commons Data Portal
Data from genomic cancer studies
- Global Biotic Interactions (GloBI)
Data on species interactions, such as predator-prey, pollinator-plant, pathogen-host and parasite-host
- ICOS PSP benchmarks
A collection of practical benchmarks designed to test the scalability of classification and regression methods developed by the ICOS research group
- IGSR: The International Genome Sample Resource
As part of the ‘1000 Genomes’ project, a catalogue of common genetic variations in humans was compiled
- Journal of Cell Biology
A collection of image data relating to articles published in the ‘Journal of Cell Biology’.
- KEGG: Kyoto Encyclopedia of Genes and Genomes
Database on higher-level functions and relationships within biological systems such as cells, organisms, ecosystems and the biosphere, based on information at the molecular level
- NIH Human Microbiome Project Catalog
Metadata on all reference genomes of human-related isolates and on metagenome samples from healthy humans
- National Center for Biotechnology Information
Databases, including those on chemicals and bioassays, DNA and RNA, and homology
- openSNP
Repository for genetic and phenotypic data
- palmerpenguins
Dataset for data exploration and visualisation as an alternative to Iris
- Pathguide
Resources on biological signalling pathways and molecular interactions
- RCSB Protein Data Bank (RCSB PDB)
Data on experimentally determined 3D structures, integrative 3D structures and computer-generated structural models (CSM)
- Personal Genome Project
Publicly accessible data on genomes, health and traits
- PGC Data Access Portals
Portals for querying data at the individual level or with restricted access
- Rfam
Collection of RNA families, each represented by multiple sequence alignment (MSA), consensus RNA structures and covariance models
- SSBD:database
Open resources for the analysis of microscopic images and quantitative data of biological objects, such as single molecules, cells, tissues, individuals, etc.
- UniGene
NCBI database on the transcriptome and therefore not primarily a database for genes
- UniProt
Freely accessible source of protein sequences and functional information
- UCSC Genome Browser
Sequence and annotation data for the genome sequences displayed in the UCSC Genome Browser
Chemistry¶
- Ionic Liquids Database - ILThermo
Online search tool for the thermodynamic and transport properties of ionic liquids, as well as binary and ternary mixtures containing ionic liquids
- PubChem
A collection of freely accessible chemical information from the National Center for Biotechnology Information
Climate and Weather¶
- 38-Cloud: A Cloud Segmentation Dataset
38 Landsat-8 images and manually extracted pixel-level reference values for cloud detection
- Aviation Weather Center
Warnings, forecasts and analyses of hazardous weather conditions for aviation
- Actuaries Climate Index
Monthly and seasonal data by region and component
- Average city temperatures
Daily data on average air temperatures in major cities worldwide
- Canadian Weather Information
Historical data by station name, province, territory or distance
- Caravan
Global hydrology dataset with large samples
- CDC – Climate Data Center
Climate data from the german weather service (DWD)
- Climate Data Online (CDO)
Statistics, current weather observations and climate data from the Australian Data Archive for Meteorology (ADAM)
- Climatic Research Unit
Data provided by the CRU of the National Centre for Atmospheric Science (NCAS)
- Copernicus Climate Change Service (C3S)
One of the six thematic services provided under the European Union’s Copernicus programme
- European Climate Assessment & Dataset (ECA&D)
Datasets on changes in weather and climate extremes
- GDELT Project: Four Massive Datasets Charting The Global Climate Change News Narrative 2009-2020
Four extensive datasets illustrating global climate change reporting from 2009 to 2020
- NOAA Global Radiation and Aerosols (GRAD) Data
Long-term measurements of radiation, meteorological parameters and aerosols at various remote locations worldwide, as well as at sites across the Americas
- NOAA Local Climatological Data (LCD)
Summaries of climatological conditions from airports and other major weather stations
- Open-Meteo
Open data with a high resolution of 1 to 11 kilometres
- WorldClim
Maps, graphs, tables and data on the global climate
Complex Network¶
- Archive-IT
Archived websites and web pages
- CRAWDAD
‘Community Resource for Archiving Wireless Data at Dartmouth’ (CRAWDAD)
- DIMACS
Benchmarks for synthetic and real-world input generators, short-range calculators and scripts for generating benchmark performance reports, as well as detailed documentation
- DOI URLs
DOIs for nearly 50 million journal articles from the OAI-PMH server
- Internet Archive Dataset Collection
Extensive data archives from both institutions and individuals
- KONECT
Network datasets from the Koblenz Network Collection
- Laboratory for Web Algorithmics
Data for the WebGraph framework
- Mark Newman: Network data
Links to network datasets in GML format
- Microsoft Research Tools: code, datasets, & models
Directory of datasets, SDKs, APIs and open-source tools developed by Microsoft researchers
- NBER U.S. Patent Citations Data File
Findings, insights and methodological tools
- Network Repository
Interactive data and network data repository with real-time visual analysis, featuring thousands of datasets spanning over 30 disciplines, from biological to social network data
- NIST Complex Network Resources
Standard datasets against which algorithms and claims can be compared and verified
- The R Datasets Package
The R
datasetspackage- PyPi/Maven dependency data
Three LZMA-compressed files: mvn-deps.csv.lzma, mvn-minimal-deps.csv.lzma and pypi-deps.csv.lzma
- Scopus
Database of abstracts and citations
- Stack Overflow Annual Developer Survey
Annual developer survey by Stack Overflow
- Stanford GraphBase
Literate Programming with more than 30 examples
- Stanford Large Network Dataset Collection
Collection from the Stanford Network Analysis Project, including social networks, citation and collaboration networks, road networks and Wikipedia networks
- SuiteSparse Matrix Collection
Collection of sparse matrices
- UCI Network Data Repository
Data sets from the UCI Network Data Repository, including collections of classic network data sets and data sets curated by research groups or organisations
Computer Networks¶
- CAIDA Data
Internet topology showing the arrangement and interconnection of devices within autonomous systems (AS) on the Internet
- Click Dataset
Around 53.5 billion HTTP requests from users at Indiana University
- ClueWeb09 Dataset
Approximately 1 billion web pages in ten languages, collected in January and February 2009
- ClueWeb12 Dataset
733,019,372 English-language web pages, collected between 10 February 2012 and 10 May 2012
- Common Crawl
Free, open repository of web crawling data
- Criteo 1TB Click Logs Dataset
Feature values and click data for millions of display ads to evaluate algorithms for predicting click-through rate (CTR)
- Merklemap DNS records database
Database of DNS records containing over 4 billion entries
- MIRAGE Project
Reproducible architecture for capturing mobile app traffic and generating ground-truth data
- MobiPerf
MobiPerf is an open-source application for measuring network performance (throughput, latency, etc.) on mobile platforms
- Shopper Intent Prediction from Clickstream E‑Commerce Data
Prediction of purchase intent based on e-commerce clickstream data
- Stanford Internet Research Data Repository
Public archive of research datasets describing hosts, services and websites on the internet
- Open Observatory of Network Interference (OONI)
Non-profit open-source software project aimed at supporting decentralised initiatives to document internet censorship worldwide
- Project Sonar
SSL, DNS, HTTP and UDP connections on public networks
- UCSD Network Telescope
Passive traffic monitoring system based on a globally distributed but low-traffic /9 and /10 network.
Energy sector¶
- Almanac of Minutely Power dataset (AMPds)
Two years’ worth of minute-by-minute measurement data on electricity, water and natural gas
- Commercial Building Energy Dataset (COMBED)
Energy-related dataset from a commercial building, with data recorded more than once per minute
- Direct Borohydride Fuel Cell (DBFC) Dataset
Impedance and polarisation measurements at the anode using Pd/C, Pt/C and Pd-coated Ni–Co/rGO catalysts
- Domestic Electrical Load Survey (DELS) Secure Data 1994–2014
The ‘DELS Secure Data’ dataset contains anonymised survey responses
- ECO data set (Electricity Consumption & Occupancy)
Non-intrusive load monitoring and occupancy detection over an eight-month period in six Swiss households
- EIA-923
The EIA-923 questionnaire collects detailed electricity data on power generation, fuel consumption, fossil fuel stocks and goods received at the level of power stations and generating units
- Global Power Plant Database
Global open-source database for power stations
- Household Electricity Study - EV0702
Data on household electricity consumption from April 2010 to April 2011 from domestic appliances in a total of 251 owner-occupied households across England
- High Frequency EMI Data Set (HFED)
Data set on high-frequency electromagnetic interference (EMI) containing measurement curves derived from a signal analyser and a Universal Software Radio Peripheral (USRP)
- Moroccan buildings’ electricity consumption dataset (MORED)
Data on the electricity consumption of various urban buildings in Moroccan cities
- Marktstammdatenregister (MaStR)
Basic data on the electricity and gas market
- Proton Exchange Membrane (PEM) Fuel Cell Dataset
Standard tests on Nafion-112 membranes and MEA activation tests of a PEM fuel cell under various operating conditions
- Plug Load Appliance Identification Dataset (PLAID)
Voltage and current measurements at a sampling rate of 30 kHz on 11 different appliance types in more than 60 households in Pittsburgh, Pennsylvania
- Public Utility Data Liberation Project (PUDL)
Open-source data processing pipeline that facilitates access to US energy data and its programmatic use
- Smart Meter Data Listing
List of datasets relating to smart meters
- SynD
Synthetic energy dataset for non-intrusive load monitoring in households (SynD) provides a synthetic energy dataset with a focus on residential buildings
- tracebase data set
Collection of electricity consumption data for research purposes in the field of energy analysis
- UK Domestic Appliance-Level Electricity (UK-DALE) dataset
Electricity demand from five homes and individual appliances recorded every six seconds
- Indian Dataset for Ambient Water and Energy
Energy monitoring and energy consumption of a home in India over 73 days
Financial sector¶
- BIS Data Portal
The Bank for International Settlements (BIS) provides statistics in collaboration with central banks and other national authorities
- Cboe Futures Exchange Market Data
Daily market statistics and closing prices, price summaries and other market data services
- EDGAR
Electronic Data Gathering, Analysis, and Retrieval (EDGAR) is the central system for companies filing documents under the Securities Act, Securities Exchange Act, Trust Indenture Act and Investment Company Act
- FAANG- Complete Stock Data
Data on the shares of FAANG (Facebook, Amazon, Apple, Netflix and Google) companies since they were first listed
- Federal Reserve Economic Data (FRED)
Online database comprising hundreds of thousands of time series of economic data from numerous national, international, public and private sources
- Google Finance
Search for shares, ETFs, etc.
- Nasdaq Data Link
Platform for financial and alternative data, providing financial professionals with useful information and tools for capturing, managing and analysing data
- NYSE Exchange Proprietary Market Data
Low-latency real-time market data covering the various asset classes and markets of the NYSE Group
- Yahoo Finance
Financial news, data and commentary, including share prices, press releases, financial reports and original content
Geosciences and Environmental Sciences¶
- AODN Portal
Data from the Australian Ocean Data Network (AODN) and the Integrated Marine Observing System (IMOS)
- Alabama’s Real-Time Coastal Observing System (ARCOS)
Environmental monitoring data in and around Mobile Bay
- BODC Database
Collection of marine datasets from the British Oceanographic Data Centre (BODC)
- Common Metadata Repository (CMR)
Search API for NASA metadata on remotely sensed geosciences
- Earth Models
Modelling tools and datasets relating to the Earth
- Earthdata Data Catalog
The Earth Science Data Systems (ESDS) programme offers free access to NASA’s archive of geoscience data
- Earthquake Catalog
Current or past earthquakes, earthquake resources by state and web services
- Global Volcanism Program
Catalogue of Holocene and Pleistocene volcanoes and their eruptions over the last 12,000 years
- Global Wind Atlas
Web-based application for decision-makers, planners and investors to identify areas with strong winds for wind energy generation
- Meteoritical Bulletin Database
International database of officially recognised meteorites and their locations
- National Data Buoy Center
Meteorological and oceanographic measurements for the marine environment
- National Estuarine Research Reserve System
Short-term fluctuations and long-term changes in the integrity and biodiversity of estuarine ecosystems and coastal waters
- Norwegian Polar Data Centre: Datasets
Antarctica, the Arctic Ocean and Svalbard
- PANGAEA Publisher for Earth & Environmental Science
Georeferenced data on chemistry, the lithosphere and atmosphere, biology and palaeontology, oceans and land areas, fisheries and agriculture etc.
- Radiance – Global Light Pollution Visualization & Analysis
For astrophotography, astrophysics and the protection of the night sky
- UC Irvine Machine Learning Repository
Machine learning datasets containing data on air quality, ozone detection, greenhouse gas concentrations, aquatic toxicity and more
- UK National Data Repository (NDR) for offshore petroleum-related licence information
In future, records relating to licences for the exploration and storage of carbon dioxide will also be stored
- WHPA Prediction
Dataset from the study ‘A new framework for experimental design using Bayesian Evidential Learning’
See also
Government information¶
- Datos Argentina
Data repository of the Argentine Nation
- Australian Bureu of Statistics
Australian Bureau of Statistics
- Data.gov.au
Open government data in Australia
- data.gv.at
Central catalogue containing metadata from the decentralised data catalogues of Austrian public authorities
- Data.Gov.be
The Belgian data portal
- dados.gov.br
Brazilian Open Data Portal
- GovData
Data portal for Germany with legal texts, studies and guidelines on ‘Open Government’
- open.canada.ca
‘Open Government’ of the Canadian government
- datos.gob.cl
Data sets from public institutions in Chile
- EU Open Data Portal
The official portal for European data
- Metadaten Verbund (MetaVer)
Joint portal of the German federal states of Brandenburg, Bremen, Hamburg, Hesse, Mecklenburg-Western Pomerania, Saarland, Saxony and Saxony-Anhalt
- National Bureau of Statistics of China (NBS)
Open data from the Chinese National Bureau of Statistics
- Debt to the Penny
Information from the US Department of the Treasury on total outstanding government debt
- National Archives
The National Archives and Records Administration (NARA) archives documents and materials produced in the course of the US federal government’s activities
- Eurostat
Statistics and data on Europe
- EveryPolitician
OpenSanctions’ global database of political office holders
- StatsPolicy|gov
Decentralised network of the US federal statistics system
- Finnish open data
Finnish open data portal
- data.gouv
Platform for French open data
- GENESIS-Online
Database of the German Federal Statistical Office
- data.gov.gr
Greek register for open public sector data
- Open Government Data (OGD) Platform India
Portal for open government data from the Indian Government’s National Informatics Centre (NIC)
- data.go.id
Data and official public information from the Indonesian government
- data.gov.ie
Ireland’s open data portal
- data.gov.il
Databases of all Israeli ministries
- dati.gov.it
Open data from the Italian public administration
- e-Stat Portal Site of Official Statistics of Japan
Portal for Japanese government statistics
- data.public.lu
Luxembourg Open Data Platform
- data.gov.my
Malaysia’s official open data portal
- datos.gob.mx
Mexican national open data platform
- date.gov.md
Moldovan government data portal
- data.overheid.nl
Dutch government data register
- stats.govt.nz
Statistics from New Zealand’s official statistics agency, Stats NZ (Tatauranga Aotearoa)
- OECD Data
Data from the Organisation for Economic Co-operation and Development
- Open Data Hub
Open data catalogue focusing on mobility and tourism
- pordata.pt
PORDATA was organised and developed by the Francisco Manuel dos Santos Foundation
- data.gov.ro
Romanian open data sets provided by public authorities and institutions
- data.gov.ru
Russia’s open data register
- Singapore’s open data portal
Singapore’s open data portal
- stats sa
Statistics of the Republic of South Africa
- opendata.swiss
Swiss Open Government Data
- data.gov.tw
Taiwanese Open Government Data
- Tunisia Data Portal
Tunisia’s data portal
- data.gov.uk directory
Data from the UK central government, local authorities and public bodies
- Geographic Data Service
UK Research and Innovation (UKRI) Smart Data Research (SDR UK)
- Healthy and Sustainable Places (HASP) Data Service
Smart data for a better understanding of the quality of life and sustainability of places
- United States Census Bureau
Data from the United States Census Bureau
- National Center for Health Statistics (CDC)
Data and analysis tools from the National Center for Health Statistics
- U.S. Department of Housing and Urban Development’s Office of Policy Development and Research (PD&R)
Research findings, publications and datasets on housing, community development and other areas in the United States
- data.gov
Data, tools and resources from the US government
- OpenFDA
Data from the Food and Drug Administration (FDA) of the US Department of Health and Human Services
- National Center for Education Statistics (NCES)
Data on the state of education in the United States
- United States Patent and Trademark Office (USPTO)
The USPTO’s data platform
- Congressional Research Service
Reports from the Congressional think tank
- Uganda Bureau of Statistics
Data portals of the Uganda Bureau of Statistics
- data.gov.ua
Ukraine’s data portal
- catalogodatos.gub.uy
Uruguay’s open data
- IATI Country Development Finance Data
Data on development and humanitarian activities, by country, reporting organisation and sector
- UNdata
Resources from the United Nations (UN) statistical system and other international organisations
- UNESCO Datahub
Data from UNESCO initiatives in the fields of education, science, culture and communication
- UNICEF Data and Analytics
Data on the situation of children and women worldwide
- World Bank Open Data
World Bank open data platform
Healthcare¶
- covid-19-lake
AWS S3 Explorer
- COVID-19 Case Surveillance Public Use Data
COVID-19 case surveillance up to 1 July 2024
- Health Inspection Scores (2024-Present)
Results of health inspections carried out by the San Francisco Department of Public Health from 2024 to the present
- Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
COVID-19 database from the Centre for Systems Science and Engineering (CSSE) at Johns Hopkins University (archived on 10 March 2023)
- NYT Coronavirus (Covid-19) Data in the United States
A database containing data on coronavirus cases and deaths in the US
- HealthData.gov
Data, tools and resources in the field of health and social care
- The COVID Tracking Project
Reported data in various units and according to different definitions used by US states and territories
- Vitalnet Data Scenarios
A Vitalnet ‘data scenario’ is a complete data analysis situation
- Genomic Data Commons (GDC)
A repository and computing platform for cancer researchers focusing on cancer, its clinical course and response to therapies
- Gapminder
Complete datasets with hundreds of indicators
- Medical Subject Headings
The ‘Medical Subject Headings’ (MeSH) thesaurus is a controlled and hierarchically structured vocabulary created by the National Library of Medicine
- MeDAL dataset
Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining (MeDAL)
- Medicare Coverage Database (MCD)
Procedures and timelines for determining insurance coverage
- data.cms.gov
Data from the Centers for Medicare & Medicaid Services (CMS)
- Nightingale Open Science
Datasets on heart attacks, cancer metastases, cardiac arrest, bone ageing, Covid-19
- Ebola Cases and Deaths in Affected Countries
Total number of probable, confirmed and suspected Ebola cases and deaths in Guinea, Liberia, Sierra Leone, Nigeria, Senegal, Mali, Spain, the USA, the UK and Italy
- Organisation Data Service (ODS)
Data service of the National Health Service (NHS) in England
- OpenPaymentsData.CMS.gov
Payments to hospitals and healthcare providers by medical companies
- PhysioNet
Databases from PhysioNet
- Spanish Flu Dataset
Mortality resulting from the 1918 influenza pandemic, Chicago, USA
- Cancer Imaging Archive
The Cancer Imaging Archive (TCIA) is a service that anonymises and provides access to an extensive archive of medical cancer images
- US Water Quality Data by ZIP Code
Daily data on water quality in the USA by postcode – breaches of EPA regulations, lead levels, safety ratings
- The Global Health Observatory
The GHO data archive is the WHO’s portal for health-related statistics from its 194 member states
- Informatics for Integrating Biology & the Bedside (i2b2)
NLP research datasets
Image Processing¶
- 10k US Adult Faces Database
Over ten thousand natural face photographs, along with various measurements for 2,222 of these faces, including memorability scores, psychological features and annotations on landmarks
- Action Similarity Labeling (ASLAN) Challenge
Video database containing actions and a comprehensive test protocol for investigating the similarity of actions
- Affective Image Classification
Affective image classification using features inspired by psychology and art theory
- AI Detector Arena Benchmark Dataset
Dataset for evaluating AI image recognition tools
- Airborne Object Tracking Dataset (AOT)
Dataset for tracking airborne objects
- All-Age-Faces (AAF) Database
The All-Age-Faces (AAF) dataset contains 13,322 facial images of predominantly Asian individuals from all age groups
- animals with attributes
A dataset for attribute-based classification
- Arabic Font Classification
Classification of Arabic fonts, see also Arabic Font Classification
- Biometrically Filtered Famous Figure (B3FD) Dataset
Dataset containing facial images for age estimation
- CADDY Underwater Stereo-Vision Dataset
Human-Robot Interaction (HRI) for divers and autonomous underwater vehicles
- Caltech Vision Lab Datasets
See also caltechvisionlab.github.io
- Cat Dataset
Over 9,000 images of cats with annotated facial features
- CCAgT
Images of cervical cells stained using the AgNOR method
- Chars74K dataset
Character recognition in natural images
- Cube++
4,890 images of various scenes under different conditions
- Danbooru2021
Extensive anime image database with over 4.9 million images and over 162 million tags
- Densely Annotated Video Driving Data Set
28 driving sequences recorded in the CARLA simulator, comprising a total of 10,767 individual frames
- ETH Entomological Collection (ETHEC) Dataset
Data for hierarchical image classification using entailment cone embeddings
- Face Image Project
Unfiltered faces for gender and age classification
- Face Recognition Databases
Datasets for benchmarking face recognition algorithms
- FlickrLogos
Company logos from Flickr in various situations
- Fluorescent Neuronal Cells v2
Collection of fluorescence microscopy images and their corresponding ground-truth annotations
- HumanEva Dataset
Seven calibrated video sequences synchronised with 3D body postures
- IEEE DataPort: Image Processing
IEEE datasets for image processing
- ImageNet
Image database organised according to the WordNet hierarchy
- Indoor Scene Recognition
Images for indoor scene recognition
- Iranis Dataset
Extensive dataset containing more than 83,000 images of Persian numbers and letters sourced from real-world vehicle number plates
- KITTI Vision Benchmark Suite
Computer vision benchmarks for real-world environments, focusing on stereo, optical flow, visual odometry, 3D object detection and 3D tracking
- Labeled Information Library of Alexandria: Biology and Conservation (LILA BC)
Repository for datasets from the fields of biology and conservation
- Labelled Faces in the Wild (LFW) Dataset
Database of facial photographs used to investigate the problem of unrestricted facial recognition
- LLVIP: A Visible-infrared Paired Dataset for Low-light Vision
Paired visible-infrared datasets for image processing in low-light conditions
- Multi-View Region of Interest Prediction Dataset for Autonomous Driving
Multi-view images captured in the CARLA simulator with annotations for regions of interest
- Newspaper Navigator
Experimental application for locating historical newspaper images based on visual similarity
- Open Images Dataset V6
1,743,042 training images with bounding boxes, object segmentations, visual relationships and localised descriptions
- Oxford-IIIT Pet Dataset
Dataset with 37 categories of pets
- Roboflow Computer Vision Datasets
Public datasets for computer vision
- Stanford Dogs Dataset
Images of 120 dog breeds from around the world with annotations from ImageNet
- SUN database project
Collection of annotated images featuring a wide variety of environmental scenes, locations and the objects within them
- SVIRO Dataset and Benchmark
Synthetic dataset for Vehicle Interior Rear seat Occupancy (SVIRO) is a synthetic dataset for the detection and classification of rear seat
- TikTok dataset
Dataset published at CVPR 2021, presented in the paper ‘Learning High Fidelity Depths of Dressed Humans by Watching Social Media Dance Videos’
- Violent-Flows Database
Database and benchmark for crowd violence and non-violence
- Visual Genome
Dataset and knowledge base for linking structured image concepts with language
- X-ray images
The X-ray images contained in GDXray+ may only be used for research and educational purposes
- YouTube-BoundingBoxes Dataset
Extensive dataset of video URLs with densely distributed, high-quality annotations of bounding boxes for individual objects
- YouTube-8M Segments
Human-verified labels for approximately 237,000 segments across 1,000 classes
Medicine¶
- BCNB
WSI dataset on core needle biopsy in early-stage breast cancer
- Broad Bioimage Benchmark Collection
The Broad Bioimage Benchmark Collection (BBBC) is a collection of microscopy image sets. In addition to images, each set contains a description of the biological application and expected results
- Catalogue Of Semantic Mutations In Cancer (COSMIC)
Data from COSMIC, the Cell Lines Project, Actionability and the Cancer Mutation Census (CMC)
- CCLE Cancer Cell Line Encyclopedia
Cancer cell line models for research into cancer biology, validation of cancer targets and determination of drug efficacy
- Genomics of Drug Sensitivity in Cancer datasets
Datasets and features on the genomics of drug sensitivity in cancer
- Grand Challenge
Platform for machine learning in medical imaging
- HMS LINCS Project
The LINCS project collects and disseminates data and analytical tools to understand how human cells respond to disturbances caused by drugs, the environment and mutations.
- Serratus
Collaborative open-science project for virus detection
- Stowers Original Data Repository
The data underlying scientific publications from the Stowers Institute for Medical Research
Natural Language¶
- Automatic Keyphrase Extraction
Datasets for the automatic extraction of key phrases
- The Big Bad NLP database
More than 400 well-structured NLP datasets for common NLP tasks and requirements, such as document classification, automatic image captioning, dialogues, clustering, intent classification, language modelling, machine translation, text corpora and much more
- Blizzard Challenge 2018
Approx. 6.5 hours of British English speech data from a single female speaker
- The Blog Authorship Corpus
Posts from 19,320 bloggers collected from blogger.com in August 2004
- CLiPS Stylometry Investigation (CSI) Corpus
Annually updated corpus of student essays and reviews
- DBpedia
Current publications of core data from en.wikipedia.org
- List of Dirty, Naughty, Obscene, and Otherwise Bad Words
Filter for Shutterstock’s autocomplete server and recommendation engine
- European Parliament Proceedings Parallel Corpus 1996-2011
A parallel corpus for statistical machine translation
- Explanation Bank
Inference algorithms that answer complex questions and provide explanations understandable to humans
- German Political Speeches Corpus and Visualization
Political speeches by leading German politicians, predominantly delivered from 1990 onwards
- Google Books Ngram Viewer Datasets
The Google Books Ngram Viewer is optimised for quickly querying the usage of short phrases
- Gutenberg Offline Catalogs
eBooks from Project Gutenberg
- The LJ Speech Dataset
Public domain speech dataset consisting of 13,100 short audio clips
- Making Sense of Microposts (#Microposts2016)
Tweets from the Redites project covering numerous notable events from 2011 and 2013
- MC-AFP
Machine comprehension dataset based on the Gigaword dataset
- Machine Comprehension Test (MCTest)
Collection of 660 stories and associated questions
- MS MARCO
Datasets for natural language generation, text passage ranking, key term extraction, dialogue-oriented search and a crawling dataset
- Multi-Domain Sentiment Dataset
Product reviews from Amazon.com across many different product categories (domains)
- No Language Left Behind (NLLB - 200vo)
Dataset based on metadata for bitexts published by Meta AI
- Noisy speech database for training speech enhancement algorithms and TTS models
Database containing clear and noisy parallel speech
- Personae Corpus
The ‘Personae’ corpus was compiled for experiments on author attribution and personality prediction
- SMS Spam Collection
The corpus was compiled from free or freely accessible sources on the internet for research purposes
- SQuAD2.0 – The Stanford Question Answering Dataset
SQuAD 2.0 tests a system’s ability not only to answer reading comprehension questions, but also to provide no answer when a question cannot be answered
- Universal Dependencies
Framework for the consistent annotation of grammar (parts of speech, morphological features and syntactic dependencies) in various human languages
- USENET corpus
Collection of public USENET posts between October 2005 and January 2011
- Web 1T 5-gram Version 1
The N-gram frequencies were generated from texts sourced from publicly accessible websites
- Wikidata
Wikidata database dumps
- Wordbank
An open database on children’s vocabulary development
- WordNet – A Lexical Database for English wndb(5WN)
A comprehensive lexical database of the English language. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitively related words (synsets)
Neuroscience¶
- Allen Institute Brain Knowledge Platform
The Brain Knowledge Platform’s data catalogue provides access to a wide range of projects and datasets
- BrainOmics Neuroimaging Genetics
Links between neuroimaging, genetics and cognitive data
- codeneuro neurofinder
Each dataset is available as a ZIP file and contains images, reference neuron regions, metadata and code for loading the data
- CRCNS - Collaborative Research in Computational Neuroscience
Data from the first round of data-sharing projects supported under the CRCNS funding programme
- Child Mind Institute
International Neuroimaging Data-Sharing Initiative (INDI)
- Human Connectome Project (HCP) Young Adult
Study data from the HCP Young Adult (HCP-YA) project
- National Database for Autism Research (NDAR)
Data on autism spectrum disorders at all levels of biological and behavioural organisation
- NIMH Data Archive
Data archive of the National Institute of Mental Health (NIMH)
- NeuroElectro
Electrophysiological properties, such as resting membrane potentials and membrane time constants, of various neuron types
- NeuroMorpho.Org
Collection of digitally reconstructed neurons and glial cells
- Open Access Series of Imaging Studies (OASIS)
Brain neuroimaging datasets
- Open NeuroData Registry
Numerous neuroimaging datasets (as pre-computed Neuroglancer volumes) from various modalities
- OpenfMRI
Archive of human brain imaging data collected using MRI and EEG techniques
- OpenNeuro
Platform for the validation and exchange of BIDS-compliant MRI, PET, MEG, EEG and iEEG data
- StudyForrest
Data on brain structure, brain function and the characteristics of film stimuli
- GigaDB
2,669 discoverable, traceable and citable datasets
Physics¶
- CERN Open Data portal
Archived results of various research activities, along with associated software and documentation
- IceCube Neutrino Observatory
IceCube neutrino point-source data pointing towards TXS 0506+056
- Gravitational Wave Open Science Center (GWOSC)
Data from gravitational-wave observatories
- NASA Exoplanet Archive
Planetary parameters for confirmed planets
- Entry Points to NASA Science Data
Thematic archives on stars, planets and other celestial bodies, the Sun, our Earth and cells
- Quantum simulations
Numerical simulation of an electron in a two-dimensional potential well Einschlusspotential
- Sloan Digital Sky Survey (SDSS)
Mapping the near and distant universe to understand the physical processes that govern our universe
Search engines¶
- Academic Torrents
Scalable BitTorrent infrastructure
- Data Basis
Non-governmental organisation operating Brazil’s largest public data platform
- Data Commons
Data Commons is a Google initiative designed to enable the exploration of diverse, standardised data using a unified Knowledge Graph
- DataHub Collections
Curated datasets
- Domains Project
The world’s single largest internet domains dataset
- ERIC - Education Resources Information Center
Internet-based database containing bibliographic references and full-text documents in the field of educational research and information
- Galaxy Europe
Thousands of tools, quotas and computing infrastructure as part of ‘Training Infrastructure as a Service’ (TIaaS)
- Google Dataset Search
Name, description, author and publication formats of datasets
- Harvard Dataverse
Repository for research data and code
- ICPSR
Bibliography, variable search and thematic collection of the Inter-university Consortium for Political and Social Research (ICPSR)
- Kaggle Datasets
Kaggle supports a wide range of formats for the publication of datasets
- National Technical Reports Library (NTRL)
Collection of technical reports funded by the US government
- NFDI4DS Portal
Research data from the NFDI4DataScience (NFDI4DS) consortium
- ODI Certified Datasets
Datasets certified by the Open Data Institute (ODI)
- Open Data Inception
Open data portals worldwide
- PaN-Finder
Building on the PaNOSC project, data catalogues from major research institutions are being linked together.
- Registry of research data repositories (re3data)
Global directory of research data repositories across all research disciplines
- Statista
Portal for market data, market research and market studies
- Zenodo
Repository for research results from the OpenAIRE project funded by the European Commission
Transport and traffic¶
- Autobahn App API
API for up-to-date administrative data on roadworks, traffic jams and charging stations
- Aviation accident database
All accidents in civil and commercial aviation involving passenger aircraft in scheduled and non-scheduled services worldwide
- BASt Datensammlungen
Data on bridge and civil engineering, road construction, behaviour and safety, and traffic engineering from the Federal Roads and Transport Agency BASt
- Bike Share Data Systems
Data portals for bike-sharing schemes
- BIXI Open data
Members vs. occasional users, journey history and station status
- Chicago Metropolitan Agency for Planning: Transportation Data
CMAP traffic forecasts, based on a comprehensive regional modelling system
- Czech National Traffic Information Registry
Overview of traffic information sources and their providers, including a technical description of formats and protocols
- Darmstadt Mobilität
Mobility data from Darmstadt
- Data Expo 2009: Airline on time data
Arrival and departure details for commercial flights within the USA from October 1987 to April 2008
- data.europa.eu: Transport
EU transport datasets
- DB AG APIs und Datenströme
OpenAPI, AsyncAPI, RIS-API and GTFS, GTFS-RT, RiFahrt
- Datastore.brussels: Transport
Traffic datasets from Brussels
- Düsseldorf Verkehrsmeldungen – Mobilitätsdaten
Traffic updates and geodata from the City of Düsseldorf
- England National Highways
Up-to-date traffic information from the National Traffic Information Service
- Fatality Analysis Reporting System (FARS)
FARS reports on fatal accidents
- Finish Transport Infrastructure Agency
Open data from the Finnish Transport Agency
- Fintraffic Data sources
Traffic information from the traffic management systems of ITM Finland Ltd.
- Freight Analysis Framework Data
Freight transport analysis by the BTS and the FHWA
- gencat.cat
Mobility and traffic data for Catalonia
- GeoLife GPS Trajectories
GPS movement data from the Geolife project (Microsoft Research Asia) covering 182 users from April 2007 to August 2012
- Jena Open Data: Mobilität
Parking, traffic disruptions, cycle routes for tourists, roadworks, etc.
- Köln: Transport und Verkehr
Transport and traffic data for the city of Cologne
- Transport for London
List of available TfL data feeds
- MobiData BW
Mobility data from the Baden-Württemberg Local Transport Authority
- Mobilithek
A platform for the exchange of digital information between mobility providers, infrastructure operators, transport authorities and information providers
- NDW Open Data
Dutch mobility data
- Open Data im Tourismus
Knowledge graphs covering the domains of attractions, events, tours, accommodation providers and restaurants
- Open.NRW: Verkehr
Transport datasets from the state of North Rhine-Westphalia
- OpenFlights Airports Database
The basic data on airports comes from DAFIF and OurAirports
- OpenStation
Central data source from DB InfraGO for open data on the infrastructure of passenger railway stations in Germany
- Paris Data Comptage routier
Road counting – traffic data from permanent sensors
- Pedestrian Counting System
Hourly pedestrian counts since 2009, recorded by pedestrian sensors in Melbourne
- renfe Data
Data from the Spanish railways
- Schweizer Bundesamt für Strassen ASTRA
Traffic data from ASTRA
- Traffic Scotland Data Hub
Traffic and travel information from Traffic Scotland
- SF Bay Area Bike Share
Bay Area Bike Share regularly publishes open data
- Tark Tee Smart Road DATEX II data gateway
Traffic and road-related information from the Estonian Transport Authority in DATEX II format
- TLC Trip Record Data
Trip records for yellow and green taxis from the New York City Taxi and Limousine Commission (TLC)
- Toronto’s Open Data: Transportation
Transport datasets from Toronto
- Uber TLC FOIL Response
Uber journey data requested via a Freedom of Information request to the New York City Taxi & Limousine Commission
- UK National Highways
Highways Agency data on journey times and traffic flow on the road network
- US Bureau of Transportation Statistics
BTS databases
- US domestic flights from 1990 to 2009
US domestic flights from 1990 to 2009
- US Traffic Volume Trends
Monthly report based on hourly traffic count data reported by US states
- Vlaams Verkeerscentrum
Data from the Flemish Traffic Centre on traffic demand and a large-scale traffic survey
Social sciences¶
ACLED is an independent, impartial conflict monitoring organisation that provides real-time data and analysis on violent conflicts and protests in all countries and regions worldwide
ARED is a collection of biographical and professional information on individuals who form the top echelons of authoritarian regimes
Canadian Institute for Legal Information
Statistical data on gender representation in academia
COW promotes the collection, dissemination and use of accurate and reliable quantitative data in the field of international relations
Cryptome publishes open, secret and classified documents
Information on criminal offences, investigation outcomes, as well as stop-and-searches and searches at street level, broken down by police district
Data from numerous research and evaluation projects at the Upjohn Institute, funded by the US Department of Labour
ESS is a scientifically oriented cross-national survey
Aggregated data from all US states
The Fund for Peace (FFP) compiles the Fragile States Index, a ranking of 178 countries that assesses the risks and vulnerabilities of individual states using 12 indicators
The Global Knowledge Graph connects people, organisations, places, topics, numbers, images and emotions into a single network spanning the entire planet
Religious change and its impact on societies worldwide
The GDD data includes information on pets, credit history, social networks, the importance of cultural values, as well as interviewer characteristics and observations
Database containing records of over 260,000 incidents of gun violence in the US from January 2013 to March 2018
HDX is an open platform for data exchange between different crises and organisations
IDB data on economic and social development in Latin America and the Caribbean
Online catalogue of surveys and data from the French National Institute for Demographic Studies (INED)
INFORM is a collaboration between the Reference Group on Risks, Early Warning and Preparedness of the Standing Inter-Agency Committee and the European Commission
INSCR was established to coordinate and consolidate the information resources created and used by the Center for Systemic Peace
The iCSO system facilitates cooperation between civil society organisations and DESA
The INA on weapons, books and capital flows
The ISSP is a cross-national collaborative programme that conducts annual surveys on various topics relevant to the social sciences
IPUMS provides census and survey data from around the world, linked together in time and space
Protests against governments in all countries, 1990–2020
The Microsoft Academic Graph is a heterogeneous graph comprising datasets on academic publications, citation relationships between these publications, as well as authors, institutions, journals, conferences and subject areas
ND-GAIN is a measurement tool that helps governments, businesses and communities to analyse risks exacerbated by climate change, such as overpopulation, food insecurity, inadequate infrastructure and civil conflicts
OpenSanctions is an international database of individuals and companies of political, criminal or economic interest
‘Our World in Data’ focuses on the world’s major and daunting challenges: poverty, disease, hunger, climate change, war, existential risks and inequality
The ‘Encyclopedia of International Studies’ is now available as the ‘Oxford Research Encyclopedia (ORE) of International Studies’, featuring new and revised articles
Ways in which smartphones can be used to study human interactions beyond traditional methods based on surveys or simulations
Open-source tool for running queries on public data from the Stack Exchange network
Dataset containing predictions about Titanic survivors
UC Berkeley’s archive of digitised social science data and statistics
The Social Science Data Archive has been operating at UCLA since 1961
The UCDP, part of the Department of Peace and Conflict Research, provides data on organised violence
The World Inequality Database (WID) provides a database on the historical development of global income and wealth distribution, both within individual countries and between countries
Population data at local level, including tracking progress towards the Sustainable Development Goals
Data for a specific continent, region, country, religion, affinity group or population group