Datasets in the Data Commons Graph

The base schema in the Data Commons Graph, including taxonomy, is derived from schema.org.

Bureau of Economic Analysis (BEA) - GDP Datasets

Gross domestic product (GDP) measures the overall level of economic activity in a country or region. The BEA publishes GDP by industry data at the national, state, county, and metropolitan statistical area levels.

Data Commons includes the following BEA GDP datasets:

Data made available under the public domain.

Bureau of Economic Analysis (BEA) - Regional Price Parities by Metropolitan Statistical Areas

Regional price parity (RPP) measures "geographic price level differences for one period in time within the United States". Data Commons includes the RPP indices for metropolitan statistical areas broken down by goods and services categories from BEA's "Regional Price Parities by MSA" dataset.

Data made available under BEA Terms of Service.

Bureau of Labor Statistics (BLS) Consumer Price Index (CPI) Databases

"The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services." It is often used as an indicator for inflation. Data Commons includes seasonally unadjusted average CPI for All Urban Consumers (CPI-U), CPI for Urban Wage Earners and Clerical Workers (CPI-W), and Chained CPI for All Urban Consumers (C-CPI-U), provided by the BLS monthly since 1947.

Data made available under BLS Terms of Service.

Bureau of Labor Statistics (BLS) - Job Openings and Labor Turnover Survey

The Job Openings and Labor Turnover Survey (JOLTS) program produces monthly data on job openings, hires, and separations starting December 2000. Data Commons includes both seasonally adjusted and unadjusted quarterly numbers of job postings, hires, and separations broken down by industry.

Data made available under BLS Terms of Service.

Bureau of Labor Statistics (BLS) - Monthly County-Level Employment and Unemployment

The BLS Local Area Unemployment Statistics (LAUS) program produces monthly and annual "employment, unemployment, and labor force data for Census regions and divisions, States, counties, metropolitan areas, and many cities, by place of residence." Data Commons includes employment, unemployment, and labor force by state and county from the "Labor force data by county, not seasonally adjusted" table.

Data made available under BLS Terms of Service.

Bureau of Labor Statistics (BLS) - Quarterly Census of Employment and Wages (QCEW)

From the BLS Quarterly Census of Employment and Wages (QCEW) program, Data Commons includes quarterly and annual employment and wage statistics broken down by industry and ownership (private, state government-owned, or federal government-owned) at the country, state, county, and metropolitan area level, from "QCEW NAICS-Based Data Files".

Data made available under BLS Terms of Service.

Center for Disease Control and Prevention (CDC) - 500 Cities: Local Data for Better Health

The 500 Cities Project datasets contain model-based small area estimates for 27 measures of chronic disease related to unhealthy behaviors (5), health outcomes (13), and use of preventive services (9) for the 500 largest cities in the US and approximately 28,000 census tracts within these cities. Data Commons includes all estimates for the 500 cities.

Data made available under CDC Data Terms of Service.

Center for Disease Control and Prevention (CDC) - Compressed Mortality

Data Commons includes the CDC "Compressed Mortality, 1999-2016" dataset, which reports mortality counts for all US states and counties broken down by underlying cause of death, age, race, sex, and year.

Data made available under CDC Wonder Data Terms of Service.

Center for Disease Control and Prevention (CDC) - Daily County-Level PM2.5 Concentrations

Data Commons includes the CDC Daily County-Level PM2.5 Concentrations dataset, which "provides modeled predictions of particulate matter (PM2.5) levels from the EPA's Downscaler Model. These data are used by the CDC's National Environmental Public Health Tracking Network to generate air quality measures. Data are at the county levels for 2001-2014".

Data made available under CDC Wonder Data Terms of Service.

Center for Disease Control and Prevention (CDC) - Diabetes Surveillance System

The CDC Diabetes Surveillance System estimates the number and percentage of US adults who reported ever being told by a health professional that they had diabetes using data collected by the CDC's Behavioral Risk Factor Surveillance System (BRFSS) at the country, state, and county levels. Data Commons contains the age-adjusted percentages for diabetes, obesity, and physical inactivity for total population and gender breakdown at the county level.

Data made available under CDC Data Terms of Service.

Census Bureau - American Community Survey (ACS) 5-year Estimates

The American Community Survey (ACS) "covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population." The ACS 5-year estimates are updated every year, based on the last 5 years of collected data. Data Commons includes thousands of variables across the full range of ACS topics at the country, state, county, city, zip code tabulation area, school district, census tract levels, and more.

Data Commons also computes gender income inequality from the ACS 5-year estimates of median income for male and females 15 years or older. The computation divides the difference between male and female median income by the sum of male and female median income.

Data made available under US Census Terms of Service.

Census Bureau - American Community Survey Education Tabulation (ACS-ED)

The National Center for Education Statistics collaborates with the US Census Bureau to create a variety of custom American Community Survey (ACS) data files that describe the condition of school-age children in the United States at the country, state, and school district level. ACS-ED is updated annually based on ACS five-year period estimates. Data Commons includes statistics on dozens of variables used in the ACS-ED District Demographic Dashboard.

Data made available under the public domain.

Census Bureau - Cartographic Boundary Files

Data Commons has KML files from 2018 US Cartographic Boundaries for the following place types:

  • Congressional Districts: 116th Congress
  • County
  • Place
  • State
  • Census Tracts
  • County Subdivisions
  • ZIP Code Tabulation Areas
  • School Districts - Elementary
  • School Districts - Secondary
  • School Districts - Unified

Data made available under US Census Terms of Service.

Census Bureau - County Business Patterns

County Business Patterns is an annual series that includes per-industry "number of establishments, employment during the week of March 12, first quarter payroll, and annual payroll". Data Commons includes the number of establishments, employment during the week of March 12, and annual payroll from the 2011-2016 County Business Pattern datasets for US counties, metropolitan statistical areas, and zip codes.

Data made available under US Census Terms of Service.

Census Bureau - Economic Census

"Every five years, the US Census Bureau collects extensive statistics about businesses that are essential to understanding the American economy." Data Commons imports country, state, county, and city statistics on the number of businesses and amount of revenue, by business payroll status, industry, operation type, and tax status.

Data made available under US Census Terms of Service.

Census Bureau - Gazetteer Files

From the 2018 US Census Gazetteer, Data Commons has geographic information about:

  • Counties
  • County Subdivisions
  • 116th Congressional Districts
  • Census Tracts (2018)
  • Core Based Statistical Areas
  • Places (City, Census Designated Place, Village, Town)
  • School Districts - Elementary
  • School Districts - Secondary
  • School Districts - Unified
  • ZIP Code Tabulation Areas

Data made available under US Census Terms of Service.

Census Bureau - Population Estimates Program

The Census Bureau's Population Estimates Program (PEP) produces yearly estimates of the population for the United States, its states, counties, cities, and towns, as well as for the Commonwealth of Puerto Rico and its municipios. Data Commons imports the total population estimate data for the US and its states, counties, and cities.

Data made available under US Census Terms of Service.

Census Bureau - Small Area Health Insurance Estimates (SAHIE)

The Small Area Health Insurance Estimates program provides yearly estimates of health insurance coverage status for all counties and states. Data Commons includes all estimates, available by age, race, sex, and income.

Data made available under US Census Terms of Service.

College Scorecard - University Data

The College Scorecard dataset includes data about all undergraduate degree-granting institutions of higher education and "supporting data on student completion, debt and repayment, earnings, and more". Data Commons includes statistics on student family income, in-state and out-of-state status, tuition, graduation rate, and acceptance rate from 1996 to 2017.

Data made available under US Department of Education Terms of Service.

Department of Labor Weekly Claims and Extended Benefits Trigger Data

The US Department of Labor reports weekly new and continuing unemployment insurance claims for US states. The data is used in the Department of Labor's weekly News Release for unemployment insurance weekly claims.

Data Commons imports this data, and aggregates the following for users:

  • State new claims into US new claims
  • State continuing claims into US continuing claims
  • New and continuing claim counts into total claims (both state and US levels)

Data made available under the public domain.

Drug Enforcement Agency - Retail Drug Distributions by Drug at the County Level

Automated Reports and Consolidated Ordering System (ARCOS) is a data collection system in which manufacturers and distributors report their controlled substances transactions to the Drug Enforcement Administration (DEA).

Data Commons includes quarterly retail drug distributions from ARCOS Report 1, provided annually from 2006-2017. The 3-digit zip prefixes from the report were aggregated to the county level using 2010 ZIP Code Tabulation Area (ZCTA) Relationship records from the US Census.

Please see the disclaimers page about the scope of the data, as well as the US Department of Justice Legal Policies and Disclaimers Terms of Use.

Energy Information Administration (EIA) Form EIA-860 Data

"The survey Form EIA-860 collects generator-level specific information about existing and planned generators and associated environmental equipment at electric power plants with 1 megawatt or greater of combined nameplate capacity." Data Commons includes all reported generators and their properties from EIA-860_3_1.

Data made available under the public domain.

Federal Bureau of Investigation Uniform Crime Reporting (UCR) Program - Offenses Known to Law Enforcement, by State by City

Table 8 provided by the UCR Program's Crime in the US report "provides the volume of violent crime (murder and nonnegligent manslaughter, rape, robbery, and aggravated assault) and property crime (burglary, larceny-theft, and motor vehicle theft) as reported by city and town law enforcement agencies (listed alphabetically by state) that contributed data to the UCR Program." Data Commons includes counts of all crime types at the state and city level.

Data made available under US Department of Justice Legal Policies and Disclaimers Terms of Use.

Federal Election Commission

The FEC provides data from statements and reports filed with the FEC. Data Commons includes the following three datasets:

  • All Candidates: "The all candidate summary file contains one record including summary financial information for all candidates who raised or spent money during the period no matter when they are up for election."
  • Candidate Master: "The candidate master file contains one record for each candidate who has either registered with the Federal Election Commission or appeared on a ballot list prepared by a state elections office."
  • Candidate-Committee Linkages: "This file contains one record for each candidate to committee linkage."

Data made available under FEC Terms of Use.

Federal Reserve Treasury Nominal and Inflation-Indexed Constant Maturity Series

The United States Federal Reserve publishes daily updated Treasury Nominal and Inflation-Indexed Constant Maturity Series. Data Commons includes data for 1-Month, 3-Month, 6-Month, 1-Year, 2-Year, 3-Year, 5-Year, 7-Year, 10-Year, 20-Year, and 30-Year constant maturities. Data Commons includes all published statistics.

The data is in the public domain.

National Center for Education Statistics (NCES) - Public School and School District Data

NCES exposes data from an annual survey called "The Common Core of Data" (CCD). CCD contains general descriptive information such as name, address, and phone number; select demographic characteristics about students and staff; and fiscal data such as revenues and current expenditures. Data Commons includes school and school district level data about student populations by race, gender, lunch eligibility, and grade, as well as student-teacher ratio and teacher count statistics.

Data made available under NCES Data Usage Agreement and US Department of Education Copyright Status Notice.

National Climatic Data Center Storm Events Database

The Storm Events Database contains records documenting the "occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce; rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area; and other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event."

Data Commons includes data on all available types of storm events (e.g. Tornado, Hail) and storm episodes from the Storm Events Database. Variables include location, start datetime, end datetime, wind speed, precipitation type and amount, recorded description, affected places, number of direct and indirect injuries, number direct and indirect deaths, property damage cost, crop damage cost, and more.

Data made available under National Weather Service Use of NOAA/NWS Data and Products Terms of Service.

National Interagency Fire Center Interagency Situation Report - 209 (SIT-209)

The SIT-209 web application "is used to collect intelligence information related to the wildland fire management incidents and resources. SIT collects daily fire activity and initial/extended attack resource information during the active fire season for the local dispatch office. 209 is used to collect and store ICS-209 large Incident Summary information."

Data Commons includes data on fires in the US reported via the SIT-209 application, starting from 1999. Variables include fire name, fire type, fire cause, location, discovery date, controlled date, affected area, and estimated costs.

Data made available under US Forest Service Terms of Service.

United States Geological Survey (USGS) Advanced National Seismic System Comprehensive Earthquake Catalog (ComCat)

ComCat "contains earthquake source parameters (e.g. hypocenters, magnitudes, phase picks and amplitudes) and other products (e.g. moment tensor solutions, macroseismic information, tectonic summaries, maps) produced by contributing seismic networks." Data Commons includes date, time, location, magnitudes, magnitude errors, depth, depth error, and review status of earthquakes of magnitude 3 onwards starting from 1900.

Data made available under USGS Copyrights and Credits Terms of Service.

United States Geological Survey (USGS) Geographic Names Information System (GNIS) - National Federal Codes

The National Federal Codes dataset includes codes, names, coordinates, and more information for all "named physical and cultural geographic features (except

roads and highways) of the United States", maintained by GNIS. Data Commons uses this dataset to build containment relationships between places from the US Census Gazetteer dataset.

Data made available under USGS Copyrights and Credits Terms of Service.

Michigan Student Test of Educational Progress (M-STEP) Datasets

The Michigan Student Test of Educational Progress, or M-STEP, is "a 21st Century computer-based assessment designed to gauge how well students are mastering state standards". Data Commons includes the Mathematics and English Language Arts test counts and scores from the 2015 to 2019 "Assessment and Accountability: Grades 3-8 Assessments" files.

Data is publicly available with proper citation.

The COVID Tracking Project US State and Total Data

The COVID Tracking Project collects, cross-checks, and publishes COVID-19 testing and patient outcome data from 56 US states and territories. Visit their website for more about their data. Data Commons includes the testing and patient status statistics from the US Data and State Data files.

Data made available under Creative Commons CC BY-NC-4.0 license.

The Dartmouth Atlas of Health Care

The Dartmouth Atlas Project "uses Medicare and Medicaid data to provide information and analysis about national, regional, and local markets, as well as hospitals and their affiliated physicians." Data Commons includes the Medicare Reimbursements, Medicare Mortality Rates, and Selected Primary Care Access and Quality Measures datasets.

Data is made available under the Dartmouth Atlas Project Terms of Use.

DeepSolar

DeepSolar "analyzes satellite imagery to identify the GPS locations and sizes of solar photovoltaic (PV) panels for the contiguous U.S." Data Commons includes the count and total area of solar arrays at the state, county, and census tract level.

Data is publicly available by The DeepSolar Project.

Google Health Reconciled COVID-19 Data

Google Health reconciles COVID data from The COVID Tracking Project, Johns Hopkins University, California Health and Human Services and makes the reconciled dataset publicly available for research and prediction use via Data Commons.

The New York Times Coronavirus (Covid-19) Data in the United States

The New York Times releases cumulative counts of coronavirus cases in the United States at the country, state, and county level, over time. The New York Times compiles this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak. Data Commons imports this data and computes incremental counts for users.

Data made available for non-commercial purposes only with proper citation.

Collaboration: Opportunity Insights - Outcomes and Neighborhood Datasets

Opportunity Insights provides datasets on "social mobility and a variety of other outcomes from life expectancy to patent rates by neighborhood, college, parental income level, and racial background." Data Commons includes the following Outcomes and Neighborhood Datasets:

  • All Outcomes by Census Tract, Race, Gender and Parental Income Percentile
  • All Outcomes by County, Race, Gender and Parental Income Percentile
  • All Outcomes by Commuting Zone, Race, Gender and Parental Income Percentile

Along with the neighborhood characteristics covariate datasets at the respective place levels:

  • Neighborhood Characteristics by Census Tract
  • Neighborhood Characteristics by County
  • Neighborhood Characteristics by Commuting Zone

Data made available under Opportunity Insights Data and Data Usage.

Collaboration: Western Interconnection Data Analytics (WIDAP) Project

Data Commons includes power plants in the Western grid up until 2017, with each power plant's name, latitude, longitude, county, Office of the Regulatory Information System PLant code (ORISPL) code, and the names of its power plant units. The data is made available on Data Commons through a collaboration with the WIDAP project.

Eurostat Database Regional Statistics by NUTS Classification

The datasets under "Regional Statistics by NUTS Classification" provide various statistics on European Union countries and their NUTS regions. Data Commons includes the following datasets:

Data made available under the European Union, 1995 - today Copyright Notice.

Eurostat NUTS Geos

Data Commons has NUTS (Nomenclature of territorial units for statistics) geocodes from the 2016 classification. This covers NUTS levels 1 through 3.

Data made available under the European Union, 1995 - today Copyright Notice.

Google Community Mobility Reports

Google's COVID-19 Community Mobility Reports "chart movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential." Data Commons includes all statistics for countries, US states, and US counties.

Data made publicly available under the standard Google Terms of Service.

NOAA International Best Track Archive for Climate Stewardship (IBTrACS)

The IBTrACS project provides tropical cyclone best track data in a centralized location. Data Commons includes cyclone name, start date, end date, max wind speed, minimum pressure, max classification, oceanic basin, and affected places.

Data made available under the National Weather Service Use of NOAA/NWS Data and Products Terms of Service.

Related Places

Data Commons computes rankings and relations between places for all StatisticalVariables. For example, this dataset can answer queries such as: states with a similar population of PhDs as California, US cities with most/least violent crimes, and the rank of San Mateo County in terms of median income among counties of US.

This information is available in the Data Commons KG accessible via the API and surfaced on the Place Explorer tool.

Data made available under CC BY 4.0.

UNdata Population Data

Data Commons includes population data for countries, capital cities, urban and rural areas from UNdata.

Data made available with citation under the UNdata conditions of use.

WHO Coronavirus Disease (COVID-19) Dashboard

The World Health Organization publishes national COVID-19 cases and death counts for countries across the world. Data Commons imports this data on a daily basis.

Data made available under CC BY-NC-SA 3.0 IGO.

Wikidata Places

Data Commons includes information about administrative divisions, municipalities, cities, villages and neighborhoods of all countries in the world from Wikidata. This also includes population statistics and various well-known identifiers associated with the places.

Data made available under CC0 1.0 Universal.

World Development Indicators from The World Bank

Data Commons includes the following country-level variables from the World Development Indicators dataset:

  • Population
  • Population growth rate
  • Life expectancy
  • Fertility rate
  • CO2 emissions
  • Electricity consumption
  • Energy use
  • GDP
  • Internet usage
  • Labor force participation rate, by gender, modeled
  • Female labor force as a fraction of total labor force
  • Educational attainment, by gender and level of attainment
  • Intentional homicides, by gender
  • Prevalence of overweight, weight for height children under 5, by gender
  • Suicide mortality rate, by gender

Data made available under CC-BY 4.0.

ChEMBL

"ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs." It includes information on drugs at all stages of drug discovery.

This data is made available by EMBL-EPI Terms of Use. This data was formatted for Data Commons in part through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

Disease Ontology

Disease Ontology was developed as a project by the Institute of Genome Sciences at the University of Maryland School of Medicine. It "is a community driven, open source ontology that is designed to link disparate datasets through disease concepts". It provides a "standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts".

The data is made available under C0 1.0 Universal (CC0 1.0) Public Domain Dedication. Data Commons includes the 3/7/19 update of the Disease Ontology. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

Encyclopedia of DNA Elements (ENCODE) - BED (Browser Extensible Data) Files

The ENCODE dataset contains information for approximately 7000 experiments along with 14,000 BED files collected by The Encyclopedia of DNA Elements (ENCODE) Consortium. Examples of experiment metadata captured include the target biosample, assay type, gene assembly, etc. Bed files link to individual bed lines, which state the genomic position of individual peaks. Data Commons ingested all experimental data in BED format.

Data made available under: ENCODE Data Use Policy for External Users. This data was formatted for Data Commons through a collaboration with Dr. Anthony Oro’s group at Stanford University.

FDA-Approved Drugs

"Drugs@FDA includes information about drugs, including biological products, approved for human use in the United States." Data Commons includes the information about the FDA application for the drug as well as the drug’s strength, active ingredients, dosage forms, administration routes, FDA therapeutic equivalence code, and marketing status.

This data is made available through openFDA terms of service.

FDA - Pharmacologic Class

The FDA established pharmacologic classes "associated with an approved indication of an active moiety that the FDA has determined to be scientifically valid and clinically meaningful". This includes the (1) description of pharmacologic class (2) active moiety code and description (3) compounds associated with each class.

This data is made available through openFDA terms of service. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

Genotype-Tissue Expression (GTEx)

The GTEx eGene and significant variant-gene association data were generated from samples "collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. Remaining samples are available from the GTEx Biobank." The single-tissue cis-eQTL data from the v8 release was used. Due to the size of the datasets only Skin - Not Sun Exposed and Skin - Sun Exposed are made available on the main graph. The data for all tissues can be accessed on the Biomedical Data Commons knowledge graph.

GTEx is an NIH human genomic data unrestricted-access data repository and the data was made available in compliance with GTEx Data Release and Publication Policy. GTEx outlines how to cite use of GTEx data in journal publication.

HUPO-PSI Working Groups and Outputs

The Molecular Interactions Controlled Vocabulary from the HUPO Proteomics Standards Initiative working groups is "a structured controlled vocabulary for the annotation of experiments concerned with protein-protein interactions". The ontologies dictionary is represented in a tree structure in the EMBL-EBI Ontology Lookup Service. Data Commons includes three subsets of the ontologies: "interaction detection method", "interaction type" and "database citation", which are commonly used in protein-protein interactions.

Data Made available under Apache License 2.0. The license information of HUPO PSI can be found at the Community Practice. See also EBI term of use.

Medical Subject Headings (MeSH)

MeSH is a "thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information". Data Commons includes the Descriptor, Concept, and Term elements of MeSH as described here.

This data is from the National Library of Medicine (NLM) and is not subject to copyright and is freely reproducible as stated in the NLM’s copyright policy. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

The Molecular INTeraction (MINT) Database

The MINT Database "focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators."

MINT is a part of ELIXIR Core Data Resources, of which the resources are all committed to open access. Any use of this database should cite:

Licata, Luana, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta Iannuccelli, Eugenia Galeota, Francesca Sacco et al. "MINT, the molecular interaction database: 2012 update." Nucleic acids research 40, no. D1 (2012): D857-D861.

NIH National Center for Biotechnology Information ClinVar

"ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence." It contains reports of genetic "variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data." Data Commons includes the January 6, 2020 release of the ClinVar archive supporting both hg19 and hg38 genome assemblies.

This data is from an NIH human genome unrestricted-access data repository and made accessible under the NIH Genomic Data Sharing (GDS) Policy.

NIH National Center for Biotechnology Information Gene

The NIH NCBI gene info datasets from NCBI Gene for a subset of species contains "gene-specific content based on NCBI's RefSeq project, information from model organism databases, and links to other resources." The NCBI RefSeq project is "a comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein". The datasets included are from the February 19, 2020 update. The gene info files for the following species have been added:

  • Caenorhabditis elegans
  • Danio rerio
  • Drosophila melanogaster
  • Gallus gallus
  • Homo sapiens
  • Mus musculus
  • Saccharomyces cerevisiae
  • Xenepus laevis.

This data is from an NIH human genome unrestricted-access data repository and made accessible under the NIH Genomic Data Sharing (GDS) Policy.

Side Effect Resource (SIDER)

Sider is a database of adverse drug reactions curated by the EMBL collaboration. "SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts. The available information include side effect frequency, drug and side effect classifications as well as links to further information, for example drug–target relations." Data Commons hosts version 4.1 of SIDER released on October 21, 2015.

This data is made available under the Creative Commons Attribution-Noncommercial-Share Alike 4.0 License. Information about citing SIDER can be found here. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

SPOKE Disease Symptom Associations

These are statistical associations using a Fisher’s exact test co-occurrence of disease and symptom terms in Pubmed entries by performing as described in Himmelstein, et al (2017).

The data was previously hosted by UCSF Scalable Precision Medicine Knowledge Engine SPOKE. It was made available by the data’s owner, Sergio Baranzini, for use on Data Commons. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

The Tissue Atlas

The Human Protein Tissue Atlas contains information about the distribution of proteins on human tissues derived from the antibody-based protein profiling from 44 normal human tissues types and mRNA expression data from 37 different normal tissue types.

This dataset is available under CC BY-SA 3.0. Please also see their Disclaimer and Licence & Citation

UCSC Genome Browser: Chromosome, Gene, RNA Transcript, and Genetic Variant Annotations

The UCSC Genome Browser originated from The Human Genome Project in 2000 to share and visualize genome data. It has grown to include an agglomeration of various genome assemblies and annotations. Data Commons includes data annotating chromosomes, genes, RNA transcripts, and genetic variants from the UCSC Genome Browser. The .chrom.sizes.txt files were downloaded from the UCSC Genome Browser Downloads page on August 13, 2019. The NCBI RefSeq files were downloaded from the UCSC Table Browser on August 2, 2019 for the following genome assemblies:

  • ce10
  • ce11
  • danRer10
  • danRer11
  • dm3
  • dm6
  • galGal5
  • galGal6
  • hg19
  • hg38
  • mm9
  • mm10
  • sacCer3
  • xenLae2

The All SNPs files were downloaded from the UCSC Table Browser on August 13, 2019 for the following genome assemblies and dbSNP builds:

  • gaGal5 (dbSNP Build 147)
  • hg19 (dbSNP Build 151)
  • hg38 (dbSNP Build 151)
  • mm9 (dbSNP Build 128)
  • mm10 (dbSNP Build 142)

The annotation data is made freely available under the UCSC Genome Browser terms of use. The UCSC Genome Browser states how to cite use of their data in a journal article publication.

UniProt

Data Commons includes protein sequence and functional information including protein interaction with chemical compounds maintained by the UniProt Consortium.

The data is made available by the Creative Commons Attribution (CC BY 4.0) License. Further information on UniProt License and Disclaimer can be found here. The UniProt Consortium states how to cite Uniprot data used in a journal article. This data was formatted for Data Commons in part through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

UniProt Controlled Vocabulary of Species

UniProt’s Controlled Vocabulary of Species contains organism species UniProt identification codes, NCBI Taxonomy database identifiers, scientific names, common names, synonyms, and organism kingdoms.

The dataset is available under (CC BY 4.0) license as shown by the UniProt License and Disclaimer.

New York Botanical Garden (NYBG) - C. V. Starr Virtual Herbarium (Collaboration)

C. V. Starr Virtual Herbarium is a public specimen database with photos and detailed records about millions of plants, fungi, and algae.