Citing Data

Current publishing practice is to cite data sources in much the same manner that articles and books are cited. Such citations may be part of a regular Reference List or listed separately in their own list, either at the back of the article or in a Data Availability Statement. (See Data Availability Statement.)

Principles of Data Citation

The Force11 Joint Declaration of Data Citation Principles states (among other principles) that:
  • Data should be considered legitimate, citable products of research.
  • Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.
  • In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.
  • A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.
  • Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific timeslice, version, and/or granular portion of data retrieved subsequently is the same as was originally cited.

Data Citations in JATS

The JATS citation models are adequate to record most current practice in citing data even though data sets, protein sequences, and spreadsheets (to name a few data examples) are not tagged as uniformly by the industry as are cited journals and books.
Specific JATS structures that can assist in preserving data source information in a citation include:
  • <data-title> — the formal title or name of a cited data source (or a component of a cited data source) such as a dataset or protein structure.
    Since datasets can contain very complex relationships for citing data, both the <source> element and the <data-title> element may be needed within a single citation to describe different levels of the data source. The <data-title> is typically used as an equivalent of an article title (<article-title>). See samples below.
  • <version> — A full version statement, which may be only a number, for data or software that is cited or described.
    The content of this element may be a simple version number (such as “<version>16</version>” or “<version>XII</version>”). More complex version statements may contain a textual statement including dates that the dataset covers. Whether or not the content is more than a simple number, the @designator attribute of this element can be used to hold the simple numerical or alphabetic version number, if there is such a number: <version designator="16.2">16th version, second release</version>.

How the Data was Used

For the purposes of citing data sources, three different uses of the data associated with an article can be recognized:
  • Generated Data: Included or referenced external data that were generated in the course of the study on which the article reports.
  • Analyzed Data: Referenced data that were analyzed in the course of the study on which the article reports, but that was not generated for the study. This may include publicly available datasets.
  • Non-analyzed Data: Referenced data that were neither generated nor analyzed during the study.
The @use-type attribute (again on either <mixed-citation> or <element-citation>) may be set to explain how the data has been used in the research that led to the article, for example, for distinguishing between: “generated-data”, “analyzed-data”, and “non-analyzed-data” (referenced data).
Note: In current practice, exactly how the data was used is probably material that only contributors can supply. Publishers and archives may have no reliable way to determine use, as there is typically nothing in the text that states usage.