Citing Data

Current publishing practice is to cite data sources in much the same manner that articles and books are cited. Such citations may be part of a regular Reference List or listed separately in their own list, either at the back of the work or in a Data Availability Statement. (See Data Availability Statement.)
Principles of Data Citation
The Force11 Joint Declaration of Data Citation Principles states (among other principles) that:
  • Data should be considered legitimate, citable products of research.
  • Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.
  • In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.
  • A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.
  • Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific timeslice, version, and/or granular portion of data retrieved subsequently is the same as was originally cited.
Data Citations in BITS/JATS
The BITS/JATS citation models are adequate to record most current practice in citing data even though data sets, protein sequences, and spreadsheets (to name a few data examples) are not tagged as uniformly by the industry as are cited journals and books.
Specific JATS structures that can assist in preserving data source information in a citation include:
  • <data-title> — The formal title or name of a cited data source (or a component of a cited data source) such as a dataset or protein structure.
    Since datasets can contain very complex relationships for citing data, both the <source> element and the <data-title> element may be needed within a single citation to describe different levels of the data source. The <data-title> is typically used as an equivalent of an article title (<article-title>) for cited data. One difference is that there may be many levels of <data-title>, not a single level like <article-title>. See samples below.
  • <version> — A full version statement, which may be only a number, for data or software that is cited or described.
    The content of this element may be a simple version number (such as “<version>16</version>” or “<version>XII</version>”). More complex version statements may contain a textual statement including dates that the dataset covers. Whether or not the content is more than a simple number, the @designator attribute of this element can be used to hold the simple numerical or alphabetic version number, if there is such a number: <version designator="16.2">16th version, second release</version>.
Describing how the Data Files were Used
For the purposes of citing data sources, three different uses of the data associated with a work can be recognized:
  • Generated Data: Included or referenced external data generated in the course of the study on which the work reports.
  • Analyzed Data: Referenced data analyzed in the course of the study on which the work reports, but that was not generated for the study. This may include publicly available datasets.
  • Non-analyzed Data: Referenced data neither generated nor analyzed during the study.
The @use-type attribute (again on either <mixed-citation> or <element-citation>) may be set to explain how the data has been used in the research that led to the article, for example, for distinguishing between: “generated-data”, “analyzed-data”, and “non-analyzed-data” (referenced data).
Note: In current practice, exactly how the data were used is probably material that only contributors can supply. Publishers and archives may have no reliable way to determine use, as there is typically nothing in the text that states usage.
Examples of Data Citations
We would like to thank the Force11 group for the data citation examples given below.
Protein Data Bank in Europe sample:
...
<ref>
 <mixed-citation publication-type="data"><person-group
  ><string-name>Kollman JM</string-name>, 
  <string-name>Charles EJ</string-name>, <string-name
  >Hansen JM</string-name></person-group>, 
  <year iso-8601-date="2014">2014</year>, <data-title>Cryo-EM structure of 
  the CTP synthetase filament</data-title>, <ext-link ext-link-type="uri" 
  xlink:href="http://www.ebi.ac.uk/pdbe/entry/EMD-2700">
  http://www.ebi.ac.uk/pdbe/entry/EMD-2700</ext-link>, Publicly available 
  from <source>The Electron Microscopy Data Bank (EMDB)</source>.</mixed-citation>
</ref>
...
GigaScience sample:
...
<ref>
 <mixed-citation publication-type="data">Zheng LY, 
  Guo XS, He B, Sun LJ, Pi CM, Jing H-C: Genome data from 
  [<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5524/100012">
  http://dx.doi.org/10.5524/100012</ext-link>] <source>GigaScience</source> 
  <year iso-8601-date="2011">2011</year>.</mixed-citation>
</ref>
...
Data in figshare, referenced through a DOI:
...
<ref>
 <mixed-citation publication-type="data">Di Stefano B, Collombet S, 
  Graf T. <source>Figshare</source> <ext-link ext-link-type="doi" 
  assigning-authority="figshare" 
  xlink:href="http://dx.doi.org/10.6084/m9.figshare.939408">
  http://dx.doi.org/10.6084/m9.figshare.939408</ext-link> 
  (<year iso-8601-date="2014">2014</year>).</mixed-citation>
</ref>
...
Dryad Digital Repository, referenced through a DOI:
...
<ref>
 <mixed-citation publication-type="data"><person-group person-group-type="authors"
  ><string-name>Dubuis JO</string-name>, <string-name>Samanta R</string-name>, 
  <string-name>Gregor T</string-name></person-group> 
  (<year iso-8601-date="2013">2013</year>).  Data from: 
  <data-title>Accurate measurements of dynamics and reproducibility 
  in small genetic networks</data-title>. <source>Dryad Digital 
  Repository</source> doi:<pub-id pub-id-type="doi">10.5061/dryad.35h8v</pub-id>
 </mixed-citation>
</ref>
...


GenBank Protein sample:
...
<ref>
 <mixed-citation publication-type="data">
  <data-title>Homo sapiens cAMP responsive element binding protein 1 
  (CREB1), transcript variant A, mRNA</data-title>. <source>GenBank</source> 
  <ext-link ext-link-type="genbank" xlink:href="NM_004379.3">NM_004379.3</ext-link>.
 </mixed-citation>
</ref>
...
RNA Sequence sample:
...
<ref>
 <mixed-citation publication-type="data">Xu, J. <etal/> 
  <data-title>Cross-platform ultradeep transcriptomic profiling 
  of human reference RNA samples by RNA-Seq</data-title>. 
  <source>Sci. Data</source> <volume>1</volume>:<elocation-id>140020</elocation-id>.  
  doi: <pub-id pub-id-type="doi">10.1038/sdata.2014.20</pub-id> 
  (<year iso-8601-date="2014">2014</year>).</mixed-citation>
</ref>
...