Citing Data

Current publishing practice is to cite data sources in much the same manner that articles and books are cited. Such citations may be part of a regular Reference List or listed separately in their own list, either at the back of the article or in a Data Availability Statement. (See Data Availability Statement.)

Principles of Data Citation

The Force11 Joint Declaration of Data Citation Principles states (among other principles) that:
  • Data should be considered legitimate, citable products of research.
  • Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.
  • In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.
  • A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.
  • Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific timeslice, version, and/or granular portion of data retrieved subsequently is the same as was originally cited.

Data Citations in JATS

The JATS citation models are adequate to record most current practice in citing data even though data sets, protein sequences, and spreadsheets (to name a few data examples) are not tagged as uniformly by the industry as are cited journals and books.
Specific JATS structures that can assist in preserving data source information in a citation include:
  • Publication Type Attribute: The @publication-type on the citation element (<mixed-citation> or <element-citation>) should be set to “data”, to indicate a data citation.
  • Data Title: The element <data-title> holds the formal title or name of a cited data source (or a component of a cited data source) such as a dataset or protein structure. The <data-title> is typically used as an equivalent of an article title (<article-title>). See samples below.
    Since datasets can contain very complex relationships for citing data, both the <source> element and one or more <data-title> elements may be needed within a single citation to describe different levels of the data source.
  • Version of Resource: The element <version> may hold a full version statement for data or software that is cited or described.
    The content of this element may be a simple version number (such as “<version>16</version>” or “<version>XII</version>”). More complex version statements may contain a textual statement including dates that the dataset covers.
    Whether or not the content is more than a simple number, the @designator attribute of this element can be used to hold the simple numerical or alphabetic version number, if there is such a number: <version designator="16.2">16th version, second release</version>.

How the Data was Used

For the purposes of citing data sources, three different uses of the data associated with an article can be recognized:
  • Generated Data: Included or referenced external data that were generated in the course of the study on which the article reports.
  • Analyzed Data: Referenced data that were analyzed in the course of the study on which the article reports, but that was not generated for the study. This may include publicly available datasets.
  • Non-analyzed Data: Referenced data that were neither generated nor analyzed during the study.
The @use-type attribute (again on either <mixed-citation> or <element-citation>) may be set to explain how the data has been used in the research that led to the article, for example, for distinguishing between: “generated-data”, “analyzed-data”, and “non-analyzed-data” (referenced data).
Note: In current practice, exactly how the data was used is probably material that only contributors can supply. Publishers and archives may have no reliable way to determine use, as there is typically nothing in the text that states usage.

Examples of Data Citations

We would like to thank the Force11 group for the data citation examples given below.
Protein Data Bank in Europe sample:
...
<ref>
<mixed-citation publication-type="data">Kollman JM, Charles EJ, Hansen JM, 
<year iso-8601-date="2014">2014</year>, <data-title>Cryo-EM structure of 
the CTP synthetase filament</data-title>, <ext-link ext-link-type="uri" 
xlink:href="http://www.ebi.ac.uk/pdbe/entry/EMD-2700">
http://www.ebi.ac.uk/pdbe/entry/EMD-2700</ext-link>, Publicly available 
from <source>The Electron Microscopy Data Bank (EMDB)</source>.</mixed-citation>
</ref>
...
GigaScience sample:
...
<ref>
<mixed-citation publication-type="data">Zheng LY, 
Guo XS, He B, Sun LJ, Pi CM, Jing H-C: Genome data from 
[<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5524/100012">
http://dx.doi.org/10.5524/100012</ext-link>] <source>GigaScience</source> 
<year iso-8601-date="2011">2011</year>.</mixed-citation>
</ref>
...
Data in figshare, referenced through a DOI:
...
<ref>
<mixed-citation publication-type="data">Di Stefano B, Collombet S, 
Graf T. <source>Figshare</source> <ext-link ext-link-type="doi" 
assigning-authority="figshare" 
xlink:href="http://dx.doi.org/10.6084/m9.figshare.939408">
http://dx.doi.org/10.6084/m9.figshare.939408</ext-link> 
(<year iso-8601-date="2014">2014</year>).</mixed-citation>
</ref>
...
Dryad Digital Repository, referenced through a DOI:
...
<ref>
<mixed-citation publication-type="data">Dubuis JO, Samanta R, 
Gregor T (<year iso-8601-date="2013">2013</year>).  Data from: 
<data-title>Accurate measurements of dynamics and reproducibility 
in small genetic networks</data-title>. <source>Dryad Digital 
Repository</source> doi:<pub-id pub-id-type="doi">10.5061/dryad.35h8v</pub-id>
</mixed-citation>
</ref>
...
GenBank Protein sample:
...
<ref>
<mixed-citation publication-type="data">
<data-title>Homo sapiens cAMP responsive element binding protein 1 
(CREB1), transcript variant A, mRNA</data-title>. <source>GenBank</source> 
<ext-link ext-link-type="genbank" xlink:href="NM_004379.3">NM_004379.3</ext-link>.
</mixed-citation>
</ref>
...
RNA Sequence sample:
...
<ref>
<mixed-citation publication-type="data">Xu, J. <etal/> 
<data-title>Cross-platform ultradeep transcriptomic profiling 
of human reference RNA samples by RNA-Seq</data-title>. 
<source>Sci. Data</source> <volume>1</volume>:<elocation-id>140020</elocation-id>.  
doi: <pub-id pub-id-type="doi">10.1038/sdata.2014.20</pub-id> 
(<year iso-8601-date="2014">2014</year>).</mixed-citation>
</ref>
...
Element Citation: Analyzed data are mentioned in the Data Availability Statement section, and then the data are referenced in the References List within that section.
...
<back>
...
<sec sec-type="data-availability">
<title>Data Availability</title>
<p>The following datasets were generated or analyzed for this study:</p>
<ref-list>
<ref id="pone.0167830.data001">
   <label>D1</label>
   <element-citation publication-type="data" 
     specific-use="isSupplementedBy">
     <name><surname>Read</surname><given-names>K</given-names></name>
     <data-title>Sizing the Problem of Improving Discovery and Access
       to NIH-funded Data: A Preliminary Study (Datasets)</data-title>
     <source>Figshare</source><year iso-8601-date="2015">2015</year>
     <pub-id pub-id-type="doi" assigning-authority="figshare"
        xlink:href="https://doi.org/10.6084/m9.figshare.1285515">
        https://doi.org/10.6084/m9.figshare.1285515</pub-id>
   </element-citation>
</ref>
    
<ref id="pone.0167830.data002">
   <label>D2</label>
   <element-citation publication-type="data" 
     specific-use="references">
     <name><surname>Kok</surname><given-names>K</given-names></name>
     <name><surname>Ay</surname><given-names>A</given-names></name>
     <name><surname>Li</surname><given-names>L</given-names></name>
     <data-title>Genome-wide errant targeting by Hairy</data-title>
     <source>Dryad Digital Repository</source>
     <year iso-8601-date="2015">2015</year>
     <pub-id pub-id-type="doi" assigning-authority="dryad"
       xlink:href="https://doi.org/10.5061/dryad.cv323">
        https://doi.org/10.5061/dryad.cv323</pub-id>
   </element-citation>
</ref>
    
<ref id="pone.0167830.data003">
   <label>D3</label>
   <element-citation publication-type="data" 
     specific-use="references">
     <name><surname>Hoang</surname><given-names>C</given-names></name>
     <name><surname>Swift</surname><given-names>GH</given-names></name>
     <name><surname>Azevedo-Pouly</surname><given-names>A</given-names>
       </name>
     <name><surname>MacDonald</surname><given-names>RJ</given-names></name>
     <data-title>Effects on the transcriptome of adult mouse pancreas
        (principally acinar cells) by the inactivation of the Ptf1a gene 
        in vivo</data-title>
     <source>NCBI Gene Expression Omnibus</source>
     <year iso-8601-date="2015">2015</year>
     <pub-id pub-id-type="accession" 
        assigning-authority="NCBI"
        xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70542"
        >GSE70542</pub-id>
   </element-citation>
</ref>
</ref-list>
</sec>
</back>
...

Mixed Citation: Analyzed data source are mentioned in the Data Availability Statement section, and then the data sources are referenced in the References List within that section.
...
<back>
...
<sec sec-type="data-availability">
<title>Data Availability</title>
<p>The following datasets were generated or analyzed for this study:</p>
<ref-list>
<ref id="pone.0167830.data001">
<label>D1</label>
<mixed-citation publication-type="data" 
  specific-use="isSupplementedBy">
<string-name><surname>Read</surname> <given-names>K</given-names></string-name>. 
 <data-title>Sizing the Problem of Improving Discovery and Access
 to NIH-funded Data: A Preliminary Study (Datasets)</data-title>. 
 <source>Figshare</source>. <year iso-8601-date="2015">2015</year>. 
 <pub-id pub-id-type="doi" assigning-authority="figshare"
   xlink:href="https://doi.org/10.6084/m9.figshare.1285515">View Data</pub-id>
</mixed-citation>
</ref>
    
<ref id="pone.0167830.data002">
<label>D2</label>
<mixed-citation publication-type="data" 
  specific-use="references">
<string-name><surname>Kok</surname> <given-names>K</given-names></string-name>, 
 <string-name><surname>Ay</surname> <given-names>A</given-names></string-name>, 
 <string-name><surname>Li</surname> <given-names>L</given-names></string-name>.
 <data-title>Genome-wide errant targeting by Hairy</data-title>. 
 <source>Dryad Digital Repository</source>. <year iso-8601-date="2015">2015</year>. 
 <pub-id pub-id-type="doi" assigning-authority="dryad"
   xlink:href="https://doi.org/10.5061/dryad.cv323">View Data</pub-id>
</mixed-citation>
</ref>
    
<ref id="pone.0167830.data003">
<label>D3</label>
<mixed-citation publication-type="data" 
  specific-use="references">
<string-name><surname>Hoang</surname> <given-names>C</given-names></string-name>,
 <string-name><surname>Swift</surname> <given-names>GH</given-names></string-name>,
 <string-name><surname>Azevedo-Pouly</surname> <given-names>A</given-names></string-name>,
 <string-name><surname>MacDonald</surname> <given-names>RJ</given-names></string-name>.
 <data-title>Effects on the transcriptome of adult mouse pancreas
 (principally acinar cells) by the inactivation of the Ptf1a gene 
 in vivo</data-title>. <source>NCBI Gene Expression Omnibus</source>.
 <year iso-8601-date="2015">2015</year>.
 <pub-id pub-id-type="accession" 
   assigning-authority="NCBI"
   xlink:href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70542"
   >View Data</pub-id>
</mixed-citation>
</ref>
</ref-list>
</sec>
</back>
...