Tagging Keywords
Keywords are words and phrases used to name an article’s or section’s key concepts for
search and retrieval purposes. Typically an author, publisher, or indexing service will assign a small
number of key terms to expand lookup beyond full text, to point up the most important topics
described in an article, or to map an article to a taxonomy. Indexers assigning keywords can make
sure that someone searching for “this topic” will find this article or section, even
if the exact words are not present in the text. Thus keywords may be key words taken from the text, from an outside
vocabulary or taxonomy, or selected by authors, indexers, or publishers.
In this Tag Set, keywords come in sets (<kwd-group>), each of which may come from a particular source or ontology (such as
“author-created” or the
“MeSH Subject Headings”). The sets are
named using the @kwd-group-type attribute.
Here are some sample tagged keywords that a contributor chose as best describing an
article:
<kwd-group kwd-group-type="author-created"> <kwd>acid precipitation</kwd> <kwd>acid rainfall</kwd> <kwd>smelting region</kwd> <kwd>Aluminum residues</kwd> <kwd>Sulphur dioxide</kwd> <kwd>Copper-nickel smelters</kwd> </kwd-group>
Historical Note: Earlier versions of this Tag Set allowed for multiple sets of keywords, but
the individual keywords in the set had no structure; they were just text, words, and
phrases — possibly with face markup, superscripts, and subscripts (such
as “<kwd>XML</kwd>”,
“<kwd>H<sub>2</sub>O</kwd>”, and
“<kwd>blood-brain barrier</kwd>”). The current NISO JATS
accommodates more elaborate keyword structures as described below.
Tagging Complex/Compound Keywords
Keywords can possess an internal structure of their own; for example, a keyword may
include both a textual phrase and its corresponding code (“863 Icelandic
sagas”). Many styles of such compound keywords can be handled in this Tag Set
with the <compound-kwd> element, which is
modeled as a series of repeatable parts (<compound-kwd-part>). These parts can differentiate a text/code pair, divide a coded
keyword into multiple code segments, describe a hierarchy, and name a variety of other
compound structures. The @content-type
attribute on the <compound-kwd-part>
element is used to name each part, describe the role it plays, or otherwise define how
each part functions within the keyword as a whole.
Keywords with Codes
The simplest case of a compound keyword is a keyword that includes both a textual
phrase and its corresponding code, for example, “863 Icelandic sagas”.
Both the code and the text can be tagged as keywords parts (<compound-kwd-part>) in the element <compound-kwd>, with the @content-type attribute used to name the role or
type of each part:
<kwd-group kwd-group-type="ISO-463"> <compound-kwd> <compound-kwd-part content-type="ISO-463-code">863</compound-kwd-part> <compound-kwd-part content-type="ISO-463-text">Icelandic sagas</compound-kwd-part> </compound-kwd> ... </kwd-group>
Abbreviation and Expansion Keywords
Compound keywords can also be used to handle keywords that hold an abbreviation
and its expansion. Both the abbreviation and the expansion are tagged as <compound-kwd-part> in a single <compound-kwd>. The @kwd-group-type attribute on <kwd-group>, which is sometimes used to name the
source or the descriptor for the keywords, can be used instead to name the type of
information, such as “abbreviations”. For example:
<kwd-group kwd-group-type="abbreviations"> <compound-kwd> <compound-kwd-part content-type="abbrev">WT</compound-kwd-part> <compound-kwd-part content-type="expansion">WildType</compound-kwd-part> </compound-kwd> <compound-kwd> <compound-kwd-part content-type="abbrev">CFU</compound-kwd-part> <compound-kwd-part content-type="expansion">Colony-forming unit</compound-kwd-part> </compound-kwd> </kwd-group>
Tagging Nested or Hierarchical Keywords
Some publishers assign hierarchical topics to articles. For example, a publisher
might tag selected topics (“Blood–brain barrier”), nested inside themes
(“Cellular and Molecular Biology”), grouped into larger units like
“Neuroscience”, and grouped into still larger units such as
“Biological Sciences”, forming the following hierarchy:
Biological Sciences Neuroscience Cellular and Molecular Biology Blood–brain barrier
This kind of structure places an article in context or sorts articles into
categories. This is commonly seen in Tables of Contents, where all the Neuroscience
articles are grouped together and all the Biochemistry articles are grouped, etc.
Since keywords are intended to aid in searching and retrieval of articles
rather than establishing an article’s context, best practice is to tag this
topic structure as subject groups:
<article-categories> <subj-group subj-group-type="keywords"> <subject>Biological Sciences</subject> <subj-group subj-group-type="keywords"> <subject>Neuroscience</subject> <subj-group subj-group-type="keywords"> <subject>Cellular and Molecular Biology</subject> <subj-group subj-group-type="keywords"> <subject>Blood–brain barrier</subject> </subj-group> </subj-group> </subj-group> </subj-group> </article-categories>
Hierarchical (nested) keywords structures are also possible. These, while still rare, are
becoming more common as taxonomies are used to tag keywords for articles. Each <nested-kwd> contains a single keyword and any levels of nesting under that
keyword. Since nested keywords are recursive, these lower levels are inside an inner <nested-kwd>. The same example is used below to show how nested keywords work:
<kwd-group kwd-group-type="author" xml:lang="en"> <nested-kwd> <kwd>Biological Sciences</kwd> <nested-kwd> <kwd>Neuroscience</kwd> <nested-kwd> <kwd>Cellular and Molecular Biology</kwd> <nested-kwd> <kwd>Blood–brain barrier</kwd> </nested-kwd> </nested-kwd> </nested-kwd> </nested-kwd> </kwd-group>
Keywords from a Formal Taxonomy
Attributes for the <kwd-group>: If the content of a <kwd-group> element is a term from a thesaurus (ontology, taxonomy, term-list, vocabulary, industry glossary, or other known source), the JATS vocabulary attributes should be used to preserve the semantics. The source named can be a formal ontology or an informal field of study.
If the content of a <unstructured-kwd-group> element is several terms from a thesaurus (ontology, taxonomy, term-list, vocabulary, industry glossary, or other known source), the JATS vocabulary attributes can also be used.
The two attributes are used in this Tag Set to identify such a controlled or uncontrolled vocabulary:
vocab | Name of the controlled or uncontrolled vocabulary, taxonomy, ontology, index, database, or similar that is the source of the term. For example, for a subject term, a value might be the IPC Codes (“ipc”) or MESH headings
(“mesh”). For an uncontrolled term, the value might be
an area of study such as “medical-devices” or merely the word “uncontrolled”. |
---|---|
vocab-identifier | Unique identifier of the vocabulary, such as (but not limited to) a URI or DOI. For example, for Dublin Core (DCC), the identifier might be “http://dublincore.org/documents/2012/06/14/dces/”. |
Attributes for keyword elements: If the content of a <kwd>,
<compound-kwd>, or
<nested-kwd> element is a term from a thesaurus (ontology, taxonomy, term-list, vocabulary, industry glossary, or other known source), the vocabulary attributes should be used to preserve the semantics. The source named can be a formal ontology or an informal field of study.
In addition to the @vocab and @vocab-identifier attributes described above which identify the vocabulary of a keyword, two additional vocabulary attributes are used to identify an individual term from such a controlled or an uncontrolled vocabulary:
vocab-term | The content of the element is the free prose version of the vocabulary or taxonomic term. The @vocab-term attribute holds the canonical version of the same term, as it appears in the vocabulary. For example, if the attribute value was “digitized-vor”, the element might contain the display text “Digitized Version of Record”. |
---|---|
vocab-term-identifier | Unique identifier of the term within a specific vocabulary, such as (but not limited to) an item number, a URI, DOI, etc. |
String Keywords
Although it is not considered best practice, in some tag sets a single keyword
element has been used to hold an entire list of keywords, with connecting punctuation
used to mark the boundary between individual keywords rather than distinguishing
individual keywords with markup. To untangle this single list into individual keywords
might require publisher-specific string parsing or even human judgment, so some
receiving archives have chosen not to break the list into single keywords. The element
<unstructured-kwd-group> has been added
as an alternative to <kwd> inside
<kwd-group> to handle these lists. The
element is named <unstructured-kwd-group>
to indicate that it is a grouping of keywords and not a single keyword. Such lists can
then be tagged as shown in the following examples:
<kwd-group> <unstructured-kwd-group>XML, DTD, schema, RELAX NG, XSD, models, UML, Schematron</unstructured-kwd-group> </kwd-group> <kwd-group> <unstructured-kwd-group>molecular chaperones; surface plasmon resonance; dynamic light scattering; trypsin digestion; citrate synthase; <italic>Neurospora crassa</italic>; protéines chaperonnes, résonance des plasmons de surface; diffusion dynamique de la lumière; digestion par la trypsine; citrate synthase; <italic>Neurospora crassa</italic></unstructured-kwd-group> </kwd-group>