... not ...
Making (good clean) JATS data into (good clean) HTML - “down hill”
... not ...
“Shoveling” HTML data into JATS - “scraping”
... but getting “nice, clean, pretty” JATS out of “messy, sloppy, ambiguous” HTML ...
Attributing JATS semantics to arbitrary structures in HTML is difficult
(Typically we have only loose and variable indicators)
Called an “upconversion” because it requires energy to filter “signal” from “noise”
(Yes, this energy takes the form of person-hours spent in analysis and development)
Can be achieved successfully in XSLT
But each transformation is a one-off
(Covers a defined family of input, not general purpose)
And the inputs must be consistent
So typically (in the real world) the XSLT is never finished
This is also the solution!
We propose it become the responsibility of the uphill format (JATS profile) to define a convenient target for any downstream application wishing to come back up
This means we do not agree to accept Just Any HTML, but only a defined subset
The web developers say they can live with this
As long as it is clear, consistent, and transparent (exposed)
Why not define this subset as a mapping of our JATS profile (JATS tagging extant)?
Call it “JATS-flavored HTML”?
And it won't be messy/sloppy/ambiguous by design
XSLT coming back (producing JATS from HTML) can “sniff” not “scrape”
Do not attempt to read any HTML, only HTML that follows our rules
(where we can detect the JATS unequivocally)
Purposely allow the results to be invalid for source data that is not recognized (GIGO)
This turns the problem inside out:
Drive the process by designing a mapping:
A consistent HTML representation of (JATS) data we already have
(The “rules of the language” as an HTML profile)
Strict rules provide extra assurance that any valid data is intended as such, not simply happenstance
(Ideally, HTML becomes a “JATS carrier”)
Confidence in reversability
Also presents a viable development pathway
Design the mapping based on actual requirements and analytics (actual data)
(Not only on “what if” scenarios)
<div class="book-part-meta"> <div class="title-group"> <div class="title">Ongeont Douq Iukt</div> </div> </div> <div class="body"> <p class="p">Ok u fvonaronan unz setoquqh lonqo...</p> <p class="p">Lankezoqenl tvo velv fqamumeseth aw u Hoqkeun....</p>
@class
assignments
@class
We can determine which element types are block (div
) or inline
(span
) based on discoverable information.
span
)
Everything shouldn't be div
and span
elements, should it? (Arguments for and against.)
We can improve mappings based on prior knowledge:
JATS list
can become HTML ul
,
ol
or dl
(with or without div
element wrappers for lists inside p
)
Inline formatting can be preserved as such
italic
become i
, bold
becomes b
, sup
becomes sup
etc.
sec/title
elements can become h1
,
h2
, h3
etc. for convenience
And/or exploit the superior structures of HTML5 to reflect document organization
Since we will be looking at @class
values not element types here, we have
flexibility.
Rules are very simple:
div
, span
, p
or what have you (as mapped)
Its name is assigned as the (only) value of @class
<sec>
becomes <div
class="sec">
<p>
becomes <p
class="p">
Exceptions can be made for mappings in cases where “functional semantics” must be respected
img/@src
, a/@href
etc.)
The “exception layer” (XSLT) can sit on top of the “generic layer” (XSLT) for generality and reusability.
First we tried cramming these into @class
attributes also:
<book-part book-part-number="B1" book-part-type="chapter"> ... </book-part>
becomes
<div class="book-part book-part-number..B1 book-part-type..monograph">
Although this works (yes, it supports round-tripping), there were ... concerns ...
<div class="book-part" data-book-part-number="B1" data-book-part-type="monograph">
(Our HTML team preferred this, and who wouldn't?)
Plus there are exceptions for certain attributes with generalized or global semantics such
as @id
and @lang
, which can be mapped directly.
Promote any permissible @class
value to be an element
type
<div class="disp-quote">
becomes <disp-quote>
To be permissible, a @class
value must be
The only such permissible value given on the @class (no ambiguity)
<div class="disp-quote fig">
is not changed
Results may be valid if
@class
of the parent is correct for the element (according to
JATS)
On any element so renamed, also render data-*
attributes (stripping
“data-” prefix)
<div class="disp-quote" data-content-type="epigraph">
becomes <disp-quote content-type="epigraph">
(Yes, coming back into JATS we may lose other markup in the HTML.)
Copy everything else in place
<div class="figure">
becomes <div
class="figure">
(“figure” is not recognized as a JATS element type)
No, this is not valid JATS
The results coming back up may happen to be valid JATS, if
(Big if!)
Also, this is lossless only from the JATS point of view
Does not capture any arbitrary semantics from the HTML
Does not guarantee fidelity in HTML going up and then down
(e.g. HTML element types)
As usual, much time can be spent working at the edges. (How good is good enough?)
Perhaps surprisingly, these did not bog us down.
Also, it's easily extensible:
Mix this approach with special handling
Can be applied in either “strict” or “permissive” forms
Validation in the application has its limits. We want to know of issues before runtime!
However, JATS models map into HTML only as co-occurrence constraints.
(Combinations of values of @class or other attributes, assigned to parents and children, with allowances.)
This can be complex! Especially projected into crypto-JATS.
For example, report (using Schematron) if any data-*
attributes other than
data-content-type
or data-specific-use
appears on
div
with @class
of “disp-quote”*:
<rule context="*[a:classes(.='disp-quote')]"> <let name="jats-attrs" value="'content-type','specific-use'"/> <let name="wrong-data-attrs" value="a:data-attr(.)[not(@name=$jats-attrs)]"/> <assert test="empty($wrong-data-attrs)"> Wrong data attribute found on 'disp-quote': <value-of select="string-join($wrong-data-attrs/@name,', ')"/> </report> </rule>
<div class="disp-quote inner" data-inner-value="x"> ... </div>
(We wish to see an error because <disp-quote
inner-value="x">
will be invalid in JATS.)
But we know which attributes are allowed where, in JATS:
Maybe we could auto-generate the Schematron code?
* JATS analogous rule: the only attributes permitted on disp-quote
are @content-type
and @specific-use
(and others accounted for otherwise such as @id
and @xml:lang
)
A methodology not a technology
Policies externalizing the requirements for HTML-based systems
To integrate well with data sets already described in XML (such as JATS)
A lesson of XML: people will always have to mess with the markup to get things right!
(Agreeing on a “standard” or platform is not like running a race; it's like putting on your shoes)
This will be true in the web environment also
And we can help
Should not spurn markdown and other non-angle-bracket approaches
(Quite the contrary - as long as they map to a tree-shaped document object, we can cope)
Specifically, this entails controlling the ways vocabulary may be hidden (latent or implicit) in the HTML
And offering, as a trade-off, the capabilities of XML-based processing in a semantically well-controlled environment
“Our discipline is strict so that life may be easy” (Idries Shah)
Yet only by declaring and formalizing means of enforcing this control outside the application are we able to define it
This is despite the fact that we are focused on data sets, not only on (nominal) standards
I.e., a processor must be able to validate against a set of declarations
If only so that parties know how to define scopes of action (and responsibility)
And where to start discussions over remedies
Without such assurances, round-tripping becomes another data mining operation
Yes to the extent we can make things easier for them ...?
Yes this means an HTML (subset) schema! and probably other semi-auto-generated tools as well ...?
Eric van der Vlist for his prior work:
e.g. http://eric.van-der-vlist.com/blog/2006/05/05/2277_validating_microformats/
Norm Walsh for his prior work:
e.g. http://norman.walsh.name/2006/04/13/validatingMicroformats
(Proposes essentially this architecture with some critical differences.)