From JATS to HTML, and back again

The general problem is ...

... not ...

Making (good clean) JATS data into (good clean) HTML - “down hill”
... not ...

“Shoveling” HTML data into JATS - “scraping”

... but getting “nice, clean, pretty” JATS out of “messy, sloppy, ambiguous” HTML ...

Why the problem is a problem

Attributing JATS semantics to arbitrary structures in HTML is difficult

(Typically we have only loose and variable indicators)
Called an “upconversion” because it requires energy to filter “signal” from “noise”

(Yes, this energy takes the form of person-hours spent in analysis and development)
Can be achieved successfully in XSLT
- But each transformation is a one-off
  
  (Covers a defined family of input, not general purpose)
- And the inputs must be consistent
- So typically (in the real world) the XSLT is never finished

(An outline of) a solution

This is also the solution!
- We propose it become the responsibility of the uphill format (JATS profile) to define a convenient target for any downstream application wishing to come back up
- This means we do not agree to accept Just Any HTML, but only a defined subset
- The web developers say they can live with this
  
  As long as it is clear, consistent, and transparent (exposed)
Why not define this subset as a mapping of our JATS profile (JATS tagging extant)?

Call it “JATS-flavored HTML”?

And it won't be messy/sloppy/ambiguous by design
(Assumption: the HTML provider is motivated to produce this for us)

No pressure

XSLT coming back (producing JATS from HTML) can “sniff” not “scrape”

Do not attempt to read any HTML, only HTML that follows our rules

(where we can detect the JATS unequivocally)

Purposely allow the results to be invalid for source data that is not recognized (GIGO)
This turns the problem inside out:
- Drive the process by designing a mapping:
  
  A consistent HTML representation of (JATS) data we already have
  
  (The “rules of the language” as an HTML profile)
- XSLT (both ways) supports this translation and nothing else; it stays small (and its growth is controlled)
- Strict rules provide extra assurance that any valid data is intended as such, not simply happenstance
  
  (Ideally, HTML becomes a “JATS carrier”)
  
  Confidence in reversability
Also presents a viable development pathway
- Start small and build to address complexity as needed
- Design the mapping based on actual requirements and analytics (actual data)
  
  (Not only on “what if” scenarios)

Crypto-JATS HTML

<div class="book-part-meta">
  <div class="title-group">
    <div class="title">Ongeont Douq Iukt</div>
  </div>
</div>
<div class="body">
  <p class="p">Ok u fvonaronan unz setoquqh lonqo...</p>
  <p class="p">Lankezoqenl tvo velv fqamumeseth aw u Hoqkeun....</p>

JATS element types represented simply as @class assignments
More or less common practice on the web: put the goodness in the @class
Our rule #1: when converting from HTML, look for the JATS among the class values
Rule #1a: Define “JATS” as (a controlled list of element types we wish to allow)
Rule #1b: when multiple JATS-significant classes are given, don't read any

Where to hide the JATS in the HTML

We can determine which element types are block (div) or inline (span) based on discoverable information.

JATS DTD (content models; elements in mixed content should be span)
Or: poll the data
Or: let XSLT poll the data, generating an XSLT for us)

Everything shouldn't be div and span elements, should it? (Arguments for and against.)

We can improve mappings based on prior knowledge:

JATS list can become HTML ul, ol or dl

(with or without div element wrappers for lists inside p)
Inline formatting can be preserved as such

italic become i, bold becomes b, sup becomes sup etc.
sec/title elements can become h1, h2, h3 etc. for convenience
And/or exploit the superior structures of HTML5 to reflect document organization

Since we will be looking at @class values not element types here, we have flexibility.

Producing HTML from JATS: down the hill

Rules are very simple:

Any JATS element becomes div, span, p or what have you (as mapped)
Its name is assigned as the (only) value of @class
- <sec> becomes <div class="sec">
- <p> becomes <p class="p">
We accept (apparent) redundancy for the sake of simplicity (transparency)

Exceptions can be made for mappings in cases where “functional semantics” must be respected

E.g. images, links and cross-references (must propagate values to img/@src, a/@href etc.)
Where convenient, address processing requirements in the HTML system

The “exception layer” (XSLT) can sit on top of the “generic layer” (XSLT) for generality and reusability.

Representing JATS attributes in the HTML

First we tried cramming these into @class attributes also:

<book-part book-part-number="B1" book-part-type="chapter"> ... </book-part>

becomes

<div class="book-part book-part-number..B1 book-part-type..monograph">

Although this works (yes, it supports round-tripping), there were ... concerns ...

HTML5 offers a better solution

<div class="book-part" data-book-part-number="B1" data-book-part-type="monograph">

(Our HTML team preferred this, and who wouldn't?)

Plus there are exceptions for certain attributes with generalized or global semantics such as @id and @lang, which can be mapped directly.

Reading JATS back again (pulling back up)

“JATS sniffing”

Promote any permissible @class value to be an element type

<div class="disp-quote"> becomes <disp-quote>
To be permissible, a @class value must be
- One of a known list of JATS element types (in our controlled JATS subset)
- The only such permissible value given on the @class (no ambiguity)
  
  <div class="disp-quote fig"> is not changed
Results may be valid if
- @class of the parent is correct for the element (according to JATS)
- Other JATS constraints are respected re: ordering and cardinality
On any element so renamed, also render data-* attributes (stripping “data-” prefix)

<div class="disp-quote" data-content-type="epigraph"> becomes <disp-quote content-type="epigraph">

(Yes, coming back into JATS we may lose other markup in the HTML.)
Copy everything else in place

<div class="figure"> becomes <div class="figure">

(“figure” is not recognized as a JATS element type)

No, this is not valid JATS

Imperfect by design

The results coming back up may happen to be valid JATS, if
- There was nothing not recognized as JATS (and therefore copied)
- Everything recognized as JATS is also structured properly
- Required attributes are given, attribute values are valid
(Big if!)
Also, this is lossless only from the JATS point of view
- Does not capture any arbitrary semantics from the HTML
- Does not guarantee fidelity in HTML going up and then down
- (e.g. HTML element types)

Fun problems

Metadata
Links and linking logic
Tables
Out-of-line or embedded formats - MathML, SVG ...

As usual, much time can be spent working at the edges. (How good is good enough?)

Perhaps surprisingly, these did not bog us down.

What works

Simplicity of rules (so far) makes problems and unresolved issues easy to detect, diagnose and correct.
I.e., when it works, it works, and when it doesn't, it's easy to see why not.
Also, it's easily extensible:
- Mix this approach with special handling
- Can be applied in either “strict” or “permissive” forms

Constraining the HTML crypto format

“JATS Architectural Form”?

Validation in the application has its limits. We want to know of issues before runtime!

However, JATS models map into HTML only as co-occurrence constraints.

(Combinations of values of @class or other attributes, assigned to parents and children, with allowances.)

Parent/child relations
Element type ordering/cardinality (as represented in grammars)
Restrictions on attribute assignments

This can be complex! Especially projected into crypto-JATS.

Just for instance (constraining the HTML crypto-JATS)

For example, report (using Schematron) if any data-* attributes other than data-content-type or data-specific-use appears on div with @class of “disp-quote”*:

<rule context="*[a:classes(.='disp-quote')]">
  <let name="jats-attrs"       value="'content-type','specific-use'"/>
  <let name="wrong-data-attrs" value="a:data-attr(.)[not(@name=$jats-attrs)]"/>
  <assert test="empty($wrong-data-attrs)">
    Wrong data attribute found on 'disp-quote':
    <value-of select="string-join($wrong-data-attrs/@name,', ')"/>
  </report>
</rule>

<div class="disp-quote inner" data-inner-value="x"> ... </div>

(We wish to see an error because <disp-quote inner-value="x"> will be invalid in JATS.)

Relies on an XSLT function library (out of line)
Awfully tedious to construct by hand!
But we know which attributes are allowed where, in JATS:

Maybe we could auto-generate the Schematron code?
Or build a table-driven version (or version that performs dynamic lookup)?

* JATS analogous rule: the only attributes permitted on disp-quote are @content-type and @specific-use (and others accounted for otherwise such as @id and @xml:lang)

“Ascetic HTML”

A methodology not a technology
- Policies externalizing the requirements for HTML-based systems
  
  To integrate well with data sets already described in XML (such as JATS)
- Tools for translating back and forth between XML and HTML/CSS-based systems
- Obviously the same approach (and even tools) will work for JATS, TEI, you name it
- A lesson of XML: people will always have to mess with the markup to get things right!
  
  (Agreeing on a “standard” or platform is not like running a race; it's like putting on your shoes)
  
  This will be true in the web environment also
  
  And we can help
Should not spurn markdown and other non-angle-bracket approaches

(Quite the contrary - as long as they map to a tree-shaped document object, we can cope)
Specifically, this entails controlling the ways vocabulary may be hidden (latent or implicit) in the HTML

And offering, as a trade-off, the capabilities of XML-based processing in a semantically well-controlled environment

“Our discipline is strict so that life may be easy” (Idries Shah)

Externalizing the problem

Yet only by declaring and formalizing means of enforcing this control outside the application are we able to define it

This is despite the fact that we are focused on data sets, not only on (nominal) standards
I.e., a processor must be able to validate against a set of declarations

If only so that parties know how to define scopes of action (and responsibility)

And where to start discussions over remedies
Without such assurances, round-tripping becomes another data mining operation

Will HTML-based systems buy it?

Yes to the extent we can make things easier for them ...?

Yes this means an HTML (subset) schema! and probably other semi-auto-generated tools as well ...?

RelaxNG framework for basic structures
Schematron (generated via schema inspection/query)
CSS as validation technology? (Using fallbacks to detect unwarranted markup)
Eric van der Vlist's Examplotron
Use an XML database to provide examples to test against?
Generate a schema from the home (JATS) schema?
A stopgap: run a mini-application (XProc) to perform transformation and validate to JATS directly

Is it worth the effort?

Can we have our JATS cake and eat it too?

Acknowledgements

De Gruyter Publishers for posing this problem and helping develop a prototype solution;
Eric van der Vlist for his prior work:

e.g. http://eric.van-der-vlist.com/blog/2006/05/05/2277_validating_microformats/
Norm Walsh for his prior work:

e.g. http://norman.walsh.name/2006/04/13/validatingMicroformats

(Proposes essentially this architecture with some critical differences.)

All Aboard!

Wendell Piez

JATS-Con 2015

April 21 2015

Contact: wapiez@wendellpiez.com