Beware of the laughing horse
Laurent Galichet, ISO
In 2011, ISO embarked on their XML journey. The base DTD chosen was JATS and customizations were made to be able to capture
standards- metadata and content. This became known as the ISO STS (standards tag set). The first acid test was to convert
ISO's legacy content from Word/PDF to ISO STS compliant XML. How ISO went about this task is the subject of this paper.
Aim: To convert over 30,000 (650,000 pages) standards (EN and FR) into XML in two years.
Method: An RFP was launched early 2011 for potential providers of XML conversion. Two providers were shortlisted and after site visits
by the then project manager, the director of IT, the director of standards development and the Secretary-General of ISO,
one provider was chosen.
Theory: The ccontract and pricing had already been agreed upon, this contained the set-up period of two months, with a view of mass
conversion commencing January 1st 2012 and end December 2013.
Practice: A project manager was appointed 1st November 2011 to lead the project. The set-up period consisted of an iterative process
of marking up the same set of content, reviewing and sending feedback to the providers to then re-markup and send back. Sources
file were MS Word, cPDFs and scanned PDFs. The set-up period took 6 months.
Mass conversion: Batches were prepared for the mass conversion according to some criteria. It was agreed that batches should be composed of
documents totaling about 625 pages. Also, in order for the team to get used to the structure of standards, short document
(less than 20 pages) were first batched up. Equally, easier file formats were prioritized, so the initial batches were short
MS Word documents. This made up 10000 documents more or less. Then larger documents followed and once the MS Word source files
had all been batched up, cPDFs were sent and lastly, image PDFs. In retrospect, that was probably a mistake.
Results: Conversion ended February 2014, the project overran by two months. the budget was also all spent, including the 25% contingency
amount. The quality obtained was very good and the XML sent was fed directly into ISO's online browsing platform, which is
regularly used by many.
Conclusion:You can not just expect anyone to get it right without getting your hands dirty. The old adage of "you get out what you put
in" is very much appropriate if you are considering a legacy conversion project.