Technical Publications Software: Classification vs XML Metadata
on 05-19-201504:20 PM - edited
Technical publications software products vary in their approach to classification. Tech pubs authoring requires classification of the content – it’s one of the most important but often difficult parts of the tech writer's job. The content you create in a repository to promote reuse needs to be found before it can be used by others. As tech writers create XML content, they add metadata and key words to classify it. Some tech pubs organizations go so far as to create approved key word lists, hierarchies of categorization and formal taxonomies; if a user needs to add a term, there’s a formal process to review, approve, and add it.
However, as our product lines grow and change, our “known” – our ontology – grows, and the context of metadata and keywords change as well. Now we’ve got an entire set of extra tasks and significant time devoted to managing the classification of technical publication content.
In XML content, this time involvement and additional work grows exponentially, because not only do we put metadata and keywords on the object, but in the XML content itself. We’ve been working towards richer, more complete data. But does all that richness have to be in the XML?
Fair warning – for those of you who are XML purists, if that last statement made you uncomfortable, you might want to stop reading now.
Many industries create product documentation in XML for its inherit benefits – i.e., reuse and the associated savings for authoring, translating and automated publishing, along with the benefits of more consistent data and shorter time to create documents. But not all industries are required to exchange XML, or deliver raw XML. Commercial and hi-tech products, software, life sciences and medical devices, heavy equipment manufacturers, and energy groups often create their technical publications and publish directly to formatted online or hard copy documents. So their XML is never interchanged with customers or program partners, and they may never need to deliver an electronic, industry standard-compliant, set of XML topics.
In those situations, why should we impose the extra time and effort to plug additional classification into the XML? For those tech writers, the end goal isn’t to make that XML as rich and complete as possible. The goal is to make the final publication as complete and effective as is appropriate for their end users.
This frees us up to look at classification and all the related ways of managing XML content in a whole new way.
The truth is, XML attribute classification is not enough. Unless you deal with products simpler than a toaster, key words and metadata taxonomies struggle to hold all the complexity of the ways we manage product and document configuration. Which means we could continue adding and updating classifications to content until, in some cases, there’s more metadata than content.
In an optimal scenario, classification comes as a side effect or result of doing work, or of the data you’re working with, not as extra steps to input classification content into the XML topic. For customers with products managed in a PLM environment, much of this semantic and product classification is already available. Options and variants that are applied to product configuration, change information and release states of products – all are available to the tech writer to use, and not just as source information.
Prof. Dr. Wolfgang Ziegler and Prof. Dr. Hieko Beier define this characteristic in technical publications software as semantic classification in a recent tcworld article, "Content delivery portals: The future of modular content.” They note that “Ideally, the automated classification and metadata enrichment is controlled through semantic models. The knowledge that specific customer-groups or industries have specific requirements can be provided centrally through a model for example, and considered while indexing content. Data of this type is usually present manifold, but is not brought in context with the contents.”
When this information can be inherited from the engineering and product world into technical documentation through technical publications software, and organized by both category or part relationships and reuse of parts directly as content, we realize multiple benefits:
First, the writer is saving time – both in the initial creation of the illustration or topic, and in future research and updates to that content.
Second, the risk of inaccurate information is reduced, because we’re applying it from the source, instead of re-creating, retyping, or reinterpreting it into a classification hierarchy.
Third, we’re providing all writers in the organization with a more complete view of the entire ontology of information both available and applicable to documents and products.
Finally, we can modify and apply classification as needed, without driving additional, often inaccurate, revisions to the content.
When we have to modify an XML topic simply to change the metadata, we’re imposing additional constraints and rules about how and where that topic can be reused. At a minimum, writers would have to spend additional time analyzing if the change to the content impacts the other documents it is referenced from. By relating the classifications that are not critical to content, instead of embedding them, we allow topics to be more flexible in their reuse. We can add context and applicability without impacting the existing uses for that content.
Obviously this doesn’t work for all customers. Aerospace and defense customers who are required to provide their content in S1000D and similar compliant XML structures will always have metadata- and attribute-heavy authoring processes. Tech pubs groups who produce online documents that are interactive are dependent on metadata and attributes in the XML to drive the appropriate display of document content, but in future, they may want to consider pushing that classification into the final output of the technical publications software, instead of imposing that work on the writers up front.
About the blogger: Trish Laedtke is the Product Manager for Content and Document Management applications in Teamcenter. Her focus is on integrating tech pubs and supporting roles into the PLM environment, and taking advantage of the knowledge stored in Teamcenter to provide more accurate and effective documentation.