Digital Media Web Blogs > Web

Document Engineering


Related link: http://mitpress.mit.edu/catalog/item/default.asp?tid=10476&ttype=2

The most surprising thing I picked up at last week's
>Open Standards 2005 conference in Sydney came from the C.T.0. of an established company that provides electronic data exchange capability, especially for the shipping industry. He said that the documents they received were mainly EDI, then CSV, with very little XML. But that they converted all the data inhouse to XML for easier processing. I expected it would be the other way around: lots of XML data coming and being processed by old-school database tools.


But my surprise was not so much the low external use-rates of XML —after all, if you already supply your documents in one structured notation it is tempting to see a move to XML as only satisfying cosmetic rather than business requirements—my eyebrows raised on how intermixed XML, CSV and EDI all are now: an EDI house uses XML internally.

Towards a New Discipline

Bob Glushko and Tim McGrath's standout new book Document Engineering, Analyzing and Designing Documents for Business Informatics and Web Services takes this intermixing even further. Glushko is from an SGML publishing and XML e-commerce background, McGrath is from an EDI and UBL background.


They see a new discipline of Document Engineering
emerging. A nice summary is at an
>IBM Research seminar. Document Engineering applies a dataflow approach to the whole organization, identifying and modeling which documents get sent between business processes and their contents. The documents could be transactional documents (EDI, XML invoices) or publication (HTML, PDF, custom DTDs) or even mixed.
Its not Software Engineering, its not IT, its not web publishing, its not Enterprise Architecture, its not Business Process Re-engineering, but it straddles all these.
Document Engineering is, of course, more sophisticated than simple dataflow. The analysis also includes signals and routing aspects.

Glushko teaches at the Center for Document Engineering at Berkeley, and this book, published by MIT, is definitely aimed as an undergraduate text book for similar courses. I recommend it for anyone involved in adopting a highly-automated, loosely-coupled Service Oriented Architecture.

Taster

The book features little key points (floating outdented paragraphs) throghout to provide easy summaries. Here is a taster:

  • Document interfaces maintain clean and stable relationships between business partners.
  • No single vocabulary can have enough semantic
    precision for all applications.

  • The decision about where to transform is a business one.

  • A common metamodel helps align different models.

  • Document Engineering treats supply chains as information flows.

  • Closer collaboration does not always mean more information exchange.
  • XML is not self-describing.

  • Semantic conflicts should be resolved when the context of use is being defined.

  • There is no sharp line between requirements analysis and document analysis.

  • Requirements that are so fundamental that everyone assumes them are precisely those that should be made explicit.

  • Customization by subtractive refinement doesn't work
    because the overlapping information isn't explicitly identified.

  • Huge estimates of potential savings are emerging
    from many document-centric industries.

  • Incomplete automation can leave the enterprisewith
    a slow link in its information flow that nullifies most of the investments to improve other processes.

Practical, well-expressed and timely.

Document Analysis and Design

When XML was created, SGML authors had moved from issues of syntax and were dealing with issues of how to model information (publishing-related in the first instance) in documents. Most prominantly, the mid 90s Prentice Hall books

Document Engineering mainly takes the "analyze then assemble" kind of approach of the Maler book and gives only lip service (in s15.1.1.3, 'review' and 're-use') to the detailed knowledge of alternatives advocated by my book and (in s5.1.1.4) customizing standard components as in Dave's book. This is, in one sense, fair enough because it is not the place for a textbook to deal with the minutae of particular schemas. However, the publishing experience is that people who use the "analyze then assemble" approach but who don't have a good grounding in the tradeoffs of the different ways to implement structures frequently make lousy DTDs or schemas.

This mirrors Christopher Alexander's finding that the first people who adopted his pattern language approach to building houses ended up with buildings that looked familiar: if you are only aware of one corner of the solution space you will only sit there.

But the book is primarily concerned with model-based XML, influenced by the database, object-oriented and business rules analysis realms. Other influences are UBL, RosettaNet, CMM, UML and pattern languages.

Quibbles

The only quibbles I have with the book are minor: the word 'context'is used throughout, but it is—perhaps necessarily—such a vague word as to make any sentence using it seem amorphous and suspect. There is a chapter Analyzing the Context of Use that helps. And in the discussion of transaction patterns particularly "Offer and Acceptance" some brief treatment of the legal aspects would be appropriate for undergraduates: what is a legal contract and which country's law applies to international transactions over the web, in particular. I don't want to shock my gentle readers, but biggest sign of how far XML has emerged from its publishing roots is that the index refers to section numbers not page numbers: probably unthinkable for an SGML book!

Categories





AddThis Social Bookmark Button



Comments (5)
Read More Entries by Rick Jelliffe.

5 Comments

rjelliffe said:

Page numbers?
Indexing by section number is regarded, in the quality market, as a hack that you use when your indexes are created as a separate process to your typesetting, or when you are using typesetting systems not up the job of making whole books, or as a sign of a tight deadline.

It is less quality because it is more cumbersome to find the page: rather than being able to do a binary chop or other approximation method based on page numbers, the user has to locate section headings (where the numbers are) in the text and guess how long each section is in order to find their target.

In SGML books, one of the hurdles that needed to be proved was that you could make just as high quality books with structural markup as you could using presentation tags (troff, tex) or hand made indexes.

Also, one of the selling points for SGML was the ability to handle large documents: SGML would be contrasted with weaker technologies such as using MS Word, where you were stuck with producing one chapter at a time, and so had to index by section number rather than page number.

When I made my book, which was quite large (about 650 pages) production issues made me divide it into three or so parts, so the index items have a part number then a page number within that, which is in intermediate approach that is quite acceptable as far as predictability for users.

Another issue to be factored in to the decision about whether to index by page number or section number is that usually people will use the same thing in the cross-references. If you use page numbers for cross-references, then when you make any last-minute changes to the text, you may have to go through the entire book to check that automated and forced page breaking still produce acceptable results: adding text in one place may cause the IDed object to cross page borders which may cause references to it require an extra digit which may trigger different line and page breaks and flow through the whole document. That is an issue where deadlines fight against best quality, so where there are deadlines it can be prudent not to index by page number.

So my comment is that Document Engineering has not been produced with intent on proving with the book that marked-up document can produce just as good a product as hand indexed books, which is not to say that they haven't made the appropriate production decision, no professional slight was remotely intended! Indeed, I have no idea whether the book was made using declarative markup at all: it may have been written in a unstructured word processor with styles for all I know. (The design looks like the kind of thing that people do with TeX-based typesetting systems, as a guess, but nowadays it is difficult to say.) SGML books were read by publishing people and had to make a contain in their production values confirmations of the subject matter; XML books have a different market and so don't have have some special thing in them.

(As a further example, in my book I also have bullet lists using "radio buttons" with one item --the default or most important or typical case-- selected. Or a bullet list made from tears, signifying woes. The intent was to demonstrate that SGML/XML didn't prevent you from playfullness or innovation in design. We had to eat our own dogfood, and it tasted nice in parts.)

bobfoster said:

Page numbers?
How do section numbers vs. page numbers relate to XML vs. SGML?

rjelliffe said:

docengineering.com -- sample chapters
My pleasure Bob, and congratulations on the book: it really is a great achievement and major step forward for the industry.

glushko said:

docengineering.com -- sample chapters
Thanks, Rik for your review of our Document Engineering book. We have set up a site for the book where we have sample chapters and will be putting up lecture notes, news, and other useful stuff...

docengineering.com

-bob glushko

MrBubbles712 said:

Bad link... but it will be back soon.
Just a heads up, the MIT store is doing a bit of an update today, ('doh!), according the front page. So when it comes back up today or tomorrow, maybe we can see this book.

Topics of Interest

Related Books

Archives


 
 


Or, visit our complete archive.  

Stay Connected