Using XML features in MS Word for content migration

I’d talked about content migration in the ECM Offshoring piece recently. The content people often use Microsoft word to create unstructured content and then publish this content to their existing system. This changes after we finish implementing a new CMS for them as the content that is entered during this period needs to be moved to the new CMS powered site. One way obviously is to manually migrate it. Here’s another way though – By using improved support of XML in Microsoft Word to create structured content, content migration can be made slightly less painful. Word allows one to define an XML schema that can be applied to a document. So we created a Schema that has the fields that the CMS will use (Title, Byline, Author etc). Authors create word documents as they always did and apply these tags. Here are some screenshots:

Orignal Word Document

This is how the original content looked like. The fields like Headline and Author have their own styles. As you can see, there’s not much discipline here and it becomes very difficult to import this kind of content.

The next screen shot shows the word document after using XML features. The authors create it as they did before but apply tags.
Document after using XML schema

The authors select the appropriate part in the document and apply a tag using the XML structure on the right side. In fact they don’t even need to do this because we can create a word template as a start point with these tags and place holders like “Enter Headline Here” with appropriate styles. So now we have a word document that is displayed on their existing site as is (obviously without tags) and an equivalent XML that we will use for migrating to the new CMS.

The next screenshot shows how the above word doc is saved as XML.

XML Representation of Word document

This XML does not contain any formatting or style information and only the tags defined in XML schema are saved. It should be trivial now to write some scripts to import it to any CMS that uses a structured content repository.

So what happens if an author, by mistake, deletes a tag? Would the resulting document still be XML? Word provides the ability to protect parts of document and we can ensure that no one can delete tags.

The other issue could be that some authors might not like to see tags while they are typing. No problems – they can be hidden.
In summary, I think this mechanism will provide us a way to reuse content without having to manually import it. Also, the users can keep on using their existing tools without having to learn anything new.

If you would like to get short takes directly in your mailbox, please do consider subscribing to my newsletter. I won’t spam you and your information will be safe. I usually send it like once a week (or once in 15 days).

%d bloggers like this: