Section breaks in Office Open XML and WordML

Identifying the section breaks in pure Office Open XML (OOXML) document is the ultimate nightmare: the only indication of a section break is the presence of w:sectPr element within the last paragraph of the section. To complicate matters further, the last section in the document could be represented by a w:sectPr element that is a sibling to the w:p elements … and of course you could have elements without sections, in which case there would be no w:sectPr element anywhere within the XML. Just try to imagine writing an XSLT translation that would perform OOXML-to-HTML translation and split the Word text into DIVs (a DIV for each section).

Fortunately, the task is much easier if you use WordML, which contains auxiliary hints in the wx namespace; in our case, the wx:sect element, which encloses all the paragraphs within a section.

For example, the following Word text …… generates this WordML markup (to get the corresponding OOXML, remove the wx:sect elements).

Create post excerpts in Blogger

One of the features sorely missing in Blogger is the ability to write an excerpt for your post (Wordpress supports several different methods), so I had to write my own JavaScript solution that provides functionality similar to the more tag in Wordpress. It hides parts of the post’s text and displays a More button which reveals the hidden text. The hidden text is enclosed within SCRIPT tags:

<script>startHide()</script>
… extra text …
<script>endHide()</script>

I’m including a JavaScript library from one of my web sites into the Blogger template. If you want to have a Blogger-only solution, include the following JavaScript code in your template:

var isMainPage = 0;
var hideCount = 0;

function dw(t) { document.write(t); }

function setMainPage() { isMainPage = 1; }

function startHide() {
hideCount ++ ;
if (!isMainPage) { dw('<div id="show_'+hideCount+'">') ; return; }
dw ('<p class="hideMenu" id="hideMenu_'+hideCount+
'"><a href="javascript:showRest('+hideCount+')">More ...</a></p>');
dw ('<div id="hide_'+hideCount+'" style="display: none;">');
}

function showRest(id) {
var e = document.getElementById('hide_'+id);
if (e && e.style) e.style.display = "" ;
e = document.getElementById('hideMenu_'+id) ;
if (e && e.style) e.style.display = "none" ;
}

function endHide() {
dw ('</div>') ;
}

Spelling errors might break your WordML translations

The code that generates a single PRE block from multiple Word paragraphs worked perfectly … until Word decided I made a spelling error in one of the listings. Based on where it thinks the error is, Word can generate the w:proofErr element between w:p elements, thus breaking my code which assumed that the w:p elements are adjacent:

To fix this problem, I had to change following-sibling::*[1] expression into following-sibling::w:p[1] throughout the affected code. The fixed templates are included below:

<xsl:template match="w:p[w:pPr/w:pStyle/@w:val = 'code']">
<xsl:if test="not(preceding-sibling::w:p[1]/w:pPr/w:pStyle/@w:val = 'code')">
<pre class='{w:pPr/w:pStyle/@w:val}'>
<xsl:apply-templates select='.' mode="pre" />
</pre>
</xsl:if>
</xsl:template>

<xsl:template match="w:p[w:pPr/w:pStyle/@w:val = 'code']" mode="pre">
<xsl:apply-templates />
<xsl:if test="following-sibling::w:p[1]/w:pPr/w:pStyle/@w:val = 'code'">
<xsl:text>&#x0a;</xsl:text>
<xsl:apply-templates mode="pre" select="following-sibling::w:p[1]" />
</xsl:if>
</xsl:template>

<xsl:template match="*" mode="pre" />

Check for empty string or empty node-set

After my close encounter with ternary logic of XSLT (details are here), I started worrying about the results of every test that could contain empty elements (based on how you phrase the test, the empty string might not be equal to empty node-set). To ensure that I’m comparing strings even when one of the variables might be empty, I’m using the string XSLT function that converts whatever input it gets into a string (which can be reliably compared to another string). For example, to test if the current element’s name is empty or missing, use this test:

<xsl:if test="string(@name) = '' ">

WordML: Translate spaces in fixed-font text

I had problems with Blogger formatting, so I’ve decided to translate all fixed-font spaces in my Word texts into non-breakable spaces in translated Blogger-ready HTML. To do this conversion, I had to find the font of the current range … but it could be stored in the range properties, character style or paragraph style, and the range property could use a proportional font that overrides the character- or paragraph style fixed font.

I’ve defined the xsl:key instructions that extract the paragraph or character style fonts …

<xsl:key name="parafont" match="w:rFonts/@w:ascii" 
use="ancestor::w:style[@w:type = 'paragraph']/@w:styleId" />
<xsl:key name="rangefont" match="w:rFonts/@w:ascii"
use="ancestor::w:style[@w:type = 'character']/@w:styleId" />

… and used them in pretty complex (as it has to handle so many cases) xsl:choose statement in the w:t (text-within-range) template:

<xsl:template match="w:t/text()">
<xsl:variable name="pfont"
select="key('parafont',ancestor::w:p/w:pPr/w:pStyle/@w:val)" />
<xsl:variable name="rfont"
select="key('rangefont',ancestor::w:r/w:rPr/w:rStyle/@w:val)" />
<xsl:variable name="font" select="w:rPr/w:rFonts/@w:ascii" />
<xsl:variable name="xlate" select="translate(.,' ','&#160;')" />
<xsl:choose>
<xsl:when test="contains($font,'Courier')">
<xsl:value-of select="$xlate" /></xsl:when>
<xsl:when test="string($font) != ''">
<xsl:value-of select="." /></xsl:when>
<xsl:when test="contains($rfont,'Courier')">
<xsl:value-of select="$xlate" /></xsl:when>
<xsl:when test="string($rfont) != ''">
<xsl:value-of select="." /></xsl:when>
<xsl:when test="contains($pfont,'Courier')">
<xsl:value-of select="$xlate" /></xsl:when>
<xsl:otherwise><xsl:value-of select="." /></xsl:otherwise>
</xsl:choose>
</xsl:template>

WordML: Extract font from paragraph style

The paragraph styles in WordProcessingML have paragraph properties (w:pPr element) and range properties (w:rPr element). The font of a paragraph style is stored in the range properties (w:rPr/w:rFonts/@w:ascii element), as shown in the following snapshot from XML Notepad:

The way to extract font name from paragraph style based on the style name is very similar to the character style case: define an xsl:key definition that matches paragraph style (w:type = 'paragraph') with the given name and extract the font name.

<xsl:key name="parafont" match="w:rFonts/@w:ascii" →
use="ancestor::w:style[@w:type = 'paragraph']/@w:styleId" />

WordML: generate PRE element from multiple paragraphs

In my Word-to-Blogger converter, I wanted to convert a block of paragraphs with the code style into a single PRE element. As Word does not generate a grouping of paragraphs of the same style, I had to develop a pretty convoluted solution:

  • A special template matches paragraphs with the code style.
  • The template performs the translation only if the preceding paragraph does not have the same style. This ensures the subsequent paragraphs with the code style don’t generate extra PRE elements.
  • The template generates the PRE element and sends the current element through another translation using pre mode.
<xsl:template match="w:p[w:pPr/w:pStyle/@w:val = 'code']">
<xsl:if test="not(preceding-sibling::w:p[1]/w:pPr/w:pStyle/@w:val = 'code')">
<pre class='{w:pPr/w:pStyle/@w:val}'>
<xsl:apply-templates select='.' mode="pre" />
</pre>
</xsl:if>
</xsl:template>

The paragraph matching with mode=’pre’ is quite simple:

  • Child elements are processed (producing translated paragraph text).
  • If the following sibling has the code style (we haven’t reached the end of the PRE block), a newline element is appended with the xsl:text instruction and the sibling (the next code paragraph) is translated with mode=’pre’.
<xsl:template match="w:p[w:pPr/w:pStyle/@w:val = 'code']" mode="pre">
<xsl:apply-templates />
<xsl:if test="following-sibling::*[1]/w:pPr/w:pStyle/@w:val = 'code'">
<xsl:text>&#x0a;</xsl:text>
<xsl:apply-templates mode="pre" select="following-sibling::*[1]" />
</xsl:if>
</xsl:template>

The default template for mode=’pre’ is empty, ensuring that non-paragraph entities accidentally translated with mode=’pre’ do not generate output text.

<xsl:template match="*" mode="pre" />

The solution could be made easier if I would have used the wx:pBdrGroup element that Word inserts around my code paragraphs (I’ve configured a border on the paragraph style), but the wx:pBdrGroup-based approach would fail if someone decided to change the border of the code style.

WordML: extract font from character style

The font of a character style (the style of a range of text, not the whole paragraph) is stored in the w:styles/w:style/w:rPr/w:rFonts/@w:ascii element, as shown in the following snapshot from XML Notepad:

To get the character font from the style name, use the following xsl:key definition:

<xsl:key name="rangefont" match="w:rFonts/@w:ascii" →
use="ancestor::w:style[@w:type = 'character']/@w:styleId" />

You should check the style type in the key definition to ensure that the key matches only the character styles.

Later on, you can use the rangefont key to check the font in a range-matching template. For example, to check whether the current range has courier font, use the following xsl:choose block:

<xsl:choose>
<xsl:when test="contains(w:rPr/wx:font/@wx:val,'Courier')">
<!-- fixed-font specified in the range -->
</xsl:when>
<xsl:when test="contains(key('rangefont',w:rPr/w:rStyle/@w:val),'Courier')">
<!-- fixed-font specified in the character style -->
</xsl:when>
<xsl:otherwise>
<!-- not a fixed font -->
</xsl:otherwise>
</xsl:choose>

Drawing charts in your web pages

Here are a few links that will help you draw great charts in your web pages:

Open XML text file in Microsoft Word

The Word-to-Wiki converter macro I’ve described in one of the previous posts (developed in Word 2003) worked perfectly, but when I wanted to add a Word-to-Blogger macro (along the lines of Word-to-Wiki concept, but with a different XSLT), things got complex. I didn’t want to have any whitespace between P tags generated by XSLT (Blogger interprets whitespace line breaks as implicit <br /> tags), so I wanted to generate XML, not HTML … only to find out that the default text converter used by Microsoft Word (wdOpenFormatAuto) …

Documents.Open FileName:=TxtPath, ConfirmConversions:=False, _
ReadOnly:=False, AddToRecentFiles:=False, _
Format:=wdOpenFormatAuto, Encoding:=65001

… removes the XML tags (leaving only the text nodes) when importing XML files as text. Next I’ve tried the the wdOpenFormatText converter, only to find out that it cannot handle Unicode text. Great news … Finally I’ve managed to get exactly what I needed with the wdOpenFormatUnicodeText converter and msoEncodingUTF8 encoding:

Documents.Open FileName:=TxtPath, ConfirmConversions:=False, _
ReadOnly:=False, AddToRecentFiles:=False, _
Format:=wdOpenFormatUnicodeText, Encoding:=msoEncodingUTF8