Archives
-
LINQ over OOXML: Loving it
In a previous blog post I showed how to efficiently iterate over a WordprocessingML’s document content when creating an object model.
One of the interesting quirks of WordprocessingML is that the content of sections (a section defines information like page size and orientation), instead of being nested inside a section element, are determined by a marker section element in the final paragraph of the section. On top of that, the final section of a document is handled completely differently with the section element instead appearing on the body at the very end of the document.
While I suspect there are good reasons for doing this (I’m guessing the aim was to minimize the amount of change to the document XML structure when doing updates. I'd love to find out if anyone knows), it does make parsing the document and splitting content into sections more difficult. Fortunately with the power of LINQ we can solve this problem in just a couple of statements.
Example
private List<Section> CreateSections(List<Content> content, SectionProperties finalSectionProperties)
{
var sectionParagraphs =
content.Select((c, i) => new { Paragraph = c as Paragraph, Index = i })
// only want paragraphs with section properties
.Where(o => o.Paragraph != null && o.Paragraph.Properties.SectionProperties != null)
// get the section properties
.Select(o => new { SectionProperties = o.Paragraph.Properties.SectionProperties, o.Index })
// add the final section properties plus end index to result
.Union(new[] { new { SectionProperties = finalSectionProperties, Index = content.Count - 1 } })
.ToList();
List<Section> sections = new List<Section>();
int previousSectionEndIndex = -1;
foreach (var sectionParagraph in sectionParagraphs)
{
List<Content> sectionContent =
content.Select((c, i) => new { Content = c, Index = i })
.Where(o => o.Index <= sectionParagraph.Index && o.Index > previousSectionEndIndex)
.Select(o => o.Content)
.ToList();
Section section = new Section(this, sectionContent, sectionParagraph.SectionProperties);
sections.Add(section);
previousSectionEndIndex = sectionParagraph.Index;
}
return sections;
}
The first LINQ statement queries the document content and gets all paragraphs that have associated SectionProperties along with their position within the content. That information is then returned in an anonymous type. The position comes from the Select method, which has an overload that returns the items position as well as item itself. Since the final section properties object is outside the content and therefore not in a paragraph it is unioned on to the end of the result with the position as the end of the content collection.
Now that we have all of the sections and their end positions we loop over the query’s result and create a Section object which is passed the Section properties and the document content that lies in between the section end index and the previous section end index, again using the Select overload that returns an element’s position to find the wanted content.
And that is pretty much it. There is nothing here that couldn’t be achieved in C# 2.0 but using LINQ, lambda expressions, anonymous types and type inference C# 3.0 has probably halved the amount of code that would otherwise be required and made what is there much more concise and understandable. I’m definitely looking forward to using LINQ more in the future.
-
LINQ to XML over large documents
I have been parsing WordprocessingML OOXML over the past week using LINQ to XML and it has been a great learning experience. LINQ to XML is a clean break from the somewhat antiquated DOM that we all know and tolerate, and the new API provides many improvements over the DOM based XmlDocument.
Probably the most talked about change is the functional approach that LINQ to XML encourages. Building a document and querying data can often be accomplished in a single statement compared to the sprawling imperative code that XmlDocument would require.
A less visible and less talked about improvement in LINQ to XML is its ability to work with large documents. DOM is notorious for the amount of memory it consumes when loading large amounts of XML. A typical solution in the past was to use the memory efficient XmlReader object but XmlReader is forward only and frankly a pain in the butt to use in nontrivial situations.
LINQ to XML memory usage and XElement streaming
LINQ to XML brings two notable improvements to working with large documents. The first is a reduction in the amount of memory required. LINQ to XML stores XML documents in memory more efficiently than DOM.
The second improvement is LINQ to XML’s ability to combine the document approach the XmlReader approach. Using the static method XNode.ReadFrom, which takes an XmlReader and will return a XNode object of the reader’s current position, we can create XNode elements one by one, and work on them as needed. You still aren’t loading the entire document into memory but you get the ease of use of working with a document. The best of both worlds! Let’s see an example...
Example: Streaming LINQ to XML over OOXML
WordprocessingML is split into about a dozen separate files, each of which is quite small and can easily be loaded in their entirety in an XDocument. The exception is the main document file. This file stores the content of document and has the potential to grow from just a few kilobytes for Hello World to 100MB+ in the case of a document with 10,000 pages. Obviously we want to avoid loading a huge document like this if we can help it. This example compares the code that loads the entire document at once and the code which loads it one piece at a time.
Before:
public void LoadMainDocument(TextReader mainDocumentReader)
{
if (mainDocumentReader == null)
throw new ArgumentNullException("mainDocumentReader");
XDocument mainDocument = XDocument.Load(mainDocumentReader);
XNamespace ns = Document.Main;
XElement bodyElement = mainDocument.Root.Element(ns + "body");
List<Content> content = DocumentContent.Create(this, bodyElement.Elements());
SectionProperties finalSectionProperties = new SectionProperties(this, bodyElement.Element(ns + "sectPr"));
// divide content up into sections
_sections.AddRange(CreateSections(content, finalSectionProperties));
}
Here you can see the entire document being loaded into an XDocument with XDocument.Load(mainDocumentReader). The problem is of course that document could potentially be quite large and cause an OutOfMemoryException. XNode.ReadFrom to the rescue...
After:
public void LoadMainDocument(TextReader mainDocumentReader)
{
if (mainDocumentReader == null)
throw new ArgumentNullException("mainDocumentReader");
XNamespace ns = Document.WordprocessingML;
List<Content> content = null;
SectionProperties finalSectionProperties = null;
using (XmlReader reader = XmlReader.Create(mainDocumentReader))
{
while (reader.Read())
{
// move to body content
if (reader.LocalName == "body" && reader.NodeType == XmlNodeType.Element)
{
content = DocumentContent.Create(this, GetChildContentElements(reader));
if (reader.LocalName != "sectPr")
throw new Exception("Final section properties element expected.");
finalSectionProperties = new SectionProperties(this, (XElement)XElement.ReadFrom(reader));
}
}
}
// divide content up into sections
_sections.AddRange(CreateSections(content, finalSectionProperties));
}
private IEnumerable<XElement> GetChildContentElements(XmlReader reader)
{
// move to first child
reader.Read();
while (true)
{
// skip whitespace between elements
reader.MoveToContent();
// break on end document section
if (reader.LocalName == "sectPr")
yield break;
yield return (XElement)XElement.ReadFrom(reader);
}
}
Now rather than loading the entire document into an XDocument, XElements are created from the XmlReader as needed. Each one can be used and queried before falling out of scope and being made available for garbage collection, avoiding the danger of running out of memory. Even though this is quite different to what we were doing previously, no code outside of this method needed to be modified!
-
Getting started with OOXML
It can be hard knowing where to get started when working with a new technology. I have recently commenced work on a project heavily involving OOXML and I thought I’d share the websites and resources I found most useful to other people just starting out.
Open XML Developer - http://openxmldeveloper.org/
Open XML Developer is the best all-in-one OOXML site on the web. It features OOXML news, articles, examples and an active community. If you have questions that aren’t answered on the site the Open XML Developer site has forums on just about every OOXML topic you could think of.
Microsoft SDK for Open XML Formats - http://msdn2.microsoft.com/en-us/library/bb448854.aspx
This is Microsoft’s SDK for working with Open XML. Right now the name is slightly confusing as the SDK only provides an API over the OOXML package, not the OOXML file formats themselves. You are able to read a docx for example, and pick out all the individual style, formatting and document parts; but the actual part contents are still XML that you must read and write yourself. The SDK is still in preview at the moment so I’m sure that support for the markup languages will improve as time goes on.
Open XML Explained - http://openxmldeveloper.org/articles/1970.aspx
Open XML Explained is the first book on Open XML development and is freely available to download. The book is 128 pages long and provides a good high level introduction to OOXML and the three main markup languages: WordprocessingML, SpreadsheetML and PresentationML.
Ecma Office Open XML specification - http://www.ecma-international.org/publications/standards/Ecma-376.htm
If you really want to dig into the details of OOXML, the specification is the best place to look. Although there has been much rendering of clothing and gnashing of teeth over the specification’s 6000 page length, that page count includes introductions and primers to the specification. Also the markup reference document, which is by far the largest of the specification documents, is padded out significantly with many elements and attributes described a number of times.
- Part 1: Fundamentals (174 pages)Gives an overview of Open XML packages and the parts that make up the markup languages.
- Part 2: Open Packaging Convention (129 pages)Goes into more detail of the Open XML package conventions.
- Part 3: Primer (472 pages)Describes the markup languages and how they work. Recommended as a good introduction to OOXML.
- Part 4: Markup Language Reference (5219 pages)Provides descriptions of every element and attribute. There is a lot of detail in this document but repetition also contributes to its size. I have found using the document links is a good way to navigate the content and find what you are looking for.
- Part 5: Markup Compatibility and Extensibility (43 pages)Describes how additional markup can be added to the format while still conforming to the specification.
WinRAR - http://www.rarlab.com/
I’m sure there are better tools for this, but I have been using WinRAR to explore existing OOXML packages. Since the packages are just zip archives any zip tool will let you view the contents.
If you know of any other OOXML resources I’d love to hear about them.