Deduping an XML document with XSLT (listing only unique elements)

XSLT provides a key() function that retrieves a subset of nodes in an XML document based on a selection criteria specified in an element. In essence, is an index of the XML document, and key() serves as an index lookup function. This provides a lot of versatility to XSLT that allows you to accomplish a lot with a little. Let’s look at how we could use this to dedupe an XML document that contains a lot of duplicate items and only list each unique element once. Consider the following XML:

<Inventory>
<Product>
	<ProductID>1</ProductID>
	<name>Sword of the ancients</name>
	<description>A legendary sword fit for a hero</description> 
</Product>
<Product>
	<ProductID>2</ProductID>
	<name>BFG 9000</name>
	<description>A slayer of demons</description> 
</Product>
<Product>
	<ProductID>2</ProductID>
	<name>BFG 9000</name>
	<description>A slayer of demons</description> 
</Product>
<Product>
	<ProductID>3</ProductID>
	<name>Flaming sword</name>
	<description>Effective against the undead</description> 
</Product>
<Product>
	<ProductID>4</ProductID>
	<name>Aegis bulwark</name>
	<description>A stalwart shield that blocks the deadliest blows</description> 
</Product>
<Product>
	<ProductID>4</ProductID>
	<name>Aegis bulwark</name>
	<description>A stalwart shield that blocks the deadliest blows</description> 
</Product>
<Product>
	<ProductID>4</ProductID>
	<name>Aegis bulwark</name>
	<description>A stalwart shield that blocks the deadliest blows</description> 
</Product>
</Inventory>

Each Product is uniquely identified via its productID. Some of the products appear more than once. Let’s write a transform to clean up the XML and list each item only once. In order to do so, we will need to make use of the key() and the generate-id() functions. Let’s take a look at key() in detail first. key() takes two parameters, the name of element to do the lookup on and a search string to match against:

key(key-name, search-string)

search-string is then matched against the selection criteria defined in the key variable key-name. A key element has three attributes: @name, @match, and @use. @name is the key-name parameter that will be passed in to the key() function. @match is an XPATH expression that specifies which elements in the XML document to index. @use is an XPATH expression that serves as the lookup key. It must be relative to the elements specified in @match, and it will be evaluated against the key() search-string parameter. The following would be an xsl:key that indexes all the Products in the previous XML document based on the ProductID element.

<xsl:key name="ID-key" match="Product" use="ProductID" />  

key(“ID-Key”, “2”) would return the two BFG 9000 elements. In order to extract only the unique elements, we will need to use generate-id(). generate-id() takes a node as a parameter, and generates a unique ID for it, relative to the entire document. No two nodes in the same document will ever map to the same ID. However, generate-id() does not guarantee that the value generated for a given node will be the same on a different run. In order to identify all the unique elements, we will use generate-id() as a way to test for equality between nodes. The following expression selects all the unique ID values in the document:

<xsl:for-each select="//Product[generate-id(.)=generate-id(key('ID-key', ProductID)[1])]">

Let’s break this gnarly expression down into smaller, more manageable, parts. The xsl:foreach is selecting a set of products that satisfy the equality conditional inside the bracket predicates. What is the predicate trying to accomplish? Starting with the innermost expression in the right hand side of the conditional, we see that key(‘ID-key’, ProductID) returns a list of nodes that has the same ProductID value as the current node being evaluated. [1] returns the first node in this key node-set, which is then passed as an argument to generate-id(). Note that we are always guaranteed at least one result from the key lookup, since key() will always return the current node in the node-set.

On the left hand side of the conditional is the expression generate-id(.). This generates an id for the current product node being evaluated. If the two generated IDs are equal, we include the node in the result-set. Subsequent nodes with the same ProductID will fail this equality test, since we are only comparing to the *first* node returned by key(). The rest will be ignored. This guarantees that we get a list of Product nodes where no two ProductIDs will be the same. The equality test essentially filters the redundant elements out.

With that hairy expression out of the way, its just a simple matter of printing out all the unique elements:

<xsl:for-each select="//Product[generate-id(.)=generate-id(key('ID-key', ProductID)[1])]">
   <xsl:copy-of select="."/>
    <xsl:value-of select="."/>
</xsl:for-each>

The complete code listing is here:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
  <xsl:output method="xml" indent="yes"/>
  <xsl:key name="ID-key" match="Product" use="ProductID" />


  <xsl:template match="/">
    <xsl:element name="Inventory">
      <xsl:for-each select="//Product[generate-id(.)=generate-id(key('ID-key', ProductID)[1])]">
        <xsl:copy-of select="."/>
      </xsl:for-each>
    </xsl:element>

  </xsl:template>
</xsl:stylesheet>

Running the transform on the previously listed input XML will output the following XML file:

<?xml version="1.0" encoding="utf-8"?>
<Inventory>
  <Product>
    <ProductID>1</ProductID>
    <name>Sword of the ancients</name>
    <description>A legendary sword fit for a hero</description>
  </Product>
  <Product>
    <ProductID>2</ProductID>
    <name>BFG 9000</name>
    <description>A slayer of demons</description>
  </Product>
  <Product>
    <ProductID>3</ProductID>
    <name>Flaming sword</name>
    <description>Effective against the undead</description>
  </Product>
  <Product>
    <ProductID>4</ProductID>
    <name>Aegis bulwark</name>
    <description>A stalwart shield that blocks the deadliest blows</description>
  </Product>
</Inventory>

Leave a Reply

Your email address will not be published. Required fields are marked *