Inferring an XML Schema from an XML Document

Introduction

After a couple attempts, XML isn’t that hard to write. Create a text document with matching opening a closing tags, like <Customer></Customer>, with text values in between. That’s not too hard. Unfortunately, I don’t write XSD (XML Schemas) documents from scratch enough for them to be easy.

An XSD document is to an XML document what a SQL schema is to a SQL element like a table. The XSD document means that XML documents must contain the elements in the order and types defined by the schema to be valid XML documents relative to that schema. That is, the XSD document describes what to anticipate in an XML document that matches the schema. The challenge can be that an XSD document contains attributes and namespace elements that are a little cryptic and can be hard to remember. Fortunately, you don’t have to remember.

The .NET framework provides for inferring the schema from a document. If you have the document, you can generate the schema. This article shows you how.

Defining an XML Document

For your purposes, any XML document will do. The XML contained in Listing 1 is an XML document containing columns from the Northwind Customers table. (It was used because it is convenient.) The XML document contains the <xml> tag with the version and encoding attributes, and the rest of the document describes the content.

Listing 1: A sample XML document containing customer information.

<?xml version="1.0" encoding="utf-8" ?>
<!--Generated XML-->
<Root>
   <Customer>
      <CustomerID>ALFKI</CustomerID>
      <CompanyName>Alfreds Futterkiste</CompanyName>
      <ContactName>Paul Kimmel</ContactName>
      <ContactTitle>Sales Representative</ContactTitle>
      <Address>Obere Str. 57</Address>
      <City>Berlin</City>
      <Region></Region>
      <PostalCode>12209</PostalCode>
      <Country>Germany</Country>
      <Phone>030-0074321</Phone>
      <Fax>030-0076541</Fax>
   </Customer>
   <Customer>
      <CustomerID>ANATR</CustomerID>
      <CompanyName>Ana Trujillo Emparedados y helados</CompanyName>
      <ContactName>Ana Trujillo</ContactName>
      <ContactTitle>Owner</ContactTitle>
      <Address>Avda. de la Constitución 2222</Address>
      <City>México D.F.</City>
      <Region></Region>
      <PostalCode>05021</PostalCode>
      <Country>Mexico</Country>
      <Phone>(5) 555-4729</Phone>
      <Fax>(5) 555-3745</Fax>
   </Customer>
</Root>

The number of records was shortened to conserve space, but the size of the document doesn’t matter. This XML document (refer to Listing 1) repeats Custom objects with each child element corresponding to the columns in the Northwind Customers table.

A corresponding XSD document would need to decribe the contents that one would expect in all Customer XML documents, such as the fact that the contents are multiple complex types and each type has specific fields. The field names and types would be expressed in the XSD as well.

Writing Code to Infer the XML Schema and Return an XDocument

The XDocument type is a new type that is part of LINQ to XML. (For more on LINQ to XML, check out my book LINQ Unleashed for C#. VB programmers shouldn’t have that much trouble following the C# examples in the book.)

XDocument represents an XML document, and in fact, XSD documents are also XML documents. Listing 2 demonstrates how to use streams, basic IO, and System.XML classes to get the framework to infer (figure out) what the schema should be as indicated by the XML data.

Listing 2: Inferring the XSD (schema) for the XML document in Listing 1.

Imports System.Xml.Schema
Imports System.IO
Imports System.Text
Imports System.Xml

Module Module1

   Sub Main()

      Console.WriteLine(CreateXSD("....Customers.xml"))
      Console.ReadLine()

   End Sub

   Public Function CreateXSD(ByVal filename As String) As XDocument

      Dim xml As XDocument = XDocument.Load(filename)
      Dim inference As XmlSchemaInference = New XmlSchemaInference
      Dim stream As MemoryStream = _
         New MemoryStream(Encoding.ASCII.GetBytes(xml.ToString()))
      Dim reader As XmlTextReader = New XmlTextReader(stream)
      Dim schemaSet As XmlSchemaSet = inference.InferSchema(reader)

      Dim schema As XmlSchema = schemaSet.Schemas()(0)
      Using target As TextWriter = New StringWriter()
         schema.Write(target)
         Return XDocument.Parse(target.ToString())
      End Using

   End Function
End Module

The code is pretty straightforward. Declare an XDocument and load the XML document customers.xml. Create the XmlSchemaInference class and a MemoryStream. Use the bytes from the XML document to initialize the MemoryStream. Use the MemoryStream to initialize an XmlTextReader.

The XmlTextReader will contain all of the schema sets detected in the XML document. There is only one in your document (from Listing 1). Create a StringWriter and write the schema to the StringWriter. A final XDocument and the Parse method accepts a TextWriter—the StringWriter’s base class—and can construct an XDocument from the contents of the StringWriter.

Finally, call XDocument.ToString() to display the contents of the XDocument, which is an XML document containing the XSD for the XML in Listing 1. The inferred schema is shown in Listing 3.

Listing 3: The resulting XSD inferred from the XML document in Listing 1.

<xs:schema attributeFormDefault="unqualified"
           elementFormDefault="qualified"
           xmlns_xs="http://www.w3.org/2001/XMLSchema">
   <xs:element name="Root">
      <xs:complexType>
         <xs:sequence>
            <xs:element maxOccurs="unbounded" name="Customer">
               <xs:complexType>
                  <xs:sequence>
                     <xs:element name="CustomerID"
                                 type="xs:string" />
                     <xs:element name="CompanyName"
                                 type="xs:string" />
                     <xs:element name="ContactName"
                                 type="xs:string" />
                     <xs:element name="ContactTitle"
                                 type="xs:string" />
                     <xs:element name="Address" type="xs:string" />
                     <xs:element name="City" type="xs:string" />
                     <xs:element name="Region" />
                     <xs:element name="PostalCode"
                                 type="xs:unsignedShort" />
                     <xs:element name="Country" type="xs:string" />
                     <xs:element name="Phone" type="xs:string" />
                     <xs:element name="Fax" type="xs:string" />
                  </xs:sequence>
               </xs:complexType>
            </xs:element>
         </xs:sequence>
      </xs:complexType>
   </xs:element>
</xs:schema>

That’s it. Now you no longer have to write XSD from scratch. Start with an XML document that is an exemplar containing the data that you’d expect in all XML documents with a given schema and let .NET do the rest.

There are two things you can do from here. The first is you can tweak the XSD’s type attributes to more closely match the desired data types, and the second is that you could create a macro or wizard in Visual Studio to integrate this solution into Visual Studio. That’s left for the reader or another day.

Summary

If you are an XSD writing whiz, you probably write XSD all of the time. Even in that case, a tool that infers XSD from XML is going to be faster and less prone to syntax errors than writing XSD from scratch. With a few lines of code and a little know-how, you can let the .NET framework write your XSD for you.

About the Author

Paul Kimmel is the VB Today columnist for www.codeguru.com and has written several books on object-oriented programming and .NET. Check out his upcoming book LINQ Unleashed for C#; order your copy today at Amazon.com. Paul Kimmel is an Application Architect for EDS. You may contact him for technology questions at pkimmel@softconcepts.com.

Copyright © 2008 by Paul T. Kimmel. All Rights Reserved.

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read