Text Transformation with MGrammar and the Oslo SDK

Tuesday Feb 17th 2009 by Jeffrey Juday

Learn to build a text DSL in MGrammar.

My job involves moving different shapes and sizes of data linking systems and business processes together. Normally, I use Integration or Data Warehousing tools. Until I started using the Oslo SDK CTP "M" language, I've never considered building my own Domain Specific Language (DSL) as a tool in my repertoire.

If you've been following my Oslo SDK articles http://www.codeguru.com/columns/experts/article.php/c15779/, you've been introduced to MSchema, MGraph, and the Repository, all components of the Oslo SDK Modeling backbone. MGrammar is the third component in the "M" language. However, instead of defining data structure like MSchema, MGrammar defines data transformation, in particular, human-readable Text data transformation. Continuing to use the sample model I've developed in the other articles, I'm going to show you how MGrammer can be employed to populate Repository data.

Oslo Overview

Oslo is composed of the following components displayed in Figure 1.

Figure 1: Oslo Architecture

Source: "Microsoft PDC 2008—A Lap Around Olso"

  • "M", a language for composing models
  • The Repository, a SQL Server database designed for storing models
  • Quadrant, a tool for editing and viewing model data

Currently, Quadrant is only available to PDC attendees. M and the Repository come with the Oslo SDK available on the Oslo Developer Center site http://msdn.microsoft.com/en-us/oslo/default.aspx.

Oslo's goal is to deliver a foundation for building and storing models of all types. Models are application metadata formatted for runtime consumption. Separate Microsoft initiatives aim to build runtimes and tooling into applications such as Visual Studio that are Oslo model aware.

As I mentioned earlier, the M Language is composed of MSchema, MGraph, and MGrammar. A complete introduction to M is beyond the scope of this article. MSchema and MGraph were covered in my prior articles this article will acquaint you with MGrammar.

MGrammar Overview

Unlike XML, Text is a natural human-consumable data medium. Although text can be semi-structured like XML, text is not standardized like XML is. Parsing text to store it in, for example, a relational database using traditional development tools, though not difficult, is difficult to do right. MGrammar bridges the gap between plain human-readable, composable text data and XML, making semi-structured text parsing more approachable.

In a typical MGrammar program, a developer defines the patterns to search in the text and defines how the pattern is translated into MGraph. MGraph looks a lot like inline C# collections. Oslo utilizes MGraph to populate MSchema models in the Repository.

In MGrammar, a developer defines a set of Rules for transforming text into MGraph. MGrammar has three types of Rules:

  • Token rules work and look a lot like Regular expressions
  • Syntax rules can be composed of Tokens and define the MGraph produced from text input.
  • Interleave rules define ignored text.

There are other features of MGrammar. However, a complete survey of the language is beyond the scope of this article and Rules are really the core of the language. So, I'm going to focus on Rules and, in particular, Token and Syntax rules. Using a sample, I'll illustrate how some basic Token and Syntax capabilities are employed to parse text.

Sample Overview

The sample leverages the models I built in my prior article http://www.codeguru.com/columns/experts/article.php/c15779/. Model code snippets appear below.

type Requirement : Item
   Description : Text?;
   ApplicationId : Integer64;

Requirements : Requirement* where
item.ApplicationId in MyApplications.Id;

type ServerConfiguration : Item
   Server : Text;
   ApplicationId : Integer64;

ServerConfigInfo : ServerConfiguration* where
item.ApplicationId in MyApplications.Id;

Figure 2 depicts the sample application running in Intellipad, a development tool shipping with the Oslo SDK.

Figure 2: Application Execution

Enter a typical phrase and the text is translated to MGraph targeting the Requirement model. Enter Server and the application targets the ServerConfiguration model.

I've shown what the application does. Now, I want to show how it works.

MGrammar Structure

Following is the entire MGrammar sample.

module Sample.MGrammar
   language ModelMGraph
      syntax Main = Req | ServerConfig | Nil;

      syntax Req = InputText:MultipleTextValues =>
         Requirement { Description {InputText},
         ApplicationId  {100} };
      syntax ServerConfig = "Server="
         InputNum:MultipleTextValues =>
         ServerConfiguration { Server {InputNum},
         ApplicationId  {100} };
      //Case where nothing there without this there is an error
      syntax Nil = empty;

      token TextValue = "a".."z" | "A".."Z" | " ";
      token MultipleTextValues = (TextValue)+;

As with other "M" programs, MGrammar applications are scoped to a module. Also like MSchema, MGrammar application can import and export libraries of Modules. As you can see, MGrammar supports comments and as I mentioned before the language syntax supports other things like, for example, preprocessor directives like #if and #define. Keep in mind, though, that the Oslo SDK is a CTP and MGrammar's current incarnation is not complete. The Language keyword begins the application definition.

Main is the application entry point. Main must always be a Syntax. Text input into the MGrammar must match a defined Syntax or the application emits an error. In the example, there are three Syntaxes defining three distinct patterns: Req, ServerConfig, and Nil. Later in the article, I'll explain how a Syntax is constructed. I want to start with the Tokens.

Token Rules

The Tokens in the sample application appear below.

token TextValue = "a".."z" | "A".."Z" | " ";
token MultipleTextValues = (TextValue)+;

Tokens look and act a lot like Regular Expressions. Like Regular Expressions, Tokens can define patterns of text, numeric ranges, whitespace characters, and operators (&, |, *, +, etc.). Tokens can also include other Tokens.

In the example, TextValue defines a pattern containing any lowercase character, uppercase character, or space. MultipleTextValues defines a pattern matching one or more alphabetic characters.

Syntax Rules

The application Syntaxes appear below.

syntax Main = Req | ServerConfig | Nil;

syntax Req = InputText:MultipleTextValues =>
   Requirement { Description {InputText}, ApplicationId  {100} };
syntax ServerConfig = "Server="
   InputNum:MultipleTextValues =>
   ServerConfiguration { Server {InputNum}, ApplicationId  {100} };
//Case where nothing there without this there is an error
syntax Nil = empty;

Syntaxes handle the input and resulting MGraph output, also called a Projection. Everything to the right of the "=>" symbol is the Projection (output text).

Syntaxes can include scoped variables. Variables can be used in the Projection. Syntaxes can be composed from multiple Syntaxes and Tokens.

In the example Req uses a variable called InputText. InputText contains the full text passed into the application and is used in the MGraph output. ServerConfig utilizes a variable called InputNum in a similar fashion.

Nil is a special case Syntax. If nothing is passed into the application an empty Projection is generated. Empty is a MGrammar keyword.

Main includes all of the defined Syntaxes. As I mentioned before, Main is the entry point of the application.

Running the Sample

There are some tricks to running the sample code.

First, to build and run an MGrammar application you must select the "Sample Enabled" Intellipad. Next you must select "Minibuffer" and type SetMode("MGMode") to enable MGrammer. The graphic below demonstrates enabling MGrammar mode.


Figure 3: Setting MGrammar Mode

Finally, you must select the "Tree Preview" option to be able to enter text and see it transformed to MGraph. The graphic below demonstrates how to do this.


Figure 4: Tree Preview Selection


MGrammar is a feature in the "M" programming language shipping with the Oslo SDK. MGrammar was built to make transforming Semi-structured text easier. Rules are the Core of MGrammar. Syntax and Token Rules define the text patterns and transformations.


About the Author

Jeffrey Juday is a software developer specializing in enterprise integration solutions utilizing BizTalk, SharePoint, WCF, WF, and SQL Server. Jeff has been developing software with Microsoft tools for more than 15 years in a variety of industries including: military, manufacturing, financial services, management consulting, and computer security. Jeff is a Microsoft BizTalk MVP. Jeff spends his spare time with his wife Sherrill and daughter Alexandra. You can reach Jeff at me@jeffjuday.com.

Mobile Site | Full Site
Copyright 2018 © QuinStreet Inc. All Rights Reserved