Parsing with SAX in Delphi

SAX for Pascal provides a specification, in the form of a set of Delphi interfaces, for parsing XML documents. Your client code receives a bunch of events that tell you the contents of the XML document. The main interface a SAX parser provides is IXMLReader (in the SAX.pas unit). This interface allows an application to set and query features and properties in the parser, to register event handlers for document processing, and to initiate a document parse. You use it like this:

procedure TMyContentHandler.Parse (const aURL : string);
  xmlReader : IXMLReader;
  xmlReader := TSomeSAXParser.Create as IXMLReader;
  xmlReader.setContentHandler (Self);
  xmlReader.parse (aURL);

One of the most important methods of IXMLReader is setContentHandler. With this method you register your content handler, a class that implements the IContentHandler interface, with the parser. IContentHandler contains methods like start/endDocument, start/endElement, and characters, that the parser will call when the respective events occur.

SAX parsing the RAD way

Of course, the code above isn't true Rapid Application Development. That's why the SAX for Pascal packages come with components, registered on the SAX tab of Delphi's Component Palette, that implement the SAX interfaces. These components are in the SAXComps.pas unit. Let's look at an example, step by step.

  1. Fire up Delphi, and start a new application.
  2. Drop a TSAXDelphi component on the main form.
  3. Drop a TSAXXMLReader component on the form.
  4. Drop a TSAXContentHandler component on the form.
  5. Select SAXContentHandler1 for the ContentHandler property of
  6. Enter the following value for SAXXMLReader1's URL property:
  7. Write the following code in SAXContentHandler1's OnStartElement event:
    MessageDlg (QName, mtInformation, [mbOk], 0);
  8. Drop a TButton on the form.
  9. Write the following code in Button1's OnClick event: SAXXMLReader1.Parse;.
  10. Run the application and click the button.

This will show you all the elements in the pathToYourFile.xml file. Note: If you see question marks instead of valid element names, then refer to the section about ansi and wide strings below.

More examples

The SAX for Pascal download comes with several demo programs that further illustrate the use of the SAX interfaces and components.


You may have noticed the TSAXDelphi component in the example above. It has no properties, and we did nothing with it except drop it on the form. So what's its use? Well, TSAXDelphi is an example of a vendor.

A vendor is an implementation of the SAX for Pascal interfaces, i.e. a SAX parser. Take a look at the Vendor property of the TSAXXMLReader component. With this property, you can switch to a different parser. All your other code is completely independent of the parser you have choosen. In other words, your code won't suffer from vendor lock-in. The vendor scheme for SAX for Pascal is modeled after the vendor scheme used with Delphi's TXMLDocument component, which provides access to an XML DOM.

SAX for Pascal comes bundled with two vendors: the TSAXDelphi (a native Delphi parser by Keith Wood), that we encountered before, and TSAXMSXML, an adapter to MicroSoft's XML parser. Other vendors exist, but they must be downloaded separately.

Ansi and wide strings

The XML specification says that any valid Unicode character is valid in XML. Unicode consists of tens of thousands of characters, each of which has an unique code. To be able to store any Unicode character you'll need 4 bytes (a 32-bit value) for every character. This encoding of Unicode is called UTF-32.

To simplify matters, Unicode defines allmost all commonly used characters in the first 65536 characters. This means that most Unicode strings can be encoded using 2 bytes (a 16-bit value) for every character. This encoding is called UTF-16, which in Delphi is represented using WideChar and WideString.

To simplify matters further, Unicode defines the first 128 characters to be identical to the characters from ascii. An encoding that makes use of this fact is UTF-8. UTF-8 is a variable length encoding with several properties that make it ideal for storing Unicode when the majority of characters are ascii characters:

Delphi provides some conversion routines that you may find usefull, like AnsiToUtf8, UCS4StringToWideString, Utf8Decode, etc.

When you expect to work with lots of international text, use WideStrings. You should note that WideStrings are not reference counted on Windows (they are on Linux), which makes them less effiecient to use than AnsiStrings. When you expect to work with text which is mostly ascii, but which may contain the occasional international text, use UTF8Strings. They use less memory and are reference counted in Delphi.

SAX for Pascal gives you the option to use either. If you look in the SAX.pas unit, you will see the use of the SAX_WIDESTRINGS conditional directive. It you define this in your project's options, then it will use wide strings, otherwise it will use ansi strings. The SAX for Pascal packages are compiled with SAX_WIDESTRINGS by default.

The SAX_WIDESTRINGS conditional directive provides a convenient way to switch between ansi and wide strings. But what if you have used the SAX for Pascal components, and have created event handlers? Delphi will have created code like the following:

procedure TForm1.SAXContentHandler1Characters(Sender: TObject;
  const PCh: WideString);

Note the WideString type used. This happens because of the following definition in SAX.pas: SAXString = WideString;. Delphi considers SAXString to be just an alias for the WideString type. We can force Delphi to consider SAXString as a new type by writing SAXString = type WideString; instead. This happens when your project's options defines SAX_SEPARATETYPES. So if you haven't made up your mind about ansi or wide strings, and you want to be able to switch between them, make sure to define SAX_SEPARATETYPES (and rebuild the SAX for Pascal packages).