SAX for Pascal provides a specification, in the form of a set of
Delphi interfaces, for parsing XML documents.
Your client code receives a bunch of events that tell you the contents of the XML
document. The main interface a SAX parser provides is IXMLReader
(in the
SAX.pas
unit).
This interface allows an application to set and query features and properties in the parser,
to register event handlers for document processing, and to initiate a document parse. You use it
like this:
procedure TMyContentHandler.Parse (const aURL : string); var xmlReader : IXMLReader; begin xmlReader := TSomeSAXParser.Create as IXMLReader; xmlReader.setContentHandler (Self); xmlReader.parse (aURL); end;
One of the most important methods of IXMLReader
is setContentHandler
.
With this method you register your content handler, a class that implements the
IContentHandler
interface, with the parser. IContentHandler
contains
methods like start/endDocument
, start/endElement
, and
characters
, that the parser will call when the respective events occur.
Of course, the code above isn't true Rapid Application Development. That's why the
SAX for Pascal packages come with components, registered on the SAX
tab of Delphi's Component Palette, that implement the SAX interfaces. These components are in
the SAXComps.pas
unit.
Let's look at an example, step by step.
TSAXDelphi
component on the main form.TSAXXMLReader
component on the form.TSAXContentHandler
component on the form.SAXContentHandler1
for the ContentHandler
property of SAXXMLReader1
.
SAXXMLReader1
's URL
property:file://pathToYourFile.xml
.
SAXContentHandler1
's OnStartElement
event:MessageDlg (QName, mtInformation, [mbOk], 0);
TButton
on the form.Button1
's OnClick
event:
SAXXMLReader1.Parse;
.
This will show you all the elements in the pathToYourFile.xml
file.
Note: If you see question marks instead of valid element names, then refer
to the section about ansi and wide strings below.
The SAX for Pascal download comes with several demo programs that further illustrate the use of the SAX interfaces and components.
You may have noticed the TSAXDelphi
component in the example above. It has no
properties, and we did nothing with it except drop it on the form. So what's its use? Well,
TSAXDelphi
is an example of a vendor.
A vendor is an implementation of the SAX for Pascal
interfaces, i.e. a SAX parser. Take a look at the Vendor
property of the
TSAXXMLReader
component. With this property, you can switch to a different
parser. All your other code is completely independent of the parser you have choosen. In other
words, your code won't suffer from vendor lock-in. The vendor scheme for
SAX for Pascal is modeled after the vendor scheme used with Delphi's
TXMLDocument
component, which provides access to an XML
DOM.
SAX for Pascal comes bundled with two vendors: the
TSAXDelphi
(a native Delphi parser by Keith Wood), that we encountered before,
and TSAXMSXML
, an adapter to MicroSoft's
XML parser. Other vendors exist, but they must be downloaded
separately.
The XML specification says that any valid Unicode character is valid in XML. Unicode consists of tens of thousands of characters, each of which has an unique code. To be able to store any Unicode character you'll need 4 bytes (a 32-bit value) for every character. This encoding of Unicode is called UTF-32.
To simplify matters, Unicode defines allmost all commonly used characters in the first
65536 characters. This means that most Unicode strings can be encoded using 2 bytes (a
16-bit value) for every character. This encoding is called UTF-16, which in
Delphi is represented using WideChar
and WideString
.
To simplify matters further, Unicode defines the first 128 characters to be identical to the characters from ascii. An encoding that makes use of this fact is UTF-8. UTF-8 is a variable length encoding with several properties that make it ideal for storing Unicode when the majority of characters are ascii characters:
Delphi provides some conversion routines that you may find usefull, like
AnsiToUtf8
, UCS4StringToWideString
, Utf8Decode
, etc.
When you expect to work with lots of international text, use WideString
s.
You should note that WideStrings
are not reference counted on Windows (they
are on Linux), which makes them less effiecient to use than
AnsiStrings
.
When you expect to work with text which is mostly ascii, but which may contain the
occasional international text, use UTF8Strings
. They use
less memory and are reference counted in Delphi.
SAX for Pascal gives you the option to use either.
If you look in the SAX.pas
unit, you will see the use of the
SAX_WIDESTRINGS
conditional directive. It you define this in your project's
options, then it will use wide strings, otherwise it will use ansi
strings. The SAX for Pascal packages are compiled with
SAX_WIDESTRINGS
by default.
The SAX_WIDESTRINGS
conditional directive provides a convenient way to switch
between ansi and wide strings. But what if you have used the
SAX for Pascal components, and have created
event handlers? Delphi will have created code like the following:
procedure TForm1.SAXContentHandler1Characters(Sender: TObject; const PCh: WideString); begin end;
Note the WideString
type used. This happens because of the following definition
in SAX.pas
: SAXString = WideString;
. Delphi considers
SAXString
to be just an alias for the WideString
type.
We can force Delphi to consider SAXString
as a new type by writing
SAXString = type WideString;
instead. This happens when your
project's options defines SAX_SEPARATETYPES
. So if you haven't made up your mind
about ansi or wide strings, and you want to be able to switch between
them, make sure to define SAX_SEPARATETYPES
(and rebuild the
SAX for Pascal packages).