Friday, November 13, 2009

SAXParser usage code sample

How to Parse XML in Java is always a tricky question. There different types of parsers available mainly - DOMParser (JDOM, Dom4j, XOM, etc.), SAXParser (Xerces, Piccolo, etc) and StAXParser (XPP3, Woodstox, Aalto, etc). Depending upon the nature and requirement of the project you need to select the type of parser.

Normally StAX Parsers also known as Pull Parsers are the fastest in terms of processing any size XML. The implementations available today don't support XSD validations. Some implementations support DTD validation only. StAX parsers also provide API to create an XML.

SAX Parsers also know as Push Parsers are the next best in terms of performance. Most of the implementations support XSD as well as DTD validations.

DOM Parsers are also fast but have memory overhead as it builds the whole XML tree in memory. Most of the implementations support XSD as well as DTD validations. The benefit of DOM is when you need to play with small size XMLs, need to read back and forward, change the XML and output new XML, etc. DOM parser also comes with good API support to traverse the XML compared to the handler implementation one need to do in case of SAX and StAX Parsers.

I required to parse a huge XML file and also validate against the defined XSD. I choose SAX parser because DOM have memory overhead and StAX doesn't support XSD validation. I have used Sun JDK6 provided Xerces implementation of SAX parser.

Here is the code snippet which parse the XML as well as validate. The comments in the code explains the functionality.


import java.io.File;
import java.io.IOException;
import javax.xml.XMLConstants;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class SAXXMLValidator {
private SAXParser saxParser;

public SAXXMLValidator(String schemaFilePath) throws SAXException,
ParserConfigurationException {
// Creates a schema factory object for the XSD validation
SchemaFactory schemaFactory = SchemaFactory
.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);

// schemaFilePath is the abosulte path, you can use any other way of providing the file.
Schema schema = schemaFactory.newSchema(new File(schemaFilePath));
SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();

// Set validating as false as this only validates against the DTD mentioned in the XML document.
saxParserFactory.setValidating(false);

// If your XSD uses namespace then set this to true otherwise you will get error like this "cvc-elt.1: Cannot find the declaration of element 'project'"
saxParserFactory.setNamespaceAware(true);

// Provide the schema to the factory for the parser to validate the XML
saxParserFactory.setSchema(schema);

// Creates a SAXParser and its thread safe so best to initialize all this in Constructor to save creation cost at the time of call
saxParser = saxParserFactory.newSAXParser();
}

public void validate(String xmlFilePath, DefaultHandler defaultHandler)
throws SAXException, IOException {
// xmlFilePath is the abosulte path, you can use any other way of providing the file.
// Extend the DefaultHandler to create your own handler to parse the XML and collect the errors as DefaultHandler already implements ErrorHandler.
saxParser.parse(new File(xmlFilePath), defaultHandler);
}

public static void main(String[] args) {
try {
// I am using maven xsd and pom to test the code
SAXXMLValidator saxxmlValidator = new SAXXMLValidator(
"D:\\maven-v4_0_0.xsd");
saxxmlValidator.validate("D:\\pom.xml", new DefaultHandler());
} catch (SAXException e) {
e.printStackTrace();
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}

4 comments:

  1. SAX is a somewhat outdated technology that is considered too difficult to use, you may want to check out vtd-xml

    http://vtd-xml.sf.net

    ReplyDelete
  2. Thanks Anonymous for reading the blog. I also don't like SAX over StAXParser but I require a validating parser and the only option is using SAX. Hopefully in near future StAX Parsers will also support XSD validations.

    I did take a look at VTD but its also a non-validating parser and looks like its an in memory parser so not sure how it will perform for huge XML to the tune of GBs. I surely try to compare where I don't require XSD validation.

    ReplyDelete
  3. vtd-xml actually has an extended edition that does memory mapping, it requires 64-bit JVM and supports documents upto 256Gb in size

    ReplyDelete
  4. Thanks dontcare!
    I will certainly dig vtd-xml where I don't need to validate my XML against an XSD.

    ReplyDelete