pugixml documentation

Document Object Model

pugixml is a DOM-based parser. This means, that the XML document is converted to a tree. +Each XML tag is converted to a node in DOM tree. If a tag is contained in some other tag, its node +is a child to the outer tag's one. Comments, CDATA sections and PIs (Processing Instructions) also are +transformed into tree nodes, as is the standalone text. Each node has its type.

+ +It gets converted to the following tree (note, that with some parsing options comments, PIs and CDATA +sections are not stored in the tree, and with some options there are also nodes with whitespaces +and the contents of PCDATA sections is a bit different (with trailing/leading whitespaces). So generally +the resulting DOM tree depends on the parsing options):

Here is an example of an XML document: + +

+<?xml version="1.0"?>
+<mesh name="mesh_root">
+    <!-- here is a mesh node -->
+    some text
+    <![CDATA[[someothertext]]>
+    some more text
+    <node attr1="value1" />
+    <node attr1="value2">
+        <?TARGET somedata?>
+        <innernode/>
+    </node>
+</mesh>
+

The parent-children relations are shown with lines. Some nodes have previous and next siblings +(for example, the next sibling for node_comment node is node_pcdata with value "some text", and the +previous sibling for node_element with name "mesh" is node_pi with target "xml" (target for PI nodes +is stored in the node name)).

Documentation

Introduction

pugixml is a library for parsing XML files, which means that you give it XML data some way, +and it gives you the DOM tree and the ways to traverse it and to get some useful information from it. +The library source consist of two files, the header pugixml.hpp, and the source code pugixml.cpp. +You can either compile cpp file in your project, or build a static library (or perhaps even a DLL), +or make the whole code use inline linkage and make one big file (as it was done in pugxml). +All library classes reside in namespace pugi, so you can either use fully qualified +names (pugi::xml_node) or write a using declaration (using namespace pugi;, using +pugi::xml_node) and use plain names. All classes have the xml_ prefix.

By default it's supposed that you compile the source file with your project (add it into your +project, or add relevant entry in your Makefile, or do whatever you need to do with your compilation +environment). The library is written in standard-conformant C++ and was tested on win32 platform +(MSVC 7.1 (2003), MSVC 8.0 (2005)).

xml_parser class

xml_parser class is the core of parsing process; you initiate parsing with it, you get DOM +tree from it, the nodes and attributes are stored in it. You have two ways to load a file: either +provide a string with XML-data (it has to be null-terminated, and it will be modified during parsing +process, so it can not be a piece of read-only memory), or with an std::istream object (any input +stream, like std::ifstream, std::istringstream, etc.) - in this case the parser will allocate +the necessary amount of memory (equivalent to stream's size) and read everything from the stream.

The functions for parsing are: +

+
+ void parse(std::istream& stream, unsigned int optmsk = parse_noset);
_Winnie C++ Colorizer
+: This function will create a buffer with the size equal to that of provided stream, +read the chunk of data from the stream and parse it with provided options (optmsk). +The stream does not have to persist after the call to the function, the lifetime of internal buffer +with stream's data is managed by pugixml. +

+
+ char* parse(char* xmlstr, unsigned int optmsk = parse_noset); +
_Winnie C++ Colorizer
+: This function parses the provided string with provided options, and returns the position where the +parsing stopped (do not expect, that parsing will stop on every error, or on most of them - as I've +said, pugixml is error ignorant). The input string is modified. The string must persist for the +lifetime of the parser. + +

+
+ xml_parser(std::istream& stream, unsigned int optmsk = parse_default);
_Winnie C++ Colorizer
+: Just a convenience ctor, that calls the corresponding parse() function.

+
+ xml_parser(char* xmlstr, unsigned int optmsk = parse_default);
_Winnie C++ Colorizer
+: Just a convenience ctor, that calls the corresponding parse() function.

If you want to provide XML data after the creation of the parser, use the default ctor. Otherwise +you are free to use either parsing ctors or default ctor and later - parsing function.

After parsing an XML file, you'll get a DOM tree. To get access to it (or, more precisely, to its +root), call either document() function or cast xml_parser object to xml_node by +using the following functions:


+        operator xml_node() const;
+        xml_node document() const;
+

Ok, easy part behind - now let's dive into parsing options. There is a variety of them, and you +must choose them wisely to get the needed results and the best speed/least memory overhead. At first, +there are flags that determine which parts of the document will be put into DOM tree, and which will +be just skipped:

If parse_pi is on, then processing instructions (<? ... ?>) are put into DOM +tree (with node type node_pi, otherwise they are discarded. Note that for now the prolog +(<?xml ... ?>) is parsed as a processing instruction. +
Default value: on +
In W3C mode: on
If parse_comments is on, then comments () are put into DOM +tree (with node type node_comment, otherwise they are discarded. +
Default value: on +
In W3C mode: on
If parse_cdata is on, then the content of CDATA section (<![CDATA[[ ... ]]>) +is put into DOM tree (with node type node_cdata, otherwise it is discarded. +
Default value: on +
In W3C mode: on
If parse_ws_pcdata is off, then the content of PCDATA section (it's the plain text +in the node, like in <some_tag>Hello!</some_tag>) is discarded if it consists only +of space-like characters (spaces, tabs and newlines). +
Default value: off +
In W3C mode: on
If parse_ext_pcdata is off, then the content of PCDATA section is discarded if it belongs +to root (document) node, that is it does not have a parent tag. +
Default value: on +
In W3C mode: off

+So, these are the processing flags: +

Then there are flags that determine how the processing of the retrieved data is done. There are +several reasons for these flags, mainly: +

parsing speed. The less processing - the more speed.
data fetching comfort. Sometimes you're ok with messed linefeeds, sometimes you're not. Sometimes +you want your PCDATA trimmed, sometimes you do not. Sometimes you want your attribute values normalized, +sometimes you do not. Some of these are normally specified in DOCTYPE, though... +
...parser is not DOCTYPE aware (and will never be), so you need a way to set those properties - +if not on per-node basis, then on per-document

If parse_trim_pcdata is on, then the trimming of leading/trailing space-like characters +is performed for PCDATA content +
Default value: on +
In W3C mode: off
If parse_trim_attribute is on, then the trimming of leading/trailing space-like characters +is performed for attribute values +
Default value: on +
In W3C mode: off
If parse_escapes_pcdata is on, then the character reference expansion is done for PCDATA +content (replacing <lt; with <, &#4c; with L, etc.). +
Default value: on +
In W3C mode: on
If parse_escapes_attribute is on, then the character reference expansion is done for +attribute values (replacing <lt; with <, &#4c; with L, etc.). +
Default value: on +
In W3C mode: on
If parse_wnorm_pcdata is on, then the whitespace normalisation is done for PCDATA content +(this includes replacing any space-like character by a space character and converting sequences of +spaces into a single space) +
Default value: on +
In W3C mode: off
If parse_wnorm_attribute is on, then the whitespace normalisation is done for attribute +values +
Default value: on +
In W3C mode: off
If parse_wconv_attribute is on, then the whitespace conversion is done for attribute +values (this is a subset of whitespace normalization, and includes only replacing space-like characters +with spaces). If parse_wnorm_attribute is on, this flag has no effect. +
Default value: on +
In W3C mode: on
If parse_eol_cdata is on, then the end-of-line handling is done for CDATA content (this +includes converting any pair of 0x0d 0x0a characters to a single 0x0a and converting any standalone +0x0d to 0x0a). Note, that end-of-line handling is done for all content (PCDATA, attribute values) +except CDATA sections (if this flag is off). +
Default value: on +
In W3C mode: on

+Note, that these 2 flags are mutually exclusive. +

Finally, there are two more flags, that indicate closing tag parsing. When pugixml meets a +close tags, there are three ways: +

check that the tag name matches the opening tag, return an error if it does not. This is a +standard-compliant way, is controlled by parse_check_end_tags flag, which is on in W3C mode
try to find the corresponding tag name (so that <foo> <bar> </foo> will be parsed +correctly). This is controlled by parse_match_end_tags, which is on by default
just treat the tag as a closing tag for the node (so that <foo> ... </bar> will +be parsed as <foo> ... </foo>). This is the fastest way, and this is what pugxml +is doing, but it can corrupt your DOM tree. This way is chosen if both parse_check_end_tags and +parsse_match_end_tags are off. +

Did I say finally? Ok, so finally there are some helper flags, or better groups of flags. +These are: +

parse_minimal - no flag is set (this also means the fastest parsing)
parse_default - default set of flags
parse_noset - use the current parser options (see below)
parse_w3c - use the W3C compliance mode

+or turn the flag off by AND-ing the options value with the NEGation of this flag's constant: +

A couple of words on flag usage. The parsing options are just a set of bits, with each bit corresponding +to one flag. You can turn the flag on by OR-ing the options value with this flag's constant: +

+	parse_w3c | parse_wnorm_pcdata
+

+	parse_w3c & ~parse_comments
+

+You can access the current options of parser by options() method: +


+        unsigned int options() const;
+        unsigned int options(unsigned int optmsk);
+

+(the latter one returns previous options). These options are used when parse_noset flag set is +passed to parse() functions (which is the default value of corresponding parameter). +

xml_node class

If xml_parser is a heart of constructing a DOM tree from file, xml_node is a heart +of processing the tree. This is a simple wrapper, so it's small (4/8 bytes, depending on the size of +pointer), you're free to copy it and it does not own anything. I'll continue with a list of methods +with their description, with one note in advance. Some functions, that do something according to a +string-like parameter, have a pair with a suffix _w. The _w suffix tells, that this +function is doing a wildcard matching, instead of simple string comparison. You're free to use wildcards +* (that is equal to any sequence of characters (possibly empty)), ? (that is equal to +any character) and character sets ([Abc] means 'any symbol of A, b and c', [A-Z4] means +'any symbol from A to Z, or 4', [!0-9] means 'any symbol, that is not a digit'). So the wildcard +?ell_[0-9][0-9]_* will match strings like 'cell_23_xref', 'hell_00_', but will not match the +strings like 'ell_23_xref', 'cell_0_x' or 'cell_0a_x'.


+        /// Access iterators for this node's collection of child nodes.
+        iterator begin() const;
+        iterator end() const;
+        
+        /// Access iterators for this node's collection of child nodes (same as begin/end).
+        iterator children_begin() const;
+        iterator children_end() const;
+    
+        /// Access iterators for this node's collection of attributes.
+        attribute_iterator attributes_begin() const;
+        attribute_iterator attributes_end() const;
+
+        /// Access iterators for this node's collection of siblings.
+        iterator siblings_begin() const;
+        iterator siblings_end() const;
+

Functions, returning the iterators to walk through children/siblings/attributes. More on that in +Iterators section.

+ +


+        operator unspecified_bool_type() const;
+

_Winnie C++ Colorizer

+ +

This is a safe bool-like conversion operator. You can check node's validity (if (xml_node), + if (!xml_node), if (node1 && node2 && !node3 && cond1 && ...) - you get the idea) with +it. +

+ +


+        bool operator==(const xml_node& r) const;
+        bool operator!=(const xml_node& r) const;
+        bool operator<(const xml_node& r) const;
+        bool operator>(const xml_node& r) const;
+        bool operator<=(const xml_node& r) const;
+        bool operator>=(const xml_node& r) const;
+

_Winnie C++ Colorizer

+ +

Comparison operators

+ +


+        bool empty() const;
+

_Winnie C++ Colorizer

+ +

if (node.empty()) is equivalent to if (!node)

+ +


+        xml_node_type type() const;
+        const char* name() const;
+        const char* value() const;
+

_Winnie C++ Colorizer

+ +

Access node's properties (type, name and value). If there is no name/value, the corresponding functions +return "" - they never return NULL.

+ +


+        xml_node child(const char* name) const;
+        xml_node child_w(const char* name) const;
+

_Winnie C++ Colorizer

+ +

Get a child node with specified name, or xml_node() (this is an invalid node) if nothing is +found

+ +


+        xml_attribute attribute(const char* name) const;
+        xml_attribute attribute_w(const char* name) const;
+

_Winnie C++ Colorizer

+ +

Get an attribute with specified name, or xml_attribute() (this is an invalid attribute) if +nothing is found

+ +


+        xml_node sibling(const char* name) const;
+        xml_node sibling_w(const char* name) const;
+

_Winnie C++ Colorizer

+ +

Get a node's sibling with specified name, or xml_node() if nothing is found.
+node.sibling(name) is equivalent to node.parent().child(name).

+ +


+        xml_node next_sibling(const char* name) const;
+        xml_node next_sibling_w(const char* name) const;
+        xml_node next_sibling() const;
+

_Winnie C++ Colorizer

+ +

These functions get the next sibling, that is, one of the siblings of that node, that is to the +right. next_sibling() just returns the right brother of the node (or xml_node()), +the two other functions are searching for the sibling with the given name

+ +


+        xml_node previous_sibling(const char* name) const;
+        xml_node previous_sibling_w(const char* name) const;
+        xml_node previous_sibling() const;
+

_Winnie C++ Colorizer

+ +

These functions do exactly the same as next_sibling ones, with the exception that they +search for the left siblings.

+ +


+        xml_node parent() const;
+

_Winnie C++ Colorizer

+ +

Get a parent node. The parent node for the root one (the document) is considered to be the document +itself.

+ +


+        const char* child_value() const;
+

_Winnie C++ Colorizer

+ +

Look for the first node of type node_pcdata or node_cdata among the +children of the current node and return its contents (or "" if nothing is found)

+ +


+        xml_attribute first_attribute() const;
+        xml_attribute last_attribute() const;
+

_Winnie C++ Colorizer

+ +

These functions get the first and last attributes of the node (or xml_attribute() if the node +has no attributes).

+ +


+        xml_node first_child() const;
+        xml_node last_child() const;
+

_Winnie C++ Colorizer

+ +

These functions get the first and last children of the node (or xml_node() if the node has +no children).

+ +


+        template <typename OutputIterator> void all_elements_by_name(const char* name, OutputIterator it) const;
+        template <typename OutputIterator> void all_elements_by_name_w(const char* name, OutputIterator it) const;
+

_Winnie C++ Colorizer

+ +

Get all elements with the specified name in the subtree (depth-first search) and return them with +the help of output iterator (i.e. std::back_inserter)

+ +


+        template <typename Predicate> xml_attribute find_attribute(Predicate pred) const;
+        template <typename Predicate> xml_node find_child(Predicate pred) const;
+        template <typename Predicate> xml_node find_element(Predicate pred) const;
+

_Winnie C++ Colorizer

+ +

Find attribute, child or a node in the subtree (find_element - depth-first search) with the help +of the given predicate. Predicate should behave like a function which accepts a xml_node or +xml_attribute (for find_attribute) parameter and returns bool. The first entity for which +the predicate returned true is returned. If predicate returned false for all entities, xml_node() +or xml_attribute() is returned.

+ +


+        xml_node first_element(const char* name) const;
+        xml_node first_element_w(const char* name) const;
+
+        xml_node first_element_by_value(const char* name, const char* value) const;
+        xml_node first_element_by_value_w(const char* name, const char* value) const;
+
+        xml_node first_element_by_attribute(const char* name, const char* attr_name, const char* attr_value) const;
+        xml_node first_element_by_attribute_w(const char* name, const char* attr_name, const char* attr_value) const;
+
+        xml_node first_element_by_attribute(const char* attr_name, const char* attr_value) const;
+        xml_node first_element_by_attribute_w(const char* attr_name, const char* attr_value) const;
+

_Winnie C++ Colorizer

+ +

Find the first node (depth-first search), which corresponds to the given criteria (i.e. either has +a matching name, or a matching value, or has an attribute with given name/value, or has an attribute +and has a matching name). Note that _w versions treat all parameters as wildcards.

+ +


+        xml_node first_node(xml_node_type type) const;
+

_Winnie C++ Colorizer

+ +

Return a first node (depth-first search) with a given type, or xml_node().

+ +


+        std::string path(char delimiter = '/') const;
+

_Winnie C++ Colorizer

+ +

Get a path of the node (i.e. the string of names of the nodes on the path from the DOM tree root +to the node, separated with delimiter (/ by default).

+ +


+        xml_node first_element_by_path(const char* path, char delimiter = '/') const;
+

_Winnie C++ Colorizer

+ +

Get the first element that has the following path. The path can be absolute (beginning with delimiter) or +relative, '..' means 'up-level' (so if we are at the path mesh/fragment/geometry/stream, ../.. +will lead us to mesh/fragment, and /mesh will lead us to mesh).

+ +


+        bool traverse(xml_tree_walker& walker) const;
+

_Winnie C++ Colorizer

+ +

Traverse the subtree (beginning with current node) with the walker, return the result. See +Miscellaneous section for details.

+ + +

xml_attribute class

Like xml_node, xml_attribute is a simple wrapper of the node's attribute.


+        bool operator==(const xml_attribute& r) const;
+        bool operator!=(const xml_attribute& r) const;
+        bool operator<(const xml_attribute& r) const;
+        bool operator>(const xml_attribute& r) const;
+        bool operator<=(const xml_attribute& r) const;
+        bool operator>=(const xml_attribute& r) const;
+

Comparison operators.


+        operator unspecified_bool_type() const;
+

Safe bool conversion - like in xml_node, use this to check for validity.


+        bool empty() const;
+

Like with xml_node, if (attr.empty()) is equivalent to if (!attr). +


+        xml_attribute next_attribute() const;
+        xml_attribute previous_attribute() const;
+

Get the next/previous attribute of the node, that owns the current attribute. Return xml_attribute() +if no such attribute is found.


+        const char* name() const;
+        const char* value() const;
+

Get the name and value of the attribute. These methods never return NULL - they return "" instead.


+        int as_int() const;
+        double as_double() const;
+        float as_float() const;
+

Convert the value of an attribute to the desired type. If the conversion is not successfull, return +default value (0 for int, 0.0 for double, 0.0f for float). These functions rely on CRT functions ato*.


+        bool as_bool() const;
+

Convert the value of an attribute to bool. This method returns true if the first character of the +value is '1', 't', 'T', 'y' or 'Y'. Otherwise it returns false.

Iterators

Sometimes you have to cycle through the children or the attributes of the node. You can do it either +by using next_sibling, previous_sibling, next_attribute and previous_attribute +(along with first_child, last_child, first_attribute and last_attribute), +or you can use an iterator-like interface. There are two iterator types, xml_node_iterator and +xml_attribute_iterator. They are bidirectional constant iterators, which means that you can +either increment or decrement them, and use dereferencing and member access operators to get constant +access to node/attribute (the constness of iterators may change with the introducing of mutable trees).

In order to get the iterators, use corresponding functions of xml_node. Note that _end() +functions return past-the-end iterator, that is, in order to get the last attribute, you'll have to +do something like: + +
+ if (node.attributes_begin() != node.attributes_end()) // we have at least one attribute + { + xml_attribute last_attrib = *(--node.attributes_end()); + ... + } +
_Winnie C++ Colorizer
+

Miscellaneous

If you want to traverse a subtree, you can use traverse function. There is a class +xml_tree_walker, which has some functions that you can override in order to get custom traversing +(the default one just does nothing). + +
+ virtual bool begin(const xml_node&); + virtual bool end(const xml_node&); +
_Winnie C++ Colorizer
+ +

These functions are called when the processing of the node starts/ends. First begin() +is called, then all children of the node are processed recursively, then end() is called. If +any of these functions returns false, the traversing is stopped and the traverse() function +returns false.


+        virtual void push();
+        virtual void pop();
+

These functions are called before and after the processing of node's children. If node has no children, +none of these is called. The default behavior is to increment/decrement current node depth.


+        virtual int depth() const;
+

Get the current depth. You can use this function to do your own indentation, for example.

Lets get to some minor notes. You can safely write something like: + +
+ bool value = node.child("stream").attribute("compress").as_bool(); +
_Winnie C++ Colorizer
+ +If node has a child with the name 'geometry', and this child has an attribute 'compress', than everything +is ok. If node has a child with the name 'geometry' with no attribute 'compress', then attribute("compress") +will return xml_attribute(), and the corresponding call to as_bool() will return default value (false). +If there is no child node 'geometry', the child(...) call will return xml_node(), the subsequent call +to attribute(...) will return xml_attribute() (because there are no attributes belonging to invalid +node), and as_bool() will again return false, so this call sequence is perfectly safe.

Lifetime issues and memory management

As parsing is done in-situ, the XML data is to persist during the lifetime of xml_parser. If +the parsing is called via a function of xml_parser, that accepts char*, you have to ensure +yourself, that the string will outlive the xml_parser object.

The memory for nodes and attributes is allocated in blocks of data (the blocks form a linked list; +the default size of the block is 32 kb, though you can change it via changing a memory_block_size +constant in pugixml.hpp file. Remember that the first block is allocated on stack (it resides +inside xml_parser object), and all subsequent blocks are allocated on heap, so expect a stack overflow +when setting too large memory block size), so the xml_parser object (which contains the blocks) +should outlive all xml_node and xml_attribute objects (as well as iterators), which belong +to the parser's tree. Again, you should ensure it yourself.

Example

Ok, so you are not much of documentation reader, are you? So am I. Let's assume that you're going +to parse an xml file... something like this: + +

+<?xml version="1.0" encoding="UTF-8"?>
+<mesh name="Cathedral">
+    <fragment name="Cathedral">    
+        <geometry>
+            <stream usage="main" source="StAnna.dmesh" compress="true" />
+            <stream usage="ao" source="StAnna.ao" />
+        </geometry>
+    </fragment>
+    <fragment name="Cathedral">    
+    	...
+    </fragment>
+	...
+</mesh>
+

<mesh> is a root node, it has 0 or more <fragment>s, each of them has a <geometry> +node, and there are <stream> nodes with the shown attributes. We'd like to parse the file and... +well, and do something with it's contents. There are several methods of doing that; I'll show 2 of them +(the remaining one is using iterators).

Here we exploit the knowledge of the strict hierarchy of our XML document and read the nodes from +DOM tree accordingly. When we have an xml_node object, we can get the desired information from +it (name, value, attributes list, nearby nodes in a tree - siblings, parent and children).


+#include <fstream>
+#include <vector>
+#include <algorithm>
+#include <iterator>
+
+#include "pugixml.hpp"
+
+using namespace pugi;
+
+int main()
+{
+    std::ifstream in("mesh.xml");
+    in.unsetf(std::ios::skipws);
+                
+    std::vector<char> buf;
+    std::copy(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::back_inserter(buf));
+    buf.push_back(0); // zero-terminate
+    
+    xml_parser parser(&buf[0], pugi::parse_w3c);
+
+    xml_node doc = parser.document();
+        
+    if (xml_node mesh = doc.first_element("mesh"))
+    {
+        // store mesh.attribute("name").value()
+
+        for (xml_node fragment = mesh.first_element("fragment"); fragment; fragment = fragment.next_sibling())
+        {
+            // store fragment.attribute("name").value()
+    
+            if (xml_node geometry = fragment.first_element("geometry"))
+                for (xml_node stream = geometry.first_element("stream"); stream; stream = stream.next_sibling())
+                {
+                    // store stream.attribute("usage").value()
+                    // store stream.attribute("source").value()
+                    
+                    if (stream.attribute("compress"))
+                        // store stream.attribute("compress").as_bool()
+    
+                }
+        }
+    }
+}
+

We can also write a class that will traverse the DOM tree and store the information from nodes based +on their names, depths, attributes, etc. This way is well known by the users of SAX parsers. To do that, +we have to write an implementation of xml_tree_walker interface

                   
+#include <fstream>
+#include <vector>
+#include <algorithm>
+#include <iterator>
+
+#include "pugixml.hpp"
+
+using namespace pugi;
+
+struct mesh_parser: public xml_tree_walker
+{
+    virtual bool begin(const xml_node& node)
+    {
+        if (strcmp(node.name(), "mesh") == 0)
+        {
+            // store node.attribute("name").value()
+        }
+        else if (strcmp(node.name(), "fragment") == 0)
+        {
+            // store node.attribute("name").value()
+        }   
+        else if (strcmp(node.name(), "geometry") == 0)
+        {
+            // ...
+        }
+        else if (strcmp(node.name(), "stream") == 0)
+        {
+            // store node.attribute("usage").value()
+            // store node.attribute("source").value()
+                    
+            if (node.attribute("compress"))
+                // store stream.attribute("compress").as_bool()
+        }
+        else return false;
+
+        return true;
+    }
+};
+
+int main()
+{
+    std::ifstream in("mesh.xml");
+    in.unsetf(std::ios::skipws);
+                
+    std::vector<char> buf;
+    std::copy(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::back_inserter(buf));
+    buf.push_back(0); // zero-terminate
+    
+    xml_parser parser(&buf[0], pugi::parse_w3c);
+
+    mesh_parser mp;
+
+    if (!parser.document().traverse(mp))
+        // generate an error
+}
+

Parsing process

So, let's talk a bit about parsing process, and about the reason for providing XML data as a contiguous +writeable block of memory. Parsing is done in-situ. This means, that the strings, representing the +parts of DOM tree (node names, attribute names and values, CDATA content, etc.) are not separately +allocated on heap, but instead are parts of the original data. This is the keypoint to parsing speed, +because it helps achieve the minimal amount of memory allocations (more on that below) and minimal +amount of copying data.

In-situ parsing can be done in two ways, with zero-segmenting the string (that is, set the past-the-end +character for the part of XML string to 0, see +this image for further details), and storing pointer + size of the string instead of pointer to +the beginning of ASCIIZ string.

+ +

Originally, pugxml had only the first way, but then authors added the second method, 'non-segmenting' +or non-destructive parsing. The advantages of this method are: you no longer need non-constant storage; +you can even read data from memory-mapped files directly. Well, there are disadvantages. +For one thing, you can not do any of the transformations in-situ. The transformations that are required +by XML standard are: +

End of line handling (replacing 0x0d 0x0a with 0x0a and any standalone 0x0d with 0x0a) (for the whole +document)
White space normalization for attribute values (converting space-like characters to spaces (0x20), +sometimes trimming leading/trailing spaces and converting sequences of spaces to a single space
Character reference expansion (< and alike, <#0a; and alike, <40; and alike)
Entity reference expansion (&entityname;)

+ +None of these can be done in-situ. pugxml did neither character nor entity reference expansion, +and allocated new memory when normalizing white spaces when in non-destructive mode. I chose complete +in-situ parsing (the good thing about it is that any transformation, except entity reference, can be +done in-situ because it does not increase the amount of characters - even converting a character +reference to UTF-8). There is no entity reference expansion because of this and because I do not want +to parse DOCTYPE and, moreover, use DOCTYPE in following parsing (performing selective whitespace +normalization in attributes and CDATA sections and so on).

+ +

In order to be able to modify the tree (change attribute/node names & values) with in-situ parsing, +one needs to implement two ways of storing data (both in-situ and not). The DOM tree is now mutable, +but it will change in the future releases (without introducing speed/memory overhead, except on clean- +up stage).

+ +

The parsing process itself is more or less straightforward, when you see it - but the impression +is fake, because the explicit jumps are made (i.e. we know, that if we come to a closing brace (>), +we should expect CDATA after it (or a new tag), so let's just jump to the corresponding code), and, +well, there can be bugs (see Bugs section).

+ +

And, to make things worse, memory allocation (which is done only for node and attribute structures) +is done in pools. The pools are single-linked lists with predefined block size (32 kb by default), and +well, it increases speed a lot (allocations are slow, and the memory gets fragmented when allocating +a bunch of 16-byte (attribute) or 40-byte (node) structures)

+ +

+ + +

W3C compliance

+ +In short, it accepts most malformed XML files and does not do anything that is related to DOCTYPE. +This is because the main goal was developing fast, easy-to-use and error ignorant (so you can always +get something even from a malformed document) parser, there are some good validating and conformant +parsers already.

pugixml is not a compliant XML parser. The main reason for that is that it does not reject +most malformed XML files. The more or less complete list of incompatibilities follows (I will be talking +of ones when using parse_w3c mode): + +

The parser is completely DOCTYPE-ignorant, that is, it does not even skip all possible DOCTYPEs +correctly, let alone use them for parsing +
It accepts multiple attributes with the same name in one node +
It is charset-ignorant +
It accepts invalid names of tags +
It accepts invalid attribute values (those with < in them) and does not reject invalid entity +references or character references (in fact, it does not do DOCTYPE parsing, so it does not perform +entity reference expansion) +
It does not reject comments with -- inside +
It does not reject PI with the names of 'xml' and alike; in fact, it parses prolog as a PI, which +is not conformant +
All characters from #x1 to #x20 are considered to be whitespaces +
And some other things that I forgot to mention +

FAQ

I'm always open for questions; feel free to write them to zeux@mathcentre.com. +

+ +

+ + +

Bugs

I'm always open for bug reports; feel free to write them to zeux@mathcentre.com. +Please provide as much information as possible - version of pugixml, compiling and OS environment +(compiler and it's version, STL version, OS version, etc.), the description of the situation in which +the bug arises, the code and data files that show the bug, etc. - the more, the better. Though, please, +do not send executable files.

+ +

+ + +

Future work

Here are some improvements that will be done in future versions (they are sorted by priority, the +upper ones will get there sooner).

Support for altering the tree (both changing nodes'/attributes' names and values and adding/deleting +attributes/nodes) and writing the tree to stream +
Support for UTF-16 files (parsing BOM to get file's type and converting UTF-16 file to UTF-8 buffer +if necessary) +
Improved API (I'm going to look at SelectNode from MS XML and perhaps there will be some other +changes) +
Externally provided entity reference table (or perhaps even taken from DOCTYPE?) +
More intelligent parsing of DOCTYPE (it does not always skip DOCTYPE for now) +
XML 1.1 changes (changed EOL handling, normalization issues, +
XPath support +
Name your own? +