From 0a97bad6608a2b1ea01ae6ce18bab63abf0c9210 Mon Sep 17 00:00:00 2001 From: "arseny.kapoulkine" Date: Wed, 21 Feb 2007 19:41:31 +0000 Subject: Merged 0.3 in trunk git-svn-id: http://pugixml.googlecode.com/svn/trunk@68 99668b35-9821-0410-8761-19e4c4f06640 --- docs/index.html | 1059 ++++++++++++++++--------------------------------------- 1 file changed, 302 insertions(+), 757 deletions(-) (limited to 'docs/index.html') diff --git a/docs/index.html b/docs/index.html index c843bdb..7c8392f 100644 --- a/docs/index.html +++ b/docs/index.html @@ -15,17 +15,8 @@

Contents

Introduction
-
Document Object Model
-
Documentation -
Introduction
-
xml_parser class
-
xml_node class
-
xml_attribute class
-
Iterators
-
Miscellaneous
-
Lifetime issues and memory management
- -
Parsing process
+
Quick start
+
Reference
W3C compliance
Comparison with existing parsers
FAQ
@@ -43,7 +34,7 @@

pugixml is just another XML parser. This is a successor to pugxml (well, to be honest, the only part that is left as is is wildcard matching code; the rest was either heavily refactored or rewritten -from scratch). The main features (call it USP) are:

+from scratch). The main features are:

@@ -84,10 +75,8 @@ ok with DOM - it should not be a problem, because the overall memory consumption you'll need a contiguous chunk of memory, which can be a problem).
  • lack of validation, DTD processing, XML namespaces, proper handling of encoding. If you need those - go take MSXML or XercesC or anything like that.
  • -
  • lack of XPath & UTF-16/32 parsing. These are not implemented for now, but they are the features -for the next release.
  • -
  • immutability of DOM tree. It's constant. You can't change it. There are good reasons for prohibiting -that, though it is a thing that will likely be in the next release.
  • +
  • lack of UTF-16/32 parsing. This is not implemented for now, but this is the features for the next +release.

  • @@ -101,803 +90,345 @@ an XML file were measured.
    - -

    Document Object Model

    +
    +

    Quick start

    -

    pugixml is a DOM-based parser. This means, that the XML document is converted to a tree. -Each XML tag is converted to a node in DOM tree. If a tag is contained in some other tag, its node -is a child to the outer tag's one. Comments, CDATA sections and PIs (Processing Instructions) also are -transformed into tree nodes, as is the standalone text. Each node has its type.

    +

    Here there is a small collection of code snippets to help the reader begin using pugixml.

    -

    Here is an example of an XML document: +

    For everything you can do with pugixml, you need a document. There are several ways to obtain it:

    -
    -<?xml version="1.0"?>
    -<mesh name="mesh_root">
    -    <!-- here is a mesh node -->
    -    some text
    -    <![CDATA[[someothertext]]>
    -    some more text
    -    <node attr1="value1" />
    -    <node attr1="value2">
    -        <?TARGET somedata?>
    -        <innernode/>
    -    </node>
    -</mesh>
    -
    - -It gets converted to the following tree (note, that with some parsing options comments, PIs and CDATA -sections are not stored in the tree, and with some options there are also nodes with whitespaces -and the contents of PCDATA sections is a bit different (with trailing/leading whitespaces). So generally -the resulting DOM tree depends on the parsing options):

    - -

    - -

    The parent-children relations are shown with lines. Some nodes have previous and next siblings -(for example, the next sibling for node_comment node is node_pcdata with value "some text", and the -previous sibling for node_element with name "mesh" is node_pi with target "xml" (target for PI nodes -is stored in the node name)).

    -
    - -
    -

    Documentation

    - -
    -

    Introduction

    - -

    pugixml is a library for parsing XML files, which means that you give it XML data some way, -and it gives you the DOM tree and the ways to traverse it and to get some useful information from it. -The library source consist of two files, the header pugixml.hpp, and the source code pugixml.cpp. -You can either compile cpp file in your project, or build a static library (or perhaps even a DLL), -or make the whole code use inline linkage and make one big file (as it was done in pugxml). -All library classes reside in namespace pugi, so you can either use fully qualified -names (pugi::xml_node) or write a using declaration (using namespace pugi;, using -pugi::xml_node) and use plain names. All classes have the xml_ prefix.

    - -

    By default it's supposed that you compile the source file with your project (add it into your -project, or add relevant entry in your Makefile, or do whatever you need to do with your compilation -environment). The library is written in standard-conformant C++ and was tested on win32 platform -(MSVC 7.1 (2003), MSVC 8.0 (2005)).

    - -
    -

    xml_parser class

    - -

    xml_parser class is the core of parsing process; you initiate parsing with it, you get DOM -tree from it, the nodes and attributes are stored in it. You have two ways to load a file: either -provide a string with XML-data (it has to be null-terminated, and it will be modified during parsing -process, so it can not be a piece of read-only memory), or with an std::istream object (any input -stream, like std::ifstream, std::istringstream, etc.) - in this case the parser will allocate -the necessary amount of memory (equivalent to stream's size) and read everything from the stream.

    - -

    The functions for parsing are: -

    -
    -
    
    -        void parse(std::istream& stream, unsigned int optmsk = parse_noset);
    -
    This function will create a buffer with the size equal to that of provided stream, -read the chunk of data from the stream and parse it with provided options (optmsk). -The stream does not have to persist after the call to the function, the lifetime of internal buffer -with stream's data is managed by pugixml. -
    - -
     
    -
    -
    
    -        char* parse(char* xmlstr, unsigned int optmsk = parse_noset);
    -
    -
    This function parses the provided string with provided options, and returns the position where the -parsing stopped (do not expect, that parsing will stop on every error, or on most of them - as I've -said, pugixml is error ignorant). The input string is modified. The string must persist for the -lifetime of the parser. - -
     
    -
    -
    
    -        char* parse(const ownership_transfer_tag&, char* xmlstr, unsigned int optmsk = parse_noset);
    -
    -
    This function parses the provided string with provided options, and returns the position where the -parsing stopped (do not expect, that parsing will stop on every error, or on most of them - as I've -said, pugixml is error ignorant). The input string is modified. The string's ownership is -managed by parser (string's memory is freed automatically when parser's destructor is called). -
     
    -
    -
    
    -        xml_parser(std::istream& stream, unsigned int optmsk = parse_default);
    -
    Just a convenience ctor, that calls the corresponding parse() function.
    - -
     
    -
    -
    
    -        xml_parser(char* xmlstr, unsigned int optmsk = parse_default);
    -
    Just a convenience ctor, that calls the corresponding parse() function.
    - -
     
    -
    
    -        xml_parser(const ownership_transfer_tag&, char* xmlstr, unsigned int optmsk = parse_default);
    -
    Just a convenience ctor, that calls the corresponding parse() function.
    - -
    - -

    If you want to provide XML data after the creation of the parser, use the default ctor. Otherwise -you are free to use either parsing ctors or default ctor and later - parsing function.

    - -

    After parsing an XML file, you'll get a DOM tree. To get access to it (or, more precisely, to its -root), call either document() function or cast xml_parser object to xml_node by -using the following functions:

    - -
    
    -        operator xml_node() const;
    -        xml_node document() const;
    -
    - -

    Ok, easy part is behind - now let's dive into parsing options. There is a variety of them, and you -must choose them wisely to get the needed results and the best speed/least memory overhead. At first, -there are flags that determine which parts of the document will be put into DOM tree, and which will -be just skipped:

    - -
      -
    • If parse_pi is on, then processing instructions (<? ... ?>) are put into DOM -tree (with node type node_pi) otherwise they are discarded. Note that for now the prolog -(<?xml ... ?>) is parsed as a processing instruction. -
      Default value: off -
      In W3C mode: on
    • -
    • If parse_comments is on, then comments (<!-- ... -->) are put into DOM -tree (with node type node_comment) otherwise they are discarded. -
      Default value: off -
      In W3C mode: on
    • -
    • If parse_cdata is on, then the content of CDATA section (<![CDATA[[ ... ]]>) -is put into DOM tree (with node type node_cdata) otherwise it is discarded. -
      Default value: on -
      In W3C mode: on
    • -
    • If parse_ws_pcdata is off, then the content of PCDATA section (it's the plain text -in the node, like in <some_tag>Hello!</some_tag>) is discarded if it consists only -of space-like characters (spaces, tabs and newlines). -
      Default value: off -
      In W3C mode: on
    • -
    • If parse_ext_pcdata is off, then the content of PCDATA section is discarded if it belongs -to root (document) node, that is it does not have a parent tag. -
      Default value: on -
      In W3C mode: off
    • -
    - -

    Then there are flags that determine how the processing of the retrieved data is done. There are -several reasons for these flags, mainly: -

      -
    • parsing speed. The less processing - the more speed.
    • -
    • data fetching comfort. Sometimes you're ok with messed linefeeds, sometimes you're not. Sometimes -you want your PCDATA trimmed, sometimes you do not. Sometimes you want your attribute values normalized, -sometimes you do not. Some of these are normally specified in DOCTYPE, though... -
    • ...parser is not DOCTYPE aware (and will never be), so you need a way to set those properties - -if not on per-node basis, then on per-document
    • -
    -So, these are the processing flags: -

    - -
      -
    • If parse_escapes is on, then the character reference expansion is done for PCDATA content -and for attribute values (replacing <lt; with <, &#4c; with L, etc.). -
      Default value: on -
      In W3C mode: on
    • -
    • If parse_wnorm_attribute is on, then the whitespace normalisation is done for attribute -values (this includes replacing any space-like character by a space character, converting sequences of -spaces into a single space and trimming of leading/trailing spaces) -
      Default value: off -
      In W3C mode: off
    • -
    • If parse_wconv_attribute is on, then the whitespace conversion is done for attribute -values (this is a subset of whitespace normalization, and includes only replacing space-like characters -with spaces). If parse_wnorm_attribute is on, this flag has no effect. -
      Default value: on -
      In W3C mode: on
    • -
    • If parse_eol is on, then the end-of-line handling is done for PCDATA/CDATA content and for -attribute values (this includes converting any pair of 0x0d 0x0a characters to a single 0x0a and -converting any standalone 0x0d to 0x0a). -
      Default value: on -
      In W3C mode: on
    • -
    - -

    Finally, there are two more flags, that indicate closing tag parsing. When pugixml meets a -close tags, there are three ways: -

      -
    • check that the tag name matches the opening tag, return an error if it does not. This is a -standard-compliant way, is controlled by parse_check_end_tags flag, which is on in W3C mode
    • -
    • try to find the corresponding tag name (so that <foo> <bar> </foo> will be parsed -correctly). This is controlled by parse_match_end_tags, which is on by default
    • -
    • just treat the tag as a closing tag for the node (so that <foo> ... </bar> will -be parsed as <foo> ... </foo>). This is the fastest way, and this is what pugxml -is doing, but it can corrupt your DOM tree. This way is chosen if both parse_check_end_tags and -parse_match_end_tags are off. -
    -Note, that these 2 flags are mutually exclusive. -

    - -

    Did I say finally? Ok, so finally there are some helper flags, or better groups of flags. -These are: -

      -
    • parse_minimal - no flag is set (this also means the fastest parsing)
    • -
    • parse_default - default set of flags
    • -
    • parse_noset - use the current parser options (see below)
    • -
    • parse_w3c - use the W3C compliance mode
    • -
    -

    - -

    A couple of words on flag usage. The parsing options are just a set of bits, with each bit corresponding -to one flag. You can turn the flag on by OR-ing the options value with this flag's constant: -

    -	parse_w3c | parse_wnorm_attribute
    -
    -or turn the flag off by AND-ing the options value with the NEGation of this flag's constant: -
    -	parse_w3c & ~parse_comments
    -
    -You can access the current options of parser by options() method: -
    
    -        unsigned int options() const;
    -        unsigned int options(unsigned int optmsk);
    -
    -(the latter one returns previous options). These options are used when parse_noset flag set is -passed to parse() functions (which is the default value of corresponding parameter). -

    - -
    -

    xml_node class

    - -

    If xml_parser is a heart of constructing a DOM tree from file, xml_node is a heart -of processing the tree. This is a simple wrapper, so it's small (4/8 bytes, depending on the size of -pointer), you're free to copy it and it does not own anything. I'll continue with a list of methods -with their description, with one note in advance. Some functions, that do something according to a -string-like parameter, have a pair with a suffix _w. The _w suffix tells, that this -function is doing a wildcard matching, instead of simple string comparison. You're free to use wildcards -* (that is equal to any sequence of characters (possibly empty)), ? (that is equal to -any character) and character sets ([Abc] means 'any symbol of A, b and c', [A-Z4] means -'any symbol from A to Z, or 4', [!0-9] means 'any symbol, that is not a digit'). So the wildcard -?ell_[0-9][0-9]_* will match strings like 'cell_23_xref', 'hell_00_', but will not match the -strings like 'ell_23_xref', 'cell_0_x' or 'cell_0a_x'.

    - -
    
    -        /// Access iterators for this node's collection of child nodes.
    -        iterator begin() const;
    -        iterator end() const;
    -        
    -        /// Access iterators for this node's collection of child nodes (same as begin/end).
    -        iterator children_begin() const;
    -        iterator children_end() const;
    -    
    -        /// Access iterators for this node's collection of attributes.
    -        attribute_iterator attributes_begin() const;
    -        attribute_iterator attributes_end() const;
    -
    -        /// Access iterators for this node's collection of siblings.
    -        iterator siblings_begin() const;
    -        iterator siblings_end() const;
    -
    - -

    Functions, returning the iterators to walk through children/siblings/attributes. More on that in -Iterators section.

    - -
    
    -        operator unspecified_bool_type() const;
    -
    - -

    This is a safe bool-like conversion operator. You can check node's validity (if (xml_node), - if (!xml_node), if (node1 && node2 && !node3 && cond1 && ...) - you get the idea) with -it. -

    - -
    
    -        bool operator==(const xml_node& r) const;
    -        bool operator!=(const xml_node& r) const;
    -        bool operator<(const xml_node& r) const;
    -        bool operator>(const xml_node& r) const;
    -        bool operator<=(const xml_node& r) const;
    -        bool operator>=(const xml_node& r) const;
    -
    - -

    Comparison operators

    - -
    
    -        bool empty() const;
    -
    - -

    if (node.empty()) is equivalent to if (!node)

    - -
    
    -        xml_node_type type() const;
    -        const char* name() const;
    -        const char* value() const;
    -
    - -

    Access node's properties (type, name and value). If there is no name/value, the corresponding functions -return "" - they never return NULL.

    - -
    
    -        xml_node child(const char* name) const;
    -        xml_node child_w(const char* name) const;
    -
    - -

    Get a child node with specified name, or xml_node() (this is an invalid node) if nothing is -found

    - -
    
    -        xml_attribute attribute(const char* name) const;
    -        xml_attribute attribute_w(const char* name) const;
    -
    - -

    Get an attribute with specified name, or xml_attribute() (this is an invalid attribute) if -nothing is found

    - -
    
    -        xml_node sibling(const char* name) const;
    -        xml_node sibling_w(const char* name) const;
    -
    - -

    Get a node's sibling with specified name, or xml_node() if nothing is found.
    -node.sibling(name) is equivalent to node.parent().child(name).

    - -
    
    -        xml_node next_sibling(const char* name) const;
    -        xml_node next_sibling_w(const char* name) const;
    -        xml_node next_sibling() const;
    -
    - -

    These functions get the next sibling, that is, one of the siblings of that node, that is to the -right. next_sibling() just returns the right brother of the node (or xml_node()), -the two other functions are searching for the sibling with the given name

    - -
    
    -        xml_node previous_sibling(const char* name) const;
    -        xml_node previous_sibling_w(const char* name) const;
    -        xml_node previous_sibling() const;
    -
    - -

    These functions do exactly the same as next_sibling ones, with the exception that they -search for the left siblings.

    - -
    
    -        xml_node parent() const;
    -
    - -

    Get a parent node. The parent node for the root one (the document) is considered to be the document -itself.

    - -
    
    -        const char* child_value() const;
    -
    - -

    Look for the first node of type node_pcdata or node_cdata among the -children of the current node and return its contents (or "" if nothing is found)

    - -
    
    -    const char* child_value(const char* name) const;
    -
    +#include <fstream> +#include <iostream> -

    This is the convenient way of looking into child's child value - that is, node.child_value(name) is equivalent to node.child(name).child_value().

    +#include "pugixml.hpp" -
    
    -    const char* child_value_w(const char* name) const;
    -
    +using namespace std; +using namespace pugi; -

    This is the convenient way of looking into child's child value - that is, node.child_value_w(name) is equivalent to node.child_w(name).child_value().

    +int main() +{ + // Several ways to get XML document -
    
    -        xml_attribute first_attribute() const;
    -        xml_attribute last_attribute() const;
    -
    + { + // Load from string + xml_document doc; -

    These functions get the first and last attributes of the node (or xml_attribute() if the node -has no attributes).

    + cout << doc.load("<sample-xml>some text <b>in bold</b> here</sample-xml>") << endl; + } -
    
    -        xml_node first_child() const;
    -        xml_node last_child() const;
    -
    + { + // Load from file + xml_document doc; -

    These functions get the first and last children of the node (or xml_node() if the node has -no children).

    + cout << doc.load_file("sample.xml") << endl; + } -
    
    -        template <typename OutputIterator> void all_elements_by_name(const char* name, OutputIterator it) const;
    -        template <typename OutputIterator> void all_elements_by_name_w(const char* name, OutputIterator it) const;
    -
    + { + // Load from any input stream (STL) + xml_document doc; -

    Get all elements with the specified name in the subtree (depth-first search) and return them with -the help of output iterator (i.e. std::back_inserter)

    + std::ifstream in("sample.xml"); + cout << doc.load(in) << endl; + } -
    
    -        template <typename Predicate> xml_attribute find_attribute(Predicate pred) const;
    -        template <typename Predicate> xml_node find_child(Predicate pred) const;
    -        template <typename Predicate> xml_node find_element(Predicate pred) const;
    -
    + { + // More advanced: parse the specified string without duplicating it + xml_document doc; -

    Find attribute, child or a node in the subtree (find_element - depth-first search) with the help -of the given predicate. Predicate should behave like a function which accepts a xml_node or -xml_attribute (for find_attribute) parameter and returns bool. The first entity for which -the predicate returned true is returned. If predicate returned false for all entities, xml_node() -or xml_attribute() is returned.

    + char* s = new char[100]; + strcpy(s, "<sample-xml>some text <b>in bold</b> here</sample-xml>"); + cout << doc.parse(transfer_ownership_tag(), s) << endl; + } -
    
    -        xml_node first_element(const char* name) const;
    -        xml_node first_element_w(const char* name) const;
    +    {
    +        // Even more advanced: assume manual lifetime control
    +        xml_document doc;
     
    -        xml_node first_element_by_value(const char* name, const char* value) const;
    -        xml_node first_element_by_value_w(const char* name, const char* value) const;
    +        char* s = new char[100];
    +        strcpy(s, "<sample-xml>some text <b>in bold</b> here</sample-xml>");
    +        cout << doc.parse(transfer_ownership_tag(), s) << endl;
     
    -        xml_node first_element_by_attribute(const char* name, const char* attr_name, const char* attr_value) const;
    -        xml_node first_element_by_attribute_w(const char* name, const char* attr_name, const char* attr_value) const;
    +        delete[] s; // <-- after this point, all string contents of document is invalid!
    +    }
     
    -        xml_node first_element_by_attribute(const char* attr_name, const char* attr_value) const;
    -        xml_node first_element_by_attribute_w(const char* attr_name, const char* attr_value) const;
    -
    + { + // Or just create document from code? + xml_document doc; -

    Find the first node (depth-first search), which corresponds to the given criteria (i.e. either has -a matching name, or a matching value, or has an attribute with given name/value, or has an attribute -and has a matching name). Note that _w versions treat all parameters as wildcards.

    + // add nodes to document (see next samples) + } +} +_Winnie C++ Colorizer -
    
    -        xml_node first_node(xml_node_type type) const;
    -
    +

    This sample should print a row of 1, meaning that all load/parse functions returned true (of course, if sample.xml does not exist or is malformed, there will be 0's)

    -

    Return a first node (depth-first search) with a given type, or xml_node().

    +

    Once you have your document, there are several ways to extract data from it.

    
    -        std::string path(char delimiter = '/') const;
    -
    +#include <iostream> -

    Get a path of the node (i.e. the string of names of the nodes on the path from the DOM tree root -to the node, separated with delimiter (/ by default).

    +#include "pugixml.hpp" -
    
    -        xml_node first_element_by_path(const char* path, char delimiter = '/') const;
    -
    +using namespace std; +using namespace pugi; -

    Get the first element that has the following path. The path can be absolute (beginning with delimiter) or -relative, '..' means 'up-level' (so if we are at the path mesh/fragment/geometry/stream, ../.. -will lead us to mesh/fragment, and /mesh will lead us to mesh).

    +struct bookstore_traverser: public xml_tree_walker +{ + virtual bool for_each(xml_node& n) + { + for (int i = 0; i < depth(); ++i) cout << " "; // indentation -
    
    -        bool traverse(xml_tree_walker& walker) const;
    -
    + if (n.type() == node_element) cout << n.name() << endl; + else cout << n.value() << endl; -

    Traverse the subtree (beginning with current node) with the walker, return the result. See -Miscellaneous section for details.

    + return true; // continue traversal + } +}; - -

    xml_attribute class

    +int main() +{ + xml_document doc; + doc.load("<bookstore><book title='ShaderX'><price>3</price></book><book title='GPU Gems'><price>4</price></book></bookstore>"); -

    Like xml_node, xml_attribute is a simple wrapper of the node's attribute.

    + // If you want to iterate through nodes... -
    
    -        bool operator==(const xml_attribute& r) const;
    -        bool operator!=(const xml_attribute& r) const;
    -        bool operator<(const xml_attribute& r) const;
    -        bool operator>(const xml_attribute& r) const;
    -        bool operator<=(const xml_attribute& r) const;
    -        bool operator>=(const xml_attribute& r) const;
    -
    + { + // Get a bookstore node + xml_node bookstore = doc.child("bookstore"); -

    Comparison operators.

    + // Iterate through books + for (xml_node book = bookstore.child("book"); book; book = book.next_sibling("book")) + { + cout << "Book " << book.attribute("title").value() << ", price " << book.child("price").first_child().value() << endl; + } -
    
    -        operator unspecified_bool_type() const;
    -
    + // Output: + // Book ShaderX, price 3 + // Book GPU Gems, price 4 + } -

    Safe bool conversion - like in xml_node, use this to check for validity.

    + { + // Alternative way to get a bookstore node (wildcards) + xml_node bookstore = doc.child_w("*[sS]tore"); // this will select bookstore, anyStore, Store, etc. -
    
    -        bool empty() const;
    -
    + // Iterate through books with STL compatible iterators + for (xml_node::iterator it = bookstore.begin(); it != bookstore.end(); ++it) + { + // Note the use of helper function child_value() + cout << "Book " << it->attribute("title").value() << ", price " << it->child_value("price") << endl; + } + + // Output: + // Book ShaderX, price 3 + // Book GPU Gems, price 4 + } -

    Like with xml_node, if (attr.empty()) is equivalent to if (!attr). -

    + { + // You can also traverse the whole tree (or a subtree) + bookstore_traverser t; -
    
    -        xml_attribute next_attribute() const;
    -        xml_attribute previous_attribute() const;
    -
    + doc.traverse(t); + + // Output: + // bookstore + // book + // price + // 3 + // book + // price + // 4 + + doc.first_child().traverse(t); + + // Output: + // book + // price + // 3 + // book + // price + // 4 + } -

    Get the next/previous attribute of the node, that owns the current attribute. Return xml_attribute() -if no such attribute is found.

    + // If you want a distinct node... -
    
    -        const char* name() const;
    -        const char* value() const;
    -
    + { + // You can specify the way to it through child() functions + cout << doc.child("bookstore").child("book").next_sibling().attribute("title").value() << endl; -

    Get the name and value of the attribute. These methods never return NULL - they return "" instead.

    + // Output: + // GPU Gems + + // You can use a sometimes convenient path function + cout << doc.first_element_by_path("bookstore/book/price").child_value() << endl; + + // Output: + // 3 -
    
    -        int as_int() const;
    -        double as_double() const;
    -        float as_float() const;
    -
    + // And you can use powerful XPath expressions + cout << doc.select_single_node("/bookstore/book[@title = 'ShaderX']/price").node().child_value() << endl; + + // Output: + // 3 -

    Convert the value of an attribute to the desired type. If the conversion is not successfull, return -default value (0 for int, 0.0 for double, 0.0f for float). These functions rely on CRT functions ato*.

    + // Of course, XPath is much more powerful -
    
    -        bool as_bool() const;
    -
    + // Compile query that prints total price of all Gems book in store + xpath_query query("sum(/bookstore/book[contains(@title, 'Gems')]/price)"); -

    Convert the value of an attribute to bool. This method returns true if the first character of the -value is '1', 't', 'T', 'y' or 'Y'. Otherwise it returns false.

    + cout << query.evaluate_number(doc) << endl; -
    -

    Iterators

    + // Output: + // 4 -

    Sometimes you have to cycle through the children or the attributes of the node. You can do it either -by using next_sibling, previous_sibling, next_attribute and previous_attribute -(along with first_child, last_child, first_attribute and last_attribute), -or you can use an iterator-like interface. There are two iterator types, xml_node_iterator and -xml_attribute_iterator. They are bidirectional constant iterators, which means that you can -either increment or decrement them, and use dereferencing and member access operators to get constant -access to node/attribute (the constness of iterators may change with the introducing of mutable trees).

    + // You can apply the same XPath query to any document. For example, let's add another Gems + // book (more detail about modifying tree in next sample): + xml_node book = doc.child("bookstore").append_child(); + book.set_name("book"); + book.append_attribute("title") = "Game Programming Gems 2"; + + xml_node price = book.append_child(); + price.set_name("price"); -

    In order to get the iterators, use corresponding functions of xml_node. Note that _end() -functions return past-the-end iterator, that is, in order to get the last attribute, you'll have to -do something like: + xml_node price_text = price.append_child(node_pcdata); + price_text.set_value("5.3"); + + // Now let's reevaluate query + cout << query.evaluate_number(doc) << endl; -
    
    -    if (node.attributes_begin() != node.attributes_end()) // we have at least one attribute
    -    {
    -        xml_attribute last_attrib = *(--node.attributes_end());
    -        ...
    +        // Output:
    +        // 9.3
         }
    -
    -

    - -
    -

    Miscellaneous

    - -

    If you want to traverse a subtree, you can use traverse function. There is a class -xml_tree_walker, which has some functions that you can override in order to get custom traversing -(the default one just does nothing). - -
    
    -        virtual bool begin(const xml_node&);
    -        virtual bool end(const xml_node&);
    -
    - -

    These functions are called when the processing of the node starts/ends. First begin() -is called, then all children of the node are processed recursively, then end() is called. If -any of these functions returns false, the traversing is stopped and the traverse() function -returns false.

    - -
    
    -        virtual void push();
    -        virtual void pop();
    -
    +} +
    _Winnie C++ Colorizer -

    These functions are called before and after the processing of node's children. If node has no children, -none of these is called. The default behavior is to increment/decrement current node depth.

    +

    Finally, let's get into more details about tree modification and saving.

    
    -        virtual int depth() const;
    -
    - -

    Get the current depth. You can use this function to do your own indentation, for example.

    - -

    Lets get to some minor notes. You can safely write something like: +#include <iostream> -
    
    -        bool value = node.child("stream").attribute("compress").as_bool();
    -
    - -If node has a child with the name 'geometry', and this child has an attribute 'compress', than everything -is ok. If node has a child with the name 'geometry' with no attribute 'compress', then attribute("compress") -will return xml_attribute(), and the corresponding call to as_bool() will return default value (false). -If there is no child node 'geometry', the child(...) call will return xml_node(), the subsequent call -to attribute(...) will return xml_attribute() (because there are no attributes belonging to invalid -node), and as_bool() will again return false, so this call sequence is perfectly safe.

    - - -

    Lifetime issues and memory management

    - -

    As parsing is done in-situ, the XML data is to persist during the lifetime of xml_parser. If -the parsing is called via a function of xml_parser, that accepts char*, you have to ensure -yourself, that the string will outlive the xml_parser object.

    - -

    The memory for nodes and attributes is allocated in blocks of data (the blocks form a linked list; -the default size of the block is 32 kb, though you can change it via changing a memory_block_size -constant in pugixml.hpp file. Remember that the first block is allocated on stack (it resides -inside xml_parser object), and all subsequent blocks are allocated on heap, so expect a stack overflow -when setting too large memory block size), so the xml_parser object (which contains the blocks) -should outlive all xml_node and xml_attribute objects (as well as iterators), which belong -to the parser's tree. Again, you should ensure it yourself.

    +#include "pugixml.hpp" -
    +using namespace std; +using namespace pugi; -
    -

    Example

    +int main() +{ + // For this example, we'll start with an empty document and create nodes in it from code + xml_document doc; -

    Ok, so you are not much of documentation reader, are you? So am I. Let's assume that you're going -to parse an xml file... something like this: + // Append several children and set values/names at once + doc.append_child(node_comment).set_value("This is a test comment"); + doc.append_child().set_name("application"); -

    -<?xml version="1.0" encoding="UTF-8"?>
    -<mesh name="Cathedral">
    -    <fragment name="Cathedral">    
    -        <geometry>
    -            <stream usage="main" source="StAnna.dmesh" compress="true" />
    -            <stream usage="ao" source="StAnna.ao" />
    -        </geometry>
    -    </fragment>
    -    <fragment name="Cathedral">    
    -    	...
    -    </fragment>
    -	...
    -</mesh>
    -
    + // Let's add a few modules + xml_node application = doc.child("application"); -

    <mesh> is a root node, it has 0 or more <fragment>s, each of them has a <geometry> -node, and there are <stream> nodes with the shown attributes. We'd like to parse the file and... -well, and do something with it's contents. There are several methods of doing that; I'll show 2 of them -(the remaining one is using iterators).

    + // Save node wrapper for convenience + xml_node module_a = application.append_child(); + module_a.set_name("module"); + + // Add an attribute, immediately setting it's value + module_a.append_attribute("name").set_value("A"); -

    Here we exploit the knowledge of the strict hierarchy of our XML document and read the nodes from -DOM tree accordingly. When we have an xml_node object, we can get the desired information from -it (name, value, attributes list, nearby nodes in a tree - siblings, parent and children).

    + // You can use operator= + module_a.append_attribute("folder") = "/work/app/module_a"; -
    
    -#include <fstream>
    -#include <vector>
    -#include <algorithm>
    -#include <iterator>
    +    // Or even assign numbers
    +    module_a.append_attribute("status") = 85.4;
     
    -#include "pugixml.hpp"
    +    // Let's add another module
    +    xml_node module_c = application.append_child();
    +    module_c.set_name("module");
    +    module_c.append_attribute("name") = "C";
    +    module_c.append_attribute("folder") = "/work/app/module_c";
     
    -using namespace pugi;
    +    // Oh, we missed module B. Not a problem, let's insert it before module C
    +    xml_node module_b = application.insert_child_before(node_element, module_c);
    +    module_b.set_name("module");
    +    module_b.append_attribute("folder") = "/work/app/module_b";
     
    -int main()
    -{
    -    std::ifstream in("mesh.xml");
    -    in.unsetf(std::ios::skipws);
    -                
    -    std::vector<char> buf;
    -    std::copy(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::back_inserter(buf));
    -    buf.push_back(0); // zero-terminate
    +    // We can do the same thing for attributes
    +    module_b.insert_attribute_before("name", module_b.attribute("folder")) = "B";
         
    -    xml_parser parser(&buf[0], pugi::parse_w3c);
    +    // Let's add some text in module A
    +    module_a.append_child(node_pcdata).set_value("Module A description");
     
    -    xml_node doc = parser.document();
    -        
    -    if (xml_node mesh = doc.first_element("mesh"))
    -    {
    -        // store mesh.attribute("name").value()
    +    // Well, there's not much left to do here. Let's output our document to file using several formatting options
     
    -        for (xml_node fragment = mesh.first_element("fragment"); fragment; fragment = fragment.next_sibling())
    -        {
    -            // store fragment.attribute("name").value()
    +    doc.save_file("sample_saved_1.xml");
         
    -            if (xml_node geometry = fragment.first_element("geometry"))
    -                for (xml_node stream = geometry.first_element("stream"); stream; stream = stream.next_sibling())
    -                {
    -                    // store stream.attribute("usage").value()
    -                    // store stream.attribute("source").value()
    -                    
    -                    if (stream.attribute("compress"))
    -                        // store stream.attribute("compress").as_bool()
    +    // Contents of file sample_saved_1.xml (tab size = 4):
    +    // <?xml version="1.0"?>
    +    // <!--This is a test comment-->
    +    // <application>
    +    //     <module name="A" folder="/work/app/module_a" status="85.4">Module A description</module>
    +    //     <module name="B" folder="/work/app/module_b" />
    +    //     <module name="C" folder="/work/app/module_c" />
    +    // </application>
    +
    +    // Let's use two spaces for indentation instead of tab character
    +    doc.save_file("sample_saved_2.xml", "  ");
    +
    +    // Contents of file sample_saved_2.xml:
    +    // <?xml version="1.0"?>
    +    // <!--This is a test comment-->
    +    // <application>
    +    //   <module name="A" folder="/work/app/module_a" status="85.4">Module A description</module>
    +    //   <module name="B" folder="/work/app/module_b" />
    +    //   <module name="C" folder="/work/app/module_c" />
    +    // </application>
         
    -                }
    -        }
    -    }
    -}
    -
    + // Let's save a raw XML file + doc.save_file("sample_saved_3.xml", "", format_raw); + + // Contents of file sample_saved_3.xml: + // <?xml version="1.0"?><!--This is a test comment--><application><module name="A" folder="/work/app/module_a" status="85.4">Module A description</module><module name="B" folder="/work/app/module_b" /><module name="C" folder="/work/app/module_c" /></application> -

    We can also write a class that will traverse the DOM tree and store the information from nodes based -on their names, depths, attributes, etc. This way is well known by the users of SAX parsers. To do that, -we have to write an implementation of xml_tree_walker interface

    + // Finally, you can print a subtree to any output stream (including cout) + doc.child("application").child("module").print(cout); -
                       
    -#include <fstream>
    -#include <vector>
    -#include <algorithm>
    -#include <iterator>
    +    // Output:
    +    // <module name="A" folder="/work/app/module_a" status="85.4">Module A description</module>
    +}
    +
    _Winnie C++ Colorizer
    -#include "pugixml.hpp" +

    Note, that these examples do not cover the whole pugixml API. For further information, look into reference section.

    -using namespace pugi; +
    -struct mesh_parser: public xml_tree_walker -{ - virtual bool begin(const xml_node& node) - { - if (strcmp(node.name(), "mesh") == 0) - { - // store node.attribute("name").value() - } - else if (strcmp(node.name(), "fragment") == 0) - { - // store node.attribute("name").value() - } - else if (strcmp(node.name(), "geometry") == 0) - { - // ... - } - else if (strcmp(node.name(), "stream") == 0) - { - // store node.attribute("usage").value() - // store node.attribute("source").value() - - if (node.attribute("compress")) - // store stream.attribute("compress").as_bool() - } - else return false; +
    +

    Reference

    - return true; - } -}; +

    pugixml is a library for parsing XML files, which means that you give it XML data some way, +and it gives you the DOM tree and the ways to traverse it and to get some useful information from it. +The library source consist of two headers, pugixml.hpp and pugiconfig.hpp, and two source +files, pugixml.cpp and pugixpath.cpp. +You can either compile cpp files in your project, or build a static library. +All library classes reside in namespace pugi, so you can either use fully qualified +names (pugi::xml_node) or write a using declaration (using namespace pugi;, using +pugi::xml_node) and use plain names. All classes have eitther xml_ or xpath_ prefix.

    -int main() -{ - std::ifstream in("mesh.xml"); - in.unsetf(std::ios::skipws); - - std::vector<char> buf; - std::copy(std::istream_iterator<char>(in), std::istream_iterator<char>(), std::back_inserter(buf)); - buf.push_back(0); // zero-terminate - - xml_parser parser(&buf[0], pugi::parse_w3c); +

    By default it's supposed that you compile the source file with your project (add it into your +project, or add relevant entry in your Makefile, or do whatever you need to do with your compilation +environment). The library is written in standard-conformant C++ and was tested on following platforms:

    - mesh_parser mp; +

    +

    +

    - if (!parser.document().traverse(mp)) - // generate an error -} - +

    The documentation for pugixml classes, functions and constants is available here.


    - -

    Parsing process

    - -

    So, let's talk a bit about parsing process, and about the reason for providing XML data as a contiguous -writeable block of memory. Parsing is done in-situ. This means, that the strings, representing the -parts of DOM tree (node names, attribute names and values, CDATA content, etc.) are not separately -allocated on heap, but instead are parts of the original data. This is the keypoint to parsing speed, -because it helps achieve the minimal amount of memory allocations (more on that below) and minimal -amount of copying data.

    - -

    In-situ parsing can be done in two ways, with zero-segmenting the string (that is, set the past-the-end -character for the part of XML string to 0, see -this image for further details), and storing pointer + size of the string instead of pointer to -the beginning of ASCIIZ string.

    - -

    Originally, pugxml had only the first way, but then authors added the second method, 'non-segmenting' -or non-destructive parsing. The advantages of this method are: you no longer need non-constant storage; -you can even read data from memory-mapped files directly. Well, there are disadvantages. -For one thing, you can not do any of the transformations in-situ. The transformations that are required -by XML standard are: -

    - -None of these can be done in-situ. pugxml did neither character nor entity reference expansion, -and allocated new memory when normalizing white spaces when in non-destructive mode. I chose complete -in-situ parsing (the good thing about it is that any transformation, except entity reference, can be -done in-situ because it does not increase the amount of characters - even converting a character -reference to UTF-8). There is no entity reference expansion because of this and because I do not want -to parse DOCTYPE and, moreover, use DOCTYPE in following parsing (performing selective whitespace -normalization in attributes and CDATA sections and so on).

    - -

    In order to be able to modify the tree (change attribute/node names & values) with in-situ parsing, -one needs to implement two ways of storing data (both in-situ and not). The DOM tree is now mutable, -but it will change in the future releases (without introducing speed/memory overhead, except on clean- -up stage).

    - -

    The parsing process itself is more or less straightforward, when you see it - but the impression -is fake, because the explicit jumps are made (i.e. we know, that if we come to a closing brace (>), -we should expect CDATA after it (or a new tag), so let's just jump to the corresponding code), and, -well, there can be bugs (see Bugs section).

    - -

    And, to make things worse, memory allocation (which is done only for node and attribute structures) -is done in pools. The pools are single-linked lists with predefined block size (32 kb by default), and -well, it increases speed a lot (allocations are slow, and the memory gets fragmented when allocating -a bunch of 16-byte (attribute) or 40-byte (node) structures)

    +3 MSVC is Microsoft Visual C++ Compiler
    +
    4 ICC is Intel C++ Compiler
    +
    5 BCC is Borland C++ Compiler
    @@ -913,20 +444,17 @@ of ones when using parse_w3c mode): correctly, let alone use them for parsing
  • It accepts multiple attributes with the same name in one node
  • It is charset-ignorant -
  • It accepts invalid names of tags
  • It accepts invalid attribute values (those with < in them) and does not reject invalid entity references or character references (in fact, it does not do DOCTYPE parsing, so it does not perform entity reference expansion)
  • It does not reject comments with -- inside -
  • It does not reject PI with the names of 'xml' and alike; in fact, it parses prolog as a PI, which -is not conformant -
  • All characters from #x1 to #x20 are considered to be whitespaces +
  • It does not reject PI with the names of 'xml' and alike
  • And some other things that I forgot to mention -In short, it accepts most malformed XML files and does not do anything that is related to DOCTYPE. -This is because the main goal was developing fast, easy-to-use and error ignorant (so you can always -get something even from a malformed document) parser, there are some good validating and conformant +In short, it accepts some malformed XML files and does not do anything that is related to DOCTYPE. +This is because the main goal was developing fast, easy-to-use and error ignorant (so you can get +something even from a malformed document) parser, there are some good validating and conformant parsers already.


    @@ -1015,9 +543,10 @@ off. The test system is AMD Sempron 2500+, 512 Mb RAM.

    Q: I do not have/want STL support. How can I compile pugixml without STL?

    A: There is an undocumented define PUGIXML_NO_STL. If you uncomment the relevant line in pugixml header file, it will compile without any STL classes. The reason it is undocumented -are that it will make some documented functions not available (specifically, xml_parser() ctor and -parse() function that operate on std::istream, xml_node::path function, utf16 and utf8 conversion -functions). Otherwise, it will work fine.

    +are that it will make some documented functions not available (specifically, xml_document::load, that +operates on std::istream, xml_node::path function, saving functions (xml_node::print, xml_document::save), +XPath-related functions and classes and as_utf16 and as_utf8 conversion functions). Otherwise, it will +work fine.

    Q: Do paths that are accepted by first_element_by_path have to end with delimiter?

    A: Either way will work, both /path/to/node/ and /path/to/node is fine.

    @@ -1048,16 +577,10 @@ do not send executable files.

    upper ones will get there sooner).

      -
    • Support for altering the tree (both changing nodes'/attributes' names and values and adding/deleting -attributes/nodes) and writing the tree to stream
    • Support for UTF-16 files (parsing BOM to get file's type and converting UTF-16 file to UTF-8 buffer if necessary) -
    • Improved API (I'm going to look at SelectNode from MS XML and perhaps there will be some other -changes) -
    • Externally provided entity reference table (or perhaps even taken from DOCTYPE?)
    • More intelligent parsing of DOCTYPE (it does not always skip DOCTYPE for now)
    • XML 1.1 changes (changed EOL handling, normalization issues, etc.) -
    • XPath support
    • Name your own?
    @@ -1079,6 +602,28 @@ changes)
  • Optimizations of strconv_t +
    21.02.2007 - v0.3 +
    Refactored, reworked and improved version. Changes:
      +
    • Interface:
        +
      • Added XPath +
      • Added tree modification functions +
      • Added no STL compilation mode +
      • Added saving document to file +
      • Refactored parsing flags +
      • Removed xml_parser class in favor of xml_document +
      • Added transfer ownership parsing mode +
      • Modified the way xml_tree_walker works +
      • Iterators are now non-constant +
      +
    • Implementation:
        +
      • Support of several compilers and platforms +
      • Refactored and sped up parsing core +
      • Improved standard compliancy +
      • Added XPath implementation +
      • Fixed several bugs +
      +
    +

  • @@ -1099,7 +644,7 @@ changes)

    The pugixml parser is distributed under the MIT license:

    -Copyright (c) 2006 Arseny Kapoulkine
    +Copyright (c) 2006-2007 Arseny Kapoulkine
     
     Permission is hereby granted, free of charge, to any person
     obtaining a copy of this software and associated documentation
    @@ -1125,7 +670,7 @@ OTHER DEALINGS IN THE SOFTWARE.
     
     
    -

    Revised 8 December, 2006

    -

    © Copyright Arseny Kapoulkine 2006. All Rights Reserved.

    +

    Revised 21 February, 2007

    +

    © Copyright Arseny Kapoulkine 2006-2007. All Rights Reserved.

    -- cgit v1.2.3