Loading document

- Loading document -

Loading document from file
Loading document from memory
Loading document from C++ IOstreams
Handling parsing errors
Parsing options
Encodings
Conformance to W3C specification

- pugixml provides several functions for loading XML data from various places - - files, C++ iostreams, memory buffers. All functions use an extremely fast - non-validating parser. This parser is not fully W3C conformant - it can load - any valid XML document, but does not perform some well-formedness checks. While - considerable effort is made to reject invalid XML documents, some validation - is not performed for performance reasons. Also some XML transformations (i.e. - EOL handling or attribute value normalization) can impact parsing speed and - thus can be disabled. However for vast majority of XML documents there is no - performance difference between different parsing options. Parsing options also - control whether certain XML nodes are parsed; see Parsing options for - more information. -

- XML data is always converted to internal character format (see Unicode interface) - before parsing. pugixml supports all popular Unicode encodings (UTF-8, UTF-16 - (big and little endian), UTF-32 (big and little endian); UCS-2 is naturally - supported since it's a strict subset of UTF-16) and handles all encoding conversions - automatically. Unless explicit encoding is specified, loading functions perform - automatic encoding detection based on first few characters of XML data, so - in almost all cases you do not have to specify document encoding. Encoding - conversion is described in more detail in Encodings. -

- Loading document from file -

- The most common source of XML data is files; pugixml provides dedicated functions - for loading an XML document from file: -

xml_parse_result xml_document::load_file(const char* path, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
-xml_parse_result xml_document::load_file(const wchar_t* path, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
-

- These functions accept the file path as its first argument, and also two - optional arguments, which specify parsing options (see Parsing options) - and input data encoding (see Encodings). The path has the target - operating system format, so it can be a relative or absolute one, it should - have the delimiters of the target system, it should have the exact case if - the target file system is case-sensitive, etc. -

- File path is passed to the system file opening function as is in case of - the first function (which accepts const - char* path); the second function either uses - a special file opening function if it is provided by the runtime library - or converts the path to UTF-8 and uses the system file opening function. -

- load_file destroys the existing - document tree and then tries to load the new tree from the specified file. - The result of the operation is returned in an xml_parse_result - object; this object contains the operation status and the related information - (i.e. last successfully parsed position in the input file, if parsing fails). - See Handling parsing errors for error handling details. -

- This is an example of loading XML document from file (samples/load_file.cpp): -

- -

pugi::xml_document doc;
-
-pugi::xml_parse_result result = doc.load_file("tree.xml");
-
-std::cout << "Load result: " << result.description() << ", mesh name: " << doc.child("mesh").attribute("name").value() << std::endl;
-

- Loading document from memory -

- Sometimes XML data should be loaded from some other source than a file, i.e. - HTTP URL; also you may want to load XML data from file using non-standard - functions, i.e. to use your virtual file system facilities or to load XML - from gzip-compressed files. All these scenarios require loading document - from memory. First you should prepare a contiguous memory block with all - XML data; then you have to invoke one of buffer loading functions. These - functions will handle the necessary encoding conversions, if any, and then - will parse the data into the corresponding XML tree. There are several buffer - loading functions, which differ in the behavior and thus in performance/memory - usage: -

xml_parse_result xml_document::load_buffer(const void* contents, size_t size, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
-xml_parse_result xml_document::load_buffer_inplace(void* contents, size_t size, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
-xml_parse_result xml_document::load_buffer_inplace_own(void* contents, size_t size, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
-

- All functions accept the buffer which is represented by a pointer to XML - data, contents, and data - size in bytes. Also there are two optional arguments, which specify parsing - options (see Parsing options) and input data encoding (see Encodings). - The buffer does not have to be zero-terminated. -

- load_buffer function works - with immutable buffer - it does not ever modify the buffer. Because of this - restriction it has to create a private buffer and copy XML data to it before - parsing (applying encoding conversions if necessary). This copy operation - carries a performance penalty, so inplace functions are provided - load_buffer_inplace and load_buffer_inplace_own - store the document data in the buffer, modifying it in the process. In order - for the document to stay valid, you have to make sure that the buffer's lifetime - exceeds that of the tree if you're using inplace functions. In addition to - that, load_buffer_inplace - does not assume ownership of the buffer, so you'll have to destroy it yourself; - load_buffer_inplace_own assumes - ownership of the buffer and destroys it once it is not needed. This means - that if you're using load_buffer_inplace_own, - you have to allocate memory with pugixml allocation function (you can get - it via get_memory_allocation_function). -

- The best way from the performance/memory point of view is to load document - using load_buffer_inplace_own; - this function has maximum control of the buffer with XML data so it is able - to avoid redundant copies and reduce peak memory usage while parsing. This - is the recommended function if you have to load the document from memory - and performance is critical. -

- There is also a simple helper function for cases when you want to load the - XML document from null-terminated character string: -

xml_parse_result xml_document::load_string(const char_t* contents, unsigned int options = parse_default);
-

- It is equivalent to calling load_buffer - with size being either strlen(contents) - or wcslen(contents) * sizeof(wchar_t), - depending on the character type. This function assumes native encoding for - input data, so it does not do any encoding conversion. In general, this function - is fine for loading small documents from string literals, but has more overhead - and less functionality than the buffer loading functions. -

- This is an example of loading XML document from memory using different functions - (samples/load_memory.cpp): -

- -

const char source[] = "<mesh name='sphere'><bounds>0 0 1 1</bounds></mesh>";
-size_t size = sizeof(source);
-

- -

// You can use load_buffer to load document from immutable memory block:
-pugi::xml_parse_result result = doc.load_buffer(source, size);
-

- -

// You can use load_buffer_inplace to load document from mutable memory block; the block's lifetime must exceed that of document
-char* buffer = new char[size];
-memcpy(buffer, source, size);
-
-// The block can be allocated by any method; the block is modified during parsing
-pugi::xml_parse_result result = doc.load_buffer_inplace(buffer, size);
-
-// You have to destroy the block yourself after the document is no longer used
-delete[] buffer;
-

- -

// You can use load_buffer_inplace_own to load document from mutable memory block and to pass the ownership of this block
-// The block has to be allocated via pugixml allocation function - using i.e. operator new here is incorrect
-char* buffer = static_cast<char*>(pugi::get_memory_allocation_function()(size));
-memcpy(buffer, source, size);
-
-// The block will be deleted by the document
-pugi::xml_parse_result result = doc.load_buffer_inplace_own(buffer, size);
-

- -

// You can use load to load document from null-terminated strings, for example literals:
-pugi::xml_parse_result result = doc.load_string("<mesh name='sphere'><bounds>0 0 1 1</bounds></mesh>");
-

- Loading document from C++ IOstreams -

- To enhance interoperability, pugixml provides functions for loading document - from any object which implements C++ std::istream - interface. This allows you to load documents from any standard C++ stream - (i.e. file stream) or any third-party compliant implementation (i.e. Boost - Iostreams). There are two functions, one works with narrow character streams, - another handles wide character ones: -

xml_parse_result xml_document::load(std::istream& stream, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
-xml_parse_result xml_document::load(std::wistream& stream, unsigned int options = parse_default);
-

- load with std::istream - argument loads the document from stream from the current read position to - the end, treating the stream contents as a byte stream of the specified encoding - (with encoding autodetection as necessary). Thus calling xml_document::load - on an opened std::ifstream object is equivalent to calling - xml_document::load_file. -

- load with std::wstream - argument treats the stream contents as a wide character stream (encoding - is always encoding_wchar). Because - of this, using load with - wide character streams requires careful (usually platform-specific) stream - setup (i.e. using the imbue - function). Generally use of wide streams is discouraged, however it provides - you the ability to load documents from non-Unicode encodings, i.e. you can - load Shift-JIS encoded data if you set the correct locale. -

- This is a simple example of loading XML document from file using streams - (samples/load_stream.cpp); read - the sample code for more complex examples involving wide streams and locales: -

- -

std::ifstream stream("weekly-utf-8.xml");
-pugi::xml_parse_result result = doc.load(stream);
-

- Handling parsing errors -

- All document loading functions return the parsing result via xml_parse_result object. It contains parsing - status, the offset of last successfully parsed character from the beginning - of the source stream, and the encoding of the source stream: -

struct xml_parse_result
-{
-    xml_parse_status status;
-    ptrdiff_t offset;
-    xml_encoding encoding;
-
-    operator bool() const;
-    const char* description() const;
-};
-

- Parsing status is represented as the xml_parse_status - enumeration and can be one of the following: -

- status_ok means that no error was encountered - during parsing; the source stream represents the valid XML document which - was fully parsed and converted to a tree.

- -
- status_file_not_found is only - returned by load_file - function and means that file could not be opened. -
- status_io_error is returned by load_file function and by load functions with std::istream/std::wstream arguments; it means that some - I/O error has occurred during reading the file/stream. -
- status_out_of_memory means that - there was not enough memory during some allocation; any allocation failure - during parsing results in this error. -
- status_internal_error means that - something went horribly wrong; currently this error does not occur

- -
- status_unrecognized_tag means - that parsing stopped due to a tag with either an empty name or a name - which starts with incorrect character, such as #. -
- status_bad_pi means that parsing stopped - due to incorrect document declaration/processing instruction -
- status_bad_comment, status_bad_cdata, - status_bad_doctype and status_bad_pcdata - mean that parsing stopped due to the invalid construct of the respective - type -
- status_bad_start_element means - that parsing stopped because starting tag either had no closing > symbol or contained some incorrect - symbol -
- status_bad_attribute means that - parsing stopped because there was an incorrect attribute, such as an - attribute without value or with value that is not quoted (note that - <node - attr=1> is - incorrect in XML) -
- status_bad_end_element means - that parsing stopped because ending tag had incorrect syntax (i.e. extra - non-whitespace symbols between tag name and >) -
- status_end_element_mismatch - means that parsing stopped because the closing tag did not match the - opening one (i.e. <node></nedo>) or because some tag was not closed - at all -
- status_no_document_element - means that no element nodes were discovered during parsing; this usually - indicates an empty or invalid document -

- description() - member function can be used to convert parsing status to a string; the returned - message is always in English, so you'll have to write your own function if - you need a localized string. However please note that the exact messages - returned by description() - function may change from version to version, so any complex status handling - should be based on status - value. Note that description() returns a char - string even in PUGIXML_WCHAR_MODE; - you'll have to call as_wide to get the wchar_t string. -

- If parsing failed because the source data was not a valid XML, the resulting - tree is not destroyed - despite the fact that load function returns error, - you can use the part of the tree that was successfully parsed. Obviously, - the last element may have an unexpected name/value; for example, if the attribute - value does not end with the necessary quotation mark, like in <node - attr="value>some data</node> example, the value of - attribute attr will contain - the string value>some data</node>. -

- In addition to the status code, parsing result has an offset - member, which contains the offset of last successfully parsed character if - parsing failed because of an error in source data; otherwise offset is 0. For parsing efficiency reasons, - pugixml does not track the current line during parsing; this offset is in - units of pugi::char_t (bytes for character - mode, wide characters for wide character mode). Many text editors support - 'Go To Position' feature - you can use it to locate the exact error position. - Alternatively, if you're loading the document from memory, you can display - the error chunk along with the error description (see the example code below). -

- - - - - -

	Caution
	- Offset is calculated in the XML buffer in native encoding; if encoding - conversion is performed during parsing, offset can not be used to reliably - track the error position. -

- Parsing result also has an encoding - member, which can be used to check that the source data encoding was correctly - guessed. It is equal to the exact encoding used during parsing (i.e. with - the exact endianness); see Encodings for more information. -

- Parsing result object can be implicitly converted to bool; - if you do not want to handle parsing errors thoroughly, you can just check - the return value of load functions as if it was a bool: - if (doc.load_file("file.xml")) { ... - } else { ... }. -

- This is an example of handling loading errors (samples/load_error_handling.cpp): -

- -

pugi::xml_document doc;
-pugi::xml_parse_result result = doc.load_string(source);
-
-if (result)
-    std::cout << "XML [" << source << "] parsed without errors, attr value: [" << doc.child("node").attribute("attr").value() << "]\n\n";
-else
-{
-    std::cout << "XML [" << source << "] parsed with errors, attr value: [" << doc.child("node").attribute("attr").value() << "]\n";
-    std::cout << "Error description: " << result.description() << "\n";
-    std::cout << "Error offset: " << result.offset << " (error at [..." << (source + result.offset) << "]\n\n";
-}
-

- Parsing options -

- All document loading functions accept the optional parameter options. This is a bitmask that customizes - the parsing process: you can select the node types that are parsed and various - transformations that are performed with the XML text. Disabling certain transformations - can improve parsing performance for some documents; however, the code for - all transformations is very well optimized, and thus the majority of documents - won't get any performance benefit. As a rule of thumb, only modify parsing - flags if you want to get some nodes in the document that are excluded by - default (i.e. declaration or comment nodes). -

- - - - - -

	Note
	- You should use the usual bitwise arithmetics to manipulate the bitmask: - to enable a flag, use `mask \| flag`; - to disable a flag, use `mask & ~flag`. -

- These flags control the resulting tree contents: -

- parse_declaration determines if XML - document declaration (node with type node_declaration) - is to be put in DOM tree. If this flag is off, it is not put in the tree, - but is still parsed and checked for correctness. This flag is off by default.

- -
- parse_doctype determines if XML document - type declaration (node with type node_doctype) - is to be put in DOM tree. If this flag is off, it is not put in the tree, - but is still parsed and checked for correctness. This flag is off by default.

- -
- parse_pi determines if processing instructions - (nodes with type node_pi) are to be put - in DOM tree. If this flag is off, they are not put in the tree, but are - still parsed and checked for correctness. Note that <?xml ...?> - (document declaration) is not considered to be a PI. This flag is off by default.

- -
- parse_comments determines if comments - (nodes with type node_comment) are - to be put in DOM tree. If this flag is off, they are not put in the tree, - but are still parsed and checked for correctness. This flag is off by default.

- -
- parse_cdata determines if CDATA sections - (nodes with type node_cdata) are to - be put in DOM tree. If this flag is off, they are not put in the tree, - but are still parsed and checked for correctness. This flag is on by default.

- -
- parse_trim_pcdata determines if leading - and trailing whitespace characters are to be removed from PCDATA nodes. - While for some applications leading/trailing whitespace is significant, - often the application only cares about the non-whitespace contents so - it's easier to trim whitespace from text during parsing. This flag is - off by default.

- -
- parse_ws_pcdata determines if PCDATA - nodes (nodes with type node_pcdata) - that consist only of whitespace characters are to be put in DOM tree. - Often whitespace-only data is not significant for the application, and - the cost of allocating and storing such nodes (both memory and speed-wise) - can be significant. For example, after parsing XML string <node> <a/> </node>, <node> - element will have three children when parse_ws_pcdata - is set (child with type node_pcdata - and value " ", - child with type node_element and - name "a", and another - child with type node_pcdata and value - " "), and only - one child when parse_ws_pcdata - is not set. This flag is off by default. -

- -
- parse_ws_pcdata_single determines - if whitespace-only PCDATA nodes that have no sibling nodes are to be - put in DOM tree. In some cases application needs to parse the whitespace-only - contents of nodes, i.e. <node> - </node>, but is not interested in whitespace - markup elsewhere. It is possible to use parse_ws_pcdata - flag in this case, but it results in excessive allocations and complicates - document processing in some cases; this flag is intended to avoid that. - As an example, after parsing XML string <node> - <a> </a> </node> with parse_ws_pcdata_single - flag set, <node> element will have one child <a>, and <a> - element will have one child with type node_pcdata - and value " ". - This flag has no effect if parse_ws_pcdata - is enabled. This flag is off by default. -

- -
- parse_fragment determines if document - should be treated as a fragment of a valid XML. Parsing document as a - fragment leads to top-level PCDATA content (i.e. text that is not located - inside a node) to be added to a tree, and additionally treats documents - without element nodes as valid. This flag is off - by default. -

- - - - - -

Caution

- Using in-place parsing (load_buffer_inplace) - with parse_fragment flag - may result in the loss of the last character of the buffer if it is a part - of PCDATA. Since PCDATA values are null-terminated strings, the only way - to resolve this is to provide a null-terminated buffer as an input to - load_buffer_inplace - i.e. - doc.load_buffer_inplace("test\0", - 5, pugi::parse_default | - pugi::parse_fragment). -

- These flags control the transformation of tree element contents: -

- parse_escapes determines if character - and entity references are to be expanded during the parsing process. - Character references have the form &#...; or - &#x...; (... is Unicode numeric - representation of character in either decimal (&#...;) - or hexadecimal (&#x...;) form), entity references - are <, >, &, - ' and " (note - that as pugixml does not handle DTD, the only allowed entities are predefined - ones). If character/entity reference can not be expanded, it is left - as is, so you can do additional processing later. Reference expansion - is performed on attribute values and PCDATA content. This flag is on by default.

- -
- parse_eol determines if EOL handling (that - is, replacing sequences 0x0d 0x0a by a single 0x0a - character, and replacing all standalone 0x0d - characters by 0x0a) is to - be performed on input data (that is, comments contents, PCDATA/CDATA - contents and attribute values). This flag is on - by default.

- -
- parse_wconv_attribute determines - if attribute value normalization should be performed for all attributes. - This means, that whitespace characters (new line, tab and space) are - replaced with space (' '). - New line characters are always treated as if parse_eol - is set, i.e. \r\n - is converted to a single space. This flag is on - by default.

- -
- parse_wnorm_attribute determines - if extended attribute value normalization should be performed for all - attributes. This means, that after attribute values are normalized as - if parse_wconv_attribute - was set, leading and trailing space characters are removed, and all sequences - of space characters are replaced by a single space character. parse_wconv_attribute - has no effect if this flag is on. This flag is off - by default. -

- - - - - -

	Note
	- `parse_wconv_attribute` option - performs transformations that are required by W3C specification for attributes - that are declared as `CDATA`; parse_wnorm_attribute - performs transformations required for `NMTOKENS` attributes. - In the absence of document type declaration all attributes should behave - as if they are declared as `CDATA`, thus parse_wconv_attribute - is the default option. -

- Additionally there are three predefined option masks: -

- parse_minimal has all options turned - off. This option mask means that pugixml does not add declaration nodes, - document type declaration nodes, PI nodes, CDATA sections and comments - to the resulting tree and does not perform any conversion for input data, - so theoretically it is the fastest mode. However, as mentioned above, - in practice parse_default is usually - equally fast.

- -
- parse_default is the default set of flags, - i.e. it has all options set to their default values. It includes parsing - CDATA sections (comments/PIs are not parsed), performing character and - entity reference expansion, replacing whitespace characters with spaces - in attribute values and performing EOL handling. Note, that PCDATA sections - consisting only of whitespace characters are not parsed (by default) - for performance reasons.

- -
- parse_full is the set of flags which adds - nodes of all types to the resulting tree and performs default conversions - for input data. It includes parsing CDATA sections, comments, PI nodes, - document declaration node and document type declaration node, performing - character and entity reference expansion, replacing whitespace characters - with spaces in attribute values and performing EOL handling. Note, that - PCDATA sections consisting only of whitespace characters are not parsed - in this mode. -

- This is an example of using different parsing options (samples/load_options.cpp): -

- -

const char* source = "<!--comment--><node>&lt;</node>";
-
-// Parsing with default options; note that comment node is not added to the tree, and entity reference &lt; is expanded
-doc.load_string(source);
-std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";
-
-// Parsing with additional parse_comments option; comment node is now added to the tree
-doc.load_string(source, pugi::parse_default | pugi::parse_comments);
-std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";
-
-// Parsing with additional parse_comments option and without the (default) parse_escapes option; &lt; is not expanded
-doc.load_string(source, (pugi::parse_default | pugi::parse_comments) & ~pugi::parse_escapes);
-std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";
-
-// Parsing with minimal option mask; comment node is not added to the tree, and &lt; is not expanded
-doc.load_string(source, pugi::parse_minimal);
-std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";
-

- Encodings -

- pugixml supports all popular Unicode encodings (UTF-8, UTF-16 (big and little - endian), UTF-32 (big and little endian); UCS-2 is naturally supported since - it's a strict subset of UTF-16) and handles all encoding conversions. Most - loading functions accept the optional parameter encoding. - This is a value of enumeration type xml_encoding, - that can have the following values: -

- encoding_auto means that pugixml will - try to guess the encoding based on source XML data. The algorithm is - a modified version of the one presented in Appendix F.1 of XML recommendation; - it tries to match the first few bytes of input data with the following - patterns in strict order:

-
- - If first four bytes match UTF-32 BOM (Byte Order Mark), encoding - is assumed to be UTF-32 with the endianness equal to that of BOM; -
- - If first two bytes match UTF-16 BOM, encoding is assumed to be - UTF-16 with the endianness equal to that of BOM; -
- - If first three bytes match UTF-8 BOM, encoding is assumed to be - UTF-8; -
- - If first four bytes match UTF-32 representation of <, - encoding is assumed to be UTF-32 with the corresponding endianness; -
- - If first four bytes match UTF-16 representation of <?, - encoding is assumed to be UTF-16 with the corresponding endianness; -
- - If first two bytes match UTF-16 representation of <, - encoding is assumed to be UTF-16 with the corresponding endianness - (this guess may yield incorrect result, but it's better than UTF-8); -
- - Otherwise encoding is assumed to be UTF-8.
  
  - -
-
- encoding_utf8 corresponds to UTF-8 encoding - as defined in the Unicode standard; UTF-8 sequences with length equal - to 5 or 6 are not standard and are rejected. -
- encoding_utf16_le corresponds to - little-endian UTF-16 encoding as defined in the Unicode standard; surrogate - pairs are supported. -
- encoding_utf16_be corresponds to - big-endian UTF-16 encoding as defined in the Unicode standard; surrogate - pairs are supported. -
- encoding_utf16 corresponds to UTF-16 - encoding as defined in the Unicode standard; the endianness is assumed - to be that of the target platform. -
- encoding_utf32_le corresponds to - little-endian UTF-32 encoding as defined in the Unicode standard. -
- encoding_utf32_be corresponds to - big-endian UTF-32 encoding as defined in the Unicode standard. -
- encoding_utf32 corresponds to UTF-32 - encoding as defined in the Unicode standard; the endianness is assumed - to be that of the target platform. -
- encoding_wchar corresponds to the encoding - of wchar_t type; it has - the same meaning as either encoding_utf16 - or encoding_utf32, depending - on wchar_t size. -
- encoding_latin1 corresponds to ISO-8859-1 - encoding (also known as Latin-1). -

- The algorithm used for encoding_auto - correctly detects any supported Unicode encoding for all well-formed XML - documents (since they start with document declaration) and for all other - XML documents that start with <; if your XML document - does not start with < and has encoding that is different - from UTF-8, use the specific encoding. -

- - - - - -

	Note
	- The current behavior for Unicode conversion is to skip all invalid UTF - sequences during conversion. This behavior should not be relied upon; moreover, - in case no encoding conversion is performed, the invalid sequences are - not removed, so you'll get them as is in node/attribute contents. -

- Conformance to W3C specification -

- pugixml is not fully W3C conformant - it can load any valid XML document, - but does not perform some well-formedness checks. While considerable effort - is made to reject invalid XML documents, some validation is not performed - because of performance reasons. -

- There is only one non-conformant behavior when dealing with valid XML documents: - pugixml does not use information supplied in document type declaration for - parsing. This means that entities declared in DOCTYPE are not expanded, and - all attribute/PCDATA values are always processed in a uniform way that depends - only on parsing options. -

- As for rejecting invalid XML documents, there are a number of incompatibilities - with W3C specification, including: -

- Multiple attributes of the same node can have equal names. -
- All non-ASCII characters are treated in the same way as symbols of English - alphabet, so some invalid tag names are not rejected. -
- Attribute values which contain < are not rejected. -
- Invalid entity/character references are not rejected and are instead - left as is. -
- Comment values can contain --. -
- XML data is not required to begin with document declaration; additionally, - document declaration can appear after comments and other nodes. -
- Invalid document type declarations are silently ignored in some cases. -