pugixml 1.6 manual | Overview | Installation | Document: Object model · Loading · Accessing · Modifying · Saving | XPath | API Reference | Table of Contents |
pugixml provides several functions for loading XML data from various places - files, C++ iostreams, memory buffers. All functions use an extremely fast non-validating parser. This parser is not fully W3C conformant - it can load any valid XML document, but does not perform some well-formedness checks. While considerable effort is made to reject invalid XML documents, some validation is not performed for performance reasons. Also some XML transformations (i.e. EOL handling or attribute value normalization) can impact parsing speed and thus can be disabled. However for vast majority of XML documents there is no performance difference between different parsing options. Parsing options also control whether certain XML nodes are parsed; see Parsing options for more information.
XML data is always converted to internal character format (see Unicode interface) before parsing. pugixml supports all popular Unicode encodings (UTF-8, UTF-16 (big and little endian), UTF-32 (big and little endian); UCS-2 is naturally supported since it's a strict subset of UTF-16) and handles all encoding conversions automatically. Unless explicit encoding is specified, loading functions perform automatic encoding detection based on first few characters of XML data, so in almost all cases you do not have to specify document encoding. Encoding conversion is described in more detail in Encodings.
The most common source of XML data is files; pugixml provides dedicated functions for loading an XML document from file:
xml_parse_result xml_document::load_file(const char* path, unsigned int options = parse_default, xml_encoding encoding = encoding_auto); xml_parse_result xml_document::load_file(const wchar_t* path, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
These functions accept the file path as its first argument, and also two optional arguments, which specify parsing options (see Parsing options) and input data encoding (see Encodings). The path has the target operating system format, so it can be a relative or absolute one, it should have the delimiters of the target system, it should have the exact case if the target file system is case-sensitive, etc.
File path is passed to the system file opening function as is in case of
the first function (which accepts const
char* path
); the second function either uses
a special file opening function if it is provided by the runtime library
or converts the path to UTF-8 and uses the system file opening function.
load_file
destroys the existing
document tree and then tries to load the new tree from the specified file.
The result of the operation is returned in an xml_parse_result
object; this object contains the operation status and the related information
(i.e. last successfully parsed position in the input file, if parsing fails).
See Handling parsing errors for error handling details.
This is an example of loading XML document from file (samples/load_file.cpp):
pugi::xml_document doc; pugi::xml_parse_result result = doc.load_file("tree.xml"); std::cout << "Load result: " << result.description() << ", mesh name: " << doc.child("mesh").attribute("name").value() << std::endl;
Sometimes XML data should be loaded from some other source than a file, i.e. HTTP URL; also you may want to load XML data from file using non-standard functions, i.e. to use your virtual file system facilities or to load XML from gzip-compressed files. All these scenarios require loading document from memory. First you should prepare a contiguous memory block with all XML data; then you have to invoke one of buffer loading functions. These functions will handle the necessary encoding conversions, if any, and then will parse the data into the corresponding XML tree. There are several buffer loading functions, which differ in the behavior and thus in performance/memory usage:
xml_parse_result xml_document::load_buffer(const void* contents, size_t size, unsigned int options = parse_default, xml_encoding encoding = encoding_auto); xml_parse_result xml_document::load_buffer_inplace(void* contents, size_t size, unsigned int options = parse_default, xml_encoding encoding = encoding_auto); xml_parse_result xml_document::load_buffer_inplace_own(void* contents, size_t size, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
All functions accept the buffer which is represented by a pointer to XML
data, contents
, and data
size in bytes. Also there are two optional arguments, which specify parsing
options (see Parsing options) and input data encoding (see Encodings).
The buffer does not have to be zero-terminated.
load_buffer
function works
with immutable buffer - it does not ever modify the buffer. Because of this
restriction it has to create a private buffer and copy XML data to it before
parsing (applying encoding conversions if necessary). This copy operation
carries a performance penalty, so inplace functions are provided - load_buffer_inplace
and load_buffer_inplace_own
store the document data in the buffer, modifying it in the process. In order
for the document to stay valid, you have to make sure that the buffer's lifetime
exceeds that of the tree if you're using inplace functions. In addition to
that, load_buffer_inplace
does not assume ownership of the buffer, so you'll have to destroy it yourself;
load_buffer_inplace_own
assumes
ownership of the buffer and destroys it once it is not needed. This means
that if you're using load_buffer_inplace_own
,
you have to allocate memory with pugixml allocation function (you can get
it via get_memory_allocation_function).
The best way from the performance/memory point of view is to load document
using load_buffer_inplace_own
;
this function has maximum control of the buffer with XML data so it is able
to avoid redundant copies and reduce peak memory usage while parsing. This
is the recommended function if you have to load the document from memory
and performance is critical.
There is also a simple helper function for cases when you want to load the XML document from null-terminated character string:
xml_parse_result xml_document::load_string(const char_t* contents, unsigned int options = parse_default);
It is equivalent to calling load_buffer
with size
being either strlen(contents)
or wcslen(contents) * sizeof(wchar_t)
,
depending on the character type. This function assumes native encoding for
input data, so it does not do any encoding conversion. In general, this function
is fine for loading small documents from string literals, but has more overhead
and less functionality than the buffer loading functions.
This is an example of loading XML document from memory using different functions (samples/load_memory.cpp):
const char source[] = "<mesh name='sphere'><bounds>0 0 1 1</bounds></mesh>"; size_t size = sizeof(source);
// You can use load_buffer to load document from immutable memory block: pugi::xml_parse_result result = doc.load_buffer(source, size);
// You can use load_buffer_inplace to load document from mutable memory block; the block's lifetime must exceed that of document char* buffer = new char[size]; memcpy(buffer, source, size); // The block can be allocated by any method; the block is modified during parsing pugi::xml_parse_result result = doc.load_buffer_inplace(buffer, size); // You have to destroy the block yourself after the document is no longer used delete[] buffer;
// You can use load_buffer_inplace_own to load document from mutable memory block and to pass the ownership of this block // The block has to be allocated via pugixml allocation function - using i.e. operator new here is incorrect char* buffer = static_cast<char*>(pugi::get_memory_allocation_function()(size)); memcpy(buffer, source, size); // The block will be deleted by the document pugi::xml_parse_result result = doc.load_buffer_inplace_own(buffer, size);
// You can use load to load document from null-terminated strings, for example literals: pugi::xml_parse_result result = doc.load_string("<mesh name='sphere'><bounds>0 0 1 1</bounds></mesh>");
To enhance interoperability, pugixml provides functions for loading document
from any object which implements C++ std::istream
interface. This allows you to load documents from any standard C++ stream
(i.e. file stream) or any third-party compliant implementation (i.e. Boost
Iostreams). There are two functions, one works with narrow character streams,
another handles wide character ones:
xml_parse_result xml_document::load(std::istream& stream, unsigned int options = parse_default, xml_encoding encoding = encoding_auto); xml_parse_result xml_document::load(std::wistream& stream, unsigned int options = parse_default);
load
with std::istream
argument loads the document from stream from the current read position to
the end, treating the stream contents as a byte stream of the specified encoding
(with encoding autodetection as necessary). Thus calling xml_document::load
on an opened std::ifstream
object is equivalent to calling
xml_document::load_file
.
load
with std::wstream
argument treats the stream contents as a wide character stream (encoding
is always encoding_wchar). Because
of this, using load
with
wide character streams requires careful (usually platform-specific) stream
setup (i.e. using the imbue
function). Generally use of wide streams is discouraged, however it provides
you the ability to load documents from non-Unicode encodings, i.e. you can
load Shift-JIS encoded data if you set the correct locale.
This is a simple example of loading XML document from file using streams (samples/load_stream.cpp); read the sample code for more complex examples involving wide streams and locales:
std::ifstream stream("weekly-utf-8.xml"); pugi::xml_parse_result result = doc.load(stream);
All document loading functions return the parsing result via xml_parse_result
object. It contains parsing
status, the offset of last successfully parsed character from the beginning
of the source stream, and the encoding of the source stream:
struct xml_parse_result { xml_parse_status status; ptrdiff_t offset; xml_encoding encoding; operator bool() const; const char* description() const; };
Parsing status is represented as the xml_parse_status
enumeration and can be one of the following:
status_ok
means that no error was encountered
during parsing; the source stream represents the valid XML document which
was fully parsed and converted to a tree. status_file_not_found
is only
returned by load_file
function and means that file could not be opened.
status_io_error
is returned by load_file
function and by load
functions with std::istream
/std::wstream
arguments; it means that some
I/O error has occurred during reading the file/stream.
status_out_of_memory
means that
there was not enough memory during some allocation; any allocation failure
during parsing results in this error.
status_internal_error
means that
something went horribly wrong; currently this error does not occur status_unrecognized_tag
means
that parsing stopped due to a tag with either an empty name or a name
which starts with incorrect character, such as #
.
status_bad_pi
means that parsing stopped
due to incorrect document declaration/processing instruction
status_bad_comment
, status_bad_cdata
,
status_bad_doctype
and status_bad_pcdata
mean that parsing stopped due to the invalid construct of the respective
type
status_bad_start_element
means
that parsing stopped because starting tag either had no closing >
symbol or contained some incorrect
symbol
status_bad_attribute
means that
parsing stopped because there was an incorrect attribute, such as an
attribute without value or with value that is not quoted (note that
<node
attr=1>
is
incorrect in XML)
status_bad_end_element
means
that parsing stopped because ending tag had incorrect syntax (i.e. extra
non-whitespace symbols between tag name and >
)
status_end_element_mismatch
means that parsing stopped because the closing tag did not match the
opening one (i.e. <node></nedo>
) or because some tag was not closed
at all
status_no_document_element
means that no element nodes were discovered during parsing; this usually
indicates an empty or invalid document
description()
member function can be used to convert parsing status to a string; the returned
message is always in English, so you'll have to write your own function if
you need a localized string. However please note that the exact messages
returned by description()
function may change from version to version, so any complex status handling
should be based on status
value. Note that description()
returns a char
string even in PUGIXML_WCHAR_MODE
;
you'll have to call as_wide to get the wchar_t
string.
If parsing failed because the source data was not a valid XML, the resulting
tree is not destroyed - despite the fact that load function returns error,
you can use the part of the tree that was successfully parsed. Obviously,
the last element may have an unexpected name/value; for example, if the attribute
value does not end with the necessary quotation mark, like in <node
attr="value>some data</node>
example, the value of
attribute attr
will contain
the string value>some data</node>
.
In addition to the status code, parsing result has an offset
member, which contains the offset of last successfully parsed character if
parsing failed because of an error in source data; otherwise offset
is 0. For parsing efficiency reasons,
pugixml does not track the current line during parsing; this offset is in
units of pugi::char_t (bytes for character
mode, wide characters for wide character mode). Many text editors support
'Go To Position' feature - you can use it to locate the exact error position.
Alternatively, if you're loading the document from memory, you can display
the error chunk along with the error description (see the example code below).
Caution | |
---|---|
Offset is calculated in the XML buffer in native encoding; if encoding conversion is performed during parsing, offset can not be used to reliably track the error position. |
Parsing result also has an encoding
member, which can be used to check that the source data encoding was correctly
guessed. It is equal to the exact encoding used during parsing (i.e. with
the exact endianness); see Encodings for more information.
Parsing result object can be implicitly converted to bool
;
if you do not want to handle parsing errors thoroughly, you can just check
the return value of load functions as if it was a bool
:
if (doc.load_file("file.xml")) { ...
} else { ... }
.
This is an example of handling loading errors (samples/load_error_handling.cpp):
pugi::xml_document doc; pugi::xml_parse_result result = doc.load_string(source); if (result) std::cout << "XML [" << source << "] parsed without errors, attr value: [" << doc.child("node").attribute("attr").value() << "]\n\n"; else { std::cout << "XML [" << source << "] parsed with errors, attr value: [" << doc.child("node").attribute("attr").value() << "]\n"; std::cout << "Error description: " << result.description() << "\n"; std::cout << "Error offset: " << result.offset << " (error at [..." << (source + result.offset) << "]\n\n"; }
All document loading functions accept the optional parameter options
. This is a bitmask that customizes
the parsing process: you can select the node types that are parsed and various
transformations that are performed with the XML text. Disabling certain transformations
can improve parsing performance for some documents; however, the code for
all transformations is very well optimized, and thus the majority of documents
won't get any performance benefit. As a rule of thumb, only modify parsing
flags if you want to get some nodes in the document that are excluded by
default (i.e. declaration or comment nodes).
Note | |
---|---|
You should use the usual bitwise arithmetics to manipulate the bitmask:
to enable a flag, use |
These flags control the resulting tree contents:
parse_declaration
determines if XML
document declaration (node with type node_declaration)
is to be put in DOM tree. If this flag is off, it is not put in the tree,
but is still parsed and checked for correctness. This flag is off by default. parse_doctype
determines if XML document
type declaration (node with type node_doctype)
is to be put in DOM tree. If this flag is off, it is not put in the tree,
but is still parsed and checked for correctness. This flag is off by default. parse_pi
determines if processing instructions
(nodes with type node_pi) are to be put
in DOM tree. If this flag is off, they are not put in the tree, but are
still parsed and checked for correctness. Note that <?xml ...?>
(document declaration) is not considered to be a PI. This flag is off by default. parse_comments
determines if comments
(nodes with type node_comment) are
to be put in DOM tree. If this flag is off, they are not put in the tree,
but are still parsed and checked for correctness. This flag is off by default. parse_cdata
determines if CDATA sections
(nodes with type node_cdata) are to
be put in DOM tree. If this flag is off, they are not put in the tree,
but are still parsed and checked for correctness. This flag is on by default. parse_trim_pcdata
determines if leading
and trailing whitespace characters are to be removed from PCDATA nodes.
While for some applications leading/trailing whitespace is significant,
often the application only cares about the non-whitespace contents so
it's easier to trim whitespace from text during parsing. This flag is
off by default. parse_ws_pcdata
determines if PCDATA
nodes (nodes with type node_pcdata)
that consist only of whitespace characters are to be put in DOM tree.
Often whitespace-only data is not significant for the application, and
the cost of allocating and storing such nodes (both memory and speed-wise)
can be significant. For example, after parsing XML string <node> <a/> </node>
, <node>
element will have three children when parse_ws_pcdata
is set (child with type node_pcdata
and value " "
,
child with type node_element and
name "a"
, and another
child with type node_pcdata and value
" "
), and only
one child when parse_ws_pcdata
is not set. This flag is off by default.
parse_ws_pcdata_single
determines
if whitespace-only PCDATA nodes that have no sibling nodes are to be
put in DOM tree. In some cases application needs to parse the whitespace-only
contents of nodes, i.e. <node>
</node>
, but is not interested in whitespace
markup elsewhere. It is possible to use parse_ws_pcdata
flag in this case, but it results in excessive allocations and complicates
document processing in some cases; this flag is intended to avoid that.
As an example, after parsing XML string <node>
<a> </a> </node>
with parse_ws_pcdata_single
flag set, <node>
element will have one child <a>
, and <a>
element will have one child with type node_pcdata
and value " "
.
This flag has no effect if parse_ws_pcdata
is enabled. This flag is off by default.
parse_fragment
determines if document
should be treated as a fragment of a valid XML. Parsing document as a
fragment leads to top-level PCDATA content (i.e. text that is not located
inside a node) to be added to a tree, and additionally treats documents
without element nodes as valid. This flag is off
by default.
Caution | |
---|---|
Using in-place parsing (load_buffer_inplace)
with |
These flags control the transformation of tree element contents:
parse_escapes
determines if character
and entity references are to be expanded during the parsing process.
Character references have the form &#...;
or
&#x...;
(...
is Unicode numeric
representation of character in either decimal (&#...;
)
or hexadecimal (&#x...;
) form), entity references
are <
, >
, &
,
'
and "
(note
that as pugixml does not handle DTD, the only allowed entities are predefined
ones). If character/entity reference can not be expanded, it is left
as is, so you can do additional processing later. Reference expansion
is performed on attribute values and PCDATA content. This flag is on by default. parse_eol
determines if EOL handling (that
is, replacing sequences 0x0d 0x0a
by a single 0x0a
character, and replacing all standalone 0x0d
characters by 0x0a
) is to
be performed on input data (that is, comments contents, PCDATA/CDATA
contents and attribute values). This flag is on
by default. parse_wconv_attribute
determines
if attribute value normalization should be performed for all attributes.
This means, that whitespace characters (new line, tab and space) are
replaced with space (' '
).
New line characters are always treated as if parse_eol
is set, i.e. \r\n
is converted to a single space. This flag is on
by default. parse_wnorm_attribute
determines
if extended attribute value normalization should be performed for all
attributes. This means, that after attribute values are normalized as
if parse_wconv_attribute
was set, leading and trailing space characters are removed, and all sequences
of space characters are replaced by a single space character. parse_wconv_attribute
has no effect if this flag is on. This flag is off
by default.
Note | |
---|---|
|
Additionally there are three predefined option masks:
parse_minimal
has all options turned
off. This option mask means that pugixml does not add declaration nodes,
document type declaration nodes, PI nodes, CDATA sections and comments
to the resulting tree and does not perform any conversion for input data,
so theoretically it is the fastest mode. However, as mentioned above,
in practice parse_default is usually
equally fast. parse_default
is the default set of flags,
i.e. it has all options set to their default values. It includes parsing
CDATA sections (comments/PIs are not parsed), performing character and
entity reference expansion, replacing whitespace characters with spaces
in attribute values and performing EOL handling. Note, that PCDATA sections
consisting only of whitespace characters are not parsed (by default)
for performance reasons. parse_full
is the set of flags which adds
nodes of all types to the resulting tree and performs default conversions
for input data. It includes parsing CDATA sections, comments, PI nodes,
document declaration node and document type declaration node, performing
character and entity reference expansion, replacing whitespace characters
with spaces in attribute values and performing EOL handling. Note, that
PCDATA sections consisting only of whitespace characters are not parsed
in this mode.
This is an example of using different parsing options (samples/load_options.cpp):
const char* source = "<!--comment--><node><</node>"; // Parsing with default options; note that comment node is not added to the tree, and entity reference < is expanded doc.load_string(source); std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n"; // Parsing with additional parse_comments option; comment node is now added to the tree doc.load_string(source, pugi::parse_default | pugi::parse_comments); std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n"; // Parsing with additional parse_comments option and without the (default) parse_escapes option; < is not expanded doc.load_string(source, (pugi::parse_default | pugi::parse_comments) & ~pugi::parse_escapes); std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n"; // Parsing with minimal option mask; comment node is not added to the tree, and < is not expanded doc.load_string(source, pugi::parse_minimal); std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";
pugixml supports all popular Unicode encodings (UTF-8, UTF-16 (big and little
endian), UTF-32 (big and little endian); UCS-2 is naturally supported since
it's a strict subset of UTF-16) and handles all encoding conversions. Most
loading functions accept the optional parameter encoding
.
This is a value of enumeration type xml_encoding
,
that can have the following values:
encoding_auto
means that pugixml will
try to guess the encoding based on source XML data. The algorithm is
a modified version of the one presented in Appendix F.1 of XML recommendation;
it tries to match the first few bytes of input data with the following
patterns in strict order: <
,
encoding is assumed to be UTF-32 with the corresponding endianness;
<?
,
encoding is assumed to be UTF-16 with the corresponding endianness;
<
,
encoding is assumed to be UTF-16 with the corresponding endianness
(this guess may yield incorrect result, but it's better than UTF-8);
encoding_utf8
corresponds to UTF-8 encoding
as defined in the Unicode standard; UTF-8 sequences with length equal
to 5 or 6 are not standard and are rejected.
encoding_utf16_le
corresponds to
little-endian UTF-16 encoding as defined in the Unicode standard; surrogate
pairs are supported.
encoding_utf16_be
corresponds to
big-endian UTF-16 encoding as defined in the Unicode standard; surrogate
pairs are supported.
encoding_utf16
corresponds to UTF-16
encoding as defined in the Unicode standard; the endianness is assumed
to be that of the target platform.
encoding_utf32_le
corresponds to
little-endian UTF-32 encoding as defined in the Unicode standard.
encoding_utf32_be
corresponds to
big-endian UTF-32 encoding as defined in the Unicode standard.
encoding_utf32
corresponds to UTF-32
encoding as defined in the Unicode standard; the endianness is assumed
to be that of the target platform.
encoding_wchar
corresponds to the encoding
of wchar_t
type; it has
the same meaning as either encoding_utf16
or encoding_utf32
, depending
on wchar_t
size.
encoding_latin1
corresponds to ISO-8859-1
encoding (also known as Latin-1).
The algorithm used for encoding_auto
correctly detects any supported Unicode encoding for all well-formed XML
documents (since they start with document declaration) and for all other
XML documents that start with <
; if your XML document
does not start with <
and has encoding that is different
from UTF-8, use the specific encoding.
Note | |
---|---|
The current behavior for Unicode conversion is to skip all invalid UTF sequences during conversion. This behavior should not be relied upon; moreover, in case no encoding conversion is performed, the invalid sequences are not removed, so you'll get them as is in node/attribute contents. |
pugixml is not fully W3C conformant - it can load any valid XML document, but does not perform some well-formedness checks. While considerable effort is made to reject invalid XML documents, some validation is not performed because of performance reasons.
There is only one non-conformant behavior when dealing with valid XML documents: pugixml does not use information supplied in document type declaration for parsing. This means that entities declared in DOCTYPE are not expanded, and all attribute/PCDATA values are always processed in a uniform way that depends only on parsing options.
As for rejecting invalid XML documents, there are a number of incompatibilities with W3C specification, including:
<
are not rejected.
--
.
pugixml 1.6 manual | Overview | Installation | Document: Object model · Loading · Accessing · Modifying · Saving | XPath | API Reference | Table of Contents |