diff options
Diffstat (limited to 'docs/manual/loading.html')
-rw-r--r-- | docs/manual/loading.html | 183 |
1 files changed, 98 insertions, 85 deletions
diff --git a/docs/manual/loading.html b/docs/manual/loading.html index a3c1515..5b5576b 100644 --- a/docs/manual/loading.html +++ b/docs/manual/loading.html @@ -4,14 +4,15 @@ <title>Loading document</title> <link rel="stylesheet" href="../pugixml.css" type="text/css"> <meta name="generator" content="DocBook XSL Stylesheets V1.75.2"> -<link rel="home" href="../manual.html" title="pugixml 0.9"> -<link rel="up" href="../manual.html" title="pugixml 0.9"> +<link rel="home" href="../manual.html" title="pugixml 1.0"> +<link rel="up" href="../manual.html" title="pugixml 1.0"> <link rel="prev" href="dom.html" title="Document object model"> <link rel="next" href="access.html" title="Accessing document data"> </head> <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> <table width="100%"><tr> -<td>pugixml 0.9 manual | +<td> +<a href="http://pugixml.org/">pugixml 1.0</a> manual | <a href="../manual.html">Overview</a> | <a href="install.html">Installation</a> | Document: @@ -44,11 +45,11 @@ non-validating parser. This parser is not fully W3C conformant - it can load any valid XML document, but does not perform some well-formedness checks. While considerable effort is made to reject invalid XML documents, some validation - is not performed because of performance reasons. Also some XML transformations - (i.e. EOL handling or attribute value normalization) can impact parsing speed - and thus can be disabled. However for vast majority of XML documents there - is no performance difference between different parsing options. Parsing options - also control whether certain XML nodes are parsed; see <a class="xref" href="loading.html#manual.loading.options" title="Parsing options"> Parsing options</a> for + is not performed for performance reasons. Also some XML transformations (i.e. + EOL handling or attribute value normalization) can impact parsing speed and + thus can be disabled. However for vast majority of XML documents there is no + performance difference between different parsing options. Parsing options also + control whether certain XML nodes are parsed; see <a class="xref" href="loading.html#manual.loading.options" title="Parsing options"> Parsing options</a> for more information. </p> <p> @@ -65,43 +66,36 @@ <div class="titlepage"><div><div><h3 class="title"> <a name="manual.loading.file"></a><a class="link" href="loading.html#manual.loading.file" title="Loading document from file"> Loading document from file</a> </h3></div></div></div> -<a name="xml_document::load_file"></a><p> - The most common source of XML data is files; pugixml provides a separate - function for loading XML document from file: +<a name="xml_document::load_file"></a><a name="xml_document::load_file_wide"></a><p> + The most common source of XML data is files; pugixml provides dedicated functions + for loading an XML document from file: </p> <pre class="programlisting"><span class="identifier">xml_parse_result</span> <span class="identifier">xml_document</span><span class="special">::</span><span class="identifier">load_file</span><span class="special">(</span><span class="keyword">const</span> <span class="keyword">char</span><span class="special">*</span> <span class="identifier">path</span><span class="special">,</span> <span class="keyword">unsigned</span> <span class="keyword">int</span> <span class="identifier">options</span> <span class="special">=</span> <span class="identifier">parse_default</span><span class="special">,</span> <span class="identifier">xml_encoding</span> <span class="identifier">encoding</span> <span class="special">=</span> <span class="identifier">encoding_auto</span><span class="special">);</span> +<span class="identifier">xml_parse_result</span> <span class="identifier">xml_document</span><span class="special">::</span><span class="identifier">load_file</span><span class="special">(</span><span class="keyword">const</span> <span class="keyword">wchar_t</span><span class="special">*</span> <span class="identifier">path</span><span class="special">,</span> <span class="keyword">unsigned</span> <span class="keyword">int</span> <span class="identifier">options</span> <span class="special">=</span> <span class="identifier">parse_default</span><span class="special">,</span> <span class="identifier">xml_encoding</span> <span class="identifier">encoding</span> <span class="special">=</span> <span class="identifier">encoding_auto</span><span class="special">);</span> </pre> <p> - This function accepts file path as its first argument, and also two optional - arguments, which specify parsing options (see <a class="xref" href="loading.html#manual.loading.options" title="Parsing options"> Parsing options</a>) and - input data encoding (see <a class="xref" href="loading.html#manual.loading.encoding" title="Encodings"> Encodings</a>). The path has the target + These functions accept the file path as its first argument, and also two + optional arguments, which specify parsing options (see <a class="xref" href="loading.html#manual.loading.options" title="Parsing options"> Parsing options</a>) + and input data encoding (see <a class="xref" href="loading.html#manual.loading.encoding" title="Encodings"> Encodings</a>). The path has the target operating system format, so it can be a relative or absolute one, it should - have the delimiters of target system, it should have the exact case if target - file system is case-sensitive, etc. File path is passed to system file opening - function as is. + have the delimiters of the target system, it should have the exact case if + the target file system is case-sensitive, etc. + </p> +<p> + File path is passed to the system file opening function as is in case of + the first function (which accepts <code class="computeroutput"><span class="keyword">const</span> + <span class="keyword">char</span><span class="special">*</span> <span class="identifier">path</span></code>); the second function either uses + a special file opening function if it is provided by the runtime library + or converts the path to UTF-8 and uses the system file opening function. </p> <p> <code class="computeroutput"><span class="identifier">load_file</span></code> destroys the existing document tree and then tries to load the new tree from the specified file. - The result of the operation is returned in an <code class="computeroutput"><span class="identifier">xml_parse_result</span></code> - object; this object contains the operation status, and the related information + The result of the operation is returned in an <a class="link" href="loading.html#xml_parse_result">xml_parse_result</a> + object; this object contains the operation status and the related information (i.e. last successfully parsed position in the input file, if parsing fails). See <a class="xref" href="loading.html#manual.loading.errors" title="Handling parsing errors"> Handling parsing errors</a> for error handling details. </p> -<div class="note"><table border="0" summary="Note"> -<tr> -<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../images/note.png"></td> -<th align="left">Note</th> -</tr> -<tr><td align="left" valign="top"><p> - As of version 0.9, there is no function for loading XML document from wide - character path. Unfortunately, there is no portable way to do this; the - version 1.0 will provide such function only for platforms with the corresponding - functionality. You can use stream-loading functions as a workaround if - your STL implementation can open file streams via <code class="computeroutput"><span class="keyword">wchar_t</span></code> - paths. - </p></td></tr> -</table></div> <p> This is an example of loading XML document from file (<a href="../samples/load_file.cpp" target="_top">samples/load_file.cpp</a>): </p> @@ -122,7 +116,7 @@ <a name="manual.loading.memory"></a><a class="link" href="loading.html#manual.loading.memory" title="Loading document from memory"> Loading document from memory</a> </h3></div></div></div> <a name="xml_document::load_buffer"></a><a name="xml_document::load_buffer_inplace"></a><a name="xml_document::load_buffer_inplace_own"></a><p> - Sometimes XML data should be loaded from some other source than file, i.e. + Sometimes XML data should be loaded from some other source than a file, i.e. HTTP URL; also you may want to load XML data from file using non-standard functions, i.e. to use your virtual file system facilities or to load XML from gzip-compressed files. All these scenarios require loading document @@ -177,12 +171,12 @@ </pre> <p> It is equivalent to calling <code class="computeroutput"><span class="identifier">load_buffer</span></code> - with <code class="computeroutput"><span class="identifier">size</span> <span class="special">=</span> - <span class="identifier">strlen</span><span class="special">(</span><span class="identifier">contents</span><span class="special">)</span></code>. - This function assumes native encoding for input data, so it does not do any - encoding conversion. In general, this function is fine for loading small - documents from string literals, but has more overhead and less functionality - than buffer loading functions. + with <code class="computeroutput"><span class="identifier">size</span></code> being either <code class="computeroutput"><span class="identifier">strlen</span><span class="special">(</span><span class="identifier">contents</span><span class="special">)</span></code> + or <code class="computeroutput"><span class="identifier">wcslen</span><span class="special">(</span><span class="identifier">contents</span><span class="special">)</span> <span class="special">*</span> <span class="keyword">sizeof</span><span class="special">(</span><span class="keyword">wchar_t</span><span class="special">)</span></code>, + depending on the character type. This function assumes native encoding for + input data, so it does not do any encoding conversion. In general, this function + is fine for loading small documents from string literals, but has more overhead + and less functionality than the buffer loading functions. </p> <p> This is an example of loading XML document from memory using different functions @@ -246,7 +240,7 @@ <a name="manual.loading.stream"></a><a class="link" href="loading.html#manual.loading.stream" title="Loading document from C++ IOstreams"> Loading document from C++ IOstreams</a> </h3></div></div></div> <a name="xml_document::load_stream"></a><p> - For additional interoperability pugixml provides functions for loading document + To enhance interoperability, pugixml provides functions for loading document from any object which implements C++ <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">istream</span></code> interface. This allows you to load documents from any standard C++ stream (i.e. file stream) or any third-party compliant implementation (i.e. Boost @@ -267,10 +261,10 @@ <p> <code class="computeroutput"><span class="identifier">load</span></code> with <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">wstream</span></code> argument treats the stream contents as a wide character stream (encoding - is always <code class="computeroutput"><span class="identifier">encoding_wchar</span></code>). - Because of this, using <code class="computeroutput"><span class="identifier">load</span></code> - with wide character streams requires careful (usually platform-specific) - stream setup (i.e. using the <code class="computeroutput"><span class="identifier">imbue</span></code> + is always <a class="link" href="loading.html#encoding_wchar">encoding_wchar</a>). Because + of this, using <code class="computeroutput"><span class="identifier">load</span></code> with + wide character streams requires careful (usually platform-specific) stream + setup (i.e. using the <code class="computeroutput"><span class="identifier">imbue</span></code> function). Generally use of wide streams is discouraged, however it provides you the ability to load documents from non-Unicode encodings, i.e. you can load Shift-JIS encoded data if you set the correct locale. @@ -330,7 +324,7 @@ </li> <li class="listitem"> <a name="status_io_error"></a><code class="literal">status_io_error</code> is returned by <code class="computeroutput"><span class="identifier">load_file</span></code> function and by <code class="computeroutput"><span class="identifier">load</span></code> functions with <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">istream</span></code>/<code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">wstream</span></code> arguments; it means that some - I/O error has occured during reading the file/stream. + I/O error has occurred during reading the file/stream. </li> <li class="listitem"> <a name="status_out_of_memory"></a><code class="literal">status_out_of_memory</code> means that @@ -407,11 +401,11 @@ member, which contains the offset of last successfully parsed character if parsing failed because of an error in source data; otherwise <code class="computeroutput"><span class="identifier">offset</span></code> is 0. For parsing efficiency reasons, pugixml does not track the current line during parsing; this offset is in - units of <code class="computeroutput"><span class="identifier">pugi</span><span class="special">::</span><span class="identifier">char_t</span></code> (bytes for character mode, wide - characters for wide character mode). Many text editors support 'Go To Position' - feature - you can use it to locate the exact error position. Alternatively, - if you're loading the document from memory, you can display the error chunk - along with the error description (see the example code below). + units of <a class="link" href="dom.html#char_t">pugi::char_t</a> (bytes for character + mode, wide characters for wide character mode). Many text editors support + 'Go To Position' feature - you can use it to locate the exact error position. + Alternatively, if you're loading the document from memory, you can display + the error chunk along with the error description (see the example code below). </p> <div class="caution"><table border="0" summary="Caution"> <tr> @@ -490,9 +484,15 @@ <li class="listitem"> <a name="parse_declaration"></a><code class="literal">parse_declaration</code> determines if XML document declaration (node with type <a class="link" href="dom.html#node_declaration">node_declaration</a>) - are to be put in DOM tree. If this flag is off, it is not put in the - tree, but is still parsed and checked for correctness. This flag is - <span class="bold"><strong>off</strong></span> by default. <br><br> + is to be put in DOM tree. If this flag is off, it is not put in the tree, + but is still parsed and checked for correctness. This flag is <span class="bold"><strong>off</strong></span> by default. <br><br> + + </li> +<li class="listitem"> + <a name="parse_doctype"></a><code class="literal">parse_doctype</code> determines if XML document + type declaration (node with type <a class="link" href="dom.html#node_doctype">node_doctype</a>) + is to be put in DOM tree. If this flag is off, it is not put in the tree, + but is still parsed and checked for correctness. This flag is <span class="bold"><strong>off</strong></span> by default. <br><br> </li> <li class="listitem"> @@ -525,13 +525,13 @@ the cost of allocating and storing such nodes (both memory and speed-wise) can be significant. For example, after parsing XML string <code class="computeroutput"><span class="special"><</span><span class="identifier">node</span><span class="special">></span> <span class="special"><</span><span class="identifier">a</span><span class="special">/></span> <span class="special"></</span><span class="identifier">node</span><span class="special">></span></code>, <code class="computeroutput"><span class="special"><</span><span class="identifier">node</span><span class="special">></span></code> element will have three children when <code class="computeroutput"><span class="identifier">parse_ws_pcdata</span></code> - is set (child with type <code class="computeroutput"><span class="identifier">node_pcdata</span></code> + is set (child with type <a class="link" href="dom.html#node_pcdata">node_pcdata</a> and value <code class="computeroutput"><span class="string">" "</span></code>, - child with type <code class="computeroutput"><span class="identifier">node_element</span></code> - and name <code class="computeroutput"><span class="string">"a"</span></code>, and - another child with type <code class="computeroutput"><span class="identifier">node_pcdata</span></code> - and value <code class="computeroutput"><span class="string">" "</span></code>), - and only one child when <code class="computeroutput"><span class="identifier">parse_ws_pcdata</span></code> + child with type <a class="link" href="dom.html#node_element">node_element</a> and + name <code class="computeroutput"><span class="string">"a"</span></code>, and another + child with type <a class="link" href="dom.html#node_pcdata">node_pcdata</a> and value + <code class="computeroutput"><span class="string">" "</span></code>), and only + one child when <code class="computeroutput"><span class="identifier">parse_ws_pcdata</span></code> is not set. This flag is <span class="bold"><strong>off</strong></span> by default. </li> </ul></div> @@ -551,7 +551,7 @@ that as pugixml does not handle DTD, the only allowed entities are predefined ones). If character/entity reference can not be expanded, it is left as is, so you can do additional processing later. Reference expansion - is performed in attribute values and PCDATA content. This flag is <span class="bold"><strong>on</strong></span> by default. <br><br> + is performed on attribute values and PCDATA content. This flag is <span class="bold"><strong>on</strong></span> by default. <br><br> </li> <li class="listitem"> @@ -569,9 +569,9 @@ if attribute value normalization should be performed for all attributes. This means, that whitespace characters (new line, tab and space) are replaced with space (<code class="computeroutput"><span class="char">' '</span></code>). - New line characters are always treated as if <code class="computeroutput"><span class="identifier">parse_eol</span></code> + New line characters are always treated as if <a class="link" href="loading.html#parse_eol">parse_eol</a> is set, i.e. <code class="computeroutput"><span class="special">\</span><span class="identifier">r</span><span class="special">\</span><span class="identifier">n</span></code> - is converted to single space. This flag is <span class="bold"><strong>on</strong></span> + is converted to a single space. This flag is <span class="bold"><strong>on</strong></span> by default. <br><br> </li> @@ -579,10 +579,10 @@ <a name="parse_wnorm_attribute"></a><code class="literal">parse_wnorm_attribute</code> determines if extended attribute value normalization should be performed for all attributes. This means, that after attribute values are normalized as - if <code class="computeroutput"><span class="identifier">parse_wconv_attribute</span></code> + if <a class="link" href="loading.html#parse_wconv_attribute">parse_wconv_attribute</a> was set, leading and trailing space characters are removed, and all sequences of space characters are replaced by a single space character. The value - of <code class="computeroutput"><span class="identifier">parse_wconv_attribute</span></code> + of <a class="link" href="loading.html#parse_wconv_attribute">parse_wconv_attribute</a> has no effect if this flag is on. This flag is <span class="bold"><strong>off</strong></span> by default. </li> @@ -595,24 +595,25 @@ <tr><td align="left" valign="top"><p> <code class="computeroutput"><span class="identifier">parse_wconv_attribute</span></code> option performs transformations that are required by W3C specification for attributes - that are declared as <code class="literal">CDATA</code>; <code class="computeroutput"><span class="identifier">parse_wnorm_attribute</span></code> + that are declared as <code class="literal">CDATA</code>; <a class="link" href="loading.html#parse_wnorm_attribute">parse_wnorm_attribute</a> performs transformations required for <code class="literal">NMTOKENS</code> attributes. - In the absence of document type declaration all attributes behave as if - they are declared as <code class="literal">CDATA</code>, thus <code class="computeroutput"><span class="identifier">parse_wconv_attribute</span></code> + In the absence of document type declaration all attributes should behave + as if they are declared as <code class="literal">CDATA</code>, thus <a class="link" href="loading.html#parse_wconv_attribute">parse_wconv_attribute</a> is the default option. </p></td></tr> </table></div> <p> - Additionally there are two predefined option masks: + Additionally there are three predefined option masks: </p> <div class="itemizedlist"><ul class="itemizedlist" type="disc"> <li class="listitem"> <a name="parse_minimal"></a><code class="literal">parse_minimal</code> has all options turned off. This option mask means that pugixml does not add declaration nodes, - PI nodes, CDATA sections and comments to the resulting tree and does - not perform any conversion for input data, so theoretically it is the - fastest mode. However, as discussed above, in practice <code class="computeroutput"><span class="identifier">parse_default</span></code> is usually equally fast. - <br><br> + document type declaration nodes, PI nodes, CDATA sections and comments + to the resulting tree and does not perform any conversion for input data, + so theoretically it is the fastest mode. However, as mentioned above, + in practice <a class="link" href="loading.html#parse_default">parse_default</a> is usually + equally fast. <br><br> </li> <li class="listitem"> @@ -622,7 +623,18 @@ entity reference expansion, replacing whitespace characters with spaces in attribute values and performing EOL handling. Note, that PCDATA sections consisting only of whitespace characters are not parsed (by default) - for performance reasons. + for performance reasons. <br><br> + + </li> +<li class="listitem"> + <a name="parse_full"></a><code class="literal">parse_full</code> is the set of flags which adds + nodes of all types to the resulting tree and performs default conversions + for input data. It includes parsing CDATA sections, comments, PI nodes, + document declaration node and document type declaration node, performing + character and entity reference expansion, replacing whitespace characters + with spaces in attribute values and performing EOL handling. Note, that + PCDATA sections consisting only of whitespace characters are not parsed + in this mode. </li> </ul></div> <p> @@ -705,36 +717,36 @@ </li> <li class="listitem"> <a name="encoding_utf8"></a><code class="literal">encoding_utf8</code> corresponds to UTF-8 encoding - as defined in Unicode standard; UTF-8 sequences with length equal to - 5 or 6 are not standard and are rejected. + as defined in the Unicode standard; UTF-8 sequences with length equal + to 5 or 6 are not standard and are rejected. </li> <li class="listitem"> <a name="encoding_utf16_le"></a><code class="literal">encoding_utf16_le</code> corresponds to - little-endian UTF-16 encoding as defined in Unicode standard; surrogate + little-endian UTF-16 encoding as defined in the Unicode standard; surrogate pairs are supported. </li> <li class="listitem"> <a name="encoding_utf16_be"></a><code class="literal">encoding_utf16_be</code> corresponds to - big-endian UTF-16 encoding as defined in Unicode standard; surrogate + big-endian UTF-16 encoding as defined in the Unicode standard; surrogate pairs are supported. </li> <li class="listitem"> <a name="encoding_utf16"></a><code class="literal">encoding_utf16</code> corresponds to UTF-16 - encoding as defined in Unicode standard; the endianness is assumed to - be that of target platform. + encoding as defined in the Unicode standard; the endianness is assumed + to be that of the target platform. </li> <li class="listitem"> <a name="encoding_utf32_le"></a><code class="literal">encoding_utf32_le</code> corresponds to - little-endian UTF-32 encoding as defined in Unicode standard. + little-endian UTF-32 encoding as defined in the Unicode standard. </li> <li class="listitem"> <a name="encoding_utf32_be"></a><code class="literal">encoding_utf32_be</code> corresponds to - big-endian UTF-32 encoding as defined in Unicode standard. + big-endian UTF-32 encoding as defined in the Unicode standard. </li> <li class="listitem"> <a name="encoding_utf32"></a><code class="literal">encoding_utf32</code> corresponds to UTF-32 - encoding as defined in Unicode standard; the endianness is assumed to - be that of target platform. + encoding as defined in the Unicode standard; the endianness is assumed + to be that of the target platform. </li> <li class="listitem"> <a name="encoding_wchar"></a><code class="literal">encoding_wchar</code> corresponds to the encoding @@ -823,7 +835,8 @@ </tr></table> <hr> <table width="100%"><tr> -<td>pugixml 0.9 manual | +<td> +<a href="http://pugixml.org/">pugixml 1.0</a> manual | <a href="../manual.html">Overview</a> | <a href="install.html">Installation</a> | Document: |