From 4f2ad720c867f29f3156b953eadfe9be5efb511a Mon Sep 17 00:00:00 2001 From: Arseny Kapoulkine Date: Mon, 21 Aug 2017 20:52:58 -0700 Subject: docs: Update encoding conversion description We support Latin-1 and automatically detect it by parsing the encoding from document declaration; both of these were omitted from the description of the automatic detection. Additionally, the description has been rewritten to be more concise and a bit more abstract - there's no need to specify the algorithm precisely here. Fixes #158. --- docs/manual.adoc | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) (limited to 'docs/manual.adoc') diff --git a/docs/manual.adoc b/docs/manual.adoc index 6688bd6..7f4fc8b 100644 --- a/docs/manual.adoc +++ b/docs/manual.adoc @@ -556,7 +556,7 @@ On 32-bit architectures document structure in compact mode is typically reduced pugixml provides several functions for loading XML data from various places - files, C{plus}{plus} iostreams, memory buffers. All functions use an extremely fast non-validating parser. This parser is not fully W3C conformant - it can load any valid XML document, but does not perform some well-formedness checks. While considerable effort is made to reject invalid XML documents, some validation is not performed for performance reasons. Also some XML transformations (i.e. EOL handling or attribute value normalization) can impact parsing speed and thus can be disabled. However for vast majority of XML documents there is no performance difference between different parsing options. Parsing options also control whether certain XML nodes are parsed; see <> for more information. -XML data is always converted to internal character format (see <>) before parsing. pugixml supports all popular Unicode encodings (UTF-8, UTF-16 (big and little endian), UTF-32 (big and little endian); UCS-2 is naturally supported since it's a strict subset of UTF-16) and handles all encoding conversions automatically. Unless explicit encoding is specified, loading functions perform automatic encoding detection based on first few characters of XML data, so in almost all cases you do not have to specify document encoding. Encoding conversion is described in more detail in <>. +XML data is always converted to internal character format (see <>) before parsing. pugixml supports all popular Unicode encodings (UTF-8, UTF-16 (big and little endian), UTF-32 (big and little endian); UCS-2 is naturally supported since it's a strict subset of UTF-16) as well as some non-Unicode encodings (Latin-1) and handles all encoding conversions automatically. Unless explicit encoding is specified, loading functions perform automatic encoding detection based on source XML data, so in most cases you do not have to specify document encoding. Encoding conversion is described in more detail in <>. [[loading.file]] === Loading document from file @@ -784,17 +784,9 @@ include::samples/load_options.cpp[tags=code] === Encodings [[xml_encoding]] -pugixml supports all popular Unicode encodings (UTF-8, UTF-16 (big and little endian), UTF-32 (big and little endian); UCS-2 is naturally supported since it's a strict subset of UTF-16) and handles all encoding conversions. Most loading functions accept the optional parameter `encoding`. This is a value of enumeration type `xml_encoding`, that can have the following values: - -* [[encoding_auto]]`encoding_auto` means that pugixml will try to guess the encoding based on source XML data. The algorithm is a modified version of the one presented in http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info[Appendix F.1 of XML recommendation]; it tries to match the first few bytes of input data with the following patterns in strict order: -** If first four bytes match UTF-32 BOM (Byte Order Mark), encoding is assumed to be UTF-32 with the endianness equal to that of BOM; -** If first two bytes match UTF-16 BOM, encoding is assumed to be UTF-16 with the endianness equal to that of BOM; -** If first three bytes match UTF-8 BOM, encoding is assumed to be UTF-8; -** If first four bytes match UTF-32 representation of `<`, encoding is assumed to be UTF-32 with the corresponding endianness; -** If first four bytes match UTF-16 representation of `