From ce4ac177801e31ffd309c91cb9e464d8cab205a3 Mon Sep 17 00:00:00 2001 From: Arseny Kapoulkine Date: Thu, 13 Aug 2015 14:07:19 +0100 Subject: docs: Clarify UTF-8 vs wchar_t memory efficiency --- docs/manual.adoc | 2 +- docs/manual.html | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/manual.adoc b/docs/manual.adoc index cd3d8f8..af48a10 100644 --- a/docs/manual.adoc +++ b/docs/manual.adoc @@ -420,7 +420,7 @@ bool xml_node::set_name(const wchar_t* value); [[char_t]][[string_t]] There is a special type, `pugi::char_t`, that is defined as the character type and depends on the library configuration; it will be also used in the documentation hereafter. There is also a type `pugi::string_t`, which is defined as the STL string of the character type; it corresponds to `std::string` in char mode and to `std::wstring` in wchar_t mode. -In addition to the interface, the internal implementation changes to store XML data as `pugi::char_t`; this means that these two modes have different memory usage characteristics. The conversion to `pugi::char_t` upon document loading and from `pugi::char_t` upon document saving happen automatically, which also carries minor performance penalty. The general advice however is to select the character mode based on usage scenario, i.e. if UTF-8 is inconvenient to process and most of your XML data is non-ASCII, wchar_t mode is probably a better choice. +In addition to the interface, the internal implementation changes to store XML data as `pugi::char_t`; this means that these two modes have different memory usage characteristics - generally UTF-8 mode is more memory and performance efficient, especially if `sizeof(wchar_t)` is 4. The conversion to `pugi::char_t` upon document loading and from `pugi::char_t` upon document saving happen automatically, which also carries minor performance penalty. The general advice however is to select the character mode based on usage scenario, i.e. if UTF-8 is inconvenient to process and most of your XML data is non-ASCII, wchar_t mode is probably a better choice. [[as_utf8]][[as_wide]] There are cases when you'll have to convert string data between UTF-8 and wchar_t encodings; the following helper functions are provided for such purposes: diff --git a/docs/manual.html b/docs/manual.html index 0c679a7..36acf43 100644 --- a/docs/manual.html +++ b/docs/manual.html @@ -1274,7 +1274,7 @@ If the size of wchar_t is 2, pugixml assumes UTF-16 encoding instea There is a special type, pugi::char_t, that is defined as the character type and depends on the library configuration; it will be also used in the documentation hereafter. There is also a type pugi::string_t, which is defined as the STL string of the character type; it corresponds to std::string in char mode and to std::wstring in wchar_t mode.

-

In addition to the interface, the internal implementation changes to store XML data as pugi::char_t; this means that these two modes have different memory usage characteristics. The conversion to pugi::char_t upon document loading and from pugi::char_t upon document saving happen automatically, which also carries minor performance penalty. The general advice however is to select the character mode based on usage scenario, i.e. if UTF-8 is inconvenient to process and most of your XML data is non-ASCII, wchar_t mode is probably a better choice.

+

In addition to the interface, the internal implementation changes to store XML data as pugi::char_t; this means that these two modes have different memory usage characteristics - generally UTF-8 mode is more memory and performance efficient, especially if sizeof(wchar_t) is 4. The conversion to pugi::char_t upon document loading and from pugi::char_t upon document saving happen automatically, which also carries minor performance penalty. The general advice however is to select the character mode based on usage scenario, i.e. if UTF-8 is inconvenient to process and most of your XML data is non-ASCII, wchar_t mode is probably a better choice.

@@ -1458,10 +1458,10 @@ You can use the following accessor functions to change or get current memory man

If you are processing big documents or your platform is memory constrained and you’re willing to sacrifice a bit of performance for memory, you can compile pugixml with PUGIXML_COMPACT define which will activate compact mode. Compact mode uses a different representation of the document structure that assumes locality of reference between nodes and attributes to optimize memory usage. As a result you get significantly smaller node/attribute objects; usually most objects in most documents don’t require additional storage, but in the worst case - if assumptions about locality of reference don’t hold - additional memory will be allocated to store the extra data required.

-

The compact storage supports all existing operations - including tree modification - with the same amortized complexity (that is, all basic document manipulations are still amortized O(1)). The operations are slightly slower; you can usually expect 10-50% slowdown in terms of processing time unless your processing was memory-bound.

+

The compact storage supports all existing operations - including tree modification - with the same amortized complexity (that is, all basic document manipulations are still O(1) on average). The operations are slightly slower; you can usually expect 10-50% slowdown in terms of processing time unless your processing was memory-bound.

-

On 32-bit architectures document structure is typically reduced by around 2.5x; on 64-bit architectures the ratio is around 5x. Thus for big markup-heavy documents compact mode can make the difference between the processing of a multi-gigabyte document running completely in RAM vs requiring swapping to disk. Even if the document fits into memory, compact storage can use CPU caches more efficiently by taking less space and causing less cache/TLB misses.

+

On 32-bit architectures document structure in compact mode is typically reduced by around 2.5x; on 64-bit architectures the ratio is around 5x. Thus for big markup-heavy documents compact mode can make the difference between the processing of a multi-gigabyte document running completely from RAM vs requiring swapping to disk. Even if the document fits into memory, compact storage can use CPU caches more efficiently by taking less space and causing less cache/TLB misses.

@@ -5532,7 +5532,7 @@ If exceptions are disabled, then in the event of parsing failure the query is in -- cgit v1.2.3