From c55e5512355d23483d521d7c7dd38e67ba7835f9 Mon Sep 17 00:00:00 2001 From: Arseny Kapoulkine Date: Thu, 13 Aug 2015 14:03:10 +0100 Subject: docs: Add PUGIXML_COMPACT documentation Also add PUGIXML_COMPACT to pugiconfig.hpp --- docs/manual.adoc | 14 ++++++++++++++ docs/manual.html | 31 +++++++++++++++++++++++++++---- src/pugiconfig.hpp | 3 +++ 3 files changed, 44 insertions(+), 4 deletions(-) diff --git a/docs/manual.adoc b/docs/manual.adoc index 62b8f4d..cd3d8f8 100644 --- a/docs/manual.adoc +++ b/docs/manual.adoc @@ -216,6 +216,8 @@ pugixml uses several defines to control the compilation process. There are two w [[PUGIXML_WCHAR_MODE]]`PUGIXML_WCHAR_MODE` define toggles between UTF-8 style interface (the in-memory text encoding is assumed to be UTF-8, most functions use `char` as character type) and UTF-16/32 style interface (the in-memory text encoding is assumed to be UTF-16/32, depending on `wchar_t` size, most functions use `wchar_t` as character type). See <> for more details. +[[PUGIXML_COMPACT]]`PUGIXML_COMPACT` define activates a different internal representation of document storage that is much more memory efficient for documents with a lot of markup (i.e. nodes and attributes), but is slightly slower to parse and access. For details see <>. + [[PUGIXML_NO_XPATH]]`PUGIXML_NO_XPATH` define disables XPath. Both XPath interfaces and XPath implementation are excluded from compilation. This option is provided in case you do not need XPath functionality and need to save code space. [[PUGIXML_NO_STL]]`PUGIXML_NO_STL` define disables use of STL in pugixml. The functions that operate on STL types are no longer present (i.e. load/save via iostream) if this macro is defined. This option is provided in case your target platform does not have a standard-compliant STL implementation. @@ -536,6 +538,17 @@ When the document is loaded from file/buffer, unless an inplace loading function All additional memory, such as memory for document structure (node/attribute objects) and memory for node/attribute names/values is allocated in pages on the order of 32 Kb; actual objects are allocated inside the pages using a memory management scheme optimized for fast allocation/deallocation of many small objects. Because of the scheme specifics, the pages are only destroyed if all objects inside them are destroyed; also, generally destroying an object does not mean that subsequent object creation will reuse the same memory. This means that it is possible to devise a usage scheme which will lead to higher memory usage than expected; one example is adding a lot of nodes, and them removing all even numbered ones; not a single page is reclaimed in the process. However this is an example specifically crafted to produce unsatisfying behavior; in all practical usage scenarios the memory consumption is less than that of a general-purpose allocator because allocation meta-data is very small in size. +[[dom.memory.compact]] +==== Compact mode + +By default nodes and attributes are optimized for efficiency of access. This can cause them to take a significant amount of memory - for documents with a lot of nodes and not a lot of contents (short attribute values/node text), and depending on the pointer size, the document structure can take noticeably more memory than the document itself (e.g. on a 64-bit platform in UTF-8 mode a markup-heavy document with the file size of 2.1 Mb can use 2.1 Mb for document buffer and 8.3 Mb for document structure). + +If you are processing big documents or your platform is memory constrained and you're willing to sacrifice a bit of performance for memory, you can compile pugixml with `PUGIXML_COMPACT` define which will activate compact mode. Compact mode uses a different representation of the document structure that assumes locality of reference between nodes and attributes to optimize memory usage. As a result you get significantly smaller node/attribute objects; usually most objects in most documents don't require additional storage, but in the worst case - if assumptions about locality of reference don't hold - additional memory will be allocated to store the extra data required. + +The compact storage supports all existing operations - including tree modification - with the same amortized complexity (that is, all basic document manipulations are still O(1) on average). The operations are slightly slower; you can usually expect 10-50% slowdown in terms of processing time unless your processing was memory-bound. + +On 32-bit architectures document structure in compact mode is typically reduced by around 2.5x; on 64-bit architectures the ratio is around 5x. Thus for big markup-heavy documents compact mode can make the difference between the processing of a multi-gigabyte document running completely from RAM vs requiring swapping to disk. Even if the document fits into memory, compact storage can use CPU caches more efficiently by taking less space and causing less cache/TLB misses. + [[loading]] == Loading document @@ -2461,6 +2474,7 @@ This is the reference for all macros, types, enumerations, classes and functions [source,subs="+macros"] ---- #define +++PUGIXML_WCHAR_MODE+++ +#define +++PUGIXML_COMPACT+++ #define +++PUGIXML_NO_XPATH+++ #define +++PUGIXML_NO_STL+++ #define +++PUGIXML_NO_EXCEPTIONS+++ diff --git a/docs/manual.html b/docs/manual.html index 99cc654..0c679a7 100644 --- a/docs/manual.html +++ b/docs/manual.html @@ -930,6 +930,9 @@ can include pugixml.cpp in your project (see B

PUGIXML_WCHAR_MODE define toggles between UTF-8 style interface (the in-memory text encoding is assumed to be UTF-8, most functions use char as character type) and UTF-16/32 style interface (the in-memory text encoding is assumed to be UTF-16/32, depending on wchar_t size, most functions use wchar_t as character type). See Unicode interface for more details.

+

PUGIXML_COMPACT define activates a different internal representation of document storage that is much more memory efficient for documents with a lot of markup (i.e. nodes and attributes), but is slightly slower to parse and access. For details see Compact mode.

+
+

PUGIXML_NO_XPATH define disables XPath. Both XPath interfaces and XPath implementation are excluded from compilation. This option is provided in case you do not need XPath functionality and need to save code space.

@@ -1215,7 +1218,7 @@ Both xml_node and xml_attribute have the default const

-If you want to use xml_node or xml_attribute objects as keys in hash-based associative containers, you can use the hash_value member functions. They return the hash values that are guaranteed to be the same for all handles to the same underlying object. The hash value for null handles is 0.

+If you want to use xml_node or xml_attribute objects as keys in hash-based associative containers, you can use the hash_value member functions. They return the hash values that are guaranteed to be the same for all handles to the same underlying object. The hash value for null handles is 0. Note that hash value does not depend on the content of the node, only on the location of the underlying structure in memory - this means that loading the same document twice will likely produce different hash values, and copying the node will not preserve the hash.

@@ -1382,7 +1385,7 @@ You can use the following accessor functions to change or get current memory man

-

Allocation function is called with the size (in bytes) as an argument and should return a pointer to a memory block with alignment that is suitable for storage of primitive types (usually a maximum of void* and double types alignment is sufficient) and size that is greater than or equal to the requested one. If the allocation fails, the function has to return null pointer (throwing an exception from allocation function results in undefined behavior).

+

Allocation function is called with the size (in bytes) as an argument and should return a pointer to a memory block with alignment that is suitable for storage of primitive types (usually a maximum of void* and double types alignment is sufficient) and size that is greater than or equal to the requested one. If the allocation fails, the function has to either return null pointer or to throw an exception.

Deallocation function is called with the pointer that was returned by some call to allocation function; it is never called with a null pointer. If memory management functions are not thread-safe, library thread safety is not guaranteed.

@@ -1446,6 +1449,21 @@ You can use the following accessor functions to change or get current memory man

All additional memory, such as memory for document structure (node/attribute objects) and memory for node/attribute names/values is allocated in pages on the order of 32 Kb; actual objects are allocated inside the pages using a memory management scheme optimized for fast allocation/deallocation of many small objects. Because of the scheme specifics, the pages are only destroyed if all objects inside them are destroyed; also, generally destroying an object does not mean that subsequent object creation will reuse the same memory. This means that it is possible to devise a usage scheme which will lead to higher memory usage than expected; one example is adding a lot of nodes, and them removing all even numbered ones; not a single page is reclaimed in the process. However this is an example specifically crafted to produce unsatisfying behavior; in all practical usage scenarios the memory consumption is less than that of a general-purpose allocator because allocation meta-data is very small in size.

+
+

3.6.4. Compact mode

+
+

By default nodes and attributes are optimized for efficiency of access. This can cause them to take a significant amount of memory - for documents with a lot of nodes and not a lot of contents (short attribute values/node text), and depending on the pointer size, the document structure can take noticeably more memory than the document itself (e.g. on a 64-bit platform in UTF-8 mode a markup-heavy document with the file size of 2.1 Mb can use 2.1 Mb for document buffer and 8.3 Mb for document structure).

+
+
+

If you are processing big documents or your platform is memory constrained and you’re willing to sacrifice a bit of performance for memory, you can compile pugixml with PUGIXML_COMPACT define which will activate compact mode. Compact mode uses a different representation of the document structure that assumes locality of reference between nodes and attributes to optimize memory usage. As a result you get significantly smaller node/attribute objects; usually most objects in most documents don’t require additional storage, but in the worst case - if assumptions about locality of reference don’t hold - additional memory will be allocated to store the extra data required.

+
+
+

The compact storage supports all existing operations - including tree modification - with the same amortized complexity (that is, all basic document manipulations are still amortized O(1)). The operations are slightly slower; you can usually expect 10-50% slowdown in terms of processing time unless your processing was memory-bound.

+
+
+

On 32-bit architectures document structure is typically reduced by around 2.5x; on 64-bit architectures the ratio is around 5x. Thus for big markup-heavy documents compact mode can make the difference between the processing of a multi-gigabyte document running completely in RAM vs requiring swapping to disk. Even if the document fits into memory, compact storage can use CPU caches more efficiently by taking less space and causing less cache/TLB misses.

+
+
@@ -3295,7 +3313,10 @@ You should use the usual bitwise arithmetics to manipulate the bitmask: to enabl
  • -

    format_indent determines if all nodes should be indented with the indentation string (this is an additional parameter for all saving functions, and is "\t" by default). If this flag is on, before every node the indentation string is output several times, where the amount of indentation depends on the node’s depth relative to the output subtree. This flag has no effect if format_raw is enabled. This flag is on by default.

    +

    format_indent determines if all nodes should be indented with the indentation string (this is an additional parameter for all saving functions, and is "\t" by default). If this flag is on, the indentation string is printed several times before every node, where the amount of indentation depends on the node’s depth relative to the output subtree. This flag has no effect if format_raw is enabled. This flag is on by default.

    +
  • +
  • +

    format_indent_attributes determines if all attributes should be printed on a new line, indented with the indentation string according to the attribute’s depth. This flag implies format_indent. This flag has no effect if format_raw is enabled. This flag is off by default.

  • format_raw switches between formatted and raw output. If this flag is on, the nodes are not indented in any way, and also no newlines that are not part of document text are printed. Raw mode can be used for serialization where the result is not intended to be read by humans; also it can be useful if the document was parsed with parse_ws_pcdata flag, to preserve the original document formatting as much as possible. This flag is off by default.

    @@ -5028,6 +5049,7 @@ If exceptions are disabled, then in the event of parsing failure the query is in
    #define PUGIXML_WCHAR_MODE
    +#define PUGIXML_COMPACT
     #define PUGIXML_NO_XPATH
     #define PUGIXML_NO_STL
     #define PUGIXML_NO_EXCEPTIONS
    @@ -5115,6 +5137,7 @@ If exceptions are disabled, then in the event of parsing failure the query is in
     
    // Formatting options bit flags:
     const unsigned int format_default
     const unsigned int format_indent
    +const unsigned int format_indent_attributes
     const unsigned int format_no_declaration
     const unsigned int format_no_escapes
     const unsigned int format_raw
    @@ -5509,7 +5532,7 @@ If exceptions are disabled, then in the event of parsing failure the query is in
     
    diff --git a/src/pugiconfig.hpp b/src/pugiconfig.hpp index 5ee5131..6edf64c 100644 --- a/src/pugiconfig.hpp +++ b/src/pugiconfig.hpp @@ -17,6 +17,9 @@ // Uncomment this to enable wchar_t mode // #define PUGIXML_WCHAR_MODE +// Uncomment this to enable compact mode +// #define PUGIXML_COMPACT + // Uncomment this to disable XPath // #define PUGIXML_NO_XPATH -- cgit v1.2.3