xxHash 0.8.2
Extremely fast non-cryptographic hash function
|
Macros | |
#define | XXH_NO_LONG_LONG |
Define this to disable 64-bit code. | |
#define | XXH_FORCE_MEMORY_ACCESS 0 |
Controls how unaligned memory is accessed. | |
#define | XXH_SIZE_OPT 0 |
Controls how much xxHash optimizes for size. | |
#define | XXH_FORCE_ALIGN_CHECK 0 |
If defined to non-zero, adds a special path for aligned inputs (XXH32() and XXH64() only). | |
#define | XXH_NO_INLINE_HINTS 0 |
When non-zero, sets all functions to static . | |
#define | XXH3_INLINE_SECRET 0 |
Determines whether to inline the XXH3 withSecret code. | |
#define | XXH32_ENDJMP 0 |
Whether to use a jump for XXH32_finalize . | |
#define | XXH_OLD_NAMES |
#define | XXH_NO_STREAM |
Disables the streaming API. | |
#define | XXH_DEBUGLEVEL 0 |
Sets the debugging level. | |
#define | XXH_CPU_LITTLE_ENDIAN XXH_isLittleEndian() |
Whether the target is little endian. | |
#define | XXH_VECTOR XXH_SCALAR |
Overrides the vectorization implementation chosen for XXH3. | |
#define | XXH_VECTOR XXH_SCALAR |
Overrides the vectorization implementation chosen for XXH3. | |
#define | XXH_ACC_ALIGN 8 |
Selects the minimum alignment for XXH3's accumulators. | |
#define | XXH3_NEON_LANES XXH_ACC_NB |
Controls the NEON to scalar ratio for XXH3. | |
Enumerations | |
enum | XXH_VECTOR_TYPE { XXH_SCALAR = 0 , XXH_SSE2 = 1 , XXH_AVX2 = 2 , XXH_AVX512 = 3 , XXH_NEON = 4 , XXH_VSX = 5 , XXH_SVE = 6 } |
Possible values for XXH_VECTOR. More... | |
Various macros to control xxHash's behavior.
#define XXH_NO_LONG_LONG |
Define this to disable 64-bit code.
Useful if only using the XXH32 family and you have a strict C90 compiler.
#define XXH_FORCE_MEMORY_ACCESS 0 |
Controls how unaligned memory is accessed.
By default, access to unaligned memory is controlled by memcpy()
, which is safe and portable.
Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal.
The below switch allow selection of a different access method in the search for improved performance.
XXH_FORCE_MEMORY_ACCESS=0
(default): memcpy
memcpy()
. Safe and portable. Note that most modern compilers will eliminate the function call and treat it as an unaligned access.XXH_FORCE_MEMORY_ACCESS=1
: __attribute__((aligned(1)))
memcpy
.XXH_FORCE_MEMORY_ACCESS=2
: Direct cast XXH_FORCE_MEMORY_ACCESS=3
: Byteshift memcpy()
calls, and it might also be faster on big-endian systems which lack a native byteswap instruction. However, some compilers will emit literal byteshifts even if the target supports unaligned access.Prefer these methods in priority order (0 > 3 > 1 > 2)
#define XXH_SIZE_OPT 0 |
Controls how much xxHash optimizes for size.
xxHash, when compiled, tends to result in a rather large binary size. This is mostly due to heavy usage to forced inlining and constant folding of the XXH3 family to increase performance.
However, some developers prefer size over speed. This option can significantly reduce the size of the generated code. When using the -Os
or -Oz
options on GCC or Clang, this is defined to 1 by default, otherwise it is defined to 0.
Most of these size optimizations can be controlled manually.
This is a number from 0-2.
XXH_SIZE_OPT
== 0: Default. xxHash makes no size optimizations. Speed comes first.XXH_SIZE_OPT
== 1: Default for -Os
and -Oz
. xxHash is more conservative and disables hacks that increase code size. It implies the options XXH_NO_INLINE_HINTS == 1, XXH_FORCE_ALIGN_CHECK == 0, and XXH3_NEON_LANES == 8 if they are not already defined.XXH_SIZE_OPT
== 2: xxHash tries to make itself as small as possible. Performance may cry. For example, the single shot functions just use the streaming API. #define XXH_FORCE_ALIGN_CHECK 0 |
If defined to non-zero, adds a special path for aligned inputs (XXH32() and XXH64() only).
This is an important performance trick for architectures without decent unaligned memory access performance.
It checks for input alignment, and when conditions are met, uses a "fast path" employing direct 32-bit/64-bit reads, resulting in dramatically faster read speed.
The check costs one initial branch per hash, which is generally negligible, but not zero.
Moreover, it's not useful to generate an additional code path if memory access uses the same instruction for both aligned and unaligned addresses (e.g. x86 and aarch64).
In these cases, the alignment check can be removed by setting this macro to 0. Then the code will always use unaligned memory access. Align check is automatically disabled on x86, x64, ARM64, and some ARM chips which are platforms known to offer good unaligned memory accesses performance.
It is also disabled by default when XXH_SIZE_OPT >= 1.
This option does not affect XXH3 (only XXH32 and XXH64).
#define XXH_NO_INLINE_HINTS 0 |
When non-zero, sets all functions to static
.
By default, xxHash tries to force the compiler to inline almost all internal functions.
This can usually improve performance due to reduced jumping and improved constant folding, but significantly increases the size of the binary which might not be favorable.
Additionally, sometimes the forced inlining can be detrimental to performance, depending on the architecture.
XXH_NO_INLINE_HINTS marks all internal functions as static, giving the compiler full control on whether to inline or not.
When not optimizing (-O0), using -fno-inline
with GCC or Clang, or if XXH_SIZE_OPT >= 1, this will automatically be defined.
#define XXH3_INLINE_SECRET 0 |
Determines whether to inline the XXH3 withSecret code.
When the secret size is known, the compiler can improve the performance of XXH3_64bits_withSecret() and XXH3_128bits_withSecret().
However, if the secret size is not known, it doesn't have any benefit. This happens when xxHash is compiled into a global symbol. Therefore, if XXH_INLINE_ALL is not defined, this will be defined to 0.
Additionally, this defaults to 0 on GCC 12+, which has an issue with function pointers that are sometimes force inline on -Og, and it is impossible to automatically detect this optimization level.
#define XXH32_ENDJMP 0 |
Whether to use a jump for XXH32_finalize
.
For performance, XXH32_finalize
uses multiple branches in the finalizer. This is generally preferable for performance, but depending on exact architecture, a jmp may be preferable.
This setting is only possibly making a difference for very small inputs.
#define XXH_NO_STREAM |
Disables the streaming API.
When xxHash is not inlined and the streaming functions are not used, disabling the streaming functions can improve code size significantly, especially with the XXH3 family which tends to make constant folded copies of itself.
#define XXH_DEBUGLEVEL 0 |
Sets the debugging level.
XXH_DEBUGLEVEL is expected to be defined externally, typically via the compiler's command line options. The value must be a number.
#define XXH_CPU_LITTLE_ENDIAN XXH_isLittleEndian() |
Whether the target is little endian.
Defined to 1 if the target is little endian, or 0 if it is big endian. It can be defined externally, for example on the compiler command line.
If it is not defined, a runtime check (which is usually constant folded) is used instead.
#define XXH_VECTOR XXH_SCALAR |
Overrides the vectorization implementation chosen for XXH3.
Can be defined to 0 to disable SIMD or any of the values mentioned in XXH_VECTOR_TYPE.
If this is not defined, it uses predefined macros to determine the best implementation.
#define XXH_VECTOR XXH_SCALAR |
Overrides the vectorization implementation chosen for XXH3.
Can be defined to 0 to disable SIMD or any of the values mentioned in XXH_VECTOR_TYPE.
If this is not defined, it uses predefined macros to determine the best implementation.
#define XXH_ACC_ALIGN 8 |
Selects the minimum alignment for XXH3's accumulators.
When using SIMD, this should match the alignment required for said vector type, so, for example, 32 for AVX2.
Default: Auto detected.
#define XXH3_NEON_LANES XXH_ACC_NB |
Controls the NEON to scalar ratio for XXH3.
This can be set to 2, 4, 6, or 8.
ARM Cortex CPUs are very sensitive to how their pipelines are used.
For example, the Cortex-A73 can dispatch 3 micro-ops per cycle, but only 2 of those can be NEON. If you are only using NEON instructions, you are only using 2/3 of the CPU bandwidth.
This is even more noticeable on the more advanced cores like the Cortex-A76 which can dispatch 8 micro-ops per cycle, but still only 2 NEON micro-ops at once.
Therefore, to make the most out of the pipeline, it is beneficial to run 6 NEON lanes and 2 scalar lanes, which is chosen by default.
This does not apply to Apple processors or 32-bit processors, which run better with full NEON. These will default to 8. Additionally, size-optimized builds run 8 lanes.
This change benefits CPUs with large micro-op buffers without negatively affecting most other CPUs:
Chipset | Dispatch type | NEON only | 6:2 hybrid | Diff. |
---|---|---|---|---|
Snapdragon 730 (A76) | 2 NEON/8 micro-ops | 8.8 GB/s | 10.1 GB/s | ~16% |
Snapdragon 835 (A73) | 2 NEON/3 micro-ops | 5.1 GB/s | 5.3 GB/s | ~5% |
Marvell PXA1928 (A53) | In-order dual-issue | 1.9 GB/s | 1.9 GB/s | 0% |
Apple M1 | 4 NEON/8 micro-ops | 37.3 GB/s | 36.1 GB/s | ~-3% |
It also seems to fix some bad codegen on GCC, making it almost as fast as clang.
When using WASM SIMD128, if this is 2 or 6, SIMDe will scalarize 2 of the lanes meaning it effectively becomes worse 4.
enum XXH_VECTOR_TYPE |
Possible values for XXH_VECTOR.
Note that these are actually implemented as macros.
If this is not defined, it is detected automatically. internal macro XXH_X86DISPATCH overrides this.