Skip to content

[Hashing] Use a non-deterministic seed if LLVM_ENABLE_ABI_BREAKING_CHECKS #96282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 28, 2024

Conversation

MaskRay
Copy link
Member

@MaskRay MaskRay commented Jun 21, 2024

Hashing.h provides hash_value/hash_combine/hash_combine_range, which are
primarily used by DenseMap<StringRef, X>

Users shouldn't rely on specific hash values due to size_t differences
on 32-bit/64-bit platforms and potential algorithm changes.
set_fixed_execution_hash_seed is provided but it has never been used.

In LLVM_ENABLE_ABI_BREAKING_CHECKS builds, take the the address of a
static storage duration variable as the seed like
absl/hash/internal/hash.h kSeed. (See https://reviews.llvm.org/D93931
for workaround for older Clang. Mach-O x86-64 forces PIC, so absl's
__apple_build_version__ check is unnecessary.)

LLVM_ENABLE_ABI_BREAKING_CHECKS defaults to WITH_ASSERTS and is
enabled in an assertion build.

In a non-assertion build, get_execution_seed returns the fixed value
regardless of NDEBUG. Removing a variable load yields noticeable
size/performance improvement.

A few users relying on the iteration order of DenseMap<StringRef, X>
have been fixed (e.g., f8f4235 c025bd1 89e8e63
86eb6bf eb8d036 0ea6b8e 58d7a6e 8ea31db
592abf2 6644975).
From my experience fixing StringMap
iteration order issues, the scale of issues is similar.

MaskRay added 2 commits June 20, 2024 23:46
Created using spr 1.3.5-bogner

[skip ci]
Created using spr 1.3.5-bogner
@llvmbot
Copy link
Member

llvmbot commented Jun 21, 2024

@llvm/pr-subscribers-llvm-support

Author: Fangrui Song (MaskRay)

Changes

Hashing.h provides hash_value/hash_combine/hash_combine_range, which are
primarily used by

  • DenseMap&lt;StringRef, X&gt;
  • FoldingSetNodeIDRef::ComputeHash (will be fixed by #96136)

Users shouldn't rely on specific hash values due to potential algorithm
changes. set_fixed_execution_hash_seed is provided but it has never
been used.

Take the the address of a static storage duration variable as the seed
like absl/hash/internal/hash.h kSeed.
(See https://reviews.llvm.org/D93931 for workaround for older Clang.
Mach-O x86-64 forces PIC, so absl's __apple_build_version__ check is
unnecessary.)

A few users relying on the iteration order of DenseMap&lt;StringRef, X&gt;
have been fixed (e.g., f8f4235 c025bd1 89e8e63
86eb6bf eb8d036 0ea6b8e 58d7a6e 8ea31db
255986e).
From my experience fixing DenseMap&lt;A *, X&gt; and
StringMap
iteration order issues, the scale of breakage is smaller.


Full diff: https://github.com/llvm/llvm-project/pull/96282.diff

5 Files Affected:

  • (modified) llvm/include/llvm/ADT/Hashing.h (+11-34)
  • (modified) llvm/lib/Support/CMakeLists.txt (-1)
  • (removed) llvm/lib/Support/Hashing.cpp (-28)
  • (modified) llvm/unittests/ADT/HashingTest.cpp (-72)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn (-1)
diff --git a/llvm/include/llvm/ADT/Hashing.h b/llvm/include/llvm/ADT/Hashing.h
index a5477362a5079..3c87894635787 100644
--- a/llvm/include/llvm/ADT/Hashing.h
+++ b/llvm/include/llvm/ADT/Hashing.h
@@ -126,23 +126,6 @@ hash_code hash_value(const std::basic_string<T> &arg);
 /// Compute a hash_code for a standard string.
 template <typename T> hash_code hash_value(const std::optional<T> &arg);
 
-/// Override the execution seed with a fixed value.
-///
-/// This hashing library uses a per-execution seed designed to change on each
-/// run with high probability in order to ensure that the hash codes are not
-/// attackable and to ensure that output which is intended to be stable does
-/// not rely on the particulars of the hash codes produced.
-///
-/// That said, there are use cases where it is important to be able to
-/// reproduce *exactly* a specific behavior. To that end, we provide a function
-/// which will forcibly set the seed to a fixed value. This must be done at the
-/// start of the program, before any hashes are computed. Also, it cannot be
-/// undone. This makes it thread-hostile and very hard to use outside of
-/// immediately on start of a simple program designed for reproducible
-/// behavior.
-void set_fixed_execution_hash_seed(uint64_t fixed_value);
-
-
 // All of the implementation details of actually computing the various hash
 // code values are held within this namespace. These routines are included in
 // the header file mainly to allow inlining and constant propagation.
@@ -322,24 +305,18 @@ struct hash_state {
   }
 };
 
-
-/// A global, fixed seed-override variable.
-///
-/// This variable can be set using the \see llvm::set_fixed_execution_seed
-/// function. See that function for details. Do not, under any circumstances,
-/// set or read this variable.
-extern uint64_t fixed_seed_override;
-
+/// The seed is non-deterministic (address of a variable) to prevent having
+/// users depend on the particular hash values. On platforms without ASLR, this
+/// is still likely non-deterministic per build.
 inline uint64_t get_execution_seed() {
-  // FIXME: This needs to be a per-execution seed. This is just a placeholder
-  // implementation. Switching to a per-execution seed is likely to flush out
-  // instability bugs and so will happen as its own commit.
-  //
-  // However, if there is a fixed seed override set the first time this is
-  // called, return that instead of the per-execution seed.
-  const uint64_t seed_prime = 0xff51afd7ed558ccdULL;
-  static uint64_t seed = fixed_seed_override ? fixed_seed_override : seed_prime;
-  return seed;
+  static const char seed = 0;
+  // Work around x86-64 negative offset folding for old Clang -fno-pic
+  // https://reviews.llvm.org/D93931
+#if !defined(__clang__) || __clang_major__ > 11
+  return static_cast<uint64_t>(reinterpret_cast<uintptr_t>(&seed));
+#else
+  return 0xff51afd7ed558ccdULL;
+#endif
 }
 
 
diff --git a/llvm/lib/Support/CMakeLists.txt b/llvm/lib/Support/CMakeLists.txt
index 0c69ac99f5bc6..31343df4d8b8b 100644
--- a/llvm/lib/Support/CMakeLists.txt
+++ b/llvm/lib/Support/CMakeLists.txt
@@ -187,7 +187,6 @@ add_llvm_component_library(LLVMSupport
   FormatVariadic.cpp
   GlobPattern.cpp
   GraphWriter.cpp
-  Hashing.cpp
   HexagonAttributeParser.cpp
   HexagonAttributes.cpp
   InitLLVM.cpp
diff --git a/llvm/lib/Support/Hashing.cpp b/llvm/lib/Support/Hashing.cpp
deleted file mode 100644
index 1b20a670434f1..0000000000000
--- a/llvm/lib/Support/Hashing.cpp
+++ /dev/null
@@ -1,28 +0,0 @@
-//===-------------- lib/Support/Hashing.cpp -------------------------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-//
-// This file provides implementation bits for the LLVM common hashing
-// infrastructure. Documentation and most of the other information is in the
-// header file.
-//
-//===----------------------------------------------------------------------===//
-
-#include "llvm/ADT/Hashing.h"
-
-using namespace llvm;
-
-// Provide a definition and static initializer for the fixed seed. This
-// initializer should always be zero to ensure its value can never appear to be
-// non-zero, even during dynamic initialization.
-uint64_t llvm::hashing::detail::fixed_seed_override = 0;
-
-// Implement the function for forced setting of the fixed seed.
-// FIXME: Use atomic operations here so that there is no data race.
-void llvm::set_fixed_execution_hash_seed(uint64_t fixed_value) {
-  hashing::detail::fixed_seed_override = fixed_value;
-}
diff --git a/llvm/unittests/ADT/HashingTest.cpp b/llvm/unittests/ADT/HashingTest.cpp
index ab13ee833ce55..c28356e229e66 100644
--- a/llvm/unittests/ADT/HashingTest.cpp
+++ b/llvm/unittests/ADT/HashingTest.cpp
@@ -235,78 +235,6 @@ TEST(HashingTest, HashCombineRangeLengthDiff) {
   }
 }
 
-TEST(HashingTest, HashCombineRangeGoldenTest) {
-  struct { const char *s; uint64_t hash; } golden_data[] = {
-#if SIZE_MAX == UINT64_MAX || SIZE_MAX == UINT32_MAX
-    { "a",                                0xaeb6f9d5517c61f8ULL },
-    { "ab",                               0x7ab1edb96be496b4ULL },
-    { "abc",                              0xe38e60bf19c71a3fULL },
-    { "abcde",                            0xd24461a66de97f6eULL },
-    { "abcdefgh",                         0x4ef872ec411dec9dULL },
-    { "abcdefghijklm",                    0xe8a865539f4eadfeULL },
-    { "abcdefghijklmnopqrstu",            0x261cdf85faaf4e79ULL },
-    { "abcdefghijklmnopqrstuvwxyzabcdef", 0x43ba70e4198e3b2aULL },
-    { "abcdefghijklmnopqrstuvwxyzabcdef"
-      "abcdefghijklmnopqrstuvwxyzghijkl"
-      "abcdefghijklmnopqrstuvwxyzmnopqr"
-      "abcdefghijklmnopqrstuvwxyzstuvwx"
-      "abcdefghijklmnopqrstuvwxyzyzabcd", 0xdcd57fb2afdf72beULL },
-    { "a",                                0xaeb6f9d5517c61f8ULL },
-    { "aa",                               0xf2b3b69a9736a1ebULL },
-    { "aaa",                              0xf752eb6f07b1cafeULL },
-    { "aaaaa",                            0x812bd21e1236954cULL },
-    { "aaaaaaaa",                         0xff07a2cff08ac587ULL },
-    { "aaaaaaaaaaaaa",                    0x84ac949d54d704ecULL },
-    { "aaaaaaaaaaaaaaaaaaaaa",            0xcb2c8fb6be8f5648ULL },
-    { "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 0xcc40ab7f164091b6ULL },
-    { "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 0xc58e174c1e78ffe9ULL },
-    { "z",                                0x1ba160d7e8f8785cULL },
-    { "zz",                               0x2c5c03172f1285d7ULL },
-    { "zzz",                              0x9d2c4f4b507a2ac3ULL },
-    { "zzzzz",                            0x0f03b9031735693aULL },
-    { "zzzzzzzz",                         0xe674147c8582c08eULL },
-    { "zzzzzzzzzzzzz",                    0x3162d9fa6938db83ULL },
-    { "zzzzzzzzzzzzzzzzzzzzz",            0x37b9a549e013620cULL },
-    { "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 0x8921470aff885016ULL },
-    { "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 0xf60fdcd9beb08441ULL },
-    { "a",                                0xaeb6f9d5517c61f8ULL },
-    { "ab",                               0x7ab1edb96be496b4ULL },
-    { "aba",                              0x3edb049950884d0aULL },
-    { "ababa",                            0x8f2de9e73a97714bULL },
-    { "abababab",                         0xee14a29ddf0ce54cULL },
-    { "ababababababa",                    0x38b3ddaada2d52b4ULL },
-    { "ababababababababababa",            0xd3665364219f2b85ULL },
-    { "abababababababababababababababab", 0xa75cd6afbf1bc972ULL },
-    { "abababababababababababababababab"
-      "abababababababababababababababab"
-      "abababababababababababababababab"
-      "abababababababababababababababab"
-      "abababababababababababababababab", 0x840192d129f7a22bULL }
-#else
-#error This test only supports 64-bit and 32-bit systems.
-#endif
-  };
-  for (unsigned i = 0; i < sizeof(golden_data)/sizeof(*golden_data); ++i) {
-    StringRef str = golden_data[i].s;
-    hash_code hash = hash_combine_range(str.begin(), str.end());
-#if 0 // Enable this to generate paste-able text for the above structure.
-    std::string member_str = "\"" + str.str() + "\",";
-    fprintf(stderr, " { %-35s 0x%016llxULL },\n",
-            member_str.c_str(), static_cast<uint64_t>(hash));
-#endif
-    EXPECT_EQ(static_cast<size_t>(golden_data[i].hash),
-              static_cast<size_t>(hash));
-  }
-}
-
 TEST(HashingTest, HashCombineBasicTest) {
   // Hashing a sequence of homogenous types matches range hashing.
   const int i1 = 42, i2 = 43, i3 = 123, i4 = 999, i5 = 0, i6 = 79;
diff --git a/llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn b/llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn
index af4a6993cbc79..d6b6b71a1e59b 100644
--- a/llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn
+++ b/llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn
@@ -90,7 +90,6 @@ static_library("Support") {
     "FormattedStream.cpp",
     "GlobPattern.cpp",
     "GraphWriter.cpp",
-    "Hashing.cpp",
     "HexagonAttributeParser.cpp",
     "HexagonAttributes.cpp",
     "InitLLVM.cpp",

@llvmbot
Copy link
Member

llvmbot commented Jun 21, 2024

@llvm/pr-subscribers-llvm-adt

Author: Fangrui Song (MaskRay)

Changes

Hashing.h provides hash_value/hash_combine/hash_combine_range, which are
primarily used by

  • DenseMap&lt;StringRef, X&gt;
  • FoldingSetNodeIDRef::ComputeHash (will be fixed by #96136)

Users shouldn't rely on specific hash values due to potential algorithm
changes. set_fixed_execution_hash_seed is provided but it has never
been used.

Take the the address of a static storage duration variable as the seed
like absl/hash/internal/hash.h kSeed.
(See https://reviews.llvm.org/D93931 for workaround for older Clang.
Mach-O x86-64 forces PIC, so absl's __apple_build_version__ check is
unnecessary.)

A few users relying on the iteration order of DenseMap&lt;StringRef, X&gt;
have been fixed (e.g., f8f4235 c025bd1 89e8e63
86eb6bf eb8d036 0ea6b8e 58d7a6e 8ea31db
255986e).
From my experience fixing DenseMap&lt;A *, X&gt; and
StringMap
iteration order issues, the scale of breakage is smaller.


Full diff: https://github.com/llvm/llvm-project/pull/96282.diff

5 Files Affected:

  • (modified) llvm/include/llvm/ADT/Hashing.h (+11-34)
  • (modified) llvm/lib/Support/CMakeLists.txt (-1)
  • (removed) llvm/lib/Support/Hashing.cpp (-28)
  • (modified) llvm/unittests/ADT/HashingTest.cpp (-72)
  • (modified) llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn (-1)
diff --git a/llvm/include/llvm/ADT/Hashing.h b/llvm/include/llvm/ADT/Hashing.h
index a5477362a5079..3c87894635787 100644
--- a/llvm/include/llvm/ADT/Hashing.h
+++ b/llvm/include/llvm/ADT/Hashing.h
@@ -126,23 +126,6 @@ hash_code hash_value(const std::basic_string<T> &arg);
 /// Compute a hash_code for a standard string.
 template <typename T> hash_code hash_value(const std::optional<T> &arg);
 
-/// Override the execution seed with a fixed value.
-///
-/// This hashing library uses a per-execution seed designed to change on each
-/// run with high probability in order to ensure that the hash codes are not
-/// attackable and to ensure that output which is intended to be stable does
-/// not rely on the particulars of the hash codes produced.
-///
-/// That said, there are use cases where it is important to be able to
-/// reproduce *exactly* a specific behavior. To that end, we provide a function
-/// which will forcibly set the seed to a fixed value. This must be done at the
-/// start of the program, before any hashes are computed. Also, it cannot be
-/// undone. This makes it thread-hostile and very hard to use outside of
-/// immediately on start of a simple program designed for reproducible
-/// behavior.
-void set_fixed_execution_hash_seed(uint64_t fixed_value);
-
-
 // All of the implementation details of actually computing the various hash
 // code values are held within this namespace. These routines are included in
 // the header file mainly to allow inlining and constant propagation.
@@ -322,24 +305,18 @@ struct hash_state {
   }
 };
 
-
-/// A global, fixed seed-override variable.
-///
-/// This variable can be set using the \see llvm::set_fixed_execution_seed
-/// function. See that function for details. Do not, under any circumstances,
-/// set or read this variable.
-extern uint64_t fixed_seed_override;
-
+/// The seed is non-deterministic (address of a variable) to prevent having
+/// users depend on the particular hash values. On platforms without ASLR, this
+/// is still likely non-deterministic per build.
 inline uint64_t get_execution_seed() {
-  // FIXME: This needs to be a per-execution seed. This is just a placeholder
-  // implementation. Switching to a per-execution seed is likely to flush out
-  // instability bugs and so will happen as its own commit.
-  //
-  // However, if there is a fixed seed override set the first time this is
-  // called, return that instead of the per-execution seed.
-  const uint64_t seed_prime = 0xff51afd7ed558ccdULL;
-  static uint64_t seed = fixed_seed_override ? fixed_seed_override : seed_prime;
-  return seed;
+  static const char seed = 0;
+  // Work around x86-64 negative offset folding for old Clang -fno-pic
+  // https://reviews.llvm.org/D93931
+#if !defined(__clang__) || __clang_major__ > 11
+  return static_cast<uint64_t>(reinterpret_cast<uintptr_t>(&seed));
+#else
+  return 0xff51afd7ed558ccdULL;
+#endif
 }
 
 
diff --git a/llvm/lib/Support/CMakeLists.txt b/llvm/lib/Support/CMakeLists.txt
index 0c69ac99f5bc6..31343df4d8b8b 100644
--- a/llvm/lib/Support/CMakeLists.txt
+++ b/llvm/lib/Support/CMakeLists.txt
@@ -187,7 +187,6 @@ add_llvm_component_library(LLVMSupport
   FormatVariadic.cpp
   GlobPattern.cpp
   GraphWriter.cpp
-  Hashing.cpp
   HexagonAttributeParser.cpp
   HexagonAttributes.cpp
   InitLLVM.cpp
diff --git a/llvm/lib/Support/Hashing.cpp b/llvm/lib/Support/Hashing.cpp
deleted file mode 100644
index 1b20a670434f1..0000000000000
--- a/llvm/lib/Support/Hashing.cpp
+++ /dev/null
@@ -1,28 +0,0 @@
-//===-------------- lib/Support/Hashing.cpp -------------------------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-//
-// This file provides implementation bits for the LLVM common hashing
-// infrastructure. Documentation and most of the other information is in the
-// header file.
-//
-//===----------------------------------------------------------------------===//
-
-#include "llvm/ADT/Hashing.h"
-
-using namespace llvm;
-
-// Provide a definition and static initializer for the fixed seed. This
-// initializer should always be zero to ensure its value can never appear to be
-// non-zero, even during dynamic initialization.
-uint64_t llvm::hashing::detail::fixed_seed_override = 0;
-
-// Implement the function for forced setting of the fixed seed.
-// FIXME: Use atomic operations here so that there is no data race.
-void llvm::set_fixed_execution_hash_seed(uint64_t fixed_value) {
-  hashing::detail::fixed_seed_override = fixed_value;
-}
diff --git a/llvm/unittests/ADT/HashingTest.cpp b/llvm/unittests/ADT/HashingTest.cpp
index ab13ee833ce55..c28356e229e66 100644
--- a/llvm/unittests/ADT/HashingTest.cpp
+++ b/llvm/unittests/ADT/HashingTest.cpp
@@ -235,78 +235,6 @@ TEST(HashingTest, HashCombineRangeLengthDiff) {
   }
 }
 
-TEST(HashingTest, HashCombineRangeGoldenTest) {
-  struct { const char *s; uint64_t hash; } golden_data[] = {
-#if SIZE_MAX == UINT64_MAX || SIZE_MAX == UINT32_MAX
-    { "a",                                0xaeb6f9d5517c61f8ULL },
-    { "ab",                               0x7ab1edb96be496b4ULL },
-    { "abc",                              0xe38e60bf19c71a3fULL },
-    { "abcde",                            0xd24461a66de97f6eULL },
-    { "abcdefgh",                         0x4ef872ec411dec9dULL },
-    { "abcdefghijklm",                    0xe8a865539f4eadfeULL },
-    { "abcdefghijklmnopqrstu",            0x261cdf85faaf4e79ULL },
-    { "abcdefghijklmnopqrstuvwxyzabcdef", 0x43ba70e4198e3b2aULL },
-    { "abcdefghijklmnopqrstuvwxyzabcdef"
-      "abcdefghijklmnopqrstuvwxyzghijkl"
-      "abcdefghijklmnopqrstuvwxyzmnopqr"
-      "abcdefghijklmnopqrstuvwxyzstuvwx"
-      "abcdefghijklmnopqrstuvwxyzyzabcd", 0xdcd57fb2afdf72beULL },
-    { "a",                                0xaeb6f9d5517c61f8ULL },
-    { "aa",                               0xf2b3b69a9736a1ebULL },
-    { "aaa",                              0xf752eb6f07b1cafeULL },
-    { "aaaaa",                            0x812bd21e1236954cULL },
-    { "aaaaaaaa",                         0xff07a2cff08ac587ULL },
-    { "aaaaaaaaaaaaa",                    0x84ac949d54d704ecULL },
-    { "aaaaaaaaaaaaaaaaaaaaa",            0xcb2c8fb6be8f5648ULL },
-    { "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 0xcc40ab7f164091b6ULL },
-    { "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
-      "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", 0xc58e174c1e78ffe9ULL },
-    { "z",                                0x1ba160d7e8f8785cULL },
-    { "zz",                               0x2c5c03172f1285d7ULL },
-    { "zzz",                              0x9d2c4f4b507a2ac3ULL },
-    { "zzzzz",                            0x0f03b9031735693aULL },
-    { "zzzzzzzz",                         0xe674147c8582c08eULL },
-    { "zzzzzzzzzzzzz",                    0x3162d9fa6938db83ULL },
-    { "zzzzzzzzzzzzzzzzzzzzz",            0x37b9a549e013620cULL },
-    { "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 0x8921470aff885016ULL },
-    { "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
-      "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz", 0xf60fdcd9beb08441ULL },
-    { "a",                                0xaeb6f9d5517c61f8ULL },
-    { "ab",                               0x7ab1edb96be496b4ULL },
-    { "aba",                              0x3edb049950884d0aULL },
-    { "ababa",                            0x8f2de9e73a97714bULL },
-    { "abababab",                         0xee14a29ddf0ce54cULL },
-    { "ababababababa",                    0x38b3ddaada2d52b4ULL },
-    { "ababababababababababa",            0xd3665364219f2b85ULL },
-    { "abababababababababababababababab", 0xa75cd6afbf1bc972ULL },
-    { "abababababababababababababababab"
-      "abababababababababababababababab"
-      "abababababababababababababababab"
-      "abababababababababababababababab"
-      "abababababababababababababababab", 0x840192d129f7a22bULL }
-#else
-#error This test only supports 64-bit and 32-bit systems.
-#endif
-  };
-  for (unsigned i = 0; i < sizeof(golden_data)/sizeof(*golden_data); ++i) {
-    StringRef str = golden_data[i].s;
-    hash_code hash = hash_combine_range(str.begin(), str.end());
-#if 0 // Enable this to generate paste-able text for the above structure.
-    std::string member_str = "\"" + str.str() + "\",";
-    fprintf(stderr, " { %-35s 0x%016llxULL },\n",
-            member_str.c_str(), static_cast<uint64_t>(hash));
-#endif
-    EXPECT_EQ(static_cast<size_t>(golden_data[i].hash),
-              static_cast<size_t>(hash));
-  }
-}
-
 TEST(HashingTest, HashCombineBasicTest) {
   // Hashing a sequence of homogenous types matches range hashing.
   const int i1 = 42, i2 = 43, i3 = 123, i4 = 999, i5 = 0, i6 = 79;
diff --git a/llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn b/llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn
index af4a6993cbc79..d6b6b71a1e59b 100644
--- a/llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn
+++ b/llvm/utils/gn/secondary/llvm/lib/Support/BUILD.gn
@@ -90,7 +90,6 @@ static_library("Support") {
     "FormattedStream.cpp",
     "GlobPattern.cpp",
     "GraphWriter.cpp",
-    "Hashing.cpp",
     "HexagonAttributeParser.cpp",
     "HexagonAttributes.cpp",
     "InitLLVM.cpp",

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but I'm not sure this should be always enabled. This sounds like a lot of debugging fun when a seed-dependent case is not caught in testing.

Can we have this part of LLVM_REVERSE_ITERATION instead, which is used to detect similar hash iteration stability issues? We have a build bot for it.

@MaskRay
Copy link
Member Author

MaskRay commented Jun 21, 2024

I like the idea, but I'm not sure this should be always enabled. This sounds like a lot of debugging fun when a seed-dependent case is not caught in testing.

Can we have this part of LLVM_REVERSE_ITERATION instead, which is used to detect similar hash iteration stability issues? We have a build bot for it.

I have fixed a lot of non-determinism bugs in the past few years.
In my experience, non-determinism caused by DenseMap<A *, X> is a more frequent issue than over reliance on DenseMap<StringRef, X>.
DenseMap<A *, X> non-determinism issues could have a low failure rate with certain malloc implementations, especially when the bucket count is small.

Potential non-determinism concerns are outweighed by the benefits.
The argument in abseil/abseil-cpp#339 might be useful:

Part of Abseil's philosophy is engineering at scale. In addition to making it harder to execute a hash flooding attack, another reason we randomize iteration order is that is makes it easier for use to change the underlying implementation if we need to. When we wanted to change our hash function (for example), we found thousands of unit tests that depended on iteration order. We could not just break these tests and say "sorry, you are violating Hyrum's Law" since we would have thousands of angry Google developers. So instead, we fixed the tests and implemented randomization so that next time we wanted to change something in the implementation, this would not be an issue. The need for iteration order determinism is relatively rare compared to the potential wins we get by being free to change the implementation (maybe to make the hash table faster, which could save millions of cycles at scale, for example). By giving users a knob to disable the randomness, we'd be in the same situation all over again.

While -DLLVM_ENABLE_REVERSE_ITERATION=on helps, it adds complexity for bot maintainers and configurations.
This PR allows easier detection of DenseMap<StringRef, X> order reliance by all contributors.
Eventually, perhaps we can extend this to all DenseMap<Key, X>. For types where using a non-deterministic address has performance implications, we could consider restrict the nondeterminism to LLVM_ENABLE_ASSERTIONS=on.


BTW, LLVM_ENABLE_REVERSE_ITERATION currently only applies to pointer-like types.

% cat llvm/include/llvm/Support/ReverseIteration.h
...
template<class T = void *>
bool shouldReverseIterate() {
#if LLVM_ENABLE_REVERSE_ITERATION
  return detail::IsPointerLike<T>::value;
#else
  return false;
#endif
}

@efriedma-quic
Copy link
Collaborator

We restricted reverse-iteration when we added it just to save time when we were enabling it: we wanted to prioritize issues that were actually likely to cause non-determinism (as opposed to relying on the hash algorithm, which is annoying but not actually non-deterministic). If you're willing to fix all the resulting breakage, it should be fine to apply it more widely.


I'm a little concerned that doing this in release builds is going to lead to weird bug reports. Especially given the current approach for getting randomness: ASLR isn't really that random, particularly on Windows, so the probability of getting a particular seed isn't uniform.

@MaskRay
Copy link
Member Author

MaskRay commented Jun 22, 2024

Due to ASLR and malloc non-determinism, DenseMap<A *, X> already introduces a lot of non-determinism to release builds.
DenseMap<StringRef, X> issues are less pronounced compared with DenseMap<A *, X>.

I'm a little concerned that doing this in release builds is going to lead to weird bug reports

I understand the concern. If we receive 10 weird bug reports root caused to DenseMap<A *, X>,
the number of DenseMap<StringRef, X> bug reports might just be 1.


The next step is to make integer type keys (e.g. DenseMap<uint64_t, X>) non-deterministic.
IIRC the recent uint64_t getHashKey changes does not require any test update.
The number of DenseMap<integer, X> bug reports might be small.

As my previous comment says:

we could consider restrict the nondeterminism to LLVM_ENABLE_ASSERTIONS=on.

but I haven't found it necessary.


Due to Avalanche effects, even a few ASLR bits are sufficient to cover many different scenarios and expose latent bugs.

MaskRay added 4 commits June 22, 2024 14:33
Created using spr 1.3.5-bogner

[skip ci]
Created using spr 1.3.5-bogner

[skip ci]
@MaskRay MaskRay changed the title [Hashing] Use a non-deterministic seed [Hashing] Use a non-deterministic seed if LLVM_ENABLE_ABI_BREAKING_CHECKS Jun 27, 2024
@MaskRay
Copy link
Member Author

MaskRay commented Jun 27, 2024

The non-determinism is now restricted to LLVM_ENABLE_ABI_BREAKING_CHECKS builds.

LLVM_ENABLE_ABI_BREAKING_CHECKS defaults to WITH_ASSERTS . Release builds that disable assertions disable LLVM_ENABLE_ABI_BREAKING_CHECKS. This change yields a slight code size/performance advantage by eliminating a variable read. get_execution_seed returns the fixed value regardless of NDEBUG.

Note: llvm/include/llvm/ADT/STLExtras.h and llvm/include/llvm/Support/Error.h also use LLVM_ENABLE_ABI_BREAKING_CHECKS.

const uint64_t seed_prime = 0xff51afd7ed558ccdULL;
static uint64_t seed = fixed_seed_override ? fixed_seed_override : seed_prime;
return seed;
[[maybe_unused]] static const char seed = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this inside the #if?

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please wait a bit in case there is more feedback.

Created using spr 1.3.5-bogner
@MaskRay
Copy link
Member Author

MaskRay commented Jun 27, 2024

https://llvm-compile-time-tracker.com/compare.php?from=abfff89b743584d2796000318198bf60d3622a1f&to=5c2a6b5ba62d2b7ed2c0ad3be29fba8558f5627b&stat=instructions:u

There is significant improvement on compile time through the stage2-O3: instruction:u metric.

Benchmark Old New
kimwitu++ 38847M 38705M (-0.37%)
sqlite3 35002M 34917M (-0.24%)
consumer-typeset 31851M 31794M (-0.18%)
Bullet 93072M 92815M (-0.28%)
tramp3d-v4 78154M 77899M (-0.33%)
mafft 32841M 32718M (-0.37%)
ClamAV 50244M 50128M (-0.23%)
lencod 60998M 60793M (-0.34%)
SPASS 42834M 42725M (-0.25%)
7zip 191523M 191032M (-0.26%)
geomean 55146M 54990M (-0.28%)

This is great, but it also suggests that replacing DenseMap with absl swiss table would likely yield a greater code size regression, and very likely performance regression as well. If we end up introducing a swiss table like container, it might be its own type, not DenseMap.

Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm happier restricting the non-determinism to +Asserts for now, at least as an incremental step.

Due to Avalanche effects, even a few ASLR bits are sufficient to cover many different scenarios and expose latent bugs.

On Windows specifically, I'm less concerned about the total number of bits, and more concerned that ASLR isn't randomized for each run of an executable.

// Work around x86-64 negative offset folding for old Clang -fno-pic
// https://reviews.llvm.org/D93931
#if LLVM_ENABLE_ABI_BREAKING_CHECKS && \
(!defined(__clang__) || __clang_major__ > 11)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an ABI problem that this ifdef exists? I mean, LLVM libraries built with clang<11 can't be used by programs built with clang>11. With LLVM_ENABLE_ABI_BREAKING_CHECKS, I guess it's unlikely to cause issues, though. (I guess you could use an empty inline asm as a workaround if you wanted to.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clang condition introduced a slight ABI problem when mixing clang<11 and clang>=11. In practice it is rare that the llvm-project build and a downstream client exchange the hash values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this affect any access to a DenseMap across the boundary for example?

That is any API where LLVM would initialize a DenseMap and return a reference to it to the user which then access elements by computing key hashes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely a problem for LLVM built using GCC and an external project, based on LLVM, built using CLANG 11. Is there a reliable check that we can add (e.g. in llvm/include/llvm/Config/abi-breaking.h.cmake) to detect this at the build time rather than failing during the execution?

Copy link
Member Author

@MaskRay MaskRay Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM built with GCC are typically release builds and LLVM_ENABLE_ABI_BREAKING_CHECKS is 0. The else branch is taken and there is no impact.

If we want to be over-cautious, this condition can be refined to:
LLVM_ENABLE_ABI_BREAKING_CHECKS && (!defined(__clang__) || __clang_major__ > 11 || defined(__PIC__)).
We could code the condition to utils/bazel/llvm_configs/abi-breaking.h.cmake, but I am hoping that the complexity isn't needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would make sense to move the function to the source file for the LLVM_ENABLE_ABI_BREAKING_CHECKS case? It will add cost, but it's okay for assertion-enabled builds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case it is a Release build with GCC and -DLLVM_ENABLE_ASSERTIONS=ON.

Could you please explain how __PIC__ check helps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still dealing with the fallout from this change in Triton.

It seems that breaking the ABI across different compilers is acceptable, but it would be good to detect this and error out. We already detect mismatches of LLVM_ENABLE_ABI_BREAKING_CHECKS. Would it be acceptable to just add the clang > 11 condition there?

@MaskRay
Copy link
Member Author

MaskRay commented Jun 28, 2024

Eli is happier now. I plan to land this in a few hours.

MaskRay added a commit that referenced this pull request Jun 28, 2024
Otherwise clang/test/CodeGenCXX/attr-annotate.cpp
output could fail when llvm::hash_value(StringRef) changes
(#96282).

EmitGlobalAnnotations iterates over DeferredAnnotations.
lravenclaw pushed a commit to lravenclaw/llvm-project that referenced this pull request Jul 3, 2024
clangd/test/trace.test might fail as llvm::hash_value(StringRef) is
non-deterministic per process (llvm#96282).
lravenclaw pushed a commit to lravenclaw/llvm-project that referenced this pull request Jul 3, 2024
Follow-up to llvm#96282.
Since llvm/lib/Target files are compiled with -fvisibility=hidden,
the seed variable in libLLVM.so is hidden in a -DLLVM_BUILD_LLVM_DYLIB=on build.

This would cause `hash_value(std::string()), hash_value(StringRef())` to
fail since the former (might be part of the main executable) and the
latter (llvm/lib/Support/StringRef.cpp in libLLVM.so) use copies in
different components.
lravenclaw pushed a commit to lravenclaw/llvm-project that referenced this pull request Jul 3, 2024
This test from llvm#74253 relies on particular hash values from Hashing.h.
The test fails in LLVM_ENABLE_ABI_BREAKING_CHECKS=on modes (llvm#96282) or
whenever Hashing.h implementation changes.
kbluck pushed a commit to kbluck/llvm-project that referenced this pull request Jul 6, 2024
This test from llvm#74253 relies on particular hash values from Hashing.h.
The test fails in LLVM_ENABLE_ABI_BREAKING_CHECKS=on modes (llvm#96282) or
whenever Hashing.h implementation changes.
fsfod pushed a commit to fsfod/llvm-project that referenced this pull request Jul 20, 2024
Otherwise clang/test/CodeGenCXX/attr-annotate.cpp
output could fail when llvm::hash_value(StringRef) changes
(llvm#96282).

EmitGlobalAnnotations iterates over DeferredAnnotations.

(cherry picked from commit 6644975)
@quic-seaswara
Copy link

Aren't hashes supposed to be deterministic ?

Can we add a override for builds with asserts turned ON to favor determinism ?

Could have an API to set the determinism flag ?

This also has an effect on Build ID computation ?

@MaskRay
Copy link
Member Author

MaskRay commented Jul 24, 2024

Aren't hashes supposed to be deterministic ?

Can we add a override for builds with asserts turned ON to favor determinism ?

Could have an API to set the determinism flag ?

This also has an effect on Build ID computation ?

Some hashes are supposed to be deterministic, but Hashing.h isn't .
The file-level documentation and get_execution_seed are very clear about the unstability.

If you want to deploy assertions builds and have concern about non-deterministism, set LLVM_ABI_BREAKING_CHECKS to FORCE_OFF.

The unstable property is important to ensure that no test relies on the particular implementation and allow us to replace the implementation, which would be infeasible if people keep adding brittle tests.

jsji added a commit to intel/llvm that referenced this pull request Jul 27, 2024
After llvm/llvm-project#96282, the order of the
attr changed, so update the test .
jsji added a commit to intel/llvm that referenced this pull request Jul 27, 2024
…of attr

llvm/llvm-project#96282 changed the hashing,
we may get non-detrministic iteration order if using DenseMap.

This is causing tests to beome flaky.

Fixing it similar to 6644975
MaskRay added a commit that referenced this pull request Jul 28, 2024
FNV, used by stable_hash_combine_string is extremely slow. For string
hashing with good avalanche effects, we prefer xxh3_64bits.

StableHashing.h might still be useful as it provides a stable
hash_combine while Hashing.h's might be non-deterministic (#96282).

Pull Request: #100668
bnbarham added a commit to bnbarham/swift that referenced this pull request Aug 1, 2024
llvm/llvm-project#96282 changed
`get_execution_seed` to be non-deterministic. Use stable hashes instead.
bnbarham added a commit to bnbarham/swift that referenced this pull request Aug 1, 2024
llvm/llvm-project#96282 changed
`get_execution_seed` to be non-deterministic. Use stable hashes instead.
bnbarham added a commit to bnbarham/swift that referenced this pull request Aug 14, 2024
llvm/llvm-project#96282 changed
`get_execution_seed` to be non-deterministic. Use stable hashes instead.
karupayun added a commit to openxla/triton that referenced this pull request Sep 19, 2024
This commit is fixing the root cause of the issue that appeared during
an llvm update, and was raised in triton-lang#4212 (comment).

The issue was temporary fixed in triton-lang#4512,
where we set LLVM_ABI_BREAKING_CHECKS to FORCE_OFF constant in order to
not use a non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80).

A further investigation (with chsigg@) found that the issue is that LLVM
is built with clang11, while Triton use a newer version. The ABI issue
is brought up here: llvm/llvm-project#96282 (comment),
but the consensus seemed to be that this setup is rare.

Updated the clang version to 12 in the ubuntu build fixed the issue and
therefore we can revert setting LLVM_ABI_BREAKING_CHECKS constant. I am
additionaly erasing LLVM_ABI_BREAKING_CHECKS in the setup of the other
hardwares (it seems it was not needed) and I am splitting ubuntu and
macOS configurations because it seems cleaner than having a variable
that sets the compiler version for each of them.
karupayun added a commit to openxla/triton that referenced this pull request Sep 19, 2024
This commit is fixing the root cause of the issue that appeared during
an llvm update, and was raised in triton-lang#4212 (comment).

The issue was temporary fixed in triton-lang#4512,
where we set LLVM_ABI_BREAKING_CHECKS to FORCE_OFF constant in order to
not use a non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80).

A further investigation (with chsigg@) found that the issue is that LLVM
is built with clang11, while Triton use a newer version. The ABI issue
is brought up here: llvm/llvm-project#96282 (comment),
but the consensus seemed to be that this setup is rare.

Updated the clang version to 12 in the ubuntu build fixed the issue and
therefore we can revert setting LLVM_ABI_BREAKING_CHECKS constant. I am
additionaly erasing LLVM_ABI_BREAKING_CHECKS in the setup of the other
hardwares (it seems it was not needed) and I am splitting ubuntu and
macOS configurations because it seems cleaner than having a variable
that sets the compiler version for each of them.
karupayun added a commit to triton-lang/triton that referenced this pull request Sep 26, 2024
This commit is fixing the root cause of the issue that appeared during
an llvm update, and was raised in #4212 (comment).

The issue was temporary fixed in #4512,
where we set LLVM_ABI_BREAKING_CHECKS to FORCE_OFF constant in order to
not use a non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80).

A further investigation (with chsigg@) found that the issue is that LLVM
is built with clang11, while Triton use a newer version. The ABI issue
is brought up here: llvm/llvm-project#96282 (comment),
but the consensus seemed to be that this setup is rare.

Updated the clang version to 12 in the ubuntu build fixed the issue and
therefore we can revert setting LLVM_ABI_BREAKING_CHECKS constant. I am
additionaly erasing LLVM_ABI_BREAKING_CHECKS in the setup of the other
hardwares (it seems it was not needed) and I am splitting ubuntu and
macOS configurations because it seems cleaner than having a variable
that sets the compiler version for each of them.
adrian-prantl pushed a commit to adrian-prantl/llvm-project that referenced this pull request Oct 21, 2024
FNV, used by stable_hash_combine_string is extremely slow. For string
hashing with good avalanche effects, we prefer xxh3_64bits.

StableHashing.h might still be useful as it provides a stable
hash_combine while Hashing.h's might be non-deterministic (llvm#96282).

Pull Request: llvm#100668

(cherry picked from commit c538434)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants