-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[lldb][debugserver] Read/write SME registers on arm64 #119171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lldb][debugserver] Read/write SME registers on arm64 #119171
Conversation
The Apple M4 line of cores includes the Scalable Matrix Extension (SME) feature. The M4s do not implement Scalable Vector Extension (SVE), although the processor is in Streaming SVE Mode when the SME is being used. The most obvious side effects of being in SSVE Mode are that (on the M4 cores) NEON instructions cannot be used, and watchpoints may get false positives, the address comparisons are done at a lowered granularity. When SSVE mode is enabled, the kernel will provide the Streaming Vector Length register, which is a maximum of 64 bytes with the M4. Also provided are SVCR (with bits indicating if SSVE mode and SME mode are enabled), TPIDR2, SVL. Then the SVE registers Z0..31 (SVL bytes long), P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL bytes), and the M4 supports SME2, so the ZT0 register (64 bytes). When SSVE/SME are disabled, none of these registers are provided by the kernel - reads and writes of them will fail. Unlike Linux, lldb cannot modify the SVL through a thread_set_state call, or change the processor state's SSVE/SME status. There is also no way for a process to request a lowered SVL size today, so the work that David did to handle VL/SVL changing while stepping through a process is not an issue on Darwin today. But debugserver should be providing everything necessary so we can reuse all of David's work on resizing the register contexts in lldb if it happens in the future. debugbserver sends svl, svcr, and tpidr2 in the expedited registers when a thread stops, if SSVE|SME mode are enabled (if the kernel allows it to read the ARM_SME_STATE register set). While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible SVL is 256; this would give us a 65k ZA register. If debugserver sized all of its register contexts assuming the largest possible SVL, we could easily use 2MB more memory for the register contexts of all threads in a process -- and on iOS et al, processes must run within a small memory allotment and this would push us over that. Much of the work in debugserver was changing the arm64 register context from being a static compile-time array of register sets, to being initialized at runtime if debugserver is running on a machine with SME. The ZA is only created to the machine's actual maximum SVL. The size of the 32 SVE Z registers is less significant so I am statically allocating those to the architecturally largest possible SVL value today. Also, debugserver includes information about registers that share the same part of the register file. e.g. S0 and D0 are the lower parts of the NEON 128-bit V0 register. And when running on an SME machine, v0 is the lower 128 bits of the SVE Z0 register. So the register maps used when defining the VFP registers must differ depending on the runtime state of the cpu. I also changed register reading in debugserver, where formerly when debugserver was asked to read a register, and the thread_get_state read of that register failed, it would return all zero's. This is necessary when constructing a `g` packet that gets all registers - because there is no separation between register bytes, the offsets are fixed. But when we are asking for a single register (e.g. Z0) when not in SSVE/SME mode, this should return an error. This does mean that when you're running on an SME capabable machine, but not in SME mode, and do `register read -a`, lldb will report that 48 SVE registers were unavailable and 5 SME registers were unavailable. But that's only when `-a` is used. The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which is not yet in a release. The test case I wrote is skipped on current OSes. I pilfered the SME register setup from some of David's existing SME test files; there were a few Linux specific details in those tests that they weren't easy to reuse on Darwin. rdar://121608074
@llvm/pr-subscribers-lldb Author: Jason Molenda (jasonmolenda) ChangesThe Apple M4 line of cores includes the Scalable Matrix Extension (SME) feature. The M4s do not implement Scalable Vector Extension (SVE), although the processor is in Streaming SVE Mode when the SME is being used. The most obvious side effects of being in SSVE Mode are that (on the M4 cores) NEON instructions cannot be used, and watchpoints may get false positives, the address comparisons are done at a lowered granularity. When SSVE mode is enabled, the kernel will provide the Streaming Vector Length register, which is a maximum of 64 bytes with the M4. Also provided are SVCR (with bits indicating if SSVE mode and SME mode are enabled), TPIDR2, SVL. Then the SVE registers Z0..31 (SVL bytes long), P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL bytes), and the M4 supports SME2, so the ZT0 register (64 bytes). When SSVE/SME are disabled, none of these registers are provided by the kernel - reads and writes of them will fail. Unlike Linux, lldb cannot modify the SVL through a thread_set_state call, or change the processor state's SSVE/SME status. There is also no way for a process to request a lowered SVL size today, so the work that David did to handle VL/SVL changing while stepping through a process is not an issue on Darwin today. But debugserver should be providing everything necessary so we can reuse all of David's work on resizing the register contexts in lldb if it happens in the future. debugbserver sends svl, svcr, and tpidr2 in the expedited registers when a thread stops, if SSVE|SME mode are enabled (if the kernel allows it to read the ARM_SME_STATE register set). While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible SVL is 256; this would give us a 65k ZA register. If debugserver sized all of its register contexts assuming the largest possible SVL, we could easily use 2MB more memory for the register contexts of all threads in a process -- and on iOS et al, processes must run within a small memory allotment and this would push us over that. Much of the work in debugserver was changing the arm64 register context from being a static compile-time array of register sets, to being initialized at runtime if debugserver is running on a machine with SME. The ZA is only created to the machine's actual maximum SVL. The size of the 32 SVE Z registers is less significant so I am statically allocating those to the architecturally largest possible SVL value today. Also, debugserver includes information about registers that share the same part of the register file. e.g. S0 and D0 are the lower parts of the NEON 128-bit V0 register. And when running on an SME machine, v0 is the lower 128 bits of the SVE Z0 register. So the register maps used when defining the VFP registers must differ depending on the runtime state of the cpu. I also changed register reading in debugserver, where formerly when debugserver was asked to read a register, and the thread_get_state read of that register failed, it would return all zero's. This is necessary when constructing a This does mean that when you're running on an SME capabable machine, but not in SME mode, and do The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which is not yet in a release. The test case I wrote is skipped on current OSes. I pilfered the SME register setup from some of David's existing SME test files; there were a few Linux specific details in those tests that they weren't easy to reuse on Darwin. rdar://121608074 Patch is 67.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/119171.diff 9 Files Affected:
diff --git a/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp b/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp
index 181ba4e7d87721..6a072354972acd 100644
--- a/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp
+++ b/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp
@@ -100,6 +100,25 @@ bool ArchitectureAArch64::ReconfigureRegisterInfo(DynamicRegisterInfo ®_info,
if (reg_value != fail_value && reg_value <= 32)
svg_reg_value = reg_value;
}
+ if (!svg_reg_value) {
+ const RegisterInfo *darwin_svg_reg_info = reg_info.GetRegisterInfo("svl");
+ if (darwin_svg_reg_info) {
+ uint32_t svg_reg_num = darwin_svg_reg_info->kinds[eRegisterKindLLDB];
+ uint64_t reg_value =
+ reg_context.ReadRegisterAsUnsigned(svg_reg_num, fail_value);
+ // UpdateARM64SVERegistersInfos and UpdateARM64SMERegistersInfos
+ // expect the number of 8-byte granules; darwin provides number of
+ // bytes.
+ if (reg_value != fail_value && reg_value <= 256) {
+ svg_reg_value = reg_value / 8;
+ // Apple hardware only implements Streaming SVE mode, so
+ // the non-streaming Vector Length is not reported by the
+ // kernel. Set both svg and vg to this svl value.
+ if (!vg_reg_value)
+ vg_reg_value = reg_value / 8;
+ }
+ }
+ }
if (!vg_reg_value && !svg_reg_value)
return false;
diff --git a/lldb/test/API/macosx/sme-registers/Makefile b/lldb/test/API/macosx/sme-registers/Makefile
new file mode 100644
index 00000000000000..d4173d262ed270
--- /dev/null
+++ b/lldb/test/API/macosx/sme-registers/Makefile
@@ -0,0 +1,5 @@
+C_SOURCES := main.c
+
+CFLAGS_EXTRAS := -mcpu=apple-m4
+
+include Makefile.rules
diff --git a/lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py b/lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py
new file mode 100644
index 00000000000000..82a5eb0dc81a6b
--- /dev/null
+++ b/lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py
@@ -0,0 +1,164 @@
+import lldb
+from lldbsuite.test.lldbtest import *
+from lldbsuite.test.decorators import *
+import lldbsuite.test.lldbutil as lldbutil
+import os
+
+
+class TestSMERegistersDarwin(TestBase):
+
+ NO_DEBUG_INFO_TESTCASE = True
+ mydir = TestBase.compute_mydir(__file__)
+
+ @skipIfRemote
+ @skipUnlessDarwin
+ @skipUnlessFeature("hw.optional.arm.FEAT_SME")
+ @skipUnlessFeature("hw.optional.arm.FEAT_SME2")
+ # thread_set_state/thread_get_state only avail in macOS 15.4+
+ @skipIf(macos_version=["<", "15.4"])
+ def test(self):
+ """Test that we can read the contents of the SME/SVE registers on Darwin"""
+ self.build()
+ (target, process, thread, bkpt) = lldbutil.run_to_source_breakpoint(
+ self, "break here", lldb.SBFileSpec("main.c")
+ )
+ frame = thread.GetFrameAtIndex(0)
+ self.assertTrue(frame.IsValid())
+
+ if self.TraceOn():
+ self.runCmd("reg read -a")
+
+ svl_reg = frame.register["svl"]
+ svl = svl_reg.GetValueAsUnsigned()
+
+ # SSVE and SME modes should be enabled (reflecting PSTATE.SM and PSTATE.ZA)
+ svcr = frame.register["svcr"]
+ self.assertEqual(svcr.GetValueAsUnsigned(), 3)
+
+ z0 = frame.register["z0"]
+ self.assertEqual(z0.GetNumChildren(), svl)
+ self.assertEqual(z0.GetChildAtIndex(0).GetValueAsUnsigned(), 0x1)
+ self.assertEqual(z0.GetChildAtIndex(svl - 1).GetValueAsUnsigned(), 0x1)
+
+ z31 = frame.register["z31"]
+ self.assertEqual(z31.GetNumChildren(), svl)
+ self.assertEqual(z31.GetChildAtIndex(0).GetValueAsUnsigned(), 32)
+ self.assertEqual(z31.GetChildAtIndex(svl - 1).GetValueAsUnsigned(), 32)
+
+ p0 = frame.register["p0"]
+ self.assertEqual(p0.GetNumChildren(), svl / 8)
+ self.assertEqual(p0.GetChildAtIndex(0).GetValueAsUnsigned(), 0xFF)
+ self.assertEqual(
+ p0.GetChildAtIndex(p0.GetNumChildren() - 1).GetValueAsUnsigned(), 0xFF
+ )
+
+ p15 = frame.register["p15"]
+ self.assertEqual(p15.GetNumChildren(), svl / 8)
+ self.assertEqual(p15.GetChildAtIndex(0).GetValueAsUnsigned(), 0xFF)
+ self.assertEqual(
+ p15.GetChildAtIndex(p15.GetNumChildren() - 1).GetValueAsUnsigned(), 0xFF
+ )
+
+ za = frame.register["za"]
+ self.assertEqual(za.GetNumChildren(), (svl * svl))
+ za_0 = za.GetChildAtIndex(0)
+ self.assertEqual(za_0.GetValueAsUnsigned(), 4)
+ za_final = za.GetChildAtIndex(za.GetNumChildren() - 1)
+ self.assertEqual(za_final.GetValueAsUnsigned(), 67)
+
+ zt0 = frame.register["zt0"]
+ self.assertEqual(zt0.GetNumChildren(), 64)
+ zt0_0 = zt0.GetChildAtIndex(0)
+ self.assertEqual(zt0_0.GetValueAsUnsigned(), 0)
+ zt0_final = zt0.GetChildAtIndex(63)
+ self.assertEqual(zt0_final.GetValueAsUnsigned(), 63)
+
+ z0_old_values = []
+ z0_new_str = '"{'
+ for i in range(svl):
+ z0_old_values.append(z0.GetChildAtIndex(i).GetValueAsUnsigned())
+ z0_new_str = z0_new_str + ("0x%02x " % (z0_old_values[i] + 5))
+ z0_new_str = z0_new_str + '}"'
+ self.runCmd("reg write z0 %s" % z0_new_str)
+
+ z31_old_values = []
+ z31_new_str = '"{'
+ for i in range(svl):
+ z31_old_values.append(z31.GetChildAtIndex(i).GetValueAsUnsigned())
+ z31_new_str = z31_new_str + ("0x%02x " % (z31_old_values[i] + 3))
+ z31_new_str = z31_new_str + '}"'
+ self.runCmd("reg write z31 %s" % z31_new_str)
+
+ p0_old_values = []
+ p0_new_str = '"{'
+ for i in range(int(svl / 8)):
+ p0_old_values.append(p0.GetChildAtIndex(i).GetValueAsUnsigned())
+ p0_new_str = p0_new_str + ("0x%02x " % (p0_old_values[i] - 5))
+ p0_new_str = p0_new_str + '}"'
+ self.runCmd("reg write p0 %s" % p0_new_str)
+
+ p15_old_values = []
+ p15_new_str = '"{'
+ for i in range(int(svl / 8)):
+ p15_old_values.append(p15.GetChildAtIndex(i).GetValueAsUnsigned())
+ p15_new_str = p15_new_str + ("0x%02x " % (p15_old_values[i] - 8))
+ p15_new_str = p15_new_str + '}"'
+ self.runCmd("reg write p15 %s" % p15_new_str)
+
+ za_old_values = []
+ za_new_str = '"{'
+ for i in range(svl * svl):
+ za_old_values.append(za.GetChildAtIndex(i).GetValueAsUnsigned())
+ za_new_str = za_new_str + ("0x%02x " % (za_old_values[i] + 7))
+ za_new_str = za_new_str + '}"'
+ self.runCmd("reg write za %s" % za_new_str)
+
+ zt0_old_values = []
+ zt0_new_str = '"{'
+ for i in range(64):
+ zt0_old_values.append(zt0.GetChildAtIndex(i).GetValueAsUnsigned())
+ zt0_new_str = zt0_new_str + ("0x%02x " % (zt0_old_values[i] + 2))
+ zt0_new_str = zt0_new_str + '}"'
+ self.runCmd("reg write zt0 %s" % zt0_new_str)
+
+ thread.StepInstruction(False)
+ frame = thread.GetFrameAtIndex(0)
+
+ if self.TraceOn():
+ self.runCmd("reg read -a")
+
+ z0 = frame.register["z0"]
+ for i in range(z0.GetNumChildren()):
+ self.assertEqual(
+ z0_old_values[i] + 5, z0.GetChildAtIndex(i).GetValueAsUnsigned()
+ )
+
+ z31 = frame.register["z31"]
+ for i in range(z31.GetNumChildren()):
+ self.assertEqual(
+ z31_old_values[i] + 3, z31.GetChildAtIndex(i).GetValueAsUnsigned()
+ )
+
+ p0 = frame.register["p0"]
+ for i in range(p0.GetNumChildren()):
+ self.assertEqual(
+ p0_old_values[i] - 5, p0.GetChildAtIndex(i).GetValueAsUnsigned()
+ )
+
+ p15 = frame.register["p15"]
+ for i in range(p15.GetNumChildren()):
+ self.assertEqual(
+ p15_old_values[i] - 8, p15.GetChildAtIndex(i).GetValueAsUnsigned()
+ )
+
+ za = frame.register["za"]
+ for i in range(za.GetNumChildren()):
+ self.assertEqual(
+ za_old_values[i] + 7, za.GetChildAtIndex(i).GetValueAsUnsigned()
+ )
+
+ zt0 = frame.register["zt0"]
+ for i in range(zt0.GetNumChildren()):
+ self.assertEqual(
+ zt0_old_values[i] + 2, zt0.GetChildAtIndex(i).GetValueAsUnsigned()
+ )
diff --git a/lldb/test/API/macosx/sme-registers/main.c b/lldb/test/API/macosx/sme-registers/main.c
new file mode 100644
index 00000000000000..00bbb4a5551622
--- /dev/null
+++ b/lldb/test/API/macosx/sme-registers/main.c
@@ -0,0 +1,123 @@
+/// BUILT with
+/// xcrun -sdk macosx.internal clang -mcpu=apple-m4 -g sme.c -o sme
+
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+
+void write_sve_regs() {
+ asm volatile("ptrue p0.b\n\t");
+ asm volatile("ptrue p1.h\n\t");
+ asm volatile("ptrue p2.s\n\t");
+ asm volatile("ptrue p3.d\n\t");
+ asm volatile("pfalse p4.b\n\t");
+ asm volatile("ptrue p5.b\n\t");
+ asm volatile("ptrue p6.h\n\t");
+ asm volatile("ptrue p7.s\n\t");
+ asm volatile("ptrue p8.d\n\t");
+ asm volatile("pfalse p9.b\n\t");
+ asm volatile("ptrue p10.b\n\t");
+ asm volatile("ptrue p11.h\n\t");
+ asm volatile("ptrue p12.s\n\t");
+ asm volatile("ptrue p13.d\n\t");
+ asm volatile("pfalse p14.b\n\t");
+ asm volatile("ptrue p15.b\n\t");
+
+ asm volatile("cpy z0.b, p0/z, #1\n\t");
+ asm volatile("cpy z1.b, p5/z, #2\n\t");
+ asm volatile("cpy z2.b, p10/z, #3\n\t");
+ asm volatile("cpy z3.b, p15/z, #4\n\t");
+ asm volatile("cpy z4.b, p0/z, #5\n\t");
+ asm volatile("cpy z5.b, p5/z, #6\n\t");
+ asm volatile("cpy z6.b, p10/z, #7\n\t");
+ asm volatile("cpy z7.b, p15/z, #8\n\t");
+ asm volatile("cpy z8.b, p0/z, #9\n\t");
+ asm volatile("cpy z9.b, p5/z, #10\n\t");
+ asm volatile("cpy z10.b, p10/z, #11\n\t");
+ asm volatile("cpy z11.b, p15/z, #12\n\t");
+ asm volatile("cpy z12.b, p0/z, #13\n\t");
+ asm volatile("cpy z13.b, p5/z, #14\n\t");
+ asm volatile("cpy z14.b, p10/z, #15\n\t");
+ asm volatile("cpy z15.b, p15/z, #16\n\t");
+ asm volatile("cpy z16.b, p0/z, #17\n\t");
+ asm volatile("cpy z17.b, p5/z, #18\n\t");
+ asm volatile("cpy z18.b, p10/z, #19\n\t");
+ asm volatile("cpy z19.b, p15/z, #20\n\t");
+ asm volatile("cpy z20.b, p0/z, #21\n\t");
+ asm volatile("cpy z21.b, p5/z, #22\n\t");
+ asm volatile("cpy z22.b, p10/z, #23\n\t");
+ asm volatile("cpy z23.b, p15/z, #24\n\t");
+ asm volatile("cpy z24.b, p0/z, #25\n\t");
+ asm volatile("cpy z25.b, p5/z, #26\n\t");
+ asm volatile("cpy z26.b, p10/z, #27\n\t");
+ asm volatile("cpy z27.b, p15/z, #28\n\t");
+ asm volatile("cpy z28.b, p0/z, #29\n\t");
+ asm volatile("cpy z29.b, p5/z, #30\n\t");
+ asm volatile("cpy z30.b, p10/z, #31\n\t");
+ asm volatile("cpy z31.b, p15/z, #32\n\t");
+}
+
+#define MAX_VL_BYTES 256
+void set_za_register(int svl, int value_offset) {
+ uint8_t data[MAX_VL_BYTES];
+
+ // ldr za will actually wrap the selected vector row, by the number of rows
+ // you have. So setting one that didn't exist would actually set one that did.
+ // That's why we need the streaming vector length here.
+ for (int i = 0; i < svl; ++i) {
+ // This may involve instructions that require the smefa64 extension.
+ for (int j = 0; j < MAX_VL_BYTES; j++)
+ data[j] = i + value_offset;
+ // Each one of these loads a VL sized row of ZA.
+ asm volatile("mov w12, %w0\n\t"
+ "ldr za[w12, 0], [%1]\n\t" ::"r"(i),
+ "r"(&data)
+ : "w12");
+ }
+}
+
+static uint16_t
+arm_sme_svl_b(void)
+{
+ uint64_t ret = 0;
+ asm volatile (
+ "rdsvl %[ret], #1"
+ : [ret] "=r"(ret)
+ );
+ return (uint16_t)ret;
+}
+
+
+// lldb/test/API/commands/register/register/aarch64_sme_z_registers/save_restore/main.c
+void
+arm_sme2_set_zt0() {
+#define ZTO_LEN (512 / 8)
+ uint8_t data[ZTO_LEN];
+ for (unsigned i = 0; i < ZTO_LEN; ++i)
+ data[i] = i + 0;
+
+ asm volatile("ldr zt0, [%0]" ::"r"(&data));
+#undef ZT0_LEN
+}
+
+int main()
+{
+
+ printf("Enable SME mode\n");
+
+ asm volatile ("smstart");
+
+ write_sve_regs();
+
+ set_za_register(arm_sme_svl_b(), 4);
+
+ arm_sme2_set_zt0();
+
+ int c = 10; // break here
+ c += 5;
+ c += 5;
+
+ asm volatile ("smstop");
+}
diff --git a/lldb/tools/debugserver/source/DNBDefs.h b/lldb/tools/debugserver/source/DNBDefs.h
index dacee652b3ebfc..df8ca809d412c7 100644
--- a/lldb/tools/debugserver/source/DNBDefs.h
+++ b/lldb/tools/debugserver/source/DNBDefs.h
@@ -312,16 +312,21 @@ struct DNBRegisterValue {
uint64_t uint64;
float float32;
double float64;
- int8_t v_sint8[64];
- int16_t v_sint16[32];
- int32_t v_sint32[16];
- int64_t v_sint64[8];
- uint8_t v_uint8[64];
- uint16_t v_uint16[32];
- uint32_t v_uint32[16];
- uint64_t v_uint64[8];
- float v_float32[16];
- double v_float64[8];
+ // AArch64 SME's ZA register max size is 64k, this object must be
+ // large enough to hold that much data. The current Apple cores
+ // have a much smaller maximum ZA reg size, but there are not
+ // multiple copies of this object so increase the static size to
+ // maximum possible.
+ int8_t v_sint8[65536];
+ int16_t v_sint16[32768];
+ int32_t v_sint32[16384];
+ int64_t v_sint64[8192];
+ uint8_t v_uint8[65536];
+ uint16_t v_uint16[32768];
+ uint32_t v_uint32[16384];
+ uint64_t v_uint64[8192];
+ float v_float32[16384];
+ double v_float64[8192];
void *pointer;
char *c_str;
} value;
diff --git a/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp b/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
index b6f52cb5cf496d..ba2a8116d68bec 100644
--- a/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
+++ b/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
@@ -93,6 +93,55 @@ DNBArchMachARM64::SoftwareBreakpointOpcode(nub_size_t byte_size) {
uint32_t DNBArchMachARM64::GetCPUType() { return CPU_TYPE_ARM64; }
+static std::once_flag g_cpu_has_sme_once;
+bool DNBArchMachARM64::CPUHasSME() {
+ static bool g_has_sme = false;
+ std::call_once(g_cpu_has_sme_once, []() {
+ int ret = 0;
+ size_t size = sizeof(ret);
+ if (sysctlbyname("hw.optional.arm.FEAT_SME", &ret, &size, NULL, 0) != -1)
+ g_has_sme = ret == 1;
+ });
+ return g_has_sme;
+}
+
+static std::once_flag g_cpu_has_sme2_once;
+bool DNBArchMachARM64::CPUHasSME2() {
+ static bool g_has_sme2 = false;
+ std::call_once(g_cpu_has_sme2_once, []() {
+ int ret = 0;
+ size_t size = sizeof(ret);
+ if (sysctlbyname("hw.optional.arm.FEAT_SME2", &ret, &size, NULL, 0) != -1)
+ g_has_sme2 = ret == 1;
+ });
+ return g_has_sme2;
+}
+
+static std::once_flag g_sme_max_svl_once;
+unsigned int DNBArchMachARM64::GetSMEMaxSVL() {
+ static unsigned int g_sme_max_svl = 0;
+ std::call_once(g_sme_max_svl_once, []() {
+ if (CPUHasSME()) {
+ unsigned int ret = 0;
+ size_t size = sizeof(ret);
+ if (sysctlbyname("hw.optional.arm.sme_max_svl_b", &ret, &size, NULL, 0) !=
+ -1)
+ g_sme_max_svl = ret;
+ else
+ g_sme_max_svl = get_svl_bytes();
+ }
+ });
+ return g_sme_max_svl;
+}
+
+// This function can only be called on systems with hw.optional.arm.FEAT_SME
+// It will return the maximum SVL length for this process.
+uint16_t __attribute__((target("sme"))) DNBArchMachARM64::get_svl_bytes(void) {
+ uint64_t ret = 0;
+ asm volatile("rdsvl %[ret], #1" : [ret] "=r"(ret));
+ return (uint16_t)ret;
+}
+
static uint64_t clear_pac_bits(uint64_t value) {
uint32_t addressing_bits = 0;
if (!DNBGetAddressingBits(addressing_bits))
@@ -415,6 +464,103 @@ kern_return_t DNBArchMachARM64::GetDBGState(bool force) {
return kret;
}
+kern_return_t DNBArchMachARM64::GetSVEState(bool force) {
+ int set = e_regSetSVE;
+ // Check if we have valid cached registers
+ if (!force && m_state.GetError(set, Read) == KERN_SUCCESS)
+ return KERN_SUCCESS;
+
+ if (!CPUHasSME())
+ return KERN_INVALID_ARGUMENT;
+
+ // Read the registers from our thread
+ mach_msg_type_number_t count = ARM_SVE_Z_STATE_COUNT;
+ kern_return_t kret =
+ ::thread_get_state(m_thread->MachPortNumber(), ARM_SVE_Z_STATE1,
+ (thread_state_t)&m_state.context.sve.z[0], &count);
+ m_state.SetError(set, Read, kret);
+ DNBLogThreadedIf(LOG_THREAD, "Read SVE registers z0..z15 return value %d",
+ kret);
+ if (kret != KERN_SUCCESS)
+ return kret;
+
+ count = ARM_SVE_Z_STATE_COUNT;
+ kret = thread_get_state(m_thread->MachPortNumber(), ARM_SVE_Z_STATE2,
+ (thread_state_t)&m_state.context.sve.z[16], &count);
+ m_state.SetError(set, Read, kret);
+ DNBLogThreadedIf(LOG_THREAD, "Read SVE registers z16..z31 return value %d",
+ kret);
+ if (kret != KERN_SUCCESS)
+ return kret;
+
+ count = ARM_SVE_P_STATE_COUNT;
+ kret = thread_get_state(m_thread->MachPortNumber(), ARM_SVE_P_STATE,
+ (thread_state_t)&m_state.context.sve.p[0], &count);
+ m_state.SetError(set, Read, kret);
+ DNBLogThreadedIf(LOG_THREAD, "Read SVE registers p0..p15 return value %d",
+ kret);
+
+ return kret;
+}
+
+kern_return_t DNBArchMachARM64::GetSMEState(bool force) {
+ int set = e_regSetSME;
+ // Check if we have valid cached registers
+ if (!force && m_state.GetError(set, Read) == KERN_SUCCESS)
+ return KERN_SUCCESS;
+
+ if (!CPUHasSME())
+ return KERN_INVALID_ARGUMENT;
+
+ // Read the registers from our thread
+ mach_msg_type_number_t count = ARM_SME_STATE_COUNT;
+ kern_return_t kret =
+ ::thread_get_state(m_thread->MachPortNumber(), ARM_SME_STATE,
+ (thread_state_t)&m_state.context.sme.svcr, &count);
+ m_state.SetError(set, Read, kret);
+ DNBLogThreadedIf(LOG_THREAD, "Read ARM_SME_STATE return value %d", kret);
+ if (kret != KERN_SUCCESS)
+ return kret;
+
+ memset(m_state.context.sme.za.data(), 0, m_state.context.sme.za.size());
+
+ size_t za_size = m_state.context.sme.svl_b * m_state.context.sme.svl_b;
+ const size_t max_chunk_size = 4096;
+ int n_chunks;
+ size_t chunk_size;
+ if (za_size <= max_chunk_size) {
+ n_chunks = 1;
+ chunk_size = za_size;
+ } else {
+ n_chunks = za_size / max_chunk_size;
+ chunk_size = max_chunk_size;
+ }
+ for (int i = 0; i < n_chunks; i++) {
+ count = ARM_SME_ZA_STATE_COUNT;
+ arm_sme_za_state_t za_state;
+ kret = thread_get_state(m_thread->MachPortNumber(), ARM_SME_ZA_STATE1 + i,
+ (thread_state_t)&za_state, &count);
+ m_state.SetError(set, Read, kret);
+ DNBLogThreadedIf(LOG_THREAD, "Read ARM_SME_STATE return value %d", kret);
+ if (kret != KERN_SUCCESS)
+ return kret;
+ memcpy(m_state.context.sme.za.data() + (i * chunk_size), &za_state,
+ chunk_size);
+ }
+
+ if (CPUHasSME2()) {
+ count = ARM_SME2_STATE;
+ kret = thread_get_state(m_thread->MachPortNumber(), ARM_SME2_STATE,
+ (thread_state_t)&m_state.context.sme.zt0, &count);
+ m_state.SetError(set, Read, kret);
+ DNBLogThreadedIf(LOG_THREAD, "Read ARM_SME2_STATE return value %d", kret);
+ if (kret != KERN_SUCCESS)
+ return kret;
+ }
+
+ return kret;
+}
+
kern_return_t DNBArchMachARM64::SetGPRState() {
int set = e_regSetGPR;
kern_return_t kret = ::thread_set_state(
@@ -441,6 +587,80 @@ kern_return_t DNBArchMachARM64::SetVFPState() {
return kret; // Return the error code
}
+kern_return_t DNBArchMachARM64::SetSVEState() {
+ if (!CPUHasSME())
+ return KERN_INVALID_ARGUMENT;
+
+ int set = e_regSetSVE;
+ kern_return_t kret = thread_set_state(
+ m_thread->MachPortNumber(), ARM_SVE_Z_STATE1,
+ (thread_state_t)&m_state.context.sve.z[0], ARM_SVE_Z_STATE_COUNT);
+ m_state.SetError(set, Write, kret);
+ DNBLogThreadedIf(LOG_THREAD, "Write ARM_SVE_Z_STATE1 return value %d", kret);
+ if (kret != KERN_SUCCESS)
+ return kret;
+
+ kret = thread_set_state(m_thread->MachPortNumber(), ARM_SVE_Z_STATE2,
+ (thread_state_t)&m_state.context.sve.z[16],
+ ARM_SVE_Z_STATE_COUNT);
+ m_state.SetError(set, Write, kret);
+ DNBLogThreadedIf(LOG_TH...
[truncated]
|
✅ With the latest revision this PR passed the Python code formatter. |
✅ With the latest revision this PR passed the C/C++ code formatter. |
One difference from debugserver and lldb-server is that lldb-server provides "vg" and "svg" registers (vector granule, streaming vector granule, depending on Streaming mode) which is the vector length in 8-byte granules. On Darwin, debugserver provides only "svl", in bytes. I considered having debugserver report the vector length in granules to match the Linux behavior, but the kernel was giving me the value in bytes and I think it's a more natural representation, so I stuck with it. |
I assume you did what we did for Linux, where the inactive registers do not dissapear from the register list but just fail to read. Linux also returns all 0s for the array register when it's disabled, but this is I think more about keeping the register offsets valid than anything practical.
To repeat what I said over email to Jason - This "vector granule" possibly comes from a time before SVE was finalised and we (Arm) did not know exactly how the vector length would be reported. Our simulators use this "granule" term and it made it into GDB via that. The Linux kernel does not report in granules it does the same as Darwin, bytes. We (Linaro) added "vg" to be compatible with GDB and QEMU. Bytes is the more useful reporting unit because the SVE programming question is always "how many units of X bytes fit into the vector length". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lightly skimmed the debugserver parts, I assume one of your colleagues will help there.
// large enough to hold that much data. The current Apple cores | ||
// have a much smaller maximum ZA reg size, but there are not | ||
// multiple copies of this object so increase the static size to | ||
// maximum possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Linux I remember heap allocating the object that represented the array register, because of the potential size. Perhaps that just uses a buffer in the background though.
The problem you have with this is that even x0
will take up 64k, right? Or is this object used as an overlay to a buffer and doesn't actually get allocated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this object is allocated to read/write a single register, so a read of x0 will be a 64k object. But looking at the debugserver sources, we don't store an array of them anywhere - we read / write individual registers one at a time with this object for a short time period, so I don't think the memory increase is a problem. It might be better to have a dynamically allocated size here though, as you did. I did that for the DNBArm64ArchImpl register contexts stored for each thread, where we will have one for each thread when stopped, that memory use made me more nervous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for Linux we were also stack allocating the register value and I didn't want 64k stack frames everywhere we used one. d99d9d8 in case any of the concerns apply to debugserver also.
(I am also very aware of these issues because in a previous job when we added MIPS MSA support we accidentally turned every register object into 512 bits, even the 8 and 16 bit ones we read from non-MIPS DSP chips)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the DNBRegisterValue
use a little more, and I think I want to change it to a heap allocated object, but it's going to touch all of the arch plugins in debugserver, so I will do it as a separate change from this one. On the macOS environment, the single 64k register on the stack isn't blowing anything, but it's not ideal and could cause a problem in our more constrained environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on changing this to a heap object. This seems unnecessarily wasteful when not in SME mode, which I expect to remain the majority of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of restructuring the internals of the object to heap-allocate the value space, which would require touching all of the DNBArchImpl back-ends, but actually just heap-allocating the object in RNBRemote (the main place this object is created) would be much easier than changing it at all.
lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
Outdated
Show resolved
Hide resolved
lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
Outdated
Show resolved
Hide resolved
lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
Outdated
Show resolved
Hide resolved
1. confirm that we cannot read SME/SVE regs when not in SSVE mode. 2. Make it clearer how I'm modifying all of the SVE/SME registers, then instruction stepping, then reading them back to confirm that they were modified. Remove `DNBArchMachARM64::get_svl_bytes`, depend entirely on the hw.optional.arm.sme_max_svl_b sysctl to get the system's maximum SVL, instead of debugserver's maximum SVL. They're always the same today, but it's possible to imagine it not being like that in the future.
Updated the API test case as per David's suggestions, and remove the code that was using debugserver's SVL as the hardware maximum, depending entirely on the newer sysctl to get the correct value instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a pass and left some nits but I'm not enough of an expert to review this in more detail. I like the test coverage and I appreciate @DavidSpickett taking the time to review this!
if (reg_value != fail_value && reg_value <= 256) { | ||
svg_reg_value = reg_value / 8; | ||
// Apple hardware only implements Streaming SVE mode, so | ||
// the non-streaming Vector Length is not reported by the | ||
// kernel. Set both svg and vg to this svl value. | ||
if (!vg_reg_value) | ||
vg_reg_value = reg_value / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: lots of magic values here but in all fairness that's consistent with the surrounding code. The comment covers the 8 byte granule so I'm not too concerned, though some constants might make this easier to read.
@@ -0,0 +1,113 @@ | |||
/// BUILT with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/built/build/
. But also this is covered by the makefile so maybe something like "Requires -mcpu=apple-m4" would be more to the point.
// large enough to hold that much data. The current Apple cores | ||
// have a much smaller maximum ZA reg size, but there are not | ||
// multiple copies of this object so increase the static size to | ||
// maximum possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on changing this to a heap object. This seems unnecessarily wasteful when not in SME mode, which I expect to remain the majority of the time.
#include <mach/mach.h> | ||
#include <stdint.h> | ||
|
||
// define the SVE/SME/SME2 thread status structures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// define the SVE/SME/SME2 thread status structures | |
// Define the SVE/SME/SME2 thread status structures |
@@ -2567,10 +2568,13 @@ rnb_err_t RNBRemote::HandlePacket_QSetProcessEvent(const char *p) { | |||
return SendPacket("OK"); | |||
} | |||
|
|||
void register_value_in_hex_fixed_width(std::ostream &ostrm, nub_process_t pid, | |||
// if a fail_value is provided, a correct-length reply is always provided, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if a fail_value is provided, a correct-length reply is always provided, | |
// If a fail_value is provided, a correct-length reply is always provided, |
|
||
// Pad out the reply to the correct size to maintain correct offsets, | ||
// even if we could not read the register value. | ||
std::vector<uint8_t> zeros(reg->nub_info.size, *fail_value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You call the vector
zeros assuming that's the fail value, but maybe failed
would be more generic.
TestRegisters.py expected there to be no registers unable to be read. With an SME system when in non-SME mode, all of the SVE/SME registers will show as unreadable. TestGdbRemoteGPacket.py was testing the g/G packet with debugserver (which lldb doesn't use any more) and this exposed the fact that I wasn't handling SME/SVE registers correctly in DNBArchImplArm64::GetRegisterContext/SetRegisterContext. Fix those to correctly account for the size of these register contexts (the ZA register is stored in a vector, instead of having a compile-time register context size). Also, RNBRemote::HandlePacket_G() was not handling the thread specifier properly (it would try to interpret it as content for the packet, and the hex decoding would fail). I think this test might be setting the current thread with Hg<tid> before it sends only "g" or "G", that might work. In debugserver, changed all stack-allocated DNBRegisterValue objects to be unique_ptr managed so they're on heap. This object is now 64k and putting that on the stack could be a problem. We don't have multiple DNBRegisterValue objects alive at the same time, so I didn't dynamically size it to the maximum ZA register size on the current machine. Fix the comment formatting suggestions from Jonas.
I pushed an update addressing Jonas' suggestions, and also fixing two testsuite issues I found by testing the patch on an SME and non-SME machine running the the same OS. I'm still seeing one bonus failure in TestFirmwareCorefiles.py on the SME system that I need to debug, but the rest of the testsuite looks good. |
When the process is not in Streaming SVE Mode, all of the SME thread_get_state calls will fail. Zero out the buffers that we were going to write data into, so we don't leave old data in them that might be sent to lldb (particularly in a GetRegisterContext "g" packet response, where we need to read all registers to complete the response).
I debugged the two test failures I'm seeing on an M4, neither is related to the SME changes. I will handle those two issues separately, this is ready to land. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -32,11 +50,19 @@ def test_register_commands(self): | |||
# verify that logging does not assert | |||
self.log_enable("registers") | |||
|
|||
error_str_matched = False | |||
if self.get_sme_available() == True and self.platformIsDarwin(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.get_sme_available() == True and self.platformIsDarwin(): | |
if self.get_sme_available() and self.platformIsDarwin(): |
I've taken the liberty of repeating the fact that the required APIs are not in a publicly available OS release, at the top of the PR description. Since we had someone on Discord try this and they could not read the registers. Might save a few enthusiasts some time if they see that note first. The good thing is that lldb didn't break:
I think that's exactly what should be happening on an unsupported OS version. |
Thanks, I missed that. Yeah the unrecognized register flavor will be rejected by the thread_get_state call and we'll report the registers as unavailable on current macOSes. I've got the macOS version where it is available in the test already so I'll make the comment more explicit. |
Thanks. |
**Note:** The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which will be first available in macOS 15.4. The Apple M4 line of cores includes the Scalable Matrix Extension (SME) feature. The M4s do not implement Scalable Vector Extension (SVE), although the processor is in Streaming SVE Mode when the SME is being used. The most obvious side effects of being in SSVE Mode are that (on the M4 cores) NEON instructions cannot be used, and watchpoints may get false positives, the address comparisons are done at a lowered granularity. When SSVE mode is enabled, the kernel will provide the Streaming Vector Length register, which is a maximum of 64 bytes with the M4. Also provided are SVCR (with bits indicating if SSVE mode and SME mode are enabled), TPIDR2, SVL. Then the SVE registers Z0..31 (SVL bytes long), P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL bytes), and the M4 supports SME2, so the ZT0 register (64 bytes). When SSVE/SME are disabled, none of these registers are provided by the kernel - reads and writes of them will fail. Unlike Linux, lldb cannot modify the SVL through a thread_set_state call, or change the processor state's SSVE/SME status. There is also no way for a process to request a lowered SVL size today, so the work that David did to handle VL/SVL changing while stepping through a process is not an issue on Darwin today. But debugserver should be providing everything necessary so we can reuse all of David's work on resizing the register contexts in lldb if it happens in the future. debugbserver sends svl, svcr, and tpidr2 in the expedited registers when a thread stops, if SSVE|SME mode are enabled (if the kernel allows it to read the ARM_SME_STATE register set). While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible SVL is 256; this would give us a 64k ZA register. If debugserver sized all of its register contexts assuming the largest possible SVL, we could easily use 2MB more memory for the register contexts of all threads in a process -- and on iOS et al, processes must run within a small memory allotment and this would push us over that. Much of the work in debugserver was changing the arm64 register context from being a static compile-time array of register sets, to being initialized at runtime if debugserver is running on a machine with SME. The ZA is only created to the machine's actual maximum SVL. The size of the 32 SVE Z registers is less significant so I am statically allocating those to the architecturally largest possible SVL value today. Also, debugserver includes information about registers that share the same part of the register file. e.g. S0 and D0 are the lower parts of the NEON 128-bit V0 register. And when running on an SME machine, v0 is the lower 128 bits of the SVE Z0 register. So the register maps used when defining the VFP registers must differ depending on the capabilities of the cpu at runtime. I also changed register reading in debugserver, where formerly when debugserver was asked to read a register, and the thread_get_state read of that register failed, it would return all zero's. This is necessary when constructing a `g` packet that gets all registers - because there is no separation between register bytes, the offsets are fixed. But when we are asking for a single register (e.g. Z0) when not in SSVE/SME mode, this should return an error. This does mean that when you're running on an SME capabable machine, but not in SME mode, and do `register read -a`, lldb will report that 48 SVE registers were unavailable and 5 SME registers were unavailable. But that's only when `-a` is used. The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which is not yet in a release. The test case I wrote is skipped on current OSes. I pilfered the SME register setup from some of David's existing SME test files; there were a few Linux specific details in those tests that they weren't easy to reuse on Darwin. rdar://121608074 (cherry picked from commit 46e7823)
…ster-support [lldb][debugserver] Read/write SME registers on arm64 (llvm#119171)
Note: The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which will be first available in macOS 15.4.
The Apple M4 line of cores includes the Scalable Matrix Extension (SME) feature. The M4s do not implement Scalable Vector Extension (SVE), although the processor is in Streaming SVE Mode when the SME is being used. The most obvious side effects of being in SSVE Mode are that (on the M4 cores) NEON instructions cannot be used, and watchpoints may get false positives, the address comparisons are done at a lowered granularity.
When SSVE mode is enabled, the kernel will provide the Streaming Vector Length register, which is a maximum of 64 bytes with the M4. Also provided are SVCR (with bits indicating if SSVE mode and SME mode are enabled), TPIDR2, SVL. Then the SVE registers Z0..31 (SVL bytes long), P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL bytes), and the M4 supports SME2, so the ZT0 register (64 bytes).
When SSVE/SME are disabled, none of these registers are provided by the kernel - reads and writes of them will fail.
Unlike Linux, lldb cannot modify the SVL through a thread_set_state call, or change the processor state's SSVE/SME status. There is also no way for a process to request a lowered SVL size today, so the work that David did to handle VL/SVL changing while stepping through a process is not an issue on Darwin today. But debugserver should be providing everything necessary so we can reuse all of David's work on resizing the register contexts in lldb if it happens in the future. debugbserver sends svl, svcr, and tpidr2 in the expedited registers when a thread stops, if SSVE|SME mode are enabled (if the kernel allows it to read the ARM_SME_STATE register set).
While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible SVL is 256; this would give us a 64k ZA register. If debugserver sized all of its register contexts assuming the largest possible SVL, we could easily use 2MB more memory for the register contexts of all threads in a process -- and on iOS et al, processes must run within a small memory allotment and this would push us over that.
Much of the work in debugserver was changing the arm64 register context from being a static compile-time array of register sets, to being initialized at runtime if debugserver is running on a machine with SME. The ZA is only created to the machine's actual maximum SVL. The size of the 32 SVE Z registers is less significant so I am statically allocating those to the architecturally largest possible SVL value today.
Also, debugserver includes information about registers that share the same part of the register file. e.g. S0 and D0 are the lower parts of the NEON 128-bit V0 register. And when running on an SME machine, v0 is the lower 128 bits of the SVE Z0 register. So the register maps used when defining the VFP registers must differ depending on the capabilities of the cpu at runtime.
I also changed register reading in debugserver, where formerly when debugserver was asked to read a register, and the thread_get_state read of that register failed, it would return all zero's. This is necessary when constructing a
g
packet that gets all registers - because there is no separation between register bytes, the offsets are fixed. But when we are asking for a single register (e.g. Z0) when not in SSVE/SME mode, this should return an error.This does mean that when you're running on an SME capabable machine, but not in SME mode, and do
register read -a
, lldb will report that 48 SVE registers were unavailable and 5 SME registers were unavailable. But that's only when-a
is used.The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which is not yet in a release. The test case I wrote is skipped on current OSes. I pilfered the SME register setup from some of David's existing SME test files; there were a few Linux specific details in those tests that they weren't easy to reuse on Darwin.
rdar://121608074