Skip to content

[lldb][debugserver] Read/write SME registers on arm64 #119171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jasonmolenda
Copy link
Collaborator

@jasonmolenda jasonmolenda commented Dec 9, 2024

Note: The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which will be first available in macOS 15.4.

The Apple M4 line of cores includes the Scalable Matrix Extension (SME) feature. The M4s do not implement Scalable Vector Extension (SVE), although the processor is in Streaming SVE Mode when the SME is being used. The most obvious side effects of being in SSVE Mode are that (on the M4 cores) NEON instructions cannot be used, and watchpoints may get false positives, the address comparisons are done at a lowered granularity.

When SSVE mode is enabled, the kernel will provide the Streaming Vector Length register, which is a maximum of 64 bytes with the M4. Also provided are SVCR (with bits indicating if SSVE mode and SME mode are enabled), TPIDR2, SVL. Then the SVE registers Z0..31 (SVL bytes long), P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL bytes), and the M4 supports SME2, so the ZT0 register (64 bytes).

When SSVE/SME are disabled, none of these registers are provided by the kernel - reads and writes of them will fail.

Unlike Linux, lldb cannot modify the SVL through a thread_set_state call, or change the processor state's SSVE/SME status. There is also no way for a process to request a lowered SVL size today, so the work that David did to handle VL/SVL changing while stepping through a process is not an issue on Darwin today. But debugserver should be providing everything necessary so we can reuse all of David's work on resizing the register contexts in lldb if it happens in the future. debugbserver sends svl, svcr, and tpidr2 in the expedited registers when a thread stops, if SSVE|SME mode are enabled (if the kernel allows it to read the ARM_SME_STATE register set).

While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible SVL is 256; this would give us a 64k ZA register. If debugserver sized all of its register contexts assuming the largest possible SVL, we could easily use 2MB more memory for the register contexts of all threads in a process -- and on iOS et al, processes must run within a small memory allotment and this would push us over that.

Much of the work in debugserver was changing the arm64 register context from being a static compile-time array of register sets, to being initialized at runtime if debugserver is running on a machine with SME. The ZA is only created to the machine's actual maximum SVL. The size of the 32 SVE Z registers is less significant so I am statically allocating those to the architecturally largest possible SVL value today.

Also, debugserver includes information about registers that share the same part of the register file. e.g. S0 and D0 are the lower parts of the NEON 128-bit V0 register. And when running on an SME machine, v0 is the lower 128 bits of the SVE Z0 register. So the register maps used when defining the VFP registers must differ depending on the capabilities of the cpu at runtime.

I also changed register reading in debugserver, where formerly when debugserver was asked to read a register, and the thread_get_state read of that register failed, it would return all zero's. This is necessary when constructing a g packet that gets all registers - because there is no separation between register bytes, the offsets are fixed. But when we are asking for a single register (e.g. Z0) when not in SSVE/SME mode, this should return an error.

This does mean that when you're running on an SME capabable machine, but not in SME mode, and do register read -a, lldb will report that 48 SVE registers were unavailable and 5 SME registers were unavailable. But that's only when -a is used.

The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which is not yet in a release. The test case I wrote is skipped on current OSes. I pilfered the SME register setup from some of David's existing SME test files; there were a few Linux specific details in those tests that they weren't easy to reuse on Darwin.

rdar://121608074

The Apple M4 line of cores includes the Scalable Matrix Extension
(SME) feature. The M4s do not implement Scalable Vector Extension
(SVE), although the processor is in Streaming SVE Mode when the SME
is being used.  The most obvious side effects of being in SSVE Mode
are that (on the M4 cores) NEON instructions cannot be used, and
watchpoints may get false positives, the address comparisons are
done at a lowered granularity.

When SSVE mode is enabled, the kernel will provide the Streaming
Vector Length register, which is a maximum of 64 bytes with the M4.
Also provided are SVCR (with bits indicating if SSVE mode and SME
mode are enabled), TPIDR2, SVL.  Then the SVE registers Z0..31 (SVL
bytes long), P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL
bytes), and the M4 supports SME2, so the ZT0 register (64 bytes).

When SSVE/SME are disabled, none of these registers are provided by
the kernel - reads and writes of them will fail.

Unlike Linux, lldb cannot modify the SVL through a thread_set_state
call, or change the processor state's SSVE/SME status.  There is
also no way for a process to request a lowered SVL size today, so
the work that David did to handle VL/SVL changing while stepping
through a process is not an issue on Darwin today.  But debugserver
should be providing everything necessary so we can reuse all of
David's work on resizing the register contexts in lldb if it happens
in the future.  debugbserver sends svl, svcr, and tpidr2 in the
expedited registers when a thread stops, if SSVE|SME mode are enabled
(if the kernel allows it to read the ARM_SME_STATE register set).

While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible
SVL is 256; this would give us a 65k ZA register.  If debugserver
sized all of its register contexts assuming the largest possible
SVL, we could easily use 2MB more memory for the register contexts
of all threads in a process -- and on iOS et al, processes must run
within a small memory allotment and this would push us over that.

Much of the work in debugserver was changing the arm64 register
context from being a static compile-time array of register sets,
to being initialized at runtime if debugserver is running on a
machine with SME.  The ZA is only created to the machine's actual
maximum SVL. The size of the 32 SVE Z registers is less significant
so I am statically allocating those to the architecturally largest
possible SVL value today.

Also, debugserver includes information about registers that share
the same part of the register file.  e.g. S0 and D0 are the lower
parts of the NEON 128-bit V0 register.  And when running on an SME
machine, v0 is the lower 128 bits of the SVE Z0 register.  So the
register maps used when defining the VFP registers must differ
depending on the runtime state of the cpu.

I also changed register reading in debugserver, where formerly when
debugserver was asked to read a register, and the thread_get_state
read of that register failed, it would return all zero's.  This is
necessary when constructing a `g` packet that gets all registers -
because there is no separation between register bytes, the offsets
are fixed.  But when we are asking for a single register (e.g.  Z0)
when not in SSVE/SME mode, this should return an error.

This does mean that when you're running on an SME capabable machine,
but not in SME mode, and do `register read -a`, lldb will report
that 48 SVE registers were unavailable and 5 SME registers were
unavailable.  But that's only when `-a` is used.

The register reading and writing depends on new register flavor
support in thread_get_state/thread_set_state in the kernel, which
is not yet in a release.  The test case I wrote is skipped on current
OSes.  I pilfered the SME register setup from some of David's
existing SME test files; there were a few Linux specific details
in those tests that they weren't easy to reuse on Darwin.

rdar://121608074
@llvmbot
Copy link
Member

llvmbot commented Dec 9, 2024

@llvm/pr-subscribers-lldb

Author: Jason Molenda (jasonmolenda)

Changes

The Apple M4 line of cores includes the Scalable Matrix Extension (SME) feature. The M4s do not implement Scalable Vector Extension (SVE), although the processor is in Streaming SVE Mode when the SME is being used. The most obvious side effects of being in SSVE Mode are that (on the M4 cores) NEON instructions cannot be used, and watchpoints may get false positives, the address comparisons are done at a lowered granularity.

When SSVE mode is enabled, the kernel will provide the Streaming Vector Length register, which is a maximum of 64 bytes with the M4. Also provided are SVCR (with bits indicating if SSVE mode and SME mode are enabled), TPIDR2, SVL. Then the SVE registers Z0..31 (SVL bytes long), P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL bytes), and the M4 supports SME2, so the ZT0 register (64 bytes).

When SSVE/SME are disabled, none of these registers are provided by the kernel - reads and writes of them will fail.

Unlike Linux, lldb cannot modify the SVL through a thread_set_state call, or change the processor state's SSVE/SME status. There is also no way for a process to request a lowered SVL size today, so the work that David did to handle VL/SVL changing while stepping through a process is not an issue on Darwin today. But debugserver should be providing everything necessary so we can reuse all of David's work on resizing the register contexts in lldb if it happens in the future. debugbserver sends svl, svcr, and tpidr2 in the expedited registers when a thread stops, if SSVE|SME mode are enabled (if the kernel allows it to read the ARM_SME_STATE register set).

While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible SVL is 256; this would give us a 65k ZA register. If debugserver sized all of its register contexts assuming the largest possible SVL, we could easily use 2MB more memory for the register contexts of all threads in a process -- and on iOS et al, processes must run within a small memory allotment and this would push us over that.

Much of the work in debugserver was changing the arm64 register context from being a static compile-time array of register sets, to being initialized at runtime if debugserver is running on a machine with SME. The ZA is only created to the machine's actual maximum SVL. The size of the 32 SVE Z registers is less significant so I am statically allocating those to the architecturally largest possible SVL value today.

Also, debugserver includes information about registers that share the same part of the register file. e.g. S0 and D0 are the lower parts of the NEON 128-bit V0 register. And when running on an SME machine, v0 is the lower 128 bits of the SVE Z0 register. So the register maps used when defining the VFP registers must differ depending on the runtime state of the cpu.

I also changed register reading in debugserver, where formerly when debugserver was asked to read a register, and the thread_get_state read of that register failed, it would return all zero's. This is necessary when constructing a g packet that gets all registers - because there is no separation between register bytes, the offsets are fixed. But when we are asking for a single register (e.g. Z0) when not in SSVE/SME mode, this should return an error.

This does mean that when you're running on an SME capabable machine, but not in SME mode, and do register read -a, lldb will report that 48 SVE registers were unavailable and 5 SME registers were unavailable. But that's only when -a is used.

The register reading and writing depends on new register flavor support in thread_get_state/thread_set_state in the kernel, which is not yet in a release. The test case I wrote is skipped on current OSes. I pilfered the SME register setup from some of David's existing SME test files; there were a few Linux specific details in those tests that they weren't easy to reuse on Darwin.

rdar://121608074


Patch is 67.81 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/119171.diff

9 Files Affected:

  • (modified) lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp (+19)
  • (added) lldb/test/API/macosx/sme-registers/Makefile (+5)
  • (added) lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py (+164)
  • (added) lldb/test/API/macosx/sme-registers/main.c (+123)
  • (modified) lldb/tools/debugserver/source/DNBDefs.h (+15-10)
  • (modified) lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp (+720-186)
  • (modified) lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.h (+66-7)
  • (added) lldb/tools/debugserver/source/MacOSX/arm64/sme_thread_status.h (+86)
  • (modified) lldb/tools/debugserver/source/RNBRemote.cpp (+49-38)
diff --git a/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp b/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp
index 181ba4e7d87721..6a072354972acd 100644
--- a/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp
+++ b/lldb/source/Plugins/Architecture/AArch64/ArchitectureAArch64.cpp
@@ -100,6 +100,25 @@ bool ArchitectureAArch64::ReconfigureRegisterInfo(DynamicRegisterInfo &reg_info,
     if (reg_value != fail_value && reg_value <= 32)
       svg_reg_value = reg_value;
   }
+  if (!svg_reg_value) {
+    const RegisterInfo *darwin_svg_reg_info = reg_info.GetRegisterInfo("svl");
+    if (darwin_svg_reg_info) {
+      uint32_t svg_reg_num = darwin_svg_reg_info->kinds[eRegisterKindLLDB];
+      uint64_t reg_value =
+          reg_context.ReadRegisterAsUnsigned(svg_reg_num, fail_value);
+      // UpdateARM64SVERegistersInfos and UpdateARM64SMERegistersInfos
+      // expect the number of 8-byte granules; darwin provides number of
+      // bytes.
+      if (reg_value != fail_value && reg_value <= 256) {
+        svg_reg_value = reg_value / 8;
+        // Apple hardware only implements Streaming SVE mode, so
+        // the non-streaming Vector Length is not reported by the
+        // kernel. Set both svg and vg to this svl value.
+        if (!vg_reg_value)
+          vg_reg_value = reg_value / 8;
+      }
+    }
+  }
 
   if (!vg_reg_value && !svg_reg_value)
     return false;
diff --git a/lldb/test/API/macosx/sme-registers/Makefile b/lldb/test/API/macosx/sme-registers/Makefile
new file mode 100644
index 00000000000000..d4173d262ed270
--- /dev/null
+++ b/lldb/test/API/macosx/sme-registers/Makefile
@@ -0,0 +1,5 @@
+C_SOURCES := main.c
+
+CFLAGS_EXTRAS := -mcpu=apple-m4
+
+include Makefile.rules
diff --git a/lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py b/lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py
new file mode 100644
index 00000000000000..82a5eb0dc81a6b
--- /dev/null
+++ b/lldb/test/API/macosx/sme-registers/TestSMERegistersDarwin.py
@@ -0,0 +1,164 @@
+import lldb
+from lldbsuite.test.lldbtest import *
+from lldbsuite.test.decorators import *
+import lldbsuite.test.lldbutil as lldbutil
+import os
+
+
+class TestSMERegistersDarwin(TestBase):
+
+    NO_DEBUG_INFO_TESTCASE = True
+    mydir = TestBase.compute_mydir(__file__)
+
+    @skipIfRemote
+    @skipUnlessDarwin
+    @skipUnlessFeature("hw.optional.arm.FEAT_SME")
+    @skipUnlessFeature("hw.optional.arm.FEAT_SME2")
+    # thread_set_state/thread_get_state only avail in macOS 15.4+
+    @skipIf(macos_version=["<", "15.4"])
+    def test(self):
+        """Test that we can read the contents of the SME/SVE registers on Darwin"""
+        self.build()
+        (target, process, thread, bkpt) = lldbutil.run_to_source_breakpoint(
+            self, "break here", lldb.SBFileSpec("main.c")
+        )
+        frame = thread.GetFrameAtIndex(0)
+        self.assertTrue(frame.IsValid())
+
+        if self.TraceOn():
+            self.runCmd("reg read -a")
+
+        svl_reg = frame.register["svl"]
+        svl = svl_reg.GetValueAsUnsigned()
+
+        # SSVE and SME modes should be enabled (reflecting PSTATE.SM and PSTATE.ZA)
+        svcr = frame.register["svcr"]
+        self.assertEqual(svcr.GetValueAsUnsigned(), 3)
+
+        z0 = frame.register["z0"]
+        self.assertEqual(z0.GetNumChildren(), svl)
+        self.assertEqual(z0.GetChildAtIndex(0).GetValueAsUnsigned(), 0x1)
+        self.assertEqual(z0.GetChildAtIndex(svl - 1).GetValueAsUnsigned(), 0x1)
+
+        z31 = frame.register["z31"]
+        self.assertEqual(z31.GetNumChildren(), svl)
+        self.assertEqual(z31.GetChildAtIndex(0).GetValueAsUnsigned(), 32)
+        self.assertEqual(z31.GetChildAtIndex(svl - 1).GetValueAsUnsigned(), 32)
+
+        p0 = frame.register["p0"]
+        self.assertEqual(p0.GetNumChildren(), svl / 8)
+        self.assertEqual(p0.GetChildAtIndex(0).GetValueAsUnsigned(), 0xFF)
+        self.assertEqual(
+            p0.GetChildAtIndex(p0.GetNumChildren() - 1).GetValueAsUnsigned(), 0xFF
+        )
+
+        p15 = frame.register["p15"]
+        self.assertEqual(p15.GetNumChildren(), svl / 8)
+        self.assertEqual(p15.GetChildAtIndex(0).GetValueAsUnsigned(), 0xFF)
+        self.assertEqual(
+            p15.GetChildAtIndex(p15.GetNumChildren() - 1).GetValueAsUnsigned(), 0xFF
+        )
+
+        za = frame.register["za"]
+        self.assertEqual(za.GetNumChildren(), (svl * svl))
+        za_0 = za.GetChildAtIndex(0)
+        self.assertEqual(za_0.GetValueAsUnsigned(), 4)
+        za_final = za.GetChildAtIndex(za.GetNumChildren() - 1)
+        self.assertEqual(za_final.GetValueAsUnsigned(), 67)
+
+        zt0 = frame.register["zt0"]
+        self.assertEqual(zt0.GetNumChildren(), 64)
+        zt0_0 = zt0.GetChildAtIndex(0)
+        self.assertEqual(zt0_0.GetValueAsUnsigned(), 0)
+        zt0_final = zt0.GetChildAtIndex(63)
+        self.assertEqual(zt0_final.GetValueAsUnsigned(), 63)
+
+        z0_old_values = []
+        z0_new_str = '"{'
+        for i in range(svl):
+            z0_old_values.append(z0.GetChildAtIndex(i).GetValueAsUnsigned())
+            z0_new_str = z0_new_str + ("0x%02x " % (z0_old_values[i] + 5))
+        z0_new_str = z0_new_str + '}"'
+        self.runCmd("reg write z0 %s" % z0_new_str)
+
+        z31_old_values = []
+        z31_new_str = '"{'
+        for i in range(svl):
+            z31_old_values.append(z31.GetChildAtIndex(i).GetValueAsUnsigned())
+            z31_new_str = z31_new_str + ("0x%02x " % (z31_old_values[i] + 3))
+        z31_new_str = z31_new_str + '}"'
+        self.runCmd("reg write z31 %s" % z31_new_str)
+
+        p0_old_values = []
+        p0_new_str = '"{'
+        for i in range(int(svl / 8)):
+            p0_old_values.append(p0.GetChildAtIndex(i).GetValueAsUnsigned())
+            p0_new_str = p0_new_str + ("0x%02x " % (p0_old_values[i] - 5))
+        p0_new_str = p0_new_str + '}"'
+        self.runCmd("reg write p0 %s" % p0_new_str)
+
+        p15_old_values = []
+        p15_new_str = '"{'
+        for i in range(int(svl / 8)):
+            p15_old_values.append(p15.GetChildAtIndex(i).GetValueAsUnsigned())
+            p15_new_str = p15_new_str + ("0x%02x " % (p15_old_values[i] - 8))
+        p15_new_str = p15_new_str + '}"'
+        self.runCmd("reg write p15 %s" % p15_new_str)
+
+        za_old_values = []
+        za_new_str = '"{'
+        for i in range(svl * svl):
+            za_old_values.append(za.GetChildAtIndex(i).GetValueAsUnsigned())
+            za_new_str = za_new_str + ("0x%02x " % (za_old_values[i] + 7))
+        za_new_str = za_new_str + '}"'
+        self.runCmd("reg write za %s" % za_new_str)
+
+        zt0_old_values = []
+        zt0_new_str = '"{'
+        for i in range(64):
+            zt0_old_values.append(zt0.GetChildAtIndex(i).GetValueAsUnsigned())
+            zt0_new_str = zt0_new_str + ("0x%02x " % (zt0_old_values[i] + 2))
+        zt0_new_str = zt0_new_str + '}"'
+        self.runCmd("reg write zt0 %s" % zt0_new_str)
+
+        thread.StepInstruction(False)
+        frame = thread.GetFrameAtIndex(0)
+
+        if self.TraceOn():
+            self.runCmd("reg read -a")
+
+        z0 = frame.register["z0"]
+        for i in range(z0.GetNumChildren()):
+            self.assertEqual(
+                z0_old_values[i] + 5, z0.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        z31 = frame.register["z31"]
+        for i in range(z31.GetNumChildren()):
+            self.assertEqual(
+                z31_old_values[i] + 3, z31.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        p0 = frame.register["p0"]
+        for i in range(p0.GetNumChildren()):
+            self.assertEqual(
+                p0_old_values[i] - 5, p0.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        p15 = frame.register["p15"]
+        for i in range(p15.GetNumChildren()):
+            self.assertEqual(
+                p15_old_values[i] - 8, p15.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        za = frame.register["za"]
+        for i in range(za.GetNumChildren()):
+            self.assertEqual(
+                za_old_values[i] + 7, za.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
+
+        zt0 = frame.register["zt0"]
+        for i in range(zt0.GetNumChildren()):
+            self.assertEqual(
+                zt0_old_values[i] + 2, zt0.GetChildAtIndex(i).GetValueAsUnsigned()
+            )
diff --git a/lldb/test/API/macosx/sme-registers/main.c b/lldb/test/API/macosx/sme-registers/main.c
new file mode 100644
index 00000000000000..00bbb4a5551622
--- /dev/null
+++ b/lldb/test/API/macosx/sme-registers/main.c
@@ -0,0 +1,123 @@
+///  BUILT with
+///     xcrun -sdk macosx.internal clang -mcpu=apple-m4 -g sme.c -o sme 
+
+
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+
+void write_sve_regs() {
+  asm volatile("ptrue p0.b\n\t");
+  asm volatile("ptrue p1.h\n\t");
+  asm volatile("ptrue p2.s\n\t");
+  asm volatile("ptrue p3.d\n\t");
+  asm volatile("pfalse p4.b\n\t");
+  asm volatile("ptrue p5.b\n\t");
+  asm volatile("ptrue p6.h\n\t");
+  asm volatile("ptrue p7.s\n\t");
+  asm volatile("ptrue p8.d\n\t");
+  asm volatile("pfalse p9.b\n\t");
+  asm volatile("ptrue p10.b\n\t");
+  asm volatile("ptrue p11.h\n\t");
+  asm volatile("ptrue p12.s\n\t");
+  asm volatile("ptrue p13.d\n\t");
+  asm volatile("pfalse p14.b\n\t");
+  asm volatile("ptrue p15.b\n\t");
+
+  asm volatile("cpy  z0.b, p0/z, #1\n\t");
+  asm volatile("cpy  z1.b, p5/z, #2\n\t");
+  asm volatile("cpy  z2.b, p10/z, #3\n\t");
+  asm volatile("cpy  z3.b, p15/z, #4\n\t");
+  asm volatile("cpy  z4.b, p0/z, #5\n\t");
+  asm volatile("cpy  z5.b, p5/z, #6\n\t");
+  asm volatile("cpy  z6.b, p10/z, #7\n\t");
+  asm volatile("cpy  z7.b, p15/z, #8\n\t");
+  asm volatile("cpy  z8.b, p0/z, #9\n\t");
+  asm volatile("cpy  z9.b, p5/z, #10\n\t");
+  asm volatile("cpy  z10.b, p10/z, #11\n\t");
+  asm volatile("cpy  z11.b, p15/z, #12\n\t");
+  asm volatile("cpy  z12.b, p0/z, #13\n\t");
+  asm volatile("cpy  z13.b, p5/z, #14\n\t");
+  asm volatile("cpy  z14.b, p10/z, #15\n\t");
+  asm volatile("cpy  z15.b, p15/z, #16\n\t");
+  asm volatile("cpy  z16.b, p0/z, #17\n\t");
+  asm volatile("cpy  z17.b, p5/z, #18\n\t");
+  asm volatile("cpy  z18.b, p10/z, #19\n\t");
+  asm volatile("cpy  z19.b, p15/z, #20\n\t");
+  asm volatile("cpy  z20.b, p0/z, #21\n\t");
+  asm volatile("cpy  z21.b, p5/z, #22\n\t");
+  asm volatile("cpy  z22.b, p10/z, #23\n\t");
+  asm volatile("cpy  z23.b, p15/z, #24\n\t");
+  asm volatile("cpy  z24.b, p0/z, #25\n\t");
+  asm volatile("cpy  z25.b, p5/z, #26\n\t");
+  asm volatile("cpy  z26.b, p10/z, #27\n\t");
+  asm volatile("cpy  z27.b, p15/z, #28\n\t");
+  asm volatile("cpy  z28.b, p0/z, #29\n\t");
+  asm volatile("cpy  z29.b, p5/z, #30\n\t");
+  asm volatile("cpy  z30.b, p10/z, #31\n\t");
+  asm volatile("cpy  z31.b, p15/z, #32\n\t");
+}
+
+#define MAX_VL_BYTES 256
+void set_za_register(int svl, int value_offset) {
+  uint8_t data[MAX_VL_BYTES];
+
+  // ldr za will actually wrap the selected vector row, by the number of rows
+  // you have. So setting one that didn't exist would actually set one that did.
+  // That's why we need the streaming vector length here.
+  for (int i = 0; i < svl; ++i) {
+    // This may involve instructions that require the smefa64 extension.
+    for (int j = 0; j < MAX_VL_BYTES; j++)
+      data[j] = i + value_offset;
+    // Each one of these loads a VL sized row of ZA.
+    asm volatile("mov w12, %w0\n\t"
+                 "ldr za[w12, 0], [%1]\n\t" ::"r"(i),
+                 "r"(&data)
+                 : "w12");
+  }
+}
+
+static uint16_t
+arm_sme_svl_b(void)
+{
+        uint64_t ret = 0;
+        asm volatile (
+                "rdsvl  %[ret], #1"
+                : [ret] "=r"(ret)
+        );
+        return (uint16_t)ret;
+}
+
+
+// lldb/test/API/commands/register/register/aarch64_sme_z_registers/save_restore/main.c
+void
+arm_sme2_set_zt0() {
+#define ZTO_LEN (512 / 8)
+    uint8_t data[ZTO_LEN];
+    for (unsigned i = 0; i < ZTO_LEN; ++i)
+      data[i] = i + 0;
+
+    asm volatile("ldr zt0, [%0]" ::"r"(&data));
+#undef ZT0_LEN
+}
+
+int main()
+{
+
+  printf("Enable SME mode\n");
+
+  asm volatile ("smstart");
+ 
+  write_sve_regs();
+
+  set_za_register(arm_sme_svl_b(), 4);
+
+  arm_sme2_set_zt0();
+
+  int c = 10; // break here
+  c += 5;
+  c += 5;
+
+  asm volatile ("smstop");
+}
diff --git a/lldb/tools/debugserver/source/DNBDefs.h b/lldb/tools/debugserver/source/DNBDefs.h
index dacee652b3ebfc..df8ca809d412c7 100644
--- a/lldb/tools/debugserver/source/DNBDefs.h
+++ b/lldb/tools/debugserver/source/DNBDefs.h
@@ -312,16 +312,21 @@ struct DNBRegisterValue {
     uint64_t uint64;
     float float32;
     double float64;
-    int8_t v_sint8[64];
-    int16_t v_sint16[32];
-    int32_t v_sint32[16];
-    int64_t v_sint64[8];
-    uint8_t v_uint8[64];
-    uint16_t v_uint16[32];
-    uint32_t v_uint32[16];
-    uint64_t v_uint64[8];
-    float v_float32[16];
-    double v_float64[8];
+    // AArch64 SME's ZA register max size is 64k, this object must be
+    // large enough to hold that much data.  The current Apple cores
+    // have a much smaller maximum ZA reg size, but there are not
+    // multiple copies of this object so increase the static size to
+    // maximum possible.
+    int8_t v_sint8[65536];
+    int16_t v_sint16[32768];
+    int32_t v_sint32[16384];
+    int64_t v_sint64[8192];
+    uint8_t v_uint8[65536];
+    uint16_t v_uint16[32768];
+    uint32_t v_uint32[16384];
+    uint64_t v_uint64[8192];
+    float v_float32[16384];
+    double v_float64[8192];
     void *pointer;
     char *c_str;
   } value;
diff --git a/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp b/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
index b6f52cb5cf496d..ba2a8116d68bec 100644
--- a/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
+++ b/lldb/tools/debugserver/source/MacOSX/arm64/DNBArchImplARM64.cpp
@@ -93,6 +93,55 @@ DNBArchMachARM64::SoftwareBreakpointOpcode(nub_size_t byte_size) {
 
 uint32_t DNBArchMachARM64::GetCPUType() { return CPU_TYPE_ARM64; }
 
+static std::once_flag g_cpu_has_sme_once;
+bool DNBArchMachARM64::CPUHasSME() {
+  static bool g_has_sme = false;
+  std::call_once(g_cpu_has_sme_once, []() {
+    int ret = 0;
+    size_t size = sizeof(ret);
+    if (sysctlbyname("hw.optional.arm.FEAT_SME", &ret, &size, NULL, 0) != -1)
+      g_has_sme = ret == 1;
+  });
+  return g_has_sme;
+}
+
+static std::once_flag g_cpu_has_sme2_once;
+bool DNBArchMachARM64::CPUHasSME2() {
+  static bool g_has_sme2 = false;
+  std::call_once(g_cpu_has_sme2_once, []() {
+    int ret = 0;
+    size_t size = sizeof(ret);
+    if (sysctlbyname("hw.optional.arm.FEAT_SME2", &ret, &size, NULL, 0) != -1)
+      g_has_sme2 = ret == 1;
+  });
+  return g_has_sme2;
+}
+
+static std::once_flag g_sme_max_svl_once;
+unsigned int DNBArchMachARM64::GetSMEMaxSVL() {
+  static unsigned int g_sme_max_svl = 0;
+  std::call_once(g_sme_max_svl_once, []() {
+    if (CPUHasSME()) {
+      unsigned int ret = 0;
+      size_t size = sizeof(ret);
+      if (sysctlbyname("hw.optional.arm.sme_max_svl_b", &ret, &size, NULL, 0) !=
+          -1)
+        g_sme_max_svl = ret;
+      else
+        g_sme_max_svl = get_svl_bytes();
+    }
+  });
+  return g_sme_max_svl;
+}
+
+// This function can only be called on systems with hw.optional.arm.FEAT_SME
+// It will return the maximum SVL length for this process.
+uint16_t __attribute__((target("sme"))) DNBArchMachARM64::get_svl_bytes(void) {
+  uint64_t ret = 0;
+  asm volatile("rdsvl	%[ret], #1" : [ret] "=r"(ret));
+  return (uint16_t)ret;
+}
+
 static uint64_t clear_pac_bits(uint64_t value) {
   uint32_t addressing_bits = 0;
   if (!DNBGetAddressingBits(addressing_bits))
@@ -415,6 +464,103 @@ kern_return_t DNBArchMachARM64::GetDBGState(bool force) {
   return kret;
 }
 
+kern_return_t DNBArchMachARM64::GetSVEState(bool force) {
+  int set = e_regSetSVE;
+  // Check if we have valid cached registers
+  if (!force && m_state.GetError(set, Read) == KERN_SUCCESS)
+    return KERN_SUCCESS;
+
+  if (!CPUHasSME())
+    return KERN_INVALID_ARGUMENT;
+
+  // Read the registers from our thread
+  mach_msg_type_number_t count = ARM_SVE_Z_STATE_COUNT;
+  kern_return_t kret =
+      ::thread_get_state(m_thread->MachPortNumber(), ARM_SVE_Z_STATE1,
+                         (thread_state_t)&m_state.context.sve.z[0], &count);
+  m_state.SetError(set, Read, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Read SVE registers z0..z15 return value %d",
+                   kret);
+  if (kret != KERN_SUCCESS)
+    return kret;
+
+  count = ARM_SVE_Z_STATE_COUNT;
+  kret = thread_get_state(m_thread->MachPortNumber(), ARM_SVE_Z_STATE2,
+                          (thread_state_t)&m_state.context.sve.z[16], &count);
+  m_state.SetError(set, Read, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Read SVE registers z16..z31 return value %d",
+                   kret);
+  if (kret != KERN_SUCCESS)
+    return kret;
+
+  count = ARM_SVE_P_STATE_COUNT;
+  kret = thread_get_state(m_thread->MachPortNumber(), ARM_SVE_P_STATE,
+                          (thread_state_t)&m_state.context.sve.p[0], &count);
+  m_state.SetError(set, Read, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Read SVE registers p0..p15 return value %d",
+                   kret);
+
+  return kret;
+}
+
+kern_return_t DNBArchMachARM64::GetSMEState(bool force) {
+  int set = e_regSetSME;
+  // Check if we have valid cached registers
+  if (!force && m_state.GetError(set, Read) == KERN_SUCCESS)
+    return KERN_SUCCESS;
+
+  if (!CPUHasSME())
+    return KERN_INVALID_ARGUMENT;
+
+  // Read the registers from our thread
+  mach_msg_type_number_t count = ARM_SME_STATE_COUNT;
+  kern_return_t kret =
+      ::thread_get_state(m_thread->MachPortNumber(), ARM_SME_STATE,
+                         (thread_state_t)&m_state.context.sme.svcr, &count);
+  m_state.SetError(set, Read, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Read ARM_SME_STATE return value %d", kret);
+  if (kret != KERN_SUCCESS)
+    return kret;
+
+  memset(m_state.context.sme.za.data(), 0, m_state.context.sme.za.size());
+
+  size_t za_size = m_state.context.sme.svl_b * m_state.context.sme.svl_b;
+  const size_t max_chunk_size = 4096;
+  int n_chunks;
+  size_t chunk_size;
+  if (za_size <= max_chunk_size) {
+    n_chunks = 1;
+    chunk_size = za_size;
+  } else {
+    n_chunks = za_size / max_chunk_size;
+    chunk_size = max_chunk_size;
+  }
+  for (int i = 0; i < n_chunks; i++) {
+    count = ARM_SME_ZA_STATE_COUNT;
+    arm_sme_za_state_t za_state;
+    kret = thread_get_state(m_thread->MachPortNumber(), ARM_SME_ZA_STATE1 + i,
+                            (thread_state_t)&za_state, &count);
+    m_state.SetError(set, Read, kret);
+    DNBLogThreadedIf(LOG_THREAD, "Read ARM_SME_STATE return value %d", kret);
+    if (kret != KERN_SUCCESS)
+      return kret;
+    memcpy(m_state.context.sme.za.data() + (i * chunk_size), &za_state,
+           chunk_size);
+  }
+
+  if (CPUHasSME2()) {
+    count = ARM_SME2_STATE;
+    kret = thread_get_state(m_thread->MachPortNumber(), ARM_SME2_STATE,
+                            (thread_state_t)&m_state.context.sme.zt0, &count);
+    m_state.SetError(set, Read, kret);
+    DNBLogThreadedIf(LOG_THREAD, "Read ARM_SME2_STATE return value %d", kret);
+    if (kret != KERN_SUCCESS)
+      return kret;
+  }
+
+  return kret;
+}
+
 kern_return_t DNBArchMachARM64::SetGPRState() {
   int set = e_regSetGPR;
   kern_return_t kret = ::thread_set_state(
@@ -441,6 +587,80 @@ kern_return_t DNBArchMachARM64::SetVFPState() {
   return kret;                             // Return the error code
 }
 
+kern_return_t DNBArchMachARM64::SetSVEState() {
+  if (!CPUHasSME())
+    return KERN_INVALID_ARGUMENT;
+
+  int set = e_regSetSVE;
+  kern_return_t kret = thread_set_state(
+      m_thread->MachPortNumber(), ARM_SVE_Z_STATE1,
+      (thread_state_t)&m_state.context.sve.z[0], ARM_SVE_Z_STATE_COUNT);
+  m_state.SetError(set, Write, kret);
+  DNBLogThreadedIf(LOG_THREAD, "Write ARM_SVE_Z_STATE1 return value %d", kret);
+  if (kret != KERN_SUCCESS)
+    return kret;
+
+  kret = thread_set_state(m_thread->MachPortNumber(), ARM_SVE_Z_STATE2,
+                          (thread_state_t)&m_state.context.sve.z[16],
+                          ARM_SVE_Z_STATE_COUNT);
+  m_state.SetError(set, Write, kret);
+  DNBLogThreadedIf(LOG_TH...
[truncated]

@jasonmolenda jasonmolenda requested review from DavidSpickett and JDevlieghere and removed request for JDevlieghere December 9, 2024 06:32
Copy link

github-actions bot commented Dec 9, 2024

✅ With the latest revision this PR passed the Python code formatter.

Copy link

github-actions bot commented Dec 9, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@jasonmolenda
Copy link
Collaborator Author

One difference from debugserver and lldb-server is that lldb-server provides "vg" and "svg" registers (vector granule, streaming vector granule, depending on Streaming mode) which is the vector length in 8-byte granules. On Darwin, debugserver provides only "svl", in bytes. I considered having debugserver report the vector length in granules to match the Linux behavior, but the kernel was giving me the value in bytes and I think it's a more natural representation, so I stuck with it.

@DavidSpickett
Copy link
Collaborator

But when we are asking for a single register (e.g. Z0) when not in SSVE/SME mode, this should return an error.

I assume you did what we did for Linux, where the inactive registers do not dissapear from the register list but just fail to read.

Linux also returns all 0s for the array register when it's disabled, but this is I think more about keeping the register offsets valid than anything practical.

I considered having debugserver report the vector length in granules to match the Linux behavior, but the kernel was giving me the value in bytes and I think it's a more natural representation, so I stuck with it.

To repeat what I said over email to Jason -

This "vector granule" possibly comes from a time before SVE was finalised and we (Arm) did not know exactly how the vector length would be reported. Our simulators use this "granule" term and it made it into GDB via that.

The Linux kernel does not report in granules it does the same as Darwin, bytes. We (Linaro) added "vg" to be compatible with GDB and QEMU.

Bytes is the more useful reporting unit because the SVE programming question is always "how many units of X bytes fit into the vector length".

Copy link
Collaborator

@DavidSpickett DavidSpickett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lightly skimmed the debugserver parts, I assume one of your colleagues will help there.

// large enough to hold that much data. The current Apple cores
// have a much smaller maximum ZA reg size, but there are not
// multiple copies of this object so increase the static size to
// maximum possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Linux I remember heap allocating the object that represented the array register, because of the potential size. Perhaps that just uses a buffer in the background though.

The problem you have with this is that even x0 will take up 64k, right? Or is this object used as an overlay to a buffer and doesn't actually get allocated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this object is allocated to read/write a single register, so a read of x0 will be a 64k object. But looking at the debugserver sources, we don't store an array of them anywhere - we read / write individual registers one at a time with this object for a short time period, so I don't think the memory increase is a problem. It might be better to have a dynamically allocated size here though, as you did. I did that for the DNBArm64ArchImpl register contexts stored for each thread, where we will have one for each thread when stopped, that memory use made me more nervous.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for Linux we were also stack allocating the register value and I didn't want 64k stack frames everywhere we used one. d99d9d8 in case any of the concerns apply to debugserver also.

(I am also very aware of these issues because in a previous job when we added MIPS MSA support we accidentally turned every register object into 512 bits, even the 8 and 16 bit ones we read from non-MIPS DSP chips)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the DNBRegisterValue use a little more, and I think I want to change it to a heap allocated object, but it's going to touch all of the arch plugins in debugserver, so I will do it as a separate change from this one. On the macOS environment, the single 64k register on the stack isn't blowing anything, but it's not ideal and could cause a problem in our more constrained environments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing this to a heap object. This seems unnecessarily wasteful when not in SME mode, which I expect to remain the majority of the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of restructuring the internals of the object to heap-allocate the value space, which would require touching all of the DNBArchImpl back-ends, but actually just heap-allocating the object in RNBRemote (the main place this object is created) would be much easier than changing it at all.

 1. confirm that we cannot read SME/SVE regs
    when not in SSVE mode.
 2. Make it clearer how I'm modifying all of the
    SVE/SME registers, then instruction stepping, then
    reading them back to confirm that they were modified.

Remove `DNBArchMachARM64::get_svl_bytes`, depend entirely on the
hw.optional.arm.sme_max_svl_b sysctl to get the system's maximum
SVL, instead of debugserver's maximum SVL.  They're always the same
today, but it's possible to imagine it not being like that in the
future.
@jasonmolenda
Copy link
Collaborator Author

Updated the API test case as per David's suggestions, and remove the code that was using debugserver's SVL as the hardware maximum, depending entirely on the newer sysctl to get the correct value instead.

Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a pass and left some nits but I'm not enough of an expert to review this in more detail. I like the test coverage and I appreciate @DavidSpickett taking the time to review this!

Comment on lines +112 to +118
if (reg_value != fail_value && reg_value <= 256) {
svg_reg_value = reg_value / 8;
// Apple hardware only implements Streaming SVE mode, so
// the non-streaming Vector Length is not reported by the
// kernel. Set both svg and vg to this svl value.
if (!vg_reg_value)
vg_reg_value = reg_value / 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: lots of magic values here but in all fairness that's consistent with the surrounding code. The comment covers the 8 byte granule so I'm not too concerned, though some constants might make this easier to read.

@@ -0,0 +1,113 @@
/// BUILT with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/built/build/. But also this is covered by the makefile so maybe something like "Requires -mcpu=apple-m4" would be more to the point.

// large enough to hold that much data. The current Apple cores
// have a much smaller maximum ZA reg size, but there are not
// multiple copies of this object so increase the static size to
// maximum possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on changing this to a heap object. This seems unnecessarily wasteful when not in SME mode, which I expect to remain the majority of the time.

#include <mach/mach.h>
#include <stdint.h>

// define the SVE/SME/SME2 thread status structures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// define the SVE/SME/SME2 thread status structures
// Define the SVE/SME/SME2 thread status structures

@@ -2567,10 +2568,13 @@ rnb_err_t RNBRemote::HandlePacket_QSetProcessEvent(const char *p) {
return SendPacket("OK");
}

void register_value_in_hex_fixed_width(std::ostream &ostrm, nub_process_t pid,
// if a fail_value is provided, a correct-length reply is always provided,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if a fail_value is provided, a correct-length reply is always provided,
// If a fail_value is provided, a correct-length reply is always provided,


// Pad out the reply to the correct size to maintain correct offsets,
// even if we could not read the register value.
std::vector<uint8_t> zeros(reg->nub_info.size, *fail_value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You call the vector zeros assuming that's the fail value, but maybe failed would be more generic.

TestRegisters.py expected there to be no registers unable to be
read.  With an SME system when in non-SME mode, all of the SVE/SME
registers will show as unreadable.

TestGdbRemoteGPacket.py was testing the g/G packet with debugserver
(which lldb doesn't use any more) and this exposed the fact that I
wasn't handling SME/SVE registers correctly in
DNBArchImplArm64::GetRegisterContext/SetRegisterContext.  Fix those
to correctly account for the size of these register contexts (the
ZA register is stored in a vector, instead of having a compile-time
register context size).  Also, RNBRemote::HandlePacket_G() was not
handling the thread specifier properly (it would try to interpret
it as content for the packet, and the hex decoding would fail).
I think this test might be setting the current thread with Hg<tid>
before it sends only "g" or "G", that might work.

In debugserver, changed all stack-allocated DNBRegisterValue objects
to be unique_ptr managed so they're on heap.  This object is now
64k and putting that on the stack could be a problem.  We don't
have multiple DNBRegisterValue objects alive at the same time, so
I didn't dynamically size it to the maximum ZA register size on
the current machine.

Fix the comment formatting suggestions from Jonas.
@jasonmolenda
Copy link
Collaborator Author

I pushed an update addressing Jonas' suggestions, and also fixing two testsuite issues I found by testing the patch on an SME and non-SME machine running the the same OS. I'm still seeing one bonus failure in TestFirmwareCorefiles.py on the SME system that I need to debug, but the rest of the testsuite looks good.

When the process is not in Streaming SVE Mode, all of the SME
thread_get_state calls will fail.  Zero out the buffers that we
were going to write data into, so we don't leave old data in them
that might be sent to lldb (particularly in a GetRegisterContext
"g" packet response, where we need to read all registers to complete
the response).
@jasonmolenda
Copy link
Collaborator Author

I debugged the two test failures I'm seeing on an M4, neither is related to the SME changes. I will handle those two issues separately, this is ready to land.

Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -32,11 +50,19 @@ def test_register_commands(self):
# verify that logging does not assert
self.log_enable("registers")

error_str_matched = False
if self.get_sme_available() == True and self.platformIsDarwin():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.get_sme_available() == True and self.platformIsDarwin():
if self.get_sme_available() and self.platformIsDarwin():

@DavidSpickett
Copy link
Collaborator

I've taken the liberty of repeating the fact that the required APIs are not in a publicly available OS release, at the top of the PR description.

Since we had someone on Discord try this and they could not read the registers. Might save a few enthusiasts some time if they see that note first.

The good thing is that lldb didn't break:

Process 24700 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step over
    frame #0: 0x0000000100003f90 sme`main at main.c:13:5
   10   int main(void) {
   11       __asm("smstart");
   12       __asm("ptrue p0.b");
   13       __asm("fmla z0.s, p0/m, z30.s, z31.s");
-> 14       __asm("smopa za0.s, p0/m, p0/m, z30.b, z30.b");
   15       __asm("luti2  z5.h, zt0, z0[2]");
   16       __asm("smstop");
[...]
# register read -a
Scalable Matrix Extension Registers:
4 registers were unavailable.

I think that's exactly what should be happening on an unsupported OS version.

@jasonmolenda
Copy link
Collaborator Author

I've taken the liberty of repeating the fact that the required APIs are not in a publicly available OS release, at the top of the PR description.

Thanks, I missed that. Yeah the unrecognized register flavor will be rejected by the thread_get_state call and we'll report the registers as unavailable on current macOSes. I've got the macOS version where it is available in the test already so I'll make the comment more explicit.

@digantdesai
Copy link

Thanks.

@jasonmolenda jasonmolenda merged commit 46e7823 into llvm:main Dec 19, 2024
5 of 6 checks passed
@jasonmolenda jasonmolenda deleted the add-sme-register-support-to-debugserver branch December 19, 2024 17:57
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Dec 19, 2024
**Note:** The register reading and writing depends on new register
flavor support in thread_get_state/thread_set_state in the kernel, which
will be first available in macOS 15.4.

The Apple M4 line of cores includes the Scalable Matrix Extension (SME)
feature. The M4s do not implement Scalable Vector Extension (SVE),
although the processor is in Streaming SVE Mode when the SME is being
used. The most obvious side effects of being in SSVE Mode are that (on
the M4 cores) NEON instructions cannot be used, and watchpoints may get
false positives, the address comparisons are done at a lowered
granularity.

When SSVE mode is enabled, the kernel will provide the Streaming Vector
Length register, which is a maximum of 64 bytes with the M4. Also
provided are SVCR (with bits indicating if SSVE mode and SME mode are
enabled), TPIDR2, SVL. Then the SVE registers Z0..31 (SVL bytes long),
P0..15 (SVL/8 bytes), the ZA matrix register (SVL*SVL bytes), and the M4
supports SME2, so the ZT0 register (64 bytes).

When SSVE/SME are disabled, none of these registers are provided by the
kernel - reads and writes of them will fail.

Unlike Linux, lldb cannot modify the SVL through a thread_set_state
call, or change the processor state's SSVE/SME status. There is also no
way for a process to request a lowered SVL size today, so the work that
David did to handle VL/SVL changing while stepping through a process is
not an issue on Darwin today. But debugserver should be providing
everything necessary so we can reuse all of David's work on resizing the
register contexts in lldb if it happens in the future. debugbserver
sends svl, svcr, and tpidr2 in the expedited registers when a thread
stops, if SSVE|SME mode are enabled (if the kernel allows it to read the
ARM_SME_STATE register set).

While the maximum SVL is 64 bytes on M4, the AArch64 maximum possible
SVL is 256; this would give us a 64k ZA register. If debugserver sized
all of its register contexts assuming the largest possible SVL, we could
easily use 2MB more memory for the register contexts of all threads in a
process -- and on iOS et al, processes must run within a small memory
allotment and this would push us over that.

Much of the work in debugserver was changing the arm64 register context
from being a static compile-time array of register sets, to being
initialized at runtime if debugserver is running on a machine with SME.
The ZA is only created to the machine's actual maximum SVL. The size of
the 32 SVE Z registers is less significant so I am statically allocating
those to the architecturally largest possible SVL value today.

Also, debugserver includes information about registers that share the
same part of the register file. e.g. S0 and D0 are the lower parts of
the NEON 128-bit V0 register. And when running on an SME machine, v0 is
the lower 128 bits of the SVE Z0 register. So the register maps used
when defining the VFP registers must differ depending on the
capabilities of the cpu at runtime.

I also changed register reading in debugserver, where formerly when
debugserver was asked to read a register, and the thread_get_state read
of that register failed, it would return all zero's. This is necessary
when constructing a `g` packet that gets all registers - because there
is no separation between register bytes, the offsets are fixed. But when
we are asking for a single register (e.g. Z0) when not in SSVE/SME mode,
this should return an error.

This does mean that when you're running on an SME capabable machine, but
not in SME mode, and do `register read -a`, lldb will report that 48 SVE
registers were unavailable and 5 SME registers were unavailable. But
that's only when `-a` is used.

The register reading and writing depends on new register flavor support
in thread_get_state/thread_set_state in the kernel, which is not yet in
a release. The test case I wrote is skipped on current OSes. I pilfered
the SME register setup from some of David's existing SME test files;
there were a few Linux specific details in those tests that they weren't
easy to reuse on Darwin.

rdar://121608074
(cherry picked from commit 46e7823)
JDevlieghere added a commit to swiftlang/llvm-project that referenced this pull request Dec 19, 2024
…ster-support

[lldb][debugserver] Read/write SME registers on arm64 (llvm#119171)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants