Skip to content

Conversation

harp-intel
Copy link
Contributor

currently assumes that all Gaudi devices installed in a system are of the same microarchitecture

@harp-intel harp-intel requested a review from Copilot August 29, 2025 17:29
@harp-intel harp-intel linked an issue Aug 29, 2025 that may be closed by this pull request
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Gaudi microarchitecture information to the system report by detecting the device type based on PCI IDs and including it in the Gaudi device table.

  • Adds a new script to detect Gaudi microarchitecture (gaudi, gaudi2, gaudi3) based on PCI device IDs
  • Updates the Gaudi struct and table to include microarchitecture information
  • Modifies table field mapping to accommodate the new microarchitecture column

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
internal/script/script_defs.go Adds new script definition for detecting Gaudi microarchitecture via PCI ID matching
internal/report/table_helpers.go Updates Gaudi struct with Microarchitecture field and populates it from script output
internal/report/table_defs.go Adds microarchitecture field to Gaudi table definition and updates field mappings

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

__DEFAULT_HL_DEVICE="gaudi"
elif echo $__pcidev | grep -qE '1020|1030'; then
__DEFAULT_HL_DEVICE="gaudi2"
elif echo $__pcidev | grep -qE '106[0-9]'; then
Copy link
Preview

Copilot AI Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern '106[0-9]' is overly broad and could match unintended PCI IDs. Consider using a more specific pattern like '1060|1061|1062' if those are the exact IDs for gaudi3, or document the rationale for the broad range.

Suggested change
elif echo $__pcidev | grep -qE '106[0-9]'; then
elif echo $__pcidev | grep -qE '1060|1061|1062'; then

Copilot uses AI. Check for mistakes.

sort.Slice(gaudis, func(i, j int) bool {
return gaudis[i].ModuleID < gaudis[j].ModuleID
})
// set microarchitecture (assumes same arch for all gaudi devices)
Copy link
Preview

Copilot AI Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation assigns the same microarchitecture to all Gaudi devices, but the PR description states this assumption may not be valid. Consider detecting microarchitecture per device or updating the comment to reflect this known limitation.

Suggested change
// set microarchitecture (assumes same arch for all gaudi devices)
// Set microarchitecture for all Gaudi devices.
// NOTE: This assumes all Gaudi devices have the same microarchitecture, which may not be valid.
// See PR description and CodeQL warning for details. If per-device microarchitecture is needed,
// update this code to parse and assign microarchitecture per device.

Copilot uses AI. Check for mistakes.

harp-intel and others added 3 commits August 29, 2025 20:02
* Metric names modernized to match perfmon. "Legacy" names optional and deprecated. All metric descriptions in HTML report.

Signed-off-by: Harper, Jason M <[email protected]>

* temporary threshold value

Signed-off-by: Harper, Jason M <[email protected]>

* tma metric thresholds

Signed-off-by: Harper, Jason M <[email protected]>

* style the tooltips

Signed-off-by: Harper, Jason M <[email protected]>

* indent on TMA child metrics

Signed-off-by: Harper, Jason M <[email protected]>

* add tuning tips

Signed-off-by: Harper, Jason M <[email protected]>

* skip threshold if no metric variables defined

Signed-off-by: Harper, Jason M <[email protected]>

* revert to legacy names

Signed-off-by: Harper, Jason M <[email protected]>

* set legacy name to name for legacy metric defs

Signed-off-by: Harper, Jason M <[email protected]>

* fix kernel_CPI metric formula and add threshold to kernel utilization

Signed-off-by: Harper, Jason M <[email protected]>

---------

Signed-off-by: Harper, Jason M <[email protected]>
@harp-intel harp-intel merged commit f17ba11 into main Sep 1, 2025
5 checks passed
@harp-intel harp-intel deleted the gaudiarch branch September 1, 2025 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add information about version of gaudi device
1 participant