Skip to content

dwc_otg: disconnects of devices plugged into Model A ports #257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
P33M opened this issue Mar 23, 2013 · 8 comments
Closed

dwc_otg: disconnects of devices plugged into Model A ports #257

P33M opened this issue Mar 23, 2013 · 8 comments

Comments

@P33M
Copy link
Contributor

P33M commented Mar 23, 2013

If I do cat /dev/sda1 > /dev/null & and let it run on a pendrive or USB hdd plugged directly into the Pi, after several seconds the device stops responding. About 10 seconds after this, the device gets reset due to some sort of timeout. During this period, CPU is 99% wa time, 0% idle time.

If the device is plugged into a downstream hub, this never happens. The command runs until the end of the drive is reached.

Pendrive breakage:

[  154.023059] usb 1-1: USB disconnect, device number 3
[  154.043164] sd 1:0:0:0: [sda] Unhandled error code
[  154.043200] sd 1:0:0:0: [sda]
[  154.043211] Result: hostbyte=0x01 driverbyte=0x00
[  154.043224] sd 1:0:0:0: [sda] CDB:
[  154.043233] cdb[0]=0x28: 28 00 00 01 08 70 00 00 f0 00
[  154.043273] end_request: I/O error, dev sda, sector 67696
[  154.043289] quiet_error: 50 callbacks suppressed
[  154.043301] Buffer I/O error on device sda1, logical block 8458
[  154.043337] Buffer I/O error on device sda1, logical block 8459
[  154.043356] Buffer I/O error on device sda1, logical block 8460
[  154.043371] Buffer I/O error on device sda1, logical block 8461
[  154.043384] Buffer I/O error on device sda1, logical block 8462
[  154.043397] Buffer I/O error on device sda1, logical block 8463
[  154.043411] Buffer I/O error on device sda1, logical block 8464
[  154.043424] Buffer I/O error on device sda1, logical block 8465
[  154.043437] Buffer I/O error on device sda1, logical block 8466
[  154.043450] Buffer I/O error on device sda1, logical block 8467
[  154.043733] sd 1:0:0:0: [sda] Unhandled error code
[  154.043753] sd 1:0:0:0: [sda]
[  154.043764] Result: hostbyte=0x01 driverbyte=0x00
[  154.043775] sd 1:0:0:0: [sda] CDB:
[  154.043783] cdb[0]=0x28: 28 00 00 01 09 60 00 00 40 00
[  154.043818] end_request: I/O error, dev sda, sector 67936
[  154.232992] Indeed it is in host mode hprt0 = 00021501
[  154.412988] usb 1-1: new high-speed USB device number 4 using dwc_otg
[  154.413465] Indeed it is in host mode hprt0 = 00001101
[  154.615922] usb 1-1: New USB device found, idVendor=05dc, idProduct=a761
[  154.615950] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  154.615965] usb 1-1: Product: JD FireFly
[  154.615977] usb 1-1: Manufacturer: Lexar
[  154.615990] usb 1-1: SerialNumber: L7IIICTQ421A99161YKW
[  154.626128] scsi2 : usb-storage 1-1:1.0
[  156.428219] scsi 2:0:0:0: Direct-Access     Lexar    JD FireFly       1100 PQ: 0 ANSI: 0 CCS
[  156.430479] sd 2:0:0:0: [sda] 3915776 512-byte logical blocks: (2.00 GB/1.86 GiB)
[  156.431291] sd 2:0:0:0: [sda] Write Protect is off
[  156.431323] sd 2:0:0:0: [sda] Mode Sense: 43 00 00 00
[  156.432208] sd 2:0:0:0: [sda] No Caching mode page present
[  156.432238] sd 2:0:0:0: [sda] Assuming drive cache: write through
[  156.438599] sd 2:0:0:0: [sda] No Caching mode page present
[  156.438632] sd 2:0:0:0: [sda] Assuming drive cache: write through
[  156.440189]  sda: sda1
[  156.444186] sd 2:0:0:0: [sda] No Caching mode page present
[  156.444219] sd 2:0:0:0: [sda] Assuming drive cache: write through
[  156.444237] sd 2:0:0:0: [sda] Attached SCSI removable disk

USB HDD breakage:

[  193.194254] Indeed it is in host mode hprt0 = 00001101
[  193.374195] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[  193.374390] Indeed it is in host mode hprt0 = 00001101
[  226.185542] Indeed it is in host mode hprt0 = 00001101
[  226.425294] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[  226.425576] Indeed it is in host mode hprt0 = 00001101
[  226.665342] usb 1-1: device descriptor read/64, error -32
[  226.775515] Indeed it is in host mode hprt0 = 00001101
[  227.015252] usb 1-1: device descriptor read/64, error -32
[  227.125290] Indeed it is in host mode hprt0 = 00001101
[  227.305405] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[  227.305677] Indeed it is in host mode hprt0 = 00001101
[  227.485261] usb 1-1: device descriptor read/64, error -32
[  227.595490] Indeed it is in host mode hprt0 = 00001101
[  227.775269] usb 1-1: device descriptor read/64, error -32
[  227.885303] Indeed it is in host mode hprt0 = 00001101
[  228.065273] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[  228.085578] usb 1-1: device descriptor read/8, error -71
[  228.215635] usb 1-1: device descriptor read/8, error -71
[  228.325316] Indeed it is in host mode hprt0 = 00001101
[  228.505291] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[  228.525565] usb 1-1: device descriptor read/8, error -71
[  228.655612] usb 1-1: device descriptor read/8, error -71
[  228.765369] usb 1-1: USB disconnect, device number 2
[  228.778680] sd 0:0:0:0: Device offlined - not ready after error recovery
[  228.786321] sd 0:0:0:0: [sda] Unhandled error code
[  228.786351] sd 0:0:0:0: [sda]
[  228.786363] Result: hostbyte=0x01 driverbyte=0x00
[  228.786376] sd 0:0:0:0: [sda] CDB:
[  228.786386] cdb[0]=0x28: 28 00 00 03 fb 60 00 00 f0 00
[  228.786425] end_request: I/O error, dev sda, sector 260960
[  228.786445] Buffer I/O error on device sda1, logical block 32364
[  228.786484] Buffer I/O error on device sda1, logical block 32365
[  228.786502] Buffer I/O error on device sda1, logical block 32366
[  228.786517] Buffer I/O error on device sda1, logical block 32367
[  228.786530] Buffer I/O error on device sda1, logical block 32368
[  228.786544] Buffer I/O error on device sda1, logical block 32369
[  228.786556] Buffer I/O error on device sda1, logical block 32370
[  228.786569] Buffer I/O error on device sda1, logical block 32371
[  228.786582] Buffer I/O error on device sda1, logical block 32372
[  228.786596] Buffer I/O error on device sda1, logical block 32373
[  228.786759] sd 0:0:0:0: [sda] Unhandled error code
[  228.786779] sd 0:0:0:0: [sda]
[  228.786788] Result: hostbyte=0x01 driverbyte=0x00
[  228.786800] sd 0:0:0:0: [sda] CDB:
[  228.786807] cdb[0]=0x28: 28 00 00 03 fc 50 00 00 f0 00
[  228.786841] end_request: I/O error, dev sda, sector 261200
[  229.045327] Indeed it is in host mode hprt0 = 00001501
[  229.225301] usb 1-1: new high-speed USB device number 3 using dwc_otg
[  229.225567] Indeed it is in host mode hprt0 = 00001101
[  229.405352] usb 1-1: device descriptor read/64, error -32
[  229.515530] Indeed it is in host mode hprt0 = 00001101
[  229.695313] usb 1-1: device descriptor read/64, error -32
[  229.805444] Indeed it is in host mode hprt0 = 00001101
[  229.985337] usb 1-1: new high-speed USB device number 4 using dwc_otg
[  229.985604] Indeed it is in host mode hprt0 = 00001101
[  230.165384] usb 1-1: device descriptor read/64, error -32
[  230.275545] Indeed it is in host mode hprt0 = 00001101
[  230.455398] usb 1-1: device descriptor read/64, error -32
[  230.565483] Indeed it is in host mode hprt0 = 00001101
[  230.745353] usb 1-1: new high-speed USB device number 5 using dwc_otg
[  230.765658] usb 1-1: device descriptor read/8, error -71
[  230.895621] usb 1-1: device descriptor read/8, error -71
[  231.005420] Indeed it is in host mode hprt0 = 00001101
[  231.185382] usb 1-1: new high-speed USB device number 6 using dwc_otg
[  231.205701] usb 1-1: device descriptor read/8, error -71
[  231.335663] usb 1-1: device descriptor read/8, error -71
[  231.445414] hub 1-0:1.0: unable to enumerate USB device on port 1

The USB HDD appears to have more of a tantrum and stops responding altogether, even through a Pi reboot.

This happens with or without dwc_otg.fiq_fix_enable=0 OR dwc_otg.speed=1 on the command line.

@ghollingworth
Copy link

The second looks like a power issue, the first one is possible as well, how are you powering them?

Gordon

Gordon Hollingworth, Director of Engineering
RaspberryPi (M) 07450946289

On 23 Mar 2013, at 12:57, P33M [email protected] wrote:

If I do cat /dev/sda1 > /dev/null & and let it run on a pendrive or USB hdd plugged directly into the Pi, after several seconds the device stops responding. About 10 seconds after this, the device gets reset due to some sort of timeout. During this period, CPU is 99% wa time, 0% idle time.

If the device is plugged into a downstream hub, this never happens. The command runs until the end of the drive is reached.

Pendrive breakage:

[ 154.023059] usb 1-1: USB disconnect, device number 3
[ 154.043164] sd 1:0:0:0: [sda] Unhandled error code
[ 154.043200] sd 1:0:0:0: [sda]
[ 154.043211] Result: hostbyte=0x01 driverbyte=0x00
[ 154.043224] sd 1:0:0:0: [sda] CDB:
[ 154.043233] cdb[0]=0x28: 28 00 00 01 08 70 00 00 f0 00
[ 154.043273] end_request: I/O error, dev sda, sector 67696
[ 154.043289] quiet_error: 50 callbacks suppressed
[ 154.043301] Buffer I/O error on device sda1, logical block 8458
[ 154.043337] Buffer I/O error on device sda1, logical block 8459
[ 154.043356] Buffer I/O error on device sda1, logical block 8460
[ 154.043371] Buffer I/O error on device sda1, logical block 8461
[ 154.043384] Buffer I/O error on device sda1, logical block 8462
[ 154.043397] Buffer I/O error on device sda1, logical block 8463
[ 154.043411] Buffer I/O error on device sda1, logical block 8464
[ 154.043424] Buffer I/O error on device sda1, logical block 8465
[ 154.043437] Buffer I/O error on device sda1, logical block 8466
[ 154.043450] Buffer I/O error on device sda1, logical block 8467
[ 154.043733] sd 1:0:0:0: [sda] Unhandled error code
[ 154.043753] sd 1:0:0:0: [sda]
[ 154.043764] Result: hostbyte=0x01 driverbyte=0x00
[ 154.043775] sd 1:0:0:0: [sda] CDB:
[ 154.043783] cdb[0]=0x28: 28 00 00 01 09 60 00 00 40 00
[ 154.043818] end_request: I/O error, dev sda, sector 67936
[ 154.232992] Indeed it is in host mode hprt0 = 00021501
[ 154.412988] usb 1-1: new high-speed USB device number 4 using dwc_otg
[ 154.413465] Indeed it is in host mode hprt0 = 00001101
[ 154.615922] usb 1-1: New USB device found, idVendor=05dc, idProduct=a761
[ 154.615950] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ 154.615965] usb 1-1: Product: JD FireFly
[ 154.615977] usb 1-1: Manufacturer: Lexar
[ 154.615990] usb 1-1: SerialNumber: L7IIICTQ421A99161YKW
[ 154.626128] scsi2 : usb-storage 1-1:1.0
[ 156.428219] scsi 2:0:0:0: Direct-Access Lexar JD FireFly 1100 PQ: 0 ANSI: 0 CCS
[ 156.430479] sd 2:0:0:0: [sda] 3915776 512-byte logical blocks: (2.00 GB/1.86 GiB)
[ 156.431291] sd 2:0:0:0: [sda] Write Protect is off
[ 156.431323] sd 2:0:0:0: [sda] Mode Sense: 43 00 00 00
[ 156.432208] sd 2:0:0:0: [sda] No Caching mode page present
[ 156.432238] sd 2:0:0:0: [sda] Assuming drive cache: write through
[ 156.438599] sd 2:0:0:0: [sda] No Caching mode page present
[ 156.438632] sd 2:0:0:0: [sda] Assuming drive cache: write through
[ 156.440189] sda: sda1
[ 156.444186] sd 2:0:0:0: [sda] No Caching mode page present
[ 156.444219] sd 2:0:0:0: [sda] Assuming drive cache: write through
[ 156.444237] sd 2:0:0:0: [sda] Attached SCSI removable disk
USB HDD breakage:

[ 193.194254] Indeed it is in host mode hprt0 = 00001101
[ 193.374195] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[ 193.374390] Indeed it is in host mode hprt0 = 00001101
[ 226.185542] Indeed it is in host mode hprt0 = 00001101
[ 226.425294] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[ 226.425576] Indeed it is in host mode hprt0 = 00001101
[ 226.665342] usb 1-1: device descriptor read/64, error -32
[ 226.775515] Indeed it is in host mode hprt0 = 00001101
[ 227.015252] usb 1-1: device descriptor read/64, error -32
[ 227.125290] Indeed it is in host mode hprt0 = 00001101
[ 227.305405] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[ 227.305677] Indeed it is in host mode hprt0 = 00001101
[ 227.485261] usb 1-1: device descriptor read/64, error -32
[ 227.595490] Indeed it is in host mode hprt0 = 00001101
[ 227.775269] usb 1-1: device descriptor read/64, error -32
[ 227.885303] Indeed it is in host mode hprt0 = 00001101
[ 228.065273] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[ 228.085578] usb 1-1: device descriptor read/8, error -71
[ 228.215635] usb 1-1: device descriptor read/8, error -71
[ 228.325316] Indeed it is in host mode hprt0 = 00001101
[ 228.505291] usb 1-1: reset high-speed USB device number 2 using dwc_otg
[ 228.525565] usb 1-1: device descriptor read/8, error -71
[ 228.655612] usb 1-1: device descriptor read/8, error -71
[ 228.765369] usb 1-1: USB disconnect, device number 2
[ 228.778680] sd 0:0:0:0: Device offlined - not ready after error recovery
[ 228.786321] sd 0:0:0:0: [sda] Unhandled error code
[ 228.786351] sd 0:0:0:0: [sda]
[ 228.786363] Result: hostbyte=0x01 driverbyte=0x00
[ 228.786376] sd 0:0:0:0: [sda] CDB:
[ 228.786386] cdb[0]=0x28: 28 00 00 03 fb 60 00 00 f0 00
[ 228.786425] end_request: I/O error, dev sda, sector 260960
[ 228.786445] Buffer I/O error on device sda1, logical block 32364
[ 228.786484] Buffer I/O error on device sda1, logical block 32365
[ 228.786502] Buffer I/O error on device sda1, logical block 32366
[ 228.786517] Buffer I/O error on device sda1, logical block 32367
[ 228.786530] Buffer I/O error on device sda1, logical block 32368
[ 228.786544] Buffer I/O error on device sda1, logical block 32369
[ 228.786556] Buffer I/O error on device sda1, logical block 32370
[ 228.786569] Buffer I/O error on device sda1, logical block 32371
[ 228.786582] Buffer I/O error on device sda1, logical block 32372
[ 228.786596] Buffer I/O error on device sda1, logical block 32373
[ 228.786759] sd 0:0:0:0: [sda] Unhandled error code
[ 228.786779] sd 0:0:0:0: [sda]
[ 228.786788] Result: hostbyte=0x01 driverbyte=0x00
[ 228.786800] sd 0:0:0:0: [sda] CDB:
[ 228.786807] cdb[0]=0x28: 28 00 00 03 fc 50 00 00 f0 00
[ 228.786841] end_request: I/O error, dev sda, sector 261200
[ 229.045327] Indeed it is in host mode hprt0 = 00001501
[ 229.225301] usb 1-1: new high-speed USB device number 3 using dwc_otg
[ 229.225567] Indeed it is in host mode hprt0 = 00001101
[ 229.405352] usb 1-1: device descriptor read/64, error -32
[ 229.515530] Indeed it is in host mode hprt0 = 00001101
[ 229.695313] usb 1-1: device descriptor read/64, error -32
[ 229.805444] Indeed it is in host mode hprt0 = 00001101
[ 229.985337] usb 1-1: new high-speed USB device number 4 using dwc_otg
[ 229.985604] Indeed it is in host mode hprt0 = 00001101
[ 230.165384] usb 1-1: device descriptor read/64, error -32
[ 230.275545] Indeed it is in host mode hprt0 = 00001101
[ 230.455398] usb 1-1: device descriptor read/64, error -32
[ 230.565483] Indeed it is in host mode hprt0 = 00001101
[ 230.745353] usb 1-1: new high-speed USB device number 5 using dwc_otg
[ 230.765658] usb 1-1: device descriptor read/8, error -71
[ 230.895621] usb 1-1: device descriptor read/8, error -71
[ 231.005420] Indeed it is in host mode hprt0 = 00001101
[ 231.185382] usb 1-1: new high-speed USB device number 6 using dwc_otg
[ 231.205701] usb 1-1: device descriptor read/8, error -71
[ 231.335663] usb 1-1: device descriptor read/8, error -71
[ 231.445414] hub 1-0:1.0: unable to enumerate USB device on port 1
The USB HDD appears to have more of a tantrum and stops responding altogether, even through a Pi reboot.

This happens with or without dwc_otg.fiq_fix_enable=0 OR dwc_otg.speed=1 on the command line.


Reply to this email directly or view it on GitHub.

@foreverguest
Copy link

These same errors happen to me on a model B with an external powered USB HDD

@P33M
Copy link
Contributor Author

P33M commented Apr 7, 2013

The USB HDD is definitely a power hungry beast so I took pains to make sure the Pi had plenty of power available (direct wire to GPIO +5V).

This also affects certain wifi devices whereby they just stop responding after a while - my RTL5370 dies:

Bus 001 Device 002: ID 148f:5370 Ralink Technology, Corp. RT5370 Wireless Adapter
<3>[  150.747780] phy0 -> rt2x00usb_vendor_request: Error - Vendor Request 0x06 failed for offset 0x1208 with error -110.
<3>[  250.749927] phy0 -> rt2x00usb_vendor_request: Error - Vendor Request 0x07 failed for offset 0x7010 with error -110.

Those messages repeated ad infinitum. If I use the network link (scp or ssh with activity) then the wlan dongle does not hang.

No messages out of the driver - my hunch is that it's a bus error that's never handled/reset.

@P33M
Copy link
Contributor Author

P33M commented Apr 7, 2013

It's a URB pileup in dwc_otg_hcd_enqueue().

With USB storage debug enabled, I can see the URBs being submitted. I added some debug output to enqueue:

diff --git a/drivers/usb/host/dwc_otg/dwc_otg_hcd.c b/drivers/usb/host/dwc_otg/dwc_otg_hcd.c
index 91eefec..1df91d5 100644
--- a/drivers/usb/host/dwc_otg/dwc_otg_hcd.c
+++ b/drivers/usb/host/dwc_otg/dwc_otg_hcd.c
@@ -518,6 +518,8 @@ int dwc_otg_hcd_urb_enqueue(dwc_otg_hcd_t * hcd,
                tr_type = dwc_otg_hcd_select_transactions(hcd);
                if (tr_type != DWC_OTG_TRANSACTION_NONE) {
                        dwc_otg_hcd_queue_transactions(hcd, tr_type);
+               } else {
+                       DWC_ERROR("Scheduler returned no transactions but needed sched!");
                }
                DWC_SPINUNLOCK_IRQRESTORE(hcd->lock, flags);
        }
<7>[  156.489418] usb-storage: usb_stor_bulk_transfer_sglist: xfer 122880 bytes, 2 entries
<4>[  156.489506] ERROR::dwc_otg_hcd_urb_enqueue:522: Scheduler returned no transactions but needed sched!
<7>[  156.496736] usb-storage: Status code 0; transferred 122880/122880
<7>[  156.496759] usb-storage: -- transfer complete
<7>[  156.496770] usb-storage: Bulk data transfer result 0x0
<7>[  156.496777] usb-storage: Attempting to get CSW...
<7>[  156.496788] usb-storage: usb_stor_bulk_transfer_buf: xfer 13 bytes
<7>[  156.496880] usb-storage: Status code 0; transferred 13/13
<7>[  156.496892] usb-storage: -- transfer complete
<7>[  156.496900] usb-storage: Bulk status result = 0
<7>[  156.496911] usb-storage: Bulk Status S 0x53425355 T 0x53b R 0 Stat 0x0
<7>[  156.496923] usb-storage: scsi cmd done, result=0x0
<7>[  156.496943] usb-storage: *** thread sleeping.
<7>[  156.497105] usb-storage: queuecommand_lck called
<7>[  156.497195] usb-storage: *** thread awakened.
<7>[  156.497214] usb-storage: Command READ_10 (10 bytes)
<7>[  156.497221] usb-storage:  28 00 00 02 e8 98 00 00 f0 00
<7>[  156.497261] usb-storage: Bulk Command S 0x43425355 T 0x53c L 122880 F 128 Trg 0 LUN 0 CL 10
<7>[  156.497271] usb-storage: usb_stor_bulk_transfer_buf: xfer 31 bytes
<7>[  156.497360] usb-storage: Status code 0; transferred 31/31
<7>[  156.497373] usb-storage: -- transfer complete
<7>[  156.497380] usb-storage: Bulk command transfer result=0
<7>[  156.497392] usb-storage: usb_stor_bulk_transfer_sglist: xfer 122880 bytes, 2 entries
<4>[  156.497480] ERROR::dwc_otg_hcd_urb_enqueue:522: Scheduler returned no transactions but needed sched!

This causes submitted URBs to just sit there until some high-level error handling takes over - SCSI eh thread times out and calls the abort function, which dequeues the URBs in the case of usb_storage, which clears the blockage. In the case of the wlan dongle, it never does anything to requeue URBs therefore the device becomes unresponsive.

The pileup occurs because if we have only one URB being submitted, and it didn't get scheduled, there is no SOF interrupt unmasked to kick the scheduler again. I have yet to establish exactly why a freshly queued transaction was just ignored for being scheduled. I suspect another concurrency issue, given that for the wlan dongle these enqueue failures happen sporadically.

@P33M
Copy link
Contributor Author

P33M commented Apr 10, 2013

The problem is evident when more than one URB is queued at one time for a single endpoint with no other endpoint transfers in the queue. usb-storage is a particularly prone case because of its habit of queueing a large number of URBs (14-30 typical) using usbcore's scatter-gather helpers.

If multiple endpoints are active, the call to dwc_otg_hcd_select_transactions is repeated on each xfercomp interrupt completion - making the stalls disappear as other transfers will schedule a stalled URB.

The bug can be squashed by permanently enabling the SOF interrupt on the first URB queued - but for a model A this is not particularly useful as it leads to the FIQ/SOF handler now being active (and consuming CPU time and energy) when it doesn't need to be.

There doesn't seem to be a reason why multiple transfers queued for a single endpoint shouldn't be possible - each completed transfer should then cause the next to be assigned to a hc.

@P33M
Copy link
Contributor Author

P33M commented Apr 17, 2013

I've made some progress.

It appears that on occasion a transaction is assigned to a HC, started but the transfer never completes. It is simply "lost". This produces visible breakage with only a single device attached to the BCM2835 USB port because the transfer stream to the device (with no SOF interrupt to keep kicking it) is dependent on the xfercomp interrupt handler queueing the next transfer for the same endpoint. I think the data toggles come into play at this point to ensure that the "lost" packet is catered for.

My first idea was that the transaction was being performed too close to the end of a microframe - overrunning and causing general mayhem, however looking at frrem when the transaction was started didn't seem to show any correlation. Transfers would fail when scheduled anywhere in the microframe.

@P33M
Copy link
Contributor Author

P33M commented Apr 20, 2013

It took @ghollingworth's USB analyzer but we found out why -

The NAK holdoff is erroneously enabled for all non-periodic endpoints. In addition to this, there is a bug where at hfnum == HFNUM_MAX_FRAME_NUM the qh->nak_frame check returns true (when set to an otherwise intentionally bogus value), which then skips the active QH for queueing and also delays the FIQ triggering for up to 1 full-speed frame.

If this is the only QH in either of the non-periodic queues, this results in all further transactions for the same endpoint getting blocked until external action intervenes . In the case of usb_storage devices, the error handling thread would wakeup and then dequeue all URBs and clear the problem.

Thinking about it further, this could be why we are seeing very occasional drops of packets in #238 (webcams) after the tasklet patch - the FIQ is being commanded to do work on a future frame rather than at the very small interval required for a usually quite frequent periodic transaction.

@P33M
Copy link
Contributor Author

P33M commented Apr 27, 2013

Wlan dongle now working for 24h+ on a model A - closing this issue

@P33M P33M closed this as completed Apr 27, 2013
popcornmix pushed a commit that referenced this issue Feb 26, 2018
The adis_probe_trigger() creates a new IIO trigger and requests an
interrupt associated with the trigger. The interrupt uses the generic
iio_trigger_generic_data_rdy_poll() function as its interrupt handler.

Currently the driver initializes some fields of the trigger structure after
the interrupt has been requested. But an interrupt can fire as soon as it
has been requested. This opens up a race condition.

iio_trigger_generic_data_rdy_poll() will access the trigger data structure
and dereference the ops field. If the ops field is not yet initialized this
will result in a NULL pointer deref.

It is not expected that the device generates an interrupt at this point, so
typically this issue did not surface unless e.g. due to a hardware
misconfiguration (wrong interrupt number, wrong polarity, etc.).

But some newer devices from the ADIS family start to generate periodic
interrupts in their power-on reset configuration and unfortunately the
interrupt can not be masked in the device.  This makes the race condition
much more visible and the following crash has been observed occasionally
when booting a system using the ADIS16460.

	Unable to handle kernel NULL pointer dereference at virtual address 00000008
	pgd = c0004000
	[00000008] *pgd=00000000
	Internal error: Oops: 5 [#1] PREEMPT SMP ARM
	Modules linked in:
	CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-04126-gf9739f0-dirty #257
	Hardware name: Xilinx Zynq Platform
	task: ef04f640 task.stack: ef050000
	PC is at iio_trigger_notify_done+0x30/0x68
	LR is at iio_trigger_generic_data_rdy_poll+0x18/0x20
	pc : [<c042d868>]    lr : [<c042d924>]    psr: 60000193
	sp : ef051bb8  ip : 00000000  fp : ef106400
	r10: c081d80a  r9 : ef3bfa00  r8 : 00000087
	r7 : ef051be  r6 : 00000000  r5 : ef3bfa00  r4 : ee92ab00
	r3 : 00000000  r2 : 00000000  r1 : 00000000  r0 : ee97e400
	Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
	Control: 18c5387d  Table: 0000404a  DAC: 00000051
	Process swapper/0 (pid: 1, stack limit = 0xef050210)
	[<c042d868>] (iio_trigger_notify_done) from [<c0065b10>] (__handle_irq_event_percpu+0x88/0x118)
	[<c0065b10>] (__handle_irq_event_percpu) from [<c0065bbc>] (handle_irq_event_percpu+0x1c/0x58)
	[<c0065bbc>] (handle_irq_event_percpu) from [<c0065c30>] (handle_irq_event+0x38/0x5c)
	[<c0065c30>] (handle_irq_event) from [<c0068e28>] (handle_level_irq+0xa4/0x130)
	[<c0068e28>] (handle_level_irq) from [<c0064e74>] (generic_handle_irq+0x24/0x34)
	[<c0064e74>] (generic_handle_irq) from [<c021ab7c>] (zynq_gpio_irqhandler+0xb8/0x13c)
	[<c021ab7c>] (zynq_gpio_irqhandler) from [<c0064e74>] (generic_handle_irq+0x24/0x34)
	[<c0064e74>] (generic_handle_irq) from [<c0065370>] (__handle_domain_irq+0x5c/0xb4)
	[<c0065370>] (__handle_domain_irq) from [<c000940c>] (gic_handle_irq+0x48/0x8c)
	[<c000940c>] (gic_handle_irq) from [<c0013e8c>] (__irq_svc+0x6c/0xa8)

To fix this make sure that the trigger is fully initialized before
requesting the interrupt.

Fixes: ccd2b52 ("staging:iio: Add common ADIS library")
Reported-by: Robin Getz <[email protected]>
Signed-off-by: Lars-Peter Clausen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
nathanchance pushed a commit to nathanchance/pi-kernel that referenced this issue Mar 3, 2018
commit f027e0b upstream.

The adis_probe_trigger() creates a new IIO trigger and requests an
interrupt associated with the trigger. The interrupt uses the generic
iio_trigger_generic_data_rdy_poll() function as its interrupt handler.

Currently the driver initializes some fields of the trigger structure after
the interrupt has been requested. But an interrupt can fire as soon as it
has been requested. This opens up a race condition.

iio_trigger_generic_data_rdy_poll() will access the trigger data structure
and dereference the ops field. If the ops field is not yet initialized this
will result in a NULL pointer deref.

It is not expected that the device generates an interrupt at this point, so
typically this issue did not surface unless e.g. due to a hardware
misconfiguration (wrong interrupt number, wrong polarity, etc.).

But some newer devices from the ADIS family start to generate periodic
interrupts in their power-on reset configuration and unfortunately the
interrupt can not be masked in the device.  This makes the race condition
much more visible and the following crash has been observed occasionally
when booting a system using the ADIS16460.

	Unable to handle kernel NULL pointer dereference at virtual address 00000008
	pgd = c0004000
	[00000008] *pgd=00000000
	Internal error: Oops: 5 [raspberrypi#1] PREEMPT SMP ARM
	Modules linked in:
	CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-04126-gf9739f0-dirty raspberrypi#257
	Hardware name: Xilinx Zynq Platform
	task: ef04f640 task.stack: ef050000
	PC is at iio_trigger_notify_done+0x30/0x68
	LR is at iio_trigger_generic_data_rdy_poll+0x18/0x20
	pc : [<c042d868>]    lr : [<c042d924>]    psr: 60000193
	sp : ef051bb8  ip : 00000000  fp : ef106400
	r10: c081d80a  r9 : ef3bfa00  r8 : 00000087
	r7 : ef051be  r6 : 00000000  r5 : ef3bfa00  r4 : ee92ab00
	r3 : 00000000  r2 : 00000000  r1 : 00000000  r0 : ee97e400
	Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
	Control: 18c5387d  Table: 0000404a  DAC: 00000051
	Process swapper/0 (pid: 1, stack limit = 0xef050210)
	[<c042d868>] (iio_trigger_notify_done) from [<c0065b10>] (__handle_irq_event_percpu+0x88/0x118)
	[<c0065b10>] (__handle_irq_event_percpu) from [<c0065bbc>] (handle_irq_event_percpu+0x1c/0x58)
	[<c0065bbc>] (handle_irq_event_percpu) from [<c0065c30>] (handle_irq_event+0x38/0x5c)
	[<c0065c30>] (handle_irq_event) from [<c0068e28>] (handle_level_irq+0xa4/0x130)
	[<c0068e28>] (handle_level_irq) from [<c0064e74>] (generic_handle_irq+0x24/0x34)
	[<c0064e74>] (generic_handle_irq) from [<c021ab7c>] (zynq_gpio_irqhandler+0xb8/0x13c)
	[<c021ab7c>] (zynq_gpio_irqhandler) from [<c0064e74>] (generic_handle_irq+0x24/0x34)
	[<c0064e74>] (generic_handle_irq) from [<c0065370>] (__handle_domain_irq+0x5c/0xb4)
	[<c0065370>] (__handle_domain_irq) from [<c000940c>] (gic_handle_irq+0x48/0x8c)
	[<c000940c>] (gic_handle_irq) from [<c0013e8c>] (__irq_svc+0x6c/0xa8)

To fix this make sure that the trigger is fully initialized before
requesting the interrupt.

Fixes: ccd2b52 ("staging:iio: Add common ADIS library")
Reported-by: Robin Getz <[email protected]>
Signed-off-by: Lars-Peter Clausen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
popcornmix pushed a commit that referenced this issue Mar 8, 2018
commit f027e0b upstream.

The adis_probe_trigger() creates a new IIO trigger and requests an
interrupt associated with the trigger. The interrupt uses the generic
iio_trigger_generic_data_rdy_poll() function as its interrupt handler.

Currently the driver initializes some fields of the trigger structure after
the interrupt has been requested. But an interrupt can fire as soon as it
has been requested. This opens up a race condition.

iio_trigger_generic_data_rdy_poll() will access the trigger data structure
and dereference the ops field. If the ops field is not yet initialized this
will result in a NULL pointer deref.

It is not expected that the device generates an interrupt at this point, so
typically this issue did not surface unless e.g. due to a hardware
misconfiguration (wrong interrupt number, wrong polarity, etc.).

But some newer devices from the ADIS family start to generate periodic
interrupts in their power-on reset configuration and unfortunately the
interrupt can not be masked in the device.  This makes the race condition
much more visible and the following crash has been observed occasionally
when booting a system using the ADIS16460.

	Unable to handle kernel NULL pointer dereference at virtual address 00000008
	pgd = c0004000
	[00000008] *pgd=00000000
	Internal error: Oops: 5 [#1] PREEMPT SMP ARM
	Modules linked in:
	CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-04126-gf9739f0-dirty #257
	Hardware name: Xilinx Zynq Platform
	task: ef04f640 task.stack: ef050000
	PC is at iio_trigger_notify_done+0x30/0x68
	LR is at iio_trigger_generic_data_rdy_poll+0x18/0x20
	pc : [<c042d868>]    lr : [<c042d924>]    psr: 60000193
	sp : ef051bb8  ip : 00000000  fp : ef106400
	r10: c081d80a  r9 : ef3bfa00  r8 : 00000087
	r7 : ef051be  r6 : 00000000  r5 : ef3bfa00  r4 : ee92ab00
	r3 : 00000000  r2 : 00000000  r1 : 00000000  r0 : ee97e400
	Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
	Control: 18c5387d  Table: 0000404a  DAC: 00000051
	Process swapper/0 (pid: 1, stack limit = 0xef050210)
	[<c042d868>] (iio_trigger_notify_done) from [<c0065b10>] (__handle_irq_event_percpu+0x88/0x118)
	[<c0065b10>] (__handle_irq_event_percpu) from [<c0065bbc>] (handle_irq_event_percpu+0x1c/0x58)
	[<c0065bbc>] (handle_irq_event_percpu) from [<c0065c30>] (handle_irq_event+0x38/0x5c)
	[<c0065c30>] (handle_irq_event) from [<c0068e28>] (handle_level_irq+0xa4/0x130)
	[<c0068e28>] (handle_level_irq) from [<c0064e74>] (generic_handle_irq+0x24/0x34)
	[<c0064e74>] (generic_handle_irq) from [<c021ab7c>] (zynq_gpio_irqhandler+0xb8/0x13c)
	[<c021ab7c>] (zynq_gpio_irqhandler) from [<c0064e74>] (generic_handle_irq+0x24/0x34)
	[<c0064e74>] (generic_handle_irq) from [<c0065370>] (__handle_domain_irq+0x5c/0xb4)
	[<c0065370>] (__handle_domain_irq) from [<c000940c>] (gic_handle_irq+0x48/0x8c)
	[<c000940c>] (gic_handle_irq) from [<c0013e8c>] (__irq_svc+0x6c/0xa8)

To fix this make sure that the trigger is fully initialized before
requesting the interrupt.

Fixes: ccd2b52 ("staging:iio: Add common ADIS library")
Reported-by: Robin Getz <[email protected]>
Signed-off-by: Lars-Peter Clausen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
ED6E0F17 pushed a commit to ED6E0F17/linux that referenced this issue Mar 15, 2018
commit f027e0b upstream.

The adis_probe_trigger() creates a new IIO trigger and requests an
interrupt associated with the trigger. The interrupt uses the generic
iio_trigger_generic_data_rdy_poll() function as its interrupt handler.

Currently the driver initializes some fields of the trigger structure after
the interrupt has been requested. But an interrupt can fire as soon as it
has been requested. This opens up a race condition.

iio_trigger_generic_data_rdy_poll() will access the trigger data structure
and dereference the ops field. If the ops field is not yet initialized this
will result in a NULL pointer deref.

It is not expected that the device generates an interrupt at this point, so
typically this issue did not surface unless e.g. due to a hardware
misconfiguration (wrong interrupt number, wrong polarity, etc.).

But some newer devices from the ADIS family start to generate periodic
interrupts in their power-on reset configuration and unfortunately the
interrupt can not be masked in the device.  This makes the race condition
much more visible and the following crash has been observed occasionally
when booting a system using the ADIS16460.

	Unable to handle kernel NULL pointer dereference at virtual address 00000008
	pgd = c0004000
	[00000008] *pgd=00000000
	Internal error: Oops: 5 [raspberrypi#1] PREEMPT SMP ARM
	Modules linked in:
	CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-04126-gf9739f0-dirty raspberrypi#257
	Hardware name: Xilinx Zynq Platform
	task: ef04f640 task.stack: ef050000
	PC is at iio_trigger_notify_done+0x30/0x68
	LR is at iio_trigger_generic_data_rdy_poll+0x18/0x20
	pc : [<c042d868>]    lr : [<c042d924>]    psr: 60000193
	sp : ef051bb8  ip : 00000000  fp : ef106400
	r10: c081d80a  r9 : ef3bfa00  r8 : 00000087
	r7 : ef051be  r6 : 00000000  r5 : ef3bfa00  r4 : ee92ab00
	r3 : 00000000  r2 : 00000000  r1 : 00000000  r0 : ee97e400
	Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
	Control: 18c5387d  Table: 0000404a  DAC: 00000051
	Process swapper/0 (pid: 1, stack limit = 0xef050210)
	[<c042d868>] (iio_trigger_notify_done) from [<c0065b10>] (__handle_irq_event_percpu+0x88/0x118)
	[<c0065b10>] (__handle_irq_event_percpu) from [<c0065bbc>] (handle_irq_event_percpu+0x1c/0x58)
	[<c0065bbc>] (handle_irq_event_percpu) from [<c0065c30>] (handle_irq_event+0x38/0x5c)
	[<c0065c30>] (handle_irq_event) from [<c0068e28>] (handle_level_irq+0xa4/0x130)
	[<c0068e28>] (handle_level_irq) from [<c0064e74>] (generic_handle_irq+0x24/0x34)
	[<c0064e74>] (generic_handle_irq) from [<c021ab7c>] (zynq_gpio_irqhandler+0xb8/0x13c)
	[<c021ab7c>] (zynq_gpio_irqhandler) from [<c0064e74>] (generic_handle_irq+0x24/0x34)
	[<c0064e74>] (generic_handle_irq) from [<c0065370>] (__handle_domain_irq+0x5c/0xb4)
	[<c0065370>] (__handle_domain_irq) from [<c000940c>] (gic_handle_irq+0x48/0x8c)
	[<c000940c>] (gic_handle_irq) from [<c0013e8c>] (__irq_svc+0x6c/0xa8)

To fix this make sure that the trigger is fully initialized before
requesting the interrupt.

Fixes: ccd2b52 ("staging:iio: Add common ADIS library")
Reported-by: Robin Getz <[email protected]>
Signed-off-by: Lars-Peter Clausen <[email protected]>
Cc: <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
popcornmix pushed a commit that referenced this issue Dec 31, 2020
[ Upstream commit e246b7c ]

As part of the cma_dev release, that pointer will be set to NULL.  In case
it happens in rdma_bind_addr() (part of an error flow), the next call to
addr_handler() will have a call to cma_acquire_dev_by_src_ip() which will
overwrite sgid_attr without releasing it.

  WARNING: CPU: 2 PID: 108 at drivers/infiniband/core/cma.c:606 cma_bind_sgid_attr drivers/infiniband/core/cma.c:606 [inline]
  WARNING: CPU: 2 PID: 108 at drivers/infiniband/core/cma.c:606 cma_acquire_dev_by_src_ip+0x470/0x4b0 drivers/infiniband/core/cma.c:649
  CPU: 2 PID: 108 Comm: kworker/u8:1 Not tainted 5.10.0-rc6+ #257
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Workqueue: ib_addr process_one_req
  RIP: 0010:cma_bind_sgid_attr drivers/infiniband/core/cma.c:606 [inline]
  RIP: 0010:cma_acquire_dev_by_src_ip+0x470/0x4b0 drivers/infiniband/core/cma.c:649
  Code: 66 d9 4a ff 4d 8b 6e 10 49 8d bd 1c 08 00 00 e8 b6 d6 4a ff 45 0f b6 bd 1c 08 00 00 41 83 e7 01 e9 49 fd ff ff e8 90 c5 29 ff <0f> 0b e9 80 fe ff ff e8 84 c5 29 ff 4c 89 f7 e8 2c d9 4a ff 4d 8b
  RSP: 0018:ffff8881047c7b40 EFLAGS: 00010293
  RAX: ffff888104789c80 RBX: 0000000000000001 RCX: ffffffff820b8ef8
  RDX: 0000000000000000 RSI: ffffffff820b9080 RDI: ffff88810cd4c998
  RBP: ffff8881047c7c08 R08: ffff888104789c80 R09: ffffed10209f4036
  R10: ffff888104fa01ab R11: ffffed10209f4035 R12: ffff88810cd4c800
  R13: ffff888105750e28 R14: ffff888108f0a100 R15: ffff88810cd4c998
  FS:  0000000000000000(0000) GS:ffff888119c00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000000104e60005 CR4: 0000000000370ea0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   addr_handler+0x266/0x350 drivers/infiniband/core/cma.c:3190
   process_one_req+0xa3/0x300 drivers/infiniband/core/addr.c:645
   process_one_work+0x54c/0x930 kernel/workqueue.c:2272
   worker_thread+0x82/0x830 kernel/workqueue.c:2418
   kthread+0x1ca/0x220 kernel/kthread.c:292
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296

Fixes: ff11c6c ("RDMA/cma: Introduce and use cma_acquire_dev_by_src_ip()")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
spockfish pushed a commit to RoPieee/linux that referenced this issue Feb 1, 2021
[ Upstream commit e246b7c ]

As part of the cma_dev release, that pointer will be set to NULL.  In case
it happens in rdma_bind_addr() (part of an error flow), the next call to
addr_handler() will have a call to cma_acquire_dev_by_src_ip() which will
overwrite sgid_attr without releasing it.

  WARNING: CPU: 2 PID: 108 at drivers/infiniband/core/cma.c:606 cma_bind_sgid_attr drivers/infiniband/core/cma.c:606 [inline]
  WARNING: CPU: 2 PID: 108 at drivers/infiniband/core/cma.c:606 cma_acquire_dev_by_src_ip+0x470/0x4b0 drivers/infiniband/core/cma.c:649
  CPU: 2 PID: 108 Comm: kworker/u8:1 Not tainted 5.10.0-rc6+ raspberrypi#257
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Workqueue: ib_addr process_one_req
  RIP: 0010:cma_bind_sgid_attr drivers/infiniband/core/cma.c:606 [inline]
  RIP: 0010:cma_acquire_dev_by_src_ip+0x470/0x4b0 drivers/infiniband/core/cma.c:649
  Code: 66 d9 4a ff 4d 8b 6e 10 49 8d bd 1c 08 00 00 e8 b6 d6 4a ff 45 0f b6 bd 1c 08 00 00 41 83 e7 01 e9 49 fd ff ff e8 90 c5 29 ff <0f> 0b e9 80 fe ff ff e8 84 c5 29 ff 4c 89 f7 e8 2c d9 4a ff 4d 8b
  RSP: 0018:ffff8881047c7b40 EFLAGS: 00010293
  RAX: ffff888104789c80 RBX: 0000000000000001 RCX: ffffffff820b8ef8
  RDX: 0000000000000000 RSI: ffffffff820b9080 RDI: ffff88810cd4c998
  RBP: ffff8881047c7c08 R08: ffff888104789c80 R09: ffffed10209f4036
  R10: ffff888104fa01ab R11: ffffed10209f4035 R12: ffff88810cd4c800
  R13: ffff888105750e28 R14: ffff888108f0a100 R15: ffff88810cd4c998
  FS:  0000000000000000(0000) GS:ffff888119c00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000000104e60005 CR4: 0000000000370ea0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   addr_handler+0x266/0x350 drivers/infiniband/core/cma.c:3190
   process_one_req+0xa3/0x300 drivers/infiniband/core/addr.c:645
   process_one_work+0x54c/0x930 kernel/workqueue.c:2272
   worker_thread+0x82/0x830 kernel/workqueue.c:2418
   kthread+0x1ca/0x220 kernel/kthread.c:292
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296

Fixes: ff11c6c ("RDMA/cma: Introduce and use cma_acquire_dev_by_src_ip()")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Mar 4, 2021
[ Upstream commit fe454dc ]

ucma_process_join() allocates struct ucma_multicast mc and frees it if an
error occurs during its run.  Specifically, if an error occurs in
copy_to_user(), a use-after-free might happen in the following scenario:

1. mc struct is allocated.
2. rdma_join_multicast() is called and succeeds. During its run,
   cma_iboe_join_multicast() enqueues a work that will later use the
   aforementioned mc struct.
3. copy_to_user() is called and fails.
4. mc struct is deallocated.
5. The work that was enqueued by cma_iboe_join_multicast() is run and
   calls ucma_create_uevent() which tries to access mc struct (which is
   freed by now).

Fix this bug by cancelling the work enqueued by cma_iboe_join_multicast().
Since cma_work_handler() frees struct cma_work, we don't use it in
cma_iboe_join_multicast() so we can safely cancel the work later.

The following syzkaller report revealed it:

   BUG: KASAN: use-after-free in ucma_create_uevent+0x2dd/0x;3f0 drivers/infiniband/core/ucma.c:272
   Read of size 8 at addr ffff88810b3ad110 by task kworker/u8:1/108

   CPU: 1 PID: 108 Comm: kworker/u8:1 Not tainted 5.10.0-rc6+ #257
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS   rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
   Workqueue: rdma_cm cma_work_handler
   Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0xbe/0xf9 lib/dump_stack.c:118
    print_address_description.constprop.0+0x3e/0×60 mm/kasan/report.c:385
    __kasan_report mm/kasan/report.c:545 [inline]
    kasan_report.cold+0x1f/0×37 mm/kasan/report.c:562
    ucma_create_uevent+0x2dd/0×3f0 drivers/infiniband/core/ucma.c:272
    ucma_event_handler+0xb7/0×3c0 drivers/infiniband/core/ucma.c:349
    cma_cm_event_handler+0x5d/0×1c0 drivers/infiniband/core/cma.c:1977
    cma_work_handler+0xfa/0×190 drivers/infiniband/core/cma.c:2718
    process_one_work+0x54c/0×930 kernel/workqueue.c:2272
    worker_thread+0x82/0×830 kernel/workqueue.c:2418
    kthread+0x1ca/0×220 kernel/kthread.c:292
    ret_from_fork+0x1f/0×30 arch/x86/entry/entry_64.S:296

   Allocated by task 359:
     kasan_save_stack+0x1b/0×40 mm/kasan/common.c:48
     kasan_set_track mm/kasan/common.c:56 [inline]
     __kasan_kmalloc mm/kasan/common.c:461 [inline]
     __kasan_kmalloc.constprop.0+0xc2/0xd0 mm/kasan/common.c:434
     kmalloc include/linux/slab.h:552 [inline]
     kzalloc include/linux/slab.h:664 [inline]
     ucma_process_join+0x16e/0×3f0 drivers/infiniband/core/ucma.c:1453
     ucma_join_multicast+0xda/0×140 drivers/infiniband/core/ucma.c:1538
     ucma_write+0x1f7/0×280 drivers/infiniband/core/ucma.c:1724
     vfs_write fs/read_write.c:603 [inline]
     vfs_write+0x191/0×4c0 fs/read_write.c:585
     ksys_write+0x1a1/0×1e0 fs/read_write.c:658
     do_syscall_64+0x2d/0×40 arch/x86/entry/common.c:46
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

   Freed by task 359:
     kasan_save_stack+0x1b/0×40 mm/kasan/common.c:48
     kasan_set_track+0x1c/0×30 mm/kasan/common.c:56
     kasan_set_free_info+0x1b/0×30 mm/kasan/generic.c:355
     __kasan_slab_free+0x112/0×160 mm/kasan/common.c:422
     slab_free_hook mm/slub.c:1544 [inline]
     slab_free_freelist_hook mm/slub.c:1577 [inline]
     slab_free mm/slub.c:3142 [inline]
     kfree+0xb3/0×3e0 mm/slub.c:4124
     ucma_process_join+0x22d/0×3f0 drivers/infiniband/core/ucma.c:1497
     ucma_join_multicast+0xda/0×140 drivers/infiniband/core/ucma.c:1538
     ucma_write+0x1f7/0×280 drivers/infiniband/core/ucma.c:1724
     vfs_write fs/read_write.c:603 [inline]
     vfs_write+0x191/0×4c0 fs/read_write.c:585
     ksys_write+0x1a1/0×1e0 fs/read_write.c:658
     do_syscall_64+0x2d/0×40 arch/x86/entry/common.c:46
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
     The buggy address belongs to the object at ffff88810b3ad100
     which belongs to the cache kmalloc-192 of size 192
     The buggy address is located 16 bytes inside of
     192-byte region [ffff88810b3ad100, ffff88810b3ad1c0)

Fixes: b5de0c6 ("RDMA/cma: Fix use after free race in roce multicast join")
Link: https://lore.kernel.org/r/[email protected]
Reported-by: Amit Matityahu <[email protected]>
Signed-off-by: Avihai Horon <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Mar 16, 2021
[ Upstream commit fe454dc ]

ucma_process_join() allocates struct ucma_multicast mc and frees it if an
error occurs during its run.  Specifically, if an error occurs in
copy_to_user(), a use-after-free might happen in the following scenario:

1. mc struct is allocated.
2. rdma_join_multicast() is called and succeeds. During its run,
   cma_iboe_join_multicast() enqueues a work that will later use the
   aforementioned mc struct.
3. copy_to_user() is called and fails.
4. mc struct is deallocated.
5. The work that was enqueued by cma_iboe_join_multicast() is run and
   calls ucma_create_uevent() which tries to access mc struct (which is
   freed by now).

Fix this bug by cancelling the work enqueued by cma_iboe_join_multicast().
Since cma_work_handler() frees struct cma_work, we don't use it in
cma_iboe_join_multicast() so we can safely cancel the work later.

The following syzkaller report revealed it:

   BUG: KASAN: use-after-free in ucma_create_uevent+0x2dd/0x;3f0 drivers/infiniband/core/ucma.c:272
   Read of size 8 at addr ffff88810b3ad110 by task kworker/u8:1/108

   CPU: 1 PID: 108 Comm: kworker/u8:1 Not tainted 5.10.0-rc6+ #257
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS   rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
   Workqueue: rdma_cm cma_work_handler
   Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0xbe/0xf9 lib/dump_stack.c:118
    print_address_description.constprop.0+0x3e/0×60 mm/kasan/report.c:385
    __kasan_report mm/kasan/report.c:545 [inline]
    kasan_report.cold+0x1f/0×37 mm/kasan/report.c:562
    ucma_create_uevent+0x2dd/0×3f0 drivers/infiniband/core/ucma.c:272
    ucma_event_handler+0xb7/0×3c0 drivers/infiniband/core/ucma.c:349
    cma_cm_event_handler+0x5d/0×1c0 drivers/infiniband/core/cma.c:1977
    cma_work_handler+0xfa/0×190 drivers/infiniband/core/cma.c:2718
    process_one_work+0x54c/0×930 kernel/workqueue.c:2272
    worker_thread+0x82/0×830 kernel/workqueue.c:2418
    kthread+0x1ca/0×220 kernel/kthread.c:292
    ret_from_fork+0x1f/0×30 arch/x86/entry/entry_64.S:296

   Allocated by task 359:
     kasan_save_stack+0x1b/0×40 mm/kasan/common.c:48
     kasan_set_track mm/kasan/common.c:56 [inline]
     __kasan_kmalloc mm/kasan/common.c:461 [inline]
     __kasan_kmalloc.constprop.0+0xc2/0xd0 mm/kasan/common.c:434
     kmalloc include/linux/slab.h:552 [inline]
     kzalloc include/linux/slab.h:664 [inline]
     ucma_process_join+0x16e/0×3f0 drivers/infiniband/core/ucma.c:1453
     ucma_join_multicast+0xda/0×140 drivers/infiniband/core/ucma.c:1538
     ucma_write+0x1f7/0×280 drivers/infiniband/core/ucma.c:1724
     vfs_write fs/read_write.c:603 [inline]
     vfs_write+0x191/0×4c0 fs/read_write.c:585
     ksys_write+0x1a1/0×1e0 fs/read_write.c:658
     do_syscall_64+0x2d/0×40 arch/x86/entry/common.c:46
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

   Freed by task 359:
     kasan_save_stack+0x1b/0×40 mm/kasan/common.c:48
     kasan_set_track+0x1c/0×30 mm/kasan/common.c:56
     kasan_set_free_info+0x1b/0×30 mm/kasan/generic.c:355
     __kasan_slab_free+0x112/0×160 mm/kasan/common.c:422
     slab_free_hook mm/slub.c:1544 [inline]
     slab_free_freelist_hook mm/slub.c:1577 [inline]
     slab_free mm/slub.c:3142 [inline]
     kfree+0xb3/0×3e0 mm/slub.c:4124
     ucma_process_join+0x22d/0×3f0 drivers/infiniband/core/ucma.c:1497
     ucma_join_multicast+0xda/0×140 drivers/infiniband/core/ucma.c:1538
     ucma_write+0x1f7/0×280 drivers/infiniband/core/ucma.c:1724
     vfs_write fs/read_write.c:603 [inline]
     vfs_write+0x191/0×4c0 fs/read_write.c:585
     ksys_write+0x1a1/0×1e0 fs/read_write.c:658
     do_syscall_64+0x2d/0×40 arch/x86/entry/common.c:46
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
     The buggy address belongs to the object at ffff88810b3ad100
     which belongs to the cache kmalloc-192 of size 192
     The buggy address is located 16 bytes inside of
     192-byte region [ffff88810b3ad100, ffff88810b3ad1c0)

Fixes: b5de0c6 ("RDMA/cma: Fix use after free race in roce multicast join")
Link: https://lore.kernel.org/r/[email protected]
Reported-by: Amit Matityahu <[email protected]>
Signed-off-by: Avihai Horon <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
angiglesias pushed a commit to angiglesias/raspberry-kernel that referenced this issue Jul 12, 2022
…one()

The subpage we calculate is an invalid pointer for device private pages,
because device private pages are mapped via non-present device private
entries, not ordinary present PTEs.

Let's just not compute broken pointers and fixup later.  Move the proper
assignment of the correct subpage to the beginning of the function and
assert that we really only have a single page in our folio.

This currently results in a BUG when tying to compute anon_exclusive,
because:

[  528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0
[  528.739585] #PF: supervisor read access in kernel mode
[  528.745324] #PF: error_code(0x0000) - not-present page
[  528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0
[  528.756026] Oops: 0000 [raspberrypi#1] PREEMPT SMP NOPTI
[  528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex raspberrypi#257
[  528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021
[  528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000
[  528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29
c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74
0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02
[  528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202
[  528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0
[  528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8
[  528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000
[  528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540
[  528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff
[  528.850865] FS:  00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000
[  528.859891] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0
[  528.874275] PKRU: 55555554
[  528.877286] Call Trace:
[  528.880016]  <TASK>
[  528.882356]  ? lock_is_held_type+0xdf/0x130
[  528.887033]  rmap_walk_anon+0x167/0x410
[  528.891316]  try_to_migrate+0x90/0xd0
[  528.895405]  ? try_to_unmap_one+0xe10/0xe10
[  528.900074]  ? anon_vma_ctor+0x50/0x50
[  528.904260]  ? put_anon_vma+0x10/0x10
[  528.908347]  ? invalid_mkclean_vma+0x20/0x20
[  528.913114]  migrate_vma_setup+0x5f4/0x750
[  528.917691]  dmirror_devmem_fault+0x8c/0x250 [test_hmm]
[  528.923532]  do_swap_page+0xac0/0xe50
[  528.927623]  ? __lock_acquire+0x4b2/0x1ac0
[  528.932199]  __handle_mm_fault+0x949/0x1440
[  528.936876]  handle_mm_fault+0x13f/0x3e0
[  528.941256]  do_user_addr_fault+0x215/0x740
[  528.945928]  exc_page_fault+0x75/0x280
[  528.950115]  asm_exc_page_fault+0x27/0x30
[  528.954593] RIP: 0033:0x40366b
...

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 6c28760 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: "Sierra Guiza, Alejandro (Alex)" <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
Tested-by: Alistair Popple <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "Matthew Wilcox (Oracle)" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
herrnst pushed a commit to herrnst/linux-raspberrypi that referenced this issue Jan 2, 2023
[ Upstream commit 9c4557e ]

Fix the following lockdep warning

[ 1119.158984] ============================================
[ 1119.158988] WARNING: possible recursive locking detected
[ 1119.158996] 6.0.0-rc1+ raspberrypi#257 Tainted: G            E    N
[ 1119.158999] --------------------------------------------
[ 1119.159001] bash/80100 is trying to acquire lock:
[ 1119.159007] ffff88803e79b4a0 (&ns->lock/1){+.+.}-{4:4}, at: destroy_ns.part.0+0x43/0x140
[ 1119.159028]
               but task is already holding lock:
[ 1119.159030] ffff8881009764a0 (&ns->lock/1){+.+.}-{4:4}, at: aa_remove_profiles+0x3f0/0x640
[ 1119.159040]
               other info that might help us debug this:
[ 1119.159042]  Possible unsafe locking scenario:

[ 1119.159043]        CPU0
[ 1119.159045]        ----
[ 1119.159047]   lock(&ns->lock/1);
[ 1119.159051]   lock(&ns->lock/1);
[ 1119.159055]
                *** DEADLOCK ***

Which is caused by an incorrect lockdep nesting notation

Fixes: feb3c76 ("apparmor: fix possible recursive lock warning in __aa_create_ns")
Signed-off-by: John Johansen <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Jan 3, 2023
[ Upstream commit 9c4557e ]

Fix the following lockdep warning

[ 1119.158984] ============================================
[ 1119.158988] WARNING: possible recursive locking detected
[ 1119.158996] 6.0.0-rc1+ #257 Tainted: G            E    N
[ 1119.158999] --------------------------------------------
[ 1119.159001] bash/80100 is trying to acquire lock:
[ 1119.159007] ffff88803e79b4a0 (&ns->lock/1){+.+.}-{4:4}, at: destroy_ns.part.0+0x43/0x140
[ 1119.159028]
               but task is already holding lock:
[ 1119.159030] ffff8881009764a0 (&ns->lock/1){+.+.}-{4:4}, at: aa_remove_profiles+0x3f0/0x640
[ 1119.159040]
               other info that might help us debug this:
[ 1119.159042]  Possible unsafe locking scenario:

[ 1119.159043]        CPU0
[ 1119.159045]        ----
[ 1119.159047]   lock(&ns->lock/1);
[ 1119.159051]   lock(&ns->lock/1);
[ 1119.159055]
                *** DEADLOCK ***

Which is caused by an incorrect lockdep nesting notation

Fixes: feb3c76 ("apparmor: fix possible recursive lock warning in __aa_create_ns")
Signed-off-by: John Johansen <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Jan 13, 2023
[ Upstream commit 9c4557e ]

Fix the following lockdep warning

[ 1119.158984] ============================================
[ 1119.158988] WARNING: possible recursive locking detected
[ 1119.158996] 6.0.0-rc1+ #257 Tainted: G            E    N
[ 1119.158999] --------------------------------------------
[ 1119.159001] bash/80100 is trying to acquire lock:
[ 1119.159007] ffff88803e79b4a0 (&ns->lock/1){+.+.}-{4:4}, at: destroy_ns.part.0+0x43/0x140
[ 1119.159028]
               but task is already holding lock:
[ 1119.159030] ffff8881009764a0 (&ns->lock/1){+.+.}-{4:4}, at: aa_remove_profiles+0x3f0/0x640
[ 1119.159040]
               other info that might help us debug this:
[ 1119.159042]  Possible unsafe locking scenario:

[ 1119.159043]        CPU0
[ 1119.159045]        ----
[ 1119.159047]   lock(&ns->lock/1);
[ 1119.159051]   lock(&ns->lock/1);
[ 1119.159055]
                *** DEADLOCK ***

Which is caused by an incorrect lockdep nesting notation

Fixes: feb3c76 ("apparmor: fix possible recursive lock warning in __aa_create_ns")
Signed-off-by: John Johansen <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Oct 9, 2023
Add various tests to check maximum number of supported programs
being attached:

  # ./vmtest.sh -- ./test_progs -t tc_opts
  [...]
  ./test_progs -t tc_opts
  [    1.185325] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.186826] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  [    1.270123] tsc: Refined TSC clocksource calibration: 3407.988 MHz
  [    1.272428] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fc932722, max_idle_ns: 440795381586 ns
  [    1.276408] clocksource: Switched to clocksource tsc
  #252     tc_opts_after:OK
  #253     tc_opts_append:OK
  #254     tc_opts_basic:OK
  #255     tc_opts_before:OK
  #256     tc_opts_chain_classic:OK
  #257     tc_opts_chain_mixed:OK
  #258     tc_opts_delete_empty:OK
  #259     tc_opts_demixed:OK
  #260     tc_opts_detach:OK
  #261     tc_opts_detach_after:OK
  #262     tc_opts_detach_before:OK
  #263     tc_opts_dev_cleanup:OK
  #264     tc_opts_invalid:OK
  #265     tc_opts_max:OK              <--- (new test)
  #266     tc_opts_mixed:OK
  #267     tc_opts_prepend:OK
  #268     tc_opts_replace:OK
  #269     tc_opts_revision:OK
  Summary: 18/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
popcornmix pushed a commit that referenced this issue Oct 16, 2023
Add a new test case which performs double query of the bpf_mprog through
libbpf API, but also via raw bpf(2) syscall. This is testing to gather
first the count and then in a subsequent probe the full information with
the program array without clearing passed structs in between.

  # ./vmtest.sh -- ./test_progs -t tc_opts
  [...]
  ./test_progs -t tc_opts
  [    1.398818] tsc: Refined TSC clocksource calibration: 3407.999 MHz
  [    1.400263] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd336761, max_idle_ns: 440795243819 ns
  [    1.402734] clocksource: Switched to clocksource tsc
  [    1.426639] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.428112] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #252     tc_opts_after:OK
  #253     tc_opts_append:OK
  #254     tc_opts_basic:OK
  #255     tc_opts_before:OK
  #256     tc_opts_chain_classic:OK
  #257     tc_opts_chain_mixed:OK
  #258     tc_opts_delete_empty:OK
  #259     tc_opts_demixed:OK
  #260     tc_opts_detach:OK
  #261     tc_opts_detach_after:OK
  #262     tc_opts_detach_before:OK
  #263     tc_opts_dev_cleanup:OK
  #264     tc_opts_invalid:OK
  #265     tc_opts_max:OK
  #266     tc_opts_mixed:OK
  #267     tc_opts_prepend:OK
  #268     tc_opts_query:OK            <--- (new test)
  #269     tc_opts_replace:OK
  #270     tc_opts_revision:OK
  Summary: 19/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
popcornmix pushed a commit that referenced this issue Oct 16, 2023
Add a new test case to query on an empty bpf_mprog and pass the revision
directly into expected_revision for attachment to assert that this does
succeed.

  ./test_progs -t tc_opts
  [    1.406778] tsc: Refined TSC clocksource calibration: 3407.990 MHz
  [    1.408863] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fcaf6eb0, max_idle_ns: 440795321766 ns
  [    1.412419] clocksource: Switched to clocksource tsc
  [    1.428671] bpf_testmod: loading out-of-tree module taints kernel.
  [    1.430260] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #252     tc_opts_after:OK
  #253     tc_opts_append:OK
  #254     tc_opts_basic:OK
  #255     tc_opts_before:OK
  #256     tc_opts_chain_classic:OK
  #257     tc_opts_chain_mixed:OK
  #258     tc_opts_delete_empty:OK
  #259     tc_opts_demixed:OK
  #260     tc_opts_detach:OK
  #261     tc_opts_detach_after:OK
  #262     tc_opts_detach_before:OK
  #263     tc_opts_dev_cleanup:OK
  #264     tc_opts_invalid:OK
  #265     tc_opts_max:OK
  #266     tc_opts_mixed:OK
  #267     tc_opts_prepend:OK
  #268     tc_opts_query:OK
  #269     tc_opts_query_attach:OK     <--- (new test)
  #270     tc_opts_replace:OK
  #271     tc_opts_revision:OK
  Summary: 20/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
popcornmix pushed a commit that referenced this issue Nov 28, 2023
hid_debug_events_release releases resources bound to the HID device instance.
hid_device_release releases the underlying HID device instance potentially
before hid_debug_events_release has completed releasing debug resources bound
to the same HID device instance.

Reference count to prevent the HID device instance from being torn down
preemptively when HID debugging support is used. When count reaches zero,
release core resources of HID device instance using hiddev_free.

The crash:

[  120.728477][ T4396] kernel BUG at lib/list_debug.c:53!
[  120.728505][ T4396] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[  120.739806][ T4396] Modules linked in: bcmdhd dhd_static_buf 8822cu pcie_mhi r8168
[  120.747386][ T4396] CPU: 1 PID: 4396 Comm: hidt_bridge Not tainted 5.10.110 #257
[  120.754771][ T4396] Hardware name: Rockchip RK3588 EVB4 LP4 V10 Board (DT)
[  120.761643][ T4396] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[  120.768338][ T4396] pc : __list_del_entry_valid+0x98/0xac
[  120.773730][ T4396] lr : __list_del_entry_valid+0x98/0xac
[  120.779120][ T4396] sp : ffffffc01e62bb60
[  120.783126][ T4396] x29: ffffffc01e62bb60 x28: ffffff818ce3a200
[  120.789126][ T4396] x27: 0000000000000009 x26: 0000000000980000
[  120.795126][ T4396] x25: ffffffc012431000 x24: ffffff802c6d4e00
[  120.801125][ T4396] x23: ffffff8005c66f00 x22: ffffffc01183b5b8
[  120.807125][ T4396] x21: ffffff819df2f100 x20: 0000000000000000
[  120.813124][ T4396] x19: ffffff802c3f0700 x18: ffffffc01d2cd058
[  120.819124][ T4396] x17: 0000000000000000 x16: 0000000000000000
[  120.825124][ T4396] x15: 0000000000000004 x14: 0000000000003fff
[  120.831123][ T4396] x13: ffffffc012085588 x12: 0000000000000003
[  120.837123][ T4396] x11: 00000000ffffbfff x10: 0000000000000003
[  120.843123][ T4396] x9 : 455103d46b329300 x8 : 455103d46b329300
[  120.849124][ T4396] x7 : 74707572726f6320 x6 : ffffffc0124b8cb5
[  120.855124][ T4396] x5 : ffffffffffffffff x4 : 0000000000000000
[  120.861123][ T4396] x3 : ffffffc011cf4f90 x2 : ffffff81fee7b948
[  120.867122][ T4396] x1 : ffffffc011cf4f90 x0 : 0000000000000054
[  120.873122][ T4396] Call trace:
[  120.876259][ T4396]  __list_del_entry_valid+0x98/0xac
[  120.881304][ T4396]  hid_debug_events_release+0x48/0x12c
[  120.886617][ T4396]  full_proxy_release+0x50/0xbc
[  120.891323][ T4396]  __fput+0xdc/0x238
[  120.895075][ T4396]  ____fput+0x14/0x24
[  120.898911][ T4396]  task_work_run+0x90/0x148
[  120.903268][ T4396]  do_exit+0x1bc/0x8a4
[  120.907193][ T4396]  do_group_exit+0x8c/0xa4
[  120.911458][ T4396]  get_signal+0x468/0x744
[  120.915643][ T4396]  do_signal+0x84/0x280
[  120.919650][ T4396]  do_notify_resume+0xd0/0x218
[  120.924262][ T4396]  work_pending+0xc/0x3f0

[ Rahul Rameshbabu <[email protected]>: rework changelog ]
Fixes: cd667ce ("HID: use debugfs for events/reports dumping")
Signed-off-by: Charles Yi <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
popcornmix pushed a commit that referenced this issue Dec 5, 2023
[ Upstream commit fc43e9c ]

hid_debug_events_release releases resources bound to the HID device instance.
hid_device_release releases the underlying HID device instance potentially
before hid_debug_events_release has completed releasing debug resources bound
to the same HID device instance.

Reference count to prevent the HID device instance from being torn down
preemptively when HID debugging support is used. When count reaches zero,
release core resources of HID device instance using hiddev_free.

The crash:

[  120.728477][ T4396] kernel BUG at lib/list_debug.c:53!
[  120.728505][ T4396] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[  120.739806][ T4396] Modules linked in: bcmdhd dhd_static_buf 8822cu pcie_mhi r8168
[  120.747386][ T4396] CPU: 1 PID: 4396 Comm: hidt_bridge Not tainted 5.10.110 #257
[  120.754771][ T4396] Hardware name: Rockchip RK3588 EVB4 LP4 V10 Board (DT)
[  120.761643][ T4396] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[  120.768338][ T4396] pc : __list_del_entry_valid+0x98/0xac
[  120.773730][ T4396] lr : __list_del_entry_valid+0x98/0xac
[  120.779120][ T4396] sp : ffffffc01e62bb60
[  120.783126][ T4396] x29: ffffffc01e62bb60 x28: ffffff818ce3a200
[  120.789126][ T4396] x27: 0000000000000009 x26: 0000000000980000
[  120.795126][ T4396] x25: ffffffc012431000 x24: ffffff802c6d4e00
[  120.801125][ T4396] x23: ffffff8005c66f00 x22: ffffffc01183b5b8
[  120.807125][ T4396] x21: ffffff819df2f100 x20: 0000000000000000
[  120.813124][ T4396] x19: ffffff802c3f0700 x18: ffffffc01d2cd058
[  120.819124][ T4396] x17: 0000000000000000 x16: 0000000000000000
[  120.825124][ T4396] x15: 0000000000000004 x14: 0000000000003fff
[  120.831123][ T4396] x13: ffffffc012085588 x12: 0000000000000003
[  120.837123][ T4396] x11: 00000000ffffbfff x10: 0000000000000003
[  120.843123][ T4396] x9 : 455103d46b329300 x8 : 455103d46b329300
[  120.849124][ T4396] x7 : 74707572726f6320 x6 : ffffffc0124b8cb5
[  120.855124][ T4396] x5 : ffffffffffffffff x4 : 0000000000000000
[  120.861123][ T4396] x3 : ffffffc011cf4f90 x2 : ffffff81fee7b948
[  120.867122][ T4396] x1 : ffffffc011cf4f90 x0 : 0000000000000054
[  120.873122][ T4396] Call trace:
[  120.876259][ T4396]  __list_del_entry_valid+0x98/0xac
[  120.881304][ T4396]  hid_debug_events_release+0x48/0x12c
[  120.886617][ T4396]  full_proxy_release+0x50/0xbc
[  120.891323][ T4396]  __fput+0xdc/0x238
[  120.895075][ T4396]  ____fput+0x14/0x24
[  120.898911][ T4396]  task_work_run+0x90/0x148
[  120.903268][ T4396]  do_exit+0x1bc/0x8a4
[  120.907193][ T4396]  do_group_exit+0x8c/0xa4
[  120.911458][ T4396]  get_signal+0x468/0x744
[  120.915643][ T4396]  do_signal+0x84/0x280
[  120.919650][ T4396]  do_notify_resume+0xd0/0x218
[  120.924262][ T4396]  work_pending+0xc/0x3f0

[ Rahul Rameshbabu <[email protected]>: rework changelog ]
Fixes: cd667ce ("HID: use debugfs for events/reports dumping")
Signed-off-by: Charles Yi <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Dec 5, 2023
[ Upstream commit fc43e9c ]

hid_debug_events_release releases resources bound to the HID device instance.
hid_device_release releases the underlying HID device instance potentially
before hid_debug_events_release has completed releasing debug resources bound
to the same HID device instance.

Reference count to prevent the HID device instance from being torn down
preemptively when HID debugging support is used. When count reaches zero,
release core resources of HID device instance using hiddev_free.

The crash:

[  120.728477][ T4396] kernel BUG at lib/list_debug.c:53!
[  120.728505][ T4396] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[  120.739806][ T4396] Modules linked in: bcmdhd dhd_static_buf 8822cu pcie_mhi r8168
[  120.747386][ T4396] CPU: 1 PID: 4396 Comm: hidt_bridge Not tainted 5.10.110 #257
[  120.754771][ T4396] Hardware name: Rockchip RK3588 EVB4 LP4 V10 Board (DT)
[  120.761643][ T4396] pstate: 60400089 (nZCv daIf +PAN -UAO -TCO BTYPE=--)
[  120.768338][ T4396] pc : __list_del_entry_valid+0x98/0xac
[  120.773730][ T4396] lr : __list_del_entry_valid+0x98/0xac
[  120.779120][ T4396] sp : ffffffc01e62bb60
[  120.783126][ T4396] x29: ffffffc01e62bb60 x28: ffffff818ce3a200
[  120.789126][ T4396] x27: 0000000000000009 x26: 0000000000980000
[  120.795126][ T4396] x25: ffffffc012431000 x24: ffffff802c6d4e00
[  120.801125][ T4396] x23: ffffff8005c66f00 x22: ffffffc01183b5b8
[  120.807125][ T4396] x21: ffffff819df2f100 x20: 0000000000000000
[  120.813124][ T4396] x19: ffffff802c3f0700 x18: ffffffc01d2cd058
[  120.819124][ T4396] x17: 0000000000000000 x16: 0000000000000000
[  120.825124][ T4396] x15: 0000000000000004 x14: 0000000000003fff
[  120.831123][ T4396] x13: ffffffc012085588 x12: 0000000000000003
[  120.837123][ T4396] x11: 00000000ffffbfff x10: 0000000000000003
[  120.843123][ T4396] x9 : 455103d46b329300 x8 : 455103d46b329300
[  120.849124][ T4396] x7 : 74707572726f6320 x6 : ffffffc0124b8cb5
[  120.855124][ T4396] x5 : ffffffffffffffff x4 : 0000000000000000
[  120.861123][ T4396] x3 : ffffffc011cf4f90 x2 : ffffff81fee7b948
[  120.867122][ T4396] x1 : ffffffc011cf4f90 x0 : 0000000000000054
[  120.873122][ T4396] Call trace:
[  120.876259][ T4396]  __list_del_entry_valid+0x98/0xac
[  120.881304][ T4396]  hid_debug_events_release+0x48/0x12c
[  120.886617][ T4396]  full_proxy_release+0x50/0xbc
[  120.891323][ T4396]  __fput+0xdc/0x238
[  120.895075][ T4396]  ____fput+0x14/0x24
[  120.898911][ T4396]  task_work_run+0x90/0x148
[  120.903268][ T4396]  do_exit+0x1bc/0x8a4
[  120.907193][ T4396]  do_group_exit+0x8c/0xa4
[  120.911458][ T4396]  get_signal+0x468/0x744
[  120.915643][ T4396]  do_signal+0x84/0x280
[  120.919650][ T4396]  do_notify_resume+0xd0/0x218
[  120.924262][ T4396]  work_pending+0xc/0x3f0

[ Rahul Rameshbabu <[email protected]>: rework changelog ]
Fixes: cd667ce ("HID: use debugfs for events/reports dumping")
Signed-off-by: Charles Yi <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Feb 6, 2024
[ Upstream commit c511471 ]

Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory
type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens:

[1136314.192256] BUG: kernel NULL pointer dereference, address:
0000000000000034
[1136314.203943] #PF: supervisor read access in kernel mode
[1136314.213768] #PF: error_code(0x0000) - not-present page
[1136314.223550] PGD 0 P4D 0
[1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI
[1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257
[1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT,
BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210
[1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86
[1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246
[1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX:
0000000000000000
[1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
ffffc9003168c000
[1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09:
0000000000010000
[1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12:
0000000000000001
[1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15:
0000000000000001
[1136314.373298] FS:  00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000)
knlGS:0000000000000000
[1136314.386105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4:
00000000007706f0
[1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1136314.431890] PKRU: 55555554
[1136314.439143] Call Trace:
[1136314.446058]  <IRQ>
[1136314.452465]  ? __die+0x20/0x70
[1136314.459881]  ? page_fault_oops+0x15b/0x440
[1136314.468305]  ? exc_page_fault+0x6a/0x150
[1136314.476491]  ? asm_exc_page_fault+0x22/0x30
[1136314.484927]  ? __xdp_return+0x6c/0x210
[1136314.492863]  bpf_xdp_adjust_tail+0x155/0x1d0
[1136314.501269]  bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60
[1136314.511263]  ice_clean_rx_irq_zc+0x206/0xc60 [ice]
[1136314.520222]  ? ice_xmit_zc+0x6e/0x150 [ice]
[1136314.528506]  ice_napi_poll+0x467/0x670 [ice]
[1136314.536858]  ? ttwu_do_activate.constprop.0+0x8f/0x1a0
[1136314.546010]  __napi_poll+0x29/0x1b0
[1136314.553462]  net_rx_action+0x133/0x270
[1136314.561619]  __do_softirq+0xbe/0x28e
[1136314.569303]  do_softirq+0x3f/0x60

This comes from __xdp_return() call with xdp_buff argument passed as
NULL which is supposed to be consumed by xsk_buff_free() call.

To address this properly, in ZC case, a node that represents the frag
being removed has to be pulled out of xskb_list. Introduce
appropriate xsk helpers to do such node operation and use them
accordingly within bpf_xdp_adjust_tail().

Fixes: 24ea501 ("xsk: support mbuf on ZC RX")
Acked-by: Magnus Karlsson <[email protected]> # For the xsk header part
Signed-off-by: Maciej Fijalkowski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
popcornmix pushed a commit that referenced this issue Feb 6, 2024
[ Upstream commit c511471 ]

Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory
type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens:

[1136314.192256] BUG: kernel NULL pointer dereference, address:
0000000000000034
[1136314.203943] #PF: supervisor read access in kernel mode
[1136314.213768] #PF: error_code(0x0000) - not-present page
[1136314.223550] PGD 0 P4D 0
[1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI
[1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257
[1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT,
BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210
[1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86
[1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246
[1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX:
0000000000000000
[1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
ffffc9003168c000
[1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09:
0000000000010000
[1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12:
0000000000000001
[1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15:
0000000000000001
[1136314.373298] FS:  00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000)
knlGS:0000000000000000
[1136314.386105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4:
00000000007706f0
[1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1136314.431890] PKRU: 55555554
[1136314.439143] Call Trace:
[1136314.446058]  <IRQ>
[1136314.452465]  ? __die+0x20/0x70
[1136314.459881]  ? page_fault_oops+0x15b/0x440
[1136314.468305]  ? exc_page_fault+0x6a/0x150
[1136314.476491]  ? asm_exc_page_fault+0x22/0x30
[1136314.484927]  ? __xdp_return+0x6c/0x210
[1136314.492863]  bpf_xdp_adjust_tail+0x155/0x1d0
[1136314.501269]  bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60
[1136314.511263]  ice_clean_rx_irq_zc+0x206/0xc60 [ice]
[1136314.520222]  ? ice_xmit_zc+0x6e/0x150 [ice]
[1136314.528506]  ice_napi_poll+0x467/0x670 [ice]
[1136314.536858]  ? ttwu_do_activate.constprop.0+0x8f/0x1a0
[1136314.546010]  __napi_poll+0x29/0x1b0
[1136314.553462]  net_rx_action+0x133/0x270
[1136314.561619]  __do_softirq+0xbe/0x28e
[1136314.569303]  do_softirq+0x3f/0x60

This comes from __xdp_return() call with xdp_buff argument passed as
NULL which is supposed to be consumed by xsk_buff_free() call.

To address this properly, in ZC case, a node that represents the frag
being removed has to be pulled out of xskb_list. Introduce
appropriate xsk helpers to do such node operation and use them
accordingly within bpf_xdp_adjust_tail().

Fixes: 24ea501 ("xsk: support mbuf on ZC RX")
Acked-by: Magnus Karlsson <[email protected]> # For the xsk header part
Signed-off-by: Maciej Fijalkowski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants