Skip to content

Filesystem bugs, including volume corruption #5780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bmcdonnell-ionx opened this issue Jan 3, 2018 · 62 comments
Closed

Filesystem bugs, including volume corruption #5780

bmcdonnell-ionx opened this issue Jan 3, 2018 · 62 comments

Comments

@bmcdonnell-ionx
Copy link
Contributor

bmcdonnell-ionx commented Jan 3, 2018

Description

  • Type: Bugs
  • Priority: Major?

Bug

Target
LPC4088

Toolchain:
GCC_ARM - yes
ARM|IAR - untested

Toolchain version:

mbed-cli version:
1.2.2

mbed-os sha:
2b4ff78

Expected behavior

  1. readdir() returns NULL to terminate a directory listing.
  2. FATFileSystem writes files to SD Card at a relatively consistent speed.
  3. FATFileSystem never corrupts the file system on the SD Card.

Actual behavior

  1. When operating on a directory on a FAT32 volume containing the maximum number of files per directory (65534), readdir() never returns NULL.
  2. Increasing the number of files in a directory slows down FATFileSystem writes.
  3. Under certain cirumstances, FATFileSystem (or SDBlockDevice?) corrupts the file system on the SD Card. (i.e. It can no longer be read by FATFileSystem, nor by a Windows PC.)

Steps to reproduce

I forked ARMmbed/mbed-os-example-filesystem to create demo code for items 1 and 2. There are tags (linked below) for each of these. I don't have code to share for 3 yet, but I give details below.

All testing was done with an Embedded Artists' LPC4088 QuickStart Board connected to an LPC4088 Experiment Base Board (EBB). The EBB is used for its SD Card slot. Jumpers are configured to use the SPI interface to the SD Card (per Figure 7 in the EBB user guide (PDF)).

I made a small patch to mbed OS so that GCC_ARM will malloc() from the LPC4088's external SDRAM like the mbed online compiler does. Demo programs 1 & 2 use megabytes of memory for a HeapBlockDevice. You should be able to replace this with an SDBlockDevice to avoid the large memory use and test on another target device, but of course this is slower to run.

Of course, add your choice of Serial device to output printf() text.

  1. bug01-mbed-os-5.7.1-65534-files-dirent-ne-null (1fa0b0b)
  2. bug02a-mbed-os.5.7.1-write-slowdown-empty-files (6ab378b)
    bug02b-mbed-os.5.7.1-write-slowdown-small-files (d891ad3)
  3. Using an 8 GB SD Card, write many files to a directory. After each file write, append text via fprintf() to a metadata file in the directory, and a log file in the parent directory.
    • Circumstance 1: Files are exactly 512 KiB each, written via fwrite(). Filesystem is corrupted after writing to the 8160th file. I observed this particular failure mode on at least two units.
    • Circumstance 2: Files are approximately 7.5 MB each, written via fprintf(). Filesystem is corrupted after filling up the volume. Observed on one unit so far.
@0xc0170
Copy link
Contributor

0xc0170 commented Jan 3, 2018

@deepikabhavnani @geky Please review

@bmcdonnell-ionx
Copy link
Contributor Author

Note, #3 takes about two days to run one scenario on the dev board. I haven't yet tried pre-filling the SD Card, though - probably should. 😄

@bmcdonnell-ionx
Copy link
Contributor Author

bmcdonnell-ionx commented Jan 3, 2018

#3 may have something to do with me having a logfile open, the volume fills up from another file, then I try to write to the logfile again. Further investigation required.

@deepikabhavnani
Copy link

deepikabhavnani commented Jan 4, 2018

@bmcdonnell-ionx - Thanks for reporting the issues.
I have looked into 1. and it seems logic at https://github.com/ARMmbed/mbed-os/blob/master/features/filesystem/fat/FATFileSystem.cpp#L638 is incorrect

ChanFS reports FR_OK for end of directory, hence the check should be just if condition and not else if

-- else if (finfo.fname[0] == 0) {
++ if (finfo.fname[0] == 0) {
            break;
        } 

I do not have target device with that big HEAP memory, and testing with SD card will take long. Will confirm once I have verified.

@deepikabhavnani
Copy link

  1. Increasing the number of files in a directory slows down FATFileSystem writes.

Yes it will, as per your example files are created in same directory, and while open call all files are scanned in the directory before creation. See https://github.com/ARMmbed/mbed-os/blob/master/features/filesystem/fat/ChaN/ff.cpp#L3046

You can split files in multiple directories to get better performance.

We would appreciate a PR to improve performance here, else we might pick it up in future based on priorities.

@bmcdonnell-ionx
Copy link
Contributor Author

OK, so @deepikabhavnani is working on #1, and has explained #2.

We would appreciate a PR to improve performance here [#2], else we might pick it up in future based on priorities.

I don't anticipate doing so myself, but for reference, how could one improve the performance here?

@deepikabhavnani
Copy link

@bmcdonnell-ionx - @geky might help you with that, else you can add query in ChanFs community (http://elm-chan.org/fsw/ff/bd/) and see if you can get some help.

@bmcdonnell-ionx
Copy link
Contributor Author

bmcdonnell-ionx commented Jan 8, 2018

I've published test cases for the following. mbed team (@deepikabhavnani, @geky): Is this enough information for you to troubleshoot the issue?

  1. Under certain cirumstances, FATFileSystem (or SDBlockDevice?) corrupts the file system on the SD Card. (i.e. It can no longer be read by FATFileSystem, nor by a Windows PC.)
  1. Using an 8 GB SD Card, write many files to a directory. After each file write, append text via fprintf() to a metadata file in the directory, and a log file in the parent directory.
    • Circumstance 1: Files are exactly 512 KiB each, written via fwrite(). Filesystem is corrupted after writing to the 8160th file. I observed this particular failure mode on at least two units.

bug03a-mbed-os-5.7.2-filesystem-corruption (650c265)
bug03b-mbed-os-5.7.2-filesystem-corruption (167f1e1)

Notes:

  • I'm using malloc() to allocate a 512 KiB block of memory. With my mbed-os patch, malloc() goes to external memory with GCC_ARM.
  • This is imitating an unpublished program of mine where the failure occurred, which is why version a includes faux log and checksum files.
    • Results of some test runs are summarized in the table below. I used identical board pairs with 8 GB SD cards. (Each board pair is denoted as a numbered "unit" in the table.)
  • As before, add your Serial device for console output.
Run# program Date Unit# mbed OS ver# Result
1 (unpublished) 12/22 2 5.7.0 (maybe 5.7.1) Failed to re-open the log after writing the file 0x00001fe0.bin (8160 decimal). Writing that file may have failed too. There was also a checksum error when reading back the file two loop iterations before the failure. Filesystem corrupted.
2 (unpublished) 12/22 3 5.7.0 (maybe 5.7.1) Same as run 1.
3 (unpublished) 1/6 1 5.7.2 Same as run 1, except it failed one iteration earlier (00001fdf.bin).
4 (unpublished) 1/6 2 5.7.2 Same as run 1, except it failed much earlier (0000191f.bin).
5 bug03a (650c265) 1/6 2 5.7.2 Same as run 1.
6 bug03b (167f1e1) 1/7 2 5.7.2 Failed to create 00001fe2.bin. Filesystem corrupted.
7 bug03b (167f1e1) 1/7 3 5.7.2 Same as run 6.

I have another test run executing now without the faux log and checksum files; I'll report results when I have them. If I have time I may try smaller fwrite()s to reduce RAM usage, and see if the results are the same.

@bmcdonnell-ionx
Copy link
Contributor Author

Test runs of my unpublished program did not fail or corrupt the filesystem on 2 GB or 4 GB SD Cards, which can't fit the number of files it took to get to the failure demonstrated above.

@bmcdonnell-ionx
Copy link
Contributor Author

I edited my comment above, including the table. I added the version (03b) without the faux log and checksum files.

I'm experimenting with pre-filling the test directory. I'll update later.

@bmcdonnell-ionx
Copy link
Contributor Author

bmcdonnell-ionx commented Jan 9, 2018

OK. Finally I published:

mbed team (@deepikabhavnani, @geky) - is this test enough good enough for you to use to troubleshoot?

This version doesn't require the external memory. And it runs much more quickly - around 15 minutes on my board. But you must pre-populate the files.

To pre-populate the SD Card, with your PC, create a folder fs-test. Inside that folder, create 8128 files of 512 KiB each. To follow my naming convention, they are named with hex numbers, 00000000.bin through 00001fbf.bin. The data doesn't matter; I just used /dev/zero as the source.

A bash script like this should do the trick:

#!/usr/bin/bash

mkdir fs-test
cd fs-test

# create the first file
dd if=/dev/zero of=00000000.bin bs=512K count=1

# make copies
for f in `printf '%08x.bin\n' $( seq 1 8127 )`
do
   echo $f
   cp 00000000.bin $f
done

I recommend you create the folder on your hard drive, so you can just copy it repeatedly for multiple test runs. If using Windows, I recommend using Windows Explorer to copy/paste the folder to the SD Card. (I had an issue where the mbed device isn't seeing all the files under certain circumstances, such as when things were moved from one location to another within the SD Card, or when files were copied to the SD Card using Cygwin. If I can nail it down, I'll report here or in another issue. But Ctrl+C/Ctrl+V on the fs-test folder in Windows Explorer to copy it from the hard drive to the SD Card is working reliably for me on Windows 10.)

I did three test runs, on three different units, and on all three it failed to create file 00001fe2.bin, and then corrupted the filesystem. (i.e. You then put it in your PC, and it tells you to format it.)

(Sorry it took so long. I had tried changing several things at once, and I couldn't reproduce the errors, so I made one change at a time. With test runs taking 1+ days, it took a while...)

@deepikabhavnani
Copy link

When operating on a directory on a FAT32 volume containing the maximum number of files per directory (65534), readdir() never returns NULL.

@bmcdonnell-ionx Were you able to create 65534 files successfully on HEAP partition/SD card? I was trying with 8GB SD card formatted using FatFilesystem::format() API and using host machine for file creation, below is my observation.

  1. Linux system created 32765 files in any directory and then failed with error (No space left on device). Creation of files in another directory was allowed. Read was successful with application on mbed board.
  2. Windows created 37,767 files and then failed with error. readdir does not return NULL. Using the same SD card on Linux machine, ls -l hangs and does not list the filenames.

@bmcdonnell-ionx
Copy link
Contributor Author

Oops - somehow the call to sdram_init() remained in my bug03d release. Not sure how that happened. Skip that one; use this one:

@deepikabhavnani
Copy link

deepikabhavnani commented Jan 10, 2018

@bmcdonnell-ionx - I already did those changes and tried once. Below is the observation, I will be trying again with format using FatFilesystem::format()

Steps:

  1. Format on windows
  2. 8,128 - 512KB each created on linux
  3. Insert in windows and check - Windows complains after linux dump of files as well. Did scan and repair - no error found. (Windows pop error might be because of *.bin file name. TODO: Rename all files to .txt and try later)
  4. Edited test to create files from 0x1fc0 (8128) - https://gist.github.com/deepikabhavnani/cdfc66003b5f213b31551bd02338b378

Output:
Open logfile /fs/fs-test/log.txt.
Create test directory /fs/fs-test/00000000.
Create checksums file /fs/fs-test/00000000/checks.txt.
/fs/fs-test/00001fc0.bin 13053 ms - Because overwrites
/fs/fs-test/00001fc1.bin 12666 ms
|
|
/fs/fs-test/00002200.bin 12666 ms and still going on (Killed)

Next Step: Try format with FatFilesystem::format() and repeat steps

@bmcdonnell-ionx
Copy link
Contributor Author

@deepikabhavnani, you bring up a good point - I did not track which methods I used to format the SD Cards as it related to the tests. I think I mostly used Windows 10 to format them (with "default allocation size"), but occasionally used the FATFileSystem::format() function on my mbed device.

I already did those changes and tried once.

Which changes?

Looks like your gist was based on my Rev c. I think you'll find my latest Rev e easier to work with.

Windows complains after linux dump of files as well.

I don't have a Linux machine to test with, so I can't duplicate that test. I'm using Cygwin.

Windows pop error might be because of *.bin file name.

Why would that matter? I don't recall it giving me any trouble (Windows 10).

@bmcdonnell-ionx
Copy link
Contributor Author

@deepikabhavnani,

Were you able to create 65534 files successfully on HEAP partition/SD card?

Yes, this demo code does that with empty files on the HeapBlockDevice.

  1. bug01-mbed-os-5.7.1-65534-files-dirent-ne-null (1fa0b0b)

@deepikabhavnani
Copy link

deepikabhavnani commented Jan 10, 2018

@bmcdonnell-ionx I was able to reproduce issue 3 with FatFilesystem::format(), more on that once we find out actual root cause of corruption. Thanks.

@deepikabhavnani
Copy link

@bmcdonnell-ionx - Can you please verify the fix in PR#5829 for issue 3?

@bmcdonnell-ionx
Copy link
Contributor Author

bmcdonnell-ionx commented Jan 11, 2018

@deepikabhavnani - Does it matter if the card was formatted by the mbed device or the PC?

@deepikabhavnani
Copy link

No it should not matter. Block division is done differently by different format utilities, with this case with fat filesystem format we easily got the case where write to 0x800000 block was done. With other methods, it might take more time or some extra file writes.

@bmcdonnell-ionx
Copy link
Contributor Author

I'm running some other testing on my devices now. I'll probably try test 3e tomorrow (abbreviated with prepopulation), and maybe run an extended test (without prepopulation) over the weekend.

Thanks.

@deepikabhavnani
Copy link

deepikabhavnani commented Jan 11, 2018

@bmcdonnell-ionx - Thanks for helping us find this issue

@bmcdonnell-ionx
Copy link
Contributor Author

@deepikabhavnani - You're welcome. Thanks for being responsive and fixing things. :)

@bmcdonnell-ionx
Copy link
Contributor Author

@deepikabhavnani - Until now, I've been testing with mbed-os-5.7.2. Should I merge or cherry-pick your PR to test it?

The merge brings in a lot of other stuff, as shown below.

$ git merge fat_issue_5780_3
Auto-merging targets/TARGET_STM/rtc_api.c
Auto-merging features/FEATURE_BLE/ble/BLE.h
Merge made by the 'recursive' strategy.
 TESTS/host_tests/rtc_calc_auto.py                  |  138 ++++++++++++++
 TESTS/mbed_hal/rtc_time/main.cpp                   |  329 ++++++++++++++------------------
 TESTS/mbed_hal/rtc_time_conv/main.cpp              |  214 +++++++++++++++++++++
 TESTS/mbedmicro-rtos-mbed/CircularBuffer/main.cpp  |  469 +++++++++++++++++++++++++++++++++++++++++++++
 features/FEATURE_BLE/ble/BLE.h                     |    9 +
 features/FEATURE_BLE/ble/generic/GenericGap.h      |  297 +++++++++++++++++++++++++++++
 features/FEATURE_BLE/source/BLE.cpp                |   24 +++
 features/FEATURE_BLE/source/generic/GenericGap.cpp | 1091 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 features/FEATURE_LWIP/lwip-interface/lwip_stack.c  |   43 +++++
 features/filesystem/fat/FATFileSystem.cpp          |   14 +-
 platform/CircularBuffer.h                          |   30 ++-
 platform/CriticalSectionLock.h                     |   49 ++++-
 platform/mbed_mktime.c                             |  117 +++++++-----
 platform/mbed_mktime.h                             |   54 ++++--
 targets/TARGET_Atmel/TARGET_SAM_CortexM4/rtc_api.c |    9 +-
 targets/TARGET_NUVOTON/TARGET_M451/rtc_api.c       |    9 +-
 targets/TARGET_NUVOTON/TARGET_M480/rtc_api.c       |    7 +-
 targets/TARGET_NUVOTON/TARGET_NANO100/rtc_api.c    |    9 +-
 targets/TARGET_NUVOTON/TARGET_NUC472/rtc_api.c     |    9 +-
 targets/TARGET_NXP/TARGET_LPC176X/rtc_api.c        |    9 +-
 targets/TARGET_NXP/TARGET_LPC408X/rtc_api.c        |    9 +-
 targets/TARGET_NXP/TARGET_LPC43XX/rtc_api.c        |    9 +-
 targets/TARGET_RENESAS/TARGET_RZ_A1H/rtc_api.c     |    9 +-
 targets/TARGET_RENESAS/TARGET_VK_RZ_A1H/rtc_api.c  |    9 +-
 targets/TARGET_STM/rtc_api.c                       |    7 +-
 tools/config/__init__.py                           |   26 ++-
 tools/test.py                                      |    2 +-
 tools/test_configs/__init__.py                     |    7 +-
 28 files changed, 2703 insertions(+), 305 deletions(-)
 create mode 100644 TESTS/host_tests/rtc_calc_auto.py
 create mode 100644 TESTS/mbed_hal/rtc_time_conv/main.cpp
 create mode 100644 TESTS/mbedmicro-rtos-mbed/CircularBuffer/main.cpp
 create mode 100644 features/FEATURE_BLE/ble/generic/GenericGap.h
 create mode 100644 features/FEATURE_BLE/source/generic/GenericGap.cpp

@deepikabhavnani
Copy link

Fix did not work on re-running test, let me see more why I am still getting addr 0x0 for block 0x800000. May be some compiler optimization.
@bmcdonnell-ionx - You can cherry-pick. But please wait till I confirm

@deepikabhavnani
Copy link

@bmcdonnell-ionx - Fix in place, you can verify as per your convenience. Thanks

@bmcdonnell-ionx
Copy link
Contributor Author

@deepikabhavnani

Fix in place

Where?

@deepikabhavnani
Copy link

where?

I updated the PR with fix #5829

@bmcdonnell-ionx
Copy link
Contributor Author

@deepikabhavnani, what do you think of my suggested course of action above (wait & see on #3)?

Also, what's the status of issue #1? Did you merge a fix?

For ease of reference, here's our conversation so far on that.

I said:

  1. When operating on a directory on a FAT32 volume containing the maximum number of files per directory (65534), readdir() never returns NULL.

You said:

I have looked into 1. and it seems logic at https://github.com/ARMmbed/mbed-os/blob/master/features/filesystem/fat/FATFileSystem.cpp#L638 is incorrect

ChanFS reports FR_OK for end of directory, hence the check should be just if condition and not else if

-- else if (finfo.fname[0] == 0) {
++ if (finfo.fname[0] == 0) {
            break;
   } 

I do not have target device with that big HEAP memory, and testing with SD card will take long. Will confirm once I have verified.

You said:

Were you able to create 65534 files successfully on HEAP partition/SD card? I was trying with 8GB SD card formatted using FatFilesystem::format() API and using host machine for file creation, below is my observation.

  1. Linux system created 32765 files in any directory and then failed with error (No space left on device). Creation of files in another directory was allowed. Read was successful with application on mbed board.
  2. Windows created 37,767 files and then failed with error. readdir does not return NULL. Using the same SD card on Linux machine, ls -l hangs and does not list the filenames.

I said:

Yes, this demo code does that with empty files on the HeapBlockDevice.

  1. bug01-mbed-os-5.7.1-65534-files-dirent-ne-null (1fa0b0b)

@deepikabhavnani
Copy link

I am pretty sure file creation in SD card failed beyond 32765 files on Linux system, hence was unable to reproduce that issue. Also, I don’t have any system with huge heap to try this. Will let you know if I am working back on that issue.

@deepikabhavnani
Copy link

For issue #3, I would say leave it if you are not able to reproduce it.

@bmcdonnell-ionx
Copy link
Contributor Author

bmcdonnell-ionx commented Jan 19, 2018 via email

@bmcdonnell-ionx
Copy link
Contributor Author

bmcdonnell-ionx commented Jan 19, 2018 via email

@deepikabhavnani
Copy link

Yes

@bmcdonnell-ionx
Copy link
Contributor Author

bmcdonnell-ionx commented Jan 19, 2018 via email

@deepikabhavnani
Copy link

Bug was not in format, still I will give it a try sometime next week.

@bmcdonnell-ionx
Copy link
Contributor Author

"Buggy mbed formatting" was poor phrasing on my part. I meant that f_mkfs() calls the functions you fixed (disk_write() and disk_read()), so I figured maybe the bug would manifest in formatting too - not necessarily that it would be noticed during the format, but maybe it would be a problem when you try to write to certain areas on a large enough volume.

Anyway, if you want to decouple questions, you could try creating the empty files on your HDD, and copy them to the SD card later.

@bmcdonnell-ionx
Copy link
Contributor Author

@deepikabhavnani, re #1, could you either make a commit in a branch somewhere and leave a link here, or answer the following?

I see two lines in FATFileSystem.cpp containing else if (finfo.fname[0] == 0). Should they both be changed to if (finfo.fname[0] == 0)? Is that all that needs to be changed?

@deepikabhavnani
Copy link

@bmcdonnell-ionx - I tried creation after new format and same result.

Should they both be changed

Yes, but that change might not help, I missed the return statement for failure case, so checking fname[0] only in case of success makes sense.

@bmcdonnell-ionx
Copy link
Contributor Author

@deepikabhavnani, if you want me to test, can you just put the commit in a branch somewhere, and point me to it?

@bmcdonnell-ionx
Copy link
Contributor Author

@deepikabhavnani, can you give a status update? Do you think you know what needs to be fixed for bug01? If so, can you please commit in a branch or tag somewhere, and point me to it so I can test?

I'm not clear on the required changes from what you've said here. I'd love to wrap this up and get the changes into Mbed OS 5.7.4.

@deepikabhavnani
Copy link

@bmcdonnell-ionx As mentioned earlier, the proposed change might not work as I missed the return statement in failure case during analysis.
Also, sorry but I am not working on filesystem issues at present, so wont be able to help much.

@bmcdonnell-ionx
Copy link
Contributor Author

Re bug01, I think it is a problem is with FatFs. I made a post on their forum about it.

f_readdir() bug - bad termination when dir full

@ciarmcom
Copy link
Member

ciarmcom commented Jun 1, 2018

ARM Internal Ref: MBOTRIAGE-296

@dannybenor
Copy link

@bmcdonnell-ionx IMHO issues 2 and 3 (slow performance on big directories and corruption of FATFS) are well known issues with FAT file system.
Regarding issue #1, looks like a bug in CHAN.
Any concerns for closing this issue?

@bmcdonnell-ionx
Copy link
Contributor Author

@dannybenor,

Apparently [2] is a known inherent issue. [3] was fixed.

Agreed, [1] appears to be an issue w/ ChaN FatFs.

Any concerns for closing this issue?

Can you document [1] in Mbed OS as a known issue, until ChaN fixes it?

I reported it to them (link above), but there's been no response, other than one other user confirming the issue. Do you know if that's the right place, or any other way to encourage them to take action on it?

@yossi2le
Copy link
Contributor

@bmcdonnell-ionx
I found out that problem has been addressed by ChaN on May 23 2018 for FatFs R0.13b and a workaround have been supplied.
http://elm-chan.org/fsw/ff/patches.html
You may either upgrade the ChaN FatFs implementation in mbed-os and then deploy the patch or in my opinion (however not tested) use the current FatFs R0.13a and deploy the patch at line 1742 instead of 1728.

Do you still has any concerns regarding closing this bug?

@bmcdonnell-ionx
Copy link
Contributor Author

@yossi2le, thanks for your input. I hadn't realized that ChaN had taken any action on my bug report, since there was no mention in my thread on their forum.

You may either [option 1] upgrade the ChaN FatFs implementation in mbed-os and then deploy the patch or in my opinion (however not tested) [option 2] use the current FatFs R0.13a and deploy the patch at line 1742 instead of 1728.

On your [option 1], it doesn't look like the patch has been included in any FatFs release yet.

I will try your [option 2] when I get a chance. Then hopefully I can close this with a PR. 😃

@deepikabhavnani
Copy link

Then hopefully I can close this with a PR.

PR is merged, hence closing this. Please reopen if anything is missed out.

@bmcdonnell-ionx
Copy link
Contributor Author

(Oops, I forgot to close it. Thanks!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants