Bad sectors playing hide and seek on my hard drive

Trenix25

Active Member
Joined
Aug 15, 2017
Messages
179
Reaction score
74
Credits
1,598
I am using Debian 11.7 and a 4 TB Western Digital hard drive. One of my partitions, /dev/sdb15, supposedly had up to eight bad sectors. This file system is used for virtual machines and is not required for anything related to the general operation of the Linux computer. These sectors showed up in the system journal. I have tried using badblocks with mke2fs in Debian 10 and Debian 11 and it always locked up and caused trouble so that really isn't an option for me. I wrote a bash script to create a huge number of 1 MiB files using dd and /dev/zero to use up all of the free blocks on the file system. Then I tried reading all of these files, but only one of them appeared to have a bad sector. I tried reading all of the files 100 times in a row using direct I/O, iflag=direct with dd, so it wouldn't keep reading the same file from the system cache over and over again. When it got to the file with the bad sector the computer got locked up because it couldn't access the hard drive. The block size and physical sector size on this hard drive is 4096 bytes. I isolated the file with the bad sector and finished reading all of the other similar files without any trouble. Then I erased the problematic file and created new files, each 4096 bytes, in a different subdirectory so I could read those and isolate the bad block in a single, yet smaller, file. These two directories were already using quite a lot of space, like lost+found does. I was able to use /usr/bin/ls to read the directory just fine and e2fsck -fv /dev/sdb15 didn't complain about any problems with the directory. I could use /usr/bin/cat to verify that the 1 MiB file had a bad sector before removing it. After creating the 256 4096 byte files none of them appeared to have any bad sectors, even when reading each one hundreds of times using direct I/O. How can a sector go bad, and test as bad numerous times, and then suddenly not be bad anymore. I have erased the 256 smaller files and created a 1 MiB file again to hold on to the possibly bad sector. I have written a bash script to erase all of those other 1 MiB files because the command line can't handle over five hundred thousand command line arguments and I don't want to remove the directory itself. I might need that again later. It will take hours to remove all of them this way. This is currently a work in progress. The crash didn't cause any real harm because there wasn't much happening on the system when I had to shut it down when it locked up. The files were only being read, not written to.

When I tried to create a VM one of the files supposedly had a bad sector and /usr/bin/tar complained about this saying that the file shrank when the VM was not running. Then the tar file itself appeared to have a bad sector, but I didn't know this right away. Then I made a compressed file from the tar file using /usr/bin/gzip, but when I ran /usr/bin/gzip -t to test the file's integrity it failed. The compressed file appeared to have a bad sector too. There was already another file that supposedly had a bad sector and was already isolated. I erased all of the files related to the current VM and the other file that had the bad sector and created all of those 1 MiB files in a special directory created for that purpose. Only one of them had a bad sector when I tested those. No files had any bad sectors when I tested the 256 small files. How can this be explained? I may have purchased the hard drive in 2021. A good quality hard drive should last not less than ten to twenty years without any problems. This is not at all unreasonable, especially considering the advances in technology over the years. Anyone that believes otherwise has obviously learned to accept a lower standard of quality in our throw away culture. Western Digital is supposed to be a good quality brand. I sure wish I could get a new hard drive from Quantum. I hear those were really nice. Let me know if you want a copy of any of those bash scripts.

Signed,

Matthew Campbell
 


It was either 'disks' or 'g parted' I used when I got a message in big red letters that told me 'Drive Failure is Imminent. It was only one of those apps that gave me that message. I think it was g parted. I didn't have to do anything other than open the app to get the message. That's the very first thing I would do if I were having all those problems.
 
@Trenix25 wrote:
How can a sector go bad, and test as bad numerous times, and then suddenly not be bad anymore.

The smartctl tool in long test mode is usually able to mark a sector that has become unreadable as "uncorrectable" or similar. When that sector is again written to, it should get put aside and marked as bad so that the next read of the disk avoids the bad sector and then reads the disk as okay. Disks in recent years have "defect management" where the attempt to write to the bad block triggers a remapping to put it out of action.

I can't say why badblocks failed in your case, but perhaps it was because badblocks uses a default blocksize of 1024 bytes. (See manpage)

This observation from https://wiki.archlinux.org/title/badblocks may be helpful:
Typical recommended practice for testing a storage device for bad sectors is to use the manufacturer's testing program. Most manufacturers have programs that do this. The main reasoning for this is that manufacturers usually have their standards built into the test programs that will tell you if the drive needs to be replaced or not. The caveat here being is that some manufacturers testing programs do not print full test results and allow a certain number of bad sectors saying only if they pass or not. Manufacturer programs however are generally quicker than badblocks, sometimes a fair amount so.
 
@Trenix25 wrote:


The smartctl tool in long test mode is usually able to mark a sector that has become unreadable as "uncorrectable" or similar. When that sector is again written to
I didn't use that app, but 'Disks" was able to deal with the bad sector. G Parted recognized the imminent failure. A few days later, it failed.
 
I didn't use that app, but 'Disks" was able to deal with the bad sector. G Parted recognized the imminent failure. A few days later, it failed.

1719805973518.png
 

View attachment 20835

I still have HDD's. Maybe that has something to do with it??? IDK....
 
I don't use an SSD because it wouldn't last very long they way I use my system. I don't think I have g parted. I have been having problems with a couple of bad sectors for months. I have heard that bad sectors can be mapped away on newer hard drives. It came back as bad as long as I kept the file with the bad sector, but when the sector became unused it became writable again. The question becomes this: what sectors are replacing the bad sectors if and when this happens?

Signed,

Matthew Campbell
@Trenix25 wrote:


The smartctl tool in long test mode is usually able to mark a sector that has become unreadable as "uncorrectable" or similar. When that sector is again written to, it should get put aside and marked as bad so that the next read of the disk avoids the bad sector and then reads the disk as okay. Disks in recent years have "defect management" where the attempt to write to the bad block triggers a remapping to put it out of action.

I can't say why badblocks failed in your case, but perhaps it was because badblocks uses a default blocksize of 1024 bytes. (See manpage)

This observation from https://wiki.archlinux.org/title/badblocks may be helpful:
This is one of the daemons that I turned off. Perhaps I should turn it back on. I am hesitant to use anything that says it's "smart."

I believe that the man page talks about using badblocks with mke2fs to avoid block size mismatches.

Signed,

Matthew Campbell
 
I am using Debian 11.7 and a 4 TB Western Digital hard drive. One of my partitions, /dev/sdb15, supposedly had up to eight bad sectors. This file system is used for virtual machines and is not required for anything related to the general operation of the Linux computer. These sectors showed up in the system journal. I have tried using badblocks with mke2fs in Debian 10 and Debian 11 and it always locked up and caused trouble so that really isn't an option for me. I wrote a bash script to create a huge number of 1 MiB files using dd and /dev/zero to use up all of the free blocks on the file system. Then I tried reading all of these files, but only one of them appeared to have a bad sector. I tried reading all of the files 100 times in a row using direct I/O, iflag=direct with dd, so it wouldn't keep reading the same file from the system cache over and over again. When it got to the file with the bad sector the computer got locked up because it couldn't access the hard drive. The block size and physical sector size on this hard drive is 4096 bytes. I isolated the file with the bad sector and finished reading all of the other similar files without any trouble. Then I erased the problematic file and created new files, each 4096 bytes, in a different subdirectory so I could read those and isolate the bad block in a single, yet smaller, file. These two directories were already using quite a lot of space, like lost+found does. I was able to use /usr/bin/ls to read the directory just fine and e2fsck -fv /dev/sdb15 didn't complain about any problems with the directory. I could use /usr/bin/cat to verify that the 1 MiB file had a bad sector before removing it. After creating the 256 4096 byte files none of them appeared to have any bad sectors, even when reading each one hundreds of times using direct I/O. How can a sector go bad, and test as bad numerous times, and then suddenly not be bad anymore. I have erased the 256 smaller files and created a 1 MiB file again to hold on to the possibly bad sector. I have written a bash script to erase all of those other 1 MiB files because the command line can't handle over five hundred thousand command line arguments and I don't want to remove the directory itself. I might need that again later. It will take hours to remove all of them this way. This is currently a work in progress. The crash didn't cause any real harm because there wasn't much happening on the system when I had to shut it down when it locked up. The files were only being read, not written to.

When I tried to create a VM one of the files supposedly had a bad sector and /usr/bin/tar complained about this saying that the file shrank when the VM was not running. Then the tar file itself appeared to have a bad sector, but I didn't know this right away. Then I made a compressed file from the tar file using /usr/bin/gzip, but when I ran /usr/bin/gzip -t to test the file's integrity it failed. The compressed file appeared to have a bad sector too. There was already another file that supposedly had a bad sector and was already isolated. I erased all of the files related to the current VM and the other file that had the bad sector and created all of those 1 MiB files in a special directory created for that purpose. Only one of them had a bad sector when I tested those. No files had any bad sectors when I tested the 256 small files. How can this be explained? I may have purchased the hard drive in 2021. A good quality hard drive should last not less than ten to twenty years without any problems. This is not at all unreasonable, especially considering the advances in technology over the years. Anyone that believes otherwise has obviously learned to accept a lower standard of quality in our throw away culture. Western Digital is supposed to be a good quality brand. I sure wish I could get a new hard drive from Quantum. I hear those were really nice. Let me know if you want a copy of any of those bash scripts.

Signed,

Matthew Campbell
hard drive with bad sector or errors = throw in trash replace with working drive.

only purchase from known manufacturers and sources. I will only use Western Digital, Kingston, Seagate, or crucial because they back up the product. many others sell through amazon and when you need warranty they say buy a new one from amazon then return old. Letting them pick up the cost while the mfg makes money on substandard product. You get what you pay for.
 
I still have HDD's. Maybe that has something to do with it??? IDK....
I'm still using bare medal and have know problem with gnome disk utility aka Disks.
1719871391929.png


1719871446099.png

Reason for the two screenshots is so I can get all of the results shown.
 
I believe the HDD manufactures allow for bad sectors from the git-go.

Once sectors are detected as bad or not usable they are labeled as not usable and no longer used so to speak.

I'm no Guru but that's how it was explained to me by someone who knows and is a Guru.

I have a good stash of unopened brand new never used HDDs and I plan on using them until they are gone.

I don't have to have the newest or latest computer hardware.

Check out this $5.00 garage sale find from 2007.
Still had a working Windows Vista OS on it.

Code:
ubuntu@ubuntu:~$ inxi -Fxz
System:
  Kernel: 6.5.0-41-generic x86_64 bits: 64 compiler: N/A Desktop: GNOME 42.9
    Distro: Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Machine:
  Type: Desktop Mobo: Gigabyte model: P35-DS3L v: x.x
    serial: <superuser required> BIOS: Award v: F4 date: 08/13/2007
CPU:
  Info: quad core model: Intel Core2 Quad Q6600 bits: 64 type: MCP
    arch: Core Merom rev: B cache: L1: 256 KiB L2: 8 MiB
  Speed (MHz): avg: 1800 high: 2400 min/max: 1600/2400 cores: 1: 1600
    2: 1600 3: 1600 4: 2400 bogomips: 19201
  Flags: ht lm nx pae sse sse2 sse3 ssse3 vmx
Graphics:
  Device-1: AMD Park [Mobility Radeon HD 5430] vendor: ASUSTeK Caicos
    driver: radeon v: kernel bus-ID: 01:00.0
  Display: wayland server: X.Org v: 1.22.1.1 with: Xwayland v: 22.1.1
    compositor: gnome-shell driver: gpu: radeon resolution: 1024x768~75Hz
  OpenGL: renderer: AMD CEDAR (DRM 2.50.0 / 6.5.0-41-generic LLVM 15.0.7)
    v: 4.5 Mesa 23.2.1-1ubuntu3.1~22.04.2 direct render: Yes
Audio:
  Device-1: Intel 82801I HD Audio vendor: Gigabyte driver: snd_hda_intel
    v: kernel bus-ID: 00:1b.0
  Device-2: AMD Cedar HDMI Audio [Radeon HD 5400/6300/7300 Series]
    vendor: ASUSTeK driver: snd_hda_intel v: kernel bus-ID: 01:00.1
  Sound Server-1: ALSA v: k6.5.0-41-generic running: yes
  Sound Server-2: PulseAudio v: 15.99.1 running: yes
  Sound Server-3: PipeWire v: 0.3.48 running: yes
Network:
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
    vendor: Gigabyte driver: r8169 v: kernel port: d000 bus-ID: 04:00.0
  IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
  Local Storage: total: 74.5 GiB used: 11.76 GiB (15.8%)
  ID-1: /dev/sda vendor: Seagate model: ST380819AS size: 74.5 GiB
Partition:
  ID-1: / size: 72.29 GiB used: 11.76 GiB (16.3%) fs: ext4 dev: /dev/sda3
  ID-2: /boot/efi size: 512 MiB used: 6.1 MiB (1.2%) fs: vfat
    dev: /dev/sda2
Swap:
  ID-1: swap-1 type: file size: 2 GiB used: 0 KiB (0.0%) file: /swapfile
Sensors:
  System Temperatures: cpu: 45.0 C mobo: 40.0 C gpu: radeon temp: 50.5 C
  Fan Speeds (RPM): cpu: 629 fan-2: 0 fan-3: 0 fan-4: 2142
  Power: 12v: N/A 5v: 2.94 3.3v: N/A vbat: 3.07
Info:
  Processes: 221 Uptime: 33m Memory: 7.75 GiB used: 1.52 GiB (19.6%)
  Init: systemd runlevel: 5 Compilers: gcc: 11.4.0 Packages: 1659 Shell: Bash
  v: 5.1.16 inxi: 3.3.13
ubuntu@ubuntu:~$
 
I'm still using bare medal and have know problem with gnome disk utility aka Disks.
View attachment 20856

View attachment 20857
Reason for the two screenshots is so I can get all of the results shown.
That drive needs to be replaced. If you note the seek error rate that is the early sign of death. This means it has to retry when getting data. This is noticed as slow down or freeze of system. The ECC errors show the drive is in bad shape and is likely locking up the system for a time while it fixes drive errors. The only acceptable value for any error is zero. once a drive begins using the allocation for bad sectors, it is time to replace the drive. The sectors reserved by mfg are so you can replace it and hopefully not lose data. It is not meant to keep using. It is like the donut spare in your car. it is meant to get you to the next tire, it is not meant to drive on until you sell the car.

GET A NEW DRIVE, THIS ONE IS FUBAR
 
I believe the HDD manufactures allow for bad sectors from the git-go.

Once sectors are detected as bad or not usable they are labeled as not usable and no longer used so to speak.

I'm no Guru but that's how it was explained to me by someone who knows and is a Guru.

I have a good stash of unopened brand new never used HDDs and I plan on using them until they are gone.

I don't have to have the newest or latest computer hardware.

Check out this $5.00 garage sale find from 2007.
Still had a working Windows Vista OS on it.

Code:
ubuntu@ubuntu:~$ inxi -Fxz
System:
  Kernel: 6.5.0-41-generic x86_64 bits: 64 compiler: N/A Desktop: GNOME 42.9
    Distro: Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Machine:
  Type: Desktop Mobo: Gigabyte model: P35-DS3L v: x.x
    serial: <superuser required> BIOS: Award v: F4 date: 08/13/2007
CPU:
  Info: quad core model: Intel Core2 Quad Q6600 bits: 64 type: MCP
    arch: Core Merom rev: B cache: L1: 256 KiB L2: 8 MiB
  Speed (MHz): avg: 1800 high: 2400 min/max: 1600/2400 cores: 1: 1600
    2: 1600 3: 1600 4: 2400 bogomips: 19201
  Flags: ht lm nx pae sse sse2 sse3 ssse3 vmx
Graphics:
  Device-1: AMD Park [Mobility Radeon HD 5430] vendor: ASUSTeK Caicos
    driver: radeon v: kernel bus-ID: 01:00.0
  Display: wayland server: X.Org v: 1.22.1.1 with: Xwayland v: 22.1.1
    compositor: gnome-shell driver: gpu: radeon resolution: 1024x768~75Hz
  OpenGL: renderer: AMD CEDAR (DRM 2.50.0 / 6.5.0-41-generic LLVM 15.0.7)
    v: 4.5 Mesa 23.2.1-1ubuntu3.1~22.04.2 direct render: Yes
Audio:
  Device-1: Intel 82801I HD Audio vendor: Gigabyte driver: snd_hda_intel
    v: kernel bus-ID: 00:1b.0
  Device-2: AMD Cedar HDMI Audio [Radeon HD 5400/6300/7300 Series]
    vendor: ASUSTeK driver: snd_hda_intel v: kernel bus-ID: 01:00.1
  Sound Server-1: ALSA v: k6.5.0-41-generic running: yes
  Sound Server-2: PulseAudio v: 15.99.1 running: yes
  Sound Server-3: PipeWire v: 0.3.48 running: yes
Network:
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
    vendor: Gigabyte driver: r8169 v: kernel port: d000 bus-ID: 04:00.0
  IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
  Local Storage: total: 74.5 GiB used: 11.76 GiB (15.8%)
  ID-1: /dev/sda vendor: Seagate model: ST380819AS size: 74.5 GiB
Partition:
  ID-1: / size: 72.29 GiB used: 11.76 GiB (16.3%) fs: ext4 dev: /dev/sda3
  ID-2: /boot/efi size: 512 MiB used: 6.1 MiB (1.2%) fs: vfat
    dev: /dev/sda2
Swap:
  ID-1: swap-1 type: file size: 2 GiB used: 0 KiB (0.0%) file: /swapfile
Sensors:
  System Temperatures: cpu: 45.0 C mobo: 40.0 C gpu: radeon temp: 50.5 C
  Fan Speeds (RPM): cpu: 629 fan-2: 0 fan-3: 0 fan-4: 2142
  Power: 12v: N/A 5v: 2.94 3.3v: N/A vbat: 3.07
Info:
  Processes: 221 Uptime: 33m Memory: 7.75 GiB used: 1.52 GiB (19.6%)
  Init: systemd runlevel: 5 Compilers: gcc: 11.4.0 Packages: 1659 Shell: Bash
  v: 5.1.16 inxi: 3.3.13
ubuntu@ubuntu:~$
That drive needs to be replaced. If you note the seek error rate that is the early sign of death. This means it has to retry when getting data. This is noticed as slow down or freeze of system. The ECC errors show the drive is in bad shape and is likely locking up the system for a time while it fixes drive errors. The only acceptable value for any error is zero. once a drive begins using the allocation for bad sectors, it is time to replace the drive. The sectors reserved by mfg are so you can replace it and hopefully not lose data. It is not meant to keep using. It is like the donut spare in your car. it is meant to get you to the next tire, it is not meant to drive on until you sell the car.

GET A NEW DRIVE, THIS ONE IS FUBAR
When it fails I'll replace it with one of my NOS HDD. ;)
I've had drives with way more errors then that last for years.
 
When it fails I'll replace it with one of my NOS HDD. ;)
I've had drives with way more errors then that last for years.
no problem it is your choice to ignore good tech advice. hope you have good backups and don't mind the issues.
 
I can't believe the complacency some have with their data...bad sectors...eminent failure...she'll be right.
1719876547189.gif


Back in the day...when using HDDs (windoze) I had software to test HDDs from Western Digital and Seagate...every time I ran the software I got...Disk is good.

Then one day the HDD failed without warning...from that day I never take for granted what testing software says...I always have a spare SSD...just in case and 3 backup solutions too.
1719877085578.gif


Every now and then I'll run a test with either...SMART test or GSmartControl just to be on the safe side...you can't do any more.

1719877295574.png


1719877393075.gif
 
Why nobody mentions it's important to format your drive and do slow fsck to mark bad sectors as not usable?
If you don't do this bad sectors will be used and then you have RW failures.

The question becomes this: what sectors are replacing the bad sectors if and when this happens?
Bad sectors can't be replaces but must be marked as "don't use".
Reformat your drive and then do fsck as follows:
sudo mkfs.ext4 -cc -v -t ext4 -L "LabelName" -b 4096 /dev/sdXn

Where sdXn is partition name.
This is very slow fsck, some 12h for 1TB HDD
 
Last edited:
Why nobody mentions it's important to format your drive and do slow fsck to mark bad sectors as not usable?
If you don't do this bad sectors will be used and then you have RW failures.


Bad sectors can't be replaces but must be marked as "don't use".
Reformat your drive and then do fsck as follows:
sudo mkfs.ext4 -cc -v -t ext4 -L "LabelName" -b 4096 /dev/sdXn

Where sdXn is partition name.
This is very slow fsck, some 12h for 1TB HDD
This is just a band aid. bad sectors tend to spread and it is like a cancer. Once you have more than zero on SMART in any error category, you should replace the drive. Waiting for it to fail fully is like driving a car that you know the transmission is failing but you won't fix it until you can't move any longer.
I run a computer shop and we advocate that if there are errors on a drive including any bad sector, it is time to replace. That is the best practice and drives are not expensive so I do not understand why you hold them until they fall apart. My data is more important than that.
 
Waiting for it to fail fully is like driving a car that you know the transmission is failing but you won't fix it until you can't move any longer.
drives are not expensive so I do not understand why you hold them until they fall apart.
Yes drives are not expensive, that's why I have 2 backup HDD's with same contents and don't care if one of them will fail soon.

One of my HDD's (internal) is some 6 years old at least, another one (external) is brand new, but I'm not dumping the older one until it's drained completely.

While your reasoning is OK, I think replacing the drive only because drives are cheap is not good enough reason to replace it, at least not for home users, in some company where data is critical it would make more sense.
 
Yes drives are not expensive, that's why I have 2 backup HDD's with same contents and don't care if one of them will fail soon.

One of my HDD's (internal) is some 6 years old at least, another one (external) is brand new, but I'm not dumping the older one until it's drained completely.

While your reasoning is OK, I think replacing the drive only because drives are cheap is not good enough reason to replace it, at least not for home users, in some company where data is critical it would make more sense.
I think you misunderstand. I do not replace drives because they are cheap, I replace them because they are failing. usually early failure. They are replaced because they are failing not because they are cheap.

If you do not mind the slow downs and other issues resulting from a near death drive then that is fine. My customers demand top performance and that means replacing things before they are a true problem. My home users agree and I ONLY replace if it has problems. Many drives I test turn out to be old but perfectly fine. I still have a working 40Meg yes meg not gig hdd, old IDE.
 
I think you misunderstand. I do not replace drives because they are cheap, I replace them because they are failing. usually early failure. They are replaced because they are failing not because they are cheap.
Yes I understand that.
I had a HDD with bad sectors and it was running for long time afterwards despite bad sectors.
But I never heard bad sectors would cause performance issues.

fsck and similar utilities exist to address bad sectors, these utils exist for very long time and I'm glad they do, otherwise the only option would be to replace the drive to prevent data loss.
 
Yes I understand that.
I had a HDD with bad sectors and it was running for long time afterwards despite bad sectors.
But I never heard bad sectors would cause performance issues.

fsck and similar utilities exist to address bad sectors, these utils exist for very long time and I'm glad they do, otherwise the only option would be to replace the drive to prevent data loss.
performance issues are on the mechanical drives. think about how they work. They spin a platter and the head has to time to read at that point. Like a record player. Now if the information is not where it should be at the time it causes a seek error or sometimes an ecc error. Each time it does this the drive has to retry the read and that takes time. Now look and if you see millions of seek errors or ecc errors then that means it is happening millions of times. Multiply that fraction of a second and now you see delays. Sometimes the delay is as much as 2 seconds to reread the data.
Now take a bad sector (unmarked) and those retries go through the roof and you can see frozen system. The system is attempting to read or write a bad area so it must time out and try again and possibly time out again until it finds a good area.
What this means is that as the drive continues to fail it loses integrity and speed due to the errors. In the beginning this may not be noticed or it may be tolerated but as the problem gets worse so do symptoms. System performance is greatly affected by the health of the drive. I have pushed drives when they are failing but the bottom line is they will lose data, during the failure period that gets ignored it can also corrupt files sometimes just a little, sometimes badly. If those are system files, you will again see performance degrade if not halt completely. You would be surprised how much problem is caused by failing HDD. or off brand SSD that is poor quality.
 


Latest posts

Top