Disk IO of all disks in a system slowed down when one of the disks became readonly during content writing

amang8662 · Jun 29, 2022

Environment:
OS: OEL 7.6
Kernel: 3.10.0
Total Disks: 5

Description:
We have 5 disks of size 4TB each. Content is being written on each of the disks. Avg. write time of disk is 5ms/req . Now, one of the disks suddenly got failed and became readonly. After this failure, response time of remaining disks slowed down to ~400ms/req.

Failure logs in /var/log/messages:

12:26:48 blk_update_request: I/O error, dev sdi , sector 4744059704
sd 0:2:6:0 [sdi] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
'' '' CDB: Write(16) 8a .....

Aborting journal on device sdi1-8.
Buffer I/O error on dev sdi1, logical block 593007207, lost async page write
sd 0:2:0:0 target reset called for scmd
"' " [sda] tag#75 megasas : target reset FAILED!!
tage#75 controller reset is requested due tot IO timeout #012SCSI command pointer: ()#011SCSI host stae: 5#011 SCSI

Buffer I/O error on dev sdi1, logical blck 481144******** , lost async page write
JDB2: Error -5 detected when updating journal superblock for sdi1-8
blk_update_request: I/O error, dev sdi , sector 2048
Buffer I/O error on dev sdi1 , logical block 0, lost async page write
EXT4-fs error (device sdi1): ext4_journal_check_start:56
" " : Detected aborted journal
12:27:00 EXT4-fs (sdi1): Remounting filesystem read-only
EXT4-fs (sdi1) : previous I/O error to superblock detected

12:28:33 megaraid_sas 0000:02:00.0 scanning for scsi0...
megaraid_sas - Controller cache pinned for missing or offline VD 06/8
megaraid_sas - VD 06/8 is now offline

12:26:03 node2 kernel: INFO: task jdb2/sdh1-8:14517 blocked for more than 120 seconds.
task jbd2/sdg1-8:15006 ""
task jdb2/sdf1-8:15470 ""
task jdb2/sda3-8:16962 ""
task jdb2/sda5-8:17588 ""
task rs:main Q:Reg:18320 ""
task writer_service:19063 ""
task writer_service:19094 ""
=================
task writer_service:19096 ""
Call Trace:
blk_peek_request
bit_waitschedule
schedule_timeout
__blk_run_queue
queue_unplugged
ktime_get_ts
bit_wait
io_schedule_timeout
io_schedule
bit_Wait_io
__wait_on_bit
wait_on_page_bit
? wake_bit_function_
__filemap_fdatwait_range
filemap_write_and_wait_range
ext4_sync_file
do_fsync
SyS_fsync
system_call_fastpath

Validations Done:
- Checked resource utilization and any resource crunches. Didn't find any.
- No overheating of disks

We were finally able to resolve the issue by cleaning RAID controller cache. But we don't know the exact reason for the degradation of speed of other disks.
Does anyone know what can be the reason behind the slowness of the disks.??

Regards,
Aman Gupta

Disk IO of all disks in a system slowed down when one of the disks became readonly during content writing

amang8662

New Member

Members online

Latest posts