Hello everyone,
I am currently facing an issue with a server that contains 54TB of data configured in a RAID 5 setup, AlmaLinux 9. The server is equipped with a total of 14 disks. Recently, two of the disks encountered corruption and were subsequently replaced. Following the replacement and a server reboot, it indicated that a repair was required. Consequently, we initiated the xfs_repair command to address the issue.
However, the xfs_repair process has been running continuously for the past six days, with the same message being displayed every 15 minutes: “rebuild AG headers and trees - 55 of 55 allocation groups done.”
I think its taking too much time and maybe something need to be done. I would greatly appreciate any insights, suggestions, or thoughts.
Sounds way too long. I’ve just run xfs_repair
on a known faulty filesystem of about 1 TiB:
# time xfs_repair /dev/mapper/Tamar-Photos
Phase 1 - find and verify superblock...
- reporting progress in intervals of 15 minutes
Phase 2 - using internal log
- zero log...
totally zeroed log
- scan filesystem freespace and inode maps...
- 08:48:00: scanning filesystem freespace - 33 of 33 allocation groups done
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- 08:48:00: scanning agi unlinked lists - 33 of 33 allocation groups done
- process known inodes and perform inode discovery...
- agno = 0
- agno = 30
...
- agno = 14
- 08:48:01: process known inodes and inode discovery - 23296 of 23296 inodes done
- process newly discovered inodes...
- 08:48:01: process newly discovered inodes - 33 of 33 allocation groups done
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- 08:48:01: setting up duplicate extent list - 33 of 33 allocation groups done
- check for inodes claiming duplicate blocks...
- agno = 1
- agno = 3
- agno = 2
...
- agno = 31
- agno = 32
- 08:48:01: check for inodes claiming duplicate blocks - 23296 of 23296 inodes done
Phase 5 - rebuild AG headers and trees...
- 08:48:01: rebuild AG headers and trees - 33 of 33 allocation groups done
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
- 08:48:01: verify and correct link counts - 33 of 33 allocation groups done
Maximum metadata LSN (1:139264) is ahead of log (0:0).
Format log to cycle 4.
xfs_repair: libxfs_device_zero write failed: Input/output error
real 0m34.308s
user 0m0.121s
sys 0m0.101s
Obviously you can’t simply scale from one to the other, but 34 seconds to 6 days seems a bit much for a 50-fold increase in filesystem size.
Thank you @MartinR
Do you think it would be OK to reboot and attempt the process again?
I’m sorry, but without a great deal more information, particularly how the server is used, I can’t give a simple yes/no. As the saying goes: “if you break it, you get to keep the pieces”.
Having been unhelpful, IME it is safe to stop xfs_repair
and restart it, indeed on the man page under “EXIT STATUS” for some errors it says “In this case, xfs_repair should be restarted”. Are there other filesystems/services being used on this machine? If the machine is otherwise unusable then I would be tempted to:
- Ensure you are monitoring the system for any hardware problems, for instance tailing
/var/log/messages
through grep
to watch for any hardware problems. If you are using an external hardware RAID controller, keep an eye on its logs.
- Keep monitoring the system for IO activity:
pcp
, iotop
etc.
- If nothing is happening, stop the
xfs_repair
and restart it, watch the various monitors.
- Consider running bechmarking on the replaced spindles.
- Finally, reboot. XFS is a pretty stable filesystem with normally good recovery.
Oh, and you do have good backups don’t you? 