After a reboot, my server would intermittently fail to boot with RAID arrays coming up degraded. On closer inspection, the culprit was nvme1 failing to initialize -- kernel logs showed CSTS=0x1 controller reset failures from the Phison E18 controller inside the Seagate FireCuda 530 drives.
The cascade looked like this:
nvme1 controller reset (CSTS=0x1)
-> RAID assembly stalls (nvme1n1p* never appears)
-> /boot/efi mount times out (90s)
-> local-fs.target fails -> cascading service failures
-> boot fails entirely, requires cold power cycle
| Component | Details |
|-----------|---------|
| Motherboard | Supermicro H12SSL-NT v1.02, BIOS 3.3 |
| CPU | AMD EPYC (single socket) |
| NVMe Drives | 2x Seagate FireCuda 530 4TB (ZP4000GM30013) |
| NVMe Controller | Phison PS5018-E18 |
| RAID | 3x mdraid1 (md0, md1, md2) across both drives |
The Phison E18 NVMe controller has a known issue with power state transitions on AMD EPYC platforms. Turned out the problem involved three independent layers of power management -- fixing just one isn't enough:
| Layer | What it controls | Kernel parameter |
|-------|-----------------|-----------------|
| **APST** | NVMe controller's internal autonomous power states | `nvme_core.default_ps_max_latency_us=0` |
| **ASPM** | PCIe link power management (the wires between CPU and device) | `pcie_aspm=off` |
| **D3cold** | Full power removal from the PCIe device during shutdown | udev rule: `ATTR{d3cold_allowed}="0"` |
During shutdown, the kernel puts PCIe devices into D3 (low power state). With D3cold enabled (the default), the NVMe controller's power can be fully removed. On warm reboot, the PCIe bus reactivates but the Phison E18 controller fails to transition from D3cold back to D0 -- resulting in CSTS=0x1 (Controller Fatal Status). A cold boot (full power cycle) works because the entire PCIe bus gets a clean power-on initialization.
Both drives are identical model and firmware, but they occupy different PCIe slots routed through different IOMMU groups on the AMD EPYC I/O die. The different physical slot routing causes different power state timing behavior. This is consistent with reports in Linux kernel bug databases.
Both drives were initially running firmware SU6SM005. Seagate's release notes for SU6SM100 address controller stability for this class of issue -- but as I found out, the firmware update alone wasn't sufficient.
# Confirm the SMART error counters are climbing
sudo nvme smart-log /dev/nvme0
sudo nvme smart-log /dev/nvme1
# Check firmware revision on both drives
sudo nvme id-ctrl /dev/nvme0 | grep ^fr
sudo nvme id-ctrl /dev/nvme1 | grep ^fr
# Output: fr : SU6SM005
# Confirm the controller errors in the ring buffer
dmesg | grep -iE "nvme.*reset|nvme.*CSTS|nvme.*timeout"
The error log confirmed nvme1 was the repeat offender, with occasional resets on nvme0 under load as well.
Before touching firmware, I applied an immediate mitigation: disabling APST entirely with a kernel parameter. Edit /etc/default/grub and add to GRUB_CMDLINE_LINUX:
GRUB_CMDLINE_LINUX="... nvme_core.default_ps_max_latency_us=0 degradedboot=true"
Two parameters added at the same time:
- nvme_core.default_ps_max_latency_us=0 -- sets the maximum acceptable APST latency to 0us, disabling all autonomous power state transitions
- degradedboot=true -- allows the system to boot even if a RAID array is degraded, so a partially-failed boot no longer requires physical intervention
sudo update-grub
sudo update-initramfs -u -k all
Verify after reboot:
cat /proc/cmdline
# Should contain both new parameters
This fixed the APST-level issue, but the problem came back. After a warm reboot a few days later, nvme1 again failed with CSTS=0x1 -- the firmware update and APST disable weren't enough because the failure was happening at the PCIe level (ASPM and D3cold), not the NVMe protocol level.
With APST disabled as a safety net, I flashed the firmware on both drives. The NVMe spec's two-phase commit (download + activate) means the new firmware doesn't take effect until the next power cycle, giving you time to stage both drives before committing.
# Stage the firmware image on each drive
sudo nvme fw-download /dev/nvme0 --fw=FireCuda530_SU6SM100.bin
sudo nvme fw-download /dev/nvme1 --fw=FireCuda530_SU6SM100.bin
# Commit with deferred activation (takes effect after power cycle, not reboot)
sudo nvme fw-commit /dev/nvme0 --slot=1 --action=1
sudo nvme fw-commit /dev/nvme1 --slot=1 --action=1
--action=1 means "activate on next reset" -- the controller won't switch to the new firmware until a full power cycle. This is safer than --action=3 (immediate activate with reset), which resets the controller in-place while the OS is running.
After a full power cycle (not just reboot -- firmware activation requires the PCIe bus to fully power down):
sudo nvme id-ctrl /dev/nvme0 | grep ^fr
sudo nvme id-ctrl /dev/nvme1 | grep ^fr
# Expected: fr : SU6SM100
Even with the firmware fix and APST disabled, I wanted a safety net: if a drive ever gets kicked out of an array, the system should self-heal without manual intervention.
mdadm can re-add a drive to a degraded array with mdadm --add, but by default this triggers a full resync -- every block on the array is compared and rewritten. On 4TB NVMe RAID arrays this takes a long time, puts sustained load on the drives, and leaves the array degraded for the entire duration.
Write-intent bitmaps solve this. The bitmap is a small on-disk structure that tracks which regions of the array have been written since the member was last in sync. When a drive re-joins, mdadm only needs to resync the dirty regions -- on a lightly-written array after a clean reboot, this completes in seconds instead of hours.
md2 (the root volume) already had a bitmap. md0 and md1 needed them added:
sudo mdadm --grow /dev/md0 --bitmap=internal
sudo mdadm --grow /dev/md1 --bitmap=internal
Verify bitmaps exist on all arrays:
sudo mdadm --detail /dev/md0 | grep -i bitmap
sudo mdadm --detail /dev/md1 | grep -i bitmap
sudo mdadm --detail /dev/md2 | grep -i bitmap
Rather than relying on mdadm --monitor (which has unreliable re-add behavior), I wrote a dedicated recovery script that runs on a systemd timer:
/usr/local/bin/raid-recovery.sh:
#!/bin/bash
# raid-recovery.sh -- Re-add kicked RAID members using bitmap (no full resync)
# Scans all partitions for matching superblocks -- works regardless of partition layout.
set -euo pipefail
LOGPREFIX="raid-recovery"
log() { logger -t "$LOGPREFIX" "$@"; echo "$@"; }
RECOVERED=0
# Iterate every active md array on the system
for md_path in /sys/block/md*; do
[ -d "$md_path" ] || continue
array=$(basename "$md_path")
dev="/dev/$array"
[ -b "$dev" ] || continue
# Check if degraded via /proc/mdstat
state=$(grep "^$array " /proc/mdstat 2>/dev/null || true)
[ -z "$state" ] && continue
# Extract the [UU] / [_U] / [U_] sync pattern
sync_pattern=$(echo "$state" | grep -oP '\[U+_*\]' || true)
if [ -z "$sync_pattern" ]; then
# mdstat sometimes splits across lines
sync_pattern=$(awk "/^$array /{found=1;next} found{print;exit}" /proc/mdstat | grep -oP '\[U+_*\]' || true)
fi
# Skip if fully synced (no underscores = nothing missing)
echo "$sync_pattern" | grep -q '_' || continue
log "$array is degraded ($sync_pattern), scanning for missing members..."
# Get this array's UUID -- the authoritative identity check
array_uuid=$(mdadm --detail "$dev" 2>/dev/null | awk '/UUID/{print $3}' || true)
[ -z "$array_uuid" ] && continue
# Scan every partition on the system for a matching RAID superblock
while IFS= read -r part; do
[ -b "$part" ] || continue
# Skip if this partition is already active in the array
mdadm --detail "$dev" 2>/dev/null | grep -q "$part" && continue
# Check if the partition's superblock UUID matches this array
part_uuid=$(mdadm --examine "$part" 2>/dev/null | awk '/Array UUID/{print $4}' || true)
[ "$part_uuid" = "$array_uuid" ] || continue
# Try --re-add first (bitmap-aware, fast partial resync)
log "Re-adding $part to $dev (bitmap-aware)..."
if mdadm --re-add "$dev" "$part" 2>/dev/null; then
log "SUCCESS: $part re-added to $dev (partial resync via bitmap)"
RECOVERED=$((RECOVERED + 1))
elif mdadm --add "$dev" "$part" 2>/dev/null; then
log "SUCCESS: $part added to $dev (full resync -- bitmap re-add failed)"
RECOVERED=$((RECOVERED + 1))
else
log "FAILED: could not add $part to $dev"
fi
done < <(lsblk -lnp -o NAME,TYPE | awk '$2 == "part" {print $1}')
done
if [ "$RECOVERED" -gt 0 ]; then
log "Recovered $RECOVERED member(s)"
else
log "No action needed -- all arrays healthy or no eligible members found"
fi
The script discovers everything dynamically -- it iterates every md array in /sys/block/, checks if it's degraded, then scans every partition on the system using lsblk. For each partition, it compares the RAID superblock UUID against the degraded array's UUID. No hardcoded partition maps, so it works no matter how your drives are laid out.
/etc/systemd/system/mdadm-raid-recovery.service -- oneshot service:
[Unit]
Description=RAID degraded member auto-recovery
[Service]
Type=oneshot
ExecStart=/usr/local/bin/raid-recovery.sh
**`/etc/systemd/system/mdadm-raid-recovery.timer`** -- runs 90 seconds after boot (giving arrays time to fully assemble), then every 5 minutes:
[Unit]
Description=Periodic RAID recovery check
[Timer]
OnBootSec=90s
OnUnitActiveSec=5min
[Install]
WantedBy=timers.target
systemctl enable --now mdadm-raid-recovery.timer
The 90-second boot delay is intentional -- arrays that are mid-assembly or mid-check at boot shouldn't be prodded immediately.
--re-add instead of --add--re-add tells mdadm "this device was previously a member and was removed cleanly" -- mdadm checks the write-intent bitmap and only resyncs the blocks marked dirty. If the drive was simply absent for a reboot and not written to independently, the bitmap is nearly clean and the re-add completes almost instantly. --add always does a full resync regardless.
The UUID check before re-adding is the safety guard -- it confirms the device being added actually belongs to this array before touching anything.
After running stable for a couple of days on SU6SM100 with APST disabled, the problem returned during a routine warm reboot. The journal from the failed boot told the story:
Mar 12 16:58:23 kernel: nvme nvme0: pci function 0000:01:00.0
Mar 12 16:58:23 kernel: nvme nvme1: pci function 0000:02:00.0
Mar 12 16:58:23 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
Mar 12 16:58:24 mdadm: DegradedArray event detected on md device /dev/md0
Mar 12 16:58:24 mdadm: DegradedArray event detected on md device /dev/md1
Mar 12 16:58:24 mdadm: DegradedArray event detected on md device /dev/md2
Mar 12 16:59:53 systemd: Timed out waiting for device dev-disk-by-uuid-919E-9464
Mar 12 16:59:53 systemd: Dependency failed for boot-efi.mount - /boot/efi
Mar 12 16:59:53 systemd: Dependency failed for local-fs.target - Local File Systems
A second warm reboot attempt was even worse -- nvme1 didn't even register as a PCI function. It was completely invisible to the kernel. smartd only saw one NVMe device. Only a full power-off (30 seconds, then power on) recovered the drive.
Checking the system's power management state revealed the gap:
# ASPM policy -- firmware was deciding, not the kernel
cat /sys/module/pcie_aspm/parameters/policy
# [default] performance powersave powersupersave
# D3cold -- full power removal was ALLOWED on both drives
cat /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
# 1
cat /sys/bus/pci/devices/0000:02:00.0/d3cold_allowed
# 1
The NVMe power stack has three independent layers, and I'd only addressed one:
nvme_core.default_ps_max_latency_us=0default (firmware decides)During shutdown, the kernel puts PCIe devices into D3. With D3cold allowed, the controller's power is fully removed. On warm reboot, the PCIe link comes back up but the Phison E18 can't transition from D3cold to D0 -- that's the CSTS=0x1.
Create /etc/udev/rules.d/60-nvme-no-d3cold.rules:
# Disable D3cold for Seagate/Phison NVMe controllers
# Vendor 0x1bb1 = Seagate Technology PLC (Phison E18 controller)
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x1bb1", ATTR{d3cold_allowed}="0"
Apply immediately without reboot:
sudo udevadm control --reload-rules && sudo udevadm trigger -s pci
echo 0 | sudo tee /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
echo 0 | sudo tee /sys/bus/pci/devices/0000:02:00.0/d3cold_allowed
Verify:
cat /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
# 0
cat /sys/bus/pci/devices/0000:02:00.0/d3cold_allowed
# 0
Add pcie_aspm=off to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="... nvme_core.default_ps_max_latency_us=0 degradedboot=true pcie_aspm=off"
sudo update-grub
Verify after reboot:
cat /proc/cmdline | grep -o 'pcie_aspm=off'
# pcie_aspm=off
nofail for Boot PartitionsThe EFI partition (/boot/efi) and boot partition (/boot) are only needed for kernel/GRUB updates, not normal operation. Adding nofail prevents a missing partition from cascading into a local-fs.target failure:
/dev/disk/by-id/md-uuid-...-part1 /boot ext4 defaults,nofail 0 1
/dev/disk/by-uuid/919E-9464 /boot/efi vfat defaults,nofail 0 1
This is a defense-in-depth measure -- even if a drive fails to initialize, the system boots on the remaining drive.
After applying all three layers of fixes, a warm reboot (sudo reboot) succeeded cleanly:
Mar 12 20:03:56 kernel: nvme nvme0: pci function 0000:01:00.0
Mar 12 20:03:56 kernel: nvme nvme1: pci function 0000:02:00.0
Mar 12 20:03:56 kernel: nvme nvme1: missing or invalid SUBNQN field.
Mar 12 20:03:56 kernel: nvme nvme1: Shutdown timeout set to 10 seconds
Mar 12 20:03:56 kernel: nvme nvme0: missing or invalid SUBNQN field.
Mar 12 20:03:56 kernel: nvme nvme0: Shutdown timeout set to 10 seconds
Mar 12 20:03:56 kernel: nvme nvme0: 32/0/0 default/read/poll queues
Mar 12 20:03:56 kernel: nvme nvme1: 32/0/0 default/read/poll queues
Mar 12 20:03:58 systemd: Mounted boot-efi.mount - /boot/efi.
Both drives initialized within the same second. All RAID arrays came up [UU]. No CSTS errors, no degraded arrays, no timeouts.
| Layer | Fix | Why it's needed |
|-------|-----|----------------|
| NVMe APST | `nvme_core.default_ps_max_latency_us=0` | Prevents controller from entering deep autonomous power states it can't exit |
| PCIe ASPM | `pcie_aspm=off` | Prevents PCIe link from entering low-power states during shutdown/reboot |
| PCIe D3cold | udev rule: `ATTR{d3cold_allowed}="0"` | Prevents full power removal from controller during shutdown sequence |
| Firmware | SU6SM100 (latest) | Seagate's fix for E18 controller stability (necessary but not sufficient) |
| Boot resilience | `degradedboot=true` | Allows boot with degraded RAID |
| Mount resilience | `nofail` on `/boot` and `/boot/efi` | Prevents missing partition from blocking entire boot |
| Self-healing | RAID recovery timer + write-intent bitmaps | Automatic re-add of kicked members with fast partial resync |
nvme_core.default_ps_max_latency_us=0 disables APST immediately with no firmware change needed -- use it as a first mitigation.pcie_aspm=off disables OS-managed PCIe link power states. Combine with the D3cold udev rule for complete coverage.degradedboot=true in GRUB prevents a degraded array from making the server unbootable without physical access.nofail in fstab for non-essential mounts (/boot, /boot/efi) prevents cascading boot failures when a drive is slow or absent.--action=1 (deferred activate) is the correct firmware commit strategy on a live system -- stage it, then do a planned power cycle.--re-add vs --add -- always try --re-add first on a rejoining member. It uses the bitmap. --add ignores it.