Debugging Seagate FireCuda 530 NVMe Boot Failures on Linux

Debugging Seagate FireCuda 530 NVMe Boot Failures on Linux: Firmware, APST, ASPM, and D3cold

The Problem

After a reboot, my server would intermittently fail to boot with RAID arrays coming up degraded. On closer inspection, the culprit was nvme1 failing to initialize -- kernel logs showed CSTS=0x1 controller reset failures from the Phison E18 controller inside the Seagate FireCuda 530 drives.

The cascade looked like this:

nvme1 controller reset (CSTS=0x1) -> RAID assembly stalls (nvme1n1p* never appears) -> /boot/efi mount times out (90s) -> local-fs.target fails -> cascading service failures -> boot fails entirely, requires cold power cycle

The Hardware

| Component | Details |
|-----------|---------|
| Motherboard | Supermicro H12SSL-NT v1.02, BIOS 3.3 |
| CPU | AMD EPYC (single socket) |
| NVMe Drives | 2x Seagate FireCuda 530 4TB (ZP4000GM30013) |
| NVMe Controller | Phison PS5018-E18 |
| RAID | 3x mdraid1 (md0, md1, md2) across both drives |

Root Cause: Three Layers of NVMe Power Management

The Phison E18 NVMe controller has a known issue with power state transitions on AMD EPYC platforms. Turned out the problem involved three independent layers of power management -- fixing just one isn't enough:

| Layer | What it controls | Kernel parameter |
|-------|-----------------|-----------------|
| **APST** | NVMe controller's internal autonomous power states | `nvme_core.default_ps_max_latency_us=0` |
| **ASPM** | PCIe link power management (the wires between CPU and device) | `pcie_aspm=off` |
| **D3cold** | Full power removal from the PCIe device during shutdown | udev rule: `ATTR{d3cold_allowed}="0"` |

What happens during a warm reboot

During shutdown, the kernel puts PCIe devices into D3 (low power state). With D3cold enabled (the default), the NVMe controller's power can be fully removed. On warm reboot, the PCIe bus reactivates but the Phison E18 controller fails to transition from D3cold back to D0 -- resulting in CSTS=0x1 (Controller Fatal Status). A cold boot (full power cycle) works because the entire PCIe bus gets a clean power-on initialization.

Why only one drive fails

Both drives are identical model and firmware, but they occupy different PCIe slots routed through different IOMMU groups on the AMD EPYC I/O die. The different physical slot routing causes different power state timing behavior. This is consistent with reports in Linux kernel bug databases.

Both drives were initially running firmware SU6SM005. Seagate's release notes for SU6SM100 address controller stability for this class of issue -- but as I found out, the firmware update alone wasn't sufficient.

Step 1: Diagnosing with nvme-cli

# Confirm the SMART error counters are climbing
sudo nvme smart-log /dev/nvme0
sudo nvme smart-log /dev/nvme1

# Check firmware revision on both drives
sudo nvme id-ctrl /dev/nvme0 | grep ^fr
sudo nvme id-ctrl /dev/nvme1 | grep ^fr
# Output: fr : SU6SM005

# Confirm the controller errors in the ring buffer
dmesg | grep -iE "nvme.*reset|nvme.*CSTS|nvme.*timeout"

The error log confirmed nvme1 was the repeat offender, with occasional resets on nvme0 under load as well.

Step 2: Workaround First -- Disable APST via GRUB

Before touching firmware, I applied an immediate mitigation: disabling APST entirely with a kernel parameter. Edit /etc/default/grub and add to GRUB_CMDLINE_LINUX:

GRUB_CMDLINE_LINUX="... nvme_core.default_ps_max_latency_us=0 degradedboot=true"

Two parameters added at the same time: - nvme_core.default_ps_max_latency_us=0 -- sets the maximum acceptable APST latency to 0us, disabling all autonomous power state transitions - degradedboot=true -- allows the system to boot even if a RAID array is degraded, so a partially-failed boot no longer requires physical intervention

sudo update-grub
sudo update-initramfs -u -k all

Verify after reboot:

cat /proc/cmdline
# Should contain both new parameters

This fixed the APST-level issue, but the problem came back. After a warm reboot a few days later, nvme1 again failed with CSTS=0x1 -- the firmware update and APST disable weren't enough because the failure was happening at the PCIe level (ASPM and D3cold), not the NVMe protocol level.

Step 3: Firmware Update -- SU6SM100

With APST disabled as a safety net, I flashed the firmware on both drives. The NVMe spec's two-phase commit (download + activate) means the new firmware doesn't take effect until the next power cycle, giving you time to stage both drives before committing.

# Stage the firmware image on each drive
sudo nvme fw-download /dev/nvme0 --fw=FireCuda530_SU6SM100.bin
sudo nvme fw-download /dev/nvme1 --fw=FireCuda530_SU6SM100.bin

# Commit with deferred activation (takes effect after power cycle, not reboot)
sudo nvme fw-commit /dev/nvme0 --slot=1 --action=1
sudo nvme fw-commit /dev/nvme1 --slot=1 --action=1

--action=1 means "activate on next reset" -- the controller won't switch to the new firmware until a full power cycle. This is safer than --action=3 (immediate activate with reset), which resets the controller in-place while the OS is running.

After a full power cycle (not just reboot -- firmware activation requires the PCIe bus to fully power down):

sudo nvme id-ctrl /dev/nvme0 | grep ^fr
sudo nvme id-ctrl /dev/nvme1 | grep ^fr
# Expected: fr : SU6SM100

Step 4: Automatic RAID Re-Add with Write-Intent Bitmaps

Even with the firmware fix and APST disabled, I wanted a safety net: if a drive ever gets kicked out of an array, the system should self-heal without manual intervention.

The Problem with Full Resyncs

mdadm can re-add a drive to a degraded array with mdadm --add, but by default this triggers a full resync -- every block on the array is compared and rewritten. On 4TB NVMe RAID arrays this takes a long time, puts sustained load on the drives, and leaves the array degraded for the entire duration.

Write-intent bitmaps solve this. The bitmap is a small on-disk structure that tracks which regions of the array have been written since the member was last in sync. When a drive re-joins, mdadm only needs to resync the dirty regions -- on a lightly-written array after a clean reboot, this completes in seconds instead of hours.

Adding Bitmaps

md2 (the root volume) already had a bitmap. md0 and md1 needed them added:

sudo mdadm --grow /dev/md0 --bitmap=internal
sudo mdadm --grow /dev/md1 --bitmap=internal

Verify bitmaps exist on all arrays:

sudo mdadm --detail /dev/md0 | grep -i bitmap
sudo mdadm --detail /dev/md1 | grep -i bitmap
sudo mdadm --detail /dev/md2 | grep -i bitmap

The Auto-Recovery Service

Rather than relying on mdadm --monitor (which has unreliable re-add behavior), I wrote a dedicated recovery script that runs on a systemd timer:

/usr/local/bin/raid-recovery.sh:

#!/bin/bash
# raid-recovery.sh -- Re-add kicked RAID members using bitmap (no full resync)
# Scans all partitions for matching superblocks -- works regardless of partition layout.

set -euo pipefail

LOGPREFIX="raid-recovery"

log() { logger -t "$LOGPREFIX" "$@"; echo "$@"; }

RECOVERED=0

# Iterate every active md array on the system
for md_path in /sys/block/md*; do
    [ -d "$md_path" ] || continue
    array=$(basename "$md_path")
    dev="/dev/$array"
    [ -b "$dev" ] || continue

    # Check if degraded via /proc/mdstat
    state=$(grep "^$array " /proc/mdstat 2>/dev/null || true)
    [ -z "$state" ] && continue

    # Extract the [UU] / [_U] / [U_] sync pattern
    sync_pattern=$(echo "$state" | grep -oP '\[U+_*\]' || true)
    if [ -z "$sync_pattern" ]; then
        # mdstat sometimes splits across lines
        sync_pattern=$(awk "/^$array /{found=1;next} found{print;exit}" /proc/mdstat | grep -oP '\[U+_*\]' || true)
    fi

    # Skip if fully synced (no underscores = nothing missing)
    echo "$sync_pattern" | grep -q '_' || continue

    log "$array is degraded ($sync_pattern), scanning for missing members..."

    # Get this array's UUID -- the authoritative identity check
    array_uuid=$(mdadm --detail "$dev" 2>/dev/null | awk '/UUID/{print $3}' || true)
    [ -z "$array_uuid" ] && continue

    # Scan every partition on the system for a matching RAID superblock
    while IFS= read -r part; do
        [ -b "$part" ] || continue

        # Skip if this partition is already active in the array
        mdadm --detail "$dev" 2>/dev/null | grep -q "$part" && continue

        # Check if the partition's superblock UUID matches this array
        part_uuid=$(mdadm --examine "$part" 2>/dev/null | awk '/Array UUID/{print $4}' || true)
        [ "$part_uuid" = "$array_uuid" ] || continue

        # Try --re-add first (bitmap-aware, fast partial resync)
        log "Re-adding $part to $dev (bitmap-aware)..."
        if mdadm --re-add "$dev" "$part" 2>/dev/null; then
            log "SUCCESS: $part re-added to $dev (partial resync via bitmap)"
            RECOVERED=$((RECOVERED + 1))
        elif mdadm --add "$dev" "$part" 2>/dev/null; then
            log "SUCCESS: $part added to $dev (full resync -- bitmap re-add failed)"
            RECOVERED=$((RECOVERED + 1))
        else
            log "FAILED: could not add $part to $dev"
        fi
    done < <(lsblk -lnp -o NAME,TYPE | awk '$2 == "part" {print $1}')
done

if [ "$RECOVERED" -gt 0 ]; then
    log "Recovered $RECOVERED member(s)"
else
    log "No action needed -- all arrays healthy or no eligible members found"
fi

The script discovers everything dynamically -- it iterates every md array in /sys/block/, checks if it's degraded, then scans every partition on the system using lsblk. For each partition, it compares the RAID superblock UUID against the degraded array's UUID. No hardcoded partition maps, so it works no matter how your drives are laid out.

/etc/systemd/system/mdadm-raid-recovery.service -- oneshot service:

[Unit]
Description=RAID degraded member auto-recovery

[Service]
Type=oneshot
ExecStart=/usr/local/bin/raid-recovery.sh

**`/etc/systemd/system/mdadm-raid-recovery.timer`** -- runs 90 seconds after boot (giving arrays time to fully assemble), then every 5 minutes:

[Unit]
Description=Periodic RAID recovery check

[Timer]
OnBootSec=90s
OnUnitActiveSec=5min

[Install]
WantedBy=timers.target


systemctl enable --now mdadm-raid-recovery.timer

The 90-second boot delay is intentional -- arrays that are mid-assembly or mid-check at boot shouldn't be prodded immediately.

Why --re-add instead of --add

--re-add tells mdadm "this device was previously a member and was removed cleanly" -- mdadm checks the write-intent bitmap and only resyncs the blocks marked dirty. If the drive was simply absent for a reboot and not written to independently, the bitmap is nearly clean and the re-add completes almost instantly. --add always does a full resync regardless.

The UUID check before re-adding is the safety guard -- it confirms the device being added actually belongs to this array before touching anything.

Step 5: The Firmware Fix Wasn't Enough -- PCIe Power Management

After running stable for a couple of days on SU6SM100 with APST disabled, the problem returned during a routine warm reboot. The journal from the failed boot told the story:

Mar 12 16:58:23 kernel: nvme nvme0: pci function 0000:01:00.0
Mar 12 16:58:23 kernel: nvme nvme1: pci function 0000:02:00.0
Mar 12 16:58:23 kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1

Mar 12 16:58:24 mdadm: DegradedArray event detected on md device /dev/md0
Mar 12 16:58:24 mdadm: DegradedArray event detected on md device /dev/md1
Mar 12 16:58:24 mdadm: DegradedArray event detected on md device /dev/md2

Mar 12 16:59:53 systemd: Timed out waiting for device dev-disk-by-uuid-919E-9464
Mar 12 16:59:53 systemd: Dependency failed for boot-efi.mount - /boot/efi
Mar 12 16:59:53 systemd: Dependency failed for local-fs.target - Local File Systems

A second warm reboot attempt was even worse -- nvme1 didn't even register as a PCI function. It was completely invisible to the kernel. smartd only saw one NVMe device. Only a full power-off (30 seconds, then power on) recovered the drive.

Diagnosing the PCIe Layer

Checking the system's power management state revealed the gap:

# ASPM policy -- firmware was deciding, not the kernel
cat /sys/module/pcie_aspm/parameters/policy
# [default] performance powersave powersupersave

# D3cold -- full power removal was ALLOWED on both drives
cat /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
# 1
cat /sys/bus/pci/devices/0000:02:00.0/d3cold_allowed
# 1

The NVMe power stack has three independent layers, and I'd only addressed one:

  1. APST (NVMe controller internal) -- disabled via nvme_core.default_ps_max_latency_us=0
  2. ASPM (PCIe link power management) -- set to default (firmware decides)
  3. D3cold (full power removal from device) -- allowed on both drives

During shutdown, the kernel puts PCIe devices into D3. With D3cold allowed, the controller's power is fully removed. On warm reboot, the PCIe link comes back up but the Phison E18 can't transition from D3cold to D0 -- that's the CSTS=0x1.

Fix 1: Disable D3cold via udev Rule

Create /etc/udev/rules.d/60-nvme-no-d3cold.rules:

# Disable D3cold for Seagate/Phison NVMe controllers
# Vendor 0x1bb1 = Seagate Technology PLC (Phison E18 controller)
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x1bb1", ATTR{d3cold_allowed}="0"

Apply immediately without reboot:

sudo udevadm control --reload-rules && sudo udevadm trigger -s pci
echo 0 | sudo tee /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
echo 0 | sudo tee /sys/bus/pci/devices/0000:02:00.0/d3cold_allowed

Verify:

cat /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed
# 0
cat /sys/bus/pci/devices/0000:02:00.0/d3cold_allowed
# 0

Fix 2: Disable PCIe ASPM

Add pcie_aspm=off to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="... nvme_core.default_ps_max_latency_us=0 degradedboot=true pcie_aspm=off"

sudo update-grub

Verify after reboot:

cat /proc/cmdline | grep -o 'pcie_aspm=off'
# pcie_aspm=off

Fix 3: fstab nofail for Boot Partitions

The EFI partition (/boot/efi) and boot partition (/boot) are only needed for kernel/GRUB updates, not normal operation. Adding nofail prevents a missing partition from cascading into a local-fs.target failure:

/dev/disk/by-id/md-uuid-...-part1 /boot ext4 defaults,nofail 0 1
/dev/disk/by-uuid/919E-9464       /boot/efi vfat defaults,nofail 0 1

This is a defense-in-depth measure -- even if a drive fails to initialize, the system boots on the remaining drive.

Results

After applying all three layers of fixes, a warm reboot (sudo reboot) succeeded cleanly:

Mar 12 20:03:56 kernel: nvme nvme0: pci function 0000:01:00.0
Mar 12 20:03:56 kernel: nvme nvme1: pci function 0000:02:00.0
Mar 12 20:03:56 kernel: nvme nvme1: missing or invalid SUBNQN field.
Mar 12 20:03:56 kernel: nvme nvme1: Shutdown timeout set to 10 seconds
Mar 12 20:03:56 kernel: nvme nvme0: missing or invalid SUBNQN field.
Mar 12 20:03:56 kernel: nvme nvme0: Shutdown timeout set to 10 seconds
Mar 12 20:03:56 kernel: nvme nvme0: 32/0/0 default/read/poll queues
Mar 12 20:03:56 kernel: nvme nvme1: 32/0/0 default/read/poll queues
Mar 12 20:03:58 systemd: Mounted boot-efi.mount - /boot/efi.

Both drives initialized within the same second. All RAID arrays came up [UU]. No CSTS errors, no degraded arrays, no timeouts.

The Complete Fix Stack

| Layer | Fix | Why it's needed |
|-------|-----|----------------|
| NVMe APST | `nvme_core.default_ps_max_latency_us=0` | Prevents controller from entering deep autonomous power states it can't exit |
| PCIe ASPM | `pcie_aspm=off` | Prevents PCIe link from entering low-power states during shutdown/reboot |
| PCIe D3cold | udev rule: `ATTR{d3cold_allowed}="0"` | Prevents full power removal from controller during shutdown sequence |
| Firmware | SU6SM100 (latest) | Seagate's fix for E18 controller stability (necessary but not sufficient) |
| Boot resilience | `degradedboot=true` | Allows boot with degraded RAID |
| Mount resilience | `nofail` on `/boot` and `/boot/efi` | Prevents missing partition from blocking entire boot |
| Self-healing | RAID recovery timer + write-intent bitmaps | Automatic re-add of kicked members with fast partial resync |

Key Takeaways

  • CSTS=0x1 in dmesg is the fingerprint for a Phison E18 power state failure -- check this first when FireCuda 530s misbehave at boot.
  • NVMe power management has three layers -- APST (controller), ASPM (PCIe link), and D3cold (device power). Fixing one layer may not be enough. On the Phison E18, all three need to be addressed.
  • Warm reboot vs cold boot is the diagnostic clue. If cold boot always works but warm reboot intermittently fails, the issue is in the power state transition path (D3cold to D0), not in the drive itself.
  • nvme_core.default_ps_max_latency_us=0 disables APST immediately with no firmware change needed -- use it as a first mitigation.
  • pcie_aspm=off disables OS-managed PCIe link power states. Combine with the D3cold udev rule for complete coverage.
  • degradedboot=true in GRUB prevents a degraded array from making the server unbootable without physical access.
  • nofail in fstab for non-essential mounts (/boot, /boot/efi) prevents cascading boot failures when a drive is slow or absent.
  • --action=1 (deferred activate) is the correct firmware commit strategy on a live system -- stage it, then do a planned power cycle.
  • Write-intent bitmaps turn hour-long full resyncs into second-long partial ones -- add them to all arrays, not just the root volume.
  • --re-add vs --add -- always try --re-add first on a rejoining member. It uses the bitmap. --add ignores it.

References