High-Performance pNFS v4.2 Distributed Storage Architecture¶

A deep dive into building a clustered, high-availability parallel NFS storage system with load-balanced metadata servers, NVMe-backed storage nodes, and low-latency interconnects.

Architecture Overview¶

This architecture implements a production-grade parallel NFS (pNFS) v4.2 deployment designed for GPU compute clusters requiring high-throughput, low-latency storage with built-in redundancy and horizontal scalability.

Key Design Goals¶

Parallel I/O Performance: Direct client-to-storage data paths bypassing metadata bottlenecks
Metadata High Availability: Clustered MDS with automatic failover
Horizontal Scalability: Add storage nodes without downtime
Low Latency: InfiniBand/RoCE interconnects for sub-microsecond latencies
Fault Tolerance: No single points of failure in the architecture

System Topology¶

sequenceDiagram
    participant Client as Client<br/>(pNFS v4.2)
    participant MDS as MDS Cluster<br/>(Active-Active via VIP)
    participant S1 as Storage Node 1<br/>(NVMe)
    participant S2 as Storage Node 2<br/>(NVMe)
    participant S3 as Storage Node 3<br/>(NVMe)

    Note over Client,S3: ━━━━━━━ PHASE 1: METADATA PATH ━━━━━━━
    Note over MDS: Virtual IP load balances to any MDS<br/>All MDS nodes share distributed state

    Client->>+MDS: LAYOUTGET(file_handle)
    Note right of MDS: MDS queries distributed<br/>backend for file layout
    MDS-->>-Client: LAYOUT(stripe_pattern, DS_list)
    Note left of Client: ✓ Client caches layout<br/>Stripe unit: 1MB<br/>Stripe count: 3 nodes

    Note over Client,S3: ━━━━━━━ PHASE 2: DATA PATH (MDS BYPASSED) ━━━━━━━

    par Parallel Direct I/O over InfiniBand/RoCE
        Client->>+S1: WRITE Stripe 0
        S1->>S1: NVMe I/O
        S1-->>-Client: ACK
    and
        Client->>+S2: WRITE Stripe 1
        S2->>S2: NVMe I/O
        S2-->>-Client: ACK
    and
        Client->>+S3: WRITE Stripe 2
        S3->>S3: NVMe I/O
        S3-->>-Client: ACK
    end

    Note over Client,S3: ⚡ Aggregate: 3 × 7 GB/s = ~20 GB/s effective throughput

Key Architecture Points:

Layer	Component	Function
Control Plane	MDS Cluster (Active-Active)	Virtual IP → Load balances metadata requests Distributed backend → Shared state (GFS2/OCFS2) Co-located with storage nodes
Data Plane	Storage Nodes	Direct parallel I/O bypasses MDS entirely Each node: MDS service + Data service + NVMe High-speed fabric: InfiniBand or 100GbE RoCE
Client	pNFS v4.2	One-time layout fetch → caches stripe pattern Direct parallel writes to multiple storage nodes No metadata bottleneck on data path

Architecture Advantage

Separation of Control and Data Planes: Client contacts MDS once to get file layout, then performs all subsequent I/O directly to storage nodes over high-speed network. MDS handles only metadata operations (LAYOUTGET, OPEN, CLOSE), while bulk data transfer happens in parallel across multiple storage nodes, eliminating the metadata server bottleneck.

Component Breakdown¶

1. Client Layer (pNFS v4.2 Clients)¶

Role: GPU compute nodes running pNFS-aware clients

Characteristics: - Protocol: NFSv4.2 with pNFS layout extensions - Parallelism: Multiple concurrent I/O streams to storage nodes - Two-phase operations: 1. Metadata phase: Request file layout from MDS via VIP 2. Data phase: Direct parallel I/O to multiple storage nodes

Advantages: - Metadata and data paths are separated - MDS only handles control plane; data plane scales independently - Clients cache layouts, reducing metadata round-trips

2. Metadata Virtual IP (VIP) / Load Balancer¶

Role: Distribute metadata requests across clustered MDS instances

Implementation Options:

Technology	Use Case	Pros	Cons
Keepalived + VRRP	Simple HA	Easy setup, fast failover	Layer 3 only, single active
HAProxy	Advanced LB	Health checks, stats, multi-algo	Additional component
Pacemaker + Corosync	Enterprise HA	Full cluster manager	Complex configuration

Configuration Considerations: - Failover time: Target <2 seconds for MDS failover - Session stickiness: Not required (stateless metadata operations) - Health checks: Monitor MDS service health on each node

3. MDS Cluster (Metadata Servers)¶

Role: Manage namespace, permissions, file layouts, and client coordination

Clustering Strategy:

Active-Active Clustering

All MDS instances are active simultaneously, sharing load via the VIP. This differs from traditional active-passive designs and requires:

Shared backend: Distributed consensus or shared storage for metadata
State synchronization: Real-time metadata replication
Lock coordination: Distributed locking for file operations

Backend Options:

Option 1: Shared Block Device (DRBD + GFS2/OCFS2)
  pros:
    - Battle-tested clustering
    - POSIX semantics
  cons:
    - Block-level sync overhead
    - Limited to 2-3 nodes typically

Option 2: Distributed Database (etcd/Consul)
  pros:
    - Raft consensus built-in
    - Horizontal scaling
    - Cloud-native
  cons:
    - Additional latency
    - More complex integration

Option 3: Lustre MGS/MDT (if using Lustre as pNFS backend)
  pros:
    - Native high availability
    - Proven at exascale
  cons:
    - Lustre-specific
    - Complex deployment

Heartbeat Mechanism: - Interval: 500ms - 1s between nodes - Quorum: Majority voting prevents split-brain - Fencing: STONITH (Shoot The Other Node In The Head) for failed nodes

4. High-Speed Network Fabric¶

Role: Low-latency, high-bandwidth interconnect for storage traffic

Technology Comparison:

Technology	Bandwidth	Latency	Use Case
InfiniBand HDR	200 Gbps	<1 μs	HPC, AI training clusters
100GbE RoCE v2	100 Gbps	<5 μs	Cost-effective alternative
Omni-Path	100 Gbps	<1 μs	Intel ecosystem

Network Design:

- Dedicated storage VLAN/subnet
- Jumbo frames (MTU 9000) for throughput
- RDMA for zero-copy transfers
- Lossless Ethernet (PFC) if using RoCE
- Multiple paths for redundancy (LACP/MLAG)

5. Storage Nodes¶

Role: Serve actual file data via pNFS Data Service (DS)

Node Architecture:

Each storage node runs:
├── MDS Service (part of cluster)
├── Data Service (DS) (serves pNFS I/O)
└── Physical Storage (NVMe SSDs)

NVMe Configuration: - Device: PCIe Gen4 NVMe SSDs (7000+ MB/s per device) - RAID: No RAID (rely on pNFS striping across nodes) - File System: XFS or ZFS for local storage - Tuning: - nvme.io_timeout=4294967295 (disable timeout) - elevator=none (bypass I/O scheduler for NVMe) - vm.dirty_ratio=5 (aggressive writeback)

Capacity Planning:

Per-node capacity:
  - 4x 4TB NVMe = 16TB raw per node
  - 10 nodes = 160TB aggregate raw
  - No RAID overhead (redundancy via replication)
  - Effective capacity: ~140TB (accounting for metadata)

Data Flow: Read Operation¶

Phase 1: Layout Request (Metadata Path)¶

sequenceDiagram
    participant Client
    participant VIP
    participant MDS1
    participant Backend

    Client->>VIP: LAYOUTGET (file handle)
    VIP->>MDS1: Forward request
    MDS1->>Backend: Query file layout
    Backend-->>MDS1: Layout map
    MDS1-->>Client: LAYOUT (stripe pattern, DS list)
    Note over Client: Client caches layout

Layout Information Returned:

{
  "layout_type": "LAYOUT4_NFSV4_1_FILES",
  "stripe_unit": 1048576,
  "stripe_count": 4,
  "data_servers": [
    "10.10.1.11:2049",  // Storage Node 1
    "10.10.1.12:2049",  // Storage Node 2
    "10.10.1.13:2049",  // Storage Node 3
    "10.10.1.14:2049"   // Storage Node 4
  ]
}

Phase 2: Parallel Data I/O (Data Path)¶

graph LR
    Client -->|Stripe 0| DS1[Storage Node 1]
    Client -->|Stripe 1| DS2[Storage Node 2]
    Client -->|Stripe 3| DS3[Storage Node 3]
    Client -->|Stripe 4| DS4[Storage Node 4]

    DS1 --> NVMe1[NVMe SSD]
    DS2 --> NVMe2[NVMe SSD]
    DS3 --> NVMe3[NVMe SSD]
    DS4 --> NVMe4[NVMe SSD]

Throughput Calculation:

Single NVMe: 7 GB/s read
4-way stripe: 7 GB/s × 4 = 28 GB/s aggregate
Overhead (20%): ~22 GB/s effective client throughput

Key Advantage

The MDS is completely bypassed during data I/O. Only initial layout fetch requires MDS contact, then client directly streams data from multiple storage nodes in parallel.

Write Operation with Coherency¶

Challenges¶

Cache coherency: Multiple clients may access same file
Consistency: Must maintain POSIX semantics
Layout revocation: MDS may recall layouts during conflicts

Write Flow¶

sequenceDiagram
    participant Client1
    participant Client2
    participant MDS
    participant DS1

    Client1->>MDS: OPEN (file, WRITE)
    MDS-->>Client1: LAYOUT (read-write)
    Client1->>DS1: WRITE data

    Client2->>MDS: OPEN (same file, WRITE)
    MDS->>Client1: CB_LAYOUTRECALL
    Client1->>DS1: COMMIT writes
    Client1->>MDS: LAYOUTRETURN
    MDS-->>Client2: LAYOUT (read-write)

Layout Recall Scenarios: 1. Write-write conflict: Second writer needs exclusive layout 2. Read-write conflict: Writer needs to invalidate reader caches 3. Layout change: File being migrated or restriped

High Availability Scenarios¶

Scenario 1: MDS Node Failure¶

Before:
  VIP → MDS1 (active)
      → MDS2 (active)
      → MDS3 (active)  ← fails

After (within 2 seconds):
  VIP → MDS1 (active)  ← absorbs load
      → MDS2 (active)  ← absorbs load

  MDS3: Fenced by cluster, removed from VIP pool
  Client layouts: Still valid, no client disruption

Recovery Actions: - Quorum maintained (2/3 nodes) - Clients continue data I/O unaffected - New metadata requests distributed to healthy MDS nodes - Failed MDS auto-rejoins after recovery

Scenario 2: Storage Node Failure¶

pNFS File with 4-way striping across nodes 1-4:
  Node 3 fails → Stripes 2 (stored on node 3) unavailable

Client behavior:
  1. Client detects I/O error on stripe 2
  2. Client returns partial read/write to application
  3. Application must handle EIO (or use replication)

Recovery:
  - Option A: File replicated (pNFS server-side replication)
             → Automatic failover to replica stripe
  - Option B: No replication → Data loss for affected stripes

Data Durability

pNFS itself does NOT provide redundancy. You must implement:

Server-side replication (e.g., Lustre OST pools)
Client-side RAID (mdadm over pNFS)
Application-level erasure coding
Regular snapshots/backups

Scenario 3: Network Partition (Split-Brain Prevention)¶

Network partition splits cluster:
  Partition A: MDS1, MDS2 (2 nodes)
  Partition B: MDS3 (1 node)

Quorum voting:
  Partition A: 2/3 nodes = HAS QUORUM → continues operation
  Partition B: 1/3 nodes = NO QUORUM → enters read-only mode

Prevention:
  - Fencing agent (IPMI, PDU) forcibly powers off minority partition
  - Prevents conflicting writes to shared backend

Performance Tuning¶

Client-Side Tunables¶

# /etc/nfsmount.conf or mount options
mount -t nfs4 -o \
  vers=4.2,\                      # Enable pNFS
  pnfs,\                          # Use parallel NFS layouts
  rsize=1048576,\                 # 1MB read size
  wsize=1048576,\                 # 1MB write size
  timeo=600,\                     # 60s timeout
  retrans=2,\                     # 2 retransmissions
  hard,\                          # Hard mount (don't give up)
  async,\                         # Asynchronous I/O
  ac,\                            # Attribute caching
  actimeo=3600 \                  # 1-hour attribute cache
  10.10.1.100:/export /mnt/pnfs

Server-Side Tunables¶

# NFS server threads (per-node)
echo 256 > /proc/sys/sunrpc/nfsd_threads

# Network receive buffers
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728

# NVMe queue depth
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

# Disable CPU frequency scaling (performance mode)
cpupower frequency-set -g performance

Monitoring Metrics¶

Key metrics to track:
  - MDS operations/sec (LAYOUTGET, OPEN, CLOSE)
  - Data server throughput (GB/s per node)
  - Latency percentiles (p50, p95, p99)
  - Client cache hit rates
  - Network utilization (per fabric)
  - NVMe IOPS and latency

Implementation: Deployment Checklist¶

Phase 1: Network Setup¶

Deploy InfiniBand/RoCE fabric
Configure storage VLAN with jumbo frames
Enable RDMA on all nodes
Verify bandwidth with ib_write_bw / ib_read_bw

Phase 2: Storage Node Provisioning¶

Install NVMe SSDs and verify nvme list
Create XFS/ZFS filesystems
Apply NVMe performance tunings
Install nfs-kernel-server with pNFS support

Phase 3: MDS Cluster Setup¶

Choose clustering backend (DRBD, etcd, etc.)
Configure Pacemaker/Corosync or equivalent
Set up VIP with failover tests
Deploy metadata synchronization
Test quorum and fencing

Phase 4: pNFS Configuration¶

Configure pNFS layouts on each storage node
Export file systems via NFS4 with pNFS enabled
Register data servers with MDS
Test layout distribution from clients

Phase 5: Client Deployment¶

Mount pNFS export with optimized parameters
Verify parallel I/O with dd or fio
Test layout recall and coherency
Run application workload benchmarks

Phase 6: Production Hardening¶

Set up monitoring (Prometheus + Grafana)
Configure alerting for node failures
Document failover procedures
Schedule regular disaster recovery drills
Implement backup strategy

Real-World Performance¶

Benchmark Environment¶

Hardware:
  - 10x storage nodes (Dell R750)
  - 4x 7.68TB NVMe per node (Samsung PM9A3)
  - 100GbE RoCE network
  - 2x AMD EPYC 7543 per node

Workload:
  - FIO sequential read (4MB block size)
  - 8 clients, 16 threads each

Results¶

Metric	Value	Notes
Aggregate Throughput	82 GB/s	10 nodes × ~8 GB/s each
Per-Client Throughput	10.2 GB/s	82 GB/s / 8 clients
Latency (p99)	3.2 ms	Network + NVMe + pNFS overhead
MDS Load	2,300 ops/s	Only layout requests
CPU Utilization	35% avg	Plenty of headroom

Key Takeaway

pNFS achieved near-linear scaling across 10 storage nodes. MDS remained under 10% CPU utilization, proving effective metadata/data path separation.

Troubleshooting Guide¶

Problem: Clients not using pNFS (falling back to standard NFS)¶

Symptoms:

# All I/O going through MDS node
nfsstat -m | grep "pnfs"  # Shows "pnfs: not in use"

Diagnosis:

# Check server pNFS support
nfsstat -s | grep pnfs

# Check client kernel support
grep PNFS /boot/config-$(uname -r)  # Should show CONFIG_PNFS_FILE_LAYOUT=m

Solution: - Ensure server exports with pnfs option - Verify client kernel has nfs_layout_nfsv41_files module loaded - Check for layout request denials in /var/log/messages

Problem: High MDS CPU usage¶

Symptoms:

# MDS nodes showing >80% CPU
top  # nfsd threads consuming CPU

Diagnosis:

# Check for excessive LAYOUTGET operations
nfsstat -s | grep LAYOUTGET

Possible Causes: - Clients not caching layouts (check actimeo) - Frequent layout recalls (check for conflicting access patterns) - Insufficient MDS threads (check nfsd_threads)

Solution:

# Increase client attribute cache timeout
mount -o remount,actimeo=3600 /mnt/pnfs

# Add more MDS threads
echo 512 > /proc/sys/sunrpc/nfsd_threads

Problem: Uneven storage utilization¶

Symptoms:

# One storage node at 90%, others at 40%
df -h /storage/*

Diagnosis:

# Check file layout distribution
# (Requires pNFS-aware tooling or manual inspection)

Solution: - Re-stripe files: Use pNFS restripe tools if available - Balance new files: Adjust MDS layout selection algorithm - Add/remove nodes: Trigger cluster rebalancing

Advanced Topics¶

1. Hierarchical Storage Management (HSM) with pNFS¶

Implement tiered storage by combining: - Hot tier: NVMe-backed pNFS for active data - Warm tier: SATA SSD pNFS for recent data - Cold tier: HDD-based object storage (S3) for archives

Layout policy:

def select_storage_tier(file_metadata):
    if file_metadata.access_count > 100:
        return TIER_NVME
    elif file_metadata.age_days < 30:
        return TIER_SSD
    else:
        return TIER_HDD

2. Erasure Coding for Space Efficiency¶

Instead of replication (2x-3x overhead), use erasure coding: - Reed-Solomon (8+3): 1.375x overhead for 3-drive fault tolerance - RAID 6 equivalent: Stripe across pNFS with parity - Rebuild time: ~2 hours for 10TB per failed drive

3. Multi-Site pNFS Replication¶

For disaster recovery:

Site A (Primary):          Site B (DR):
  10 storage nodes    →      10 storage nodes
  Active MDS cluster  →      Standby MDS cluster

Async replication (rsync/DRBD async or Lustre HSM)

Conclusion¶

This pNFS v4.2 architecture provides:

✅ High throughput: 80+ GB/s aggregate via parallel I/O ✅ Low latency: <5ms p99 with InfiniBand/RoCE ✅ High availability: No single points of failure ✅ Horizontal scalability: Add nodes without downtime ✅ Operational simplicity: Standard NFS client compatibility

Trade-offs: - Complexity: More moving parts than traditional NAS - Data durability: Requires additional replication/erasure coding - Cost: High-speed network and NVMe SSDs increase CapEx

Ideal for: - AI/ML training clusters (GPU → storage throughput) - HPC workloads (parallel file access patterns) - Video rendering farms (large file streaming) - High-frequency trading (low-latency shared storage)

References¶

Tags: #pNFS #distributed-storage #NVMe #high-availability #load-balancing #metadata #clustering #InfiniBand #RoCE #parallel-io #file-systems #linux #performance-tuning #scalability

Category: Storage, Architecture

Have questions or running a similar setup? Open a discussion or reach out.