This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Caledonia Mining Corporation Plc Notice of Q4 and FY 2025 Results and Investor Presentation

Caledonia Mining Corporation Plc Notice of Q4 and FY 2025 Results and Investor Presentation

(NYSE AMERICAN, AIM and VFEX: CMCL) SAINT HELIER, JE / ACCESS Newswire / March 11, 2026 / Caledonia Mining Corporation

March 11, 2026

Snooze Mattress and Wellness to Host Free ‘ZenFest’ Community Yoga and Sound Bath Event in Indianapolis

Snooze Mattress and Wellness to Host Free ‘ZenFest’ Community Yoga and Sound Bath Event in Indianapolis

A free community event combining yoga, sound meditation, and local wellness vendors will take place on March 28th at

March 11, 2026

Industry Study: CPOs Are Taking Charge of AI, Risk, and Growth in 2026

Industry Study: CPOs Are Taking Charge of AI, Risk, and Growth in 2026

ProcureCon Insights research finds procurement leaders expanding their strategic voice while prioritizing automation,

March 11, 2026

Global Outbreak Solutions Appoints Former U.S. Chief Veterinary Officer Dr. John Clifford as Senior Advisor

Global Outbreak Solutions Appoints Former U.S. Chief Veterinary Officer Dr. John Clifford as Senior Advisor

Leadership expansion strengthens global policy, trade, and outbreak response expertise as GOS builds a transboundary

March 11, 2026

La Vida Salon and Spa Recognized with 2026 Consumer Choice Award for Day Spa in Windsor

La Vida Salon and Spa Recognized with 2026 Consumer Choice Award for Day Spa in Windsor

WINDSOR, ON / ACCESS Newswire / March 11, 2026 / La Vida Salon and Spa has been recognized with the 2026 Consumer

March 11, 2026

Targeting the gut–lung microbiome to reduce infections in severe pancreatitis

Targeting the gut–lung microbiome to reduce infections in severe pancreatitis

GA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Severe acute pancreatitis is frequently complicated by

March 11, 2026

Thirty Years of Industry Excellence Powers DallasAutos4Less into Its Second Decade at Garland

Thirty Years of Industry Excellence Powers DallasAutos4Less into Its Second Decade at Garland

Buy-here-pay-here dealership on South Garland Avenue since 2014 has built a loyal Dallas County customer base through

March 11, 2026

Francis Anderson’s Gripping Debut Novel ‘Glass and Darkness’ Holds a Mirror to the World

Francis Anderson’s Gripping Debut Novel ‘Glass and Darkness’ Holds a Mirror to the World

In a world of immense pressure, Glass and Darkness explores the defining power of choice. I am committed to

March 11, 2026

Taxes For Expats Expands US Tax Compliance Services with Streamlined Procedure Practice Area

Taxes For Expats Expands US Tax Compliance Services with Streamlined Procedure Practice Area

NEW YORK CITY, NY / ACCESS Newswire / March 11, 2026 / Taxes for Expats, a U.S.-based tax services firm specializing in

March 11, 2026

Skymantics and Geo Orchestration AI Bridge the “Intelligence Gap” with New Privacy-First Synthetic Integration

Skymantics and Geo Orchestration AI Bridge the “Intelligence Gap” with New Privacy-First Synthetic Integration

Skymantics' DataGenesis Engine Now Feeds the Anna Orchestration Layer, Enabling High-Fidelity "What-If" Simulations

March 11, 2026

Allied Universal Technology Services Leaders Named to Women in Security Forum Power 100

Allied Universal Technology Services Leaders Named to Women in Security Forum Power 100

IRVINE, CA / ACCESS Newswire / March 11, 2026 / Allied Universal®, the world's leading security and facility services

March 11, 2026

Protecting Trillions in Energy Assets: How Molecular Traceability is Helping Safeguard Global Oil and Gas Investments

Protecting Trillions in Energy Assets: How Molecular Traceability is Helping Safeguard Global Oil and Gas Investments

As geopolitical tensions reshape trade flows and energy markets worldwide, a new form of material verification is

March 11, 2026

2026 Bicycle Lighting Standards Increase Cyclist Legal Risk Exposure

2026 Bicycle Lighting Standards Increase Cyclist Legal Risk Exposure

Historically, bicycle lighting regulations focused primarily on front-facing white lights and rear red reflectors or

March 11, 2026

Online Offering of Curly, Kinky, and HD Lace Wig Styles

Online Offering of Curly, Kinky, and HD Lace Wig Styles

The updated selection includes curly wigs for black women, kinky wigs for black women, body wave lace frontal units and

March 11, 2026

Republican Congressional Candidate Adam Perez Arquette Reveals Past Sex Trafficking Event

Republican Congressional Candidate Adam Perez Arquette Reveals Past Sex Trafficking Event

Adam Perez Arquette, district 6, Kentucky congressional candidate is ready and willing to speak about Jeffrey Epstein.

March 11, 2026

Avino Reports Exceptional 2025 Results and Advances Multi-Asset Growth Strategy; Significant Improvements Across Key Financial Metrics; Treasury Reaches Record Levels

Avino Reports Exceptional 2025 Results and Advances Multi-Asset Growth Strategy; Significant Improvements Across Key Financial Metrics; Treasury Reaches Record Levels

VANCOUVER, BC / ACCESS Newswire / March 10, 2026 / Avino Silver & Gold Mines Ltd. (TSX:ASM)(NYSE

March 11, 2026

Tomorrow BioTech and The Super Crowd Partner to Accelerate Biotech Innovation for Science-Driven Social Impact

Tomorrow BioTech and The Super Crowd Partner to Accelerate Biotech Innovation for Science-Driven Social Impact

Tomorrow BioTech and The Super Crowd join forces to give biotech startups labs, crowdfunding, and capital to turn

March 11, 2026

Correctional officer files discrimination and retaliation lawsuit against Prince George’s County

Correctional officer files discrimination and retaliation lawsuit against Prince George’s County

Complaint alleges officer was singled out for discipline after filing formal complaint of racial discrimination against

March 11, 2026

Crescent City Auction Gallery will hold an Important Estates Auction, March 26-27, online and live in New Orleans, La.

Crescent City Auction Gallery will hold an Important Estates Auction, March 26-27, online and live in New Orleans, La.

Expected top lots include a circa-1884 coromandel dressing case by Jenner & Knewstub of London; and an 1847

March 11, 2026

Steve Andrews Introduces See the Light, Kiss the Ground, a Novel Bringing Vietnam War to Life Through a Soldier’s Eyes

Steve Andrews Introduces See the Light, Kiss the Ground, a Novel Bringing Vietnam War to Life Through a Soldier’s Eyes

ORLANDO, FL, UNITED STATES, March 10, 2026 /EINPresswire.com/ — Readers looking for an authentic and emotionally raw portrayal of the Vietnam War will find “See…

March 11, 2026

Anita Jones, Ice Skating Champ, Author, Model & Super Lips Cosmetics President’s GYNCC 2026 Black History Month Award

Anita Jones, Ice Skating Champ, Author, Model & Super Lips Cosmetics President’s GYNCC 2026 Black History Month Award

In 1992 Mark Jaffe, as CEO established his GNYCC’s public and private sector’s year-round webinars, networking, conferences and expos for its 30,000 members. I could…

March 11, 2026

Jeskell Systems Announces Immediate Availability of Supermicro AI Inference Server Amid Industry Hardware Shortages

Jeskell Systems Announces Immediate Availability of Supermicro AI Inference Server Amid Industry Hardware Shortages

Jeskell Systems announces immediate availability of Supermicro AI inference servers while industry supply shortages push typical deployments out 3–6 months. This Supermicro platform represents a…

March 11, 2026

Alice & Bob Showcases Advances in Cat-Qubit Fault Tolerance at APS Global Physics Summit

Alice & Bob Showcases Advances in Cat-Qubit Fault Tolerance at APS Global Physics Summit

Physicists and engineers highlight latest research in 14 talks covering tensor networks, quantum error correction,

March 11, 2026

Keystone Marble & Granite Launches New Premium Quartz Countertop Collection

Keystone Marble & Granite Launches New Premium Quartz Countertop Collection

Keystone Marble & Granite launches a new premium quartz countertop collection, featuring modern colors, durable

March 11, 2026

A Collective Commitment to Sustainability at 137 Pillars Suites & Residences Bangkok

A Collective Commitment to Sustainability at 137 Pillars Suites & Residences Bangkok

A range of environmental, social, and cultural initiatives underscore the hotel’s sustainability vision. This

March 11, 2026

Prorize Marks 20 Years of Revenue Management Innovation

Prorize Marks 20 Years of Revenue Management Innovation

Prorize marks its 20th anniversary, highlighting how its AI-driven pricing platform helps storage operators improve

March 11, 2026

National Cybersecurity Center Warns Parents of the Hidden Cyber Risks on Your Kids’ Devices

National Cybersecurity Center Warns Parents of the Hidden Cyber Risks on Your Kids’ Devices

Parents dramatically reduce privacy and security risks with quick, five-minute fixes recommended by cybersecurity

March 11, 2026

Infinnium Launches SaaS Version of 4iG® Platform, Challenging the ‘Move the Data’ Model in eDiscovery and Governance

Infinnium Launches SaaS Version of 4iG® Platform, Challenging the ‘Move the Data’ Model in eDiscovery and Governance

Enterprises no longer need to "lift and shift data" to get the answers they need Traditional eDiscovery and governance

March 11, 2026

The Oaks Residence Announces Limited Waitlist Ahead of Grand Opening in April 2026

The Oaks Residence Announces Limited Waitlist Ahead of Grand Opening in April 2026

Mississippi’s First and Only Boutique Luxury Assisted Living Residence Now Accepting Priority Deposits GLUCKSTADT, MS,

March 11, 2026

Trinity Rock Hour Expands to Weekend Broadcast on WAPN 91.5 FM

Trinity Rock Hour Expands to Weekend Broadcast on WAPN 91.5 FM

The Trinity Rock Hour from Trinity Music & Media expands with new Saturday and Sunday 6 PM broadcasts on WAPN 91.5

March 11, 2026

LearnLaunch Fund + Accelerator Named a 2026 Emerging Impact Manager by ImpactAssets 50

LearnLaunch Fund + Accelerator Named a 2026 Emerging Impact Manager by ImpactAssets 50

We invest in entrepreneurs building solutions that help people gain the skills, knowledge, and pathways they need to

March 11, 2026

Neuro-Innovators Appoints Veteran Regulatory Strategist Ken Phelps to Advisory Board

Neuro-Innovators Appoints Veteran Regulatory Strategist Ken Phelps to Advisory Board

PITTSBURG, PA, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Neuro-Innovators, Inc., a clinical-stage

March 11, 2026

SDA Launches Horizon South Real Estate LLC to Jump Start Growth in the Southland

SDA Launches Horizon South Real Estate LLC to Jump Start Growth in the Southland

New SDA enterprise will leverage local partnerships to create equitable developments that meet the needs of the

March 11, 2026

ACMP® Highlights the Combined Value of Membership and CCMP® Certification

ACMP® Highlights the Combined Value of Membership and CCMP® Certification

ACMP® Highlights the Combined Value of Membership and CCMP® Certification Membership and certification together provide

March 11, 2026

Circurna to Participate in Four Leading RNA and Oncology Conferences in 2026

Circurna to Participate in Four Leading RNA and Oncology Conferences in 2026

Company leaders to engage the global research community on circular RNA, AI-driven therapeutics, and next-generation

March 11, 2026

Enstep Appoints Brayden Stitt to Lead Sales Operations

Enstep Appoints Brayden Stitt to Lead Sales Operations

Enstep Technology Solutions names Brayden Stitt Head of Sales Operations and Business Development, enhancing the

March 11, 2026

Ear Trumpet Labs Names Paula Boggs Band as an Artist Ambassador

Ear Trumpet Labs Names Paula Boggs Band as an Artist Ambassador

Celebrated Seattle “Soulgrass” Voice Joins Esteemed Global Artist Community We’ve been customers for years, so we

March 11, 2026

Nijigen no Mori ‘NARUTO & BORUTO Shinobi-Zato’ ‘Shinobi-Zato 7th Anniversary Event’ Volume 5

Nijigen no Mori ‘NARUTO & BORUTO Shinobi-Zato’ ‘Shinobi-Zato 7th Anniversary Event’ Volume 5

Volume 5: "Cherry Blossom Viewing Party with Naruto Uzumaki" AWAJI, JAPAN, March 11, 2026 /EINPresswire.com/ — The

March 11, 2026

Isaac Health Launches Neurology-led, Virtual Lifestyle Medicine Program To Help Reduce Dementia Risk

Isaac Health Launches Neurology-led, Virtual Lifestyle Medicine Program To Help Reduce Dementia Risk

NEW YORK, NY, UNITED STATES, March 11, 2026 /EINPresswire.com/ — Isaac Health, a leading virtual memory clinic

March 11, 2026

Quantum 11.1.1 Release Expands Production Control, Scheduling Intelligence, and Shop Floor Transparency

Quantum 11.1.1 Release Expands Production Control, Scheduling Intelligence, and Shop Floor Transparency

MES update improves production scheduling, order prioritization, and inventory traceability for small and midsize

March 11, 2026