Broadcom Advocates Ethernet Over InfiniBand for AI Infrastructure Networking

August 29, 2024

During VMware by Broadcom’s Explore event, Broadcom’s emphasis on using Ethernet as the core technology for AI infrastructure networking took center stage. Broadcom asserts that Ethernet, rather than NVIDIA’s InfiniBand, is the superior technology for linking GPUs and other data center components crucial for running AI workloads. Broadcom’s CEO, Hock Tan, set the tone by advocating for the future of enterprises to lean towards private cloud solutions. This next-gen focus is on private data and private AI. Ram Velaga, Senior Vice President and General Manager of Broadcom’s Core Switching Group, provided a deeper dive into the rationale behind this shift, emphasizing the critical role of networking in distributed computing tasks, particularly for machine learning (ML) workloads.

Private Cloud and AI: The Future of Enterprise Operations

Broadcom is pushing for private cloud adoption, viewing it as vital for the future of enterprise-level operations. This includes running private AI on private data within these infrastructure frameworks. As data privacy concerns continue to grow, the emphasis on securing sensitive enterprise data becomes more pronounced. Moving to private cloud solutions not only promises heightened security but also offers organizations more control over their data handling and processing. According to Broadcom, this control becomes paramount as companies increasingly rely on AI to drive operational decisions and strategies.

Private cloud and AI signify a paradigm shift where enterprises can leverage their data in a more controlled environment. This setup ensures data confidentiality while taking advantage of AI technologies to enhance business processes. By embracing private cloud infrastructure, companies can safeguard their data from external threats while optimizing their AI workloads. This approach promises better data integrity, compliance with regulations, and the agility to adapt to changing business requirements.

Furthermore, private AI allows firms to develop and deploy machine learning models using their proprietary datasets without the risk of exposing sensitive information to public cloud environments. This capability is especially critical for industries where data privacy is a top priority, such as healthcare, finance, and government sectors. Broadcom’s focus on private cloud and AI demonstrates a commitment to providing secure, efficient, and scalable solutions tailored to the evolving needs of modern enterprises.

Networking for AI Workloads: A Distributed Computing Challenge

One of the core arguments presented is the challenge associated with machine learning workloads compared to traditional cloud computing. Velaga noted that, unlike CPU-driven cloud computing, which focuses on maximizing CPU utilization, ML workloads are more complex, often requiring the seamless networking of multiple GPUs across possibly vast physical distances. This inherently transforms the task into a distributed computing challenge. The complexity of ML workloads necessitates robust networking solutions to facilitate efficient data throughput and minimize latency.

This distributed nature of AI computing means that networking infrastructure must be capable of handling vast amounts of data being transferred swiftly between various computational nodes. Effective networking is crucial to ensuring that machine learning models can be trained and deployed efficiently. The ability to interconnect GPUs with minimal lag and maximal data fidelity is fundamental to the success of AI applications. Without a reliable and high-performance network, the overall efficiency and performance of AI computations can be severely compromised.

Moreover, the distributed computing challenge extends beyond mere connectivity. It encompasses ensuring synchronized operations across different nodes, managing data dependencies, and maintaining system stability under high loads. The right networking framework can significantly impact the scalability and reliability of AI deployments. In this context, Ethernet’s wide acceptance and matured technologies present a compelling case. By offering a seamless and robust networking fabric, Ethernet empowers enterprises to harness the full potential of their AI infrastructure, driving innovation and efficiencies across various industries.

Ethernet vs. InfiniBand: Analyzing the Key Differences

Velaga advocates strongly in favor of Ethernet as the networking solution over NVIDIA’s InfiniBand. While InfiniBand is marketed as well-suited for complex workloads demanding ultra-fast processing, Velaga argues that it is costlier, fragile, and fundamentally based on flawed assumptions of lossless physical networks. This cost consideration becomes imperative for enterprises looking to scale AI operations without escalating infrastructure expenses. InfiniBand’s fragility, as highlighted by Velaga, means that it may not be as reliable or scalable under real-world conditions where network stability can fluctuate.

This potential for instability makes Ethernet’s robustness an attractive alternative. The assumptions about lossless physical networks with InfiniBand do not always hold true in practical deployments. Ethernet, by contrast, with its mature error-correction protocols and broad compatibility, provides a more resilient solution for AI workloads. The reliability and proven track record of Ethernet in diverse network environments make it a trustworthy choice for enterprises seeking stable and efficient AI operations.

Furthermore, Ethernet’s widespread adoption and standardization across the industry mean that enterprises can leverage a vast ecosystem of tools and technologies to support their networking needs. This ubiquity translates to easier implementation, reduced training requirements, and more readily available expertise. In contrast, the niche nature of InfiniBand can pose challenges in sourcing compatible equipment, troubleshooting issues, and maintaining the overall network. Velaga’s emphasis on these practical considerations underscores the importance of choosing a networking solution that offers both performance and ease of use, paving the way for more seamless AI integration and operation.

Advantages of Ethernet: Cost, Standardization, and Reliability

The article lists several reasons why Ethernet is seen as the superior choice. Firstly, Ethernet’s ubiquity is a major advantage. It is widely deployed and well-understood by industry professionals, facilitating easier integration into existing networks. This means that enterprises do not need to invest in specialized training or equipment, reducing the overall cost and complexity of implementation. Secondly, Ethernet is based on open standards, ensuring broad compatibility and future-proofing. This openness promotes widespread industry adoption and collaborative innovation.

Cost-effectiveness is another critical factor in favor of Ethernet. Ethernet offers higher performance for AI fabrics at a lower cost, making it an economical choice for widespread deployment. This cost advantage enables organizations to scale their AI infrastructure without facing prohibitive expenses. Additionally, Ethernet’s consistency allows it to be uniformly applied across various network types, from front-end to back-end, storage, and management networks. This uniformity simplifies network management and maintenance, further reducing operational costs and complexities.

Reliability and ease of use are also significant advantages of Ethernet. It is noted for its high availability, reliability, and user-friendliness. Enterprises can rely on its consistent performance over time, ensuring that their AI operations run smoothly without frequent interruptions or the need for constant troubleshooting. Finally, Ethernet benefits from an extensive ecosystem, covering everything from silicon to hardware, software, automation, monitoring, and debugging tools. This comprehensive support ensures that enterprises have access to a wide range of resources and tools, making it easier to deploy, manage, and optimize their AI networks.

These advantages collectively position Ethernet as a robust, cost-effective, and scalable solution for AI infrastructure networking. By leveraging Ethernet, enterprises can achieve greater efficiency, reliability, and scalability in their AI operations, driving innovation and competitive advantage.

Innovation and Competition in the Ethernet Space

At VMware by Broadcom’s Explore event, the spotlight was on Broadcom’s advocacy for Ethernet as the foundational technology for AI infrastructure networking. Broadcom claims that Ethernet outperforms NVIDIA’s InfiniBand when it comes to connecting GPUs and other essential data center elements necessary for executing AI workloads. Hock Tan, Broadcom’s CEO, set the stage by arguing that the future of enterprise solutions lies in embracing private cloud technologies. This new focus is all about securing private data and deploying private AI. Ram Velaga, the Senior Vice President and General Manager of Broadcom’s Core Switching Group, delved deeper into the reasoning behind this stance. He highlighted the indispensable role of networking in distributed computing, particularly for machine learning (ML) tasks. Velaga reiterated that efficient networking is essential for the scalability and performance of ML workloads, underscoring why Broadcom sees Ethernet as the optimal choice for these advanced computing environments.

Subscribe to our weekly news digest!

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for subscribing.
We'll be sending you our best soon.
Something went wrong, please try again later