In the field of artificial intelligence chips, Nvidia's dominance seems to be unchallenged. However, Meta's latest move may be quietly changing the landscape. Recently, it was reported that Meta is testing its first AI training chip based on the RISC-V architecture, a move that not only marks Meta's further breakthrough in the field of chips, but may also bring new variables to the entire industry.
From Inference to Training: Meta's Chip Ambitions
Meta's exploration of the RISC-V architecture didn't come on a whim. As early as a few years ago, Meta developed RISC-V-based chips, mainly for AI inference tasks, with the aim of reducing costs and reducing dependence on NVIDIA. Now, Meta has gone one step further and successfully designed an internal AI training accelerator with the help of Broadcom. With this chip, Meta is expected to move away from its reliance on NVIDIA's high-end AI GPUs, such as H100/H200, B100/B200, etc., which have been dominant in training advanced large language models.
Chip testing in progress: a double test of performance and power consumption
According to Reuters, Meta partnered with Broadcom and TSMC completed the tape-out of the AI training accelerator and successfully manufactured the first usable chip samples. Currently, Meta has started small-scale deployments and is evaluating the performance and power consumption of the chip. While no specific benchmark results have yet been announced, the chip is already in real-world use and is starting to work.
By design, the chip is likely to have a pulsating array structure, consisting of identical processing units (PEs) arranged in rows and columns, each of which is responsible for processing matrix or vector-dependent computations, with data flowing sequentially through the network. In order to cope with the massive data requirements of AI training, the chip may be equipped with HBM3 or HBM3E memory. In addition, as a custom processor, Meta has also defined the data formats and instructions it supports to optimize the chip's size, power consumption, and performance. In terms of performance, this chip needs to compete with NVIDIA's latest AI GPUs such as the H200, B200, and even the next-generation B300 to win the market.
Figure: Processor designed by Broadcom (Source: Meta)
The winding path of the MTIA program
The chip is the latest addition to Meta's Meta Training and Inference Accelerator (MTIA) initiative. However, the MTIA program has not been all smooth sailing. Previously, one of Meta's in-house inference processors failed to meet performance and power targets in a small-scale deployment test and was eventually forced to be discontinued. This setback also prompted Meta to adjust its strategy in 2022 and purchase Nvidia's GPUs in large quantities to meet AI processing needs.
Despite this, Meta has not given up on the development of custom chips. Last year, Meta began using MTIA chips for inference tasks and plans to apply custom chips to AI training by 2026. If the chip achieves its desired goals, Meta will gradually scale up its use, which is a key step in Meta's design of more customized hardware solutions for data center operations.
The potential of the RISC-V architecture
It's worth mentioning that MTIA's inference accelerator uses an open-source RISC-V core, which allows Meta to tailor the instruction set architecture to its needs without having to pay royalties to third parties. While it is uncertain whether MTIA's training accelerator is also based on the RISC-V instruction set architecture, it is highly likely. If true, Meta may be able to build the highest-performing RISC-V architecture chip in the mobility industry.
Breaking the Monopoly: Challenges and Opportunities for Meta
At present, the AI chip market is highly competitive, and NVIDIA occupies most of the market share with its first-mover advantage and strong technical strength. If Meta's chip is successful, it will not only reduce its own costs, but also may break the existing market pattern and provide new ideas for other companies in AI chip research and development.