Instead of training on an entire dataset, participants trained on a representative portion. “But this is still far and away the most computationally demanding of our benchmarks.” Most of the benchmark networks in MLPerf can be run on a single processor, but GPT-3 takes 64 at a minimum, he says. “We wanted to keep the runtime reasonable,” saysĭavid Kanter, executive director of MLPerf’s parent organization, MLCommons. A complete training of the full 1.75-billion parameter network with an entire training dataset could take weeks and cost millions of dollars. So finding a way to benchmark these behemoths was important.īut turning GPT-3 into a useful industry benchmark was no easy task. Large language models “and generative AI have fundamentally changed how AI is used in the market,” says Dave Salvatore, Nvidia’s director of AI benchmarking and cloud computing. Gaudi2, for what it’s worth, is made using the same process technology-7 nanometers-as the H100’s predecessor, the A100. At that point, he says, Gaudi2 will be “competitive” with H100, and he expects Gaudi2 to beat H100 on the combination of price and performance. Habana engineers will have Gaudi2’s FP8 capabilities ready for GPT-3 training in September, says Plawner. Nvidia’s system in the H100 is called the transformer engine, and it was fully engaged for the GPT-3 results. Both H100 and Gaudi2 were built with mixed-precision hardware, but it’s taken time for each company’s engineers to discover the right layers and enable them. Figuring out which layers are which is the key. Versions of 8-bit floating point numbers ( FP8) can be used in certain layers of the network, while more precise 16-bit or 32-bit numbers are needed in others. Transformer network, training can be greatly accelerated by doing parts of the process using less-precise arithmetic. However, the Gaudi2 computers were operating “with one hand tied behind their back,” says Jordan Plawner, senior director of AI products at Intel, because a capability called mixed precision has not yet been enabled on the chips.īy one estimate, Nvidia and CoreWeave’s 11-minute record-setting training time would scale up to about two days of full-scale training.Ĭomputer scientists have found that for GPT-3’s type of neural network, called a On a per-chip basis, H100 systems were 3.6-times as fast at the task as Gaudi2. The smallest entrant, a 256-Gaudi2 system, did it in a little over 7 hours. Nvidia and cloud provider CoreWeave performed this task in just under 11 minutes. Computers built around Nvidia’s H100 GPU and Intel’s Habana Gaudi2 chips were the first to be tested on how quickly they could perform a modified train of GPT-3, the large language model behind ChatGPT.Ī 3,584-GPU computer run as a collaboration between MLPerf, a set of neural-network training benchmarks that have previously been called the Olympics of machine learning. For the first time, a large language model-a key driver of recent AI hype and hope-has been added to
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |