Blockchain
Decoding AI Performance: TOPS and Token Analysis on NVIDIA RTX PCs
The era of PC AI has arrived, powered by NVIDIA RTX and GeForce RTX technologies. This change introduces a new way of evaluating performance for AI-accelerated tasks, introducing metrics that can be difficult to decipher when choosing between desktops and laptops, according to the report. NVIDIA Blog.
Coming out on TOPS
The first baseline is TOPS, which is trillions of operations per second. This metric is similar to an engine’s horsepower rating, with higher numbers indicating better performance. For example, Microsoft’s Copilot+ line of PCs includes neural processing units (NPUs) capable of running up to 40 TOPS, enough for light AI-assisted tasks. However, NVIDIA RTX and GeForce RTX GPUs deliver unprecedented performance, with the GeForce RTX 4090 GPU offering over 1,300 TOPS, essential for demanding generative AI tasks, such as AI-assisted digital content creation and querying. of large language models (LLMs).
Insert tokens to play
LLM performance is measured in the number of tokens generated by the model. Tokens can be words, punctuation marks, or whitespace. AI performance can be quantified in “tokens per second”. Another crucial factor is the batch size, the number of inputs processed simultaneously. Larger batch sizes improve performance but require more memory. RTX GPUs excel in this area thanks to their impressive video random access memory (VRAM), Tensor Core, and TensorRT-LLM software.
GeForce RTX GPUs offer up to 24GB of high-speed VRAM and NVIDIA RTX GPUs up to 48GB, enabling higher batch sizes and larger models. Tensor Cores, dedicated AI accelerators, significantly accelerate the operations required for deep learning and generative AI models. Applications using the NVIDIA TensorRT software development kit (SDK) can unlock maximum performance on more than 100 million Windows PCs and workstations equipped with RTX GPUs.
Text in image, faster than ever
Measuring image generation speed is another way to evaluate performance. Stable Diffusion, a popular image-based AI model, allows users to convert text descriptions into complex visual representations. With RTX GPUs, these results can be generated faster than with CPUs or NPUs. Performance is further improved by using the TensorRT extension for the Automatic1111 interface, allowing RTX users to generate images from prompts up to 2x faster with SDXL Base checkpointing.
ComfyUI, another popular Stable Diffusion interface, recently added TensorRT acceleration, allowing RTX users to generate images from prompts up to 60% faster and convert these images to videos up to 70% faster. The new UL Procyon AI Image Generation benchmark shows a 50% speed increase on a GeForce RTX 4080 SUPER GPU compared to the fastest non-TensorRT implementation.
TensorRT acceleration will soon be available for Stable Diffusion 3, Stability AI’s new text-to-image model, which will increase performance by 50%. The TensorRT-Model Optimizer further accelerates performance, resulting in a 70% increase in speed and a 50% reduction in memory consumption.
The real proof of these advances is in real-world use cases. Users can refine image generation by modifying instructions significantly faster on RTX GPUs, taking seconds per iteration compared to minutes on other systems. This speed and security comes with anything running locally on an RTX-powered PC or workstation.
The results are in and Open Source
The AI researchers behind Jan.ai recently integrated TensorRT-LLM into their local chatbot app and compared these optimizations. They found that TensorRT is “30-70% faster than llama.cpp on the same hardware” and more efficient in consecutive processing runs. The team’s methodology is open for others to measure the performance of generative AI themselves.
From games to generative AI, speed is key. TOPS, images per second, tokens per second, and batch size are all key metrics in determining performance.
Image source: Shutterstock