LLM Performance on Consumer Hardware
LLMs are huge, slow to run and require terrabytes of RAM on the newest graphics cards, right? Well, maybe not. In this article we compare the performance of two popular current LLMs: llama 3.x and deepseek-r1 on a variety of consumer hardware from Laptop GPUs to dual Nvidia 5090s. We test quantized and full precision models and show which one can fit into the memory of your graphics card. We also test against Apples M chip series and explore what can be achieved with older but cheaper server hardware.
Test Setup
We test two of the currently popular LLM families, llama 3 and deepseek-r1, in quantized and 16bit floating point (fp16) versions for their performance on consumer hardware.
We run the models using ollama 0.6.5 and open-webui as frontend. We collect the measured response tokens after executing the following query:
I need a summary of the book “War and Peace¨. Please write at least 500 words. |
For each test, the model fits into the memory of the graphics card (VRAM). Mobile devices are allowed to cool down before each test run.
Tested Models
In this article we focus on two of the currently popular LLMs:
- Llama 3, developed by Meta, represents the latest evolution in open-source large language models, building upon its predecessors with enhanced capabilities, improved efficiency, and greater contextual understanding. It comes in various sizes, typically ranging from smaller, more efficient models suitable for edge computing and personal devices, to larger, more powerful variants designed for complex tasks and extensive data processing. Its versatility makes it ideal for applications such as chatbots, content generation, coding assistance, and research. For the AI community, Llama 3 signifies a significant step toward democratizing advanced AI technology, fostering innovation, and enabling broader access to sophisticated AI tools and research opportunities.
- DeepSeek-R1, developed by DeepSeek AI, is a cutting-edge large language model designed specifically to excel in coding and technical tasks. It comes in multiple variants, including models optimized for general-purpose programming, debugging, and software development assistance, making it highly versatile for developers and tech professionals. Its primary uses include code generation, error detection, automated debugging, and providing detailed technical explanations. For the AI community, DeepSeek-R1 represents a significant advancement in specialized AI models, enhancing productivity and accuracy in software development, and contributing to the broader adoption of AI-driven coding solutions.
Model | Variant | Precision | VRAM Size |
---|---|---|---|
llama3.2:1b | instruct | Q8_0 | 2.7GB |
llama3.2:3b | instruct | Q4_K_M | 4GB |
llama3.1:8b | instruct | Q4_K_M | 6.9GB |
llama3.3:70b | instruct | Q4_K_M | 49GB |
llama3.2:1b | instruct | fp16 | 3.9GB |
llama3.2:3b | instruct | fp16 | 8.5GB |
llama3.1:8b | instruct | fp16 | 17GB |
deepseek-r1:7b | qwen distill | Q4_K_M | 6GB |
deepseek-r1:8b | llama distill | Q4_K_M | 6.9GB |
deepseek-r1:14b | qwen distill | Q4_K_M | 11GB |
deepseek-r1:32b | qwen distill | Q4_K_M | 25GB |
deepseek-r1:70b | qwen distill | Q4_K_M | 49GB |
deepseek-r1:1.5b | qwen distill | fp16 | 4.2GB |
deepseek-r1:7b | qwen distill | fp16 | 16GB |
deepseek-r1:14b | qwen distill | fp16 | 32GB |
Table 1 shows the tested LLMs. Ollama uses quantized models by default. If you specify a more detailed tag you can choose different quantizations as well as the non-quantized floating point versions. We select llama 3.x and deepseek-r1 for their popularity and start with the smallest versions parameter-wise. We continuously choose higher parameter models until they would not fit into any of the test systems VRAM.
The last column in the table shows the size of the model when loaded into VRAM as reported by ollama ps
. We can already see here that not all models will fit into the VRAM of the test systems from Table 2.
Tested Systems
System | VRAM | Architecture | Chip |
---|---|---|---|
2x Nvidia 5090 | 64GB | Blackwell | GB202 |
Nvidia 5090 | 32GB | Blackwell | GB202 |
Nvidia 4090 | 24GB | Ada Lovelace | AD102 |
Apple M4 Max | 64GB | M4 | 40 core M4 Max |
Apple M3 Max | 64GB | M3 | 40 core M3 MAx |
Apple M1 Pro | 16GB | M1 | 16 core M1 Pro |
Nvidia 4070 Mobile | 8GB | Ada Lovelace | AD106 |
Nvidia 2080 ti | 12GB | Turing | TU102 |
2x Nvidia P40 | 48GB | Pascal | GP102 |
Nvidia 3060 Mobile | 6GB | Ampere | GA106 |
Table 2 shows the test systems avaible to us including their respective architecture and VRAM sizes. The Apple systems as well as the 4070 and 3060 systems are Laptops, the rest desktop PCs.
Results
We test two of the currently popular LLM families, llama 3 and deepseek-r1, in quantized and fp16 versions for their performance on the consumer hardware mentioned in Table 2.
LLama 3
Here we test llama 3.x performance across our test systems. First in a quantized form and then the full precision available on huggingface.
As shown in Chart 1, all test systems provide usable performance for the smaller llama 3.2 models. Even the weakest Nvidia 3060 Mobile delivers 60 response tokens per second (t/s) on llama 3.2:3b. Larger models will not fit into every test systems VRAM. The slowest test system that can run the llama 3.1:8B is the MacBook Pro with M1 Pro on its 16 graphics cores. It manages to do 26 t/s which already feels sluggish.
Only four of our test systems are able to run the largest model tested: the llama 3.3:70b which requires 49GB of VRAM. So even the single Nvidia 4090 and 5090 are not able to run this model. A dual 5090 of course can, the MacBooks with M3 and M4 MAX as well due to the unified memory, and interestingly our dual P40 setup.
The dual P40 setup is an outlier: the P40 is a server graphics card from 2016 that doesn’t even have a video output. The chip is closely related to a Nvidia Titan but sports 24GB of VRAM each. It might be a budget solution for someone who just wants to run large models and can wait for the results because the performance is bad. The dual cards even get beaten in smaller models by the 3060 Mobile chip.
We didn’t even expect the P40s to be able to run the llama 3.3:70b as it requires 49GB of VRAM and the cards combined only have 48GB of VRAM. It’s still possible though and ollama reports it to run 100% on the GPU.
Interestingly, the dual 5090 setup sometimes delivers slightly less performance compared to the single card setup. So not only does ollama not scale on performance, the overhead seems to cost us a few tokens per second. The dual card setup can run the largest tested model, llama 3.3:70b, though.
Also noteworthy: the 3060 Mobile and 4070 Mobile are not far off performance wise. Not the upgrade the name suggests.
Moving on to llama 3.x in the full precision distributed even less systems can run the larger models. Chart 2 shows the remaining possible combinations. The performance also takes a hit: we now only get a little more than half the reponse tokens per second compared to the quantized version across all systems.
Here, there is no edge case where the dual P40s can run a particularly large model in contrast to other systems, as the next model size up is too large for any of the tested systems. Again, a the dual 5090 setup is slightly slower than the single card setup.
Deepseek
Next we test deepseek-r1 performance across our test systems. First in a quantized form and then the full precision available on huggingface. There is no fp16 version of the 1b model, so we choose the 1.5b variant.
At first it looks os if the deepseek-r1 performance on our test systems shown in Chart 3 is significantly worse by about 50% compared to the llama test. But comparing same-sized models like llama3.1:8b to deepseek-r1:7b the numbers are actually really close. Of course that doesn´t make it usable. And already the smallest deepseek-r1:7b will not fit into the VRAM of the 3060 Mobile, at least if you still have a GUI working alongside it.
However, the double P40 setup can excel again: running the deepseek-r1:70b setup from VRAM albeit being slow. This is something not even the single Nvidia 4090 or 5090. This shows that at some point for a single user more VRAM would be favourable over higher card speed as the higher speed will not bring as much benefits as the ability to run larger models.
For the fp16 variant of deepseek-r1 we start with a smaller variant: 1.5b-qwen-distill-fp16. This variant can run on the 3060 Mobile. The results from Chart 4 show that the performance is close to that of the 4070 again.
The larger variants require 32GB of VRAM so if you don’t have an M3 or M4 MAX or a dual 5090 setup, you are of luck except if you settle for the performance level of the dual P40s.
Conclusion
Modern LLMs in their smaller, quantized variants have acceptable to good performance on consumer hardware. Using ollama to run them and open-webui as frontend will suffice for a lot of day to day tasks without the need for a subscription to a paid service and with the ease of mind that your data stays on your system.
Even larger models will perform well on current hardware but VRAM may be a restricting factor. If you can live with reduced performance but need to run larger models, older server hardware like the Nvidia P40 or Quadro P6000 will provide the necessary VRAM.