LLMs are huge, slow to run and require terrabytes of RAM on the newest graphics cards, right? Well, maybe not. In this article we compare the performance of two popular current LLMs: llama 3.x and deepseek-r1 on a variety of consumer hardware from Laptop GPUs to dual Nvidia 5090s. We test quantized and full precision models and show which one can fit into the memory of your graphics card. We also test against Apples M chip series and explore what can be achieved with older but cheaper server hardware.

Test Setup

We test two of the currently popular LLM families, llama 3 and deepseek-r1, in quantized and 16bit floating point (fp16) versions for their performance on consumer hardware.
We run the models using ollama 0.6.5 and open-webui as frontend. We collect the measured response tokens after executing the following query:

I need a summary of the book “War and Peace¨. Please write at least 500 words.

For each test, the model fits into the memory of the graphics card (VRAM). Mobile devices are allowed to cool down before each test run.

Tested Models

In this article we focus on two of the currently popular LLMs:

  • Llama 3, developed by Meta, represents the latest evolution in open-source large language models, building upon its predecessors with enhanced capabilities, improved efficiency, and greater contextual understanding. It comes in various sizes, typically ranging from smaller, more efficient models suitable for edge computing and personal devices, to larger, more powerful variants designed for complex tasks and extensive data processing. Its versatility makes it ideal for applications such as chatbots, content generation, coding assistance, and research. For the AI community, Llama 3 signifies a significant step toward democratizing advanced AI technology, fostering innovation, and enabling broader access to sophisticated AI tools and research opportunities.
  • DeepSeek-R1, developed by DeepSeek AI, is a cutting-edge large language model designed specifically to excel in coding and technical tasks. It comes in multiple variants, including models optimized for general-purpose programming, debugging, and software development assistance, making it highly versatile for developers and tech professionals. Its primary uses include code generation, error detection, automated debugging, and providing detailed technical explanations. For the AI community, DeepSeek-R1 represents a significant advancement in specialized AI models, enhancing productivity and accuracy in software development, and contributing to the broader adoption of AI-driven coding solutions.
Model Variant Precision VRAM Size
llama3.2:1b instruct Q8_0 2.7GB
llama3.2:3b instruct Q4_K_M 4GB
llama3.1:8b instruct Q4_K_M 6.9GB
llama3.3:70b instruct Q4_K_M 49GB
llama3.2:1b instruct fp16 3.9GB
llama3.2:3b instruct fp16 8.5GB
llama3.1:8b instruct fp16 17GB
deepseek-r1:7b qwen distill Q4_K_M 6GB
deepseek-r1:8b llama distill Q4_K_M 6.9GB
deepseek-r1:14b qwen distill Q4_K_M 11GB
deepseek-r1:32b qwen distill Q4_K_M 25GB
deepseek-r1:70b qwen distill Q4_K_M 49GB
deepseek-r1:1.5b qwen distill fp16 4.2GB
deepseek-r1:7b qwen distill fp16 16GB
deepseek-r1:14b qwen distill fp16 32GB
Table 1: tested llms

Table 1 shows the tested LLMs. Ollama uses quantized models by default. If you specify a more detailed tag you can choose different quantizations as well as the non-quantized floating point versions. We select llama 3.x and deepseek-r1 for their popularity and start with the smallest versions parameter-wise. We continuously choose higher parameter models until they would not fit into any of the test systems VRAM.
The last column in the table shows the size of the model when loaded into VRAM as reported by ollama ps. We can already see here that not all models will fit into the VRAM of the test systems from Table 2.

Tested Systems

System VRAM Architecture Chip
2x Nvidia 5090 64GB Blackwell GB202
Nvidia 5090 32GB Blackwell GB202
Nvidia 4090 24GB Ada Lovelace AD102
Apple M4 Max 64GB M4 40 core M4 Max
Apple M3 Max 64GB M3 40 core M3 MAx
Apple M1 Pro 16GB M1 16 core M1 Pro
Nvidia 4070 Mobile 8GB Ada Lovelace AD106
Nvidia 2080 ti 12GB Turing TU102
2x Nvidia P40 48GB Pascal GP102
Nvidia 3060 Mobile 6GB Ampere GA106
Table 2: tested systems

Table 2 shows the test systems avaible to us including their respective architecture and VRAM sizes. The Apple systems as well as the 4070 and 3060 systems are Laptops, the rest desktop PCs.

Results

We test two of the currently popular LLM families, llama 3 and deepseek-r1, in quantized and fp16 versions for their performance on the consumer hardware mentioned in Table 2.

LLama 3

Here we test llama 3.x performance across our test systems. First in a quantized form and then the full precision available on huggingface.

llama 3.x quantized performance on consumer hardware
Chart 1: llama 3.x quantized performance on consumer hardware

As shown in Chart 1, all test systems provide usable performance for the smaller llama 3.2 models. Even the weakest Nvidia 3060 Mobile delivers 60 response tokens per second (t/s) on llama 3.2:3b. Larger models will not fit into every test systems VRAM. The slowest test system that can run the llama 3.1:8B is the MacBook Pro with M1 Pro on its 16 graphics cores. It manages to do 26 t/s which already feels sluggish.
Only four of our test systems are able to run the largest model tested: the llama 3.3:70b which requires 49GB of VRAM. So even the single Nvidia 4090 and 5090 are not able to run this model. A dual 5090 of course can, the MacBooks with M3 and M4 MAX as well due to the unified memory, and interestingly our dual P40 setup.
The dual P40 setup is an outlier: the P40 is a server graphics card from 2016 that doesn’t even have a video output. The chip is closely related to a Nvidia Titan but sports 24GB of VRAM each. It might be a budget solution for someone who just wants to run large models and can wait for the results because the performance is bad. The dual cards even get beaten in smaller models by the 3060 Mobile chip.
We didn’t even expect the P40s to be able to run the llama 3.3:70b as it requires 49GB of VRAM and the cards combined only have 48GB of VRAM. It’s still possible though and ollama reports it to run 100% on the GPU.
Interestingly, the dual 5090 setup sometimes delivers slightly less performance compared to the single card setup. So not only does ollama not scale on performance, the overhead seems to cost us a few tokens per second. The dual card setup can run the largest tested model, llama 3.3:70b, though.
Also noteworthy: the 3060 Mobile and 4070 Mobile are not far off performance wise. Not the upgrade the name suggests.

llama 3.x fp16 performance on consumer hardware
Chart 2: llama 3.x fp16 performance on consumer hardware

Moving on to llama 3.x in the full precision distributed even less systems can run the larger models. Chart 2 shows the remaining possible combinations. The performance also takes a hit: we now only get a little more than half the reponse tokens per second compared to the quantized version across all systems.
Here, there is no edge case where the dual P40s can run a particularly large model in contrast to other systems, as the next model size up is too large for any of the tested systems. Again, a the dual 5090 setup is slightly slower than the single card setup.

Deepseek

Next we test deepseek-r1 performance across our test systems. First in a quantized form and then the full precision available on huggingface. There is no fp16 version of the 1b model, so we choose the 1.5b variant.

deepseek-r1 quantized performance on consumer hardware
Chart 3: deepseek-r1 quantized performance on consumer hardware

At first it looks os if the deepseek-r1 performance on our test systems shown in Chart 3 is significantly worse by about 50% compared to the llama test. But comparing same-sized models like llama3.1:8b to deepseek-r1:7b the numbers are actually really close. Of course that doesn´t make it usable. And already the smallest deepseek-r1:7b will not fit into the VRAM of the 3060 Mobile, at least if you still have a GUI working alongside it.
However, the double P40 setup can excel again: running the deepseek-r1:70b setup from VRAM albeit being slow. This is something not even the single Nvidia 4090 or 5090. This shows that at some point for a single user more VRAM would be favourable over higher card speed as the higher speed will not bring as much benefits as the ability to run larger models.

deepseek-r1 fp16 performance on consumer hardware
Chart 4: deepseek-r1 fp16 performance on consumer hardware

For the fp16 variant of deepseek-r1 we start with a smaller variant: 1.5b-qwen-distill-fp16. This variant can run on the 3060 Mobile. The results from Chart 4 show that the performance is close to that of the 4070 again.
The larger variants require 32GB of VRAM so if you don’t have an M3 or M4 MAX or a dual 5090 setup, you are of luck except if you settle for the performance level of the dual P40s.

Conclusion

Modern LLMs in their smaller, quantized variants have acceptable to good performance on consumer hardware. Using ollama to run them and open-webui as frontend will suffice for a lot of day to day tasks without the need for a subscription to a paid service and with the ease of mind that your data stays on your system.
Even larger models will perform well on current hardware but VRAM may be a restricting factor. If you can live with reduced performance but need to run larger models, older server hardware like the Nvidia P40 or Quadro P6000 will provide the necessary VRAM.