Qwen3 and Gemma3 Performance on Consumer Hardware

2025-06-26

Large Language Models (LLMs), or AI, are widely adopted by now. Most of us use it daily for work, leisure and private matters. That also means that we share a lot of possibly sensitive information with the LLM. This can be problematic when using 3rd party services like ChatGPT, especially now that OpenAI has to retain all chat logs indefinitely. So how about running an LLM on your own hardware?

We previously measured two popular current-generation llama 3 and deepseek-r1. This time we follow up with Qwen3 and Gemma3, both released in the last months. We want to give you an idea which kind of model will run on what hardware and what kind of performance you can expect.

Test Setup

We test two of the currently popular LLM families, Qwen3 and Gemma3, in quantized and 16bit floating point (fp16) versions for their performance on consumer hardware.
We run the models using ollama 0.9.2 and open-webui as frontend. We collect the measured response tokens after executing the following query:

I need a summary of the book “War and Peace¨. Please write at least 500 words.

For each test, the model fits into the memory of the graphics card (VRAM). Mobile devices are allowed to cool down before each test run.

For Qwen3, the thinking mode was left at default, which is “on”.

Tested Models

In this article we focus on two of the currently popular LLMs:

Qwen3, developed by Alibaba, is optimized for deployment on Alibaba Cloud infrastructure, leveraging GPUs and specialized AI accelerators. It excels in advanced conversational AI, content generation, multilingual interactions, and complex reasoning tasks. Typical applications include customer support automation, virtual assistants, translation, summarization, and sentiment analysis. Its scalable architecture ensures efficient integration into cloud-based business solutions, particularly within Alibaba’s ecosystem.
Gemma3, created by Google DeepMind, is designed for versatility and efficiency across diverse hardware platforms, including consumer-grade GPUs, CPUs, and edge devices. It effectively handles tasks such as text generation, summarization, conversational AI, and question-answering. Its lightweight, open-source architecture makes it ideal for resource-constrained environments, enabling applications like personal assistants, educational tools, and interactive chatbots. Gemma3’s open-source nature encourages customization, experimentation, and broad adoption in both research and industry contexts.

Model	Variant	Precision	VRAM Size
qwen3:0.6b		Q4_K_M	2.2GB
qwen3:4b		Q4_K_M	5.2GB
qwen3:8b		Q4_K_M	7.5GB
qwen3:14b		Q4_K_M	12GB
qwen3:32b		Q4_K_M	25GB
qwen3:0.6b		fp16	3GB
qwen3:1.7b		fp16	5.3GB
qwen3:4b		fp16	10GB
qwen3:8b		fp16	18GB
qwen3:14b		fp16	32GB
gemma3:1b	instruct	Q4_K_M	1.9GB
gemma3:4b	instruct	Q4_K_M	6GB
gemma3:12b	instruct	Q4_K_M	11GB
gemma3:27b	instruct	Q4_K_M	21GB
gemma3:1b	instruct, quantization aware	Q4_K_M	2.1GB
gemma3:4b	instruct, quantization aware	Q4_K_M	6.6GB
gemma3:12b	instruct, quantization aware	Q4_K_M	12GB
gemma3:27b	instruct, quantization aware	Q4_K_M	22GB
gemma3:1b	instruct	fp16	3.1GB
gemma3:4b	instruct	fp16	11GB
gemma3:12b	instruct	fp16	31GB
gemma3:27b	instruct	fp16	63GB

Table 1: tested llms

Table 1 shows the tested LLMs. Ollama uses quantized models by default. If you specify a more detailed tag you can choose different quantizations as well as the non-quantized floating point versions. We select Qwen3 and Gemma3 for their popularity and start with the smallest versions parameter-wise. We continuously choose higher parameter models until they would not fit into any of the test systems VRAM.
The last column in the table shows the size of the model when loaded into VRAM as reported by ollama ps. We can already see here that not all models will fit into the VRAM of the test systems from Table 2.

Typescript x AI Meetup Talk: Privacy First AI Summarizing in TypeScript

2025-06-04

On May 21, 2025 I was fortunate enough to present a TNG hacking project at the Munich Typescript Meetup. Thanks to the organizer Carl Assmann and TNG Technology Consulting GmbH for having me!

The Typescript x AI organizers and speakers: Luca Becker, Carl Assmann, Luisa Peter, Lucas L. Treffenstädt, Benjamin Behringer, Alexander Opalic, Johannes Loher

Talk Abstract

Nearly every AI-enabled product in 2025 has a summarizing function. But they all use 3rd party software to get the summary. This raises major privacy concerns and is prohibitive for most of us in the community. We display a small React/Typescript app that will enable you to summarize even longer text on your local machine or anywhere and a Large Language Model of your choosing. We motivate the app and show it in a live demo with local and remote LLMs. We discuss different approaches to prompting and the quality of summaries on various LLMs. We show the implementation and dive into the most interesting parts of the code.

Future Work

We are currently developing the app and plan to release a public version in the future. We will keep you posted on the TNG Twitter account, the Bored Consultant Twitter account or my LinkedIn Page.

LLM Performance on Consumer Hardware

2025-05-03

LLMs are huge, slow to run and require terrabytes of RAM on the newest graphics cards, right? Well, maybe not. In this article we compare the performance of two popular current LLMs: llama 3.x and deepseek-r1 on a variety of consumer hardware from Laptop GPUs to dual Nvidia 5090s. We test quantized and full precision models and show which one can fit into the memory of your graphics card. We also test against Apples M chip series and explore what can be achieved with older but cheaper server hardware.

Test Setup

We test two of the currently popular LLM families, llama 3 and deepseek-r1, in quantized and 16bit floating point (fp16) versions for their performance on consumer hardware.
We run the models using ollama 0.6.5 and open-webui as frontend. We collect the measured response tokens after executing the following query:

I need a summary of the book “War and Peace¨. Please write at least 500 words.

For each test, the model fits into the memory of the graphics card (VRAM). Mobile devices are allowed to cool down before each test run.

Tested Models

In this article we focus on two of the currently popular LLMs:

Llama 3, developed by Meta, represents the latest evolution in open-source large language models, building upon its predecessors with enhanced capabilities, improved efficiency, and greater contextual understanding. It comes in various sizes, typically ranging from smaller, more efficient models suitable for edge computing and personal devices, to larger, more powerful variants designed for complex tasks and extensive data processing. Its versatility makes it ideal for applications such as chatbots, content generation, coding assistance, and research. For the AI community, Llama 3 signifies a significant step toward democratizing advanced AI technology, fostering innovation, and enabling broader access to sophisticated AI tools and research opportunities.
DeepSeek-R1, developed by DeepSeek AI, is a cutting-edge large language model designed specifically to excel in coding and technical tasks. It comes in multiple variants, including models optimized for general-purpose programming, debugging, and software development assistance, making it highly versatile for developers and tech professionals. Its primary uses include code generation, error detection, automated debugging, and providing detailed technical explanations. For the AI community, DeepSeek-R1 represents a significant advancement in specialized AI models, enhancing productivity and accuracy in software development, and contributing to the broader adoption of AI-driven coding solutions.

Model	Variant	Precision	VRAM Size
llama3.2:1b	instruct	Q8_0	2.7GB
llama3.2:3b	instruct	Q4_K_M	4GB
llama3.1:8b	instruct	Q4_K_M	6.9GB
llama3.3:70b	instruct	Q4_K_M	49GB
llama3.2:1b	instruct	fp16	3.9GB
llama3.2:3b	instruct	fp16	8.5GB
llama3.1:8b	instruct	fp16	17GB
deepseek-r1:7b	qwen distill	Q4_K_M	6GB
deepseek-r1:8b	llama distill	Q4_K_M	6.9GB
deepseek-r1:14b	qwen distill	Q4_K_M	11GB
deepseek-r1:32b	qwen distill	Q4_K_M	25GB
deepseek-r1:70b	qwen distill	Q4_K_M	49GB
deepseek-r1:1.5b	qwen distill	fp16	4.2GB
deepseek-r1:7b	qwen distill	fp16	16GB
deepseek-r1:14b	qwen distill	fp16	32GB

Table 1: tested llms

Table 1 shows the tested LLMs. Ollama uses quantized models by default. If you specify a more detailed tag you can choose different quantizations as well as the non-quantized floating point versions. We select llama 3.x and deepseek-r1 for their popularity and start with the smallest versions parameter-wise. We continuously choose higher parameter models until they would not fit into any of the test systems VRAM.
The last column in the table shows the size of the model when loaded into VRAM as reported by ollama ps. We can already see here that not all models will fit into the VRAM of the test systems from Table 2.

The Rebuttal: AI's Impact On Content Creation And Knowledge Sharing

2025-04-21

In our earlier post “Will AI Kill The Blogosphere?” we discussed the impact of AI on content creation in the internet and how to react to this change. Now we want to find out what the AI itself thinks of the subject and our previous article. We created a rebuttal using the following prompt:

Read the given article and write a response. Evaluate the premise and conclusion
in detail. Draw your own conclusion on the subject. Be honest and don´t hold
back. Try to provide value for the reader.

The following the Grok 3’s answer. Enjoy!

The article “The Impact of AI on Content Creation and Knowledge Sharing” explores how artificial intelligence is reshaping the landscape of content creation and the dissemination of knowledge. Below, I’ll evaluate the article’s premise and conclusion in detail, offer my own perspective, and provide actionable insights for readers navigating this evolving terrain. Let’s dive in.

Will AI Kill The Blogosphere? And Why Your Company Could Profit From That

2025-03-29

There is a whole business segment out there of people making money writing blogs. Either they earn money directly from ads, affiliate links, subscriptions or donations, or they grow the blog to a certain viewership and then sell it to someone who needs a platform to promote their product. There are marketplaces for this like Flippa and Motion Invest, to name two, and Youtube channels like Income School to teach you how to do it. And albeit the multiples for selling such a blog are relatively low at around 2-3, if you can grow the blog fast enough and maybe do more than one at a time, it’s still a profitable business.

At the same time, there are tens of thousands of users on Stack Overflow, answering the millions of new questions per year for free. These people are not getting paid but receive a different kind of reward: recognition. Apparently, a strong enough motivator.

But with the advent of AI and their liberal use of copyrighted material, or just plain piracy, both kinds of incentives might be threatened: If on one hand AI scrapes your content so people don´t need to visit your page to create clicks or recognize your name, and on the other hand the AI might create the kind of content you provide directly without needing your input, then why put in the time and effort to create content, especially when you depend on the ability to sell it for money? Consequently, there are various signs that content creation on Stackoverflow is dropping: 1, 2, 3, and also Blog sales are contracting.

Getting Started With Stable Diffusion 3.5 On Python

2025-02-01

So you want to follow the hype and generate some images with Stablility AI`s shiny new Stable Diffusion 3.5 model (SD 3.5). You find the model on Hugging Face, and hey, there is a code example to try it out. And it works!?

Absolutely not!

Inconveniently there are a lot more steps to take and considerations to make before your python script will generate an AI image, especially on consumer hardware with little VRAM. This article shows how to really do it, even on a laptop GPU with only 6GB of VRAM. As such it is an adapted collection of other material available on the web.

So, let’s get you your:

Replacing Docker Desktop With Hyper-V

2024-12-04

Docker Desktop just got more pricey again. Let’s explore some ways to replace at least part of its functionality like running docker containers and doing networking. This guide will be for the Windows operating system, as it is the one where users will most likely use Docker Desktop.

We will use the Hyper-V virtualization solution already present on Windows and show how to integrate your Docker Desktop replacement into your environment.

Tesla Model 3: 6 Month Owners Review

2024-02-06

I bought me a Tesla Model 3 in August of 2023. I did it mostly because I was bored but also to take advantage of the government subsidies that were available at that time. Hey, if the government wants to spend my taxes, they should spend it on me.
Right of the bat: The car is good. It has its flaws like most cars but also a lot of nice features to make up for it. Some quirks also arise from the new EV technology and neither Tesla nor the car itself can do anything to fix it.
But there are strong opinions on EVs from both supporters or opposers of the technology. Here I want to share my observations on some of the most common prejudices and also share some learnings of my own.

Create VCards from Excel Sheets with Python

2023-08-18

So they sent you an Excel sheet with a bunch of contacts and two days later your call history looks like you are taking part in a code decipher challenge? Maybe you should have converted them to contacts in your phone? Ok, so you tried but that didn’t work: Apple Addressbook fails silently, you cannot trust the web services with customer data and after paying for some apps on the internet you discover that some of them cannot even open Excel sheets without failing. And of course no one will type hundreds of contacts into their phone manually.

But there is another way: just automate it yourself. It’s remarkably simple using Python. In this article we demonstrate how to read an address list from an Excel sheet in xlsx format (Excel Standard since 2007) and output vCard 3.0 files one can import to their phone or EMail/PIM app of choice.

please-cli: Solving man Pages With AI

2023-08-06

Large Language Models (LLMs) like GPT-4 are trained on vast datasets that include man pages, readmes, forum questions and discussions, source code and other sources of command line tool documentation. Given a set of requirements one can query a LLM to predict a command line that will perform the required task.

please-cli is a wrapper around GPT-4 that can help you translate your requirements into a shell command. Let’s start with an example:

benjamin@asterix:~# please convert a.jpeg to avif and upscale it to 200%
💡 Command:
  convert a.jpeg -resize 200% a.avif

❗ What should I do? [use arrow keys or initials to navigate]
> [I] Invoke   [C] Copy to clipboard   [Q] Ask a question   [A] Abort

Well, looks promising and the code actually works. please-cli also gives you some handy shortcuts to immediately invoke or copy the code. You can also inquire directly about the command.
In the following sections we will look at some other examples and wether we can find limitations of the script generation.

Test Setup

Tested Models

Talk Abstract

Future Work

Test Setup

Tested Models

Response to the Article: AI’s Impact on Content Creation and Knowledge Sharing