Ollama Now Runs Faster on Macs Thanks to Apple's MLX Framework - MacRumorsOpen MenuShow RoundupsShow Forums menuVisit ForumsOpen Sidebar
Skip to Content

Ollama Now Runs Faster on Macs Thanks to Apple's MLX Framework

Ollama, the popular app for running AI models locally on a computer, has released an update that takes advantage of Apple's own machine learning framework, MLX. The result is a hefty speed boost on Macs with Apple silicon.

ollama logo mac
According to Ollama, the new version processes prompts around 1.6 times faster (prefill speed) and nearly doubles the speed at which it generates responses (decode speed). Macs with M5-series chips are said to see the largest improvements, thanks to Apple's new GPU Neural Accelerators.

The update also includes smarter memory management, which should make AI-powered coding tools and chat assistants feel noticeably more responsive during extended use.

Ollama says the new performance boost should especially benefit macOS users who run personal assistants like OpenClaw or coding agents like Claude Code, OpenCode, or Codex.

The preview release is available to download as Ollama 0.19 – just make sure you have a Mac with more than 32GB of unified memory to run it. Support is currently limited to Alibaba's Qwen3.5, but Ollama says support for more AI models is planned.

Top Rated Comments

23 hours ago at 04:02 am

This is going to be some serious cash flow incoming for Apple in this year.
I think this could be a major business for Apple - it’s way cheaper for a small business to buy a powerful Mac and run qwen 3.5 than pay for an enterprise license for a frontier model - and you don’t need to worry about privacy issues.
Score: 10 Votes (Like | Disagree)
23 hours ago at 04:09 am
On device is definitely gonna be the future.

I can’t help but wonder if Apple looked ahead and foresaw this when developing the M series, or if they’ve lucked into it.
Score: 8 Votes (Like | Disagree)
1 day ago at 03:27 am
This is going to be some serious cash flow incoming for Apple in this year.
Score: 6 Votes (Like | Disagree)
Justin Cymbal Avatar
1 day ago at 03:23 am
M-Series chips at work😎
Score: 6 Votes (Like | Disagree)
16 hours ago at 11:00 am
Just tested the new Ollama MLX runner in CLI via terminal (0.19.0 preview) on my Mac mini M2 Pro 32GB, running qwen3.5:35b-a3b-coding-nvfp4.

it works, and it's noticeably faster than standard non-MLX models! Standard Qwen3.5 35b is not usable on my Mac, this one yes! It’s incredible!

What works well:
- The model loads and runs without issues on 32GB
- with /set nothink it’s blazing fast
- Token generation speed is much higher than equivalent non-MLX models
- RAM pressure stays in the green/yellow zone during normal use

Limitations:
- 32GB is the hard ceiling — the model itself takes ~20GB, leaving ~12GB for KV cache. Short sessions are fine; long context pushes into swap
- Can't comfortably use it as a backend for agentic frameworks (like OpenClaw) where the context grows large — hits the 32GB wall quickly

Bottom line: great for interactive chat sessions via ollama run, but this way it’s fun just for a session, not useful. For production agentic use or long context, you'd want 48GB+. Looking forward to seeing this on an M5 Max.
Score: 5 Votes (Like | Disagree)
21 hours ago at 06:01 am
As someone who downloads and experiments with everything possible…

There is a lot of delusion in this thread. Local language models below 100 billion parameters are quite useless. Even 100 billion parameters is considered the weak side. Fun to play with for a while but boredom and frustration sets in quickly.

So what happens is they want the next model…and then the next one…and then the next one…falsely believing their 16GB or 32GB machine will one day have the holy grail of small and powerful local language model.

But it doesn’t happen. The models keep growing and aside from being memory hungry the most important thing that makes them useable is memory bandwidth.

The top 5 language models in the world are all over a trillion parameters and what makes them useful and responsive is that they respond quickly and have GPU with over a terabyte of bandwidth.
Score: 5 Votes (Like | Disagree)

🔗 Related Apple News & Rumors

Stay updated with the latest Apple ecosystem news and verified rumors