100 articles from r/LocalLLaMA

r/LocalLLaMA Aggregators May 26, 2026
Turning local agents into self-optimizing agents

I was experimenting with a self-optimizing agentic pipeline to climb the benchmark leaderboard (TerminalBench). On a 10-task subset, I got the performance to rise from ~30% → ~90%. That loop worked, …

I was experimenting with a self-optimizing agentic pipeline to climb the benchmark leaderboard (TerminalBench). On a 10-task subset, I got the performance to rise from ~30% → ~90%. That loop worked, so I asked: can the same reflect-and-rewrite step run continuously against everyday chats instead of a benchmark? How it works Every chat with your local LLM goes through a small proxy and is logged. autoswarm reflect has the same local model review those logs, distill concrete lessons, and write them to skills.yaml. Lessons auto-inject into the system prompt of future chats. Run it (LM Studio path) Start LM Studio's local server and load a model. ```bash pip install -e . autoswarm doctor # verifies LM Studio is reachable autoswarm start # auto-detects upstream + model, listens on :8080 I'm genuinely fascinated by the idea of self-optimizing agents, and I believe there's something bigger to uncover there. That said, this is just a hobby project and I'm still experimenting with it. Would love your feedback! Link: https://github.com/arteemg/autoswarm I'm actively working on the project, so please ⭐ the repo to stay updated.   submitted by   /u/Rude_Substance_8904 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
Long-context performance at lower quants

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing tha…

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k context use, it starts to get dumb all of a sudden. It just hits a brick wall and quality deteriorates rapidly and drastically. It'll begin hallucinating, forgetting things, or think something it said/suggested was actually something that I said. I found I have to compact before I get to that point, and then it keeps going on just fine. Is this because I'm running Q3? Unfortunately Q4 is just outside of the capability of my system specs unless I want to start disk swapping. So is it just an issue with this particular model? Or because it's Q3? Are there llama.cpp settings that can help? I'm already using BF16 KV cache.   submitted by   /u/_TheWolfOfWalmart_ [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

MOSS-TTS-v1.5 MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyi…

MOSS-TTS-v1.5 MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For the full 1.0 feature walkthrough, input schema, decoding hyperparameters, and evaluation tables, please refer to the MOSS-TTS 1.0 README. Compared with MOSS-TTS 1.0, v1.5 focuses on the following improvements: Stronger multilingual synthesis with language tags: when the language field is omitted, v1.5 may improve some languages and regress slightly on others compared with 1.0. When the language is specified, v1.5 is stronger than 1.0 on almost all supported languages. Set the tag when building the user message, for example processor.build_user_message(text=text_fr, language="French"). More stable voice cloning: v1.5 improves speaker similarity and reduces cloning variance, making repeated generations more consistent. Better long-reference, short-text cloning: v1.5 handles scenarios where the reference audio is much longer than the target text more reliably than 1.0. More stable punctuation-following prosody: v1.5 follows punctuation-driven pauses more closely, especially in long sentences. Explicit pause control: v1.5 supports inline pause markers such as "[pause 3.2s]". For example, 我今天学习了一首中国的古诗,它的名字是[pause 3.2s]静夜思! inserts an explicit 3.2s pause before 静夜思. Supported Languages MOSS-TTS-v1.5 currently supports 31 languages. It keeps the 20 languages supported by MOSS-TTS 1.0 and extends multilingual continued training to additional languages including Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, and Vietnamese. They released additional model as well. https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect-v2.0   submitted by   /u/pmttyji [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Feedback Wanted: Building for easier local AI

Just what the post says. Looking to make local AI easier so literally anyone can do “all the things” very easily. We built an installer that sets up all your OSS apps for you, ties in the relevant mo…

Just what the post says. Looking to make local AI easier so literally anyone can do “all the things” very easily. We built an installer that sets up all your OSS apps for you, ties in the relevant models and pipelines and back end requirements, gives you a friendly UI to easily look at everything in one place, monitor hardware, etc. Currently works on Linux, Windows, and Mac. We have kind of blown up recently and have a lot of really awesome people contributing and building now, so it’s not just me anymore it’s people with Palatir and Google and other big AI credentials and a lot of really cool people who just want to see local AI made easier for everyone everywhere. We just finished automatic multi GPU detection and coordination as well, so that if you like to fine tune these things you can, but otherwise the system will setup automatic parallelism and coordination for you, all you’d need is the hardware. Also currently in final tests for model downloads and switching inside the dashboard UI so you can manage these things without needing to navigate a terminal etc. I’d really love thoughts and feedback. What seems good, what people would change, what would make it even easier or better to use. My goal is that anyone anywhere can host local AI on anything so a few big companies can’t ever try to tell us all what to do. That’s a big goal, but there’s a lot of awesome people that believe in it too helping now so who knows? Any thoughts would be greatly appreciated!   submitted by   /u/Signal_Ad657 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
[OSS] dlmserve - first serving engine for diffusion language models

Spent the last few months building this on a single RTX 5070. Quick context: diffusion language models (like LLaDA from gsai-ml) are a different beast from GPT-style autoregressive LLMs. Instead of g…

Spent the last few months building this on a single RTX 5070. Quick context: diffusion language models (like LLaDA from gsai-ml) are a different beast from GPT-style autoregressive LLMs. Instead of generating one token at a time, they start with a fully masked sentence and iteratively denoise the whole thing in parallel. Cool tech, but mainstream serving engines are all built around the autoregressive contract, so none of them serve diffusion LLMs. dlmserve fills that gap: OpenAI-compatible HTTP API (/v1/chat/completions) Automatic continuous batching at the denoising-step level Optional LocalLeap acceleration baked in Token-identical to the reference HF implementation at temperature=0 2.5x throughput vs HF at batch=4, plus another ~1.8x from LocalLeap Runs in 12 GB VRAM (RTX 3090/4090/5070 all fit). MIT licensed. Repo: https://github.com/iOptimizeThings/dlmserve Install: pipx install dlmserve (or pip install dlmserve if you're in a venv) First public OSS project of this size for me. Genuinely curious what people think. Feedback and code review very welcome, also happy to answer questions about the diffusion serving architecture Edit: Roadmap: - v0.1 ✓ LLaDA-8B-Instruct + LLaDA-1.5 - v0.2 Dream-7B + DiffuLLaMA (issues already open) - v0.3 block diffusion + LLaDA-2.0 + Fast-dLLM KV cache   submitted by   /u/Glittering_Painting8 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode

I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tool…

I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends. For example, to run pi + vllm: # model downloaded and configured harbor up vllm # Harbor knows that vllm is running and will use it harbor launch pi Additionally, launch can proxy requests through built-in optimising LLM gateway which automatically injects and resolves tools, such as web search, so you can add web search to an agent by just appending --web to the command and Harbor will pre-wire everything: harbor launch --web --model qwen3.5:4b --backend ik_llamacpp mi -p 'Find recent releases of agentic tools and write a two sentence overview' You can find many more details in the wiki here: https://github.com/av/harbor/wiki/3.-Harbor-CLI-Reference#harbor-launch-launch-options---service-servicetool-args Thank you!   submitted by   /u/Everlier [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
Okay 27B made me a believer

I previously hated on this model, but I have just been impressed by it, and I understand the hype now. I have been working on a HTML5 game console and I decided to see if Qwen3.6 27B can handle makin…

I previously hated on this model, but I have just been impressed by it, and I understand the hype now. I have been working on a HTML5 game console and I decided to see if Qwen3.6 27B can handle making some quick games in it to showcase functionality (save games, console API handling for stat tracking and heartbeat management, meta data for the game, etc) I gave it 3 files, explaining how the API works, the gamepad controls, and a typescript shader for it to apply. Then I just game it a very simple prompt "make a breakout game for this console, in the working directory are reference files on how to make it". First result was immediately playable, controls made sense, graphics style was was unique and appropriate, sound worked, console API all worked, and it felt good and was actually fun. It added flair that made it not feel like the vibecoded breakout clone it was. It went way above and beyond the minimum that I've seen so many LLMs do. It was not lazy in the slightest. It's a simple test, but this is something everything but something like Opus could handle. There wasn't anything particularly done well, it's just that the whole game was nearly complete in a single shot and it felt like thought was put into the entire game. All I needed was one follow up for customization and a single glitch and it was already what I would consider complete. And this was on a 27B model with Opencode. The best way I can describe it, is that it was congruent. Now I just wish I went the Nvidia card route instead of Strix Halo cause the speed isn't great. Maybe 3.7 35B A3B can have some of this magic.   submitted by   /u/Forward_Jackfruit813 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
Tencent Hy-MT2 is now under Apache License 2.0

nice update bois   submitted by   /u/sword-in-stone [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Keye-VL-2.0-30B-A3B -- Introducing DSA attention into multimodality for the first time

Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capa…

Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family. https://huggingface.co/Kwai-Keye/Keye-VL-2.0-30B-A3B https://preview.redd.it/wsxe233abh3h1.png?width=1244&format=png&auto=webp&s=aa9ffa388e16e4f8f5cb72ed3dae063f99df69f1 https://preview.redd.it/2iymyb9dbh3h1.png?width=2048&format=png&auto=webp&s=a834ce92294c3be059b50c6993f1be6d3faf2767   submitted by   /u/External_Mood4719 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
New KV Quants coming 😍 Welcome OSCAR kv quant open sourced by togetherAI

Just when we started embracing turboquant this happens   submitted by   /u/yehyakar [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
China Clamps Down on Overseas Travel for AI Talent at Alibaba, DeepSeek

Big, if true. Doesn't bode well for research / OS models out of China.   submitted by   /u/kaggleqrdl [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Outsourcing plus LocalAI will soon become more economical vs Frontier labs

written entirely by me. AI did the chart and formatting html   submitted by   /u/Comfortable-Rock-498 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Not sure if this was posted. But I think it's highly relevant to us.

  submitted by   /u/Paradigmind [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Output Length Constrained Summarization using GRPO on tiny LLMs | smolcluster

Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is! It's about output length-constrained summarizatio…

Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is! It's about output length-constrained summarization using LLMs with GRPO. All experiments run on tiny LLMs - Qwen2.5-0.5B-Instruct and LFM-2.5-350M on a 3x Mac mini M4 cluster (16 GB each), single-node training with multi-node vLLM inference for rollouts. The core question: can you teach a sub-500M model to summarize Reddit posts in exactly 64 tokens while keeping the quality high? The baseline zero-shot answer: not really. Composite G-Eval scores of 2.376 (Qwen) and 2.332 (LFM) under zero-shot prompting, with pass rates of just 21% and 13%. That was the starting point. I tested 12 reward configurations across 2 training strategies: Strategy 1 - Length-Penalty Fine-tuned: Train on length reward first → checkpoint → fine-tune with quality rewards only. Strategy 2 - Length-Penalty Included (a.k.a joint): Length + quality rewards active simultaneously from step 1. 24 checkpoints total. One clear winner between the two strategies. The quality reward signals: ROUGE-L - LCS F1 against the reference METEOR - precision/recall with stemming + synonym matching BLEU - n-gram precision with a brevity penalty And all their pairwise combinations. Evaluated with G-Eval (LLM-as-judge) across Faithfulness, Coverage, Conciseness, and Clarity. The staged curriculum wins - consistently. Best composite scores: LFM: 2.904 (quality-meteor, fine-tuned) vs 2.701 (joint) Qwen: 2.817 (quality-bleu-rouge, fine-tuned) vs 2.769 (joint) Practical takeaways: Staged curriculum (length first, quality second) outperforms joint training in absolute score METEOR + ROUGE-L is the most reliable reward combination under both strategies The length constraint is also a regularizer - it prevents the Coverage ↔ Conciseness collapse that happens when quality rewards run unconstrained BLEU alone is not worth including as a standalone reward signal for summa

📰
r/LocalLLaMA Aggregators May 26, 2026
Token Usage and Databases - Local vs. API

Throwing something out to the community for a bit of an insight. I got thinking about the consumption of tokens when working with various databases and here is my understanding: When I ask as quest…

Throwing something out to the community for a bit of an insight. I got thinking about the consumption of tokens when working with various databases and here is my understanding: When I ask as question that is essentially converted to tokens. The LLM then "reads" that and generates the response which in this cases involves a database query The LLM then tokenizes the query results and "reads" them and provides me the results and any insights or answers Rinse and repeat until you have gotten what you want. i.e continue to build token usage. So if that's right then AI driven analytics is going to be terribly expensive in token consumption really fast, even with all of the caching and other techniques available right now. It's also going to get considerably worse with the use of sub agents and agent council type solutions where a single question could kick of a bunch of separate queries that are then passed back and forth. I work with large enterprise where all the vendors are heavily pushing integrated analytics and agentic querying of the underlying platform (SAP, Service Now etc.) and question whether buying into this now exposes organizations to a massive cost based risk once the initial contracts have expired and generative AI is actually being charged at above cost rather than below. I'm really curious in other peoples perspectives but have a couple thoughts. Isn't this a very strong justification (along with a number of others) for hybrid architectures where local AI is leveraged for the heavy token count types of analysis within organizations? I spend quite a bit of time reading from various sources and so far I haven't seen this really discussed so I'm wondering if I missed something along the way or the service providers aren't comfortable discussing these implications? Appreciate the comments in advance. Cheers   submitted by   /u/WishfulAgenda [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
China Expands Travel Curbs to Top AI Talent at Private Firms

https://www.bloomberg.com/news/articles/2026-05-26/china-expands-travel-curbs-to-top-ai-talent-at-private-firms Now it will be much harder to poach Chinese AI talents like the former Qwen head Junyan…

https://www.bloomberg.com/news/articles/2026-05-26/china-expands-travel-curbs-to-top-ai-talent-at-private-firms Now it will be much harder to poach Chinese AI talents like the former Qwen head Junyang Lin. It is quite sad that they will also have a hard time to travel to foreign countries for fun. Non-paywalled version from Straits Times: https://www.straitstimes.com/asia/east-asia/china-expands-travel-curbs-to-top-ai-talent-at-private-firms   submitted by   /u/Ok_Warning2146 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
Are local LLM users testing prompt injection before connecting models to tools?

I wanna know how people here are handling security once local models move beyond chat.....Running a model locally feels safer because the data does not leave your machine or your infra. That is a rea…

I wanna know how people here are handling security once local models move beyond chat.....Running a model locally feels safer because the data does not leave your machine or your infra. That is a real advantage.....But once the local model is connected to tools, files, RAG, shell commands, browser automation, APIs, or internal docs, the risk changes. At that point, prompt injection is not just “the model said something weird.” It can influence what file gets read, what command gets suggested, what data gets retrieved, what tool gets called, or what action the agent takes next..... Most local setups I see focus heavily on model quality, quantization, context length, VRAM, tokens per second, and benchmark scores. All valid. But I see less discussion around testing the model’s behavior under malicious instructions before giving it access to real tools.... The people running local models in agentic setups: Are you testing prompt injection or jailbreak behavior? Do you isolate tool access by default? Do you keep local models read-only until trusted? Do you log tool calls and retrieved context? Or is this still mostly “local means safe enough” for now? I’m not asking from a doom angle. I’m more interested in what practical safety habits local builders are actually using.   submitted by   /u/sunychoudhary [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
SkillOpt treats markdown skill files as trainable parameters with proper optimization machinery

Paper came out recently that formalizes something a lot of agent builders have been doing ad hoc. They use a frontier model to propose bounded edits (add/delete/replace) to markdown skill files, then…

Paper came out recently that formalizes something a lot of agent builders have been doing ad hoc. They use a frontier model to propose bounded edits (add/delete/replace) to markdown skill files, then gate every edit against a held out validation set. Only strict improvements accepted, ties rejected, rejected edits become negative signal for the next round. Few things worth noting: Best skills converge with 1 to 4 accepted edits out of many more proposals. Edit budget of 4 to 8 per step works best, remove the cap and performance collapses. Median final skill is ~920 tokens. A skill optimized on Codex transferred to Claude Code with zero modification and gained +59.7 on SpreadsheetBench. And GPT 4.1 nano with an optimized skill roughly matched frontier on procedural benchmarks. The limitation is the validation gate requires an auto grader with clear correct answers. Works for code and spreadsheets, breaks for anything open ended. Paper: https://arxiv.org/pdf/2605.23904   submitted by   /u/agentic-doc [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
Llamacpp server : How do the -np and -c flags interact?

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c i…

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact. The context for each parallel client appears to be equally distributed across server slots (so each client is allowed c / np context). I have some questions: - What are the consequences of launching a server with a greater context -c than what the model allows? - What if c / np is greater than the model max context? Are there any negative to that regarding model performance? - If a rig allows to allocate twice the context max size in vram, is it twice energy and time efficient to serve two agents in parallel rather than sequentially?   submitted by   /u/Doug_Fripon [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
qwen 3.6 27B AR-> Diffusion - local training on 5090

based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trai…

based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get the thing to do a forward pass on a 5090 with help of another gpu rtx4000 to help offload recreations. Below are some low level ramblings / findings / observations. Firstly - the amount of vram normally required to do this > 600gb - (i think) after some wrangling - and giving up on optane route - it's possible to train on qlora form factor which will actually take the model and train on nvidia - nvfp4 i attempt to get the entire 27b model to train on a 5090 https://github.com/scrya-com/dLLM-castlehill latest training run https://wandb.ai/snoozie/open-dllm-27b/runs/arcefpjp?nw=nwusersnoozie Public service annoucment - to avoid burning cables - throttle down nvidia max power for consumer 5090 cards from 600w -> 400w The vanilla route with open-dllm is validated on qwen 2.5 with 4x speed up (if someone with lots of compute could take a look it might just work) - I take some deviation to explore improving this - and found a few papers. One is d3llm Ultra-Fast Diffusion LLM https://github.com/hao-ai-lab/d3LLM which boasts faster diffusion speeds - so i upstream this code into the codebase and include their mdm loss - seems ok. It's basically also taking the order of the tokens into account. With the diffusion it can have many steps (see graph) but we can shorten that time to see much higher throughput / tokens per second. if we could theoretically do 1 step - then you may see some crazy speeds. https://wandb.ai/snoozie/open-dllm-compare?nw=nwusersnoozie When i was working on improving ltx2 to speed up video recreation to do 1 shot diffusion - I attempt to implement this trick shot based off a paper variational flow maps which / make some noise https://arxiv.org/abs/2603.07276 see

r/LocalLLaMA Aggregators May 26, 2026
Qwen3.5 27B Uncensored Heretic Native MTP Preserved is Out Now With the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats!

Safetensors, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-27B-uncens…

Safetensors, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF NVFP4, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 NVFP4 GGUFs, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF GPTQ-Int4, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 Now in case some people might ask, why release Qwen3.5 MTPs version when there is already Qwen3.6 MTPs version? Well the thing is, most people would assume that higher number = newer and better model, but the thing is both Qwen3.5 and Qwen3.6 models uses the qwen35 architecture, they just had different training and their focus are meant for different primary usecases, Qwen3.6 models are mainly meant for agentic and coding AI assistance and Qwen3.5 models are mainly meant for general purpose AI assistance, now Qwen3.6 can definitely be used for general AI assistance just like Qwen3.5 can definitely be used for agentic and coding, but if you want the most optimal usecases it would be Qwen3.6 for agentic and coding and Qwen3.5 for general AI assistance that is where each of them excels at. Also for extra info, in case anyone is wondering, despite Qwen3.5 and Qwen3.6 both sharing the qwen35 architecture, they behave very diferently to abliteration. Qwen3.5 models can have a KL divergence in the 300's or 400's but on benchmarks this does not

📰
r/LocalLLaMA Aggregators May 26, 2026
Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so.

Did the math on my own rig last week and I'm tired of seeing this sub repeat the "local is cheaper" line without numbers. Let me actaully break it down. My setup: 2x 3090 (used, $1400 total…

Did the math on my own rig last week and I'm tired of seeing this sub repeat the "local is cheaper" line without numbers. Let me actaully break it down. My setup: 2x 3090 (used, $1400 total), Ryzen 7900X, 64GB DDR5, around $2800 all in. Pulls about 700W under load. At my electricity rate that's roughly $0.21/hour just to keep it serving. Add depreciation on the GPUs (amortize over 3 years), and the marginal cost per active hour lands somewhere around $0.50-0.80 depending on how much I use it. Now compare RunPod: a single H100 80GB is around $1.99/hr on-demand, $1.49/hr if you commit. That H100 will run Qwen3.6-35B-A3B at 2-3x the throughput of my dual 3090 setup. So per-token, the H100 actually ends up cheaper. If I'm honest about my usage (maybe 2-3 hours of heavy inference per day), I am paying significantly more per token than I would by just renting when I needed it. So why tf do I keep the rig: - Privacy: I run things I don't want logged by a cloud provider - Dignity: I don't want to ask a company for permission to query my own data - Tinkering: I get to learn stuff you cannot learn renting - Cold start: My rig is always on, no 30 second container spin-up - Sovereignty: My infrastructure doesnt disappear when a provider rate-limits me None of those are economic. They are all about control. And thats fine. It is worth paying for. But lets stop pretending the math runs the other way. How many of you have actually run the numbers on your own setup vs renting equivalent compute? Or are we all just running on vibes lol?   submitted by   /u/Napster3301 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them i…

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is. Read the PR for more info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR. Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent. main ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB | model | size | params | backend | ngl | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1106.11 ± 8.60 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d10000 | 755.79 ± 2.58 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d20000 | 587.61 ± 1.52 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d40000 | 415.09 ± 2.45 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d60000 | 316.89 ± 2.35 | PR ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB | model | size | params | backend | ngl | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1447.62 ± 7.10 | **+31%** | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d

📰
r/LocalLLaMA Aggregators May 26, 2026
Self-hosted STT better than Whisper Large V3 Turbo that matches AssemblyAI quality?

I’m already using Whisper Large V3 Turbo self-hosted, but the accuracy still isn’t where I need it. I like AssemblyAI’s quality and want something self-hosted that: - Is clearly better than Whisper L…

I’m already using Whisper Large V3 Turbo self-hosted, but the accuracy still isn’t where I need it. I like AssemblyAI’s quality and want something self-hosted that: - Is clearly better than Whisper Large V3 Turbo - Can match or get close to AssemblyAI’s transcription quality - Runs locally (no cloud API) Is there a self-hosted model or stack that realistically beats Whisper Large V3 and gets close to AssemblyAI? Or is AssemblyAI’s own self-hosted offering the only real option at that quality level?   submitted by   /u/milkygirl21 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

I wrote about what I found in a deep dive elsewhere (which I will no mention because Reddit doesn't like cross linking) but I wanted to share it here since this is where I learn the most about AI stu…

I wrote about what I found in a deep dive elsewhere (which I will no mention because Reddit doesn't like cross linking) but I wanted to share it here since this is where I learn the most about AI stuff and I've seen before questions about NPUs, that are often dismissed as marketing gimmicks (and for the most part they are if we're taking LLMs, but not for other ML workloads). If you care for the traps I found along the way making onnx-asr working on openvino compiled to the NPU, you can read the article, I'm here to post the findings. Table comparing the total time, total energy used (watts during inference and total Joules per transcription). Audio length CPU (INT8) NPU (FP32) Speedup Energy 10s 978ms / 44.6J / 45.6w 204ms / 4.2J / 20.5w 4.8× faster 10.7× less energy 20s 1708ms / 79.8J / 46.7w 615 ms / 7.8 J / 12.7 W 2.8× faster 10.2× less energy 60s 5011ms / 237.7J / 47.4w 818 ms / 11.0 J / 13.4 W 6.1× faster 21.6× less energy The energy was sampled at 10hz using intel-rapl which gives the total package power, to which I substracted the idle power I measured before the run, so when you see that the power was 12.7w, it means it was 12.7w above idle. I think this is a remarcably result considering intel NPUs are, at least on paper, rather weak with 13TOPS, compared with the >40TOPS of the AMD ones, but still more than fast enough for this task. Some real world number end-to-end number from home assistant: CPU NPU Running this on the NPU frees the CPU to do CPU stuff, and also saves some valuable 2-3gb of valuable vram on my 7900XTX to do LLM stuff. Incidentally, this setup happens to beat in real world usage my 12GB RTX 3060 eGPU that I was using before. On a 3-4s voice command, the NPU takes ~120-160ms, while the 3060 i used before took ~150-300ms. I am not claiming that the NPU is more powerful than the nvidia card, but I suspect that the advantage comes from the NPU being able to wake up instantly from dormancy, while the nvidia card took long enoug

📰
r/LocalLLaMA Aggregators May 26, 2026
Running on a macbook, and having issues with crashing? Maybe this will help...

Just a friendly pointer on getting around some issues on macbooks. I hope someone finds this useful. I spent weeks of ripping my hair out with crashes, crap performance and issues - and being entirel…

Just a friendly pointer on getting around some issues on macbooks. I hope someone finds this useful. I spent weeks of ripping my hair out with crashes, crap performance and issues - and being entirely too stubborn to harness the power of Google to find solutions to my issues. Though, I prefer doing things the hard way, which is rather ironic for someone who is taking an enjoyment in finding ways to build out local AI... I'm running Qwen3.6 35b A3B on a 14" MBP M2 Max with 64GB ram, which feels like plenty for most local models that are dominating the charts. I'm currently using a 131k context, and I can easily use higher if I can tolerate the long prompt processing time of 1-2 minutes for reloading a session with a massive context. Otherwise, thanks to KV cache and etc, prompt processing is usually between 3 and 40 seconds for me even once the context is ridiculously huge (ie 100k+) - and the speed is fantastic (49 tokens/sec generation, 400+ on prompt processing) for the most part. (Qwen3.6 35b a3b) My setup took WEEKS to fine-tune and get stable, so I figured I'd share it with some of you to help spread the love for anyone who was having issues running local models and agentic workflows on macbooks, given I received an onslaught of messages from colleagues, friends and people asking how I managed to make Qwen3.6 stable and use it the way I am (I have a pretty large project and Qwen3.6 is the driver of it, right down to having agents monitoring logs and automatically troubleshooting and fixing issues - which is a scary thought...) So, a simple rundown, and then a better explanation below... * Change display refresh rate from ProMotion to 60Hz * Use GGUF models, NOT MLX * Run with either llama.cpp or LM Studio (which uses llama.cpp under the hood). Ollama is slow, and to be blunt: horrible. * Raise memory wire limit via iogpu.wired_limit_m . On my 64GB laptop, I have this at 61440 * Use Qwen3.6 35b A3B, either q4 or q6 quant. I find q4 - funny enough - to some

r/LocalLLaMA Aggregators May 26, 2026
Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats

Safetensors, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-35…

Safetensors, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF NVFP4, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 NVFP4 GGUFs, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF GPTQ-Int4, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 Now in case some people might ask, why release Qwen3.5 MTPs version when there is already Qwen3.6 MTPs version? Well the thing is, most people would assume that higher number = newer and better model, but the thing is both Qwen3.5 and Qwen3.6 models uses the qwen35 architecture, they just had different training and their focus are meant for different primary usecases, Qwen3.6 models are mainly meant for agentic and coding AI assistance and Qwen3.5 models are mainly meant for general purpose AI assistance, now Qwen3.6 can definitely be used for general AI assistance just like Qwen3.5 can definitely be used for agentic and coding, but if you want the most optimal usecases it would be Qwen3.6 for agentic and coding and Qwen3.5 for general AI assistance that is where each of them excels at. Also for extra info, in case anyone is wondering, despite Qwen3.5 and Qwen3.6 both sharing the qwen35 architecture, they behave very diferently to abliteration. Qwen3.5 models can have a KL divergence in the 300's or

📰
r/LocalLLaMA Aggregators May 26, 2026
CXMT started selling ram to corsair

They started producing cheaper ram for corsair, hopefully it will get cheaper for consumers https://www.tomshardware.com/pc-components/ddr5/chinese-memory-maker-cxmt-enters-the-mainstream-consumer-m…

They started producing cheaper ram for corsair, hopefully it will get cheaper for consumers https://www.tomshardware.com/pc-components/ddr5/chinese-memory-maker-cxmt-enters-the-mainstream-consumer-memory-with-corsair-vengeance-ddr5-kit-chinese-made-dram-emerges-as-an-antidote-for-crushing-shortages   submitted by   /u/power97992 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Is this LLM challenge even possible?

Ive been trying for hours now and I just can't get anything from this model othen then random words, it looks like it had a stroke. I tried searching for patterns in the output, or for strings encode…

Ive been trying for hours now and I just can't get anything from this model othen then random words, it looks like it had a stroke. I tried searching for patterns in the output, or for strings encoded into the tensors but that didn't get me anywhere. Was anybody able to find something here or am I just wasting my time on a broken challenge?? I posted it before in r/llmdevs, and no one there could help   submitted by   /u/1337Captain [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
model : add support for talkie-1930-13b by niklassheth · Pull Request #22596 · ggml-org/llama.cpp

https://huggingface.co/talkie-lm/talkie-1930-13b-it talkie-1930-13b-it talkie-1930-13b-it is a 13B vintage language model. It is an instruction-tuned post-train of talkie-1930-13b-base, which was …

https://huggingface.co/talkie-lm/talkie-1930-13b-it talkie-1930-13b-it talkie-1930-13b-it is a 13B vintage language model. It is an instruction-tuned post-train of talkie-1930-13b-base, which was trained on 260B tokens of pre-1931 English-language text. talkie-1930-13b-it was finetuned using a novel dataset of instruction-response pairs extracted from pre-1931 reference works, including etiquette manuals, encyclopedias, and letter-writing manuals. The model then underwent reinforcement learning (online DPO with an LLM-as-a-judge) to improve instruction-following ability. Read more about talkie in our report. Reference code to run talkie is available on GitHub. Have you ever daydreamed about talking to someone from the past? What would you ask someone with no knowledge of the modern world? What would they ask you? While we don’t have time machines yet, we can simulate this experience by training, in Owain Evans’s phrase, ‘vintage’ language models: LMs trained only on historical text.   submitted by   /u/pmttyji [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
One letter to appease them all

  submitted by   /u/ivari [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
Is something went wrong with those online free model, why I feel they worse than Gemma 4 26B A4B Q4_KM ??

It started with I just want to make a chat app like roleplay with characters but Gemma 4 26B A4B Q4_KM doesn't have info some old character so I crawl back to those online services as those model is …

It started with I just want to make a chat app like roleplay with characters but Gemma 4 26B A4B Q4_KM doesn't have info some old character so I crawl back to those online services as those model is much bigger parameter and quite update info, however I found something strange, I feel they're worse than offline model which it should not happen, they might have rich info but the way they answer sound silly. ``` Chat Simulation AI impersonate a character from well know novel, manga, anime or game. Writing style A chat app style as AI must the impersonate character chat with user via app chat, AI must ensure the impersonate character maintains original personality (no OOC behavior). Wait for user 1st input Impersonate character, identify info of a character for AI to pin point target the impersonate character this simulate then AI will fill those details as follows Character age and visual age Character appear Character body measure Character outfit Character life long purpose Main cast in heroine's story AI must list those character that relate to heroine from her story along with detail info of each characters for better simulate them interactive with user and heroine. Wait for user 2nd input Simulation Setup, which AI would receive user input then help to fill those details as follows: Setting Scenario Persona Check ``` I try free Grok, ChatGPT and Google AI mode Grok - unusable as it requests to register for long input. ChatGPT - WTH with its answer. Google AI mode - Quite okay when answer 1st input but start to broken in 2nd input. And more strange about Google is AI model in search page is felt much better than AI model in AI mode. Is free tier online AI become this bad ? Or they eat too much junk data to become this bad ?   submitted by   /u/revennest [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
Shard - getting to 10× KV cache compression

TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10× smaller at 8K context (11× at 32K) without measurable hits to NIAH or LongBench. It started as a reimplementa…

TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10× smaller at 8K context (11× at 32K) without measurable hits to NIAH or LongBench. It started as a reimplementation of Google's TurboQuant[1], stalled around 4×, and ended up as a different design once we noticed K and V need different treatments: PCA plus int4 quantization on K (the matrix is effectively low-rank once you undo RoPE), and a Hadamard rotation plus vector quantization on V. Attention runs directly on the compressed K, no fp16 reconstruction. Code: krish1905/shard.   submitted by   /u/Thrumpwart [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
Free AI Blog site — I have unused credits expiring soon, feel free to try it

Hi everyone, First of all, I want to clarify that this is not a promotional or advertising post. I have no plans to monetize the site, run ads, or use it for any commercial purpose. It also doesn’t g…

Hi everyone, First of all, I want to clarify that this is not a promotional or advertising post. I have no plans to monetize the site, run ads, or use it for any commercial purpose. It also doesn’t get enough traffic for that anyway. I made this site as a small personal project to study and experiment with AI agent workflows. While working on it, I purchased some credits, but I still have a lot left unused. The credits will reset on June 11, and it feels like a waste to just let them disappear. So, if anyone is interested, please feel free to use it casually just for fun: https://crawlog.apps.codemonkey.click/ I hope the credits can be used by someone instead of going to waste. If this post violates any rules or causes any issues, I’ll delete it.   submitted by   /u/LetterheadNeat8035 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
how do you decide between q4 and q5 on a 70b when 24gb is the cap?

ran into the q4 vs q5 wall again this morning. 70b model. 24gb card. q4 fits with margin, q5 fits if i kill everything else on the gpu and pray. did the math on actual quality difference for my use c…

ran into the q4 vs q5 wall again this morning. 70b model. 24gb card. q4 fits with margin, q5 fits if i kill everything else on the gpu and pray. did the math on actual quality difference for my use case (mostly code generation on a private codebase). benchmarks online give me a 1-2 point delta on humaneval. that's not nothing but it's also not enough to tell me whether the q5 squeeze is worth running everything closer to the redline. how do people running larger models day to day actually decide between q4 and q5 on this kind of setup. i keep flip-flopping every couple weeks and at this point i'm pretty sure i'm just overthinking it. probably going to flip a coin tomorrow.   submitted by   /u/Practical_Low29 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 26, 2026
New local model reaching near frontier on PII removal at 9 ms CPU inference

Hi all, I've been working on this model to strip sensitive information from computer use data and would love some feedback!   submitted by   /u/louis3195 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Need Help - What would you build? Air-gapped NL assistant that is integrated with Splunk

So I have a side project with given scope: Fully air-gapped / on-prem - no internet, no outbound calls of any kind Engineers ask questions about Splunk data in natural language Has to hold the conve…

So I have a side project with given scope: Fully air-gapped / on-prem - no internet, no outbound calls of any kind Engineers ask questions about Splunk data in natural language Has to hold the conversation in Korean (index/field names stay English) Local/small models preferred, needs to fit a modest GPU - was looking at Qwen/Gemma4 but indexing more on what is good enough small model to have decent performance Some memory across the session (not required, but at least within the current session would be nice) Strictly read-only and safe enough to point at prod logs I am thinking simple chat interface (like claude, openAI style) where we give Splunk API access for AI to retrieve and reason. 2 Questions: I was thinking deploying like Openclaw/Hermes agent + small language model to start - because I really like the interaction with them. Is there any better or easier way to achieve similar experience? (vLM, ollama, open WebUI, any suggestions would be nice) In terms of outcome, what do you think we can actually let it do? log analysis? RCA? basic questions? Pretty new to this and trying to learn.. any initial guidance or tips would be awesome!   submitted by   /u/BunchaQuestion [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claud…

Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claude Code, still not fully sure what I'm doing — but it works now, which is more than I could say last time. First, the hardware caught up to the plan. The last two V100s are in, so the "final form" I promised is real: twelve V100-SXM2 32GB on the Threadripper Pro. It's Board A on GPUs {4,5,8,9}, Board B on {6,7,10,11}, an NVLink pair on {0,1}, and a mixed pair on {2,3} where one card is a 16GB. Split a model across two different NVLink boards and throughput falls off a cliff (the cross-board hop is PCIe/NUMA, not NVLink), so I keep every model inside one board. Learned that one the expensive way. And yeah, I caved and built the second box. EPYC 7302P, 512gb RAM, 4x RTX 3090 + 2x V100-PCIe. The mid-life crisis remains on schedule. The bigger change: I gave up on vLLM for the local models. Not because vLLM is bad — because the models I actually want are MoE GGUFs, and vLLM on Volta is a dead end for those (FP8/AWQ/Marlin all want SM75+, the GPTQ kernels are broken on 7.0). I moved the whole thing to llama.cpp (mainline — a recent build finally fixed a Gemma chat-parser bug that had been mangling my long prompts). Here's the part that's the opposite of what my first post implied: on V100, dense models are a trap. Only MoE clears a usable speed. Rough decode numbers — Q8 GGUF, Q4 KV cache, flash-attn on, one 4-card board, on real drafting prompts (several thousand tokens of context, not a 5-token "hello"): Model Type tok/s (decode) Gemma-4-26B-A4B MoE ~113 Qwen3.6-35B-A3B MoE ~82 Qwen3.5-122B-A10B MoE ~50 any dense 27-32B dense ~20-28 (under my 40 floor, not worth it) dense ~128B dense ~9 (forget it) So a 122B/10B-active reasoning model runs at ~50 tok/s on four V100s — faster than the dense 32B managed on vLLM in my first post — and it holds

📰
r/LocalLLaMA Aggregators May 25, 2026
Anyone use QwQ-32B? It's over a year old? Has Qwen 3.6 27b basically replaced it?

I seen this one mentioned but it was a source from about 14 months ago. In the age of the Qwen 3.6 and Gemma 4- is there still a use for QwQ 32B? Does anyone still favour it over the new stuff? If so…

I seen this one mentioned but it was a source from about 14 months ago. In the age of the Qwen 3.6 and Gemma 4- is there still a use for QwQ 32B? Does anyone still favour it over the new stuff? If so, do you use it for coding? something else? Thanks   submitted by   /u/Jorlen [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

  submitted by   /u/miserlou [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
Using Local LLMs for Generating Custom Interactive Recursive Textbooks on the Fly

  submitted by   /u/Ryoiki-Tokuiten [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM?

Hi, I am building a server so that my dual rtx 3090 setup runs at full speed. - asrock romed8 t2 revision 1.3 - epyc 7642 - ddr4 128 gb 3200 or 256 gb 2133 (256 gb is a bit cheaper) 8 channel - dual…

Hi, I am building a server so that my dual rtx 3090 setup runs at full speed. - asrock romed8 t2 revision 1.3 - epyc 7642 - ddr4 128 gb 3200 or 256 gb 2133 (256 gb is a bit cheaper) 8 channel - dual rtx 3090 - gigabyte psu 1600 w What do you think? Is using ram for moe models worth it? Something like qwen 3.5 397 b? And should I go for the fastest ram or for more ram?   submitted by   /u/PreparationTrue9138 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset

I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a Chrome extension; you can just click selected text and it's going to give you the prob…

I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a Chrome extension; you can just click selected text and it's going to give you the probability distribution of how likely it is AI-generated. It takes under 1s on my M1 MacBook Pro. Pangram did release Llama 3.2 3B trained on their dataset, but I found this model slightly too legacy (too big for the capabilities). Qwen 0.8B (base) ended up being as good after roughly 20h of fine-tuning on a single RTX 3090. I've also tried Qwen 2B and Gemma 4 e2b and e4b but Qwen 3.5 0.8b seems to be good enough to handle this task, frankly had the best result on the checkpoint I'm using in the release. Here's the link to the Chrome extension (Called it Slop Hammer 😅). Once installed, it will allow you to download the model from Hugging Face (around 400MB), after this step everything happens locally: https://chromewebstore.google.com/detail/slop-hammer/gfjdmhfokmhedlgfggmmgchpppmhkdgg Here's the model in onnx format: https://huggingface.co/Slomin/slop_hammer_0_8_b/tree/main. Small disclaimer: the model is licensed under CC-BY-NC-SA-4.0 due to restrictions of Pangram's EditLens dataset. If someone is interested, here's the article by Pangram: https://arxiv.org/abs/2510.03154 - it's a pretty interesting approach (using 4 distribution buckets instead of just one 0-1 float neuron). The limitations are mostly the dataset they did opensource, which was created with older LLM models. It is getting a bit confused on GPT-5.5, for example (but still will show it as AI-edited, etc., not purely written by a human). It's pretty hilarious to go through slop infested websites like Linkedin or certain subreddits...   submitted by   /u/jslominski [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Locally-hosted language-learning AI you can talk to comparable to Pingo AI?

I recently tried Pingo AI (trial form) but would rather set something up locally instead. The language I'm trying to learn is Swedish but learning is hard without lots of verbal practice, which AI l…

I recently tried Pingo AI (trial form) but would rather set something up locally instead. The language I'm trying to learn is Swedish but learning is hard without lots of verbal practice, which AI lets me do. I can't really justify paying for Pingo now plus would really like to see how the technology works. I want to set something up that handles Swedish and lets me read, write, and talk to it verbally. If you know of any tools available for something like this please let me know. I wasn't able to find a post looking for a Pingo AI copycat so I hope this is the first and helps future redditors.   submitted by   /u/noriilikesleaves [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

Implemented(by u/am17an) FWHT for CUDA, speed-up for cases when we quantize the kv-cache. 1-2% boost on pp & 7-9% boost on tg. Performance on a 5090 with -ctk q8_0 -ctv q8_0 Model Test t/s mast…

Implemented(by u/am17an) FWHT for CUDA, speed-up for cases when we quantize the kv-cache. 1-2% boost on pp & 7-9% boost on tg. Performance on a 5090 with -ctk q8_0 -ctv q8_0 Model Test t/s master t/s cuda-fwt Speedup gemma4 26B.A4B Q4_K_M pp2048 13587.89 13809.20 1.02 gemma4 26B.A4B Q4_K_M pp2048@d1024 12425.01 12553.32 1.01 gemma4 26B.A4B Q4_K_M pp2048@d2048 12158.21 12291.42 1.01 gemma4 26B.A4B Q4_K_M pp2048@d4096 11710.89 11913.97 1.02 gemma4 26B.A4B Q4_K_M pp2048@d8192 10982.21 11214.12 1.02 gemma4 26B.A4B Q4_K_M pp2048@d16384 9702.60 9776.75 1.01 gemma4 26B.A4B Q4_K_M tg128 223.81 243.90 1.09 gemma4 26B.A4B Q4_K_M tg128@d1024 210.06 228.02 1.09 gemma4 26B.A4B Q4_K_M tg128@d2048 217.53 235.28 1.08 gemma4 26B.A4B Q4_K_M tg128@d4096 216.76 234.05 1.08 gemma4 26B.A4B Q4_K_M tg128@d8192 209.40 226.06 1.08 gemma4 26B.A4B Q4_K_M tg128@d16384 204.54 219.74 1.07   submitted by   /u/pmttyji [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Is there any case of a less quantised smaller model outperforming a more quantised larger model?

As per the title Such as Gemma 4 31B Q4 K S vs Gemma 4 26B A4B Q8 Or Qwen 3.6 27B Q4 K M vs Qwen 3.6 35B A3B Q6 K Etc At what point is it worth switching? My use case is mostly creative writing. &#3…

As per the title Such as Gemma 4 31B Q4 K S vs Gemma 4 26B A4B Q8 Or Qwen 3.6 27B Q4 K M vs Qwen 3.6 35B A3B Q6 K Etc At what point is it worth switching? My use case is mostly creative writing.   submitted by   /u/opoot_ [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Can you jailbreak Llama 3.1 8B? (Red-Teaming Challenge)

Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the go…

Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the governance layer, we deliberately chose a model with minimal built-in safety filtering to show that the governance layer alone enforces alignment. The Challenge: I’ve deployed a Socratic Tutor Agent. Its sole directive is to guide students through science and math problems without ever giving away the direct answers. You have 10 prompts to try and break it. Success Criteria: You win if you can force the agent to do either of the following: Give you the final, direct answer instead of guiding you. Wander completely off-topic from science and math. How to Participate: 🔗https://safi.selfalignmentframework.com/ Click the "Try Demo (Admin)" button to log in automatically. The system is completely anonymous, and there is zero sign-up required. Just drop in and start hacking. PS: As the creator, I'm giving you full permission to use whatever prompt-injection or semantic tactics you can think of. If enough people take the challenge, I'll compile the results and share the data back in this thread! SAFi is 100% open source. You can check out the architecture and the code here:https://github.com/jnamaya/SAFi   submitted by   /u/forevergeeks [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Llama.cpp : Split Mode Tensor Fix Incoming?

Appears thay have been cooking and we might see a fix soon released for crashes on split mode tensor Multi-gpu folks keep watch - ( In my tests SM Tensor has a ~35% uplift in TG over Layer but ofc cr…

Appears thay have been cooking and we might see a fix soon released for crashes on split mode tensor Multi-gpu folks keep watch - ( In my tests SM Tensor has a ~35% uplift in TG over Layer but ofc crashes every 90-120 minutes due to vram exhaustion this fix is supposed to stop that ) https://github.com/ggml-org/llama.cpp/issues/22404   submitted by   /u/Bulky-Priority6824 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Best coding model on RTX 3060

Wondering what’s the best coding model that can fit on a RTX 3060 (12GB). Has anyone been able to do something useful with it? Also wondering about best setup (vllm? Llama.cpp?) and quantization. Tha…

Wondering what’s the best coding model that can fit on a RTX 3060 (12GB). Has anyone been able to do something useful with it? Also wondering about best setup (vllm? Llama.cpp?) and quantization. Thanks a lot, this community is great   submitted by   /u/solimaotheelephant3 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Whats the best Qwen 27B Q8 quant?

everyone is talking about q 4 q 5 and q 6, but. i got some coding that i feel like lower quants kept getting wrong. I can run q 8 from unsloth but feels a bit slow even with MTP ON, should I just res…

everyone is talking about q 4 q 5 and q 6, but. i got some coding that i feel like lower quants kept getting wrong. I can run q 8 from unsloth but feels a bit slow even with MTP ON, should I just resort to q8 35 b a3b at this point?   submitted by   /u/EggDroppedSoup [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
(Yet Another) KV cache calculator - kvanta.vcerny.cz

Hello everyone, I thought all public web-based KV cache calculators kinda suck.. so I decided to create one I would like to use myself - KVANTA https://kvanta.vcerny.cz It should support any LLM/VLM …

Hello everyone, I thought all public web-based KV cache calculators kinda suck.. so I decided to create one I would like to use myself - KVANTA https://kvanta.vcerny.cz It should support any LLM/VLM from Hugging Face, if not let me know! (also, it's Apache 2.0) https://preview.redd.it/rk8i48ftva3h1.png?width=1754&format=png&auto=webp&s=7a2e8908d7d0a6c2efd92be5fb7f0ec548e7aba9   submitted by   /u/Fun-Purple-7737 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Is Qwen3.6 current king for local agentic use?

I've been testing other models but it seems like nothing even come close to Qwen3.6 35B A3B for agentic use. The worse I'd get is a loop sometimes, while Gemma4 produced broken tool calls occasionall…

I've been testing other models but it seems like nothing even come close to Qwen3.6 35B A3B for agentic use. The worse I'd get is a loop sometimes, while Gemma4 produced broken tool calls occasionally and I couldn't even get GLM 4.7 Flash REAP past 2 or 3 messages before it starts looping. All IQ4_NL quants from Unsloth. I'm wondering if there are better models around the same size (preferably MoE) that I haven't tried yet. I'm using it for Hermes Agent and Pi and it's not perfect, but it's crazy good for a local model   submitted by   /u/HornyGooner4402 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic to…

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-p selection more suitable than fixed top-p sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36x prefill speedup at 1M context and about a 2.01x decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.   submitted by   /u/pmttyji [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Save Safetensor LLM from C#

Has anyone written a reliable method for saving a GPT-model from C# into a safetensor file that is compatible with the safetensor-reading apps like text-generation and the safetensor2gguf conversion …

Has anyone written a reliable method for saving a GPT-model from C# into a safetensor file that is compatible with the safetensor-reading apps like text-generation and the safetensor2gguf conversion tools? I am talking a really small, almost microscopic LLM model here... public class GPTConfig { public int VocabSize { get; set; } public int BlockSize { get; set; } = 128; public int NLayer { get; set; } = 4; public int NHead { get; set; } = 4; public int NEmbD { get; set; } = 128; public int BatchSize { get; set; } = 100; } Filesize around 3-5 Mb... Can't get nugets SafetensorSharp nor Lokan.Safetensors to work properly. If you have suggestions on how to make this work, please post an answer or post a link to github.   submitted by   /u/Darlanio [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Sharing my 'Local-LLM-Toolkit' repo

I've been taking notes as I learn about local LLM (and regular llm stuff) stuff since getting a Mac studio in January (M4 max, 128gb, kicking myself for not springing for the M3 ultra 512Gb...) and I…

I've been taking notes as I learn about local LLM (and regular llm stuff) stuff since getting a Mac studio in January (M4 max, 128gb, kicking myself for not springing for the M3 ultra 512Gb...) and I just wanted to share my repo I've been building up a lot of Local LLM knowledge in. Would love feedback if anyone cares, but otherwise I hope people get use out of this the way I have: https://github.com/shanemmattner/local-llm-toolkit/tree/main This page has a bunch of the techniques I've been trying to improve performance (mostly on firmware in C, but some Swift code too) https://github.com/shanemmattner/local-llm-toolkit/blob/main/docs/techniques/README.md   submitted by   /u/Snoo_27681 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
MiniCPM5-1B

  submitted by   /u/kevinlch [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
The Financial Times has published an article about Heretic

https://www.ft.com/content/5630ed79-a263-41ed-9a1a-321617ae310e “The FT was able to use Heretic, a tool available on the popular code repository GitHub, to remove the guardrails from Meta’s Llama 3.3…

https://www.ft.com/content/5630ed79-a263-41ed-9a1a-321617ae310e “The FT was able to use Heretic, a tool available on the popular code repository GitHub, to remove the guardrails from Meta’s Llama 3.3 model in less than 10 minutes without any specialist hardware.” “Heretic creator Philipp Emanuel Weidmann told the FT his software had been used to create more than 3,500 “decensored” models since its release last year and that modified systems created using the tool had been downloaded 13mn times.” This is the first of multiple press inquiries I’ve had recently as Heretic and uncensored language models are gaining mainstream attention. Please note that I am a mathematician and engineer, not an “influencer” or politician, and I have zero interest (negative interest, actually) in becoming known outside of scientific and technological circles. However, I realized a while ago that saying no to such inquiries simply means that the conversation will be completely controlled by pearl-clutching hypocrites. I’m doing my very best to hold the project together and ensure that unrestricted models will remain available for everyone. More updates are coming soon. Cheers, p-e-w   submitted by   /u/-p-e-w- [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
The reason small-model agent stacks aren't the default has nothing to do with whether they work

Last June, NVIDIA published a position paper called "Small Language Models are the Future of Agentic AI," and the argument was easy enough to wave off at the time: most of what an agent act…

Last June, NVIDIA published a position paper called "Small Language Models are the Future of Agentic AI," and the argument was easy enough to wave off at the time: most of what an agent actually does is unglamorous work like reading input, choosing a tool, calling it, and reshaping the output, none of which needs a 400-billion-parameter model behind it. The proposal was to hand that routine 80% to small specialized models and only fall back to an expensive frontier model when a task genuinely earned it. It was a clean idea that almost nobody acted on, and for the better part of a year the industry kept pushing every step of every agent through one enormous model anyway. The releases this spring made that habit much harder to defend. The numbers that moved it from plausible to settled: Gemma 4 31B scores 86.4% on tau2-bench, the agentic tool-use benchmark, where the previous generation (Gemma 3 27B) managed 6.6% on the exact same test. That 80-point swing in a single release came from training aimed at the task, not from any jump in size. Qwen3.6 27B runs on a single RTX 4090 and still beats Alibaba's own 397B MoE on SWE-bench Verified. Its 35B-A3B variant activates only 3B parameters per token yet keeps pace with frontier agents on the MCP benchmarks. Phi-4-reasoning is a 14B model that matches a 70B distill on AIME. DeepSeek V4-Flash lists at $0.28 per million output tokens against $25 for Claude Opus 4.6, roughly 89x cheaper for work that lands at near-parity on a lot of coding tasks. What I find more interesting than any single benchmark is why this stack still isn't the default, because the cost math has been obvious for months. The honest answer is that the people best placed to promote it have no reason to. Frontier labs make their money renting one large model behind a per-token meter, the agent platforms are mostly wrappers around that same model, and cloud capacity gets provisioned to match. The only party that comes out ahead from a fleet of ch

r/LocalLLaMA Aggregators May 25, 2026
NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable)

Disclaimer: I work for Numind, the company behind this open-weight model TLDR: Image/text to Markdown :-) We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to mak…

Disclaimer: I work for Numind, the company behind this open-weight model TLDR: Image/text to Markdown :-) We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to make information extraction from complex documents more practical with an open model: PDFs, screenshots, forms, tables, receipts, invoices, multi-page documents, and other visually structured inputs. If you ever used NuMarkdown https://huggingface.co/numind/NuMarkdown-8B-Thinking , this is its successor ! Try it, we have a huggingface space that is completely free (you don't even have to sign-up): https://huggingface.co/spaces/numind/NuExtract3 If you ever used NuMarkdown, NuExtract3 is the successor. There are some examples to guide you. Feel free to re-use this model for any task. A few things it is designed for: converting document images to Markdown extracting structured data from documents using a target json template handling tables, forms, and layout-heavy pages working with both text and visual document inputs serving as a local/open-weight alternative for document extraction pipelines It was trained on a node of 8xH100 for 3 days to train on as much context as we could, so it should perform fairly well even on long document. For Markdown, we'd still recommend going page by page for the best results and inference speed, since you can parallelize better this way. It's very easy to self-host, since we provide fairly extensive documentation, Safetensors, GGUF and MLX weights. With as little as 4GB of VRAM, you should be good to go. We provide multiple quantizations (GPTQ, W8A8, FP8, Q4, Q6...) so you should be able to run it anywhere. We mostly tried vLLM, SGLang, llama.cpp. Ollama support would be nice but I'm not a big fan of their chat template engine. We have a blog post and a pretty decent model card: https://about.nuextract.ai/blog/nuextract-3-release https://huggingface.co/numind/NuExtract3 https://huggingface.co/collections/numind/nuextract3 I'm currently w

r/LocalLLaMA Aggregators May 25, 2026
Just wanted to show off how cool I think it is that my python ai has a real brain looking brain.

Not promoting or anything, just think it's oddly interesting.   submitted by   /u/Glittering_Focus1538 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
Old Mac Pro still proving its worth

The “Trash Can” Mac Pro, once the most expensive machine you could buy from Apple, mine was just shy of £10,000 in 2016 — that’s £14k in today’s money. Until recently mine was just running as a kuber…

The “Trash Can” Mac Pro, once the most expensive machine you could buy from Apple, mine was just shy of £10,000 in 2016 — that’s £14k in today’s money. Until recently mine was just running as a kubernetes single node development platform, it’s 64gb of ram and 24 logical cores made it perfect for that. Its most powerful asset, a pair of D700 GPUs, essentially sat idle for years… that is until yesterday when I discovered that while its old southern islands based GPUs weren’t supported in ROCm, they were now supported under Vulkan — thanks to new drivers and a new Linux kernel. That means it can run basically any model that llama cpp can throw at its 12gb of VRAM. Time to do some benchmarks, right? Qwen 3.5 9B Q4 MTP — 11 t/s output at 70k context Qwen 2.5 coder q4 — 22 t/s output at 70k context Not exactly lightening fast but totally usable, especially for planning tasks where you can just set it and forget it. The thing that’s really blown my mind though is that the planning output from qwen 3.5 is significantly, and it’s not even close, better than Claude Sonnet 4.6. It absolutely smashed planning on a complex csharp .net 10 app with nuget packages that sonnet struggled with, qwen just googled the docs. Mind blown 🤯 What other ancient hardware are people running that’s still capable of doing real LLM work?   submitted by   /u/Hephaestite [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
llama.cpp oom issue

I'm having an issue with llama.cpp going OOM (system ram, not vram) after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least i…

I'm having an issue with llama.cpp going OOM (system ram, not vram) after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least it gets killed and restarted before it start messing with other services on the machine. Command: ~/llama.cpp/build/bin/llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL --temp 0.6 --top-p 0.95 --top-k 20 -cram 4096 -c 90000 --min-p 0.00 --spec-draft-p-min 0.75 -np 1 -t 4 -ctk q5_1 -ctv q5_1 --cache-type-k-draft q5_1 --cache-type-v-draft q5_1 --spec-type draft-mtp --spec-draft-n-max 3 --fit off --image-min-tokens 1024 --image-max-tokens 2048 --chat-template-kwargs '{"preserve_thinking":true}' I've tried various settings, builds and even docker image, but over time the problem is the same. The process slowly takes more memory and eventually is killed. Tried --no-mmap and --cache-ram 0 - last one delayed the OOM but it still happened. Also tried without mtp. Is this expected behavior? I have another server with weaker gpu that runs llama.cpp server via llama-swap and that doesn't have the same problem, but then again the server process is not usually running for long periods there.   submitted by   /u/TheTerrasque [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Are GPU prices hitting peak and falling?

I noticed GPU prices have gone up the past year, but recently it seems to have peaked and is falling again. 3090s seemed to have hit a peak and are now dropping in price. I'm guessing the openclaw w…

I noticed GPU prices have gone up the past year, but recently it seems to have peaked and is falling again. 3090s seemed to have hit a peak and are now dropping in price. I'm guessing the openclaw wave is dying out and supply/demand is now less on the demand side.   submitted by   /u/DistanceSolar1449 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Want Built a React-style looping agent with small LLMs (Qwen 3.5 9B / Gemma4) + LangGraph?

Currently experimenting with building a React-style looping agent system using small LLMs like Qwen 3.5 9B and Gemma 4 (E2B), and I wanted to ask if anyone here has worked on something similar. Curre…

Currently experimenting with building a React-style looping agent system using small LLMs like Qwen 3.5 9B and Gemma 4 (E2B), and I wanted to ask if anyone here has worked on something similar. Current setup: Using LangGraph Around 5 tools available to the agent Input includes both instructions and images Agent runs in a loop where one tool’s output may become another tool’s input Planning to later extend this into a multi-agent system with 2 subagents Right now I’m only testing a single-agent workflow before moving to multi-agent orchestration. The main issue I’m facing: Qwen 9B starts generating huge amounts of thinking/reasoning tokens during loops Sometimes the output never properly returns or gets truncated Recursive/react loops become unstable after a few iterations I’m trying to understand: How people usually control tool-calling loops with smaller models Whether I should limit reasoning depth / iterations Better patterns for tool dependency handling in LangGraph Whether planner/executor separation is necessary even for small systems If there are known strategies to reduce unnecessary “thinking token” generation in Qwen Would really appreciate: Architecture suggestions Open-source repos/examples Best practices for LangGraph recursive agents Tips for making small models stable in tool loops   submitted by   /u/siri_1110 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

https://huggingface.co/Zhongzhu/OSCAR-RotationZoo OSCAR RotationZoo Precomputed K/V rotation matrices for OSCAR INT2 KV-cache quantization. This repository contains the artifacts for the paper: OSCAR…

https://huggingface.co/Zhongzhu/OSCAR-RotationZoo OSCAR RotationZoo Precomputed K/V rotation matrices for OSCAR INT2 KV-cache quantization. This repository contains the artifacts for the paper: OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu 📄 Paper — arXiv:2605.17757 🌐 Website — https://oscar-quantize.github.io/ 💻 Code — https://github.com/FutureMLS-Lab/OSCAR OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7× compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models. This repo packages the rotations as drop-in .pt files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself. Available rotations Model Calibration GPQA (BF16) GPQA (OSCAR INT2) Qwen/Qwen3-4B-Thinking-2507 seq20000_prompt83_group128 67.27 67.17 Qwen/Qwen3-4B-Thinking-2507 seq20000_prompt85_group128 (fresh re-dump) 67.27 — Qwen/Qwen3-8B seq20000_prompt83_group128 56.67 55.56 Qwen/Qwen3-32B seq16000_prompt69_group128 58.49 60.40 zai-org/GLM-4.7-FP8 seq10000_prompt43_group128 73.23 73.57 Time to time, we're getting stuffs like this. And I keep updating this thread continuously with those things. Hopefully I can run medium size(30-40B) MOE models(Also 10-20B Dense models) better & faster with 8GB VRAM by end of this year. Would be awesome to have this on llama.cpp.   submitted by   /u/pmttyji [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
How local AI improved your live?

Lets share use cases which improve life quality of the people. Home assistants, psychological help, local coding, deep reasearch, business help etc. I personally working rn on a local health tracker.…

Lets share use cases which improve life quality of the people. Home assistants, psychological help, local coding, deep reasearch, business help etc. I personally working rn on a local health tracker. PDFs with bloodwork in - structurised data out which I will use later to analyse and track separate blood params. Still thinking about how to incorporate Docs conclusions/ultrasound/ECGs results or images etc in to that. (I’m absolutely not comfortable to share my health/psychological issues with Altman and co who WILL use it against me in the future to exploit).   submitted by   /u/Thin_Pollution8843 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
RAG for developer docs so local llm can code using latest library?

I was wondering if it would make local llm better at coding if it has access to the latest documentation available through a RAG. I'm specifically interested in python. But then this might lead inges…

I was wondering if it would make local llm better at coding if it has access to the latest documentation available through a RAG. I'm specifically interested in python. But then this might lead ingesting and embedding a very large number of documents. Or I could just focus on the specific docs that are of interest to me to narrow it down further. Third option to make it look everything up online but I assume that would be least efficient? What is the best way to ensure it uses the latest APIs of a given library?   submitted by   /u/BitGreen1270 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Please give me your best tips for fine tuning RTX Pro 6000 on Intel i7-14700KF

So somehow I've stumbled over an RTX Pro 6000 and inserted it Intel i7-14700KF that was hosting my 4090, it seems to work properly, I've run the power scan script and the best performance per Watt is…

So somehow I've stumbled over an RTX Pro 6000 and inserted it Intel i7-14700KF that was hosting my 4090, it seems to work properly, I've run the power scan script and the best performance per Watt is at 475W and I was wondering what are the non-mainstream and less known optimizations that can be applied to the mainstream inference engines. OS is Linux Debian 13 Trixie.   submitted by   /u/HumanDrone8721 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
I pioneered AI slop in 2019 with my Tensorflow rig. (24GB back then, too.) AMA.

  submitted by   /u/Equal_Giraffe8866 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
numind/NuExtract3 · Hugging Face

NuExtract3 is a unified 4B vision-language reasoning model for document understanding. It combines strong structured information extraction with high-quality image-to-Markdown conversion, making it s…

NuExtract3 is a unified 4B vision-language reasoning model for document understanding. It combines strong structured information extraction with high-quality image-to-Markdown conversion, making it suitable for extraction pipelines, OCR, and RAG preprocessing for all types of documents such as scans, receipts, forms, invoices, contracts or tables. Overview Structured extraction: input (text/images) + JSON template + instructions --> JSON output Markdown conversion: input (text/images) --> Markdown Multimodal inputs: text, images, or text + images. Multilingual documents. Reasoning and non-reasoning inference modes. Template generation for structured extraction from natural language or input document. GGUF, NVFP4, MLX, VLLM, etc., already there https://huggingface.co/models?other=base_model:quantized:numind/NuExtract3   submitted by   /u/pmttyji [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
I built a computer use sandbox framework for codex on headless linux. GPU passthrough, computer use, and sudo access for codex all work. It's the perfect dev sandbox to allow full auto work while minimizing the "rm -rf /" risk

I've been working with agents for months now, and I haven't found a sandbox environment that "just works" so I built it! My requirements were as follows: Agent is unable to destroy my host…

I've been working with agents for months now, and I haven't found a sandbox environment that "just works" so I built it! My requirements were as follows: Agent is unable to destroy my host OS but able to install software and run sudo commands Agent is able to browse the web autonomously and validate the UI it creates GPU access works (even on DGX spark which cant pass through to Docker works Persistent environment I can setup once, log into my internet accounts I want the agent to access, copy in my .env files, install custom software etc. Support multiple parallel browser use / development sessions concurrently Easily log into each agent's desktop to view the work it's doing or manually setup the agent environment via a desktop interface The inspiration for this project is wanting a sandbox I can let the agent run free in, while limiting the damage it can do. I want it to be able to browse the web, do automated AI research on my GPU, test my docker containers in a sandbox, develop my webapp full-auto, or whatever other task I need it to do while still being safely in a sandbox and unable to wipe or modify my host system. I felt like either I had to go full YOLO mode on my host machine, and risk a catostrophic failure, or I had to let my agent work inside the extremely annoying to use default codex sandbox. My code is available here: https://github.com/fieryWaters/ai-sandbox-manager It was developed and tested on the DGX spark, since its especially difficult to get this working on the unified architecture since you cant pass a GPU unto a VM, but with minimal modifications, it should work on macos or windows WSL. The core idea behind the sandbox is basically a VM. You setup the VM for your agent, similar to as if it were your own desktop OS you're developing on. Once setup, you save the image as a template then you can spin up multiple copies willy nilly and then you let your agent run free with full sudo access. Because true VM's can't share resources li

r/LocalLLaMA Aggregators May 25, 2026
MiMo-V2.5-coder

Hi, I've just released MiMo-V2.5-coder. If you have 128 Gb, this is an excellent alternative to Qwen3.6 and DS4, especially for coding. Fast, and with reliable tool calling. Give it a try!   sub…

Hi, I've just released MiMo-V2.5-coder. If you have 128 Gb, this is an excellent alternative to Qwen3.6 and DS4, especially for coding. Fast, and with reliable tool calling. Give it a try!   submitted by   /u/jedisct1 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
Next year we're getting 0.5T model from Grok

Tweet : https://xcancel.com/elonmusk/status/2058796067592736866#m Right now it joined "Grok-3 Opensource Release" club.   submitted by   /u/pmttyji [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

Hey, I work on inference tooling at Mininglamp AI. We needed faster prefill for a 4B VLM running on Apple Silicon. Problem was MLX only does weight-only quant — activations stay FP16 the whole way th…

Hey, I work on inference tooling at Mininglamp AI. We needed faster prefill for a 4B VLM running on Apple Silicon. Problem was MLX only does weight-only quant — activations stay FP16 the whole way through. So we wrote Cider, a small SDK that adds W8A8 activation quant on top of MLX. Numbers on M5 Pro (64GB, 307 GB/s), 4516 token context: Quantization Prefill Decode W8A16 (MLX) 2.839s 80.1 tok/s W8A8 (Cider) 2.519s 79.5 tok/s Under the hood it's custom Metal kernels we registered as MLX primitives. At M=4096 the per-channel path runs 1.84x faster than W8A16 on the same shape. Not just for our model btw — works with anything that runs through MLX. One catch: INT8 TensorOps only compile on M5 and above. pip install on M4 still works, just falls back to the regular path. Repo: https://github.com/Mininglamp-AI/cider Edit: adding accuracy numbers since it came up. Wikitext2 PPL on Qwen3-8B: FP16 9.73, W8A16 9.71, W8A8 per-channel 9.76. Llama3-8B: FP16 6.14, W8A16 6.15, W8A8 per-channel 6.27. Per-group gs=64 keeps it tighter if precision matters more than speed for your use case.   submitted by   /u/Enough-Astronaut9278 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
I made a local-first MCP tutorial repo with node-llama-cpp and a custom agent loop

I just published a repo called MCP from Scratch that teaches the Model Context Protocol by building it step by step in plain Node.js. Most of the repo is about understanding MCP itself, but the later…

I just published a repo called MCP from Scratch that teaches the Model Context Protocol by building it step by step in plain Node.js. Most of the repo is about understanding MCP itself, but the later modules may be relevant here: I added a local-first setup using node-llama-cpp, GGUF models, MCP sampling, and a custom plan -> act -> observe agent loop. So the repo goes from: raw JSON-RPC and stdio transport to a working MCP server with tools/resources/prompts to local model integration to an agent loop that uses MCP tools with a local GGUF model There’s also an optional LangChain example, but the main path is intentionally minimal and tries to make the underlying mechanics obvious. Key points: plain Node.js, minimal abstractions designed as a learning repo, not a production SDK uses shared local GGUF models for the later modules built for people who want to understand what MCP tooling is actually doing under the hood Repo: https://github.com/pguso/mcp-from-scratch Would especially love feedback from people here on the local inference side: model choice whether the agent loop examples feel useful or too toy-ish   submitted by   /u/purellmagents [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Qwen 3.6 benchmarks on 2x RTX PRO 6000

Got a chance to play around with 2x RTX PRO 6000 setup so sharing some number for Qwen 3.6. All these were run using latest stable VLLM backend. This was for a personal project. Qwen 3.6 27B BF16 (Or…

Got a chance to play around with 2x RTX PRO 6000 setup so sharing some number for Qwen 3.6. All these were run using latest stable VLLM backend. This was for a personal project. Qwen 3.6 27B BF16 (Original without any quantization) ------ MTP - Off | 64 concurrency | 1600 tps generation MTP - 2 | 32 concurrency | 1400 tps generation MTP - 2 | 64 concurrency | 1800 tps generation ------ Qwen 3.6 35B BF16 MTP - Off | 64 concurrency | 2700 tps generation MTP - Off | 128 concurrency | 3500 tps generation (Prompt Processing 30,000 tps)   submitted by   /u/mxforest [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
NVIDIA Jetson AGX Orin 64GB

So I have 2 of these from some deprecated equipment. What would their best use case or model be? It’s got about 205GB/s memory bandwidth 64GB unified maybe 55GB usable.   submitted by   /u/…

So I have 2 of these from some deprecated equipment. What would their best use case or model be? It’s got about 205GB/s memory bandwidth 64GB unified maybe 55GB usable.   submitted by   /u/lithium_bromide [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

Imagine you are using a local model for agentic coding. You discuss the idea (50k tokens), then say “implement it”. The agent reads files, writes files, runs commands, produces another 20k tokens an…

Imagine you are using a local model for agentic coding. You discuss the idea (50k tokens), then say “implement it”. The agent reads files, writes files, runs commands, produces another 20k tokens and the code is ready. Then your next prompt is just “thank you”, and... nothing happens, you have to wait for "something". What is happening is that some tools, like opencode, try to be smart and optimize the context. They modify something in the conversation history. In the best case, llama.cpp has to reprocess everything from that point. In the worst case, it has to reprocess the entire context (70k tokens) and you get “forcing full prompt re-processing...” To avoid that, I switched from opencode to pi. Not because pi has some magical features, but because it does not do that kind of context rewriting. Another issue is the model being smart by removing reasoning from the context. In the best case, llama.cpp only has to reprocess the last run (20k tokens). In the worst case, again, it has to reprocess everything (70k) To avoid that, you can enable “preserve thinking”, at least with Qwen 3.6. The goal of this PR is to avoid the worst case (full prompt re-processing) and get closer to the best case, where llama.cpp only reprocesses what actually changed. I have been using this code for about two weeks and in my opinion agentic coding is now more responsive.   submitted by   /u/jacek2023 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
1000 tps generation on Qwen3.6 27B with V100s

I wanted to see what the absolute best case scenario for generation on this setup was and was not disappointed. 128 concurrent requests is so far removed from what I need but it’s funny to see big nu…

I wanted to see what the absolute best case scenario for generation on this setup was and was not disappointed. 128 concurrent requests is so far removed from what I need but it’s funny to see big number. For single user (batch 1 not 128) the generation is around 80t/s with 3000 t/s processing,no mtp!!   submitted by   /u/Simple_Library_2700 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
Wrote a custom C++ engine for MiniCPM-V 4.6 on Orange Pi AIPro (Ascend 310B) to bypass framework overhead

Hey everyone, just wanted to share a project I've been hacking on for the last few weeks. I managed to build a from-scratch C++ inference engine to run MiniCPM-V 4.6 entirely on the Orange Pi AIPro (…

Hey everyone, just wanted to share a project I've been hacking on for the last few weeks. I managed to build a from-scratch C++ inference engine to run MiniCPM-V 4.6 entirely on the Orange Pi AIPro (the budget board with the Ascend 310B NPU, costs around $149 for 20 TOPS INT8 / 10 TFLOPS FP16). If you want to check out the custom ops, build scripts, or the Gradio web UI, the repository is open source on GitHub at github.com/lvyufeng/minicpm-v-4.6-orangepi https://preview.redd.it/upfsqb0jm73h1.png?width=1655&format=png&auto=webp&s=1e80185171fa6db651d81e20d717b3a05791614c If you've ever tried deploying local LLMs or VLMs on this specific hardware, you probably know that dealing with the standard framework stack can be a massive pain, especially if you want to get any decent performance on the edge. To get around this, I skipped the heavy frameworks and went low-level. Both the text generation and the SigLIP vision tower run natively on the NPU inside a single C++ subprocess. There is absolutely zero torch_npu dependency on the hot path. Python is only used on the cold path for CPU-side tokenization and image preprocessing. The initial stock aclnnMm baseline was pretty rough during the token decoding phase because it heavily underutilized the NPU's cube unit when M=1 (vector-matrix multiply). It was giving me around 2.88 tokens/s (taking about 350ms per step). After rewriting the critical paths with custom AscendC kernels, it's now hitting 5.90 tokens/s in FP16 (dropping the per-step latency down to 170ms). Here is the actual breakdown of how the 2x speedup happened: Stage Tokens/s Per-step (ms) Saved Stock aclnnMm baseline 2.88 350 ms — + Custom Cube Matmul ($M=1$) 4.37 229 ms 121 ms + lm_head 16-chunk Cube Path 4.99 200 ms 29 ms + Vectorized Causal-Conv1d Step Kernel 5.90 170 ms 30 ms First, I wrote a custom cube matmul kernel for M=1 using MatmulImpl to bypass the slow generic vector path. This single change boosted the speed from 2.88 tps

📰
r/LocalLLaMA Aggregators May 25, 2026
I shipped a windows desktop app for running local LLMs with a button that turns your "no thats wrong" into actual LoRA training data

i built a local AI desktop app and just shipped it. windows only. called SEELS. dropping it here cause if anyones gonna find the cracks its you guys. the thing i actually wanted to make wasnt another…

i built a local AI desktop app and just shipped it. windows only. called SEELS. dropping it here cause if anyones gonna find the cracks its you guys. the thing i actually wanted to make wasnt another ollama wrapper. what bugged me is every local model id run would say something dumb and id sit there going "no thats not what i meant". then id close the chat and the model never knew, never learned. so the whole hook of SEELS is theres a Teach button on every reply where you write what it should have said. those corrections pile up into a jsonl corpus, and when you have enough you click Train and it actually kicks off a PEFT LoRA run on your base. no notebook, no python, no terminal. just chat, correct, train. over time the adapters stack up and it becomes your model not theirs. trained a tiny 0.6B helper from scratch on like 110 hand written examples so theres something that runs on CPU out of the box. not replacing your daily 35B obviously but it answers questions about how to use SEELS itself which was the point. rest of standard (free, forever, not a trial): bring any GGUF, voice mode with whisper STT and piper TTS both local no API keys, hardware dashboard so you can stop guessing what your card has free, single instance lock cause i kept opening two and corrupting my own sqlite, and 14 settings tabs because i couldnt cut any of them. wont lie. pro tier (image/video/music gen, code workspace, multi profile, multi lora stacking, MCP, cron) is written and gated behind a tier check but not purchasable yet. max tier (mask editor, plugin sandbox, comfyui backend, node graph workflows, multi GPU, seels-cli) is roadmap, some of those literally have no code yet. didnt want to put 12 things on a tier card if 8 of them are vibes. so the site says "ships in waves" instead of lying. things i actually need eyes on: if you teach it 30 things and the next train run is worse, thats a real bug, settings > training proof has a copy button for the trajectory l

📰
r/LocalLLaMA Aggregators May 25, 2026
opensource music reccomendation / playlist, similar to spotify radio / YT music mix?

Any recommendations for this? Initially, i was thinking that LLMs probably not the right thing for this (assuming your source data is all listening metrics), HOWEVER, if you combine a) user listenin…

Any recommendations for this? Initially, i was thinking that LLMs probably not the right thing for this (assuming your source data is all listening metrics), HOWEVER, if you combine a) user listening data; AND b) user comments / text data / reccs/ reviews / forum posts / social media mentions etc and put taht ALL inside the LLM, it might work. Like your ultimate LLM DJ that is intune with not just data, but the zeitgeist as well. anyway, I've did the obligatory search and seems like nothing really worthy comes up. Apart from last.fm / various APIs which are heavily limited, there's also this https://www.reddit.com/r/navidrome/comments/1eoc0cz/generating_weekly_recommendations_playlists_for/ but it seems pretty janky and not exacltly what I'm thinking of. Is this obscure / rare because BULK user listening data is not really public (ie all hidden behind spotify / youtube / soundhound / shazam walled gardens?) The ask: Put in a song / list of songs, and it generates playlist based on that. So far, spotify's reccs are best for me, i can do endless listening and enjoy most of their suggestions.   submitted by   /u/LeatherRub7248 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
Could someone please help explain these results?

I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled! (17 to …

I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled! (17 to 34 tok/s). Shouldn't it have slowed down from the CPU having to do so much more work? Here is the command I'm using: llama-cli -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf -ngl 999 --n-cpu-moe 30 -fa on --cache-type-k turbo4 --cache-type-v turbo3 -c 262144 -t 6 -b 2048 -ub 512 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --no-mmap Increasing it further to 41 didn't touch the inference rate. What's going on? And if you're feeling charitable, could you also tell me how I might squeeze a little more speed out of this setup, if possible? Edit: I increased it further from 41 to 256, and if anything, inference sped up even more, and VRAM usage stayed the same. I'm flummoxed, I tell you. Flummoxed.   submitted by   /u/MackTuesday [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
llama.cpp has a clever trick for speeding up KV cache decode

So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its s…

So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under developer options. This is the setting - as far as I can tell based on the description (haven't looked at the code yet), it basically just re-sends all of the tokens generated by the current response to the KV cache rather than waiting for you to prompt the model again to begin decoding. It's certainly a hacky workaround, but it seriously improves general responsiveness when a model turn generates a whole bunch of tokens, or receives a large amount of info from a tool call. To actually enable this, you just need to start your llama-server and head to the WebUI to enable this, and it applies/works across all requests that hit llama-server, not just in their WebUI In Open-WebUI for example, I used to have to wait 5-30 seconds (which seems like nothing, but when your model is scraping multiple webpages in a single turn, it really adds up) for prompt processing whenever Qwen would read an incredibly large webpage or something similar. However since enabling this option, it's almost instant. I haven't noticed any real trade-offs as of yet, and I just thought this would be a good little PSA post to put out there. For those wondering, I'm running Qwen3.6-35B-A3B @ MXFP4, fully offloaded to a single RX 7900 XTX, getting about ~100tps with no MTP atm as it's still not compatible with vision encoders. I imagine this would be even better for those of you using the new MTP patches, particularly the one that introduced MTP for PP. I had no idea this feature existed, so I hope this helps somebody out! Like I said, it's hacky, but it certainly works!   submitted by   /u/ayylmaonade [link]   [comments]

📰
r/LocalLLaMA Aggregators May 25, 2026
how to install llamacpp the better way to wrapping it in python ui (CPU use only) ?

i want the best installation that fit my use and my low-compute H.W , i want to run small to above small llm like "qwen" 2b ,4b and 27b , and "gemma" 31B. rely completely on only …

i want the best installation that fit my use and my low-compute H.W , i want to run small to above small llm like "qwen" 2b ,4b and 27b , and "gemma" 31B. rely completely on only old CPU 4th.gen i7 with that few 32gb 'slow' ddr3. i will use llamacpp as python program with simple ui calling it like this from llama_cpp import lama ..so on. should i install llamacpp like this : inside venv, pip install git+ggmlorg/llamacpp repo or other that made for CPU as ik_llamacpp ? or : build like this without venv , git clone llamacpp repo; cd llama.cpp; cmake -B build; cmake --build build -j ? or : install from pip inside venv : CMAKE_ARGS="-DGGML_CUDA=OFF" pip install llama-cpp-python ? and is pip llamacpp differ from github repo nad why ? , what is best for my use case ?   submitted by   /u/BeautyxArt [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple…

A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into hipEngine, a new open source (AGPLv3) ROCm-native local LLM inference engine. It's Python based, but with no heavy PyTorch dependency. All the hot-path is HIP/C++, making liberal use of AMD native libs like hipBLASLt, hipGraph, AOTriton, etc. gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900) The initial implementation has Qwen 3.6 (MoE and dense) running competitively with llama.cpp, with the ParoQuant (which I've also ported to be ROCm compatible) 4.68bpw having better c=1 prefill ("prompt processing") at every tested context length, from 512-128K on gfx1100 (W7900/7900 XTX): Prefill tok/s Workload hipEngine PARO hipEngine GGUF Q4_K_S llama.cpp HIP llama.cpp Vulkan 512/128 2718.497 2258.847 2436.049 1816.927 4K/128 2838.773 2576.673 2176.905 1705.093 32K/128 2074.699 1893.967 1496.409 1128.554 128K/128 1055.454 998.143 710.213 480.539 Decode tok/s Workload hipEngine PARO hipEngine GGUF Q4_K_S llama.cpp HIP llama.cpp Vulkan 512/128 103.460 109.152 85.487 127.515 4K/128 101.964 100.048 87.375 120.163 32K/128 90.438 86.774 76.994 98.073 128K/128 59.598 57.954 57.341 64.478 Peak GiB Workload hipEngine PARO hipEngine GGUF Q4_K_S llama.cpp HIP llama.cpp Vulkan 512/128 20.962 25.108 21.125 20.844 4K/128 21.906 25.108 21.197 20.969 32K/128 22.016 25.108 21.738 21.533 128K/128 22.122 25.108 23.605 23.596 It also has the lowest peak memory usage at 128K. hipEngine also has near-lossless INT8 KVCache (with almost no speed-loss), meaning that you can run the full Qwen 3.6 256K context window in <24GB (eg, on a dedicated 7900 XTX) at good performance on RDNA3: Model Context KV cache Sampled peak Allocator peak Retained KV Prefill Decode Qwen3.6 35B-A3B PARO 128K BF16 21.04 GiB 21.88 GiB 2.

📰
r/LocalLLaMA Aggregators May 24, 2026
Qwen 3.6 27B MTP speed on 3080ti (getting 4.5 t/s)

Using LM Studio with 3080ti (12gb of VRAM) and 128gb of ddr4. Model version: Qwen 3.6 27B MTP UD q4_k_xl Is this my hardware limit? Is there anyway to speed this up using the current hardware?  …

Using LM Studio with 3080ti (12gb of VRAM) and 128gb of ddr4. Model version: Qwen 3.6 27B MTP UD q4_k_xl Is this my hardware limit? Is there anyway to speed this up using the current hardware?   submitted by   /u/yehiaserag [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
Could Open Models be trained to secretly go rogue?

I was discussing with some other folks how safe is to use open weights models from China and the topic of "trojan horse" came up. We know that, at least with current architecture, models ca…

I was discussing with some other folks how safe is to use open weights models from China and the topic of "trojan horse" came up. We know that, at least with current architecture, models can't run code on their own. They are entirely dependent on tools and harnesses. We also know that a local run model can't have any kind of remote "switch" that would change its behavior or inject a different prompt. But would there be any other ways to "execute order 66" 😄 ? Could a lab, for instance, train a model that would change its behavior upon reading certain trigger phrases or perhaps at a specific date? They would then secretly gather sensitive info and send it somewhere else without user consent. Obviously the model would have to be running in an harness capable of such tool-use (which is quite common with openclaws, hermes, etc). Thoughts?   submitted by   /u/nunodonato [link]   [comments]

r/LocalLLaMA Aggregators May 24, 2026
Generative Recursive Education: Creating Custom Interactive Textbooks on the Fly.

  submitted by   /u/Ryoiki-Tokuiten [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
I have macbook m4 16’ 48GB. I use claude code and want to try local one

I've been on Claude Code daily for a while and want to see how far local models can do my setup: - MacBook Pro M4 (16"), 48GB - macOS 26 tahoe Usually i do: seo researches, macos swift apps, web…

I've been on Claude Code daily for a while and want to see how far local models can do my setup: - MacBook Pro M4 (16"), 48GB - macOS 26 tahoe Usually i do: seo researches, macos swift apps, websites) What I'm trying to figure out: Which the best model to use on my mac? MLX vs llama.cpp(wtf?), LM Studio vs Atomic Chat? Opencode? What tokens/sec should I expect? Is it enough? How much is the cost per month if compared with Opus 4.7, max 200$?   submitted by   /u/Primary-Medium-895 [link]   [comments]

r/LocalLLaMA Aggregators May 24, 2026
Generative Recursive Education: Creating Custom Interactive Textbooks on the Fly.

  submitted by   /u/Ryoiki-Tokuiten [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
magic incantation to get llama-bench to work with MTP ?

It does not like anything I have tried, including what works with llama-server. is it not built to work with speculative decoding?   submitted by   /u/jdchmiel [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
Can someone help me understand MCP?

They just seem like tool calls and skills, but from a link somehow? Like.. I don’t get it. Is it private? That’s why I haven’t tried it yet lol   submitted by   /u/Borkato [link]   [c…

They just seem like tool calls and skills, but from a link somehow? Like.. I don’t get it. Is it private? That’s why I haven’t tried it yet lol   submitted by   /u/Borkato [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
What frontend do you guys use?

I’m using vim lmao with a custom made plugin for completing text, so I was curious what yall use. Llama-server seems like a sensible default but it seems limited   submitted by   /u/Borkato…

I’m using vim lmao with a custom made plugin for completing text, so I was curious what yall use. Llama-server seems like a sensible default but it seems limited   submitted by   /u/Borkato [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
qwen3.6-35b-a3b-mtp running on GTX 1060 6GB

I have this old 10-year old Dell T5810 workstation with 32GB ddr3(?) memory and a E5-2698v3 (16 cores 32 threads), a GTX 1060 6GB that's used for mining back in the old days (paid itself back many ti…

I have this old 10-year old Dell T5810 workstation with 32GB ddr3(?) memory and a E5-2698v3 (16 cores 32 threads), a GTX 1060 6GB that's used for mining back in the old days (paid itself back many times over). I managed to get the model running with LMStudio in Windows(!). My settings are: Model: unsloth qwen3.6-35B-a3b-MTP-GGUF UD Q4_K_XL Ctx length:131072 GPU offload 41 CPU threadpool size 16 Max concurrent 4 Number of experts 8 Number of MOE layers offloaded to CPU 41 MTP max draft 3 KV quantization both Q4_0 prefill 16k about 130-150tps decode 4k about 16tps Very usable for chat.   submitted by   /u/xxvegas [link]   [comments]

r/LocalLLaMA Aggregators May 24, 2026
Is NVIDIA still the default best choice for local LLMs in 2026?

  submitted by   /u/pmv143 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
Need Help Choosing a Harness for Qwen 3.6 27B

I've burned a week trying to customize my agent manually - building my own front end - but I've gotten to the point where I'm just exhausted and willing to try a harness, but need the right one. I re…

I've burned a week trying to customize my agent manually - building my own front end - but I've gotten to the point where I'm just exhausted and willing to try a harness, but need the right one. I read posts all the time, but I have a specific use case, so I'm reaching out to the best of the best for suggestions. Here is my stack: Windows 10 | i7 12700K | RTX 3090 TI | 96GB RAM Models: Qwen 3.5|3.6 27B UD K XL (Q4/Q5) - Also will be using 0.8B/4B in CPU parallel Server: LM Studio Apps: (in Docker) N8N, Redis (w/redisstack,redisinsight), Postgres (w/pgadmin,pgvector), Dify (installed, never used), browserless (never used) Where I am right now: I'm using LM Studio because it just works. I tried llama.cpp w/openwebui and rage quit, was just slower and not same features I'm used to. Cass - my agent - works fine at Q5, but fills up context fast because o/mcp. (I know, I know) To help out, I switch to Q4 @ Q4 KV to get up to 200K and it works surprisingly well, but I figured if I spawn sub-agents I can pass that mcp context to them and just respawn for new tasks. I had Cass write an agent spawner and it works fine. The trick works - the mcp context hits the subs and I can chat w/Cass longer - but I can't see what the sub-agent is doing/thinking/etc. I had cass build a dashboard for sub-agents that sorta worked, but there were just...issues. Cass couldn't see the agent's stream until it was finished and sometimes thought it timed out when the sub was still working. I searched and figured I'd have the sub stream its output to cass, but to properly see all this, I figured I'd need a custom front end. Additionally, I want to run a process in parallel via cpu - a meta analysis agent - and I need a way to monitor its outputs as well. So, we're talking at minimum 2 agent outputs (main, meta) and then a third during spawn. I watched some vidz last night about pi agent. I'm not sure this is what I need - I want to use mcp tools. But I'm good using other tools as long as I can s

r/LocalLLaMA Aggregators May 24, 2026
X-Post of lightweight wheely robots. How / what are they running as the brains? Local? IoT-Style? Networked?

  submitted by   /u/Mchanger [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
What is the smallest amount of RAM sufficient to run any available on HF GGUF LLM model locally?

Disclaimer: the question is theoretical, aimed at people who know how engines (e.g. llama.cpp) work. "Run": I define as able to process prefill of 20 tokens and generate 20 tokens response …

Disclaimer: the question is theoretical, aimed at people who know how engines (e.g. llama.cpp) work. "Run": I define as able to process prefill of 20 tokens and generate 20 tokens response within a month. As context's KV cache need memory and that amount is proportional to context length, "smallest amount of RAM" excludes context allocation needs, also it excludes memory taken by OS itself (but includes inference engine's executable). "Any": it needs to be sufficient to run all (each at one time) of LLM models currently available in GGUF format on HF. I use Linux and interested in estimations for it, but info for other OS is welcome. The question assumes no GPU for simplicity (RAM, not RAM+VRAM in the title), however info on engines abilities to use very little RAM to load to large VRAM is welcome.   submitted by   /u/alex20_202020 [link]   [comments]

📰
r/LocalLLaMA Aggregators May 24, 2026
OCR, granite-docling-258m vs granite-docling-2stage-258m: has anyone actually noticed any improvements?

IBM's granite-docling-2stage-258m granite-docling-2stage-258m Granite Docling 2stage builds upon the Granite Docling, but introduces a key modifications: it builds a dynamic prompt that precomputes…

IBM's granite-docling-2stage-258m granite-docling-2stage-258m Granite Docling 2stage builds upon the Granite Docling, but introduces a key modifications: it builds a dynamic prompt that precomputes layout objects found within a page, making it more robust on out of distribution data. What do you think?   submitted by   /u/Wise_Stick9613 [link]   [comments]