Newswallah — AI & Tech News

r/LocalLLaMA Aggregators May 26, 2026

Turning local agents into self-optimizing agents

I was experimenting with a self-optimizing agentic pipeline to climb the benchmark leaderboard (TerminalBench). On a 10-task subset, I got the performance to rise from ~30% → ~90%. That loop worked, …

I was experimenting with a self-optimizing agentic pipeline to climb the benchmark leaderboard (TerminalBench). On a 10-task subset, I got the performance to rise from ~30% → ~90%. That loop worked, so I asked: can the same reflect-and-rewrite step run continuously against everyday chats instead of a benchmark? How it works Every chat with your local LLM goes through a small proxy and is logged. autoswarm reflect has the same local model review those logs, distill concrete lessons, and write them to skills.yaml. Lessons auto-inject into the system prompt of future chats. Run it (LM Studio path) Start LM Studio's local server and load a model. ```bash pip install -e . autoswarm doctor # verifies LM Studio is reachable autoswarm start # auto-detects upstream + model, listens on :8080 I'm genuinely fascinated by the idea of self-optimizing agents, and I believe there's something bigger to uncover there. That said, this is just a hobby project and I'm still experimenting with it. Would love your feedback! Link: https://github.com/arteemg/autoswarm I'm actively working on the project, so please ⭐ the repo to stay updated.   submitted by   /u/Rude_Substance_8904 [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

Long-context performance at lower quants

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing tha…

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k context use, it starts to get dumb all of a sudden. It just hits a brick wall and quality deteriorates rapidly and drastically. It'll begin hallucinating, forgetting things, or think something it said/suggested was actually something that I said. I found I have to compact before I get to that point, and then it keeps going on just fine. Is this because I'm running Q3? Unfortunately Q4 is just outside of the capability of my system specs unless I want to start disk swapping. So is it just an issue with this particular model? Or because it's Q3? Are there llama.cpp settings that can help? I'm already using BF16 KV cache.   submitted by   /u/_TheWolfOfWalmart_ [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face

MOSS-TTS-v1.5 MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyi…

MOSS-TTS-v1.5 MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For the full 1.0 feature walkthrough, input schema, decoding hyperparameters, and evaluation tables, please refer to the MOSS-TTS 1.0 README. Compared with MOSS-TTS 1.0, v1.5 focuses on the following improvements: Stronger multilingual synthesis with language tags: when the language field is omitted, v1.5 may improve some languages and regress slightly on others compared with 1.0. When the language is specified, v1.5 is stronger than 1.0 on almost all supported languages. Set the tag when building the user message, for example processor.build_user_message(text=text_fr, language="French"). More stable voice cloning: v1.5 improves speaker similarity and reduces cloning variance, making repeated generations more consistent. Better long-reference, short-text cloning: v1.5 handles scenarios where the reference audio is much longer than the target text more reliably than 1.0. More stable punctuation-following prosody: v1.5 follows punctuation-driven pauses more closely, especially in long sentences. Explicit pause control: v1.5 supports inline pause markers such as "[pause 3.2s]". For example, 我今天学习了一首中国的古诗，它的名字是[pause 3.2s]静夜思！ inserts an explicit 3.2s pause before 静夜思. Supported Languages MOSS-TTS-v1.5 currently supports 31 languages. It keeps the 20 languages supported by MOSS-TTS 1.0 and extends multilingual continued training to additional languages including Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, and Vietnamese. They released additional model as well. https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect-v2.0   submitted by   /u/pmttyji [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

Feedback Wanted: Building for easier local AI

Just what the post says. Looking to make local AI easier so literally anyone can do “all the things” very easily. We built an installer that sets up all your OSS apps for you, ties in the relevant mo…

Just what the post says. Looking to make local AI easier so literally anyone can do “all the things” very easily. We built an installer that sets up all your OSS apps for you, ties in the relevant models and pipelines and back end requirements, gives you a friendly UI to easily look at everything in one place, monitor hardware, etc. Currently works on Linux, Windows, and Mac. We have kind of blown up recently and have a lot of really awesome people contributing and building now, so it’s not just me anymore it’s people with Palatir and Google and other big AI credentials and a lot of really cool people who just want to see local AI made easier for everyone everywhere. We just finished automatic multi GPU detection and coordination as well, so that if you like to fine tune these things you can, but otherwise the system will setup automatic parallelism and coordination for you, all you’d need is the hardware. Also currently in final tests for model downloads and switching inside the dashboard UI so you can manage these things without needing to navigate a terminal etc. I’d really love thoughts and feedback. What seems good, what people would change, what would make it even easier or better to use. My goal is that anyone anywhere can host local AI on anything so a few big companies can’t ever try to tell us all what to do. That’s a big goal, but there’s a lot of awesome people that believe in it too helping now so who knows? Any thoughts would be greatly appreciated!   submitted by   /u/Signal_Ad657 [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

[OSS] dlmserve - first serving engine for diffusion language models

Spent the last few months building this on a single RTX 5070. Quick context: diffusion language models (like LLaDA from gsai-ml) are a different beast from GPT-style autoregressive LLMs. Instead of g…

Spent the last few months building this on a single RTX 5070. Quick context: diffusion language models (like LLaDA from gsai-ml) are a different beast from GPT-style autoregressive LLMs. Instead of generating one token at a time, they start with a fully masked sentence and iteratively denoise the whole thing in parallel. Cool tech, but mainstream serving engines are all built around the autoregressive contract, so none of them serve diffusion LLMs. dlmserve fills that gap: OpenAI-compatible HTTP API (/v1/chat/completions) Automatic continuous batching at the denoising-step level Optional LocalLeap acceleration baked in Token-identical to the reference HF implementation at temperature=0 2.5x throughput vs HF at batch=4, plus another ~1.8x from LocalLeap Runs in 12 GB VRAM (RTX 3090/4090/5070 all fit). MIT licensed. Repo: https://github.com/iOptimizeThings/dlmserve Install: pipx install dlmserve (or pip install dlmserve if you're in a venv) First public OSS project of this size for me. Genuinely curious what people think. Feedback and code review very welcome, also happy to answer questions about the diffusion serving architecture Edit: Roadmap: - v0.1 ✓ LLaDA-8B-Instruct + LLaDA-1.5 - v0.2 Dream-7B + DiffuLLaMA (issues already open) - v0.3 block diffusion + LLaDA-2.0 + Fast-dLLM KV cache   submitted by   /u/Glittering_Painting8 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode

I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tool…

I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends. For example, to run pi + vllm: # model downloaded and configured harbor up vllm # Harbor knows that vllm is running and will use it harbor launch pi Additionally, launch can proxy requests through built-in optimising LLM gateway which automatically injects and resolves tools, such as web search, so you can add web search to an agent by just appending --web to the command and Harbor will pre-wire everything: harbor launch --web --model qwen3.5:4b --backend ik_llamacpp mi -p 'Find recent releases of agentic tools and write a two sentence overview' You can find many more details in the wiki here: https://github.com/av/harbor/wiki/3.-Harbor-CLI-Reference#harbor-launch-launch-options---service-servicetool-args Thank you!   submitted by   /u/Everlier [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

Okay 27B made me a believer

I previously hated on this model, but I have just been impressed by it, and I understand the hype now. I have been working on a HTML5 game console and I decided to see if Qwen3.6 27B can handle makin…

I previously hated on this model, but I have just been impressed by it, and I understand the hype now. I have been working on a HTML5 game console and I decided to see if Qwen3.6 27B can handle making some quick games in it to showcase functionality (save games, console API handling for stat tracking and heartbeat management, meta data for the game, etc) I gave it 3 files, explaining how the API works, the gamepad controls, and a typescript shader for it to apply. Then I just game it a very simple prompt "make a breakout game for this console, in the working directory are reference files on how to make it". First result was immediately playable, controls made sense, graphics style was was unique and appropriate, sound worked, console API all worked, and it felt good and was actually fun. It added flair that made it not feel like the vibecoded breakout clone it was. It went way above and beyond the minimum that I've seen so many LLMs do. It was not lazy in the slightest. It's a simple test, but this is something everything but something like Opus could handle. There wasn't anything particularly done well, it's just that the whole game was nearly complete in a single shot and it felt like thought was put into the entire game. All I needed was one follow up for customization and a single glitch and it was already what I would consider complete. And this was on a 27B model with Opencode. The best way I can describe it, is that it was congruent. Now I just wish I went the Nvidia card route instead of Strix Halo cause the speed isn't great. Maybe 3.7 35B A3B can have some of this magic.   submitted by   /u/Forward_Jackfruit813 [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

Tencent Hy-MT2 is now under Apache License 2.0

nice update bois   submitted by   /u/sword-in-stone [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

Keye-VL-2.0-30B-A3B -- Introducing DSA attention into multimodality for the first time

Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capa…

Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family. https://huggingface.co/Kwai-Keye/Keye-VL-2.0-30B-A3B https://preview.redd.it/wsxe233abh3h1.png?width=1244&format=png&auto=webp&s=aa9ffa388e16e4f8f5cb72ed3dae063f99df69f1 https://preview.redd.it/2iymyb9dbh3h1.png?width=2048&format=png&auto=webp&s=a834ce92294c3be059b50c6993f1be6d3faf2767   submitted by   /u/External_Mood4719 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

New KV Quants coming 😍 Welcome OSCAR kv quant open sourced by togetherAI

Just when we started embracing turboquant this happens   submitted by   /u/yehyakar [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

China Clamps Down on Overseas Travel for AI Talent at Alibaba, DeepSeek

Big, if true. Doesn't bode well for research / OS models out of China.   submitted by   /u/kaggleqrdl [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

Outsourcing plus LocalAI will soon become more economical vs Frontier labs

written entirely by me. AI did the chart and formatting html   submitted by   /u/Comfortable-Rock-498 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

Not sure if this was posted. But I think it's highly relevant to us.

  submitted by   /u/Paradigmind [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

Output Length Constrained Summarization using GRPO on tiny LLMs | smolcluster

Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is! It's about output length-constrained summarizatio…

Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is! It's about output length-constrained summarization using LLMs with GRPO. All experiments run on tiny LLMs - Qwen2.5-0.5B-Instruct and LFM-2.5-350M on a 3x Mac mini M4 cluster (16 GB each), single-node training with multi-node vLLM inference for rollouts. The core question: can you teach a sub-500M model to summarize Reddit posts in exactly 64 tokens while keeping the quality high? The baseline zero-shot answer: not really. Composite G-Eval scores of 2.376 (Qwen) and 2.332 (LFM) under zero-shot prompting, with pass rates of just 21% and 13%. That was the starting point. I tested 12 reward configurations across 2 training strategies: Strategy 1 - Length-Penalty Fine-tuned: Train on length reward first → checkpoint → fine-tune with quality rewards only. Strategy 2 - Length-Penalty Included (a.k.a joint): Length + quality rewards active simultaneously from step 1. 24 checkpoints total. One clear winner between the two strategies. The quality reward signals: ROUGE-L - LCS F1 against the reference METEOR - precision/recall with stemming + synonym matching BLEU - n-gram precision with a brevity penalty And all their pairwise combinations. Evaluated with G-Eval (LLM-as-judge) across Faithfulness, Coverage, Conciseness, and Clarity. The staged curriculum wins - consistently. Best composite scores: LFM: 2.904 (quality-meteor, fine-tuned) vs 2.701 (joint) Qwen: 2.817 (quality-bleu-rouge, fine-tuned) vs 2.769 (joint) Practical takeaways: Staged curriculum (length first, quality second) outperforms joint training in absolute score METEOR + ROUGE-L is the most reliable reward combination under both strategies The length constraint is also a regularizer - it prevents the Coverage ↔ Conciseness collapse that happens when quality rewards run unconstrained BLEU alone is not worth including as a standalone reward signal for summa

📰

r/LocalLLaMA Aggregators May 26, 2026

Token Usage and Databases - Local vs. API

Throwing something out to the community for a bit of an insight. I got thinking about the consumption of tokens when working with various databases and here is my understanding: When I ask as quest…

Throwing something out to the community for a bit of an insight. I got thinking about the consumption of tokens when working with various databases and here is my understanding: When I ask as question that is essentially converted to tokens. The LLM then "reads" that and generates the response which in this cases involves a database query The LLM then tokenizes the query results and "reads" them and provides me the results and any insights or answers Rinse and repeat until you have gotten what you want. i.e continue to build token usage. So if that's right then AI driven analytics is going to be terribly expensive in token consumption really fast, even with all of the caching and other techniques available right now. It's also going to get considerably worse with the use of sub agents and agent council type solutions where a single question could kick of a bunch of separate queries that are then passed back and forth. I work with large enterprise where all the vendors are heavily pushing integrated analytics and agentic querying of the underlying platform (SAP, Service Now etc.) and question whether buying into this now exposes organizations to a massive cost based risk once the initial contracts have expired and generative AI is actually being charged at above cost rather than below. I'm really curious in other peoples perspectives but have a couple thoughts. Isn't this a very strong justification (along with a number of others) for hybrid architectures where local AI is leveraged for the heavy token count types of analysis within organizations? I spend quite a bit of time reading from various sources and so far I haven't seen this really discussed so I'm wondering if I missed something along the way or the service providers aren't comfortable discussing these implications? Appreciate the comments in advance. Cheers   submitted by   /u/WishfulAgenda [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

China Expands Travel Curbs to Top AI Talent at Private Firms

https://www.bloomberg.com/news/articles/2026-05-26/china-expands-travel-curbs-to-top-ai-talent-at-private-firms Now it will be much harder to poach Chinese AI talents like the former Qwen head Junyan…

https://www.bloomberg.com/news/articles/2026-05-26/china-expands-travel-curbs-to-top-ai-talent-at-private-firms Now it will be much harder to poach Chinese AI talents like the former Qwen head Junyang Lin. It is quite sad that they will also have a hard time to travel to foreign countries for fun. Non-paywalled version from Straits Times: https://www.straitstimes.com/asia/east-asia/china-expands-travel-curbs-to-top-ai-talent-at-private-firms   submitted by   /u/Ok_Warning2146 [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

Are local LLM users testing prompt injection before connecting models to tools?

I wanna know how people here are handling security once local models move beyond chat.....Running a model locally feels safer because the data does not leave your machine or your infra. That is a rea…

I wanna know how people here are handling security once local models move beyond chat.....Running a model locally feels safer because the data does not leave your machine or your infra. That is a real advantage.....But once the local model is connected to tools, files, RAG, shell commands, browser automation, APIs, or internal docs, the risk changes. At that point, prompt injection is not just “the model said something weird.” It can influence what file gets read, what command gets suggested, what data gets retrieved, what tool gets called, or what action the agent takes next..... Most local setups I see focus heavily on model quality, quantization, context length, VRAM, tokens per second, and benchmark scores. All valid. But I see less discussion around testing the model’s behavior under malicious instructions before giving it access to real tools.... The people running local models in agentic setups: Are you testing prompt injection or jailbreak behavior? Do you isolate tool access by default? Do you keep local models read-only until trusted? Do you log tool calls and retrieved context? Or is this still mostly “local means safe enough” for now? I’m not asking from a doom angle. I’m more interested in what practical safety habits local builders are actually using.   submitted by   /u/sunychoudhary [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

SkillOpt treats markdown skill files as trainable parameters with proper optimization machinery

Paper came out recently that formalizes something a lot of agent builders have been doing ad hoc. They use a frontier model to propose bounded edits (add/delete/replace) to markdown skill files, then…

Paper came out recently that formalizes something a lot of agent builders have been doing ad hoc. They use a frontier model to propose bounded edits (add/delete/replace) to markdown skill files, then gate every edit against a held out validation set. Only strict improvements accepted, ties rejected, rejected edits become negative signal for the next round. Few things worth noting: Best skills converge with 1 to 4 accepted edits out of many more proposals. Edit budget of 4 to 8 per step works best, remove the cap and performance collapses. Median final skill is ~920 tokens. A skill optimized on Codex transferred to Claude Code with zero modification and gained +59.7 on SpreadsheetBench. And GPT 4.1 nano with an optimized skill roughly matched frontier on procedural benchmarks. The limitation is the validation gate requires an auto grader with clear correct answers. Works for code and spreadsheets, breaks for anything open ended. Paper: https://arxiv.org/pdf/2605.23904   submitted by   /u/agentic-doc [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

Llamacpp server : How do the -np and -c flags interact?

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c i…

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact. The context for each parallel client appears to be equally distributed across server slots (so each client is allowed c / np context). I have some questions: - What are the consequences of launching a server with a greater context -c than what the model allows? - What if c / np is greater than the model max context? Are there any negative to that regarding model performance? - If a rig allows to allocate twice the context max size in vram, is it twice energy and time efficient to serve two agents in parallel rather than sequentially?   submitted by   /u/Doug_Fripon [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

qwen 3.6 27B AR-> Diffusion - local training on 5090

based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trai…

based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get the thing to do a forward pass on a 5090 with help of another gpu rtx4000 to help offload recreations. Below are some low level ramblings / findings / observations. Firstly - the amount of vram normally required to do this > 600gb - (i think) after some wrangling - and giving up on optane route - it's possible to train on qlora form factor which will actually take the model and train on nvidia - nvfp4 i attempt to get the entire 27b model to train on a 5090 https://github.com/scrya-com/dLLM-castlehill latest training run https://wandb.ai/snoozie/open-dllm-27b/runs/arcefpjp?nw=nwusersnoozie Public service annoucment - to avoid burning cables - throttle down nvidia max power for consumer 5090 cards from 600w -> 400w The vanilla route with open-dllm is validated on qwen 2.5 with 4x speed up (if someone with lots of compute could take a look it might just work) - I take some deviation to explore improving this - and found a few papers. One is d3llm Ultra-Fast Diffusion LLM https://github.com/hao-ai-lab/d3LLM which boasts faster diffusion speeds - so i upstream this code into the codebase and include their mdm loss - seems ok. It's basically also taking the order of the tokens into account. With the diffusion it can have many steps (see graph) but we can shorten that time to see much higher throughput / tokens per second. if we could theoretically do 1 step - then you may see some crazy speeds. https://wandb.ai/snoozie/open-dllm-compare?nw=nwusersnoozie When i was working on improving ltx2 to speed up video recreation to do 1 shot diffusion - I attempt to implement this trick shot based off a paper variational flow maps which / make some noise https://arxiv.org/abs/2603.07276 see

r/LocalLLaMA Aggregators May 26, 2026

Qwen3.5 27B Uncensored Heretic Native MTP Preserved is Out Now With the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats!

Safetensors, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-27B-uncens…

Safetensors, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF NVFP4, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 NVFP4 GGUFs, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF GPTQ-Int4, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 Now in case some people might ask, why release Qwen3.5 MTPs version when there is already Qwen3.6 MTPs version? Well the thing is, most people would assume that higher number = newer and better model, but the thing is both Qwen3.5 and Qwen3.6 models uses the qwen35 architecture, they just had different training and their focus are meant for different primary usecases, Qwen3.6 models are mainly meant for agentic and coding AI assistance and Qwen3.5 models are mainly meant for general purpose AI assistance, now Qwen3.6 can definitely be used for general AI assistance just like Qwen3.5 can definitely be used for agentic and coding, but if you want the most optimal usecases it would be Qwen3.6 for agentic and coding and Qwen3.5 for general AI assistance that is where each of them excels at. Also for extra info, in case anyone is wondering, despite Qwen3.5 and Qwen3.6 both sharing the qwen35 architecture, they behave very diferently to abliteration. Qwen3.5 models can have a KL divergence in the 300's or 400's but on benchmarks this does not

📰

r/LocalLLaMA Aggregators May 26, 2026

Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so.

Did the math on my own rig last week and I'm tired of seeing this sub repeat the "local is cheaper" line without numbers. Let me actaully break it down. My setup: 2x 3090 (used, $1400 total…

Did the math on my own rig last week and I'm tired of seeing this sub repeat the "local is cheaper" line without numbers. Let me actaully break it down. My setup: 2x 3090 (used, $1400 total), Ryzen 7900X, 64GB DDR5, around $2800 all in. Pulls about 700W under load. At my electricity rate that's roughly $0.21/hour just to keep it serving. Add depreciation on the GPUs (amortize over 3 years), and the marginal cost per active hour lands somewhere around $0.50-0.80 depending on how much I use it. Now compare RunPod: a single H100 80GB is around $1.99/hr on-demand, $1.49/hr if you commit. That H100 will run Qwen3.6-35B-A3B at 2-3x the throughput of my dual 3090 setup. So per-token, the H100 actually ends up cheaper. If I'm honest about my usage (maybe 2-3 hours of heavy inference per day), I am paying significantly more per token than I would by just renting when I needed it. So why tf do I keep the rig: - Privacy: I run things I don't want logged by a cloud provider - Dignity: I don't want to ask a company for permission to query my own data - Tinkering: I get to learn stuff you cannot learn renting - Cold start: My rig is always on, no 30 second container spin-up - Sovereignty: My infrastructure doesnt disappear when a provider rate-limits me None of those are economic. They are all about control. And thats fine. It is worth paying for. But lets stop pretending the math runs the other way. How many of you have actually run the numbers on your own setup vs renting equivalent compute? Or are we all just running on vibes lol?   submitted by   /u/Napster3301 [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them i…

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is. Read the PR for more info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR. Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent. main ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB | model | size | params | backend | ngl | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1106.11 ± 8.60 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d10000 | 755.79 ± 2.58 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d20000 | 587.61 ± 1.52 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d40000 | 415.09 ± 2.45 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d60000 | 316.89 ± 2.35 | PR ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB | model | size | params | backend | ngl | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1447.62 ± 7.10 | **+31%** | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d

📰

r/LocalLLaMA Aggregators May 26, 2026

Self-hosted STT better than Whisper Large V3 Turbo that matches AssemblyAI quality?

I’m already using Whisper Large V3 Turbo self-hosted, but the accuracy still isn’t where I need it. I like AssemblyAI’s quality and want something self-hosted that: - Is clearly better than Whisper L…

I’m already using Whisper Large V3 Turbo self-hosted, but the accuracy still isn’t where I need it. I like AssemblyAI’s quality and want something self-hosted that: - Is clearly better than Whisper Large V3 Turbo - Can match or get close to AssemblyAI’s transcription quality - Runs locally (no cloud API) Is there a self-hosted model or stack that realistically beats Whisper Large V3 and gets close to AssemblyAI? Or is AssemblyAI’s own self-hosted offering the only real option at that quality level?   submitted by   /u/milkygirl21 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

I wrote about what I found in a deep dive elsewhere (which I will no mention because Reddit doesn't like cross linking) but I wanted to share it here since this is where I learn the most about AI stu…

I wrote about what I found in a deep dive elsewhere (which I will no mention because Reddit doesn't like cross linking) but I wanted to share it here since this is where I learn the most about AI stuff and I've seen before questions about NPUs, that are often dismissed as marketing gimmicks (and for the most part they are if we're taking LLMs, but not for other ML workloads). If you care for the traps I found along the way making onnx-asr working on openvino compiled to the NPU, you can read the article, I'm here to post the findings. Table comparing the total time, total energy used (watts during inference and total Joules per transcription). Audio length CPU (INT8) NPU (FP32) Speedup Energy 10s 978ms / 44.6J / 45.6w 204ms / 4.2J / 20.5w 4.8× faster 10.7× less energy 20s 1708ms / 79.8J / 46.7w 615 ms / 7.8 J / 12.7 W 2.8× faster 10.2× less energy 60s 5011ms / 237.7J / 47.4w 818 ms / 11.0 J / 13.4 W 6.1× faster 21.6× less energy The energy was sampled at 10hz using intel-rapl which gives the total package power, to which I substracted the idle power I measured before the run, so when you see that the power was 12.7w, it means it was 12.7w above idle. I think this is a remarcably result considering intel NPUs are, at least on paper, rather weak with 13TOPS, compared with the >40TOPS of the AMD ones, but still more than fast enough for this task. Some real world number end-to-end number from home assistant: CPU NPU Running this on the NPU frees the CPU to do CPU stuff, and also saves some valuable 2-3gb of valuable vram on my 7900XTX to do LLM stuff. Incidentally, this setup happens to beat in real world usage my 12GB RTX 3060 eGPU that I was using before. On a 3-4s voice command, the NPU takes ~120-160ms, while the 3060 i used before took ~150-300ms. I am not claiming that the NPU is more powerful than the nvidia card, but I suspect that the advantage comes from the NPU being able to wake up instantly from dormancy, while the nvidia card took long enoug

📰

r/LocalLLaMA Aggregators May 26, 2026

Running on a macbook, and having issues with crashing? Maybe this will help...

Just a friendly pointer on getting around some issues on macbooks. I hope someone finds this useful. I spent weeks of ripping my hair out with crashes, crap performance and issues - and being entirel…

Just a friendly pointer on getting around some issues on macbooks. I hope someone finds this useful. I spent weeks of ripping my hair out with crashes, crap performance and issues - and being entirely too stubborn to harness the power of Google to find solutions to my issues. Though, I prefer doing things the hard way, which is rather ironic for someone who is taking an enjoyment in finding ways to build out local AI... I'm running Qwen3.6 35b A3B on a 14" MBP M2 Max with 64GB ram, which feels like plenty for most local models that are dominating the charts. I'm currently using a 131k context, and I can easily use higher if I can tolerate the long prompt processing time of 1-2 minutes for reloading a session with a massive context. Otherwise, thanks to KV cache and etc, prompt processing is usually between 3 and 40 seconds for me even once the context is ridiculously huge (ie 100k+) - and the speed is fantastic (49 tokens/sec generation, 400+ on prompt processing) for the most part. (Qwen3.6 35b a3b) My setup took WEEKS to fine-tune and get stable, so I figured I'd share it with some of you to help spread the love for anyone who was having issues running local models and agentic workflows on macbooks, given I received an onslaught of messages from colleagues, friends and people asking how I managed to make Qwen3.6 stable and use it the way I am (I have a pretty large project and Qwen3.6 is the driver of it, right down to having agents monitoring logs and automatically troubleshooting and fixing issues - which is a scary thought...) So, a simple rundown, and then a better explanation below... * Change display refresh rate from ProMotion to 60Hz * Use GGUF models, NOT MLX * Run with either llama.cpp or LM Studio (which uses llama.cpp under the hood). Ollama is slow, and to be blunt: horrible. * Raise memory wire limit via iogpu.wired_limit_m . On my 64GB laptop, I have this at 61440 * Use Qwen3.6 35b A3B, either q4 or q6 quant. I find q4 - funny enough - to some

r/LocalLLaMA Aggregators May 26, 2026

Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats

Safetensors, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-35…

Safetensors, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF NVFP4, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 NVFP4 GGUFs, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF GPTQ-Int4, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 Now in case some people might ask, why release Qwen3.5 MTPs version when there is already Qwen3.6 MTPs version? Well the thing is, most people would assume that higher number = newer and better model, but the thing is both Qwen3.5 and Qwen3.6 models uses the qwen35 architecture, they just had different training and their focus are meant for different primary usecases, Qwen3.6 models are mainly meant for agentic and coding AI assistance and Qwen3.5 models are mainly meant for general purpose AI assistance, now Qwen3.6 can definitely be used for general AI assistance just like Qwen3.5 can definitely be used for agentic and coding, but if you want the most optimal usecases it would be Qwen3.6 for agentic and coding and Qwen3.5 for general AI assistance that is where each of them excels at. Also for extra info, in case anyone is wondering, despite Qwen3.5 and Qwen3.6 both sharing the qwen35 architecture, they behave very diferently to abliteration. Qwen3.5 models can have a KL divergence in the 300's or

📰

r/LocalLLaMA Aggregators May 26, 2026

CXMT started selling ram to corsair

They started producing cheaper ram for corsair, hopefully it will get cheaper for consumers https://www.tomshardware.com/pc-components/ddr5/chinese-memory-maker-cxmt-enters-the-mainstream-consumer-m…

They started producing cheaper ram for corsair, hopefully it will get cheaper for consumers https://www.tomshardware.com/pc-components/ddr5/chinese-memory-maker-cxmt-enters-the-mainstream-consumer-memory-with-corsair-vengeance-ddr5-kit-chinese-made-dram-emerges-as-an-antidote-for-crushing-shortages   submitted by   /u/power97992 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

Is this LLM challenge even possible?

Ive been trying for hours now and I just can't get anything from this model othen then random words, it looks like it had a stroke. I tried searching for patterns in the output, or for strings encode…

Ive been trying for hours now and I just can't get anything from this model othen then random words, it looks like it had a stroke. I tried searching for patterns in the output, or for strings encoded into the tensors but that didn't get me anywhere. Was anybody able to find something here or am I just wasting my time on a broken challenge?? I posted it before in r/llmdevs, and no one there could help   submitted by   /u/1337Captain [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

model : add support for talkie-1930-13b by niklassheth · Pull Request #22596 · ggml-org/llama.cpp

https://huggingface.co/talkie-lm/talkie-1930-13b-it talkie-1930-13b-it talkie-1930-13b-it is a 13B vintage language model. It is an instruction-tuned post-train of talkie-1930-13b-base, which was …

https://huggingface.co/talkie-lm/talkie-1930-13b-it talkie-1930-13b-it talkie-1930-13b-it is a 13B vintage language model. It is an instruction-tuned post-train of talkie-1930-13b-base, which was trained on 260B tokens of pre-1931 English-language text. talkie-1930-13b-it was finetuned using a novel dataset of instruction-response pairs extracted from pre-1931 reference works, including etiquette manuals, encyclopedias, and letter-writing manuals. The model then underwent reinforcement learning (online DPO with an LLM-as-a-judge) to improve instruction-following ability. Read more about talkie in our report. Reference code to run talkie is available on GitHub. Have you ever daydreamed about talking to someone from the past? What would you ask someone with no knowledge of the modern world? What would they ask you? While we don’t have time machines yet, we can simulate this experience by training, in Owain Evans’s phrase, ‘vintage’ language models: LMs trained only on historical text.   submitted by   /u/pmttyji [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026

One letter to appease them all

  submitted by   /u/ivari [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

Is something went wrong with those online free model, why I feel they worse than Gemma 4 26B A4B Q4_KM ??

It started with I just want to make a chat app like roleplay with characters but Gemma 4 26B A4B Q4_KM doesn't have info some old character so I crawl back to those online services as those model is …

It started with I just want to make a chat app like roleplay with characters but Gemma 4 26B A4B Q4_KM doesn't have info some old character so I crawl back to those online services as those model is much bigger parameter and quite update info, however I found something strange, I feel they're worse than offline model which it should not happen, they might have rich info but the way they answer sound silly. ``` Chat Simulation AI impersonate a character from well know novel, manga, anime or game. Writing style A chat app style as AI must the impersonate character chat with user via app chat, AI must ensure the impersonate character maintains original personality (no OOC behavior). Wait for user 1st input Impersonate character, identify info of a character for AI to pin point target the impersonate character this simulate then AI will fill those details as follows Character age and visual age Character appear Character body measure Character outfit Character life long purpose Main cast in heroine's story AI must list those character that relate to heroine from her story along with detail info of each characters for better simulate them interactive with user and heroine. Wait for user 2nd input Simulation Setup, which AI would receive user input then help to fill those details as follows: Setting Scenario Persona Check ``` I try free Grok, ChatGPT and Google AI mode Grok - unusable as it requests to register for long input. ChatGPT - WTH with its answer. Google AI mode - Quite okay when answer 1st input but start to broken in 2nd input. And more strange about Google is AI model in search page is felt much better than AI model in AI mode. Is free tier online AI become this bad ? Or they eat too much junk data to become this bad ?   submitted by   /u/revennest [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

Shard - getting to 10× KV cache compression

TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10× smaller at 8K context (11× at 32K) without measurable hits to NIAH or LongBench. It started as a reimplementa…

TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10× smaller at 8K context (11× at 32K) without measurable hits to NIAH or LongBench. It started as a reimplementation of Google's TurboQuant[1], stalled around 4×, and ended up as a different design once we noticed K and V need different treatments: PCA plus int4 quantization on K (the matrix is effectively low-rank once you undo RoPE), and a Hadamard rotation plus vector quantization on V. Attention runs directly on the compressed K, no fp16 reconstruction. Code: krish1905/shard.   submitted by   /u/Thrumpwart [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

Free AI Blog site — I have unused credits expiring soon, feel free to try it

Hi everyone, First of all, I want to clarify that this is not a promotional or advertising post. I have no plans to monetize the site, run ads, or use it for any commercial purpose. It also doesn’t g…

Hi everyone, First of all, I want to clarify that this is not a promotional or advertising post. I have no plans to monetize the site, run ads, or use it for any commercial purpose. It also doesn’t get enough traffic for that anyway. I made this site as a small personal project to study and experiment with AI agent workflows. While working on it, I purchased some credits, but I still have a lot left unused. The credits will reset on June 11, and it feels like a waste to just let them disappear. So, if anyone is interested, please feel free to use it casually just for fun: https://crawlog.apps.codemonkey.click/ I hope the credits can be used by someone instead of going to waste. If this post violates any rules or causes any issues, I’ll delete it.   submitted by   /u/LetterheadNeat8035 [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

how do you decide between q4 and q5 on a 70b when 24gb is the cap?

ran into the q4 vs q5 wall again this morning. 70b model. 24gb card. q4 fits with margin, q5 fits if i kill everything else on the gpu and pray. did the math on actual quality difference for my use c…

ran into the q4 vs q5 wall again this morning. 70b model. 24gb card. q4 fits with margin, q5 fits if i kill everything else on the gpu and pray. did the math on actual quality difference for my use case (mostly code generation on a private codebase). benchmarks online give me a 1-2 point delta on humaneval. that's not nothing but it's also not enough to tell me whether the q5 squeeze is worth running everything closer to the redline. how do people running larger models day to day actually decide between q4 and q5 on this kind of setup. i keep flip-flopping every couple weeks and at this point i'm pretty sure i'm just overthinking it. probably going to flip a coin tomorrow.   submitted by   /u/Practical_Low29 [link]   [comments]

📰

r/LocalLLaMA Aggregators May 26, 2026

New local model reaching near frontier on PII removal at 9 ms CPU inference

Hi all, I've been working on this model to strip sensitive information from computer use data and would love some feedback!   submitted by   /u/louis3195 [link]   [comments]

📰