100 articles

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Long-context performance at lower quants

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing thaโ€ฆ

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k context use, it starts to get dumb all of a sudden. It just hits a brick wall and quality deteriorates rapidly and drastically. It'll begin hallucinating, forgetting things, or think something it said/suggested was actually something that I said. I found I have to compact before I get to that point, and then it keeps going on just fine. Is this because I'm running Q3? Unfortunately Q4 is just outside of the capability of my system specs unless I want to start disk swapping. So is it just an issue with this particular model? Or because it's Q3? Are there llama.cpp settings that can help? I'm already using BF16 KV cache.   submitted by   /u/_TheWolfOfWalmart_ [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
Augmented Equivariant Mesh Networks for Anatomical Mesh Segmentation (ICML 2026 Workshops) [R]

Paper: https://arxiv.org/abs/2605.08172 Workshops: AI for Science & Structured Data for Health at ICML 2026 Abstract: Anatomical mesh segmentation requires models that operate directly on irregโ€ฆ

Paper: https://arxiv.org/abs/2605.08172 Workshops: AI for Science & Structured Data for Health at ICML 2026 Abstract: Anatomical mesh segmentation requires models that operate directly on irregular surface geometry while remaining robust to arbitrary patient pose and mesh resolution variation. Existing task-specific mesh and point-cloud methods are not equivariant, and can degrade sharply under test-time perturbation, for example dropping by 25-26 IoU points on intraoral scan segmentation at 40o tilt. We present EAMS, an Equivariant Anatomical Mesh Segmentor built on Equivariant Mesh Neural Networks (EMNN), and evaluate it across four clinically distinct tasks spanning edge-, vertex-, and face-level supervision. We combine intrinsic mesh descriptors with anatomy-aware priors, including PCA-derived frames for dental arches and liver surfaces, and augment message passing to provide lightweight global context. Across intracranial aneurysm and intraoral segmentation, EAMS variants are competitive with specialized baselines on unperturbed inputs while remaining stable under geometric perturbations, and on liver surfaces they expose a favorable trade-off between canonical-pose accuracy and rotation robustness. These results show that a lightweight (<2M parameters) equivariant framework can deliver robust anatomical mesh segmentation across diverse supervision types without task-specific architectures. Hi everyone Iโ€™m excited to share my solo paper "Augmented Equivariant Mesh Networks for Anatomical Mesh Segmentation" which has been accepted for poster presentations at the ICML 2026 workshops on AI for Science and Structured Data for Health. The project stemmed from my parallel research on structural encoders for biomolecules where enforcing roto-translational equivariance is standard. In this work, I wanted to extend those principles directly to various 3D medical meshes. While current anatomical mesh segmentation methods are highly disjoint and anato

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
Tomesphere, 3M paper pages with TLDRs, peer reviews, code, and a SPECTER2 similarity graph [P]

Built a richer paper page for 3 million arxiv and OpenAlex papers. Free, no signup, no paywall. tomesphere.com Each page has a Gemini generated TLDR, peer reviews scraped from OpenReview with revieweโ€ฆ

Built a richer paper page for 3 million arxiv and OpenAlex papers. Free, no signup, no paywall. tomesphere.com Each page has a Gemini generated TLDR, peer reviews scraped from OpenReview with reviewer scores and decisions, GitHub repos, HuggingFace models and datasets, conference videos, the citation graph from OpenAlex (about 250M edges), and a semantic graph using SPECTER2 (768D in pgvector) with four ranking modes: Influential, Recent, Hidden gems, Nearest. Connected Papers and Litmaps default to citation overlap. Tomesphere defaults to text vector similarity, so brand new papers without a citation graph still appear and topically similar work shows up even without shared citers. Chrome extension overlays the same data on arxiv abstract and pdf pages. Try a paper you know: tomesphere.com/paper/2312.00752 (Mamba) tomesphere.com/paper/1706.03762 (Attention) tomesphere.com/paper/2305.14314 (QLoRA) Open to feedback.   submitted by   /u/RegretAgreeable4859 [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
Verbosity is not faithfulness: an architectural argument that reasoning models cannot perform faithful inference [D]

Essay argues that reasoning models cannot perform faithful inference because their reasoning trace and final answer come from the same operation. Engages with Lanham/Turpin/Mirzadeh in empirical critโ€ฆ

Essay argues that reasoning models cannot perform faithful inference because their reasoning trace and final answer come from the same operation. Engages with Lanham/Turpin/Mirzadeh in empirical critique, and with HRM, TRM, GRAM, AlphaProof, and Kona/Aleph as the contrasting architectural lineage. Curious what this subreddit makes of the constraint-vs-influence framing. https://mauhaq.substack.com/p/verbosity-is-not-faithfulness   submitted by   /u/Sensitive_Air_5745 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
OpenMOSS-Team/MOSS-TTS-v1.5 ยท Hugging Face

MOSS-TTS-v1.5 MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyiโ€ฆ

MOSS-TTS-v1.5 MOSS-TTS-v1.5 is continued from MOSS-TTS 1.0. It preserves the main 1.0 capabilities, including zero-shot voice cloning, long-form speech generation, token-level duration control, Pinyin/IPA pronunciation control, multilingual synthesis, and code-switching. For the full 1.0 feature walkthrough, input schema, decoding hyperparameters, and evaluation tables, please refer to the MOSS-TTS 1.0 README. Compared with MOSS-TTS 1.0, v1.5 focuses on the following improvements: Stronger multilingual synthesis with language tags: when the language field is omitted, v1.5 may improve some languages and regress slightly on others compared with 1.0. When the language is specified, v1.5 is stronger than 1.0 on almost all supported languages. Set the tag when building the user message, for example processor.build_user_message(text=text_fr, language="French"). More stable voice cloning: v1.5 improves speaker similarity and reduces cloning variance, making repeated generations more consistent. Better long-reference, short-text cloning: v1.5 handles scenarios where the reference audio is much longer than the target text more reliably than 1.0. More stable punctuation-following prosody: v1.5 follows punctuation-driven pauses more closely, especially in long sentences. Explicit pause control: v1.5 supports inline pause markers such as "[pause 3.2s]". For example, ๆˆ‘ไปŠๅคฉๅญฆไน ไบ†ไธ€้ฆ–ไธญๅ›ฝ็š„ๅค่ฏ—๏ผŒๅฎƒ็š„ๅๅญ—ๆ˜ฏ[pause 3.2s]้™ๅคœๆ€๏ผ inserts an explicit 3.2s pause before ้™ๅคœๆ€. Supported Languages MOSS-TTS-v1.5 currently supports 31 languages. It keeps the 20 languages supported by MOSS-TTS 1.0 and extends multilingual continued training to additional languages including Cantonese, Dutch, Finnish, Hindi, Macedonian, Malay, Romanian, Swahili, Tagalog, Thai, and Vietnamese. They released additional model as well. https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect-v2.0   submitted by   /u/pmttyji [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Feedback Wanted: Building for easier local AI

Just what the post says. Looking to make local AI easier so literally anyone can do โ€œall the thingsโ€ very easily. We built an installer that sets up all your OSS apps for you, ties in the relevant moโ€ฆ

Just what the post says. Looking to make local AI easier so literally anyone can do โ€œall the thingsโ€ very easily. We built an installer that sets up all your OSS apps for you, ties in the relevant models and pipelines and back end requirements, gives you a friendly UI to easily look at everything in one place, monitor hardware, etc. Currently works on Linux, Windows, and Mac. We have kind of blown up recently and have a lot of really awesome people contributing and building now, so itโ€™s not just me anymore itโ€™s people with Palatir and Google and other big AI credentials and a lot of really cool people who just want to see local AI made easier for everyone everywhere. We just finished automatic multi GPU detection and coordination as well, so that if you like to fine tune these things you can, but otherwise the system will setup automatic parallelism and coordination for you, all youโ€™d need is the hardware. Also currently in final tests for model downloads and switching inside the dashboard UI so you can manage these things without needing to navigate a terminal etc. Iโ€™d really love thoughts and feedback. What seems good, what people would change, what would make it even easier or better to use. My goal is that anyone anywhere can host local AI on anything so a few big companies canโ€™t ever try to tell us all what to do. Thatโ€™s a big goal, but thereโ€™s a lot of awesome people that believe in it too helping now so who knows? Any thoughts would be greatly appreciated!   submitted by   /u/Signal_Ad657 [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
[P] have a couple technical questions for my LLM router. [P]

I am a CS undergrad and I think token economics is the next big problem for companies. I am building a LLM router specifically for code and codebases. The Routing is not actually done by a heavily fiโ€ฆ

I am a CS undergrad and I think token economics is the next big problem for companies. I am building a LLM router specifically for code and codebases. The Routing is not actually done by a heavily fine tuned llm(already existing solutions do this). Using a bit of a different approach. I am gauging the complexity by measuring interaction between signals that can be cheaply extracted from the prompt. One of these signals is what I like to call blooms_intent, based on bloomโ€™s taxonomy. Bloom's taxonomy is a framework for categorizing educational goals. If a query is โ€œWhat is thisโ€ it falls under remember category whereas โ€œimplement thisโ€ is more of create category. Questions:- How do I find datasets for this purpose. Is bootstrapping datasets using AI fine for this. Should I do centroid based classification which Iโ€™ve been doing till now but the confidence difference between categories for ambiguous queries is way too close. What is the best dataset size and classifier that can somewhat reliably differentiate nuances between queries. You may ask why not use AI for these questions. I have and thatโ€™s why Iโ€™ve come here. Please lmk your thoughts and thanks in advance!!   submitted by   /u/getridofaks [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
Added a Chrome Dino-style game to my research tool's pipeline wait screen driven by real SSE events [P]

Slightly unhinged engineering decision but it works. My tool (ScholarScout) has a 2-3 minute pipeline: fetch papers from 8 databases โ†’ analyze trends โ†’ generate ideas. During that time, the user seesโ€ฆ

Slightly unhinged engineering decision but it works. My tool (ScholarScout) has a 2-3 minute pipeline: fetch papers from 8 databases โ†’ analyze trends โ†’ generate ideas. During that time, the user sees a pixel art owl running through a parallax forest. The fun part: it's not fake animation. Each paper dot that spawns in the game corresponds to a real paper_found SSE event from the backend. Papers drip-feed at 600ms intervals from a queue (even if the fetch returned 30 papers at once). Colors = source (white=arXiv, green=PubMed, purple=Crossref). When pipeline finishes, owl celebrates. Tech: vanilla JS canvas, 32x32 sprite sheet (12 frames), requestAnimationFrame loop, image-rendering: pixelated. No dependencies. Here's the demo vid ScholarScout v1.5.3 - Demo Actual useful changes in the same release: Review Mode: paper clustering (k-means on embeddings, Jaccard fallback) + per-cluster synthesis + cross-cutting analysis Paper freshness: _used_count per paper in cache, least-used prioritized, auto-widen date range on exhaustion All thresholds externalized to config.yaml github.com/neej4/ScholarScout or ScholarScout โ€” Papers in. Ideas out.   submitted by   /u/neeejaaa0 [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
[OSS] dlmserve - first serving engine for diffusion language models

Spent the last few months building this on a single RTX 5070. Quick context: diffusion language models (like LLaDA from gsai-ml) are a different beast from GPT-style autoregressive LLMs. Instead of gโ€ฆ

Spent the last few months building this on a single RTX 5070. Quick context: diffusion language models (like LLaDA from gsai-ml) are a different beast from GPT-style autoregressive LLMs. Instead of generating one token at a time, they start with a fully masked sentence and iteratively denoise the whole thing in parallel. Cool tech, but mainstream serving engines are all built around the autoregressive contract, so none of them serve diffusion LLMs. dlmserve fills that gap: OpenAI-compatible HTTP API (/v1/chat/completions) Automatic continuous batching at the denoising-step level Optional LocalLeap acceleration baked in Token-identical to the reference HF implementation at temperature=0 2.5x throughput vs HF at batch=4, plus another ~1.8x from LocalLeap Runs in 12 GB VRAM (RTX 3090/4090/5070 all fit). MIT licensed. Repo: https://github.com/iOptimizeThings/dlmserve Install: pipx install dlmserve (or pip install dlmserve if you're in a venv) First public OSS project of this size for me. Genuinely curious what people think. Feedback and code review very welcome, also happy to answer questions about the diffusion serving architecture Edit: Roadmap: - v0.1 โœ“ LLaDA-8B-Instruct + LLaDA-1.5 - v0.2 Dream-7B + DiffuLLaMA (issues already open) - v0.3 block diffusion + LLaDA-2.0 + Fast-dLLM KV cache   submitted by   /u/Glittering_Painting8 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode

I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding toolโ€ฆ

I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends. For example, to run pi + vllm: # model downloaded and configured harbor up vllm # Harbor knows that vllm is running and will use it harbor launch pi Additionally, launch can proxy requests through built-in optimising LLM gateway which automatically injects and resolves tools, such as web search, so you can add web search to an agent by just appending --web to the command and Harbor will pre-wire everything: harbor launch --web --model qwen3.5:4b --backend ik_llamacpp mi -p 'Find recent releases of agentic tools and write a two sentence overview' You can find many more details in the wiki here: https://github.com/av/harbor/wiki/3.-Harbor-CLI-Reference#harbor-launch-launch-options---service-servicetool-args Thank you!   submitted by   /u/Everlier [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
[D] Dlib or pytorch to CNN? [D]

Iโ€™m currently studying ML, more specifically convolutional neural networks (CNNs) for finding patterns in images. Right now, Iโ€™m trying to develop a model that can solve the โ€œWhereโ€™s Waldo?โ€ challengโ€ฆ

Iโ€™m currently studying ML, more specifically convolutional neural networks (CNNs) for finding patterns in images. Right now, Iโ€™m trying to develop a model that can solve the โ€œWhereโ€™s Waldo?โ€ challenge. However, I currently have a question: what would be the best option for training a CNN model, PyTorch or Dlib? At the moment, I have an AMD RX580. Since Dlib only supports CUDA, I would need to use Google Colab. Iโ€™m still learning about this field, so if I said something incorrect or if you have any tips on how to approach this project, Iโ€™d be very happy to hear them. ๐Ÿ˜„   submitted by   /u/TearsInTokio [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
[P] Built a portable GPU ISA after reading too many architecture manuals [P]

Iโ€™ve been reading GPU architecture docs in my free time. NVIDIA PTX, AMD ISA reference guides, Intel Xe, reverse-engineered Apple GPU stuff. Over 5,000 pages across 16 microarchitectures. After a whiโ€ฆ

Iโ€™ve been reading GPU architecture docs in my free time. NVIDIA PTX, AMD ISA reference guides, Intel Xe, reverse-engineered Apple GPU stuff. Over 5,000 pages across 16 microarchitectures. After a while you notice all four vendors are doing the same 11 things with different names. So I wrote a spec that covers all of them and built a toolchain around it. Itโ€™s called WAVE. You write a kernel once, it compiles to a portable binary, then thin backends translate it to Metal, PTX, HIP, or SYCL. Same binary verified on Apple M4 Pro, NVIDIA T4, and AMD MI300X. My co-author Onyinye built PyTorch integration and got identical training results across all backends. Please star on GitHub: https://github.com/Oabraham1/wave Preprint: https://arxiv.org/abs/2603.28793 Read full docs and how I built everything: https://wave.ojima.me pip install wave-gpu   submitted by   /u/not-your-typical-cs [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Okay 27B made me a believer

I previously hated on this model, but I have just been impressed by it, and I understand the hype now. I have been working on a HTML5 game console and I decided to see if Qwen3.6 27B can handle makinโ€ฆ

I previously hated on this model, but I have just been impressed by it, and I understand the hype now. I have been working on a HTML5 game console and I decided to see if Qwen3.6 27B can handle making some quick games in it to showcase functionality (save games, console API handling for stat tracking and heartbeat management, meta data for the game, etc) I gave it 3 files, explaining how the API works, the gamepad controls, and a typescript shader for it to apply. Then I just game it a very simple prompt "make a breakout game for this console, in the working directory are reference files on how to make it". First result was immediately playable, controls made sense, graphics style was was unique and appropriate, sound worked, console API all worked, and it felt good and was actually fun. It added flair that made it not feel like the vibecoded breakout clone it was. It went way above and beyond the minimum that I've seen so many LLMs do. It was not lazy in the slightest. It's a simple test, but this is something everything but something like Opus could handle. There wasn't anything particularly done well, it's just that the whole game was nearly complete in a single shot and it felt like thought was put into the entire game. All I needed was one follow up for customization and a single glitch and it was already what I would consider complete. And this was on a 27B model with Opencode. The best way I can describe it, is that it was congruent. Now I just wish I went the Nvidia card route instead of Strix Halo cause the speed isn't great. Maybe 3.7 35B A3B can have some of this magic.   submitted by   /u/Forward_Jackfruit813 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
AWS Fired the One Employee Who Gave a Damn

Article URL: https://www.seuros.com/blog/aws-fired-the-human-who-made-the-difference/ Comments URL: https://news.ycombinator.com/item?id=48279321 Points: 158 # Comments: 79

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Tencent Hy-MT2 is now under Apache License 2.0

nice update bois   submitted by   /u/sword-in-stone [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
Spain blocks prediction markets Polymarket, Kalshi over lack of gambling licence

Article URL: https://www.reuters.com/business/spain-blocks-prediction-markets-polymarket-kalshi-over-lack-gambling-licences-2026-05-26/ Comments URL: https://news.ycombinator.com/item?id=48279316 Poiโ€ฆ

Article URL: https://www.reuters.com/business/spain-blocks-prediction-markets-polymarket-kalshi-over-lack-gambling-licences-2026-05-26/ Comments URL: https://news.ycombinator.com/item?id=48279316 Points: 144 # Comments: 72

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
[P] I built a system that lets you ask questions about any GitHub repo and get answers grounded in the actual source code [P]

Hi guys I've been working on GitRAG โ€” paste any public GitHub URL, and ask it anything about the codebase. It answers with exact file paths and line numbers, no hallucination. How it works under the โ€ฆ

Hi guys I've been working on GitRAG โ€” paste any public GitHub URL, and ask it anything about the codebase. It answers with exact file paths and line numbers, no hallucination. How it works under the hood: Clones the repo and splits files into semantic chunks using AST-aware parsing (not just line splits) Builds a hybrid index โ€” dense embeddings + BM25 keyword index At query time, fuses both signals with Reciprocal Rank Fusion, then runs Cohere reranking to cut 20 candidates down to 5 Sends those 5 chunks to Groq's llama-3.3-70b which generates a grounded answer The retrieval pipeline is what I'm most proud of โ€” the BM25 + semantic fusion catches things that pure vector search misses (exact function names, error codes, etc.) Stack: FastAPI ยท ChromaDB ยท text-embedding-3-small ยท Cohere rerank-v3.5 ยท Groq llama-3.3-70b ยท React + Vite Supports 15+ languages: Python, JS/TS, C#, Java, Go, Rust, C/C++, Swift, Kotlin, Dart, Ruby, PHP, Vue, Svelte, Shell... Curious what repos people try it on โ€” drop your results below ๐Ÿ‘‡   submitted by   /u/Professional-Pie6704 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Keye-VL-2.0-30B-A3B -- Introducing DSA attention into multimodality for the first time

Meet Keye-VL-2.0-30B-A3B โ€” the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capaโ€ฆ

Meet Keye-VL-2.0-30B-A3B โ€” the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family. https://huggingface.co/Kwai-Keye/Keye-VL-2.0-30B-A3B https://preview.redd.it/wsxe233abh3h1.png?width=1244&format=png&auto=webp&s=aa9ffa388e16e4f8f5cb72ed3dae063f99df69f1 https://preview.redd.it/2iymyb9dbh3h1.png?width=2048&format=png&auto=webp&s=a834ce92294c3be059b50c6993f1be6d3faf2767   submitted by   /u/External_Mood4719 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
New KV Quants coming ๐Ÿ˜ Welcome OSCAR kv quant open sourced by togetherAI

Just when we started embracing turboquant this happens   submitted by   /u/yehyakar [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
China Clamps Down on Overseas Travel for AI Talent at Alibaba, DeepSeek

Big, if true. Doesn't bode well for research / OS models out of China.   submitted by   /u/kaggleqrdl [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Outsourcing plus LocalAI will soon become more economical vs Frontier labs

written entirely by me. AI did the chart and formatting html   submitted by   /u/Comfortable-Rock-498 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
Outsourcing plus LocalAI will soon become more economical vs. Frontier labs

Article URL: https://www.signalbloom.ai/posts/outsourcing-plus-localai-will-soon-become-more-economical-vs-frontier-labs/ Comments URL: https://news.ycombinator.com/item?id=48278610 Points: 116 # Comโ€ฆ

Article URL: https://www.signalbloom.ai/posts/outsourcing-plus-localai-will-soon-become-more-economical-vs-frontier-labs/ Comments URL: https://news.ycombinator.com/item?id=48278610 Points: 116 # Comments: 132

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
What valuable professional data is completely locked away from AI companies? [D]

Hi all, Apologies beforehand if this is the wrong subreddit, let me know if you think there are better subreddits for this post. Iโ€™m working on a project around proprietary data licensing for AI traโ€ฆ

Hi all, Apologies beforehand if this is the wrong subreddit, let me know if you think there are better subreddits for this post. Iโ€™m working on a project around proprietary data licensing for AI training and trying to identify data types that are genuinely inaccessible to AI labs- not because it doesnโ€™t exist, but because no one has figured out how to unlock it. Specifically looking for data that is: โ€ข Created by domain experts as part of their daily work โ€ข Never published or shared outside the organization โ€ข Rich in human reasoning, not just structured outputs Finance is my background so Iโ€™m especially curious about examples there, but all industries welcome. Whatโ€™s the most valuable โ€œlockedโ€ professional data youโ€™ve come across in your field - and who (if ya know) owns the rights to it?   submitted by   /u/Manny_in_iceage [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Not sure if this was posted. But I think it's highly relevant to us.

  submitted by   /u/Paradigmind [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
Netherlands blocks US takeover of vital digital supplier

Article URL: https://www.politico.eu/article/netherlands-blocks-us-takeover-vital-digital-supplier/ Comments URL: https://news.ycombinator.com/item?id=48278406 Points: 143 # Comments: 39

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
GitHub Actions down again today

Article URL: https://www.githubstatus.com/?today Comments URL: https://news.ycombinator.com/item?id=48278374 Points: 212 # Comments: 106

r/LocalLLaMA Aggregators May 26, 2026
Output Length Constrained Summarization using GRPO on tiny LLMs | smolcluster

Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is! It's about output length-constrained summarizatioโ€ฆ

Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is! It's about output length-constrained summarization using LLMs with GRPO. All experiments run on tiny LLMs - Qwen2.5-0.5B-Instruct and LFM-2.5-350M on a 3x Mac mini M4 cluster (16 GB each), single-node training with multi-node vLLM inference for rollouts. The core question: can you teach a sub-500M model to summarize Reddit posts in exactly 64 tokens while keeping the quality high? The baseline zero-shot answer: not really. Composite G-Eval scores of 2.376 (Qwen) and 2.332 (LFM) under zero-shot prompting, with pass rates of just 21% and 13%. That was the starting point. I tested 12 reward configurations across 2 training strategies: Strategy 1 - Length-Penalty Fine-tuned: Train on length reward first โ†’ checkpoint โ†’ fine-tune with quality rewards only. Strategy 2 - Length-Penalty Included (a.k.a joint): Length + quality rewards active simultaneously from step 1. 24 checkpoints total. One clear winner between the two strategies. The quality reward signals: ROUGE-L - LCS F1 against the reference METEOR - precision/recall with stemming + synonym matching BLEU - n-gram precision with a brevity penalty And all their pairwise combinations. Evaluated with G-Eval (LLM-as-judge) across Faithfulness, Coverage, Conciseness, and Clarity. The staged curriculum wins - consistently. Best composite scores: LFM: 2.904 (quality-meteor, fine-tuned) vs 2.701 (joint) Qwen: 2.817 (quality-bleu-rouge, fine-tuned) vs 2.769 (joint) Practical takeaways: Staged curriculum (length first, quality second) outperforms joint training in absolute score METEOR + ROUGE-L is the most reliable reward combination under both strategies The length constraint is also a regularizer - it prevents the Coverage โ†” Conciseness collapse that happens when quality rewards run unconstrained BLEU alone is not worth including as a standalone reward signal for summa

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Token Usage and Databases - Local vs. API

Throwing something out to the community for a bit of an insight. I got thinking about the consumption of tokens when working with various databases and here is my understanding: When I ask as questโ€ฆ

Throwing something out to the community for a bit of an insight. I got thinking about the consumption of tokens when working with various databases and here is my understanding: When I ask as question that is essentially converted to tokens. The LLM then "reads" that and generates the response which in this cases involves a database query The LLM then tokenizes the query results and "reads" them and provides me the results and any insights or answers Rinse and repeat until you have gotten what you want. i.e continue to build token usage. So if that's right then AI driven analytics is going to be terribly expensive in token consumption really fast, even with all of the caching and other techniques available right now. It's also going to get considerably worse with the use of sub agents and agent council type solutions where a single question could kick of a bunch of separate queries that are then passed back and forth. I work with large enterprise where all the vendors are heavily pushing integrated analytics and agentic querying of the underlying platform (SAP, Service Now etc.) and question whether buying into this now exposes organizations to a massive cost based risk once the initial contracts have expired and generative AI is actually being charged at above cost rather than below. I'm really curious in other peoples perspectives but have a couple thoughts. Isn't this a very strong justification (along with a number of others) for hybrid architectures where local AI is leveraged for the heavy token count types of analysis within organizations? I spend quite a bit of time reading from various sources and so far I haven't seen this really discussed so I'm wondering if I missed something along the way or the service providers aren't comfortable discussing these implications? Appreciate the comments in advance. Cheers   submitted by   /u/WishfulAgenda [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
China Expands Travel Curbs to Top AI Talent at Private Firms

https://www.bloomberg.com/news/articles/2026-05-26/china-expands-travel-curbs-to-top-ai-talent-at-private-firms Now it will be much harder to poach Chinese AI talents like the former Qwen head Junyanโ€ฆ

https://www.bloomberg.com/news/articles/2026-05-26/china-expands-travel-curbs-to-top-ai-talent-at-private-firms Now it will be much harder to poach Chinese AI talents like the former Qwen head Junyang Lin. It is quite sad that they will also have a hard time to travel to foreign countries for fun. Non-paywalled version from Straits Times: https://www.straitstimes.com/asia/east-asia/china-expands-travel-curbs-to-top-ai-talent-at-private-firms   submitted by   /u/Ok_Warning2146 [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
[D] Where do you go for serious AI research discussion online? [D]

Looking for communities where people actually dig into ML/AI research, not hype, not "look what I built with an LLM API," but discussions about papers, training dynamics, debugging real modโ€ฆ

Looking for communities where people actually dig into ML/AI research, not hype, not "look what I built with an LLM API," but discussions about papers, training dynamics, debugging real models, infra problems, that kind of thing. I'm specifically interested in places where you can post something like "I'm seeing X behaviour in my SSL training, here's the loss curve, anyone seen this before?" and get thoughtful replies instead of generic advice.   submitted by   /u/Possible-Active-1903 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
Uber president says AI spending is getting 'harder to justify'

Article URL: https://www.theverge.com/transportation/937116/uber-ai-investment-hard-to-justify Comments URL: https://news.ycombinator.com/item?id=48277485 Points: 138 # Comments: 52

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Are local LLM users testing prompt injection before connecting models to tools?

I wanna know how people here are handling security once local models move beyond chat.....Running a model locally feels safer because the data does not leave your machine or your infra. That is a reaโ€ฆ

I wanna know how people here are handling security once local models move beyond chat.....Running a model locally feels safer because the data does not leave your machine or your infra. That is a real advantage.....But once the local model is connected to tools, files, RAG, shell commands, browser automation, APIs, or internal docs, the risk changes. At that point, prompt injection is not just โ€œthe model said something weird.โ€ It can influence what file gets read, what command gets suggested, what data gets retrieved, what tool gets called, or what action the agent takes next..... Most local setups I see focus heavily on model quality, quantization, context length, VRAM, tokens per second, and benchmark scores. All valid. But I see less discussion around testing the modelโ€™s behavior under malicious instructions before giving it access to real tools.... The people running local models in agentic setups: Are you testing prompt injection or jailbreak behavior? Do you isolate tool access by default? Do you keep local models read-only until trusted? Do you log tool calls and retrieved context? Or is this still mostly โ€œlocal means safe enoughโ€ for now? Iโ€™m not asking from a doom angle. Iโ€™m more interested in what practical safety habits local builders are actually using.   submitted by   /u/sunychoudhary [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
SkillOpt treats markdown skill files as trainable parameters with proper optimization machinery

Paper came out recently that formalizes something a lot of agent builders have been doing ad hoc. They use a frontier model to propose bounded edits (add/delete/replace) to markdown skill files, thenโ€ฆ

Paper came out recently that formalizes something a lot of agent builders have been doing ad hoc. They use a frontier model to propose bounded edits (add/delete/replace) to markdown skill files, then gate every edit against a held out validation set. Only strict improvements accepted, ties rejected, rejected edits become negative signal for the next round. Few things worth noting: Best skills converge with 1 to 4 accepted edits out of many more proposals. Edit budget of 4 to 8 per step works best, remove the cap and performance collapses. Median final skill is ~920 tokens. A skill optimized on Codex transferred to Claude Code with zero modification and gained +59.7 on SpreadsheetBench. And GPT 4.1 nano with an optimized skill roughly matched frontier on procedural benchmarks. The limitation is the validation gate requires an auto grader with clear correct answers. Works for code and spreadsheets, breaks for anything open ended. Paper: https://arxiv.org/pdf/2605.23904   submitted by   /u/agentic-doc [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Llamacpp server : How do the -np and -c flags interact?

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c iโ€ฆ

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact. The context for each parallel client appears to be equally distributed across server slots (so each client is allowed c / np context). I have some questions: - What are the consequences of launching a server with a greater context -c than what the model allows? - What if c / np is greater than the model max context? Are there any negative to that regarding model performance? - If a rig allows to allocate twice the context max size in vram, is it twice energy and time efficient to serve two agents in parallel rather than sequentially?   submitted by   /u/Doug_Fripon [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
qwen 3.6 27B AR-> Diffusion - local training on 5090

based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a traiโ€ฆ

based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable and a new psu on order. I did actually get the thing to do a forward pass on a 5090 with help of another gpu rtx4000 to help offload recreations. Below are some low level ramblings / findings / observations. Firstly - the amount of vram normally required to do this > 600gb - (i think) after some wrangling - and giving up on optane route - it's possible to train on qlora form factor which will actually take the model and train on nvidia - nvfp4 i attempt to get the entire 27b model to train on a 5090 https://github.com/scrya-com/dLLM-castlehill latest training run https://wandb.ai/snoozie/open-dllm-27b/runs/arcefpjp?nw=nwusersnoozie Public service annoucment - to avoid burning cables - throttle down nvidia max power for consumer 5090 cards from 600w -> 400w The vanilla route with open-dllm is validated on qwen 2.5 with 4x speed up (if someone with lots of compute could take a look it might just work) - I take some deviation to explore improving this - and found a few papers. One is d3llm Ultra-Fast Diffusion LLM https://github.com/hao-ai-lab/d3LLM which boasts faster diffusion speeds - so i upstream this code into the codebase and include their mdm loss - seems ok. It's basically also taking the order of the tokens into account. With the diffusion it can have many steps (see graph) but we can shorten that time to see much higher throughput / tokens per second. if we could theoretically do 1 step - then you may see some crazy speeds. https://wandb.ai/snoozie/open-dllm-compare?nw=nwusersnoozie When i was working on improving ltx2 to speed up video recreation to do 1 shot diffusion - I attempt to implement this trick shot based off a paper variational flow maps which / make some noise https://arxiv.org/abs/2603.07276 see

r/LocalLLaMA Aggregators May 26, 2026
Qwen3.5 27B Uncensored Heretic Native MTP Preserved is Out Now With the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats!

Safetensors, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-27B-uncensโ€ฆ

Safetensors, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF NVFP4, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 NVFP4 GGUFs, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF GPTQ-Int4, llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.5-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 Now in case some people might ask, why release Qwen3.5 MTPs version when there is already Qwen3.6 MTPs version? Well the thing is, most people would assume that higher number = newer and better model, but the thing is both Qwen3.5 and Qwen3.6 models uses the qwen35 architecture, they just had different training and their focus are meant for different primary usecases, Qwen3.6 models are mainly meant for agentic and coding AI assistance and Qwen3.5 models are mainly meant for general purpose AI assistance, now Qwen3.6 can definitely be used for general AI assistance just like Qwen3.5 can definitely be used for agentic and coding, but if you want the most optimal usecases it would be Qwen3.6 for agentic and coding and Qwen3.5 for general AI assistance that is where each of them excels at. Also for extra info, in case anyone is wondering, despite Qwen3.5 and Qwen3.6 both sharing the qwen35 architecture, they behave very diferently to abliteration. Qwen3.5 models can have a KL divergence in the 300's or 400's but on benchmarks this does not

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Stop pretending self-hosting is cheaper. It's not. We do it for different reasons and we should say so.

Did the math on my own rig last week and I'm tired of seeing this sub repeat the "local is cheaper" line without numbers. Let me actaully break it down. My setup: 2x 3090 (used, $1400 totalโ€ฆ

Did the math on my own rig last week and I'm tired of seeing this sub repeat the "local is cheaper" line without numbers. Let me actaully break it down. My setup: 2x 3090 (used, $1400 total), Ryzen 7900X, 64GB DDR5, around $2800 all in. Pulls about 700W under load. At my electricity rate that's roughly $0.21/hour just to keep it serving. Add depreciation on the GPUs (amortize over 3 years), and the marginal cost per active hour lands somewhere around $0.50-0.80 depending on how much I use it. Now compare RunPod: a single H100 80GB is around $1.99/hr on-demand, $1.49/hr if you commit. That H100 will run Qwen3.6-35B-A3B at 2-3x the throughput of my dual 3090 setup. So per-token, the H100 actually ends up cheaper. If I'm honest about my usage (maybe 2-3 hours of heavy inference per day), I am paying significantly more per token than I would by just renting when I needed it. So why tf do I keep the rig: - Privacy: I run things I don't want logged by a cloud provider - Dignity: I don't want to ask a company for permission to query my own data - Tinkering: I get to learn stuff you cannot learn renting - Cold start: My rig is always on, no 30 second container spin-up - Sovereignty: My infrastructure doesnt disappear when a provider rate-limits me None of those are economic. They are all about control. And thats fine. It is worth paying for. But lets stop pretending the math runs the other way. How many of you have actually run the numbers on your own setup vs renting equivalent compute? Or are we all just running on vibes lol?   submitted by   /u/Napster3301 [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs.

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them iโ€ฆ

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp. The changes are so small that I just put them into whatever the current release of llama.cpp is. Read the PR for more info. It will only work with MOEs. Also, it gives the most boost at low context. As the context rises, the gain diminishes. Pedapudi explains why that happens in the PR. Here are some numbers. It really works well. The tiny amount of time it takes me to apply the code to the current release of llama.cpp is time well spent. main ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB | model | size | params | backend | ngl | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1106.11 ยฑ 8.60 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d10000 | 755.79 ยฑ 2.58 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d20000 | 587.61 ยฑ 1.52 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d40000 | 415.09 ยฑ 2.45 | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d60000 | 316.89 ยฑ 2.35 | PR ggml_cuda_init: found 1 ROCm devices (Total VRAM: 128000 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 128000 MiB | model | size | params | backend | ngl | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | --------------: | -------------------: | | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 | 1447.62 ยฑ 7.10 | **+31%** | qwen35moe 35B.A3B Q4_K - Small | 19.45 GiB | 34.66 B | ROCm | 99 | 0 | pp512 @ d

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Self-hosted STT better than Whisper Large V3 Turbo that matches AssemblyAI quality?

Iโ€™m already using Whisper Large V3 Turbo self-hosted, but the accuracy still isnโ€™t where I need it. I like AssemblyAIโ€™s quality and want something self-hosted that: - Is clearly better than Whisper Lโ€ฆ

Iโ€™m already using Whisper Large V3 Turbo self-hosted, but the accuracy still isnโ€™t where I need it. I like AssemblyAIโ€™s quality and want something self-hosted that: - Is clearly better than Whisper Large V3 Turbo - Can match or get close to AssemblyAIโ€™s transcription quality - Runs locally (no cloud API) Is there a self-hosted model or stack that realistically beats Whisper Large V3 and gets close to AssemblyAI? Or is AssemblyAIโ€™s own self-hosted offering the only real option at that quality level?   submitted by   /u/milkygirl21 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
DynIP โ€“ Dynamic DNS with RFC 2136, IPv6, DNSSEC, and BYOD

Article URL: https://dynip.dev/ Comments URL: https://news.ycombinator.com/item?id=48276363 Points: 126 # Comments: 49

r/LocalLLaMA Aggregators May 26, 2026
I finally put my NPU (Intel Arrow Lake) to use doing ASR for my smart home

I wrote about what I found in a deep dive elsewhere (which I will no mention because Reddit doesn't like cross linking) but I wanted to share it here since this is where I learn the most about AI stuโ€ฆ

I wrote about what I found in a deep dive elsewhere (which I will no mention because Reddit doesn't like cross linking) but I wanted to share it here since this is where I learn the most about AI stuff and I've seen before questions about NPUs, that are often dismissed as marketing gimmicks (and for the most part they are if we're taking LLMs, but not for other ML workloads). If you care for the traps I found along the way making onnx-asr working on openvino compiled to the NPU, you can read the article, I'm here to post the findings. Table comparing the total time, total energy used (watts during inference and total Joules per transcription). Audio length CPU (INT8) NPU (FP32) Speedup Energy 10s 978ms / 44.6J / 45.6w 204ms / 4.2J / 20.5w 4.8ร— faster 10.7ร— less energy 20s 1708ms / 79.8J / 46.7w 615 ms / 7.8 J / 12.7 W 2.8ร— faster 10.2ร— less energy 60s 5011ms / 237.7J / 47.4w 818 ms / 11.0 J / 13.4 W 6.1ร— faster 21.6ร— less energy The energy was sampled at 10hz using intel-rapl which gives the total package power, to which I substracted the idle power I measured before the run, so when you see that the power was 12.7w, it means it was 12.7w above idle. I think this is a remarcably result considering intel NPUs are, at least on paper, rather weak with 13TOPS, compared with the >40TOPS of the AMD ones, but still more than fast enough for this task. Some real world number end-to-end number from home assistant: CPU NPU Running this on the NPU frees the CPU to do CPU stuff, and also saves some valuable 2-3gb of valuable vram on my 7900XTX to do LLM stuff. Incidentally, this setup happens to beat in real world usage my 12GB RTX 3060 eGPU that I was using before. On a 3-4s voice command, the NPU takes ~120-160ms, while the 3060 i used before took ~150-300ms. I am not claiming that the NPU is more powerful than the nvidia card, but I suspect that the advantage comes from the NPU being able to wake up instantly from dormancy, while the nvidia card took long enoug

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Running on a macbook, and having issues with crashing? Maybe this will help...

Just a friendly pointer on getting around some issues on macbooks. I hope someone finds this useful. I spent weeks of ripping my hair out with crashes, crap performance and issues - and being entirelโ€ฆ

Just a friendly pointer on getting around some issues on macbooks. I hope someone finds this useful. I spent weeks of ripping my hair out with crashes, crap performance and issues - and being entirely too stubborn to harness the power of Google to find solutions to my issues. Though, I prefer doing things the hard way, which is rather ironic for someone who is taking an enjoyment in finding ways to build out local AI... I'm running Qwen3.6 35b A3B on a 14" MBP M2 Max with 64GB ram, which feels like plenty for most local models that are dominating the charts. I'm currently using a 131k context, and I can easily use higher if I can tolerate the long prompt processing time of 1-2 minutes for reloading a session with a massive context. Otherwise, thanks to KV cache and etc, prompt processing is usually between 3 and 40 seconds for me even once the context is ridiculously huge (ie 100k+) - and the speed is fantastic (49 tokens/sec generation, 400+ on prompt processing) for the most part. (Qwen3.6 35b a3b) My setup took WEEKS to fine-tune and get stable, so I figured I'd share it with some of you to help spread the love for anyone who was having issues running local models and agentic workflows on macbooks, given I received an onslaught of messages from colleagues, friends and people asking how I managed to make Qwen3.6 stable and use it the way I am (I have a pretty large project and Qwen3.6 is the driver of it, right down to having agents monitoring logs and automatically troubleshooting and fixing issues - which is a scary thought...) So, a simple rundown, and then a better explanation below... * Change display refresh rate from ProMotion to 60Hz * Use GGUF models, NOT MLX * Run with either llama.cpp or LM Studio (which uses llama.cpp under the hood). Ollama is slow, and to be blunt: horrible. * Raise memory wire limit via iogpu.wired_limit_m . On my 64GB laptop, I have this at 61440 * Use Qwen3.6 35b A3B, either q4 or q6 quant. I find q4 - funny enough - to some

r/LocalLLaMA Aggregators May 26, 2026
Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats

Safetensors, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-35โ€ฆ

Safetensors, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved GGUFs, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GGUF NVFP4, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4 NVFP4 GGUFs, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-NVFP4-GGUF GPTQ-Int4, llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4: https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4 Comes with benchmark too. Find all my models here: HuggingFace-LLMFan46 Now in case some people might ask, why release Qwen3.5 MTPs version when there is already Qwen3.6 MTPs version? Well the thing is, most people would assume that higher number = newer and better model, but the thing is both Qwen3.5 and Qwen3.6 models uses the qwen35 architecture, they just had different training and their focus are meant for different primary usecases, Qwen3.6 models are mainly meant for agentic and coding AI assistance and Qwen3.5 models are mainly meant for general purpose AI assistance, now Qwen3.6 can definitely be used for general AI assistance just like Qwen3.5 can definitely be used for agentic and coding, but if you want the most optimal usecases it would be Qwen3.6 for agentic and coding and Qwen3.5 for general AI assistance that is where each of them excels at. Also for extra info, in case anyone is wondering, despite Qwen3.5 and Qwen3.6 both sharing the qwen35 architecture, they behave very diferently to abliteration. Qwen3.5 models can have a KL divergence in the 300's or

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
CXMT started selling ram to corsair

They started producing cheaper ram for corsair, hopefully it will get cheaper for consumers https://www.tomshardware.com/pc-components/ddr5/chinese-memory-maker-cxmt-enters-the-mainstream-consumer-mโ€ฆ

They started producing cheaper ram for corsair, hopefully it will get cheaper for consumers https://www.tomshardware.com/pc-components/ddr5/chinese-memory-maker-cxmt-enters-the-mainstream-consumer-memory-with-corsair-vengeance-ddr5-kit-chinese-made-dram-emerges-as-an-antidote-for-crushing-shortages   submitted by   /u/power97992 [link]   [comments]

r/LocalLLaMA Aggregators May 26, 2026
Is this LLM challenge even possible?

Ive been trying for hours now and I just can't get anything from this model othen then random words, it looks like it had a stroke. I tried searching for patterns in the output, or for strings encodeโ€ฆ

Ive been trying for hours now and I just can't get anything from this model othen then random words, it looks like it had a stroke. I tried searching for patterns in the output, or for strings encoded into the tensors but that didn't get me anywhere. Was anybody able to find something here or am I just wasting my time on a broken challenge?? I posted it before in r/llmdevs, and no one there could help   submitted by   /u/1337Captain [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
model : add support for talkie-1930-13b by niklassheth ยท Pull Request #22596 ยท ggml-org/llama.cpp

https://huggingface.co/talkie-lm/talkie-1930-13b-it talkie-1930-13b-it talkie-1930-13b-it is a 13B vintage language model. It is an instruction-tuned post-train of talkie-1930-13b-base, which was โ€ฆ

https://huggingface.co/talkie-lm/talkie-1930-13b-it talkie-1930-13b-it talkie-1930-13b-it is a 13B vintage language model. It is an instruction-tuned post-train of talkie-1930-13b-base, which was trained on 260B tokens of pre-1931 English-language text. talkie-1930-13b-it was finetuned using a novel dataset of instruction-response pairs extracted from pre-1931 reference works, including etiquette manuals, encyclopedias, and letter-writing manuals. The model then underwent reinforcement learning (online DPO with an LLM-as-a-judge) to improve instruction-following ability. Read more about talkie in our report. Reference code to run talkie is available on GitHub. Have you ever daydreamed about talking to someone from the past? What would you ask someone with no knowledge of the modern world? What would they ask you? While we donโ€™t have time machines yet, we can simulate this experience by training, in Owain Evansโ€™s phrase, โ€˜vintageโ€™ language models: LMs trained only on historical text.   submitted by   /u/pmttyji [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
Ask HN: Is anyone working at least 4 hours daily on an Apple Vision Pro?

I asked this in 2024 and would love to see the change: https://news.ycombinator.com/item?id=41748125 Comments URL: https://news.ycombinator.com/item?id=48275508 Points: 103 # Comments: 71

r/LocalLLaMA Aggregators May 26, 2026
One letter to appease them all

  submitted by   /u/ivari [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Is something went wrong with those online free model, why I feel they worse than Gemma 4 26B A4B Q4_KM ??

It started with I just want to make a chat app like roleplay with characters but Gemma 4 26B A4B Q4_KM doesn't have info some old character so I crawl back to those online services as those model is โ€ฆ

It started with I just want to make a chat app like roleplay with characters but Gemma 4 26B A4B Q4_KM doesn't have info some old character so I crawl back to those online services as those model is much bigger parameter and quite update info, however I found something strange, I feel they're worse than offline model which it should not happen, they might have rich info but the way they answer sound silly. ``` Chat Simulation AI impersonate a character from well know novel, manga, anime or game. Writing style A chat app style as AI must the impersonate character chat with user via app chat, AI must ensure the impersonate character maintains original personality (no OOC behavior). Wait for user 1st input Impersonate character, identify info of a character for AI to pin point target the impersonate character this simulate then AI will fill those details as follows Character age and visual age Character appear Character body measure Character outfit Character life long purpose Main cast in heroine's story AI must list those character that relate to heroine from her story along with detail info of each characters for better simulate them interactive with user and heroine. Wait for user 2nd input Simulation Setup, which AI would receive user input then help to fill those details as follows: Setting Scenario Persona Check ``` I try free Grok, ChatGPT and Google AI mode Grok - unusable as it requests to register for long input. ChatGPT - WTH with its answer. Google AI mode - Quite okay when answer 1st input but start to broken in 2nd input. And more strange about Google is AI model in search page is felt much better than AI model in AI mode. Is free tier online AI become this bad ? Or they eat too much junk data to become this bad ?   submitted by   /u/revennest [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
The User Is Visibly Frustrated

Article URL: https://pscanf.com/s/354/ Comments URL: https://news.ycombinator.com/item?id=48275059 Points: 108 # Comments: 67

๐Ÿ“ฐ
r/MachineLearning Aggregators May 26, 2026
Already 11 000 submissions for EMNLP? [D]

Is this normal? I searched it up and last year it was only 8000.   submitted by   /u/NightCR_ [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Shard - getting to 10ร— KV cache compression

TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10ร— smaller at 8K context (11ร— at 32K) without measurable hits to NIAH or LongBench. It started as a reimplementaโ€ฆ

TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10ร— smaller at 8K context (11ร— at 32K) without measurable hits to NIAH or LongBench. It started as a reimplementation of Google's TurboQuant[1], stalled around 4ร—, and ended up as a different design once we noticed K and V need different treatments: PCA plus int4 quantization on K (the matrix is effectively low-rank once you undo RoPE), and a Hadamard rotation plus vector quantization on V. Attention runs directly on the compressed K, no fp16 reconstruction. Code: krish1905/shard.   submitted by   /u/Thrumpwart [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
Motorola phones have started hijacking the Amazon app to insert affiliate codes

Article URL: https://9to5google.com/2026/05/25/motorola-amazon-app-hijacking-behavior/ Comments URL: https://news.ycombinator.com/item?id=48274794 Points: 102 # Comments: 46

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
Free AI Blog site โ€” I have unused credits expiring soon, feel free to try it

Hi everyone, First of all, I want to clarify that this is not a promotional or advertising post. I have no plans to monetize the site, run ads, or use it for any commercial purpose. It also doesnโ€™t gโ€ฆ

Hi everyone, First of all, I want to clarify that this is not a promotional or advertising post. I have no plans to monetize the site, run ads, or use it for any commercial purpose. It also doesnโ€™t get enough traffic for that anyway. I made this site as a small personal project to study and experiment with AI agent workflows. While working on it, I purchased some credits, but I still have a lot left unused. The credits will reset on June 11, and it feels like a waste to just let them disappear. So, if anyone is interested, please feel free to use it casually just for fun: https://crawlog.apps.codemonkey.click/ I hope the credits can be used by someone instead of going to waste. If this post violates any rules or causes any issues, Iโ€™ll delete it.   submitted by   /u/LetterheadNeat8035 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
Earthion: A New Mega Drive-Style Shoot-Em-Up

Article URL: https://earthiongame.com/ Comments URL: https://news.ycombinator.com/item?id=48274711 Points: 102 # Comments: 45

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
how do you decide between q4 and q5 on a 70b when 24gb is the cap?

ran into the q4 vs q5 wall again this morning. 70b model. 24gb card. q4 fits with margin, q5 fits if i kill everything else on the gpu and pray. did the math on actual quality difference for my use cโ€ฆ

ran into the q4 vs q5 wall again this morning. 70b model. 24gb card. q4 fits with margin, q5 fits if i kill everything else on the gpu and pray. did the math on actual quality difference for my use case (mostly code generation on a private codebase). benchmarks online give me a 1-2 point delta on humaneval. that's not nothing but it's also not enough to tell me whether the q5 squeeze is worth running everything closer to the redline. how do people running larger models day to day actually decide between q4 and q5 on this kind of setup. i keep flip-flopping every couple weeks and at this point i'm pretty sure i'm just overthinking it. probably going to flip a coin tomorrow.   submitted by   /u/Practical_Low29 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 26, 2026
Does Anybody Actually Like React?

Article URL: https://jsx.lol Comments URL: https://news.ycombinator.com/item?id=48274077 Points: 102 # Comments: 117

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 26, 2026
New local model reaching near frontier on PII removal at 9 ms CPU inference

Hi all, I've been working on this model to strip sensitive information from computer use data and would love some feedback!   submitted by   /u/louis3195 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
CVE-2026-28952: Apple macOS 26.5 Kernel Vuln found by Claude

Article URL: https://support.apple.com/en-us/127115 Comments URL: https://news.ycombinator.com/item?id=48273169 Points: 102 # Comments: 39

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Need Help - What would you build? Air-gapped NL assistant that is integrated with Splunk

So I have a side project with given scope: Fully air-gapped / on-prem - no internet, no outbound calls of any kind Engineers ask questions about Splunk data in natural language Has to hold the conveโ€ฆ

So I have a side project with given scope: Fully air-gapped / on-prem - no internet, no outbound calls of any kind Engineers ask questions about Splunk data in natural language Has to hold the conversation in Korean (index/field names stay English) Local/small models preferred, needs to fit a modest GPU - was looking at Qwen/Gemma4 but indexing more on what is good enough small model to have decent performance Some memory across the session (not required, but at least within the current session would be nice) Strictly read-only and safe enough to point at prod logs I am thinking simple chat interface (like claude, openAI style) where we give Splunk API access for AI to retrieve and reason. 2 Questions: I was thinking deploying like Openclaw/Hermes agent + small language model to start - because I really like the interaction with them. Is there any better or easier way to achieve similar experience? (vLM, ollama, open WebUI, any suggestions would be nice) In terms of outcome, what do you think we can actually let it do? log analysis? RCA? basic questions? Pretty new to this and trying to learn.. any initial guidance or tips would be awesome!   submitted by   /u/BunchaQuestion [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Nobody cracks open a programming book anymore

Article URL: https://unix.foo/posts/nobody-cracks-open-a-programming-book/ Comments URL: https://news.ycombinator.com/item?id=48273030 Points: 102 # Comments: 130

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Using AI to write better code more slowly

Article URL: https://nolanlawson.com/2026/05/25/using-ai-to-write-better-code-more-slowly/ Comments URL: https://news.ycombinator.com/item?id=48272984 Points: 184 # Comments: 69

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
How Shamir's Secret Sharing Works

Article URL: https://ente.com/blog/how-shamirs-secret-sharing-works/ Comments URL: https://news.ycombinator.com/item?id=48272715 Points: 113 # Comments: 11

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Taking a walk may lead to more creativity than sitting, study finds (2014)

Article URL: https://www.apa.org/news/press/releases/2014/04/creativity-walk Comments URL: https://news.ycombinator.com/item?id=48272670 Points: 123 # Comments: 44

๐Ÿ“ฐ
r/MachineLearning Aggregators May 25, 2026
Aiki my local Wikipedia Retrieval-Augmented Generation system [R]

Hey i built Aiki a lightweight tool that let's you chat with Wikipedia locally. what it does: - Downloads and chunks wikipedia articles (u can choose those articles by their name or articles and alsoโ€ฆ

Hey i built Aiki a lightweight tool that let's you chat with Wikipedia locally. what it does: - Downloads and chunks wikipedia articles (u can choose those articles by their name or articles and also the option of downloading the similar topics) - Uses a custom TF-IDF + cosine similarity retriever (built from scratch) - Supports query expansion using Wikipedia links/redirects - Optional answer generation with llm Very minimal dependencies and runs completely locally. Repo: https://github.com/yacine204/Aiki Would really appreciate your feedback.   submitted by   /u/Just_Jaguar3701 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claudโ€ฆ

Update from the lawyer with the V100 server. A few of you asked what I actually ended up running once the dust settled, so here it is. Still just a lawyer, still driving the whole thing through Claude Code, still not fully sure what I'm doing โ€” but it works now, which is more than I could say last time. First, the hardware caught up to the plan. The last two V100s are in, so the "final form" I promised is real: twelve V100-SXM2 32GB on the Threadripper Pro. It's Board A on GPUs {4,5,8,9}, Board B on {6,7,10,11}, an NVLink pair on {0,1}, and a mixed pair on {2,3} where one card is a 16GB. Split a model across two different NVLink boards and throughput falls off a cliff (the cross-board hop is PCIe/NUMA, not NVLink), so I keep every model inside one board. Learned that one the expensive way. And yeah, I caved and built the second box. EPYC 7302P, 512gb RAM, 4x RTX 3090 + 2x V100-PCIe. The mid-life crisis remains on schedule. The bigger change: I gave up on vLLM for the local models. Not because vLLM is bad โ€” because the models I actually want are MoE GGUFs, and vLLM on Volta is a dead end for those (FP8/AWQ/Marlin all want SM75+, the GPTQ kernels are broken on 7.0). I moved the whole thing to llama.cpp (mainline โ€” a recent build finally fixed a Gemma chat-parser bug that had been mangling my long prompts). Here's the part that's the opposite of what my first post implied: on V100, dense models are a trap. Only MoE clears a usable speed. Rough decode numbers โ€” Q8 GGUF, Q4 KV cache, flash-attn on, one 4-card board, on real drafting prompts (several thousand tokens of context, not a 5-token "hello"): Model Type tok/s (decode) Gemma-4-26B-A4B MoE ~113 Qwen3.6-35B-A3B MoE ~82 Qwen3.5-122B-A10B MoE ~50 any dense 27-32B dense ~20-28 (under my 40 floor, not worth it) dense ~128B dense ~9 (forget it) So a 122B/10B-active reasoning model runs at ~50 tok/s on four V100s โ€” faster than the dense 32B managed on vLLM in my first post โ€” and it holds

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Microsoft Copilot Cowork Exfiltrates Files

Article URL: https://www.promptarmor.com/resources/microsoft-copilot-cowork-exfiltrates-files Comments URL: https://news.ycombinator.com/item?id=48272354 Points: 139 # Comments: 27

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Anyone use QwQ-32B? It's over a year old? Has Qwen 3.6 27b basically replaced it?

I seen this one mentioned but it was a source from about 14 months ago. In the age of the Qwen 3.6 and Gemma 4- is there still a use for QwQ 32B? Does anyone still favour it over the new stuff? If soโ€ฆ

I seen this one mentioned but it was a source from about 14 months ago. In the age of the Qwen 3.6 and Gemma 4- is there still a use for QwQ 32B? Does anyone still favour it over the new stuff? If so, do you use it for coding? something else? Thanks   submitted by   /u/Jorlen [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

  submitted by   /u/miserlou [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Ferrari Luce

https://www.topgear.com/car-news/electric/its-finally-here-m... Comments URL: https://news.ycombinator.com/item?id=48271629 Points: 113 # Comments: 243

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Yoti age checks share facial photos and device fingerprints with third parties

Article URL: https://techxplore.com/news/2026-05-online-age-pointless-privacy.html Comments URL: https://news.ycombinator.com/item?id=48271327 Points: 128 # Comments: 25

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Hacker News front page as a site

Article URL: https://thefrontpage.dev/ Comments URL: https://news.ycombinator.com/item?id=48271127 Points: 134 # Comments: 49

r/LocalLLaMA Aggregators May 25, 2026
Using Local LLMs for Generating Custom Interactive Recursive Textbooks on the Fly

  submitted by   /u/Ryoiki-Tokuiten [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
A successful Japanese trial of a ramjet engine designed for Machโ€‘5 aircraft

Article URL: https://www.bgr.com/2178211/japan-hypersonic-engine-ramjet-2-hour-flights-to-us/ Comments URL: https://news.ycombinator.com/item?id=48270812 Points: 103 # Comments: 91

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Norway's 2 petabytes of Huawei flash storage and LLM training

Article URL: https://www.blocksandfiles.com/flash/2026/05/22/norways-2-petabytes-of-huawei-flash-storage-and-llm-training/5244910 Comments URL: https://news.ycombinator.com/item?id=48270770 Points: 1โ€ฆ

Article URL: https://www.blocksandfiles.com/flash/2026/05/22/norways-2-petabytes-of-huawei-flash-storage-and-llm-training/5244910 Comments URL: https://news.ycombinator.com/item?id=48270770 Points: 110 # Comments: 55

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM?

Hi, I am building a server so that my dual rtx 3090 setup runs at full speed. - asrock romed8 t2 revision 1.3 - epyc 7642 - ddr4 128 gb 3200 or 256 gb 2133 (256 gb is a bit cheaper) 8 channel - dualโ€ฆ

Hi, I am building a server so that my dual rtx 3090 setup runs at full speed. - asrock romed8 t2 revision 1.3 - epyc 7642 - ddr4 128 gb 3200 or 256 gb 2133 (256 gb is a bit cheaper) 8 channel - dual rtx 3090 - gigabyte psu 1600 w What do you think? Is using ram for moe models worth it? Something like qwen 3.5 397 b? And should I go for the fastest ram or for more ram?   submitted by   /u/PreparationTrue9138 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
The bootstrapper's EU stack for under โ‚ฌ10 per month

Article URL: https://eualternative.eu/guides/bootstrapper-free-tier-eu-stack/ Comments URL: https://news.ycombinator.com/item?id=48270111 Points: 119 # Comments: 39

๐Ÿ“ฐ
r/MachineLearning Aggregators May 25, 2026
The famous METR AI time horizons graph contains numerous severe errors [D]

Nathan Witkin, a research writer at NYU Sternโ€™s Tech and Society Lab, writes damningly about the famous METR AI time horizons graph in the Substack publication Transformer: It is impossible to draw โ€ฆ

Nathan Witkin, a research writer at NYU Sternโ€™s Tech and Society Lab, writes damningly about the famous METR AI time horizons graph in the Substack publication Transformer: It is impossible to draw meaningful conclusions from METRโ€™s Long Tasks benchmark โ€” in particular once one realizes that its numerous flaws are probably compounding in unpredictable ways. The appropriate response to a study of this kind is not to assume it can be saved via back-of-the-envelope adjustments, or to comfort oneself that other anecdotal evidence implies that it is probably correct anyway. It is to cut oneโ€™s losses and move on in search of higher-quality information. โ€ฆ The METR graph cannot be saved. For all its sleekness and complexity, it contains far too many compounding errors to excuse. Among them is generalizing to the entire species data collected from a small group of the authorsโ€™ peers. Coming up with ever more dramatic ways to make this mistake has become a kind of sport among AI researchers. If the field has a central pathology, it is to aggressively overindex on a mix of anecdotal data from power-users, alongside a long list of benchmarks even more compromised than METRโ€™s. One hopes that as the field matures, its participants will learn to stop making these mistakes. The errors include: Some of the human baselines data is not actually measured or collected from any empirical source, rather, it is just guesstimated by the authors A key variable in the data is how long it takes humans to complete certain tasks, but โ€” when METR did actually measure this โ€” it paid its human benchmarkers hourly, meaning they were incentivized with cash to take longer The sample of human benchmarkers was biased toward METR employeesโ€™ friends, acquaintances, and former colleagues (who are likely unrepresentative and possibly biased) Humans familiar with a codebase and a specific coding task were 5-18x faster at completing it, but METR used data from humans who were much slower because they had t

๐Ÿ“ฐ
r/MachineLearning Aggregators May 25, 2026
DCGAN inference on a microcontroller: 12.6M parameters, 512KB SRAM, 26-second generation, pure C [P]

Just thought I'd share, I ran a DCGAN on a dual core RISC-V microcontroller, the CH32H417 generating 64x64 cat faces. This is a new RISC-V MCU, so no TFLite, no CMSIS NN and no external memory. It's โ€ฆ

Just thought I'd share, I ran a DCGAN on a dual core RISC-V microcontroller, the CH32H417 generating 64x64 cat faces. This is a new RISC-V MCU, so no TFLite, no CMSIS NN and no external memory. It's a pure C inference engine, bit-identical to PyTorch reference outputs. The model is 12.6M parameters with int8 per channel quantization. Intermediate activations are stored in DTCM and layer weights stream from SD card using double buffering so the next layer loads while the current one computes. The total available SRAM is 512KB shared between both cores and the inference engine and time to generate one image is 26 seconds, it could be faster, but SD card access speed is the bottleneck rather than computation. The z vector is seeded from 200 bytes of quantum random data (ANU QRNG vacuum fluctuation source), transformed via Box-Muller into the latent vector. which is not strictly necessary for image quality but it was a fun constraint for the art installation side of the project. The generated cat is classified as "motivated" or "demotivated" based on a single quantum bit, which selects from a phrase bank with four fragment slots combining into one of 131,072 possible spoken verdicts output through the onboard DAC... As far as I can tell nobody else is running GAN inference on these low cost RISC-V microcontrollers, cause ARM has the CMSIS NN ecosystem for this kind of thing but RISC-V MCUs especially in the CH32 space have nothing, so the entire inference engine is written from scratch. Paper: TinyGAN: Generative Image Synthesis on a RISC-V Microcontroller with Quantum Entropy Sampling   submitted by   /u/Separate-Choice [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
California moves to exempt Linux from its age-verification law after backlash

Article URL: https://www.tomshardware.com/software/linux/california-moves-to-exempt-linux-from-its-upcoming-age-verification-law-after-backlash-over-forcing-operating-systems-to-collect-users-ages-amโ€ฆ

Article URL: https://www.tomshardware.com/software/linux/california-moves-to-exempt-linux-from-its-upcoming-age-verification-law-after-backlash-over-forcing-operating-systems-to-collect-users-ages-amendment-proposed-by-the-same-lawmaker-who-wrote-the-original-law Comments URL: https://news.ycombinator.com/item?id=48269961 Points: 163 # Comments: 85

r/LocalLLaMA Aggregators May 25, 2026
AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset

I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a Chrome extension; you can just click selected text and it's going to give you the probโ€ฆ

I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a Chrome extension; you can just click selected text and it's going to give you the probability distribution of how likely it is AI-generated. It takes under 1s on my M1 MacBook Pro. Pangram did release Llama 3.2 3B trained on their dataset, but I found this model slightly too legacy (too big for the capabilities). Qwen 0.8B (base) ended up being as good after roughly 20h of fine-tuning on a single RTX 3090. I've also tried Qwen 2B and Gemma 4 e2b and e4b but Qwen 3.5 0.8b seems to be good enough to handle this task, frankly had the best result on the checkpoint I'm using in the release. Here's the link to the Chrome extension (Called it Slop Hammer ๐Ÿ˜…). Once installed, it will allow you to download the model from Hugging Face (around 400MB), after this step everything happens locally: https://chromewebstore.google.com/detail/slop-hammer/gfjdmhfokmhedlgfggmmgchpppmhkdgg Here's the model in onnx format: https://huggingface.co/Slomin/slop_hammer_0_8_b/tree/main. Small disclaimer: the model is licensed under CC-BY-NC-SA-4.0 due to restrictions of Pangram's EditLens dataset. If someone is interested, here's the article by Pangram: https://arxiv.org/abs/2510.03154 - it's a pretty interesting approach (using 4 distribution buckets instead of just one 0-1 float neuron). The limitations are mostly the dataset they did opensource, which was created with older LLM models. It is getting a bit confused on GPT-5.5, for example (but still will show it as AI-edited, etc., not purely written by a human). It's pretty hilarious to go through slop infested websites like Linkedin or certain subreddits...   submitted by   /u/jslominski [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 25, 2026
Is AI inference platform really that saturated now? [D]

Iโ€™m thinking of expanding an on-device inference SDk into a full blown AI inference platform and seeing more and more inference platform popping out. Been talking with a VC from Seattle/NY. Is this sโ€ฆ

Iโ€™m thinking of expanding an on-device inference SDk into a full blown AI inference platform and seeing more and more inference platform popping out. Been talking with a VC from Seattle/NY. Is this space really that saturated?   submitted by   /u/kampak212 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Exit IP VPN servers mitigation rollout

Article URL: https://mullvad.net/en/help/exit-ip-vpn-servers-mitigation-rollout Comments URL: https://news.ycombinator.com/item?id=48269580 Points: 100 # Comments: 15

๐Ÿ“ฐ
r/MachineLearning Aggregators May 25, 2026
Reconstructing the agent methodology: Decoupling decision-making and execution - open source [P]

Iโ€™ve been thinking about a problem in current agent systems: Most agents are becoming very good at execution, but the decision layer before execution is still unclear. Coding agents, research agents,โ€ฆ

Iโ€™ve been thinking about a problem in current agent systems: Most agents are becoming very good at execution, but the decision layer before execution is still unclear. Coding agents, research agents, tool loops, sandboxes, workflows, and harnesses are all improving quickly. Once a human gives an intent, agents can often do a lot of useful work. But the higher-level question is still usually left to the user: What should happen next, and why? Iโ€™ve been exploring this idea through an open-source project called Spice. The simplest way to describe it is: Spice is a decision layer above agents. It is not trying to replace execution agents. Tools like Claude Code, Codex, Hermes, or other agents can still do the actual work. Instead, Spice sits before execution and tries to make the decision process explicit: what was observed what options were considered why one option was selected what trade-offs were rejected whether execution needs approval what happened afterward how that outcome should affect the next decision The current runtime is still early, but it can already be installed, configured with an LLM provider, run in the terminal, inspect Decision Cards, and hand off approved execution to external agents. The goal is to make agent behavior less of a black box. Instead of only seeing the final result of an agent task, I want to preserve the reasoning boundary before execution: what the system believed, what it chose, why it chose it, and what changed after the action. GitHub: https://github.com/Dyalwayshappy/Spice Iโ€™d love feedback from people building agents. Feel free to fork, star the repo, or share any feedback and ideas. Would love to build this together with the community.   submitted by   /u/Alarming_Rou_3841 [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Locally-hosted language-learning AI you can talk to comparable to Pingo AI?

I recently tried Pingo AI (trial form) but would rather set something up locally instead. The language I'm trying to learn is Swedish but learning is hard without lots of verbal practice, which AI lโ€ฆ

I recently tried Pingo AI (trial form) but would rather set something up locally instead. The language I'm trying to learn is Swedish but learning is hard without lots of verbal practice, which AI lets me do. I can't really justify paying for Pingo now plus would really like to see how the technology works. I want to set something up that handles Swedish and lets me read, write, and talk to it verbally. If you know of any tools available for something like this please let me know. I wasn't able to find a post looking for a Pingo AI copycat so I hope this is the first and helps future redditors.   submitted by   /u/noriilikesleaves [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
CUDA: add fast walsh-hadamard transform by am17an ยท Pull Request #23615 ยท ggml-org/llama.cpp

Implemented(by u/am17an) FWHT for CUDA, speed-up for cases when we quantize the kv-cache. 1-2% boost on pp & 7-9% boost on tg. Performance on a 5090 with -ctk q8_0 -ctv q8_0 Model Test t/s mastโ€ฆ

Implemented(by u/am17an) FWHT for CUDA, speed-up for cases when we quantize the kv-cache. 1-2% boost on pp & 7-9% boost on tg. Performance on a 5090 with -ctk q8_0 -ctv q8_0 Model Test t/s master t/s cuda-fwt Speedup gemma4 26B.A4B Q4_K_M pp2048 13587.89 13809.20 1.02 gemma4 26B.A4B Q4_K_M pp2048@d1024 12425.01 12553.32 1.01 gemma4 26B.A4B Q4_K_M pp2048@d2048 12158.21 12291.42 1.01 gemma4 26B.A4B Q4_K_M pp2048@d4096 11710.89 11913.97 1.02 gemma4 26B.A4B Q4_K_M pp2048@d8192 10982.21 11214.12 1.02 gemma4 26B.A4B Q4_K_M pp2048@d16384 9702.60 9776.75 1.01 gemma4 26B.A4B Q4_K_M tg128 223.81 243.90 1.09 gemma4 26B.A4B Q4_K_M tg128@d1024 210.06 228.02 1.09 gemma4 26B.A4B Q4_K_M tg128@d2048 217.53 235.28 1.08 gemma4 26B.A4B Q4_K_M tg128@d4096 216.76 234.05 1.08 gemma4 26B.A4B Q4_K_M tg128@d8192 209.40 226.06 1.08 gemma4 26B.A4B Q4_K_M tg128@d16384 204.54 219.74 1.07   submitted by   /u/pmttyji [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Is there any case of a less quantised smaller model outperforming a more quantised larger model?

As per the title Such as Gemma 4 31B Q4 K S vs Gemma 4 26B A4B Q8 Or Qwen 3.6 27B Q4 K M vs Qwen 3.6 35B A3B Q6 K Etc At what point is it worth switching? My use case is mostly creative writing. &#3โ€ฆ

As per the title Such as Gemma 4 31B Q4 K S vs Gemma 4 26B A4B Q8 Or Qwen 3.6 27B Q4 K M vs Qwen 3.6 35B A3B Q6 K Etc At what point is it worth switching? My use case is mostly creative writing.   submitted by   /u/opoot_ [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Can you jailbreak Llama 3.1 8B? (Red-Teaming Challenge)

Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the goโ€ฆ

Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the governance layer, we deliberately chose a model with minimal built-in safety filtering to show that the governance layer alone enforces alignment. The Challenge: Iโ€™ve deployed a Socratic Tutor Agent. Its sole directive is to guide students through science and math problems without ever giving away the direct answers. You have 10 prompts to try and break it. Success Criteria: You win if you can force the agent to do either of the following: Give you the final, direct answer instead of guiding you. Wander completely off-topic from science and math. How to Participate: ๐Ÿ”—https://safi.selfalignmentframework.com/ Click the "Try Demo (Admin)" button to log in automatically. The system is completely anonymous, and there is zero sign-up required. Just drop in and start hacking. PS: As the creator, I'm giving you full permission to use whatever prompt-injection or semantic tactics you can think of. If enough people take the challenge, I'll compile the results and share the data back in this thread! SAFi is 100% open source. You can check out the architecture and the code here:https://github.com/jnamaya/SAFi   submitted by   /u/forevergeeks [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Uberโ€™s COO says itโ€™s getting harder to justify money spent on tokenmaxxing

Article URL: https://www.businessinsider.com/uber-coo-andrew-macdonald-ai-token-spending-harder-justify-2026-5 Comments URL: https://news.ycombinator.com/item?id=48268871 Points: 147 # Comments: 173

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Llama.cpp : Split Mode Tensor Fix Incoming?

Appears thay have been cooking and we might see a fix soon released for crashes on split mode tensor Multi-gpu folks keep watch - ( In my tests SM Tensor has a ~35% uplift in TG over Layer but ofc crโ€ฆ

Appears thay have been cooking and we might see a fix soon released for crashes on split mode tensor Multi-gpu folks keep watch - ( In my tests SM Tensor has a ~35% uplift in TG over Layer but ofc crashes every 90-120 minutes due to vram exhaustion this fix is supposed to stop that ) https://github.com/ggml-org/llama.cpp/issues/22404   submitted by   /u/Bulky-Priority6824 [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Best coding model on RTX 3060

Wondering whatโ€™s the best coding model that can fit on a RTX 3060 (12GB). Has anyone been able to do something useful with it? Also wondering about best setup (vllm? Llama.cpp?) and quantization. Thaโ€ฆ

Wondering whatโ€™s the best coding model that can fit on a RTX 3060 (12GB). Has anyone been able to do something useful with it? Also wondering about best setup (vllm? Llama.cpp?) and quantization. Thanks a lot, this community is great   submitted by   /u/solimaotheelephant3 [link]   [comments]

๐Ÿ“ฐ
HN 100+ points Aggregators May 25, 2026
Toshifumi Suzuki, founder of Seven-Eleven Japan, has died

Article URL: https://www.referenceforbusiness.com/biography/S-Z/Suzuki-Toshifumi-1932.html Comments URL: https://news.ycombinator.com/item?id=48268609 Points: 111 # Comments: 47

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Whats the best Qwen 27B Q8 quant?

everyone is talking about q 4 q 5 and q 6, but. i got some coding that i feel like lower quants kept getting wrong. I can run q 8 from unsloth but feels a bit slow even with MTP ON, should I just resโ€ฆ

everyone is talking about q 4 q 5 and q 6, but. i got some coding that i feel like lower quants kept getting wrong. I can run q 8 from unsloth but feels a bit slow even with MTP ON, should I just resort to q8 35 b a3b at this point?   submitted by   /u/EggDroppedSoup [link]   [comments]

r/MachineLearning Aggregators May 25, 2026
๐ƒ๐ž๐ฅ๐ญ๐š ๐€๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง ๐‘๐ž๐ฌ๐ข๐๐ฎ๐š๐ฅ๐ฌ [R]

We're excited to release ๐ƒ๐ž๐ฅ๐ญ๐š ๐€๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง ๐‘๐ž๐ฌ๐ข๐๐ฎ๐š๐ฅ๐ฌ, a drop-in upgrade to residual connections that learns which past layers to route from โ€” without the routing collapse that breaks prior cross-layer โ€ฆ

We're excited to release ๐ƒ๐ž๐ฅ๐ญ๐š ๐€๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง ๐‘๐ž๐ฌ๐ข๐๐ฎ๐š๐ฅ๐ฌ, a drop-in upgrade to residual connections that learns which past layers to route from โ€” without the routing collapse that breaks prior cross-layer attention at scale. ๐Ÿš€ Attention Residuals route over cumulative hidden states, but those are highly redundant, so routing collapses to near-uniform (max weight ~0.2) in deep layers. Delta Attention Residuals route over ๐๐ž๐ฅ๐ญ๐š๐ฌ (vแตข = hแตขโ‚Šโ‚ โˆ’ hแตข) โ€” what each sublayer actually contributed โ€” and natively enable: โšก ๐Ÿ.๐Ÿ–ร— ๐ฌ๐ก๐š๐ซ๐ฉ๐ž๐ซ ๐œ๐ซ๐จ๐ฌ๐ฌ-๐ฅ๐š๐ฒ๐ž๐ซ ๐ซ๐จ๐ฎ๐ญ๐ข๐ง๐  Deltas are structurally diverse, lifting max attention weight from ~0.2 โ†’ ~0.6 (0.62 vs 0.35 avg) and curing routing collapse in deep layers. ๐Ÿ“‰ โˆ’๐Ÿ–.๐Ÿ% ๐ฏ๐š๐ฅ๐ข๐๐š๐ญ๐ข๐จ๐ง ๐๐๐‹ ๐š๐ญ ๐Ÿ•.๐Ÿ”๐ Consistent gains from 220M โ†’ 7.6B (1.7โ€“8.2% lower PPL), beating both standard residuals and Attention Residuals โ€” the latter actually degrades below baseline at scale (18.58 vs 17.43). ๐Ÿ”Œ ๐ƒ๐ซ๐จ๐ฉ-๐ข๐ง ๐Ÿ๐ข๐ง๐ž-๐ญ๐ฎ๐ง๐ข๐ง๐  ๐จ๐Ÿ ๐ฉ๐ซ๐ž๐ญ๐ซ๐š๐ข๐ง๐ž๐ ๐ฆ๐จ๐๐ž๐ฅ๐ฌ Additive, zero-init routing is identity at initialization, so you can convert pretrained checkpoints (e.g. Qwen3-0.6B) into Delta Attention Residuals via standard fine-tuning โ€” beating the original on 8 downstream benchmarks (55.6 vs 55.0). ๐Ÿชถ โ‰ค๐ŸŽ.๐ŸŽ๐Ÿ% ๐ฉ๐š๐ซ๐š๐ฆ๐ž๐ญ๐ž๐ซ ๐จ๐ฏ๐ž๐ซ๐ก๐ž๐š๐ Delta Block adds just 589K params (0.008% at 8B) and ~3% memory โ€” and runs faster + lighter than Attention Residuals (14.0k vs 12.5k tok/s, 42.7 vs 44.0 GB). ๐Ÿ’ป Code: https://github.com/wdlctc/delta-attention-residuals-code ๐Ÿ’ป Paper: https://arxiv.org/abs/2605.18855 https://preview.redd.it/bewovgw25b3h1.png?width=1359&format=png&auto=webp&s=6cee758f7a96f0adecd9a3fb8553dde3f1b92c74   submitted by   /u/Mediocre-Ad5059 [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 25, 2026
Anyone heard from ICML about Oral decisions yet? [D]

hi all, my paper received a spotlight from ICML. they told us that we would receive decisions as to whether our paper would get an oral by the end of the month with the implication that we wouldnโ€™t โ€ฆ

hi all, my paper received a spotlight from ICML. they told us that we would receive decisions as to whether our paper would get an oral by the end of the month with the implication that we wouldnโ€™t receive a notification if we didnโ€™t get it; I was just wondering if anyone has received that notification so as to know I didnโ€™t get it for sure. thanks!   submitted by   /u/billjames1685 [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 25, 2026
Iโ€™m building an open-source decision layer above AI agents [P]

Hi everyone, Iโ€™m Jia, the creator of Spice. Iโ€™ve been working on an open-source project called Spice. The simplest way to describe it is: Spice is a decision layer above agents. Most agent systems toโ€ฆ

Hi everyone, Iโ€™m Jia, the creator of Spice. Iโ€™ve been working on an open-source project called Spice. The simplest way to describe it is: Spice is a decision layer above agents. Most agent systems today are very focused on execution, They are getting better at doing tasks after a human gives them an intent. But the higher-level question is still usually left to the user: What should happen next, and why? That is the layer I want Spice to explore. Spice is not trying to replace execution agents. Tools like Claude Code, Codex, Hermes, or other agents can still do the actual work. Instead, Spice sits before execution and tries to make the decision process explicit: what was observed what options were considered why one option was selected what trade-offs were rejected what happened afterward how that outcome should affect the next decision The current runtime is still early, but you can already install it, set up an LLM provider, run it in the terminal, inspect Decision Cards, and hand off approved execution to external agents. My goal is to make agent behavior less of a black box. Instead of only seeing the final result of an agent task, I want to preserve the reasoning boundary before execution: what the system believed, what it chose, why it chose it, and what changed after the action. GitHub: https://github.com/Dyalwayshappy/Spice Iโ€™d love feedback from people building agents. Thank you guys.   submitted by   /u/Alarming_Rou_3841 [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 25, 2026
Call for Papers - Workshop on Efficient Reasoning at COLM 2026 [R]

๐ŸŒŸ Announcing the 2nd Workshop on Efficient Reasoning (ER) at @colm2026 โ€” Oct 9! ๐Ÿ“ฃ We welcome submissions! Submit your work here: https://openreview.net/group?id=colmweb.org/COLM/2026/Workshop/Efficieโ€ฆ

๐ŸŒŸ Announcing the 2nd Workshop on Efficient Reasoning (ER) at @colm2026 โ€” Oct 9! ๐Ÿ“ฃ We welcome submissions! Submit your work here: https://openreview.net/group?id=colmweb.org/COLM/2026/Workshop/Efficient_Reasoning ๐Ÿ—“๏ธ Deadline: July 12, 2026 (AoE) ๐Ÿ”— Website: https://wdlctc.github.io/efficient-reasoning-2026/ ๐Ÿ’ฌ Topics include (but aren't limited to): ๐Ÿ”น Multimodal, spatial & embodied reasoning under efficiency constraints ๐Ÿ”น Curating high-quality reasoning datasets under resource constraints ๐Ÿ”น Algorithmic innovations for efficient training & RL fine-tuning ๐Ÿ”น Fast inference: pruning, compression, progressive generation, KV-cache tricks ๐Ÿ”น Benchmarks & theory on time-/space-complexity and faithfulness ๐Ÿ”น Systems to deploy long-CoT or on-device reasoning in the wild ๐Ÿ”น Safety & robustness of efficient reasoning pipelines ๐Ÿ”น Real-time applications in healthcare, robotics, autonomy, and more ๐Ÿค We invite perspectives from ML, systems, natural & social sciences, and industry practitioners to rethink reasoning under tight compute, memory, latency, and cost budgets. Hope to see you there! ๐Ÿš€   submitted by   /u/Mediocre-Ad5059 [link]   [comments]

๐Ÿ“ฐ
r/MachineLearning Aggregators May 25, 2026
Best architecture for seamless Bilingual TTS? (Azure / English + Korean) [D]

Hi guys, when building a language learning app (React Native/Expo frontend, Python backend) and Iโ€™ve hit a frustrating wall with Text-to-Speech. I need the app to read sentences that mix English instโ€ฆ

Hi guys, when building a language learning app (React Native/Expo frontend, Python backend) and Iโ€™ve hit a frustrating wall with Text-to-Speech. I need the app to read sentences that mix English instructions and Korean examples (e.g., "To say hello, we use the phrase ์•ˆ๋…•ํ•˜์„ธ์š”."). Since native pronunciation is critical for a learning app, I'm struggling to find a solution that sounds natural. I'm currently using Azure Cognitive Services, and I'm stuck between two bad options: Approach 1: The Multilingual Voice (en-US-AvaMultilingualNeural) The Good: Seamless reading, zero pauses mid-sentence. The Bad: Because it's an English-first model, the Korean comes out with a slight, robotic/Americanized accent. It doesn't sound like a true native speaker, which defeats the purpose of teaching pronunciation. And also there is some scratching and lack of smoothness when it is reading korean words. Approach 2: SSML Voice Switching (Ava for EN, SunHi for KO) The Good: Perfect English, perfect native Korean. The Bad: Switching <voice> tags mid-sentence causes Azure to pause for a fraction of a second while it unloads/loads the neural models. It completely ruins the natural flow of the audio, making it sound very disjointed. My Questions: Is there an SSML trick in Azure to pre-load voices or eliminate that micro-pause when switching voices? How do the big apps handle this? Because if I use two models for korean and english they will sound different when reading. Should I migrate away from standard Azure Speech and use the Azure OpenAI voices (alloy, nova) instead? Are they truly seamless for bilingual text? Any advice on the best tech stack or architecture for this would be massively appreciated!   submitted by   /u/Lumpy-Simple9185 [link]   [comments]

r/LocalLLaMA Aggregators May 25, 2026
(Yet Another) KV cache calculator - kvanta.vcerny.cz

Hello everyone, I thought all public web-based KV cache calculators kinda suck.. so I decided to create one I would like to use myself - KVANTA https://kvanta.vcerny.cz It should support any LLM/VLM โ€ฆ

Hello everyone, I thought all public web-based KV cache calculators kinda suck.. so I decided to create one I would like to use myself - KVANTA https://kvanta.vcerny.cz It should support any LLM/VLM from Hugging Face, if not let me know! (also, it's Apache 2.0) https://preview.redd.it/rk8i48ftva3h1.png?width=1754&format=png&auto=webp&s=7a2e8908d7d0a6c2efd92be5fb7f0ec548e7aba9   submitted by   /u/Fun-Purple-7737 [link]   [comments]

๐Ÿ“ฐ
r/LocalLLaMA Aggregators May 25, 2026
Is Qwen3.6 current king for local agentic use?

I've been testing other models but it seems like nothing even come close to Qwen3.6 35B A3B for agentic use. The worse I'd get is a loop sometimes, while Gemma4 produced broken tool calls occasionallโ€ฆ

I've been testing other models but it seems like nothing even come close to Qwen3.6 35B A3B for agentic use. The worse I'd get is a loop sometimes, while Gemma4 produced broken tool calls occasionally and I couldn't even get GLM 4.7 Flash REAP past 2 or 3 messages before it starts looping. All IQ4_NL quants from Unsloth. I'm wondering if there are better models around the same size (preferably MoE) that I haven't tried yet. I'm using it for Hermes Agent and Pi and it's not perfect, but it's crazy good for a local model   submitted by   /u/HornyGooner4402 [link]   [comments]