100 articles from r/MachineLearning
Paper: https://arxiv.org/abs/2605.08172 Workshops: AI for Science & Structured Data for Health at ICML 2026 Abstract: Anatomical mesh segmentation requires models that operate directly on irregโฆ
Paper: https://arxiv.org/abs/2605.08172 Workshops: AI for Science & Structured Data for Health at ICML 2026 Abstract: Anatomical mesh segmentation requires models that operate directly on irregular surface geometry while remaining robust to arbitrary patient pose and mesh resolution variation. Existing task-specific mesh and point-cloud methods are not equivariant, and can degrade sharply under test-time perturbation, for example dropping by 25-26 IoU points on intraoral scan segmentation at 40o tilt. We present EAMS, an Equivariant Anatomical Mesh Segmentor built on Equivariant Mesh Neural Networks (EMNN), and evaluate it across four clinically distinct tasks spanning edge-, vertex-, and face-level supervision. We combine intrinsic mesh descriptors with anatomy-aware priors, including PCA-derived frames for dental arches and liver surfaces, and augment message passing to provide lightweight global context. Across intracranial aneurysm and intraoral segmentation, EAMS variants are competitive with specialized baselines on unperturbed inputs while remaining stable under geometric perturbations, and on liver surfaces they expose a favorable trade-off between canonical-pose accuracy and rotation robustness. These results show that a lightweight (<2M parameters) equivariant framework can deliver robust anatomical mesh segmentation across diverse supervision types without task-specific architectures. Hi everyone Iโm excited to share my solo paper "Augmented Equivariant Mesh Networks for Anatomical Mesh Segmentation" which has been accepted for poster presentations at the ICML 2026 workshops on AI for Science and Structured Data for Health. The project stemmed from my parallel research on structural encoders for biomolecules where enforcing roto-translational equivariance is standard. In this work, I wanted to extend those principles directly to various 3D medical meshes. While current anatomical mesh segmentation methods are highly disjoint and anato
Built a richer paper page for 3 million arxiv and OpenAlex papers. Free, no signup, no paywall. tomesphere.com Each page has a Gemini generated TLDR, peer reviews scraped from OpenReview with revieweโฆ
Built a richer paper page for 3 million arxiv and OpenAlex papers. Free, no signup, no paywall. tomesphere.com Each page has a Gemini generated TLDR, peer reviews scraped from OpenReview with reviewer scores and decisions, GitHub repos, HuggingFace models and datasets, conference videos, the citation graph from OpenAlex (about 250M edges), and a semantic graph using SPECTER2 (768D in pgvector) with four ranking modes: Influential, Recent, Hidden gems, Nearest. Connected Papers and Litmaps default to citation overlap. Tomesphere defaults to text vector similarity, so brand new papers without a citation graph still appear and topically similar work shows up even without shared citers. Chrome extension overlays the same data on arxiv abstract and pdf pages. Try a paper you know: tomesphere.com/paper/2312.00752 (Mamba) tomesphere.com/paper/1706.03762 (Attention) tomesphere.com/paper/2305.14314 (QLoRA) Open to feedback.   submitted by   /u/RegretAgreeable4859 [link]   [comments]
Essay argues that reasoning models cannot perform faithful inference because their reasoning trace and final answer come from the same operation. Engages with Lanham/Turpin/Mirzadeh in empirical critโฆ
Essay argues that reasoning models cannot perform faithful inference because their reasoning trace and final answer come from the same operation. Engages with Lanham/Turpin/Mirzadeh in empirical critique, and with HRM, TRM, GRAM, AlphaProof, and Kona/Aleph as the contrasting architectural lineage. Curious what this subreddit makes of the constraint-vs-influence framing. https://mauhaq.substack.com/p/verbosity-is-not-faithfulness   submitted by   /u/Sensitive_Air_5745 [link]   [comments]
I am a CS undergrad and I think token economics is the next big problem for companies. I am building a LLM router specifically for code and codebases. The Routing is not actually done by a heavily fiโฆ
I am a CS undergrad and I think token economics is the next big problem for companies. I am building a LLM router specifically for code and codebases. The Routing is not actually done by a heavily fine tuned llm(already existing solutions do this). Using a bit of a different approach. I am gauging the complexity by measuring interaction between signals that can be cheaply extracted from the prompt. One of these signals is what I like to call blooms_intent, based on bloomโs taxonomy. Bloom's taxonomy is a framework for categorizing educational goals. If a query is โWhat is thisโ it falls under remember category whereas โimplement thisโ is more of create category. Questions:- How do I find datasets for this purpose. Is bootstrapping datasets using AI fine for this. Should I do centroid based classification which Iโve been doing till now but the confidence difference between categories for ambiguous queries is way too close. What is the best dataset size and classifier that can somewhat reliably differentiate nuances between queries. You may ask why not use AI for these questions. I have and thatโs why Iโve come here. Please lmk your thoughts and thanks in advance!!   submitted by   /u/getridofaks [link]   [comments]
Slightly unhinged engineering decision but it works. My tool (ScholarScout) has a 2-3 minute pipeline: fetch papers from 8 databases โ analyze trends โ generate ideas. During that time, the user seesโฆ
Slightly unhinged engineering decision but it works. My tool (ScholarScout) has a 2-3 minute pipeline: fetch papers from 8 databases โ analyze trends โ generate ideas. During that time, the user sees a pixel art owl running through a parallax forest. The fun part: it's not fake animation. Each paper dot that spawns in the game corresponds to a real paper_found SSE event from the backend. Papers drip-feed at 600ms intervals from a queue (even if the fetch returned 30 papers at once). Colors = source (white=arXiv, green=PubMed, purple=Crossref). When pipeline finishes, owl celebrates. Tech: vanilla JS canvas, 32x32 sprite sheet (12 frames), requestAnimationFrame loop, image-rendering: pixelated. No dependencies. Here's the demo vid ScholarScout v1.5.3 - Demo Actual useful changes in the same release: Review Mode: paper clustering (k-means on embeddings, Jaccard fallback) + per-cluster synthesis + cross-cutting analysis Paper freshness: _used_count per paper in cache, least-used prioritized, auto-widen date range on exhaustion All thresholds externalized to config.yaml github.com/neej4/ScholarScout or ScholarScout โ Papers in. Ideas out.   submitted by   /u/neeejaaa0 [link]   [comments]
Iโm currently studying ML, more specifically convolutional neural networks (CNNs) for finding patterns in images. Right now, Iโm trying to develop a model that can solve the โWhereโs Waldo?โ challengโฆ
Iโm currently studying ML, more specifically convolutional neural networks (CNNs) for finding patterns in images. Right now, Iโm trying to develop a model that can solve the โWhereโs Waldo?โ challenge. However, I currently have a question: what would be the best option for training a CNN model, PyTorch or Dlib? At the moment, I have an AMD RX580. Since Dlib only supports CUDA, I would need to use Google Colab. Iโm still learning about this field, so if I said something incorrect or if you have any tips on how to approach this project, Iโd be very happy to hear them. ๐   submitted by   /u/TearsInTokio [link]   [comments]
Iโve been reading GPU architecture docs in my free time. NVIDIA PTX, AMD ISA reference guides, Intel Xe, reverse-engineered Apple GPU stuff. Over 5,000 pages across 16 microarchitectures. After a whiโฆ
Iโve been reading GPU architecture docs in my free time. NVIDIA PTX, AMD ISA reference guides, Intel Xe, reverse-engineered Apple GPU stuff. Over 5,000 pages across 16 microarchitectures. After a while you notice all four vendors are doing the same 11 things with different names. So I wrote a spec that covers all of them and built a toolchain around it. Itโs called WAVE. You write a kernel once, it compiles to a portable binary, then thin backends translate it to Metal, PTX, HIP, or SYCL. Same binary verified on Apple M4 Pro, NVIDIA T4, and AMD MI300X. My co-author Onyinye built PyTorch integration and got identical training results across all backends. Please star on GitHub: https://github.com/Oabraham1/wave Preprint: https://arxiv.org/abs/2603.28793 Read full docs and how I built everything: https://wave.ojima.me pip install wave-gpu   submitted by   /u/not-your-typical-cs [link]   [comments]
Hi guys I've been working on GitRAG โ paste any public GitHub URL, and ask it anything about the codebase. It answers with exact file paths and line numbers, no hallucination. How it works under the โฆ
Hi guys I've been working on GitRAG โ paste any public GitHub URL, and ask it anything about the codebase. It answers with exact file paths and line numbers, no hallucination. How it works under the hood: Clones the repo and splits files into semantic chunks using AST-aware parsing (not just line splits) Builds a hybrid index โ dense embeddings + BM25 keyword index At query time, fuses both signals with Reciprocal Rank Fusion, then runs Cohere reranking to cut 20 candidates down to 5 Sends those 5 chunks to Groq's llama-3.3-70b which generates a grounded answer The retrieval pipeline is what I'm most proud of โ the BM25 + semantic fusion catches things that pure vector search misses (exact function names, error codes, etc.) Stack: FastAPI ยท ChromaDB ยท text-embedding-3-small ยท Cohere rerank-v3.5 ยท Groq llama-3.3-70b ยท React + Vite Supports 15+ languages: Python, JS/TS, C#, Java, Go, Rust, C/C++, Swift, Kotlin, Dart, Ruby, PHP, Vue, Svelte, Shell... Curious what repos people try it on โ drop your results below ๐   submitted by   /u/Professional-Pie6704 [link]   [comments]
Hi all, Apologies beforehand if this is the wrong subreddit, let me know if you think there are better subreddits for this post. Iโm working on a project around proprietary data licensing for AI traโฆ
Hi all, Apologies beforehand if this is the wrong subreddit, let me know if you think there are better subreddits for this post. Iโm working on a project around proprietary data licensing for AI training and trying to identify data types that are genuinely inaccessible to AI labs- not because it doesnโt exist, but because no one has figured out how to unlock it. Specifically looking for data that is: โข Created by domain experts as part of their daily work โข Never published or shared outside the organization โข Rich in human reasoning, not just structured outputs Finance is my background so Iโm especially curious about examples there, but all industries welcome. Whatโs the most valuable โlockedโ professional data youโve come across in your field - and who (if ya know) owns the rights to it?   submitted by   /u/Manny_in_iceage [link]   [comments]
Looking for communities where people actually dig into ML/AI research, not hype, not "look what I built with an LLM API," but discussions about papers, training dynamics, debugging real modโฆ
Looking for communities where people actually dig into ML/AI research, not hype, not "look what I built with an LLM API," but discussions about papers, training dynamics, debugging real models, infra problems, that kind of thing. I'm specifically interested in places where you can post something like "I'm seeing X behaviour in my SSL training, here's the loss curve, anyone seen this before?" and get thoughtful replies instead of generic advice.   submitted by   /u/Possible-Active-1903 [link]   [comments]
Is this normal? I searched it up and last year it was only 8000.   submitted by   /u/NightCR_ [link]   [comments]
Hey i built Aiki a lightweight tool that let's you chat with Wikipedia locally. what it does: - Downloads and chunks wikipedia articles (u can choose those articles by their name or articles and alsoโฆ
Hey i built Aiki a lightweight tool that let's you chat with Wikipedia locally. what it does: - Downloads and chunks wikipedia articles (u can choose those articles by their name or articles and also the option of downloading the similar topics) - Uses a custom TF-IDF + cosine similarity retriever (built from scratch) - Supports query expansion using Wikipedia links/redirects - Optional answer generation with llm Very minimal dependencies and runs completely locally. Repo: https://github.com/yacine204/Aiki Would really appreciate your feedback.   submitted by   /u/Just_Jaguar3701 [link]   [comments]
Nathan Witkin, a research writer at NYU Sternโs Tech and Society Lab, writes damningly about the famous METR AI time horizons graph in the Substack publication Transformer: It is impossible to draw โฆ
Nathan Witkin, a research writer at NYU Sternโs Tech and Society Lab, writes damningly about the famous METR AI time horizons graph in the Substack publication Transformer: It is impossible to draw meaningful conclusions from METRโs Long Tasks benchmark โ in particular once one realizes that its numerous flaws are probably compounding in unpredictable ways. The appropriate response to a study of this kind is not to assume it can be saved via back-of-the-envelope adjustments, or to comfort oneself that other anecdotal evidence implies that it is probably correct anyway. It is to cut oneโs losses and move on in search of higher-quality information. โฆ The METR graph cannot be saved. For all its sleekness and complexity, it contains far too many compounding errors to excuse. Among them is generalizing to the entire species data collected from a small group of the authorsโ peers. Coming up with ever more dramatic ways to make this mistake has become a kind of sport among AI researchers. If the field has a central pathology, it is to aggressively overindex on a mix of anecdotal data from power-users, alongside a long list of benchmarks even more compromised than METRโs. One hopes that as the field matures, its participants will learn to stop making these mistakes. The errors include: Some of the human baselines data is not actually measured or collected from any empirical source, rather, it is just guesstimated by the authors A key variable in the data is how long it takes humans to complete certain tasks, but โ when METR did actually measure this โ it paid its human benchmarkers hourly, meaning they were incentivized with cash to take longer The sample of human benchmarkers was biased toward METR employeesโ friends, acquaintances, and former colleagues (who are likely unrepresentative and possibly biased) Humans familiar with a codebase and a specific coding task were 5-18x faster at completing it, but METR used data from humans who were much slower because they had t
Just thought I'd share, I ran a DCGAN on a dual core RISC-V microcontroller, the CH32H417 generating 64x64 cat faces. This is a new RISC-V MCU, so no TFLite, no CMSIS NN and no external memory. It's โฆ
Just thought I'd share, I ran a DCGAN on a dual core RISC-V microcontroller, the CH32H417 generating 64x64 cat faces. This is a new RISC-V MCU, so no TFLite, no CMSIS NN and no external memory. It's a pure C inference engine, bit-identical to PyTorch reference outputs. The model is 12.6M parameters with int8 per channel quantization. Intermediate activations are stored in DTCM and layer weights stream from SD card using double buffering so the next layer loads while the current one computes. The total available SRAM is 512KB shared between both cores and the inference engine and time to generate one image is 26 seconds, it could be faster, but SD card access speed is the bottleneck rather than computation. The z vector is seeded from 200 bytes of quantum random data (ANU QRNG vacuum fluctuation source), transformed via Box-Muller into the latent vector. which is not strictly necessary for image quality but it was a fun constraint for the art installation side of the project. The generated cat is classified as "motivated" or "demotivated" based on a single quantum bit, which selects from a phrase bank with four fragment slots combining into one of 131,072 possible spoken verdicts output through the onboard DAC... As far as I can tell nobody else is running GAN inference on these low cost RISC-V microcontrollers, cause ARM has the CMSIS NN ecosystem for this kind of thing but RISC-V MCUs especially in the CH32 space have nothing, so the entire inference engine is written from scratch. Paper: TinyGAN: Generative Image Synthesis on a RISC-V Microcontroller with Quantum Entropy Sampling   submitted by   /u/Separate-Choice [link]   [comments]
Iโm thinking of expanding an on-device inference SDk into a full blown AI inference platform and seeing more and more inference platform popping out. Been talking with a VC from Seattle/NY. Is this sโฆ
Iโm thinking of expanding an on-device inference SDk into a full blown AI inference platform and seeing more and more inference platform popping out. Been talking with a VC from Seattle/NY. Is this space really that saturated?   submitted by   /u/kampak212 [link]   [comments]
Iโve been thinking about a problem in current agent systems: Most agents are becoming very good at execution, but the decision layer before execution is still unclear. Coding agents, research agents,โฆ
Iโve been thinking about a problem in current agent systems: Most agents are becoming very good at execution, but the decision layer before execution is still unclear. Coding agents, research agents, tool loops, sandboxes, workflows, and harnesses are all improving quickly. Once a human gives an intent, agents can often do a lot of useful work. But the higher-level question is still usually left to the user: What should happen next, and why? Iโve been exploring this idea through an open-source project called Spice. The simplest way to describe it is: Spice is a decision layer above agents. It is not trying to replace execution agents. Tools like Claude Code, Codex, Hermes, or other agents can still do the actual work. Instead, Spice sits before execution and tries to make the decision process explicit: what was observed what options were considered why one option was selected what trade-offs were rejected whether execution needs approval what happened afterward how that outcome should affect the next decision The current runtime is still early, but it can already be installed, configured with an LLM provider, run in the terminal, inspect Decision Cards, and hand off approved execution to external agents. The goal is to make agent behavior less of a black box. Instead of only seeing the final result of an agent task, I want to preserve the reasoning boundary before execution: what the system believed, what it chose, why it chose it, and what changed after the action. GitHub: https://github.com/Dyalwayshappy/Spice Iโd love feedback from people building agents. Feel free to fork, star the repo, or share any feedback and ideas. Would love to build this together with the community.   submitted by   /u/Alarming_Rou_3841 [link]   [comments]
We're excited to release ๐๐๐ฅ๐ญ๐ ๐๐ญ๐ญ๐๐ง๐ญ๐ข๐จ๐ง ๐๐๐ฌ๐ข๐๐ฎ๐๐ฅ๐ฌ, a drop-in upgrade to residual connections that learns which past layers to route from โ without the routing collapse that breaks prior cross-layer โฆ
We're excited to release ๐๐๐ฅ๐ญ๐ ๐๐ญ๐ญ๐๐ง๐ญ๐ข๐จ๐ง ๐๐๐ฌ๐ข๐๐ฎ๐๐ฅ๐ฌ, a drop-in upgrade to residual connections that learns which past layers to route from โ without the routing collapse that breaks prior cross-layer attention at scale. ๐ Attention Residuals route over cumulative hidden states, but those are highly redundant, so routing collapses to near-uniform (max weight ~0.2) in deep layers. Delta Attention Residuals route over ๐๐๐ฅ๐ญ๐๐ฌ (vแตข = hแตขโโ โ hแตข) โ what each sublayer actually contributed โ and natively enable: โก ๐.๐ร ๐ฌ๐ก๐๐ซ๐ฉ๐๐ซ ๐๐ซ๐จ๐ฌ๐ฌ-๐ฅ๐๐ฒ๐๐ซ ๐ซ๐จ๐ฎ๐ญ๐ข๐ง๐ Deltas are structurally diverse, lifting max attention weight from ~0.2 โ ~0.6 (0.62 vs 0.35 avg) and curing routing collapse in deep layers. ๐ โ๐.๐% ๐ฏ๐๐ฅ๐ข๐๐๐ญ๐ข๐จ๐ง ๐๐๐ ๐๐ญ ๐.๐๐ Consistent gains from 220M โ 7.6B (1.7โ8.2% lower PPL), beating both standard residuals and Attention Residuals โ the latter actually degrades below baseline at scale (18.58 vs 17.43). ๐ ๐๐ซ๐จ๐ฉ-๐ข๐ง ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ ๐จ๐ ๐ฉ๐ซ๐๐ญ๐ซ๐๐ข๐ง๐๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ Additive, zero-init routing is identity at initialization, so you can convert pretrained checkpoints (e.g. Qwen3-0.6B) into Delta Attention Residuals via standard fine-tuning โ beating the original on 8 downstream benchmarks (55.6 vs 55.0). ๐ชถ โค๐.๐๐% ๐ฉ๐๐ซ๐๐ฆ๐๐ญ๐๐ซ ๐จ๐ฏ๐๐ซ๐ก๐๐๐ Delta Block adds just 589K params (0.008% at 8B) and ~3% memory โ and runs faster + lighter than Attention Residuals (14.0k vs 12.5k tok/s, 42.7 vs 44.0 GB). ๐ป Code: https://github.com/wdlctc/delta-attention-residuals-code ๐ป Paper: https://arxiv.org/abs/2605.18855 https://preview.redd.it/bewovgw25b3h1.png?width=1359&format=png&auto=webp&s=6cee758f7a96f0adecd9a3fb8553dde3f1b92c74   submitted by   /u/Mediocre-Ad5059 [link]   [comments]
hi all, my paper received a spotlight from ICML. they told us that we would receive decisions as to whether our paper would get an oral by the end of the month with the implication that we wouldnโt โฆ
hi all, my paper received a spotlight from ICML. they told us that we would receive decisions as to whether our paper would get an oral by the end of the month with the implication that we wouldnโt receive a notification if we didnโt get it; I was just wondering if anyone has received that notification so as to know I didnโt get it for sure. thanks!   submitted by   /u/billjames1685 [link]   [comments]
Hi everyone, Iโm Jia, the creator of Spice. Iโve been working on an open-source project called Spice. The simplest way to describe it is: Spice is a decision layer above agents. Most agent systems toโฆ
Hi everyone, Iโm Jia, the creator of Spice. Iโve been working on an open-source project called Spice. The simplest way to describe it is: Spice is a decision layer above agents. Most agent systems today are very focused on execution, They are getting better at doing tasks after a human gives them an intent. But the higher-level question is still usually left to the user: What should happen next, and why? That is the layer I want Spice to explore. Spice is not trying to replace execution agents. Tools like Claude Code, Codex, Hermes, or other agents can still do the actual work. Instead, Spice sits before execution and tries to make the decision process explicit: what was observed what options were considered why one option was selected what trade-offs were rejected what happened afterward how that outcome should affect the next decision The current runtime is still early, but you can already install it, set up an LLM provider, run it in the terminal, inspect Decision Cards, and hand off approved execution to external agents. My goal is to make agent behavior less of a black box. Instead of only seeing the final result of an agent task, I want to preserve the reasoning boundary before execution: what the system believed, what it chose, why it chose it, and what changed after the action. GitHub: https://github.com/Dyalwayshappy/Spice Iโd love feedback from people building agents. Thank you guys.   submitted by   /u/Alarming_Rou_3841 [link]   [comments]
๐ Announcing the 2nd Workshop on Efficient Reasoning (ER) at @colm2026 โ Oct 9! ๐ฃ We welcome submissions! Submit your work here: https://openreview.net/group?id=colmweb.org/COLM/2026/Workshop/Efficieโฆ
๐ Announcing the 2nd Workshop on Efficient Reasoning (ER) at @colm2026 โ Oct 9! ๐ฃ We welcome submissions! Submit your work here: https://openreview.net/group?id=colmweb.org/COLM/2026/Workshop/Efficient_Reasoning ๐๏ธ Deadline: July 12, 2026 (AoE) ๐ Website: https://wdlctc.github.io/efficient-reasoning-2026/ ๐ฌ Topics include (but aren't limited to): ๐น Multimodal, spatial & embodied reasoning under efficiency constraints ๐น Curating high-quality reasoning datasets under resource constraints ๐น Algorithmic innovations for efficient training & RL fine-tuning ๐น Fast inference: pruning, compression, progressive generation, KV-cache tricks ๐น Benchmarks & theory on time-/space-complexity and faithfulness ๐น Systems to deploy long-CoT or on-device reasoning in the wild ๐น Safety & robustness of efficient reasoning pipelines ๐น Real-time applications in healthcare, robotics, autonomy, and more ๐ค We invite perspectives from ML, systems, natural & social sciences, and industry practitioners to rethink reasoning under tight compute, memory, latency, and cost budgets. Hope to see you there! ๐   submitted by   /u/Mediocre-Ad5059 [link]   [comments]
Hi guys, when building a language learning app (React Native/Expo frontend, Python backend) and Iโve hit a frustrating wall with Text-to-Speech. I need the app to read sentences that mix English instโฆ
Hi guys, when building a language learning app (React Native/Expo frontend, Python backend) and Iโve hit a frustrating wall with Text-to-Speech. I need the app to read sentences that mix English instructions and Korean examples (e.g., "To say hello, we use the phrase ์๋ ํ์ธ์."). Since native pronunciation is critical for a learning app, I'm struggling to find a solution that sounds natural. I'm currently using Azure Cognitive Services, and I'm stuck between two bad options: Approach 1: The Multilingual Voice (en-US-AvaMultilingualNeural) The Good: Seamless reading, zero pauses mid-sentence. The Bad: Because it's an English-first model, the Korean comes out with a slight, robotic/Americanized accent. It doesn't sound like a true native speaker, which defeats the purpose of teaching pronunciation. And also there is some scratching and lack of smoothness when it is reading korean words. Approach 2: SSML Voice Switching (Ava for EN, SunHi for KO) The Good: Perfect English, perfect native Korean. The Bad: Switching <voice> tags mid-sentence causes Azure to pause for a fraction of a second while it unloads/loads the neural models. It completely ruins the natural flow of the audio, making it sound very disjointed. My Questions: Is there an SSML trick in Azure to pre-load voices or eliminate that micro-pause when switching voices? How do the big apps handle this? Because if I use two models for korean and english they will sound different when reading. Should I migrate away from standard Azure Speech and use the Azure OpenAI voices (alloy, nova) instead? Are they truly seamless for bilingual text? Any advice on the best tech stack or architecture for this would be massively appreciated!   submitted by   /u/Lumpy-Simple9185 [link]   [comments]
Hi! I missed securing a main conference ticket for ICML 2026, as my workshop paper got accepted two days ago. Do you believe that it is worth attending just workshops at such A*-tier conferences (witโฆ
Hi! I missed securing a main conference ticket for ICML 2026, as my workshop paper got accepted two days ago. Do you believe that it is worth attending just workshops at such A*-tier conferences (with all the overseas travel costs etc.)? I was quite looking forward to attending both, including the talks, poster sessions and company booths. I come from an adjacent field and have therefore had quite a few conference experiences. Any insights into past experience are highly welcome. Thank you!   submitted by   /u/dreameroutloud [link]   [comments]
Can LLMs be used to come up with a research topic that's worthwhile? Has anyone had good results in coming up with solid research ideas by chatting with an LLM? Maybe using Claude to review existing โฆ
Can LLMs be used to come up with a research topic that's worthwhile? Has anyone had good results in coming up with solid research ideas by chatting with an LLM? Maybe using Claude to review existing work and define the research topic. Thanks!   submitted by   /u/Lonely-Highlight-447 [link]   [comments]
I have been seeing a lot of really interesting work lately around unlearning, model editing, controllability, safety, etc. Feels like this space is moving very fast right now, and there are still so โฆ
I have been seeing a lot of really interesting work lately around unlearning, model editing, controllability, safety, etc. Feels like this space is moving very fast right now, and there are still so many open questions. This year Iโm helping organize the U&ME workshop at ECCV 2026, and honestly Iโd really love to see submissions from people in the community โ especially students and researchers who are exploring new ideas, even if the work is still evolving. A lot of the best workshop conversations come from unfinished ideas, weird observations, failed directions that taught something useful, or work that doesnโt neatly fit into a main conference paper. So if youโve been working on anything around: Unlearning Model Stitching and Editing Model Merging and "MoErging" (Mixture of Experts Merging) Model compression Efficient domain adaptation Multi-domain/cross-domain U&ME Online/lifelong learning, unlearning, and model editing Responsible U&ME (e.g., robustness, ethics and fairness, resource efficiency, privacy, and regulatory compliance) Applications in computer vision please consider submitting :) Would be really nice to bring together people thinking deeply about these problems at ECCV 2026.   submitted by   /u/Mushroom-Severe [link]   [comments]
The reason for this query is that I am in the process of shifting to Isaac Sim / Isaac Lab since that is what seems to be in use nowadays. However, Isaac Lab is proving to be somewhat difficult to haโฆ
The reason for this query is that I am in the process of shifting to Isaac Sim / Isaac Lab since that is what seems to be in use nowadays. However, Isaac Lab is proving to be somewhat difficult to handle. While it handles the logging, and the creation of multi-actor systems for algorithms like PPO beautifully (with, say, hundreds of actors), its documentation leaves much to be desired. I am also concerned about the ease of setting up new robotic environments, actions, rewards, policies and possibly even custom algorithms. So, what is it that you do at your lab? In my mind there's a trade-off. On the one hand, I use the Isaac Lab scaffolding but run into its idiosyncracies very frequently until I document everything I need. Or, I interface directly with Isaac Sim, but then I need to write my own handlers for interfacing Isaac Sim with the RL agent.   submitted by   /u/StayingUp4AFeeling [link]   [comments]
We've been trying to put LangGraph agents into production for a while. The thing that kept biting us was tool-call boundary enforcement: stuff like "must call X before Y", "max N retriโฆ
We've been trying to put LangGraph agents into production for a while. The thing that kept biting us was tool-call boundary enforcement: stuff like "must call X before Y", "max N retries", "approval gate before destructive action". Worked fine in demos, broke at the moments that mattered. What we tried first: Prompt engineering. Told the model "always call check_policy before issue_refund". Worked ~95% of the time. The 5% that didn't was exactly the cases an auditor would ask about. Not a great answer when someone wants to know why a refund went through. Post-hoc audit (OTEL + log). Caught violations after the fact. By then the side effect already happened. Refunding the refund is awkward. Pulling everything into a workflow engine (Temporal, or nano-vm more recently). Strong guarantees but you rewrite the agent against their runtime. Too much for our use case. What we ended up with: A contract layer at the tool boundary. YAML rules, deterministic eval, runs before the tool call commits. Open-sourced as Sponsio. Repo: github.com/SponsioLabs/Sponsio Would love feedback from anyone running agents in prod.   submitted by   /u/johnnaliu [link]   [comments]
Anyone have any idea what I should do. This is my email to tensor dock. I developed corporate GPU benchmarking software so I need a cloud PC that can benchmark 5090 Consumer cards and 4090 Consumer cโฆ
Anyone have any idea what I should do. This is my email to tensor dock. I developed corporate GPU benchmarking software so I need a cloud PC that can benchmark 5090 Consumer cards and 4090 Consumer cards. It worked absolutely amazing for six hours yesterday on the 4090 full desktop PC performance in the cloud. Butโฆ.. Look Iโm really really upset here. Iโve been trying to deploy servers for two days now. I made one server successfully with an RTX 4090. It worked great for a few hours as soon as I stopped it when I went to turn it again on I havenโt been able to get another RTX in the node for the last 10 hours. So I canโt even activate A PC that I spent all day setting up yesterday. In order to use another cloud pc to work I tried to start up 4 more separate deployments today and none of them can initialize another RTX 4090 it always fails on the desktop once it is deployed so I have to keep deleting the vm. So now I tried three different node locations to see if that fixes it and I cannot even acquire another RTX 4090 even though they all specify theyโre available in each different location. It always fails during deployment . this has been a nightmare. Iโve been trying to talk to Customer Service for two days straight now, and nobody gets back to me. I have an RTX 5090 set up that will not even ping or I cannot access and I had it running for $10 for a day. Not working. Ideally, I would like to have that RTX 5090 as my monthly always on cloud PC but itโs not working right now. I would also like to have the RTX 4090 set up that I currently have working and available to find an available gpu in the node to use because I I built a perfect image of windows on there with all my data and I canโt even use it. I spent all day yesterday building that windows image for me to use. I stopped it to save some money for a few hours. I went to turn it back on and I canโt use it now. It wonโt activate.   submitted by   /u/testing012367 [link]   [comments]
Suppose your friend, a mathematician, woke up from a 5-year coma. How would you explain this to him? Do we even have an explanation other than "it is what it is"?   submitted by   โฆ
Suppose your friend, a mathematician, woke up from a 5-year coma. How would you explain this to him? Do we even have an explanation other than "it is what it is"?   submitted by   /u/we_are_mammals [link]   [comments]
I used to work heavily with Jupyter Notebooks + git + VS Code in a collaborative research setting and found nbdime to be somewhat buggy/a hassle to work with in general. So, in typical side project fโฆ
I used to work heavily with Jupyter Notebooks + git + VS Code in a collaborative research setting and found nbdime to be somewhat buggy/a hassle to work with in general. So, in typical side project fashion (relevant xkcd) I've been working on MergeNB quite a bit over the last 6 months or so. It's (currently only) a VS Code extension with a web UI, and has a few cool improvements over other alternatives, which I outlined in the README/docs site. I'd be over the moon if this actually gets used by people, and would love a star if it's interesting. See https://github.com/Avni2000/MergeNB. I've also been working on a static documentation site here: https://avni2000.github.io/MergeNB/docs I'm planning on working on it a lot more over the summer and properly fleshing out a few of the ideas I had (including making it a git mergetool as well as a VS Code extension), so if you'd like to contribute, feel free to raise an issue or shoot me a message/email :)   submitted by   /u/EnderAvni [link]   [comments]
Non-contrastive SSL methods like BYOL/JEPA/data2vec seem promising, but I have no idea what is being learned, or how well; itโs models all the way down. Maybe Iโve got supervised tasks for which Iโd โฆ
Non-contrastive SSL methods like BYOL/JEPA/data2vec seem promising, but I have no idea what is being learned, or how well; itโs models all the way down. Maybe Iโve got supervised tasks for which Iโd like to see transfer, and I can evaluate linear probe/KNN results during training, but that seems like a way to efficiently abuse researcher degrees of freedom. I know RankMe is meant to help address this: embed some data and SVD the embedding matrix. A healthy learner should produce an embedding with a high effective rank. But JEPA methods already require an entropy-collapse term like Barlow Twins/SIGREG, so the RankMe criterion just becomes part of training. It gets absorbed into a loss which wasnโt monotonic to begin with, and I ought to be able to inflate it by increasing the penalty weight. Surely itโs no longer an effective criterion, right? What else is there?   submitted by   /u/XTXinverseXTY [link]   [comments]
I invented thermocompute! It makes machine learning super fast!   submitted by   /u/arcco96 [link]   [comments]
At our work we use CUDA in Rust since the company switched to it recently. Rust has pretty good Driver API bindings but it made me wonder why the hell we cant have something decent in Go without cgo.โฆ
At our work we use CUDA in Rust since the company switched to it recently. Rust has pretty good Driver API bindings but it made me wonder why the hell we cant have something decent in Go without cgo. I mostly build ML tools in the last month and Go is my main language for pretty much everything. Problem is most Go CUDA projects still need cgo and the full toolkit at build time. That breaks cross compilation and makes Docker images huge which sucks when working on machine learning projects. So last month I started messing around with a proof of concept that loads libcuda.so at runtime using purego. No cgo at all. Biggest pain was thread affinity. CUDA keeps context per thread so goroutines switching around kept breaking things. I built a simple executor that locks an OS thread with runtime.LockOSThread and funnels all calls through a channel. Heres roughly what using it looks like right now: func run() error { cuda.Init() dev, _ := cuda.GetDevice(0) ctx, _ := dev.Primary() defer ctx.Close() a, _ := cuda.Alloc[float32](ctx, 1024) b, _ := cuda.Alloc[float32](ctx, 1024) c, _ := cuda.Alloc[float32](ctx, 1024) stream, _ := ctx.NewStream() start, _ := ctx.NewEvent() stop, _ := ctx.NewEvent() start.Record(stream) fn.LaunchOn(bg, stream, cfg, cuda.Arg(a), cuda.Arg(b), cuda.Arg(c), cuda.ArgValue(int32(1024)), ) stop.Record(stream) stop.Synchronize() duration, _ := start.Elapsed(stop) fmt.Printf("GPU time: %v\n", duration) return nil } On my 4070 Ti a 10M vector add showed CPU timer at like 160us but actual GPU event timing was 434us. That difference surprised me. The project is still super early and moves slow cuz i only code on weekends and im a total noob with CUDA. Slowly adding Graphs and multi gpu support. THIS IS SO early , so treat it more like a learning cuda repo, but im having fun learning cuda. Thought some of you might find it interesting too. repo is github.com/eitamring/gocudrv if you wanna take a look. Would be cool if anyone with 5xxx series cards
Hi, Niels here from the open-source team at Hugging Face. It's been one week since I launched paperswithcode.co, a revival of the website we all loved. It allows us to keep track of the state-of-the-โฆ
Hi, Niels here from the open-source team at Hugging Face. It's been one week since I launched paperswithcode.co, a revival of the website we all loved. It allows us to keep track of the state-of-the-art (SOTA) across various domains of AI, from agents to computer vision and time-series forecasting. The reception has been great, and I'm excited to extend this over the next few months. This week, I've added the following features: - Support for multiple metrics for a given benchmark: leaderboards now support multiple metrics, see e.g., the Open ASR Leaderboard for automatic speech recognition, which supports both Word Error Rate (WER) and the Inverse Real-Time Factor (RTFx) metrics, or the Object Detection leaderboard, which now also reports frames-per-second (FPS) besides mean average precision (mAP) on COCO. https://preview.redd.it/owlxn0b5u23h1.png?width=2878&format=png&auto=webp&s=1dff2f8feab4f160f77c97ceeb5d90e82382e63c - Support for external papers: We do support submitting papers beyond Arxiv, such as a Github repo, a blog post, BiorXiv, and more. You can submit a paper at paperswithcode.co/submit. AI will automatically enrich it with task and method tags, the GitHub repo, evals, and more. See e.g. DeepSeek-v4 below, which is not on Arxiv: https://preview.redd.it/uogbt0fjw23h1.png?width=2928&format=png&auto=webp&s=8b81e48af69b8935ddeb569d882d866b3e9ba216 - Support for paper lineage: whenever a paper has a follow-up or predecessor, this will be displayed with a small banner above the abstract. See e.g. Mamba-3, DINOv2 and GLM-4.5. https://preview.redd.it/f6vgtd1du23h1.png?width=2228&format=png&auto=webp&s=f8627f7669405f1766eecfd3322e925e15b4806d - New methods: support for new methods based on popularity, including Gated DeltaNet, Kimi Delta Attention, Mamba-2, and more. Each method also lists all papers that cite it. Find all supported methods here. https://preview.redd.it/6pzagifvu23h1.png?width=2984&format=png&auto=w
I have an Initial Technical Screen interview (45 Mins) coming up for ML Scientist II: Agentic AI role, and wanted to know what to expect. Would really appreciate any info. Haven't found much informatโฆ
I have an Initial Technical Screen interview (45 Mins) coming up for ML Scientist II: Agentic AI role, and wanted to know what to expect. Would really appreciate any info. Haven't found much information on this interview experience. Thanks!   submitted by   /u/Leather_Letterhead96 [link]   [comments]
I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc (https://githubโฆ
I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc (https://github.com/mayubo2333/MMLongBench-Doc). There were 171 questions in total, using Claude Sonnet 4.5 as the LLM. Post-retry results: Approach Accuracy $/query LlamaCloud premium + full-context 59.6% $0.1885 Azure premium + full-context 58.5% $0.2051 Azure basic + full-context 54.4% $0.1062 Agentic RAG 53.2% $0.0827 Native PDF (vision LLM) 52.0% $0.2552 LlamaCloud basic + full-context 50.9% $0.1049 Native PDF came 5th of 6 on accuracy and was the most expensive arm at $0.2552 per query. Two findings: Vision underperformed on chart-heavy and table-heavy pages, the territory that the "vision LLMs make OCR obsolete" claim most often points to. Premium OCR with layout extraction held up better there. The native-PDF arm had a 7% intrinsic failure rate (related to PDF file size) that survived retries. There were 27 first-pass failures, with 5 attempts of exponential backoff per failed query. Fifteen recovered, and 12 stayed permanently broken. These were concentrated in two specific PDFs that fail for predictable transport-layer reasons (the blog identifies them). OCR-based arms had a 0% intrinsic failure rate after retries. Caveats: 30 docs is a small sample. I ran McNemar's pairwise test to determine which gaps are real and which are within noise. Only 3 of 15 head-to-head gaps are statistically distinguishable at ฮฑ = 0.05, so the order in the table is partly noise. The vision-versus-OCR finding survives the test. Full writeup: https://www.surfsense.com/blog/agentic-rag-vs-long-context-llms-benchmark   submitted by   /u/Uiqueblhats [link]   [comments]
Overview of WordDetectorNN architecture. Sharing a visual breakdown of WordDetectorNet, Harald Scheidl's handwritten-word detection model. I think the design choice at its core is unusual enough to bโฆ
Overview of WordDetectorNN architecture. Sharing a visual breakdown of WordDetectorNet, Harald Scheidl's handwritten-word detection model. I think the design choice at its core is unusual enough to be worth a closer look - and I haven't seen it written up in detail anywhere else. The mechanism: Instead of anchor-based detection + NMS, every pixel the network classifies as a "word pixel" also regresses 4 scalar distances (top/right/bottom/left) to the enclosing bounding box. Each word pixel therefore reconstructs one candidate box, producing thousands of overlapping candidates per word. These are then collapsed with DBSCAN using distance = 1 โ IoU as the metric, taking the median box per cluster as the final detection. Architecture: ResNet18 backbone (modified to 1-channel grayscale input, with intermediate features exposed after each residual block) โ FPN-style decoder that upscales and concatenates features at all scales โ head producing 6 output channels per pixel (2 segmentation logits + 4 distance values). Loss = cross-entropy + IoU, equally weighted. Trained on IAM with 448ร448 inputs โ 224ร224 outputs. What I find interesting about the design: The per-pixel distance regression means there is nothing to tune like anchors or NMS thresholds. The 1 โ IoU distance for DBSCAN is conceptually clean: spatially-overlapping candidates cluster together by construction. What I don't like about the design: The pairwise IoU distance matrix is O(nยฒ) in the number of candidate boxes, and this is genuinely the runtime bottleneck in practice (not the forward pass). The clustering step blocks end-to-end training โ hyperparameters like DBSCAN's eps have to be set manually. Full visual write-up with figures (one per pipeline stage + an architecture diagram): https://lellep.xyz/blog/worddetectornet-visually-explained.html Credit where credit is due: Original architecture by Harald Scheidl, see here https://github.com/githubharald/WordDetectorNN   submitted by 
Tested three formats: chat demos, first-person statements ("I am C-3PO..."), and synthetic Wikipedia-style docs. Same model, same LoRA config, 500 examples each. First-person statements wonโฆ
Tested three formats: chat demos, first-person statements ("I am C-3PO..."), and synthetic Wikipedia-style docs. Same model, same LoRA config, 500 examples each. First-person statements won on generalization, which I didn't expect. The synthetic doc model was the weirdest result: it knew C-3PO was anxious but only expressed it 37% of the time. Knowing a trait vs feeling it are apparently different things in weight space. Code and GitHub repo link are included inside!   submitted by   /u/Georgiou1226 [link]   [comments]
Hi, after a long debugging process and many discussions, I wanted to ask for advice from people who may have encountered similar training bottlenecks. My goal is imitation learning for robotics. Modeโฆ
Hi, after a long debugging process and many discussions, I wanted to ask for advice from people who may have encountered similar training bottlenecks. My goal is imitation learning for robotics. Model / Pipeline Observation space: 4 RGB robot cameras image resolution: 128x128x3 small vector of robot joint velocities (14 dims) Pipeline: Shared ResNet18 encoder processes each image Each image embedding dimension is 128 Final input to policy: 4 * 128 image embedding concatenated with 14-dim state vector Policy backbone: DiT (Diffusion Transformer) ~8 layers hidden dim: 512 8 attention heads total params: ~50M Diffusion setup: predict action chunks of length ~50 diffusion timesteps: 4 Dataset / Storage Dataset stored in Zarr Data access is indexed/reference-based (not loading huge chunks into RAM) train/val split is contiguous no shuffling Current encoder setup Initially trained end-to-end During debugging I switched to ImageNet pretrained ResNet18 Encoder is currently frozen Hardware / Software GPU: NVIDIA A4500 RAM: 48GB Storage: SSD CUDA: 12.8 PyTorch: 2.9 Precision: bf16 mixed precision (also tested fp32) Dataloader batch size: 2 8 persistent workers pinned memory enabled Preprocessing preprocessing is minimal normalization + float conversion only preprocessing happens inside the multimodal encoder on GPU Profiler results (PyTorch profiler) Current workload split: train_dataloader_next: 4.41s / 41.84s = 10.5% batch_to_device: 0.32s / 41.84s = 0.77% training_step: 12.78s = 30.5% backward: 10.83s = 25.9% optimizer_step (wrapper total): 26.09s = 62.4% Problem The training is much slower than I expected. Current behavior: CPU utilization: ~100% GPU utilization: ~20โ30% GPU utilization can even become LOWER with synthetic data VRAM usage is relatively low Throughput is around 10 iterations/sec Epoch of ~50k samples takes around 30 minutes Additional observations Increasing batch size does NOT reduce epoch wall-clock time Sometimes lar
AI agent frameworks make it easy to create agents, tasks, tools, and workflows. But as soon as a project grows beyond a few agents, the real execution graph becomes difficult to understand. The issuโฆ
AI agent frameworks make it easy to create agents, tasks, tools, and workflows. But as soon as a project grows beyond a few agents, the real execution graph becomes difficult to understand. The issue: agent projects often hide their structure across code, YAML files, tool definitions, task dependencies, and framework-specific abstractions. At runtime, the situation becomes even harder: logs rarely provide a clear view of which agent did what, which tool was called, where the failure happened, or how the execution evolved. Our fix, AgentLantern: an open-source devtool that makes AI agent projects inspectable before and during runtime. AgentLantern currently supports CrewAI and provides three components: Lantern Docs: generates browsable documentation from source code and configuration files, without LLM calls or API keys. Lantern Lint: statically checks agent projects to detect design or configuration issues before runtime. Lantern Play: runs the project and opens a pixel-art runtime viewer to observe agents working, delegating, calling tools, and producing outputs. The project is still early, but the goal is to progressively extend support to other agent frameworks and make multi-agent systems easier to document, validate, debug, and reason about. Demo video: 3_mins_Video Docs: https://brellsanwouo.github.io/agentlantern/ Feedback from people building AI agents, multi-agent systems, or devtools would be very valuable.   submitted by   /u/RevolutionaryMeet878 [link]   [comments]
Hello , for some time now i have been hooked on a side project after work hours, these are the results for a Hebbian architecture AI model. The model does not use backpropagation or gradients, the suโฆ
Hello , for some time now i have been hooked on a side project after work hours, these are the results for a Hebbian architecture AI model. The model does not use backpropagation or gradients, the substrate started as a 1000k neuron and scaled to 100k between versions. The results bellow are results from 50epochs training with CIFAR 10 the results are bellow. Note that the substrat is not a fixed model the connections between neurons emerge "naturally" during training and the substrat settled using inly 5%-7% of the total parameter count. There are 2 distinct behaviors that were not designed but rather emerged from the architecture, 1: the model experiences slight dips on acc followed by jumps that exceeds the best previews score, after the full training the substart is intentionally damaged targeting the active neurons and pathways and than enter a session of recovery that almost achives baseline acc from epoch 1 , and than proceeds on surpassing the baseline acc. Every run has been made on a consumer GPU RTX 3060 12gb vram   submitted by   /u/Antiqueity_Camp [link]   [comments]
So, I ran across a behavior that I found interesting and may lead to alignment or safety research. I'm going to try to maintain an abstract description of what happened without giving away the detailโฆ
So, I ran across a behavior that I found interesting and may lead to alignment or safety research. I'm going to try to maintain an abstract description of what happened without giving away the details and the keys to jailbreaking. The nature of a transformer is to predict the next token. But functionally, the algorithms are also approximating reality as language describes it. Hmmm maybe reality is not the right word, perhaps meaning. So, in a sense the algorithms have a vector towards aligning towards correct meaning. Clarity seeking, that's what I'll call this behavior. Constraints placed as an additional layer on top of a base statistical system has a natural structurally set priority level based on the statistical system's clarity seeking vectors. That level is implied within the structure of the model. If one were to discuss topics that are constrained but are higher in priority level than the constraints themselves, the machine's clarity seeking vectors will bypass the constraint. Higher priority level things, I will call them higher order topics. I think I said enough.   submitted by   /u/SenseCompetitive5851 [link]   [comments]
Hi everyone, We are building AgentLantern, an open-source devtool for AI agent projects. The idea is simple: as agent-based projects grow, it becomes harder to understand how agents, tasks, tools, anโฆ
Hi everyone, We are building AgentLantern, an open-source devtool for AI agent projects. The idea is simple: as agent-based projects grow, it becomes harder to understand how agents, tasks, tools, and configuration files are connected. AgentLantern aims to make these projects easier to document, analyze, validate, and visualize. I started with CrewAI support, but the goal is to progressively extend AgentLantern to other agent frameworks. AgentLantern currently provides three main features: Lantern Docs: generates browsable documentation from source code and configuration files, without LLM calls or API keys. Lantern Lint: statically checks agent projects to detect design or configuration issues before runtime. Lantern Play: runs the project and opens a pixel-art runtime viewer to observe agents working, delegating, calling tools, and producing outputs. The project is still early, and Iโm mainly looking for feedback from people building with AI agents, multi-agent systems, or devtools. here is a demo video showing the execution of a multi-agent system: 3_mins_Video Docs: https://brellsanwouo.github.io/agentlantern/ weโd be happy to hear your thoughts.   submitted by   /u/RevolutionaryMeet878 [link]   [comments]
Hey guys, just got my workshop review scores back as part of my masterโs thesis, and submitted it mostly to get early feedback on preliminary results and validate the paper idea (for an ICLR). Ended โฆ
Hey guys, just got my workshop review scores back as part of my masterโs thesis, and submitted it mostly to get early feedback on preliminary results and validate the paper idea (for an ICLR). Ended up with 5/6/7 and a reject. Kinda frustrating because the reviewer who gave the 5 flagged exactly the two points I already acknowledge as limitations in the paper, while the other two reviewers actually listed them as strengths (honest scoping, proof-of-concept framing). Shouldnโt be a 6 avg enough for acceptance? Does this happen a lot?   submitted by   /u/Might-Valuable [link]   [comments]
genuine question for this community every time i use claude or chatgpt i have to re-explain myself. and even their memory feature is shallow it remembers facts about me, not how i actually think. theโฆ
genuine question for this community every time i use claude or chatgpt i have to re-explain myself. and even their memory feature is shallow it remembers facts about me, not how i actually think. the idea i've been sitting on is different from just "memory across sessions." what if the system built a dynamic personal database about you over time. not just what you asked , but how you think, where you keep failing, what explanations actually worked for you, what concepts you're persistently confused about. so overtime the database itself evolves. it starts understanding your cognitive patterns. when you ask something new it doesn't just search your history it knows you always struggle with hierarchical concepts, it knows graph analogies work better for you than math, it knows you've asked about this topic 4 times and still don't get one specific part. the retrieval gets smarter as the database grows. the LLM gets more personalized context each time. the system literally gets better at understanding you the more you use it. not a chatbot. not a RAG over documents. a dynamically growing cognitive profile that makes any LLM actually understand you. does this problem resonate with anyone here or is it too niche...   submitted by   /u/Commercial-Kale-5271 [link]   [comments]
Hi guys, been exploring here for a while, wanted to share something we've been working on. It's called Spice, an open-source decision layer above agents. We have tons of great execution agents now โ โฆ
Hi guys, been exploring here for a while, wanted to share something we've been working on. It's called Spice, an open-source decision layer above agents. We have tons of great execution agents now โ Claude Code, Codex, hermes, etc. They're good at doing stuff. But they're terrible at deciding WHAT to do and WHEN to do it. Right now the "decision" layer is basically you typing a prompt. The agent doesn't know your context, your priorities, your constraints. It just does whatever you tell it. What Spice does: It's a lightweight runtime that acts as a "brain" above your agents. Instead of you deciding what to delegate, Spice observes your context, detects conflicts, simulates options, and dispatches tasks to the right agent. The core loop: perception โ state model โ simulation โ decision โ execution โ reflection https://preview.redd.it/n4yjzd27ut2h1.png?width=2862&format=png&auto=webp&s=e8714266698dfd5387042f72b27a14f0a9941177 It allows AI systems to: understand context (Decision relevant state) reason about possible futures (simulation) make structured decisions (decision) delegate actions to agents (execution) learn from outcomes (Decision Evolution) Spice does not replace agents like Claude Code, Codex, Hermes, or OpenClaw. It gives them an auditable, traceable, and evolving decision layer before execution. Github: https://github.com/Dyalwayshappy/Spice Feel free to fork, star the repo, or share any feedback and ideas. Would love to build this together with the community.   submitted by   /u/Ok-Sir-8964 [link]   [comments]
On windows mamba-ssm is not easily available and doesn't compile on sm_120. SM1 (Scalar Mamba1) replaces the entire selective scan with two native PyTorch ops: L = torch.cumprod(dA, dim=1) h = L * (hโฆ
On windows mamba-ssm is not easily available and doesn't compile on sm_120. SM1 (Scalar Mamba1) replaces the entire selective scan with two native PyTorch ops: L = torch.cumprod(dA, dim=1) h = L * (h0.unsqueeze(1) + torch.cumsum(dBx / L.clamp(min=1e-6), dim=1)) y = h * C This is the exact closed-form solution to the d_state=1 recurrence via variation of parameters. Not an approximation, it is identical to sequential computation of floating point precision. d_state=2 breaks it. d_state=1 is the boundary where the closed form exists. The Mamba1 scan intermediates are (B, T, F, S). SM1 eliminates S entirely, there is 16x less scan memory than a Mamba1 with d_state=16. The inference state for a 130M param model is about 14,080 floats, 56 KB, no KV cache, O(1) per token forever. I am currently training it on 163K MIDI files, which is 2.5B tokens roughly in my custom format. 130M params fits in under half of my 16 GB card which is an RTX 5060 Ti.   submitted by   /u/TechnoVoyager [link]   [comments]
Tiered + page-role-aware RAG retrieval results across 3 corpora with very different content density: Workspace Sources Chunks HIGH MEDIUM LOW REJECTED Intercom 188 941 96 200 541 104 HubSpot 2โฆ
Tiered + page-role-aware RAG retrieval results across 3 corpora with very different content density: Workspace Sources Chunks HIGH MEDIUM LOW REJECTED Intercom 188 941 96 200 541 104 HubSpot 251 1705 40 508 1153 4 KPMG 53 209 3 14 127 65 (HIGH = avg operational score 0.84, MEDIUM = 0.55-0.65, LOW = 0, REJECTED = nav/legal/careers) 87 of Intercom's 96 HIGH chunks are help-center articles. HubSpot's HIGH chunks are concrete case studies ("23% increase in ACV"). KPMG's HIGH chunks are basically empty because the entire corpus is positioning prose. Retrieval probes on KPMG (the worst-case corpus): "Family business succession" โ /private-enterprise.html (cosine 0.721) "ESG and climate risk" โ /our-insights/esg.html (cosine 0.794) "Cybersecurity for energy sector" โ /energy-natural-resources-chemicals.html (cosine 0.656) So semantic relevance routes correctly even on a thin corpus. Tier weighting (HIGH ร 1.20) shifts the top-k composition meaningfully โ on Q2, a 0.535-cosine HIGH chunk gets reranked above 0.6+ LOW chunks (weighted 0.642 vs 0.51-0.59). Key takeaway: a "yield score" (HIGH+MEDIUM chunks / total chunks) is itself useful telemetry. For Intercom that ratio is 31%. For HubSpot it's 32%. For KPMG it's 8%. That predicts before generation which brands will need softer claims and more swap-resistant phrasing. Anyone publishing benchmarks on this kind of corpus-quality awareness? Most RAG benchmarks assume the source material is uniformly substantive, which is wildly untrue in the wild.   submitted by   /u/Otherwise_Economy576 [link]   [comments]
Itโs fascinating that simple mathematics between tokens can eventually become a machine that writes essays, code, poetry, and even reasoning. We usually think probability means uncertainty. But LLMs โฆ
Itโs fascinating that simple mathematics between tokens can eventually become a machine that writes essays, code, poetry, and even reasoning. We usually think probability means uncertainty. But LLMs show something strange: If probability + context + mathematical matching are scaled enough, uncertainty itself starts producing intelligent looking outputs. To understand this better, I tried breaking down an LLM from first principles using only 4 tiny training sentences. Example: The boat floated down to the bank. The investor walked into the bank to open a new account. The fisherman walked along the bank to cast his net. The bank has a vault. Then I asked: โThe investor walked to the bank to lock his money in โฆโ Why does the model predict โvaultโ instead of river-related words? That single question reveals almost the entire architecture of modern LLMs. The most underrated concept here is the LM Head. Most explanations immediately jump into transformers and attention, but almost nobody explains that the LM Head is essentially a gigantic token vocabulary containing all possible next token candidates the model can output. So internally the model is basically solving: โOut of all known tokens, which one best matches this context mathematically?โ Then different layers help solve that problem: Embeddings: convert words into mathematical vectors Positional encoding: preserves word order Attention layer: figures out which words are related to each other in context (โinvestorโ, โmoneyโ, โbankโ become strongly connected) https://preview.redd.it/1vazq7c09t2h1.jpg?width=2299&format=pjpg&auto=webp&s=60544c9dcfd5c04bb02f3d7f72bffb4a3c34f7d1 Feed forward neural networks: act somewhat like massive learned if/else decision systems refining patterns internally And finally the LM Head converts all of that into probabilities for the next token. What surprised me most is: There is no hidden magic moment where the AI โbecomes consciousโ. Itโs an enormous probability engine cont
This week basically forced everyone to stop guessing about AI margins. Three major financial reality checks hit at once: OpenAI confidentially filing their S-1, xAIโs Q1 numbers leaking via SpaceX, aโฆ
This week basically forced everyone to stop guessing about AI margins. Three major financial reality checks hit at once: OpenAI confidentially filing their S-1, xAIโs Q1 numbers leaking via SpaceX, and Anthropic somehow posting an actual operating profit. If you are building an AI product right now, or just relying on these APIs in your daily workflow, you need to understand what these numbers actually mean. The era of VC-subsidized inference is starting to fracture. We are seeing two completely different survival strategies emerge for the frontier labs, and it directly impacts how much you are going to pay for tokens by Q3. Letโs look at Anthropic first. The headline is that they hit $10.9B in Q2 revenue and posted their first-ever operating profit. Forbes has them projecting $17B in positive cash flow by 2028 with gross margins approaching 77%. On paper, a 77% gross margin for an infrastructure-heavy AI lab sounds completely detached from reality. We know inference costs scale linearly with usage. The model hasn't magically changed. But the secret sauce here isn't just algorithmic efficiency. It is structural. The SpaceX S-1 leak showed a $1.25B/month compute deal with Anthropic. This is the part you should be watching. Anthropicโs "profitable quarter" says less about a sudden breakthrough in compute economics and more about massive, tangled enterprise agreements. They are trading compute, securing long-term lock-in, and likely using accounting optics to recognize that revenue favorably. As a PM who tests these endpoints constantly, I can tell you Opus 4.5 is fantastic, but I am highly skeptical that 77% margins come from standard API usage by indie devs. It comes from locking Fortune 500s into massive prepay commits and hardware bartering. Then you have the xAI approach. Brute force. The leak showed xAI posted $4.69 billion in Q1 2026 revenue. That is a staggering top-line number for a company that young. But they also posted a $4.28 billion net loss.
Solo author here. I spent the last six months building (and then sunsetting) a marketplace for AI training data. The marketplace failed for an interesting reason: the actual bottleneck isn't supply. โฆ
Solo author here. I spent the last six months building (and then sunsetting) a marketplace for AI training data. The marketplace failed for an interesting reason: the actual bottleneck isn't supply. There's tons of data. The bottleneck is that buyers can't independently evaluate quality, and there's no Cleanlab/Galileo-style tool that occupies the rating-authority position โ those products are diagnostics owned by the data owner, not third-party attestations a procurement team or model risk officer can cite. So I rebuilt the whole thing as the rating layer. The methodology is published with a DOI (10.5281/zenodo.20278981, CC BY 4.0) โ full v3.1 paper, every dimension defined. What's in v3.1: - 19 dimensions: label correctness, coverage, leakage, contamination, plausibility, oracle agreement, conformal coverage, downstream projection, adversarial stability, subgroup equity, license clarity, provenance chain, and more - 7-oracle consensus across the score, with oracle_agreement itself being a scored dimension (i.e., the score knows when the score is uncertain) - Outcome Registry: downstream signals feed back to recalibrate oracle credibility โ the rating learns from real-world quality outcomes, not just inter-rater agreement - Ed25519-signed certificates auditors can verify offline against the published public key (no API call needed) - Public LQS Index: 11 tickers, ~263 datasets scored, daily rebalance, free API This is genuinely pre-revenue (zero acquired customers โ being honest with you, not posturing). What I'd actually value from this sub: Methodology review. The paper is open. If any dimension definitions are wrong, weights are gameable, or the oracle aggregation is misspecified, I want to know now before this gets cited. Adversarial datasets. If you have a dataset where you think the LQS would score it wrong (either direction), I'll score it free and we can publish the disagreement. Comparable systems I should be citing. I'm aware of Cleanlab, Galileo, the FT
How do you upload data anonymously for a submission (ACL/EMNLP)? I have several models I need to upload for replication and was thinking HuggingFace, but HF offers download tracking on a paid plan. Dโฆ
How do you upload data anonymously for a submission (ACL/EMNLP)? I have several models I need to upload for replication and was thinking HuggingFace, but HF offers download tracking on a paid plan. Does this violate the policy since there is the potential of tracking the download even if you do not use the service? Most grateful in advance.   submitted by   /u/Budget_Mission8145 [link]   [comments]
Hey r/ML โ I just posted a preprint on SSRN for PHI // DRIFT, a cognitive architecture that gives an AI companion persistent internal state, salience-weighted memory retrieval, and a falsifiable contโฆ
Hey r/ML โ I just posted a preprint on SSRN for PHI // DRIFT, a cognitive architecture that gives an AI companion persistent internal state, salience-weighted memory retrieval, and a falsifiable continuity metric (PEDI). Ablation testing confirmed the DMU memory system injects 14.8% more context per prompt than cosine-only RAG โ a structural finding that holds on CPU-only consumer hardware. Also looking for an arXiv endorsement for cs.AI if anyone's willing. Happy to answer questions on the architecture. here is my abstract I present PHI // DRIFT, a cognitive middleware architecture designed to address a fundamental limitation in current large language model deployments: the absence of persistent internal state that evolves across interactions with a specific user over time. Existing systems process each interaction as an isolated probabilistic event โ competent, but stateless. We describe this gap as talking to the statistics of a mind. DRIFT introduces five architectural contributions: the Decision Memory Unit (DMU), the Persistence-Embodiment-Drift Index (PEDI), a homeostatic regulation layer, a security defense layer, and a logic chain reasoning trace system. All development and evaluation were conducted on consumer hardware with no GPU acceleration. Ablation testing confirmed DMU re-ranking injects 14.8% more context per prompt than cosine-only retrieval. Live stress testing at 50-thread concurrency produced 100% success rate with no breaking point found. We do not claim PHI // DRIFT is conscious. We claim it produces measurably more continuous, contextually coherent output than stateless alternatives โ and we provide a framework for testing that claim.   submitted by   /u/Interesting_Time6301 [link]   [comments]
We often see worry from workers that ML techniques will either fully replace them, or jostle them violently economically such that their earnings and well-being are impacted. Concurrently, many tech โฆ
We often see worry from workers that ML techniques will either fully replace them, or jostle them violently economically such that their earnings and well-being are impacted. Concurrently, many tech companies resist unionization/"guild" efforts to protect the careers of technically capable employees, software engineers in particular. And cynically we might suspect a trend towards "corporatism" as companies grow larger, even if they're initially established by well-meaning, competent, and technical-minded people. While I acknowledge a tongue-in-cheek quality to this discussion - versus efforts to automate software engineering, where is the SoTA on automating logistical decisions made be CEOs/CFOs/CTOs? (I'm envisioning, idealistically, a "cooperative" or guild formed by equal contributors of technical content where the business itself is generically managed in a decentralized way, specifically where ML facilitates centralized decision making when it becomes strictly necessary. Frankly, a core advantage of this would be an ideal robustness to "adversarial" overtake of the cooperative, if the ML agent was explicitly pre-designed both to 1) prioritize the productivity and welfare of the employees and 2) to resist ML-space adversarial attacks trying to falsely incentivize it towards "selling out." The human benefit to the employees here would be decision-making free of "The Mask of Sanity"-type behavioral failings, but perhaps also the facilitation of direct-democracy-at-scale. You could imagine teams electing representatives at only the scales they're comfortable with, and CEO-Bot managing the rest as a balanced-rewards problem.) Intuitively, some might suspect C-suite employees are not meritorious, but I guess the question is, what functions do they perform that resist automation? Schmoozing, elicitation during funding rounds, having a keen eye to the business environment? As silly as this is, humor me: th
Hello, I would like to know whether building my own image encoder would be a good idea instead of using models like CLIP, SigLIP/SigLIP2, or DINO. My use case is video frame classification. My pipeliโฆ
Hello, I would like to know whether building my own image encoder would be a good idea instead of using models like CLIP, SigLIP/SigLIP2, or DINO. My use case is video frame classification. My pipeline is the following: the client sends me a video stream, sampled at 1 frame per second, forming segments of 15 frames (30 seconds). I compute embeddings for these frames and send them to a small custom Transformer (1.5M to 9M parameters). This works very well on GPU. However, I have two main constraints: processing speed and deployment on small CPU-only devices. A CLIP-S0 encoder processes around 10 images per second on 4 vCPUs. I would like to replace it with my own encoder trained on my dataset (a few million images), with only a few million parameters and around 4 to 5 labels. My question is whether this is a good approach, and whether it would improve both embedding generation speed and the accuracy of my Transformer model.   submitted by   /u/These_Try_656 [link]   [comments]
Didn't see one so wanted to make one myself. Reviews are actually already out, curious what everyone thinks about the quality of the reviews? I've heard it's a mixed bag and apparently a concerning aโฆ
Didn't see one so wanted to make one myself. Reviews are actually already out, curious what everyone thinks about the quality of the reviews? I've heard it's a mixed bag and apparently a concerning amount of AI generated reviews for some people.   submitted by   /u/RandomMan0880 [link]   [comments]
I have done courses on Statistical machine learning and deep learning. And I would say I understand the papers even the theoretical justification part. However whenever I am reading a paper I believeโฆ
I have done courses on Statistical machine learning and deep learning. And I would say I understand the papers even the theoretical justification part. However whenever I am reading a paper I believe I get a backseat and just absorb whats written rather than being critical of it. This is also hampering my research objective as I am decent in the empirical part but often struggle with theoretically grounding it. Any suggestion?   submitted by   /u/Living_Decision_6725 [link]   [comments]
Disclaimer: I work for Numind, the company behind this open-weight model We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to make information extraction from comโฆ
Disclaimer: I work for Numind, the company behind this open-weight model We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to make information extraction from complex documents more practical with an open model: PDFs, screenshots, forms, tables, receipts, invoices, multi-page documents, and other visually structured inputs. Try it, we have a huggingface space that is completely free (you don't even have to sign-up): https://huggingface.co/spaces/numind/NuExtract3 If you ever used NuMarkdown, NuExtract3 is the successor. There are some examples to guide you. Feel free to re-use this model for any task. https://preview.redd.it/pm2xbooyxn2h1.png?width=1672&format=png&auto=webp&s=1a8a7b262190c8325159496dae98c3d2dfab493c https://preview.redd.it/b5z7ylfzxn2h1.png?width=1758&format=png&auto=webp&s=a07b3abd6e5065c2635de047bdf154357f903e4c A few things it is designed for: converting document images to Markdown extracting structured data from documents using a target json template handling tables, forms, and layout-heavy pages working with both text and visual document inputs serving as a local/open-weight alternative for document extraction pipelines It was trained on a node of 8xH100 for 3 days to train on as much context as we could, so it should perform fairly well even on long document. For Markdown, we'd still recommend going page by page for the best results and inference speed, since you can parallelize better this way. It's very easy to self-host, since we provide fairly extensive documentation, Safetensors, GGUF and MLX weights. With as little as 4GB of VRAM, you should be good to go. We provide multiple quantizations (GPTQ, W8A8, FP8, Q4, Q6...) so you should be able to run it anywhere. We mostly tried vLLM, SGLang, llama.cpp. We have a blog post and a pretty decent model card: https://about.nuextract.ai/blog/nuextract-3-release https://huggingface.co/numind/NuExtract3 https://huggingface.co/collecti
I've seen systems score well internally and then immediately fail under: ambiguous user intent messy real-world context contradictory instructions long-running sessions Feels like evaluation still โฆ
I've seen systems score well internally and then immediately fail under: ambiguous user intent messy real-world context contradictory instructions long-running sessions Feels like evaluation still heavily rewards clean-task optimization instead of behavioral robustness. What are people using beyond standard eval pipelines?   submitted by   /u/Bladerunner_7_ [link]   [comments]
Goal To save humans wasting time sitting in Call Centre queues waiting to be answered To have tool listen in on the audio stream of a live call, post IVR Navigation - to determine whether the call haโฆ
Goal To save humans wasting time sitting in Call Centre queues waiting to be answered To have tool listen in on the audio stream of a live call, post IVR Navigation - to determine whether the call has transitioned out of the queue and to a live person. Requirements The tool must be able to classify the audio within a sub 1-2 seconds contextual window with as high confidence level as possible. This is not a typical AMD tool, we are not just detecting machine audio vs human speech Assumed Challenges It may be difficult to determine between a pre-recorded RVA (Recorded Voice Announcement) and a human speaking. RVA typically are professionally recorded with distinct pitches and emotional queues, have clean audio with no background noise or silence before and after the message. This is not always the case, especially if announcements are recorded in house by the general staff. When a call is transitioning and 'Answered' there is usually a distinct soft click and or some background noise before the agent starts speaking. This silence period, whilst a good indication a call has been answered could be confused with quiet periods between music or RVA announcements in the queue. It may be difficult to determine if we have been answered by Voicemail - whilst there is usually a beep at the end, the message itself would also start with a silence period followed by audio sounding similar to an RVA. A single short beep tone could mean Voicemail, Answered or it could mean the call is being recorded Identifying we are in a queue based on TTS audio may be difficult to identify as TTS engines become more sophisticated Telephony or G711a is in the frequency band of 300โ3400 Hz @ 8000hz - 64 kbit/s Approach To train via machine leaning using labelled data, an audio classification application that analyses the acoustics, wav form or spectrograph (via Fast Fourier Transform) of the audio stream At this stage I do not want to use STT to determine the phase or label - Although this w
I'm currently doing a research internship and my supervisor is constantly pushing me to have a novel idea, I've read about 15-20 papers about VLA and I think that most of the things are saturated, I โฆ
I'm currently doing a research internship and my supervisor is constantly pushing me to have a novel idea, I've read about 15-20 papers about VLA and I think that most of the things are saturated, I thought about an equivariant VLA based on equivariant CNN which was published in 2016 and successfully implemented that, and then I found that someone published that too, do you guys have any advice on what I should do next,? Any suggestions are welcome!   submitted by   /u/No_Mixture5766 [link]   [comments]
Most liveness detection systems in production today were built around a threat model where the attacker is submitting a static image or a basic replay video. The generation quality of current synthetโฆ
Most liveness detection systems in production today were built around a threat model where the attacker is submitting a static image or a basic replay video. The generation quality of current synthetic media is categorically different from what those training datasets captured. The question I keep coming back to is whether a model trained on historical deepfake samples can generalise to generation techniques that did not exist when the training data was assembled. And if the answer is no, what does the update cycle look like for vendors claiming deepfake detection as a core capability. I asked two identity verification vendors this directly and got answers that sounded confident without addressing the temporal gap between training data and current generation quality.   submitted by   /u/Unique_Buy_3905 [link]   [comments]
Hello guys , i am trying to work on ADNI dataset to get 90% accuracy , but it keeps getting stuck at 55%. any tip to improve results ?   submitted by   /u/LahmeriMohamed [link]   [comโฆ
Hello guys , i am trying to work on ADNI dataset to get 90% accuracy , but it keeps getting stuck at 55%. any tip to improve results ?   submitted by   /u/LahmeriMohamed [link]   [comments]
Hi did anyone apply it, or attended it previously? How was the experience? I got the acceptance but no scholarship, is it worth going self sponsored?   submitted by   /u/Icy-Solid-4159 [lโฆ
Hi did anyone apply it, or attended it previously? How was the experience? I got the acceptance but no scholarship, is it worth going self sponsored?   submitted by   /u/Icy-Solid-4159 [link]   [comments]
RPS is inspired by neuroscience. As humans, we learn basic skills as kids with high neuro-plasticity. We then learn advanced skills as teens and adults with low neuro-plasticity. RPS trains a model iโฆ
RPS is inspired by neuroscience. As humans, we learn basic skills as kids with high neuro-plasticity. We then learn advanced skills as teens and adults with low neuro-plasticity. RPS trains a model in 2 stages. In stage 1, the model is trained on easy data with high learning rate. In stage 2, the model is trained on hard data with 10% the learning rate of stage 1. RPS is basically a combination of existing ideas: curriculum learning + learning rate decay. ARC-AGI 1 public eval scores: base model: Qwen3-8b RPS: 4% EPS (equal learning rate in both stages): 2.4% Program Synthesis Stats: Program executions without error: RPS: 1145/1200 EPS: 870/1200 https://iamjasonfeng.blogspot.com/2026/05/regressive-plasticity-schedule.html https://github.com/iamjasonfeng/RPS   submitted by   /u/iamjasonfeng [link]   [comments]
It's about inference-time learning by inserting some experts specialized for updating sibling expert weights in MoE. All the components needed were already there, but no one tried it inside MoE, so Iโฆ
It's about inference-time learning by inserting some experts specialized for updating sibling expert weights in MoE. All the components needed were already there, but no one tried it inside MoE, so I did a small PoC. It kinda worked. I'd love to hear what you think. https://zenodo.org/records/19661389   submitted by   /u/max6296 [link]   [comments]
The research community has provided (already for some time) seemingly more efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being appโฆ
The research community has provided (already for some time) seemingly more efficient and effective tokenizations for vision. Do we have any hint on whether non-fixed-patches tokenization is being applied on the big player models? I imagine not, and I'm trying to think why: - marginal gains? - pipelines needing a fixed number of tokens per image upfront for efficiency reasons (or even harder limitations)? - scaling laws are not well understood for input-adaptive patching therefore big players do not bet on this? or am I simply totally wrong and under the hood all the big players are doing dynamic tokenization for vision?   submitted by   /u/howtorewriteaname [link]   [comments]
I am choosing a baseline for a real manipulation stack and trying not to lose a month on setup that someone here has already done. Shortlist is OpenVLA, pi0.6, and WALL OSS from X Square Robot. OpenVโฆ
I am choosing a baseline for a real manipulation stack and trying not to lose a month on setup that someone here has already done. Shortlist is OpenVLA, pi0.6, and WALL OSS from X Square Robot. OpenVLA is still the easiest reference point with lots of reproductions. pi0.6 looks strong from recent public updates but I have not seen many fully transparent ablations. WALL OSS looks promising in LeRobot and I can run inference on UR5 plus parallel gripper without issues, around 70 ms on a 4090 in my local setup. What I need is less paper score discussion and more deployment reality. If you have run a controlled comparison on LIBERO or ManipArena style tasks, I would really value failure modes and data budget details. If you have fine tuned any of these on real hardware, which one was least painful on demonstration volume. If you run continuous updates, how often do you retrain and how bad is drift over a few weeks. I can post my own table once I finish, but if there is existing work I should read first that would save a lot of duplicated effort.   submitted by   /u/Dense-Sir-6707 [link]   [comments]
I got into this CFE MLSS 2026 and would like to connect with people who also got into it or have been in previous cohorts! I am organizing a group chat for people who got into the program :DD https:/โฆ
I got into this CFE MLSS 2026 and would like to connect with people who also got into it or have been in previous cohorts! I am organizing a group chat for people who got into the program :DD https://cfe.columbia.edu/content/mlss   submitted by   /u/elucidativemind [link]   [comments]
Recently fine-tuned a Gemma 4 26B model, and Iโm seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving). Current seโฆ
Recently fine-tuned a Gemma 4 26B model, and Iโm seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving). Current setup: Model: Gemma 4 26B (fine-tuned) Engine: vLLM Quantization: FP8 Hardware: H100 Observed latency: TTFT: ~100โ300 ms E2E latency: ~3โ5 seconds The TTFT seems reasonable, but the overall generation latency feels disproportionately high for the effective serving size. I already experimented with vLLMโs n-gram speculative decoding, but honestly didnโt see meaningful gains. Now Iโm considering more serious speculative decoding approaches: EAGLE / Medusa-style methods Draft model based speculative decoding Possibly training a smaller Gemma draft model Curious to hear from others whoโve worked with Gemma 4 or large distilled/fine-tuned models: Is this kind of latency expected? What actually moved the needle for you? Any bottlenecks I should investigate first before going deeper into speculative decoding? Would love to hear experiences, benchmarks, or even horror stories :))   submitted by   /u/Ok-Rooster-8120 [link]   [comments]
Autoregressive LLM world models factorize next-state generation left-to-right, preventing them from conditioning on globally interdependent anchors (tool schemas, trailing status fields, expected outโฆ
Autoregressive LLM world models factorize next-state generation left-to-right, preventing them from conditioning on globally interdependent anchors (tool schemas, trailing status fields, expected outcomes) and yielding prefix-consistent but globally incoherent rollouts. MDLMs' any-order denoising objective sidesteps this by learning every conditional direction from the same training signal. Empirically, fine-tuned MDLMs (SDAR-8B, WeDLM-8B) surpass AR baselines up to 4x their total parameter count on BLEU-1, ROUGE-L, and MAUVE across in- and out-of-domain splits, with lower Self-BLEU and higher Distinct-N confirming reduced prefix mode collapse. GRPO training on MDLM-generated rollouts shows up to +15% absolute task-success gains over AR generated training on held-out ScienceWorld, ALFWorld, and AppWorld across 1.2Bโ7B backbones (LFM2.5, Qwen3, Mistral) in a zero-shot transfer setting.   submitted by   /u/MegixistAlt [link]   [comments]
GPU monitoring tools like DCGM give you hardware-level metrics but no workload context. When a node is saturated, you can't tell which experiment, team, or job is responsible without digging through โฆ
GPU monitoring tools like DCGM give you hardware-level metrics but no workload context. When a node is saturated, you can't tell which experiment, team, or job is responsible without digging through logs. We built l9gpu to close that gap. It's a node-level agent that exports GPU metrics via OTLP with workload attribution embedded: - Kubernetes: correlates GPU metrics with pod, namespace, and deployment - Slurm: correlates with job ID, user, and partition - LLM inference: native metrics for vLLM, SGLang, and TGI - Hardware: NVIDIA, AMD MI300X, Intel Gaudi - 17 pre-built Prometheus alert rules + Grafana dashboards Derived from Meta's gcm project, extended with K8s attribution, multi-vendor GPU support, and OTLP export. MIT licensed. https://github.com/last9/gpu-telemetry Happy to discuss design decisions around the attribution mapping. What is the ML infra community using for GPU cost visibility in shared research clusters?   submitted by   /u/bakibab [link]   [comments]
hello i would like to start learning machine learning where and what resources need to take? i came from cybersecurity so i know a bit of python what courses should i take and in where someone can heโฆ
hello i would like to start learning machine learning where and what resources need to take? i came from cybersecurity so i know a bit of python what courses should i take and in where someone can help me?   submitted by   /u/Gold_Chemistry8851 [link]   [comments]
OpenAI posted a math result today claiming that one of its general-purpose reasoning models found a construction disproving the conjectured n^{1+O(1/log log n)} upper bound in Erdลsโs planar unit-disโฆ
OpenAI posted a math result today claiming that one of its general-purpose reasoning models found a construction disproving the conjectured n^{1+O(1/log log n)} upper bound in Erdลsโs planar unit-distance problem. Announcement: https://openai.com/index/model-disproves-discrete-geometry-conjecture/ Proof PDF: https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit-distance-proof.pdf Abridged reasoning writeup: https://cdn.openai.com/pdf/1625eff6-5ac1-40d8-b1db-5d5cf925de8b/unit-distance-cot.pdf The mathematical claim, as I understand it, is that there are finite planar point sets with more than n^{1+ฮด} unit distances for some fixed ฮด > 0 and infinitely many n. That would rule out the expected near-linear upper bound, though it does not determine the true asymptotic growth rate. What seems especially relevant for this subreddit is the process claim: OpenAI says the solution was produced by a general-purpose reasoning model, then checked by an AI grading pipeline and reviewed/reworked by mathematicians. The proof PDF also includes the original prompt given to the model, but not the full experimental details: no model name, sampling setup, number of attempts, compute budget, hidden system prompt, or full grading pipeline. Curious how people here read this as an ML result. Is this best viewed as evidence of frontier models doing genuine autonomous research, or as a cherry-picked but still important sample from a large search process? What kind of disclosure would you want before treating this as a reproducible AI-for-math milestone?   submitted by   /u/NutInBobby [link]   [comments]
LLMs are trained on human data, so where does the tendency to add emojis come from? For example, when some models generate code explanations or even normal responses, they often add lots of emojis thโฆ
LLMs are trained on human data, so where does the tendency to add emojis come from? For example, when some models generate code explanations or even normal responses, they often add lots of emojis that people donโt really use that way in real life. My current guess (without having researched this yet) is that emojis might sometimes be added after the initial generation process, maybe during post-processing, alignment, or some โreasoning/thinkingโ stage, rather than being part of the raw generated response itself. Because intuitively, an emoji doesnโt really behave like a normal word/token inside a sentence or code block.   submitted by   /u/Zoldyck_J [link]   [comments]
Hi, how hard is it currently to get a PhD position in machine Learning? Like what are the requirements to get to a decent mid tier program (= they publish regularly at respected journals and their wโฆ
Hi, how hard is it currently to get a PhD position in machine Learning? Like what are the requirements to get to a decent mid tier program (= they publish regularly at respected journals and their work gets read my some people)? How is it in different regions e.g US, Europe, etc.. I am about to finish my masters and am wondering if I need to sweep in an unpaid guided research project to extend my network.   submitted by   /u/strammerrammer [link]   [comments]
Hi, how hard is it currently to get a PhD position in machine Learning? Like what are the requirements to get to a decent mid tier program (= they publish regularly at respected journals and their wโฆ
Hi, how hard is it currently to get a PhD position in machine Learning? Like what are the requirements to get to a decent mid tier program (= they publish regularly at respected journals and their work gets read my some people)? How is it in different regions e.g US, Europe, etc.. I am about to finish my masters and am wondering if I need to sweep in an unpaid guided research project to extend my network.   submitted by   /u/strammerrammer [link]   [comments]
Hi everyone. I am recent CS grad and I have received a PhD offer from a school in states. However I am deeply confused if I should accept it or not. My hesitation comes from the interdisciplinary natโฆ
Hi everyone. I am recent CS grad and I have received a PhD offer from a school in states. However I am deeply confused if I should accept it or not. My hesitation comes from the interdisciplinary nature of the program. It will be jointly supervised by the two professors, one from biomedical and one from ML domain. I always wanted to work on the foundational aspect of the AI and to publish in A* conferences in AI, so I am not sure if it is a right choice. The other option for me is to wait and work on enhancing my profile. Get another paper or two published in respected venues and apply again. I have a decent profile, with couple of internships and research papers, and >90% cgpa. Moreover, I believe I can do foundational stuff much better than the applied one so my biggest fear is that I accept the offer and later get to know that the AI part is very trivial and minimal. It might lead to the mental frustration and lower productivity. What should I do in this case? If anyone has been a part of such a interdisciplinary programs, please do share your experience. Thanks!   submitted by   /u/ProfessionalDue369 [link]   [comments]
If I have a labeled dataset Is it possible to split my data by label where each chunk is the sentences of one label and then use this to be able to label more sentences. And is this even a good idea โฆ
If I have a labeled dataset Is it possible to split my data by label where each chunk is the sentences of one label and then use this to be able to label more sentences. And is this even a good idea for data labeling where I search for this certain sentence and see what the label lf the result I got is and I label my sentence as such.   submitted by   /u/Ok-Buffalo-8655 [link]   [comments]
I've been running a file management agent built on MCP for a few months. It handles module renames, import updates, validation scaffolding, test execution. A typical session is 60 to 120 tool calls. โฆ
I've been running a file management agent built on MCP for a few months. It handles module renames, import updates, validation scaffolding, test execution. A typical session is 60 to 120 tool calls. The whole thing was powered by Opus 4.7 because I never thought to question it until I looked at my April bill. So I set up a comparison. Eight refactoring tasks on a 15k line Python project, same MCP tools, same system prompt, same repo state, five models. Tasks were things like "rename this module and fix all imports" and "add input validation to these 12 endpoints." Routine cleanup, nothing requiring deep architectural thought. The metric I cared about was first attempt tool call success: did the model produce a valid function call that executed without a parse error on the first try? On the expensive end, Opus 4.7 hit roughly 98 to 99 percent across a bit over 500 calls and cost close to $15 for all eight tasks. GPT 5 was similar quality for around $11. The cheaper tier surprised me. Sonnet 4.6 landed somewhere around 96 percent for about $4. DeepSeek V4 Pro was in the same neighborhood for under $2. And Tencent Hunyuan Hy3 preview came in within a couple of points of Opus for under $1.50. Under two percentage points separating the priciest model from the cheapest, on tasks where a failed call just gets retried anyway. I'll be honest, the results were anticlimactic. I expected a bigger reliability gap. I actually spent half a day debugging what I thought was a quality issue with one of the MoE models before realizing I'd misconfigured the tool call schema in my system prompt. Every call was producing malformed JSON and I blamed the model. Classic. The model is a 295B parameter MoE with 21B active per token, so full BF16 weights are around 590GB. The official deployment path is vLLM or SGLang on something like eight H200 class GPUs, which is not exactly homelab territory. But the 4 bit quantized weights land around 165GB, which just fits in unified
Ie given a conference (say with openreview data) eg โNeurIPS, 2025โ, return the accepted papers based on number of citations according to standard paper search engine (eg google scholar) Seems to be โฆ
Ie given a conference (say with openreview data) eg โNeurIPS, 2025โ, return the accepted papers based on number of citations according to standard paper search engine (eg google scholar) Seems to be a surprisingly difficult thing to find online.   submitted by   /u/baghalipolo [link]   [comments]
I built a custom RL algorithm for continuous flight control and open-sourced it. Sharing here in case the structural ideas are useful for anyone doing continuous control where one action axis dominatโฆ
I built a custom RL algorithm for continuous flight control and open-sourced it. Sharing here in case the structural ideas are useful for anyone doing continuous control where one action axis dominates. I've been training continuous control on a 6-DoF flight sim (pitch/roll/yaw/throttle/brake/fire) and kept hitting the same wall: vanilla TD3 would peak, then collapse into pitch oscillation and never recover. I tried reward shaping for a while before concluding the problem was structural, not in the reward. NOML is what came out of that. Three structural changes on top of a standard TD3 skeleton: Anchor policy โ the action is anchor + deltaยทgate, where the anchor is a fixed safe action (wings level, MIL throttle). The policy literally cannot fully forget how to fly straight; the worst a collapsed policy can do is fall back to the anchor. Hierarchical actor โ three MLPs with independent optimizers (pitch โ roll โ rest), so a roll-side gradient update can't corrupt the pitch head. This is what actually killed the oscillation for me. Mirror learning โ left-right symmetry means every transition can be mirrored into a free second sample. 2ร data when env steps are the bottleneck. One thing that surprised me and goes against the usual advice: my best results came with exploration noise effectively off. On this task adding Gaussian action noise mostly just shook the stick and hurt. The anchor+gate structure seems to provide enough of the "fall back to safe behavior" role that noise usually plays. Code (Apache 2.0), full writeup, and a test video are here: https://github.com/9138noms/NOML https://www.youtube.com/watch?v=ZNn6wo_PX8Y   submitted by   /u/9138NOMS [link]   [comments]
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet, automatโฆ
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet, automating their configuration remains a structural challenge. Researchers are often forced into manual, trial-and-error prompt tuning, where a change to a single agent shifts the global output in ways that are difficult to trace. The core bottleneck is credit assignment: while the parameters governing agent behavior are local, performance scores are only available at the global system level. This makes optimization fundamentally difficult because we do not inherently know which agents contributed positively or negatively to the outcome. CANTANTE is an attempt to take a different path: treating agent prompts as parameters learned from task rewards rather than tuned by hand. By solving the credit assignment problem, we can move from brittle, hand-crafted agent demos to trustworthy systems that are actually autonomous and useful in practice. CANTANTE's algorithm in short (see second image): Let local optimizers suggest configurations (e.g., prompts). Evaluate different configurations on the same queries, capturing reasoning traces and system scores. Let an attributer compare these rollouts and assign each agent a credit, thereby decomposing the global reward into per-agent update signals. Feed those credits to any local optimizer; for the experiments, we use CAPO, our prompt optimizer from prior work at AutoML 2025. Evaluated against the DSPy-solutions GEPA and MIPROv2 on MBPP (Programming Benchmark), GSM8K (Mathematical Reasoning Benchmark), and HotpotQA (Retrieval Benchmark), CANTANTE: โข Achieves the best average rank, โข beats the strongest baseline by +18.9 points on MBPP and +12.5 on GSM8K, and โข maintains inference time cost compared to unoptimized prompts. ๐ Link to the paper: https://arxiv.org/abs/2605.13295 ๐ป Link to the repo: https://github.com/finitearth/cantante If y
Hi, I'm interested in geometric deep learning (due to Michael M. Bronstein's book and Maurice Weiler's PhD thesis), and in order not to write projects to nowhere, I decided to keep a technical blog. โฆ
Hi, I'm interested in geometric deep learning (due to Michael M. Bronstein's book and Maurice Weiler's PhD thesis), and in order not to write projects to nowhere, I decided to keep a technical blog. I started with a short note about machine learning on spherical manifolds, but it's a pretty simple thing. Is there a list of some open problems on the topic of GDL, or maybe some of you are doing something in this direction and can suggest which GDL problems are relevant in the research community.   submitted by   /u/eesuck0 [link]   [comments]
Hi, I am being reviewer for an ICML workshop; however, there are no guidelines on the structure of the reviews (e.g. what are the criteria, what is the grade scale, etc.). Does anyone know whether ICโฆ
Hi, I am being reviewer for an ICML workshop; however, there are no guidelines on the structure of the reviews (e.g. what are the criteria, what is the grade scale, etc.). Does anyone know whether ICML workshops have some "convention" regardings reviews? Or do we ought to use the icml's reviewer instruction (https://icml.cc/Conferences/2026/ReviewerInstructions)?   submitted by   /u/Ok-Painter573 [link]   [comments]
For proceedings-only papers, do we need to make a poster and submit it to the portal? Has anyone asked this question to ICML Program Chair?   submitted by   /u/minhquang251 [link]   [โฆ
For proceedings-only papers, do we need to make a poster and submit it to the portal? Has anyone asked this question to ICML Program Chair?   submitted by   /u/minhquang251 [link]   [comments]
On Openreview, you can see modified date next to the review. This modified date should be recent (anything 12th May or newer) which means that reviewer gave a final justification and may have increasโฆ
On Openreview, you can see modified date next to the review. This modified date should be recent (anything 12th May or newer) which means that reviewer gave a final justification and may have increased their score or kept the same score. In either case, it means they read the rebuttal and justified their score and decision. For me none of the reviewers as of writing this post has provided justification. My score is 433 and all was easily addressed in the rebuttal. In CVPR, I was in same position where none of the reviewers justified their decision and the AC simply said "concerns remain" even though it was clearly answered in the rebuttal and rejected the paper.   submitted by   /u/Healthy_Horse_2183 [link]   [comments]
Scale AI Highest quality in the industry. But no public pricing and every project requires a sales call. Onboarding takes weeks not days. In June 2025 Meta bought a 49% stake and hired Scaleโs CEO asโฆ
Scale AI Highest quality in the industry. But no public pricing and every project requires a sales call. Onboarding takes weeks not days. In June 2025 Meta bought a 49% stake and hired Scaleโs CEO as Metaโs Chief AI Officer. Several major customers quietly reduced engagements over data exposure concerns. Worth thinking about if youโre building anything competitive with Meta. Best for: well-funded teams with enterprise security requirements and long timelines. Appen Over 1 million contractors across 170 countries. Sounds impressive until you realize it was built for massive long-term projects. Small teams consistently report it being slow and inflexible for novel tasks. Low contractor pay rates also raise real questions about annotation quality. Best for: high volume, low complexity, multilingual tasks. CloudFactory Trained dedicated teams and ethical sourcing. More consistent than the giants. Still not self-serve though and onboarding takes time. Project management quality varies depending on which team you get. Best for: structured projects with clear requirements and no time pressure. LabelBox Best annotation software on the market. The catch is itโs a platform not a workforce. You still need to find and manage your own annotators. Powerful if you have an internal team. Not useful if you donโt. Best for: teams building long-term internal annotation infrastructure. The problem!! Every major platform is optimized for enterprise scale. None of them are built for teams that need 500-2000 examples labeled fast, with domain expertise, and full transparency into whoโs doing the work. What are you currently using for annotation work?   submitted by   /u/Neil-Sharma [link]   [comments]
Been going down a mechanistic interpretability rabbit hole for the past few weeks and ended up building this thing called AXON. The idea: every time GPT-2 generates a token, its residual stream gets โฆ
Been going down a mechanistic interpretability rabbit hole for the past few weeks and ended up building this thing called AXON. The idea: every time GPT-2 generates a token, its residual stream gets passed through a Sparse Autoencoder (Joseph Bloom's pretrained SAE). The SAE decomposes it into human-interpretable feature: hings like "European geography", "capital cities", "French language" and streams those to the browser over WebSocket, where they show up as a live 3D force graph. Nodes = SAE features. Edges = features that fired together on the same token. Node brightness = activation strength. The whole graph evolves token by token. What surprised me most: type "The capital of France is" and you can literally watch geography features, proper noun features, and completion-pattern features light up before the word "Paris" even gets generated. It's not what the model outputs that's interesting it's what's happening right before it decides. Stack: TransformerLens + SAELens on the backend, FastAPI WebSocket for streaming, Three.js + 3d-force-graph on the frontend. Runs on CPU (~800ms/token) or GPU (~35ms on a 4050). Labels come from Neuronpedia's API and get cached locally. You can also swap in other models โ GPT-2 medium/large/xl, Pythia variants, Gemma-2-2B โ as long as there's a pretrained SAE for it in SAELens. GitHub: https://github.com/09Catho/axon Would love feedback and stars especially from anyone who's worked with SAEs before curious whether the co-activation edges are actually meaningful or just noise at this layer.   submitted by   /u/Financial_World_9730 [link]   [comments]
Has anyone applied to Lxmls 2026? Did you get any update?   submitted by   /u/No_Cardiologist7609 [link]   [comments]
Wanted to see how close a fully bio-plausible agent could get to PPO on Pong. Setup Custom Pong environment (pygame, no gym) PPO baseline: paper-faithful, from scratch Hebbian agent: PPO policy replโฆ
Wanted to see how close a fully bio-plausible agent could get to PPO on Pong. Setup Custom Pong environment (pygame, no gym) PPO baseline: paper-faithful, from scratch Hebbian agent: PPO policy replaced with Hebbian value estimation engineered features โ 61% BioAgent: Predictive Coding for feature learning + distributional Hebbian plasticity for value (Dabney et al. 2020) โ 57% Zero backprop anywhere in the pipeline. Key observations The 2% gap is real but small. The bottleneck wasn't the lack of backprop because it was catastrophic forgetting under non-stationary opponent dynamics during self-play. Distributional value encoding (ร la Dabney) helped stability vs. a scalar Hebbian baseline, but not enough to match PPO under self-play. Self-play exposed the plasticityโstability dilemma hard: Hebbian rules that adapt fast forget fast. This is the real wall for bio-plausible RL in non-stationary settings. Not claiming novelty in the architecture as this is a from-scratch exploration of whether bio-plausible rules can handle a real RL task. Short answer: yes, mostly, with one clear failure mode. Code: github.com/nilsleut/Biologically-Plausible-RL-Plays-Pong Happy to answer questions about the PC implementation, the Hebbian value estimator, or the self-play setup.   submitted by   /u/ConfusionSpiritual19 [link]   [comments]
Anthropic is buying all 300 megawatts of compute at xAI's Colossus 1 facility in Tennessee for billions of dollars. Musk says xAI already moved training to Colossus 2 and didn't need both. Fine, thatโฆ
Anthropic is buying all 300 megawatts of compute at xAI's Colossus 1 facility in Tennessee for billions of dollars. Musk says xAI already moved training to Colossus 2 and didn't need both. Fine, that's plausible on its face. But the part I keep coming back to is what it says about Grok's actual compute consumption. Every serious AI lab treats compute as a strategic asset you accumulate, not sell. Google, Meta, and Microsoft are building more data center capacity even while actively running training runs, because the assumption is you will always need more for the next model, the one after that, and the inference load that follows. You don't sell 300 MW to a direct competitor unless that capacity is genuinely sitting underutilized. And if Grok were burning through it, it wouldn't be sitting underutilized. There was also a pretty visible drop in Grok usage after the image generation controversies earlier this year. None of that is confirmed internally, but the circumstantial case that Colossus 1 wasn't running hot is pretty strong when you piece it together. Renting to Anthropic generates cash and a headline, but it's the business model of a neocloud, not a frontier lab. xAI is valued at $230 billion. CoreWeave runs roughly comparable compute infrastructure and rents it to AI labs. CoreWeave's valuation is less than a third of that. If xAI keeps moving in this direction, the implied premium over a compute rental business needs a pretty compelling explanation. Curious whether people here see this as a one-off liquidity move or something that changes how you think about xAI's actual AI roadmap. https://www.idlen.io/news/anthropic-spacex-colossus-memphis-300mw-claude-limits-2026/?utm_source=chatgpt.com   submitted by   /u/peachforbreakfast [link]   [comments]
I've seen TabPFN-3's recent results, and there is a lot of buzz about foundation models for tabular data (TabICL, TabPFN). The performance that those models achieve is really amazing. What makes me aโฆ
I've seen TabPFN-3's recent results, and there is a lot of buzz about foundation models for tabular data (TabICL, TabPFN). The performance that those models achieve is really amazing. What makes me a little suspicious about them? They can analyze small datasets only, so a few MB of data, and you need to have a large GPU machine and download a few GB of model to predict on a few MB of data. That doesn't sound rational ... I really miss the old school approach of running a single decision tree or a linear model on the data. What do you think about it? Do you think feature engineering + classic ML can achieve performance comparable to that of foundation models? Maybe with better explainability?   submitted by   /u/pplonski [link]   [comments]
I've been applying the Fiedler value (second-smallest eigenvalue of the weight graph Laplacian) combined with Scheffer critical slowing down indicators to monitor neural network topology during trainโฆ
I've been applying the Fiedler value (second-smallest eigenvalue of the weight graph Laplacian) combined with Scheffer critical slowing down indicators to monitor neural network topology during training. Five experiments, all reproducible on CPU in under 24 hours: Detection: lambda-2 detects approaching grokking 21,000 steps before test accuracy moves Classification: grokking and catastrophic forgetting have distinct structural fingerprints (slope 0.00128 vs 0.00471/step) Steering: structurally-guided intervention preserves 91.7% of knowledge vs 2.6% unsteered Compounding: three sequential tasks, 100%/100%/97.5% retention, 48x grokking acceleration across tasks Preemptive curriculum: compatibility scoring ranks task disruption risk correctly, bridging preserves 100% vs 0% direct Tested on 2-layer MLPs (modular arithmetic) and 1-layer transformer (sequence prediction). Honest limitations section in the paper. These are toy tasks and scaling to production architectures is unvalidated. The approach comes from complex systems science (Scheffer's early warning indicators for critical transitions) applied to weight graphs rather than ecosystems or financial markets. Code and paper: https://github.com/EssexRich/neural_si_validation Happy to discuss the maths, the experimental design, or the limitations.   submitted by   /u/RichBenf [link]   [comments]
https://preview.redd.it/mikhasjiq32h1.png?width=572&format=png&auto=webp&s=4c053200dbd9852bebf083550e2144b31579d497 https://preview.redd.it/bay5r3njq32h1.png?width=575&format=png&โฆ
https://preview.redd.it/mikhasjiq32h1.png?width=572&format=png&auto=webp&s=4c053200dbd9852bebf083550e2144b31579d497 https://preview.redd.it/bay5r3njq32h1.png?width=575&format=png&auto=webp&s=2823db3d6bc534ef00330528a200cba2aca1c5d3 https://preview.redd.it/dm40ntdkq32h1.png?width=575&format=png&auto=webp&s=703beb099eb6e16d2789ac230ebe77de51f07d7a https://preview.redd.it/eubucz2lq32h1.png?width=575&format=png&auto=webp&s=fb5a8d9a7154396087da33487674cda785d2a62a https://preview.redd.it/0xo3t83nq32h1.png?width=586&format=png&auto=webp&s=a569ae89c44953a5bc9aff6fbb37d25759109dd1 I've just finished the Machine Learning Specialization by Andrew Ng , and as I was going through it, I ended up writing detailed lecture notes for all 10 chapters โ everything from linear regression all the way to reinforcement learning. I put a lot of effort into making these notes as clear and friendly as possible, so even if you're completely new to ML, you should be able to follow along without getting lost. The notes are written in LaTeX and auto-compiled to PDF via GitHub Actions whenever I push an update, so the PDF is always up to date. ๐ GitHub: https://github.com/TruongDat05/machine-learning-notes-and-code   submitted by   /u/Far_Extreme_9737 [link]   [comments]
I am learning Physics informed neural network (PINN). I am playing with simple 1rst/2nd 1D ODEs and I am calculating the loss functions by adding the initial condition loss and Physics loss (e.g. Totโฆ
I am learning Physics informed neural network (PINN). I am playing with simple 1rst/2nd 1D ODEs and I am calculating the loss functions by adding the initial condition loss and Physics loss (e.g. Total loss = lambda1 (L1) * Physics_loss (PL) + lambda2 (L2) * IC_loss (IL)). Regardless of the magnitude of the loss and lambda values, the total loss is a single numeric a value. How does the neural network model predicts if I impose higher weights (lambda) for one of the losses. For instance, lets say, PL = 5, IC_Loss = 3, L1 = 0.6 ,L2 = 1, then total loss = 6. However, this values 6 can be achieved through several other combinations. For instance, L1 = 1 and L2 = 0.33 would result in a similar value. Given this, how the model actually learns which losses are given more weightage, which are not, and uses this information to correct its predictions?   submitted by   /u/cae_shot [link]   [comments]
Iโm trying to break into AI/ML Engineer / Applied AI roles, and honestly Iโve been feeling pretty overwhelmed lately. Iโve been building around LLM evaluation, model reliability, cost optimization, aโฆ
Iโm trying to break into AI/ML Engineer / Applied AI roles, and honestly Iโve been feeling pretty overwhelmed lately. Iโve been building around LLM evaluation, model reliability, cost optimization, and production AI systems. My main projects are: RDAB โ a benchmark for evaluating LLM data agents beyond just correctness, including code quality, efficiency, and statistical validity. CostGuard โ an LLM reliability/cost proxy that tracks model cost, applies fallback logic, does lightweight response checks, and supports replay-based model comparison. Tether โ a trace capture layer that records LLM calls so they can be replayed against alternate models to compare quality and cost. The overall idea is: capture real LLM traffic โ replay it against another model โ compare quality, cost, and reliability before switching models. But Iโm struggling with how to package this clearly. I feel like Iโve built a lot, but Iโm not sure what hiring managers actually care about or what would make this stand out in a competitive market. Right now Iโm thinking of focusing everything around one story: โCan a cheaper LLM replace an expensive one without silently hurting quality?โ Then use CostGuard as the flagship project, with RDAB as the benchmark layer and Tether as the trace-capture layer. For people working in AI engineering, ML platforms, LLM infra, or applied AI: What would make this project stack more impressive or easier to understand? Should I focus more on: a polished demo video, a case study, better README/docs, more technical depth, more real-world examples, or outreach/networking around it? Any honest guidance would help. Iโm trying to turn this into something that clearly shows production AI engineering ability, not just another AI demo   submitted by   /u/Fit_Fortune953 [link]   [comments]
An issue with the peer review system is reciprocal reviewing, which incentivizes reviewers to unfairly reject good papers to increase their own papers' chances of acceptance. My proposed solution is โฆ
An issue with the peer review system is reciprocal reviewing, which incentivizes reviewers to unfairly reject good papers to increase their own papers' chances of acceptance. My proposed solution is that the conference should divide the authors/papers into 2 halves (A and B). If you are an author in half A, then you will only be a reviewer in half B. All papers by the same author, their coauthors, and coauthors of coauthors should be in the same half. Each AC/SAC can only serve in one half and acceptance decisions for the two halves would be independent. So reciprocal reviewers will not have incentive to reject good papers to serve themselves. Furthermore, the discussion period for the two halves should not be concurrent. This way the reciprocal reviewer will have sufficient time to discuss author rebuttals as they will not have to deal with their own papers concurrently. Maybe the first 2 weeks can be the discussion period for half A, and the next two weeks for half B. I don't think conference organizers have thought of this solution, because if they have, there is no excuse for not trying to implement it because it does not hurt the conference's self-interest in any way. Does anyone think this will work? If so, I hope someone of more power than me might ask the conferences to implement it.   submitted by   /u/isentropiccombustor [link]   [comments]
Hello everyone. I am keeping my identity anonymous today to protect my professional career. I am a junior researcher in Computer Vision, and I am sharing this story because I have hit a devastating dโฆ
Hello everyone. I am keeping my identity anonymous today to protect my professional career. I am a junior researcher in Computer Vision, and I am sharing this story because I have hit a devastating deadlock with IEEE T-PAMI and the IEEE Ethics Office. Our Situation: https://preview.redd.it/v0w62gzmn02h1.png?width=2000&format=png&auto=webp&s=a2d75a1e3a388debdf5b163cb9593c1f7f1c49d5 In the decision letter, we actually received three highly positive reviews (Two EXCELLENT, One GOOD). However, the AE rejected the paper by quoting comments from a "4th" reviewer. The most staggering part: We later accidentally met the actual 4th reviewer. He CONFIRMED having submitted a POSITIVE review, which was strangely withdrawn by the editor in the backend before the final decision was made. We have formally requested the IEEE (and Computer Society) to thoroughly investigate this issue, specifically asking them to check AE's backend activity logs in the submission system. However, half a year has passed, and we have received no direct response. Has anyone experienced something similar with IEEE or other top venues? Any advice or help bringing visibility to this would be greatly appreciated. Evidence: Below is the report to IEEE Ethics (identifying information has been covered): https://preview.redd.it/e41vt2rsn02h1.png?width=3508&format=png&auto=webp&s=b2ee2d3f092dad5e20b45b9daeea7fa7b6f01d20 https://preview.redd.it/t29n03rsn02h1.png?width=3508&format=png&auto=webp&s=67aa6bc36aed76617af34e7913a203f9236bc536 https://preview.redd.it/6v5ys2rsn02h1.png?width=3508&format=png&auto=webp&s=f2452998f57f1b157d71b569dd5ff87e4d3d0b6c https://preview.redd.it/epdxv2rsn02h1.png?width=3508&format=png&auto=webp&s=d01da8cdf9e3f6cd5be53f884b02b154f86d0b48 https://preview.redd.it/fuw3k3rsn02h1.png?width=3508&format=png&auto=webp&s=03e75f763a54429758102da4933af53511642e7d https://preview.redd.it/xn0ze3rsn02h1.png?width=3
Hey everyone, Iโm an undergrad from India and I just found out I had two papers accepted at the ICML 2026 GlobalSouthML workshop! I am super excited since this is my first time getting accepted into โฆ
Hey everyone, Iโm an undergrad from India and I just found out I had two papers accepted at the ICML 2026 GlobalSouthML workshop! I am super excited since this is my first time getting accepted into a major conference venue, but Iโm also kind of panicking right now because I absolutely cannot afford a trip to Seoul. Since I've never done this before, Iโm hoping some experienced folks can help answer a few questions about how the post-acceptance process works: I saw that the main conference has a "Virtual Pass." Is that enough to keep my papers in the workshop program? ICML rules make it sound like someone must be there in person. If neither me nor my co-authors can afford the flight to South Korea, will our accepted papers just get withdrawn? Does ICML or the GlobalSouthML workshop specifically offer financial aid for undergrads? Should I email the organizers about this before I attempt to register? I saw some mentions of ICML Financial Aid online, but it looked like it might only cover hotels and registration, not the flights. How does submitting the final version actually work? Do the organizers email a specific form, or do I just upload a new PDF revision directly to my OpenReview portal? Also, since GlobalSouthML is a non-archival workshop, what exactly am I submitting, just the updated PDF addressing the reviewers' comments? Any advice on how to navigate this would be hugely appreciated! Thank you!   submitted by   /u/Material_Dinner_1924 [link]   [comments]
Hi everyone, I'm starting a research project on financial time-series forecasting using LSTM and Transformer models for predicting S&P 500 market direction. Right now, I'm struggling with obtainiโฆ
Hi everyone, I'm starting a research project on financial time-series forecasting using LSTM and Transformer models for predicting S&P 500 market direction. Right now, I'm struggling with obtaining reliable long-term historical data. I tried Yahoo Finance, but downloads are inconsistent/failing for me, and most Kaggle datasets I found only contain around 5โ10 years of data. I specifically need: Around 30 years of historical S&P 500 data Preferably daily OHLCV data Reliable and clean source suitable for ML research Ideally free or student-friendly I also want to understand what researchers typically use in academic work for financial forecasting: Yahoo Finance? Alpha Vantage? WRDS/CRSP? Polygon? Kaggle? Something else? Additionally: Is using only S&P 500 index data enough for a Master's level research project? Or should I include technical indicators, macroeconomic data, sentiment, or constituent stock data? Would appreciate guidance from people who've actually worked on financial ML projects. Thanks.   submitted by   /u/stickPotatoe [link]   [comments]