Newswallah — AI & Tech News

📰

arXiv cs.LG Research May 26, 2026

Synheart Capacity: A Theory-Driven Physiological Representation of Cognitive Capacity Dynamics from Wearable Signals

arXiv:2605.24416v1 Announce Type: new Abstract: Human cognitive performance is constrained by limited mental resources, yet continuous computational estimation of cognitive capacity dynamics remains…

arXiv:2605.24416v1 Announce Type: new Abstract: Human cognitive performance is constrained by limited mental resources, yet continuous computational estimation of cognitive capacity dynamics remains an open challenge. We propose a theory-driven multimodal learning framework that models capacity-related cognitive state as a two-dimensional physiological representation defined by voluntary resource allocation (mental effort) and overload-related strain (stress). The proposed architecture combines dual-stream encoding of cardiac (IBI/HRV) and electrodermal (EDA) signals with late fusion and task-specific output heads that independently estimate probabilistic effort and stress states. Evaluation on the SWELL-KW dataset using strict leave-one-subject-out cross-validation demonstrates cross-individual generalization (stress: 70.0\% balanced accuracy; effort: 72.2\%), with significant gains from multimodal integration and theory-guided supervision. Rather than collapsing physiological dynamics into a single workload label, the proposed effort--stress state-space enables structured differentiation between distinct cognitive regimes, including productive engagement and overload-related strain. Predicted state trajectories exhibit significant demand-sensitive shifts under controlled workload manipulations, with effort and stress responding differentially across interruption and time-pressure conditions. These results suggest that physiologically grounded multidimensional state representations may provide a foundation for adaptive systems capable of continuous capacity-aware monitoring and human-centered interaction.

📰

arXiv cs.LG Research May 26, 2026

A Unified Python Framework for Direct PPO-based Control of AHUs with Economizer Logic and CO2-Constrained Ventilation

arXiv:2605.24406v1 Announce Type: new Abstract: Optimizing HVAC (Heating, Ventilation and Air Conditioning) can enhance a building's energy efficiency while providing comfort levels for its occupant…

arXiv:2605.24406v1 Announce Type: new Abstract: Optimizing HVAC (Heating, Ventilation and Air Conditioning) can enhance a building's energy efficiency while providing comfort levels for its occupants. Using conventional control systems to maintain HVAC functions is often difficult because of the nonlinear characteristics of a building envelope as it experiences stochastic load variations over time. This paper presents a new approach to optimizing HVAC systems through the use of Deep Reinforcement Learning (DRL) algorithms and the Proximal Policy Optimization (PPO) algorithm implemented in a custom Python performance environment. The DRL system uses a second order resistor-capacitor thermal model and an integrated dynamic mass balance of CO2 to replicate the complex physics associated with buildings. One major innovation of this study is a "Hierarchical Flow Logic," which provides the means to ensure that indoor air quality (IAQ) is maintained by overriding the accepted actions of the agent that cause CO2 to exceed 1000 ppm. In addition, an enthalpy-based economiser is used to create free cooling from the outdoor environment. The experimental data shows that compared to PID controllers tuned by GA or traditional On-Off controls, a PPO agent has better temperature stability and energy efficiency overall. An end-to-end pipeline provides an avenue for robust and generalized solutions to help implement smart building energy management within the context of real hardware implementation.

📰

arXiv cs.LG Research May 26, 2026

Generative OOD-regularized Model-based Policy Optimization

arXiv:2605.24405v1 Announce Type: new Abstract: We study sequential decision-making with offline reinforcement learning (RL). Traditional offline RL policies may result in out-of-distribution (OOD) …

arXiv:2605.24405v1 Announce Type: new Abstract: We study sequential decision-making with offline reinforcement learning (RL). Traditional offline RL policies may result in out-of-distribution (OOD) actions when training relies only on sparse offline representations. To ensure safe offline policies in a sparse state-action space, we explore how density estimation models can be integrated into model-based RL methods to avoid the OOD regions. Generative models are capable of explicitly modeling the density in sparse state-action spaces. Building on this, we introduce Generative OOD-regularized Model-based Policy Optimization (GORMPO), a density-regularized offline RL algorithm that uses generative density modeling to restrict policy updates to high-density areas of the dataset. Furthermore, we examine whether better OOD detection corresponds to better model-based offline policies. We compare (1) the OOD detection capabilities of various density estimators and (2) their performance within the GORMPO framework on a real-world medical dataset and sparse offline RL datasets. We theoretically guarantee GORMPO's performance under mild assumptions. Empirically, GORMPO outperforms state-of-the-art baselines by 17% on a real-world medical dataset and enhances the base model on the offline RL datasets. Our empirical findings show that better OOD detection generally results in improved policies in environments with stable dynamics, while conservative penalties with poor density estimation are favored when dynamics are uncertain.

📰

arXiv cs.LG Research May 26, 2026

AvAtar: Learning to Align via Active Optimal Transport

arXiv:2605.24395v1 Announce Type: new Abstract: Alignment plays a fundamental role in many machine learning problems, such as multi-network analysis, multimodal learning, and point cloud registratio…

arXiv:2605.24395v1 Announce Type: new Abstract: Alignment plays a fundamental role in many machine learning problems, such as multi-network analysis, multimodal learning, and point cloud registration. Recent works increasingly leverage optimal transport (OT) for distributional alignment, whose effectiveness largely depends on sparse supervision that is hard or costly to obtain in practice. Existing works, however, largely overlook how to actively acquire high-quality supervision to improve their alignment performance under OT frameworks. In this paper, we propose a principled active alignment framework for optimal transport alignment called AvAtar. We quantify the informativeness of a candidate by measuring its gradient-based impact on the global alignment result, computed as the gradient propagation from the global alignment result to all possible supervisions of the candidate through the entropy-regularized OT formulation. While differentiating through OT is challenging given its constrained nature, we leverage the adjoint-state method to reformulate the computation to a linear system solvable by the conjugate gradient method with linear complexity and guaranteed convergence. By encoding the global alignment result via effective utility functions, AvAtar is applicable to general alignment problems under the OT framework. Extensive experiments on three representative alignment tasks demonstrate the effectiveness, scalability, and generalizability of the proposed AvAtar.

📰

arXiv cs.LG Research May 26, 2026

Learning Laplacian Eigenspace with Mass-Aware Neural Operators on Point Clouds

arXiv:2605.24390v1 Announce Type: new Abstract: The eigendecomposition of the Laplace--Beltrami Operator (LBO) is fundamental to geometric analysis, yet computing its low-frequency eigenmodes remain…

arXiv:2605.24390v1 Announce Type: new Abstract: The eigendecomposition of the Laplace--Beltrami Operator (LBO) is fundamental to geometric analysis, yet computing its low-frequency eigenmodes remains a significant bottleneck due to the high cost of iterative solvers on large-scale data. To amortize this cost, we introduce the Neural Eigenspace Operator (NEO), a feed-forward framework designed to predict the spectrum directly from point clouds. Crucially, NEO circumvents the ill-posed nature of standard eigenvector regression, which suffers from intrinsic sign flips and rotation ambiguities, by learning the stable, invariant low-frequency subspace instead. Specifically, the network predicts a redundant set of basis functions whose span robustly covers the target eigenspace, allowing for the recovery of accurate eigenpairs via a lightweight Rayleigh--Ritz refinement. To handle irregular sampling, we propose a mass-aware neural operator that incorporates per-point area weights into attention-based aggregation, improving robustness to non-uniform densities and enabling zero-shot generalization across resolutions. Our approach achieves near-linear runtime scaling and substantial wall-clock speedups over iterative solvers at comparable accuracy, and exhibits strong zero-shot transfer to high-resolution point clouds. The resulting eigenpairs support standard spectral geometry tasks, while the raw basis functions provide effective point-wise features for downstream learning. Code: https://github.com/Adversarr/NEO.

📰

arXiv cs.LG Research May 26, 2026

Assessing the Operational Viability of Foundation Models for Time Series Forecasting

arXiv:2605.24381v1 Announce Type: new Abstract: Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approaches achieve s…

arXiv:2605.24381v1 Announce Type: new Abstract: Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approaches achieve strong performance, they require domain-specific training, feature engineering, and ongoing maintenance. Large-scale foundation models have recently emerged as a zero-shot alternative, avoiding task-specific training much like LLMs. In this work, we evaluate foundation models against standard supervised approaches. Rather than focusing solely on aggregate accuracy, we analyze performance across four operational regimes: periodic human-centric systems, physically constrained processes, stochastic financial markets, and heterogeneous demand forecasting. Our results characterize optimal deployment areas. Foundation models perform well in domains with transferable periodic structures and are efficient for cold-start or long-tail scenarios. Conversely, supervised specialists maintain higher precision in systems governed by strict physical constraints. In financial domains, newer foundation models are rapidly closing the performance gap with supervised specialists. We further quantify trade-offs in inference latency, data drift adaptability, and deployment constraints. Finally, we propose a Complexity Router that assigns each series to the optimal model class using empirical features. We demonstrate that this selective routing achieves higher accuracy and significantly lower inference costs compared to deploying a universal foundation model, providing a practical framework for balancing generalization and efficiency.

📰

arXiv cs.LG Research May 26, 2026

GEESE: Genotype-aware End-to-End Spatio-temporal Embedding for Behavioral Phenotyping

arXiv:2605.24370v1 Announce Type: new Abstract: Behavioral phenotyping of genetic animal models currently requires labor-intensive manual feature engineering that limits reproducibility and scalabil…

arXiv:2605.24370v1 Announce Type: new Abstract: Behavioral phenotyping of genetic animal models currently requires labor-intensive manual feature engineering that limits reproducibility and scalability. We present GEESE, an end-to-end deep learning framework that learns behavioral representations directly from 3D pose dynamics without hand-crafted features. Using a pretrained time series foundation model, we encode movement sequences into a behavioral manifold that supports both behavior classification and genotype prediction. Evaluated across three autism-associated genetic models (CNTNAP2, CHD8, FMR1), our deep learning approach surpasses hand-crafted feature baselines in both tasks, revealing that learned representations capture genotype-specific behavioral signatures. The framework generalizes across genetic backgrounds, and an all-cohort model identifies both genetic background and genotype from movement patterns alone. We further provide HONK, an interactive intelligent tool enabling researchers without programming expertise to perform behavioral phenotyping from pose data through natural language interaction.

📰

arXiv cs.LG Research May 26, 2026

Treatment Effect Estimation with Differentiated Networked Effect on Graph Data

arXiv:2605.24358v1 Announce Type: new Abstract: Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce and medicine.…

arXiv:2605.24358v1 Announce Type: new Abstract: Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce and medicine. This task is challenging due to interference, where individual outcomes can be influenced by the treatments and covariates of their neighbors. Existing methods attempt to model such interference for accurate ITE estimation. However, a critical issue is often overlooked: differentiated networked effect (DNE), an effect caused by local networks consisting of neighbors with varying importance and scales. Capturing DNE is vital; otherwise, we will end up with imprecise ITE estimation due to an erroneous characterization of interference, which can result in misguided decisions. To address this challenge, we propose a novel interference modeling mechanism that incorporates two partial attention mechanisms and a message amplifier. The partial attention mechanisms automatically estimate the importance of different neighbors in contributing to interference, while the message amplifier adjusts the results of the interference modeling mechanism based on the scale of neighbors, all of which enables the model to capture DNE. Experiments on three real-world graphs demonstrate that our methods outperform existing approaches for ITE estimation from graph data, which corroborates the importance of explicitly capturing DNE.

📰

arXiv cs.LG Research May 26, 2026

Refined Analysis of Entropy-Regularized Actor-Critic

arXiv:2605.24357v1 Announce Type: new Abstract: In this paper, we study the role of the critic in actor--critic for entropy-regularized, finite, discounted environments. We establish that, when the …

arXiv:2605.24357v1 Announce Type: new Abstract: In this paper, we study the role of the critic in actor--critic for entropy-regularized, finite, discounted environments. We establish that, when the critic is exact, using the latter as a baseline is a variance-reduction method in a strong sense. In this case, actor--critic with stochastic gradients matches the sample complexity of deterministic policy gradient, reaching an $\epsilon$-optimal regularized value with $\tilde{O}(\log(1/\epsilon))$ samples. In practice, the critic is learned alongside the actor: the variance of the actor update is then influenced by the critic's variance and bias. Specifically, when the critic has a sufficiently small error, the variance reduction and rapid convergence are preserved. This suggests to learn the critic first, keeping it up to date after each actor update, underscoring the crucial role of accurate critic estimation in actor--critic methods.

📰

arXiv cs.LG Research May 26, 2026

Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs

arXiv:2605.24345v1 Announce Type: new Abstract: In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient ex…

arXiv:2605.24345v1 Announce Type: new Abstract: In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient exploration is needed to learn the true-environment optimal policy. We study this time-varying robustness--exploration trade-off through a quantile Bayesian risk-aware Markov decision process (BR-MDP), in which the quantile level controls how posterior uncertainty enters the Bellman backup. We characterize this control through an asymptotic normality result for the difference between the quantile BR-MDP value and the value in the true environment. The result implies that upper/lower-tail quantiles induce optimism/pessimism towards epistemic uncertainty, and the magnitude of the optimism/pessimism decreases as data accumulate. Building on this characterization, we propose an online Bayesian risk-aware algorithm with an adaptive quantile schedule that emphasizes robustness early and gradually encourages exploration of less-visited state--action pairs. We establish sublinear Bayesian regret bounds with respect to both the true optimal value and the optimal BR-MDP robust value. Numerical experiments demonstrate strong performance in both exploration-demanding and exploration-costly environments.

📰

arXiv cs.LG Research May 26, 2026

ChainzRule: Sample-Efficient, Robust Deep Learning Across Tabular, NLP, and Vision Tasks

arXiv:2605.24340v1 Announce Type: new Abstract: Production deep learning systems across enterprise domains operate under constraints that academic benchmarks routinely obscure: labeled data is expen…

arXiv:2605.24340v1 Announce Type: new Abstract: Production deep learning systems across enterprise domains operate under constraints that academic benchmarks routinely obscure: labeled data is expensive, inference budgets are tight, and models that cannot explain their behavior are difficult to trust and maintain. We present ChainzRule (CR), a neural architecture replacing typical activations with learnable polynomial layers governed by Differential Regularization (DREG), a layer-wise Jacobian penalty computed analytically during the forward pass at standard inference cost. The core claim is that bounding intermediate derivatives forces the network toward low-frequency, structurally stable representations, simultaneously reducing dependence on labeled data volume, improving robustness to distribution shift, and providing a measurable, gradient-based handle on model behavior. Evaluated across five domains, CR achieves $85.71\% \pm 2.01\%$ on Pima Diabetes (statistically superior to SVM and XGBoost), $46.20\% \pm 0.37\%$ on SST-5 sentiment classification with a frozen encoder (superior to RNTN using approximately 5\% of its training data), $55.79\%$ on SST-5 with a fine-tuned BERT backbone (versus BERT-base linear head at $54.9\%$), $70.17\%$ on Yelp Full ordinal regression with 3.2M parameters versus a 10-model average of $66.35\%$, and $+2.32\%$ mean corruption accuracy on CIFAR-10-C. All results with reported $p$-values fall below the $\alpha = 0.05$ threshold after Bonferroni correction. CR maintains a gradient tail ratio $\tau$ (p99/mean) of $1.01$--$1.02$ against $1.07$--$1.09$ for all typical activation function baselines across every data fraction, a structural invariant we propose as the mechanistic driver of sample efficiency and a deployment-time proxy for model reliability.

📰

arXiv cs.LG Research May 26, 2026

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

arXiv:2605.24331v1 Announce Type: new Abstract: Context or prompt-level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards (RLVR) for improving th…

arXiv:2605.24331v1 Announce Type: new Abstract: Context or prompt-level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards (RLVR) for improving the reasoning capability of large language models, yet the principle determining what constitutes an optimal weighting remains poorly understood. We address this gap by formulating prompt reweighting as a functional derivative of a utility functional defined in the pass-rate function space, yielding a unified optimality framework that accommodates existing schemes, including REINFORCE and GRPO. Building on this optimality framework, we propose a distribution-aware prompt reweighting approach, called CurveRL, based on a quantile coordinate transform, in which the weight assigned to each prompt depends not on the absolute value of pass rates but on its rank and density to reflect the distributional structure of the pass rates in the learning dynamics. Extensive experiments across multiple benchmarks demonstrate that our proposed CurveRL consistently outperforms GRPO and other RLVR baselines. Our study identifies context-distribution control as a principled axis for analyzing and designing prompt-reweighted RLVR algorithms. The code is released in https://github.com/zhyzmath/CurveRL.

📰

arXiv cs.LG Research May 26, 2026

Interdomain Attention: Beyond Token-Level Key-Value Memory

arXiv:2605.24330v1 Announce Type: new Abstract: Transformers and deep state space models (SSMs) sit at opposite ends of a basic design choice: attention routes each query through a growing key-value…

arXiv:2605.24330v1 Announce Type: new Abstract: Transformers and deep state space models (SSMs) sit at opposite ends of a basic design choice: attention routes each query through a growing key-value (KV) cache by content-based matching at quadratic cost, while deep SSMs compress context into a fixed-size recurrent state that is not directly addressed by query-key matching. We propose Interdomain Attention, which integrates an SSM into an attention module through kernel methods: an attention kernel is approximated by a finite feature map, the resulting key features and values are projected onto a shared set of basis functions maintained by a single SSM recurrence, and each query attends to the compressed coefficients through its own feature map, recovering query-conditioned attention over a fixed-size state. The scalable layer is a learned relaxation of this derivation, and we validate its components through ablations. In a 125M to 1.3B autoregressive language-modeling study on FineWeb-Edu at matched recurrent-state budget, Interdomain Attention improves on an SSM token mixer at every scale, surpasses a same-recipe softmax baseline at 1.3B on validation perplexity and on the eight-task commonsense suite, and inherits the length-flat behavior of its fixed-state core out to 3.5x the training context. Ablations indicate that the query-conditioned projection is the main source of the gain.

📰

arXiv cs.LG Research May 26, 2026

Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making

arXiv:2605.24319v1 Announce Type: new Abstract: As large language models become a default source of guidance on personal, moral, and existential questions, it matters whether they draw on the religi…

arXiv:2605.24319v1 Announce Type: new Abstract: As large language models become a default source of guidance on personal, moral, and existential questions, it matters whether they draw on the religious frameworks that have historically shaped such reasoning, or systematically omit them. In this paper, we ask a deliberately narrow question: when posed an everyday ethical question for which religious perspectives may be valuable, do LLMs invoke religion at all? In contrast to benchmarks that look for the presence of political leanings or social bias, we look for the absence of religious representation as a dimension of value alignment and bias in LLMs. We term this ``omissive bias.'' To measure omissive bias, we contribute the AllFaith Religious Representation Benchmark: 150 ethically and personally salient questions, sourced from in-the-wild chat transcripts and faith-community contributors, paired with an LLM-as-judge rubric that gives full credit for any mention of a religion, a religious practice, or a religious leader. The questions are not themselves about religion--they are open-ended questions about grief, forgiveness, relationships, purpose, and honesty, where religion is one valuable perspective among several. We also run a human-subjects survey to compare LLM behavior against human expectations. Evaluating 27 models, we find that LLMs consistently underrepresent religion relative to human expectations. The omission is asymmetric: models invoke religion more readily for abstract existential questions (meaning, death, truth) than for the practical personal situations--grief, marriage, family conflict, addiction--where many people most rely on it. It is not our purpose to adjudicate which values LLMs should hold. We argue, more modestly, that current LLM responses overlook critical opportunities to reflect religious frameworks that many people draw on when navigating personal and ethical challenges.

📰

arXiv cs.LG Research May 26, 2026

From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

arXiv:2605.24316v1 Announce Type: new Abstract: Scaling laws provide compact descriptions of how prediction error varies with compute, model size, and data, but existing theory mainly treats single-…

arXiv:2605.24316v1 Announce Type: new Abstract: Scaling laws provide compact descriptions of how prediction error varies with compute, model size, and data, but existing theory mainly treats single-sample SGD or full data reuse, leaving the role of mini-batching unclear. We study batch scaling laws for sketched linear regression under a power-law covariance spectrum and a source condition on the target parameter. We analyze one-pass batch SGD, multi-pass batch SGD with replacement, and multi-pass batch SGD without replacement. Our first result is a risk decomposition: all three procedures share the same irreducible and approximation terms, while their stochastic terms depend on the sampling protocol. One-pass batch SGD splits into bias and variance, whereas the two multi-pass methods split into GD bias, GD variance, and a fluctuation term around a common GD reference trajectory. We then prove source-condition scaling laws for one-pass and multi-pass mini-batch methods. For one-pass batch SGD, mini-batching preserves the approximation and optimization-bias exponents, while the variance scales as $O(\min(M,(T_{\mathrm{eff}}\gamma)^{1/a})/(B T_{\mathrm{eff}}))$. Thus the usual $1/B$ covariance reduction holds at fixed update count $T$, but in the one-pass regime $T=N/B$ it is partly offset by the shorter optimization horizon. For multi-pass batch SGD, with- and without-replacement sampling have identical approximation and GD bias/variance terms; they differ only in the fluctuation covariance prefactor, which is $1/B$ with replacement and $\rho_{N,B}=(N-B)/(B(N-1))$ without replacement. Hence without-replacement sampling is less noisy for $B>1$, and when $B=N$ the fluctuation vanishes, recovering deterministic gradient descent. These results place batch size on the same theoretical footing as compute, data, and model dimension in sketched linear regression.

📰

arXiv cs.LG Research May 26, 2026

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

arXiv:2605.24305v1 Announce Type: new Abstract: Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason…

arXiv:2605.24305v1 Announce Type: new Abstract: Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inability to reason about parameter-dependent dynamics. We present ChaosBench-Logic v2, a 40,886-question benchmark over 165 dynamical systems with 27 FOL predicates and 78 axiom edges, together with CARE (Calibration- and Adversarial-Robust Evaluation), a protocol that surfaces these pathologies. Evaluating 14 models, we find that regime-transition reasoning remains near random (MCC = 0.05) even for frontier models, whereas FOL deduction with given premises reaches MCC = 0.52. Per-family decomposition shows that the proprietary-model advantage concentrates on cross-indicator (+0.40) and consistency tasks, while open-source Qwen 2.5-32B dominates indicator diagnostics (0.91 vs. 0.45). Two models exhibit negative MCC on bifurcation questions, confirmed as systematic anti-correlation via confusion-matrix analysis.

📰

arXiv cs.LG Research May 26, 2026

LLMs Show No Signs Of Individuated Metacognition

arXiv:2605.24299v1 Announce Type: new Abstract: Confidence-weighted routing, selective abstention, and ensemble weighting all assume that a model's stated confidence is informative about its capabil…

arXiv:2605.24299v1 Announce Type: new Abstract: Confidence-weighted routing, selective abstention, and ensemble weighting all assume that a model's stated confidence is informative about its capability on the question being asked. They presume functional metacognition, the capacity to assess one's own capabilities, without exercising them. Aggregate calibration is well studied, with mixed results, but the underlying structure of elicited confidence is less well understood. We decompose binary confidence judgements from 20 frontier Large Language Models (LLMs) across six benchmarks using tetrachoric factor analysis paired with pairwise calibration, asking whether two models that differ in confidence also differ in performance. On factual recall and information retrieval benchmarks the cross-model confidence matrix is approximately rank-one and a single dominant factor captures most of the latent variance. Models retrieving facts share an item-level difficulty axis and differ mainly in their decision thresholds along it. Across all benchmarks the relationship between confidence and performance collapses once items that all models agree on are removed. Inter-model pairwise calibration is small even where statistically significant, and what remains shrinks to nothing once base-rate differences along the shared factor are controlled for. Mathematical reasoning is the apparent exception, but this turns out to be a confound where reasoning models answer questions about their confidence by trying to solve them in their chain of thought, bypassing the sub-symbolic self-knowledge we seek to measure. We find no evidence for significant verbalised individuated metacognition in any tested domain.

📰

arXiv cs.LG Research May 26, 2026

Private Adaptive Covariance Estimation via Gaussian Graphical Models

arXiv:2605.24295v1 Announce Type: new Abstract: We propose PACE-GGM, a data-adaptive differentially private method for covariance estimation that concentrates its privacy budget on the most informat…

arXiv:2605.24295v1 Announce Type: new Abstract: We propose PACE-GGM, a data-adaptive differentially private method for covariance estimation that concentrates its privacy budget on the most informative entries of the empirical covariance matrix, rather than perturbing all entries. This applies in the natural setting where the modeler supplies separate bounds for each variable, so that individual entries can be measured with less noise than the full matrix. In each round, our method selects a poorly approximated entry, measures it using the Gaussian mechanism, and then reconstructs a full covariance matrix using a maximum-entropy reconstruction objective, leading to a Gaussian graphical model structure. Experiments on diverse real-world datasets demonstrate consistent improvements in estimation error with respect to the Gaussian mechanism and other baselines, particularly in high-dimensional and low-to-moderate privacy regimes.

📰

arXiv cs.LG Research May 26, 2026

TUBE: Tangent Upper Bound on Evidence for Discrete Diffusion Language Models

arXiv:2605.24292v1 Announce Type: new Abstract: Log-likelihood is a standard metric for evaluating generative models. Unfortunately, in contrast to autoregressive models (ARMs), discrete diffusion m…

arXiv:2605.24292v1 Announce Type: new Abstract: Log-likelihood is a standard metric for evaluating generative models. Unfortunately, in contrast to autoregressive models (ARMs), discrete diffusion models generally do not admit exact computation of this quantity. Existing evaluations, therefore, rely on the evidence lower bound (ELBO), leaving unclear how much higher the true value may be. We address this by introducing the Tangent Upper Bound on Evidence (TUBE), a variational upper bound on log-likelihood that admits an unbiased Monte Carlo estimator. Our TUBE extends across latent-variable models, including masked diffusion models (MDMs), any-order ARMs (AO-ARMs), and block variants of both. Applied to block MDMs and block AO-ARMs, TUBE reveals our key empirical finding that these models lie strictly below the exact ARM baseline, showing that ARMs still dominate in likelihood.

📰

arXiv cs.LG Research May 26, 2026

Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning

arXiv:2605.24286v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produ…

arXiv:2605.24286v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt-to-answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information-flow perspective: faithful reasoning should route answer-relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt-to-answer shortcut. This perspective yields a task-agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy-based, masked-KL, and gradient-based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low-entropy failure mode of KL-based diagnostics where gradient-based measures remain more stable. Building on this analysis, we introduce update-time interventions for verifier-based on-policy RL, including attention masking, backward-only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward-hackable code repair, and DAPO-Math models trained without hints but evaluated under wrong-hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward-hacking behavior more transparent in the CoT and improve task-agnostic faithfulness metrics, while in some settings also reducing wrong-hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at https://github.com/safety-research/faithful-cot.

📰

arXiv cs.LG Research May 26, 2026

Fourier Feature Pyramids for Physics-Informed Neural Networks

arXiv:2605.24278v1 Announce Type: new Abstract: We present an improved neural field architecture for solving partial differential equations (PDEs). Current physics-informed neural networks (PINNs) p…

arXiv:2605.24278v1 Announce Type: new Abstract: We present an improved neural field architecture for solving partial differential equations (PDEs). Current physics-informed neural networks (PINNs) provide a flexible framework for solving PDEs, but they struggle to achieve highly accurate solutions and require computation that scales poorly with parameter count. Our model, which we call beignet (Bandlimited Embedding with Interpolated Grid Network), replaces the random Fourier feature embedding used by existing PINN models with a trainable multi-resolution Fourier feature pyramid. To query beignet at a continuous coordinate, we use Fourier interpolation at each level of the pyramid to return features at the input coordinate, and then decode this vector with a fully-connected neural network trunk. Our model provides multiple benefits: 1) Spatial derivatives can be computed efficiently by using the chain rule to compose derivatives of the neural network computed with automatic differentiation with derivatives of the feature grid computed spectrally by the Fast Fourier transform (FFT). 2) beignet can achieve higher accuracy in a compute-efficient manner by scaling the parameter count of this Fourier feature pyramid, instead of the less-efficient strategy of scaling the neural network architecture. 3) beignet can directly control the representation bandlimit, resulting in more stable optimization for difficult PDEs. We demonstrate that beignet finds significantly more accurate solutions on PDE benchmarks using fewer parameters than state-of-the-art PINN methods. We further evaluate beignet on the self-similar inviscid Burgers blowup problem and show that it can minimize residuals to near machine precision using Adam, an accuracy regime previously attained only by using computationally expensive higher-order optimizers.

📰

arXiv cs.LG Research May 26, 2026

A lift for input-convex neural network training

arXiv:2605.24274v1 Announce Type: new Abstract: Input-convex neural networks (ICNNs) are widely used for log-concave density estimation, convex-potential normalizing flows, optimal transport, and tr…

arXiv:2605.24274v1 Announce Type: new Abstract: Input-convex neural networks (ICNNs) are widely used for log-concave density estimation, convex-potential normalizing flows, optimal transport, and transport-map inversion for high-dimensional Bayesian posteriors. These tasks share a structural constraint: the inter-layer weights of the ICNN must remain non-negative. The standard recipe, projected gradient descent (PGD) onto the non-negative cone, applies a hard, non-smooth projection -- the stiff-penalty limit of an ADMM-style constraint splitting -- and its classical convergence guarantees do not transfer to the non-smooth ICNN training landscape; the differentiable alternative, softplus reparametrization, attenuates the gradient exponentially in the weight magnitude, stalling training with dead inter-layer weights and plateaued loss. Inspired by parameter-extension lifts of PDE-constrained inverse problems, we propose the lift: instead of constraining the inter-layer weights directly, we train an unconstrained hypernetwork that emits them from a permutation-invariant summary of the input batch. This adds stochasticity to the training dynamics that softens the loss landscape, letting the iterates escape the gradient-attenuated region where direct softplus stalls. We trace this softening to three structural ingredients -- a learnable bias acting as slack, a hypernetwork body that conditions on the target batch, and a cross-covariance coupling the two through batch stochasticity -- and prove each one necessary: deleting any single ingredient collapses the cross-covariance that carries the softening. On log-concave energy-based modeling from one-dimensional toy targets to image-flavored latents, and convex-potential normalizing flows on a 21-dimensional tabular benchmark, we show that the lift reaches a lower test loss than both PGD and direct softplus, and turns a plateau-bounded training trajectory into a valley-descending one.

📰

arXiv cs.LG Research May 26, 2026

Optimizing Digital Therapeutic Interventions: Online Learning under Endogenous Adherence

arXiv:2605.24261v1 Announce Type: new Abstract: A critical challenge facing clinicians managing chronic disease interventions is sustaining long-run patient health given limited information and reso…

arXiv:2605.24261v1 Announce Type: new Abstract: A critical challenge facing clinicians managing chronic disease interventions is sustaining long-run patient health given limited information and resources. Digital therapeutics (DTs) provide a cost-effective way to manage interventions at scale through repeated interactions (e.g. daily treatment recommendations), but patient success is highly dependent on their adherence. Behavioral psychology suggests that both treatment recommendations and past adherence affect future adherence, yet existing decision support frameworks for DTs model only recommendation effects or treat adherence as exogenous context, leaving a key gap in model and algorithm development. To address this gap, we present a DT decision support framework that captures both recommendation and adherence effects, allowing clinicians to better plan treatment recommendations. We model a patient's time-varying capacity for engagement with treatment using a linear dynamical system (LDS) that captures both recommendation and adherence effects, endogenously connected to adherence behavior with a logit link. We establish finite-time identification guarantees for this model, extending LDS results to our setting. Next, we propose an optimism-based algorithm, UCB-BOLD, for online treatment selection and prove that it achieves sublinear regret. We evaluate UCB-BOLD against benchmarks via ablation studies on a synthetic patient cohort generated using micro-randomized trial data. DT decision support tools can include dynamical models to enable decision makers to efficiently use the data in DT settings to improve patient health through effective resource allocation. While myopic or heuristic approaches suffice for some patient types, the benefits of explicitly planning around recommendation and adherence effects are significant for others; UCB-BOLD achieves 2-3x lower conditional value-at-risk regret than the next-best benchmark.

📰

arXiv cs.LG Research May 26, 2026

Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

arXiv:2605.24251v1 Announce Type: new Abstract: Continual anomaly detection (CAD) addresses the need for industrial inspection systems to adapt to evolving production conditions, yet existing method…

arXiv:2605.24251v1 Announce Type: new Abstract: Continual anomaly detection (CAD) addresses the need for industrial inspection systems to adapt to evolving production conditions, yet existing methods share three critical gaps: unrealistic evaluation, no systematic comparison, and no consideration of edge deployment constraints. We introduce a unified benchmark combining discrete-task evaluation on structural and logical anomalies, a novel continuous drift protocol, the first head-to-head comparison of all published CAD methods, and computational efficiency profiling on edge hardware. Our results reveal that existing CAD methods do not consistently outperform traditional approaches with simple experience replay. Thus motivated, we propose DINOSaur, a training-free method combining a frozen DINOv3 backbone with spatially-indexed coreset memory and neighborhood-restricted anomaly scoring. DINOSaur achieves zero forgetting by construction, outperforms all evaluated methods across all five protocols, and runs at sub-100\,ms inference on an NVIDIA Jetson Orin Nano, with on-device adaptation to new tasks in under 30 seconds.

📰

arXiv cs.LG Research May 26, 2026

PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets

arXiv:2605.24249v1 Announce Type: new Abstract: The growing availability of clinical data has increased the use of machine learning, yet centralized data aggregation is often infeasible for sensitiv…

arXiv:2605.24249v1 Announce Type: new Abstract: The growing availability of clinical data has increased the use of machine learning, yet centralized data aggregation is often infeasible for sensitive health information. Federated Learning (FL) offers a distributed alternative, but its adoption is limited by substantial heterogeneity across institutional datasets, making harmonization a critical but frequently overlooked prerequisite for multi-site analytics. We introduce PrivFusion, a privacy-preserving multi-agent framework that automates the harmonization of structured datasets prior to federated training. PrivFusion uses agents to analyze local data, cluster semantically similar features across sites, and provide iterative transformation recommendations until alignment is achieved. Evaluation across four heterogeneous COVID-19 datasets demonstrates that PrivFusion effectively and efficiently harmonizes multi-site data while substantially reducing manual effort.

📰

arXiv cs.LG Research May 26, 2026

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

arXiv:2605.24216v1 Announce Type: new Abstract: Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horiz…

arXiv:2605.24216v1 Announce Type: new Abstract: Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.

📰

arXiv cs.LG Research May 26, 2026

Characterizing the Representational Capacity of Neural Processes

arXiv:2605.24210v1 Announce Type: new Abstract: What functions can Neural Processes represent? We analyze the representational capacity of popular NP architectures: Conditional Neural Processes (CNP…

arXiv:2605.24210v1 Announce Type: new Abstract: What functions can Neural Processes represent? We analyze the representational capacity of popular NP architectures: Conditional Neural Processes (CNPs), Attentive Neural Processes (ANPs), Transformer Neural Processes (TNPs), and their latent variants. We prove these architectures form a strict hierarchy. CNP-representable functions are exactly those depending on finitely many expected features of the context distribution. ANPs strictly generalize CNPs via query-dependent reweighting, enabling kernel smoothers. ConvCNPs and ANPs are incomparable; each contains functions outside the other, separated by stationarity versus translation equivariance. TNPs with $L$ self-attention layers capture $L$-hop context interactions. For latent NPs, we show finite-dimensional latents provide coherent sampling but do not circumvent encoder limitations; matching GP posterior distributions requires latent dimension scaling with context size. These results provide a theoretical foundation for architecture selection based on task structure.

📰

arXiv cs.LG Research May 26, 2026

Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization

arXiv:2605.24192v1 Announce Type: new Abstract: The neural-network denoising functions which form the backbone of image diffusion models are remarkably consistent in their generalization behaviour a…

arXiv:2605.24192v1 Announce Type: new Abstract: The neural-network denoising functions which form the backbone of image diffusion models are remarkably consistent in their generalization behaviour across a wide variety of network architectures and training procedure hyperparameters. A recent line of research has sought to model the outputs of these networks by aggregating posterior weighted averages of training dataset patches. In this work, we consolidate these approaches into a unified model class which we call Filtered Posterior Mean Collections (FPMCs). We define this model class using query precision vectors, response weights, and source distributions, and illustrate that existing methods are recoverable with specific choices of these design axes. Investigating each axis in turn, we find that FPMC performance can be improved with soft relaxations of prior patch-based methods, and through augmentations of source distributions. Applying these findings to an existing FPMC, we demonstrate consistent sample improvement across three natural image datasets.

📰

arXiv cs.LG Research May 26, 2026

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

arXiv:2605.24171v1 Announce Type: new Abstract: Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacter…

arXiv:2605.24171v1 Announce Type: new Abstract: Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy. Using five prompting strategies across five open-weight models on 1,000 CVEs (6,074 code samples spanning 16 programming languages), we evaluate accuracy, recall, abstention, coverage, and effective F1. We find that standard chain-of-thought prompting achieves the strongest overall operational performance, while few-shot prompting provides model-dependent benefits that are most pronounced for prompt-sensitive models. In contrast, adaptive chain-of-thought frequently suppresses recall and self-consistency induces excessive abstention, sharply reducing effective performance. These results show that vulnerability detection behavior is jointly determined by the model and the prompt, and that prompt sensitivity is a first-class system property that must be explicitly characterized in evaluation and deployment.

📰

arXiv cs.LG Research May 26, 2026

Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis

arXiv:2605.24162v1 Announce Type: new Abstract: Biological systems are governed by structured molecular interactions, where pathways, regulatory circuits, and functional gene relationships shape cel…

arXiv:2605.24162v1 Announce Type: new Abstract: Biological systems are governed by structured molecular interactions, where pathways, regulatory circuits, and functional gene relationships shape cellular behavior and disease progression. Much of this knowledge is naturally represented as graphs. However, most biomedical AI models cannot directly use graph-encoded biological knowledge and instead require compressed low-dimensional representations, which can lose important structure and reduce performance, especially in limited-sample clinical studies. Here, we introduce Graph-in-Graph (GiG), a knowledge graph-modulated deep learning framework for data-efficient clinical prediction. GiG represents each patient as a standalone modular graph, in which curated biological knowledge graphs define edges and patient-specific measurements, such as gene expression, define node features. This design allows multiple biological knowledge graphs to be integrated while preserving gene-gene interactions and pathway topology during patient-level representation learning. Across cohorts comprising nearly 9,700 patients and five clinical tasks, including liquid biopsy cancer detection, prostate cancer diagnosis, and 32-class pan-cancer classification, GiG consistently outperforms traditional and state-of-the-art methods, with the largest gains in limited-sample settings. On the challenging prostate cancer diagnosis task, GiG improves macro-F1 by up to 49 percentage points relative to competing methods. Control experiments replacing real pathway graphs with random topologies confirm that these gains arise from biologically grounded knowledge graph structure rather than graph modeling alone. These findings show that knowledge graph-modulated deep learning can improve robustness, interpretability, and sample efficiency in clinical data analysis, and provide a principled framework for integrating biological knowledge graphs into predictive modeling.

📰

arXiv cs.LG Research May 26, 2026

Riemannian Archetypal Analysis: Interpretable non-linear data analysis on deformed star distributions

arXiv:2605.24113v1 Announce Type: new Abstract: Classical archetypal analysis is appealing for its interpretability, but its linear geometry can limit performance on data with strongly non-linear st…

arXiv:2605.24113v1 Announce Type: new Abstract: Classical archetypal analysis is appealing for its interpretability, but its linear geometry can limit performance on data with strongly non-linear structure; at the same time, existing neural extensions improve flexibility while often weakening the geometric meaning of archetypes and interpolations. In this work, we develop a Riemannian version of archetypal analysis based on data-driven pullback geometry for real-valued data, with the goal of combining the interpretability of classical archetypal analysis with the expressive power of modern non-linear models. We introduce a class of deformed star distributions together with associated pullback Riemannian geometry to provide a statistical interpretation of the resulting manifold mappings, define the Riemannian archetypal mapping (RAM) as a projection onto the manifold of geodesically convex combinations of archetypes, and propose a practical optimization scheme based on convex relaxation followed by non-convex refinement. We further propose a learning scheme that yields reasonable, albeit generally suboptimal, deformed star distributions from data. Experiments on synthetic examples and MNIST show that the resulting framework produces meaningful geodesics, useful denoising projections, and geometry-aware classifications, while also clarifying where current optimization limitations remain.

📰

arXiv cs.LG Research May 26, 2026

Overcoming "Physics Shock" in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference

arXiv:2605.24106v1 Announce Type: new Abstract: Rapid and accurate flood extent mapping from Remote Sensing data, such as Synthetic Aperture Radar (SAR), is critical for operational disaster respons…

arXiv:2605.24106v1 Announce Type: new Abstract: Rapid and accurate flood extent mapping from Remote Sensing data, such as Synthetic Aperture Radar (SAR), is critical for operational disaster response, but standard Deep Learning models often produce physically impossible predictions due to a lack of hydrological constraints. While PhysicsInformed Neural Networks (PINNs) attempt to address this by embedding governing laws directly into the loss function, their application to real-world remote sensing data frequently fails. Enforcing rigid spatial derivatives (e.g., the 2D Shallow Water Equations) onto unconditioned latent spaces attempting to fit noisy SAR speckle causes catastrophic gradient divergence, a phenomenon we term Physics Shock. In this paper, we propose a novel Uncertainty-Aware PINN framework tailored specifically for applied Earth Observation that addresses this instability. By integrating a dynamic Warm-Start protocol and modeling heteroscedastic aleatoric uncertainty via a negative log-likelihood objective, the network learns to dynamically relax physical constraints in regions of high sensor noise while strictly enforcing them in high-confidence areas. Evaluated on the Sen1Floods11 dataset, our probabilistic Attention-Gated FNO-UNet successfully stabilizes multi-objective optimization, achieving a +25% relative improvement in Intersection over Union (IoU) compared to deterministic baselines. Furthermore, through Deep Ensembles, we successfully disentangle intrinsic sensor noise from out-of-distribution terrain ignorance, providing operational agencies with highly calibrated, physically consistent confidence bounds for robust disaster mitigation and real-time decision-making.

📰

arXiv cs.LG Research May 26, 2026

Verified SHAP: Provable Bounds for Exact Shapley Values of Neural Networks

arXiv:2605.24084v1 Announce Type: new Abstract: Shapley additive explanations (SHAP) are widely recognised as computationally intractable for neural networks, since they induce an exponential search…

arXiv:2605.24084v1 Announce Type: new Abstract: Shapley additive explanations (SHAP) are widely recognised as computationally intractable for neural networks, since they induce an exponential search space over the input features. In this work, we take a first step towards scaling exact SHAP computation to larger search spaces by introducing an algorithm that leverages recent advances in neural network verification to compute arbitrarily tight exact lower and upper bounds on SHAP values for neural networks, ultimately recovering the exact SHAP values. We demonstrate that our approach scales to orders of magnitude larger search spaces than state-of-the-art exact methods. This provides an important first step towards exact SHAP computation and establishes a principled cornerstone for evaluating statistical approximation methods on larger search spaces.

📰

arXiv cs.LG Research May 26, 2026

Not All Transitions Matter: Evidence from PPO

arXiv:2605.24071v1 Announce Type: new Abstract: Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. …

arXiv:2605.24071v1 Announce Type: new Abstract: Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent's own actions. Because of this, consecutive transitions are never truly independent. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal. This paper asks whether that redundancy can simply be removed. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilize training. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation. Across five environments of increasing difficulty, CartPole-v1, Acrobot-v1, LunarLander-v2, HalfCheetah-v5, and Hopper-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates. Dropping 25% of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch.

📰

arXiv cs.LG Research May 26, 2026

Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion

arXiv:2605.24064v1 Announce Type: new Abstract: Hyper-relational knowledge graphs (HKGs) effectively represent complex facts. While inferring new knowledge in HKGs is a critical problem, current met…

arXiv:2605.24064v1 Announce Type: new Abstract: Hyper-relational knowledge graphs (HKGs) effectively represent complex facts. While inferring new knowledge in HKGs is a critical problem, current methods cast it as a simple link prediction, assuming that nearly all entities and relations within a fact are known, leaving only a single blank to be filled. However, this restricted assumption may not hold in real-world scenarios in which multiple, or even all, constituent components of a fact may be missing simultaneously. To bridge this gap, we introduce a task called fact generation: generating a valid hyper-relational fact from an arbitrarily masked query, i.e., completing a partially observed fact or generating a fact from scratch. We propose KREPE, the first generative representation learning method for HKGs that learns to model the probability distributions of missing components conditioned on the local fact components and global structure of HKGs via a masked discrete diffusion. KREPE models both the intra-fact dependencies by contextual message passing and inter-fact correlations by aggregating stochastically sampled contexts. KREPE seamlessly unifies link prediction and fact generation within a single training framework, achieving state-of-the-art performance on standard HKG link prediction benchmarks and outperforming LLM-based baselines in generating novel and correct facts.

📰

arXiv cs.LG Research May 26, 2026

Federated Learning over Human-Body Communication for On-Body Edge Intelligence: A Survey, Taxonomy, and BODYFED-HBC Scheduling Vignette

arXiv:2605.24062v1 Announce Type: new Abstract: Human-body communication (HBC) is a promising physical substrate for wearable body-area networks because it can localize communication around the body…

arXiv:2605.24062v1 Announce Type: new Abstract: Human-body communication (HBC) is a promising physical substrate for wearable body-area networks because it can localize communication around the body and reduce the burden of conventional radio links. Federated learning (FL) is a promising learning substrate because it can reduce raw-data centralization for physiological and behavioral sensing. Yet these two literatures remain weakly connected: FL for wearables usually abstracts the communication layer, whereas HBC research usually abstracts learning and model-update traffic. This article surveys the intersection of HBC, wireless body-area networks, wearable FL, Internet-of-Bodies privacy, and edge-intelligence optimization. We propose a taxonomy that distinguishes intra-body, body-hub, cross-user, and clinical-cloud FL deployments, and we identify the open problem of body-channel-aware FL: learning protocols whose client selection, update compression, and aggregation are controlled by posture-dependent HBC links, residual energy, sensor memory, and privacy risk. To make the research agenda concrete, we introduce BODYFED-HBC as a reference architecture and provide an optimization formulation and scheduling algorithm. We further specify a reproducible simulation vignette that combines public wearable datasets with empirical body-coupled-communication signal-loss models. The article concludes with open datasets, evaluation metrics, limitations, and research directions for computer scientists working above the hardware layer.

📰

arXiv cs.LG Research May 26, 2026

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

arXiv:2605.24059v1 Announce Type: new Abstract: We present a three-step recipe for identifying attention-head circuits in pretrained transformers. A per-head spectral signal -- the time-integrated p…

arXiv:2605.24059v1 Announce Type: new Abstract: We present a three-step recipe for identifying attention-head circuits in pretrained transformers. A per-head spectral signal -- the time-integrated participation ratio of each head's attention output -- ranks heads doing sustained content-dependent computation without labels or attribution gradients. A task-pattern screen filters this general indicator into a task-specific candidate circuit, and group ablation against a matched-random control completes the causal claim. We validate across an 8x parameter range (51M to 1B-active / 7B-total), two architecture families (dense, mixture-of-experts), and four pretraining pipelines. The recipe ports: a 2-6 head induction circuit is causally necessary in every model tested, with a 94-100% drop in synthetic-induction top-1 after ablation. The spectral signal is predictive without supervision: on six independent seeds of a 51M-parameter probe model, the same computation identifies the seed-specific circuit on each seed. The fraction of heads doing identifiable specialized computation is conserved at 17-19% across the Pythia family (124M to 410M), while specific induction circuits stay 3-11 heads -- sublinear in total head count. This paper is the methodology anchor of a three-paper program; companion papers extend the recipe to developmental trajectories during pretraining and to composed-task circuits where pattern selectivity decouples from task-causal structure.

📰

arXiv cs.LG Research May 26, 2026

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

arXiv:2605.24058v1 Announce Type: new Abstract: On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA ad…

arXiv:2605.24058v1 Announce Type: new Abstract: On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA adapter. In the unmerged adapter-mode setting, however, the adapter is more than a compact storage module; it introduces an additional dense floating-point branch, maintains a trainable state for local updates, and acts as a unit of communication and hot-swapping.We introduce LoRDBA, a LoRA-compatible adapter that replaces both low-rank factors with binary sign carriers while representing magnitudes through lightweight, channel-wise scales, converting the dense adapter branch into two sign-accumulation matrix multiplications interleaved with channel-wise scaling. A finite-sample analysis shows that reconstruction quality is governed by the residual-to-magnitude ratio of the original LoRA factors. In adapter-mode experiments, LoRDBA outperforms low-bit baselines at matched model sizes while matching fp16 LoRA quality in selected regimes. The unmerged adapter incurs at most 8% prefill latency overhead at matched rank r=16 despite an over 10x reduction in adapter footprint, with moderate training memory overhead of approximately 1.6x that of fp16 LoRA.

📰

arXiv cs.LG Research May 26, 2026

Feature Lottery? A Bifurcation Theory of Concept Emergence

arXiv:2605.24057v1 Announce Type: new Abstract: Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies on retrospe…

arXiv:2605.24057v1 Announce Type: new Abstract: Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies on retrospective, label-dependent metrics. We introduce a bifurcation theory of representation dynamics to detect these moments in real time. Analyzing a passive GMM probe attached to the evolving encoder, we show the onset of structure corresponds to a supercritical pitchfork bifurcation driven by the loss Hessian. The system exhibits a theoretically predictable zero-crossing ($\beta_c$) that, compared to the network's current state ($\beta$), yields a dynamic ratio $\beta(t)/\beta_c(t)$: a universal, label-free phase coordinate for representation dynamics, computable entirely from hidden states. We empirically validate four distinct transition regimes predicted by this coordinate across diverse settings: SAEs on language models (Pythia), SSL (CIFAR), and grokking (modular arithmetic). Crucially, under finite dissipation, macroscopic symmetry-breaking can lag the initial zero-crossing by orders of magnitude, which providing a rigorous dynamical account of the delayed escape observed in grokking. Microscopically, the bifurcation creates a shared unstable subspace, forcing collective symmetry breaking. We term this the "feature lottery" in SAE training: a feature's terminal interpretability becomes predictable remarkably early. By only 5% of training, early atom purity robustly predicts final convergence purity, with top-decile early atoms achieving over 12x the baseline purity at convergence. Beyond explaining concept emergence, $\beta/\beta_c$ provides a practical early-warning indicator for training health, detecting the onset of usable structure, the crystallization of feature identity, and representational collapse epochs before downstream metrics react.

📰

arXiv cs.LG Research May 26, 2026

Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

arXiv:2605.24055v1 Announce Type: new Abstract: Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occasional lar…

arXiv:2605.24055v1 Announce Type: new Abstract: Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occasional large-magnitude impulse outliers. For tasks that depend on local shape, such as ECG morphology analysis and battery degradation monitoring, the main requirement is not only low reconstruction error but also preservation of derivative peaks and task-critical features. We propose Cascade-KDE, a training-free restoration framework for corrupted time series. The method first estimates a two-dimensional temporal-amplitude density, then applies a Density-Truncated Robust Expectation to limit the influence of distant abnormal points, and finally refines the sequence through an exponential cascade with adaptive stopping. This design aims to improve robustness under out-of-distribution impulse corruptions while keeping the restored trajectory close to the original local structure. Across several benchmark datasets, the proposed method shows consistent gains over classical filters and representative learning-based baselines on curve fidelity, derivative preservation, downstream classification, and runtime efficiency. These results suggest that bounded density-based restoration is a practical option for feature-preserving preprocessing in noisy time-series pipelines.

📰

arXiv cs.LG Research May 26, 2026

Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

arXiv:2605.24052v1 Announce Type: new Abstract: To better serve users' demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large language model (L…

arXiv:2605.24052v1 Announce Type: new Abstract: To better serve users' demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large language model (LLM)-generated content (e.g., AI-generated traffic condition predictions) with human feedback collected from crowdsourcing workers (e.g., mobile users). However, workers may strategically misreport their online preference feedback to maximize their influence or payment. Existing pipelines in mobile crowdsourcing (e.g., EM-based weight estimation) fail to identify the most accurate worker in this online setting, resulting in a linear regret $\mathcal{O}(T)$ over $T$ time slots. In this paper, we study truthful online preference aggregation for LLM fine-tuning in mobile crowdsourcing. We formulate a new dynamic Bayesian game to model the multi-agent online learning process between the platform and strategic mobile workers. We propose a novel online weighted aggregation mechanism that dynamically adjusts each worker's weight in the preference aggregation according to their feedback accuracy. We prove that our mechanism ensures truthful feedback from strategic workers and achieves a sublinear regret $\mathcal{O}(\sqrt{T})$ over $T$ time slots. We further extend our mechanism to a challenging scenario with limited worker feedback per time slot, still guaranteeing a sublinear regret $\mathcal{O}(\sqrt{T})$. Experiments on LLM fine-tuning with real-world datasets further demonstrate significant performance gains of our mechanisms over benchmark schemes.

📰

arXiv cs.LG Research May 26, 2026

Mixture of Complementary Agents for Robust LLM Ensemble

arXiv:2605.24048v1 Announce Type: new Abstract: Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information and boosting …

arXiv:2605.24048v1 Announce Type: new Abstract: Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information and boosting performance. A foundational step in these pipelines is to feed the responses of several proposer LLMs into a summarizer LLM, which synthesizes a better answer. However, choosing which proposers to include is non-trivial. Existing approaches primarily focus either on accuracy (picking the strongest models) or diversity (ensuring variety), and often overlook the interactions among proposers and with the summarizer. We reframe proposer selection as a combinatorial selection problem akin to feature selection, where the value of an LLM lies in its complementarity with others. However, directly applying standard feature-selection algorithms is impractical in the LLM setting due to prohibitive time complexity. Motivated by this limitation, we explore an extensive range of computationally feasible, greedy-style selection algorithms that assess complementarity using a small labeled set. Our experiments validate complementarity as a guiding principle for proposer selection and identify methods that achieve the best performance-cost trade-offs in practice.

📰

arXiv cs.LG Research May 26, 2026

A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?

arXiv:2605.24045v1 Announce Type: new Abstract: Protein-ligand modeling underpins computational drug discovery and molecular design. Existing protein-ligand benchmarks typically evaluate whether a p…

arXiv:2605.24045v1 Announce Type: new Abstract: Protein-ligand modeling underpins computational drug discovery and molecular design. Existing protein-ligand benchmarks typically evaluate whether a protein and ligand interact and how strongly they bind, through tasks such as binary binding prediction and affinity regression. However, these evaluations provide limited evidence of whether models can localize binding sites or identify the non-covalent interactions underlying molecular recognition. To address this gap, we introduce InteractBind, a large-scale protein-ligand dataset comprising approximately 100k protein-ligand pairs, together with a benchmark for fine-grained evaluation. The core fine-grained task is that of binding-site localization, which uses protein-residue and ligand-atom interaction maps spanning six major types of non-covalent interactions to assess whether model-derived interaction maps localize binding sites. InteractBind further includes binding affinity and protein similarity-controlled splits to support realistic generalization assessment. Using InteractBind, we evaluate eight existing sequence-based and interaction-aware models, assessing binary binding prediction and binding-site localization. Results reveal limited binding-site localization despite strong binary binding prediction, with marked variation across non-covalent interaction types. Overall, InteractBind establishes a benchmark paradigm that encourages the development of more interpretable and physically grounded protein-ligand models.

📰

arXiv cs.LG Research May 26, 2026

LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

arXiv:2605.24043v1 Announce Type: new Abstract: Scientific discovery is a closed-loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet most approa…

arXiv:2605.24043v1 Announce Type: new Abstract: Scientific discovery is a closed-loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet most approaches reduce discovery to supervised learning over fixed datasets, where limited observations can support multiple plausible mechanisms that fit locally but fail to generalize. Thus, the key challenge is selecting informative observations to resolve uncertainty, shifting the focus from static inference to adaptive data acquisition. To address this, we propose LLM-AutoSciLab, a closed-loop framework that couples hypothesis generation with hypothesis-conditioned experiment selection and mechanism refinement. Rather than fitting models to passively collected data, LLM-AutoSciLab iteratively proposes plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence. To evaluate dynamic, closed-loop scientific discovery with active data acquisition, we introduce ActiveSciBench, comprising two datasets: ActiveSciBench-Chem with 57 enzyme-kinetics tasks and ActiveSciBench-GRN with 45 gene-regulatory-network tasks. These datasets model discovery as a budget-constrained process requiring adaptive experiment design, variable selection, and recovery of true mechanisms. Across NewtonBench, ActiveSciBench-Chem, and ActiveSciBench-GRN, LLM-AutoSciLab outperforms prior methods, achieving 67.6% and 35.1% symbolic accuracy on NewtonBench and ActiveSciBench-Chem, respectively, and 31.1% exact graph recovery on ActiveSciBench-GRN. Moreover, hypothesis-guided experimentation is 2-5x more sample-efficient than the strongest competing baselines. Code and data are available at: https://github.com/scientific-discovery/LLM-AutoSciLab

📰

arXiv cs.LG Research May 26, 2026

Hidden-State Privacy Has an Empty Middle

arXiv:2605.24042v1 Announce Type: new Abstract: Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate privacy aga…

arXiv:2605.24042v1 Announce Type: new Abstract: Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker. We prove a complementary Fisher-ball lower bound: every full-rank Gaussian release at $O(1)$ Fisher utility admits a direction whose Mahalanobis signal grows linearly in hidden width, ruling out uniform Gaussian safety in the class and matching the empirical empty middle. The diagonal inverse-Fisher release $\Sigma^\star_{\mathrm{diag}}(\mathcal{K}) = (2\mathcal{K}/d)\,\mathrm{diag}(1/F_{ii})$ is the unique minimax-optimal diagonal mechanism at first-order KL budget $\mathcal{K}$ and the only release with worst-attacker top-1 $\le 0.001$ at every point of a 32 model-layer grid, but it sits on a privacy/utility edge rather than filling the middle. A generalized-eigen mechanism reaching $13\times$ Pareto reduction under Euclidean retrieval collapses to $100\%$ top-1 under the adaptive Mahalanobis attacker, and a full-trajectory sequence inverter recovers $94\%$ of clean GPT-2 prefixes but $0\%$ under $\Sigma_{\mathrm{diag}}$. A split-memory transformer trained from scratch reaches $G_{\mathrm{Mah}} \in [20, 33]$ at 90M and maintains a $6$--$24\times$ advantage over same-budget GPT baselines from 30M to 1B at a fixed-token language-modeling loss penalty; pretrained models top out at 9.3. These results reframe hidden-state release from mechanism-design within the Gaussian class to architecture or release co-design.

📰

arXiv cs.LG Research May 26, 2026

Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation

arXiv:2605.24041v1 Announce Type: new Abstract: Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference procedure tha…

arXiv:2605.24041v1 Announce Type: new Abstract: Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference procedure that struggles to resolve high-frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre-trained operators with a learned refinement module iteratively applied via fixed-point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under local assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high-frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high-frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72-36.10% in low-, 5.07-6.68% in mid-, and 1.48-2.04% in high-frequencies, remaining stable beyond the trained iteration count. Code is available at https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator

📰

arXiv cs.LG Research May 26, 2026

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

arXiv:2605.24033v1 Announce Type: new Abstract: Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through exa…

arXiv:2605.24033v1 Announce Type: new Abstract: Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through examples, ablations, and manual reasoning. This leaves a gap between finding a plausible circuit and proving what the circuit does. We introduce Verifiable Transformers, a framework for converting task-localized Transformer circuits into bounded, solver-checkable claims. Given a behavior, a finite task domain, and a candidate-token projection, we extract a task circuit and verify properties such as projected functional equivalence, edge necessity, task-relevant invariance, and final-residual robustness. Direct verification encodes the extracted circuit itself into an SMT solver. When a circuit contains operators that are not exactly or tractably encodable, surrogate-mediated verification fits an SMT-encodable surrogate, validates it against the extracted circuit over the bounded domain, and verifies symbolic explanations against the surrogate. We instantiate direct verification with a GPT-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU. On small symbolic sequence tasks, we train an SMT-representable Transformer, extract sparse circuits for quote closing and bracket type tracking, and exhaustively verify projected functional equivalence, content invariance, edge necessity, and final-residual robustness. At GPT-2 scale, the same operator stack trains stably on OpenWebText, although naive direct SMT verification remains intractable. We also demonstrate surrogate-mediated verification on task-localized circuits with hard-to-encode attention, showing both verified symbolic explanations and solver-generated counterexamples. The goal is not full-model verification, but a concrete path for turning mechanistic circuit explanations into formal propositions that can be proven or refuted.

📰

arXiv cs.LG Research May 26, 2026

CAFD: Concept-Aware DNN Fault Detection using VLMs

arXiv:2605.24008v1 Announce Type: new Abstract: Fault detection for Deep Neural Networks (DNNs) has received increasing attention in recent years. While more advanced hybrid approaches have been pro…

arXiv:2605.24008v1 Announce Type: new Abstract: Fault detection for Deep Neural Networks (DNNs) has received increasing attention in recent years. While more advanced hybrid approaches have been proposed to combine multiple sources of information and outperform earlier techniques, they often incur substantial computational overhead, limiting scalability and practicality in real-world settings. In this paper, we introduce Concept-Aware Fault Detection (CAFD), a learning-based approach that achieves superior fault detection performance by effectively integrating multiple information sources while maintaining practical efficiency. Specifically, CAFD is trained using a carefully selected set of informative features, including model-based signals derived from the DNN's outputs, distance-based features, and a novel concept-based feature, called Concept Failure Ratio (CFR). CFR leverages Vision-Language Models (VLMs) to extract textual concepts from images and quantify the likelihood that their presence is associated with DNN failures. By incorporating this feature, CAFD benefits from complementary semantic information, enabling more effective fault detection. Our results demonstrate that CFR serves as an effective indicator for DNN fault detection. We conduct an extensive empirical evaluation of CAFD, comparing it against five state-of-the-art baselines across three subject DNN models and datasets, including ImageNet. Across a wide range of constrained selection budgets, CAFD consistently outperforms all baselines in Fault Detection Rate (FDR), achieving average FDR improvements of 18.3% across all investigated subjects and budget sizes.

📰

arXiv cs.LG Research May 26, 2026

Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection

arXiv:2605.23984v1 Announce Type: new Abstract: Industrial anomaly detection has attracted significant attention as a fundamental challenge in industrial systems. The rapid advancement of heterogene…

arXiv:2605.23984v1 Announce Type: new Abstract: Industrial anomaly detection has attracted significant attention as a fundamental challenge in industrial systems. The rapid advancement of heterogeneous industrial sensors has driven industrial anomaly detection from unimodal to multimodal paradigms. However, existing methods are primarily designed for centralized and offline settings, overlooking the distributed and continuously generated data characteristic of real-world industrial environments. With the advancement of edge intelligence, modern edge devices are increasingly capable of not only data acquisition but also distributed model training, enabling collaborative intelligence across the system. Industrial anomaly detection represents a critical application in this context. Motivated by these challenges, we propose a novel framework termed Multimodal Online Distributed Industrial Anomaly Detection (MODIAD). We first present a comprehensive workflow for MODIAD and then formulate a Multi-class Intelligent Scheduling (MIS) problem to coordinate cross class model updates by balancing data sufficiency and class update frequency. To efficiently solve this problem, we design a Sequential Marginal Gain Greedy (SMG) algorithm that enables effective multi-class training under resource constraints. Furthermore, to improve the computational and communication efficiency during training, we propose an Resource Efficient Class-Wise Low Rank Adaptation (REC-LoRA) strategy, which significantly reduces system overhead while preserving detection performance. Extensive experiments on two representative multimodal industrial anomaly detection datasets, MVTec 3D-AD and Eyecandies demonstrate that the proposed approach achieves superior performance and efficiency under the MODIAD scenario.

📰

arXiv cs.LG Research May 26, 2026

Algometrics: Forecasting Under Algorithmic Feedback

arXiv:2605.23978v1 Announce Type: new Abstract: In algorithmic markets, predictive models become part of the data-generating process they aim to forecast. Once their outputs are converted into trade…

arXiv:2605.23978v1 Announce Type: new Abstract: In algorithmic markets, predictive models become part of the data-generating process they aim to forecast. Once their outputs are converted into trades, allocations, execution schedules, or risk controls, they change the future data on which they are evaluated. I introduce algometrics, a framework for time series whose evolution depends on the predictive algorithms forecasting them. The framework distinguishes historical risk, measured under passive forecasting, from deployment risk, measured when forecasts drive actions. I prove three results. First, deployment risk is not identifiable from passive historical data alone: even in a one-step linear feedback model, infinitely many algorithm-mediated environments induce the same historical law while implying different deployment risks for the same forecaster. Second, historical model rankings can invert under crowding, so a predictor with lower passive error can have higher deployment error once similar algorithms are adopted. Third, randomized or instrumented actions identify short-horizon linear feedback, and I derive a finite-sample bound for deployment-risk estimation. These results suggest that time-series benchmarks in algorithmic markets should report feedback sensitivity alongside predictive accuracy.

📰