1273993 results (page 115 of 50960)
-
Linear equivalence of nonlinear recurrent neural networks
Large nonlinear recurrent neural networks with random couplings generate high-dimensional, potentially chaotic activity whose structure is of interest in neuroscience, machine learning, ecology, and other fields. A fundamental object encoding the collective structure of this activity is the $N \times N$ covariance matrix. Prior analytical work on the covariance matrix has been limited to low-dimen…
-
Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation
Reward hacking in code generation, where models exploit evaluation loopholes to obtain full reward without correctly solving the tasks, poses a critical challenge for Reinforcement Learning (RL) and the deployment of reasoning models. Existing studies have been conducted primarily on synthetic hacking trajectories. However, whether these synthetic behaviors faithfully represent naturally emerging …
-
Optimality Conditions and Numerical Algorithms for a Class of Minimax Bilevel Optimization Problems
In many applications, including Stackelberg games, machine learning, and power systems \cite{Mackay2018Selftuning,Heinrich1952The,Wang2021Bi-Level}, the decisions in a minimax optimization problem can be constrained by a solution to an optimization problem. In this paper, we introduce optimality conditions of this novel minimax bilevel optimization problem and develops efficient first-order algori…
-
Your Students Don't Use LLMs Like You Wish They Did
Educational NLP systems are typically evaluated using engagement metrics and satisfaction surveys, which are at best a proxy for meeting pedagogical goals. We introduce six computational metrics for automated evaluation of pedagogical alignment in student-AI dialogue. We validate our metrics through analysis of 12,650 messages across 500 conversations from four courses. Using our metrics, we ident…
-
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradient access, and strict query budgets. We formalize this strict black-box threat model and propose a two-agent evasion framework operating in a semantic perturbation spa…
-
Leveraging Spatial Transcriptomics as Alternative to Manual Annotations for Deep Learning-Based Nuclei Analysis
Deep learning-based nuclei segmentation and classification in pathology images typically rely on large-scale pixel-level manual annotations, which are costly and difficult to obtain across diverse tissues and staining conditions. To address this limitation, we propose a framework that leverages spatial transcriptomics (ST) data as supervision for nuclei segmentation and classification. By incorpor…
-
Resource-Constrained Shortest Path with Polytopic Reset Sets
This paper investigates the problem of computing the shortest path between two states under resource constraints in environments with resource-replenishment regions. Namely, the length of the path is limited by a budget that can be restored within polytopic replenishment regions. We show that the optimal path in this problem exhibits a distinct geometric structure: it consists of straight-line seg…
-
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a framework and benchmark for quantifying this property via the Judge Sensitivity Score (JSS), defined as the fraction of paraphrase pairs on which a judge returns an identi…
-
SEMA-SQL: Beyond Traditional Relational Querying with Large Language Models
Relational databases excel at structured data analysis, but real-world queries increasingly require capabilities beyond standard SQL, such as semantically matching entities across inconsistent names, extracting information not explicitly stored in schemas, and analyzing unstructured text. While text-to-SQL systems enable natural language querying, they remain limited to relational operations and c…
-
Supernodes and Halos: Loss-Critical Hubs in LLM Feed-Forward Layers
We study the organization of channel-level importance in transformer feed-forward networks (FFNs). Using a Fisher-style loss proxy (LP) based on activation-gradient second moments, we show that loss sensitivity is concentrated in a small set of channels within each layer. In Llama-3.1-8B, the top 1% of channels per layer accounts for a median of 58.7% of LP mass, with a range of 33.0% to 86.1%. We…
-
GeoCert: Certified Geometric AI for Reliable Forecasting
Forecasting systems in science must be accurate, physically consistent, and certifiably reliable. Most existing models address prediction, constraint enforcement, and verification separately, limiting scalability and interpretability. We introduce GeoCert, a geometric AI framework that unifies forecasting, physical reasoning, and formal verification within a single differentiable computation. GeoC…
-
Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization
While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted heuristics, inherently limiting their potential for open-ended improvement. To address this, we propose Escher-Loop, a fully closed-loop framework that operationalizes the mutual evolution of two distinct populations: Task Agents that solve concrete problems, a…
-
Can Humans Detect AI? Mining Textual Signals of AI-Assisted Writing Under Varying Scrutiny Conditions
This study asks whether the threat of AI detection changes how people write with AI, and whether other people can tell the difference. In a two-phase controlled experiment, 21 participants wrote opinion pieces on remote work using an AI chatbot. Half were randomly warned that their submission would be scanned by an AI detection tool. The other half received no warning. Both groups had access to th…
-
Estimation of MIDAS Regressions with Errors-in-the-Variables
In this paper, a Mixed Data Sampling (MIDAS) model is studied when both low and high frequency variables are contaminated with measurement error. It is shown that the profile likelihood estimator becomes inconsistent in the presence of measurement error. Using the corrected score approach along with profile likelihood approach, a consistent estimator for parameters of MIDAS Measurement Error model…
-
A Milestone in Formalization: The Sphere Packing Problem in Dimension 8
In 2016, Viazovska famously solved the sphere packing problem in dimension $8$, using modular forms to construct a 'magic' function satisfying optimality conditions determined by Cohn and Elkies in 2003. In March 2024, Hariharan and Viazovska launched a project to formalize this solution and related mathematical facts in the Lean Theorem Prover. A significant milestone was achieved in February 202…
-
Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference
Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch overhead, particularly in interactive, short-sequence settings. This paper presents a hybrid runtime framework that combines Just-In-Time (JIT) compilation with CUDA Graph execution to reduce launch ove…
-
V.O.I.C.E (Voice, Ownership, Identity, Control, Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data
As generative voice models are rapidly advancing in both capabilities and public utilization, the unconsented collection, reuse, and synthesis of voice data are introducing new classes of privacy, security and governance risk that are poorly captured by existing, largely uniform threat models. To fill the gap, we present V.O.I.C.E, a taxonomy of voice generation risk grounded in a multi-source thr…
-
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs…
-
Machine learning models for estimating counterfactuals in a single-arm inflammatory bowel disease study
Single-arm trials accelerate study timelines by reducing the number of patients that must be recruited for a concurrent control group. However, these designs require an alternative comparator to estimate treatment effects. One approach is to construct a virtual control arm using a machine learning (ML) model trained on external control data to predict the counterfactual outcomes of the treatment a…
-
On cross-validation for small area estimators
Subnational monitoring of public health often relies on household surveys where data are sparse at the desired spatial resolution. Small area estimation (SAE) methods address this challenge by borrowing strength across areas and incorporating auxiliary information. However, comparing these estimators remains difficult in the absence of ground truth. We propose a cross-validation framework for eval…
-
A theory of ROC analysis of rule-out and rule-in diagnostics with applications to mammography data
Multiple diagnostic tests are frequently used to determine the presence of a disease condition in patients. In this paper, we use bivariate copulas to examine the properties of receiver operating characteristic (ROC) curves formed when two correlated diagnostic tests are used together to rule-out ("believe the negative") and rule-in ("believe the positive") patients for disease. We use this theory…
-
Scaling limit of Sinkhorn-rescaled Random Matrices via Stability of Static Schrödinger Bridges
We analyze the asymptotic behavior and scaling limits of large random matrices rescaled via the Sinkhorn algorithm to match prescribed row and column margins. For a random matrix with independent sub-exponential entries, we show that its Sinkhorn rescaling concentrates around the rescaling of its mean matrix, both at the level of the Schrödinger potentials and as random measures on the unit square…
-
Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster infer…
-
Architecture Matters for Multi-Agent Security
Multi-agent systems (MAS), composed of networks of two or more autonomous AI agents, have become increasingly popular in production deployments, yet introduce security risks that do not arise in single-agent settings. Even if individual agents exhibit robust security, architectural decisions governing their coordination can create attack surfaces that have not been systematically characterized. In…
-
A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection
The growing availability of online support groups has opened up new windows to study mental health through natural language processing (NLP). However, it is hindered by a lack of high-quality, well-validated datasets. Existing studies have a tendency to build task-specific corpora without collecting them into widely available resources, and this makes reproducibility as well as cross-task comparis…