The Memory Floor for Detecting Hidden Structure in Streams
When data arrives in a stream and memory is limited, detecting planted structures – hidden cliques, bicliques, or sparse signals buried in noise – requires trading off between how much you can remember and how many samples you need. Garg, Hastings, Pabbaraju, and Sharan establish a unified framework for proving memory-sample lower bounds across these detection problems. They derive the first such tradeoffs for sparse principal component analysis in the spiked covariance model, with bounds that are nearly tight in the low-memory regime of O(log n) bits.
Their technical core is a general distinction problem over matrices: separate a null distribution from one containing an outlier structure. They develop a new distributed data processing inequality that measures information cost under null conditions, then apply direct-sum arguments to extend single-pass results to multi-pass streaming. The key structural finding is that multiple passes help less than one might expect. Each additional pass over the data provides diminishing returns in the low-memory regime because the bottleneck is not exposure to the data but the capacity to accumulate evidence about the planted structure across observations. The information about a hidden clique, for instance, is diffused across many entries of the adjacency matrix, and compressing that information below a threshold destroys it in a way that more passes cannot recover.
This connects to a deep question about the geometry of statistical evidence: when a signal is distributed uniformly across a high-dimensional object, detecting it requires a memory footprint proportional to the signal’s spread, regardless of how many times you inspect the object. The framework’s generality – covering cliques, bicliques, sparse means, and sparse PCA within a single argument – suggests that the phenomenon is algebraic rather than problem-specific.
(arXiv:2603.00770)