Deep learning inference and training tasks are often accompanied by additional numerical data analysis tasks such as clustering, dimensionality reduction, data transformation, and linear modeling. While matrix engines are primarily designed with deep neural network workloads in mind, they have also been used to accelerate general-purpose matrix processing workloads. The matrix multiplication components of numerical data analysis workloads vary in matrix shapes, sizes, and layouts compared to deep neural network models. In this wide problem space, subtle static scheduling or system-level effects generate variable memory-latency behavior observed by the accelerator in small matrix size regimes, leading to up to a 30% degradation in accelerator utilization. We observe that minor modifications to a matrix accelerator’s hardware controller can substantially improve the suitability of the accelerator for these problem types, and demonstrate up to a 1.25x improvement in the utilization of a matrix engine on small matrices through hardware-managed static scheduling, and up to a 1.15x improvement through dynamic scheduling and hardware-managed commutative micro-threading, helping improve the utilization of matrix engines for general purpose linear algebra workloads.