Hybrid Vision Transformers (ViTs) are integral to achieving high accuracy and generalization across a diverse array of computer vision tasks. However, their deployment on hardware-constrained platforms, such as autonomous driving systems, remains challenging due to significant processing and memory demands. To address these limitations, we propose Super-QHViT, a novel framework that leverages hardware efficiency techniques, including Neural Architecture Search (NAS) and quantization, to design more efficient deep neural network architectures. Super-QHViT employs NAS to optimize the architecture for reduced computational and memory overhead, while quantization techniques minimize the precision of neural network weights, further enhancing efficiency.
Our approach involves training and inferring Super-QHViT on standard benchmarks like ImageNet-1k to ensure robust performance. Additionally, we investigate the impact of various knowledge distillation methods on both neural network training and quantization processes, aiming to improve the effectiveness of the resulting models. The culmination of our work is a one-shot training neural architecture search supernet that, once trained, enables the rapid identification of efficient subnet architectures tailored to specific hardware requirements without necessitating additional training.