Abstract
With the widespread application of automatic speech recognition (ASR)systems, their vulnerability to adversarial attacks has been extensivelystudied. However, most existing adversarial examples are generated on specificindividual models, resulting in a lack of transferability. In real-worldscenarios, attackers often cannot access detailed information about the targetmodel, making query-based attacks unfeasible. To address this challenge, wepropose a technique called Acoustic Representation Optimization that alignsadversarial perturbations with low-level acoustic characteristics derived fromspeech representation models. Rather than relying on model-specific,higher-layer abstractions, our approach leverages fundamental acousticrepresentations that remain consistent across diverse ASR architectures. Byenforcing an acoustic representation loss to guide perturbations toward theserobust, lower-level representations, we enhance the cross-model transferabilityof adversarial examples without degrading audio quality. Our method isplug-and-play and can be integrated with any existing attack methods. Weevaluate our approach on three modern ASR models, and the experimental resultsdemonstrate that our method significantly improves the transferability ofadversarial examples generated by previous methods while preserving the audioquality.