group img banner
header-image_0001_layer_4.jpg
header-image_0003_layer_0.jpg
header-image_0002_layer_1.jpg

Three Papers in NeurIPS 2022

Video Computing Group members have three papers in NeurIPS 2022. The topics are "generative adversarial attacks using vision-language models", "audio-visual-language embodied navigation (with MERL)" and "blackbox attack via surrogate ensemble search".
  • The majority of methods for crafting adversarial attacks have focused on scenes with a single dominant object (e.g., images from ImageNet). On the other hand, natural scenes include multiple dominant objects that are semantically related. Thus, it is crucial to explore designing attack strategies that look beyond learning on single-object scenes or attack single-object victim classifiers. This paper presents the first approach of using generative models for adversarial attacks on multi-object scenes. In order to represent the relationships between different objects in the input scene, we leverage upon the open-sourced pre-trained vision-language model CLIP (Contrastive Language-Image Pre-training), with the motivation to exploit the encoded semantics in the language space along with the visual space. We call this attack approach Generative Adversarial Multiobject Attacks (GAMA). GAMA demonstrates the utility of the CLIP model as an attacker’s tool to train formidable perturbation generators for multi-object scenes.

GAMA: Generative Adversarial Multi-Object Scene Attacks, A. Aich*, C. K.-Ta*, A. Gupta, C. Song, S. V. Krishnamurthy, M. S. Asif, A. Roy-Chowdhury, Neural Information Processing Systems (NeurIPS), 2022 (* joint first authors)


  • Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate. To this end, we present AVLEN – an interactive agent for Audio-Visual-Language Embodied Navigation. Similar to audio-visual navigation tasks, the goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio-cues for navigation or to query the oracle and (b) lower-level policies to select navigation actions based on its audio-visual and language inputs. 

AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments, S. Paul*, A. Roy-Chowdhury, A. Cherian*, Neural Information Processing Systems (NeurIPS), 2022


  • Blackbox adversarial attacks can be categorized into transfer- and query-based attacks. Transfer methods do not require any feedback from the victim model, but provide lower success rates compared to query-based methods. Query attacks often require a large number of queries for success. To achieve the best of both approaches, recent efforts have tried to combine them, but still require hundreds of queries to achieve high success rates (especially for targeted attacks). In this paper, we propose a novel method for blackbox attacks via surrogate ensemble search (BASES) that can generate highly successful blackbox attacks using an extremely small number of queries. We first define a perturbation machine that generates a perturbed image by minimizing a weighted loss function over a fixed set of surrogate models. To generate an attack for a given victim model, we search over the weights in the loss function using queries generated by the perturbation machine.

Blackbox Attacks via Surrogate Ensemble Search, Z. Cai, C. Song, S. V. Krishnamurthy, A. Roy-Chowdhury,  M. S. Asif, Neural Information Processing Systems (NeurIPS), 2022