https://viper.cs.columbia.edu/ ViperGPT: Visual Inference via Python Execution for Reasoning Didac Suris^*, Sachit Menon^*, Carl Vondrick ^*Equal contribution Columbia University Paper arXiv Code ViperGPT decomposes visual queries into interpretable steps. Abstract Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks. Logical Reasoning ViperGPT can perform logic operations because it directly executes Python code. 1. 2. 3. Previous Next Spatial Understanding We show ViperGPT's spatial understanding. 1. 2. 3. 4. 5. 6. 7. Previous Next Knowledge ViperGPT can access the knowledge of large language models. Consistency ViperGPT answers similar questions with consistent reasoning. 1. 2. Previous Next Math ViperGPT can count, and divide. All using Python. Attributes We show some ViperGPT examples involving attributes. 1. 2. 3. 4. 5. Previous Next Relational Reasoning Reasoning about relations. 1. 2. 3. 4. Previous Next Negation Negation is programmatic, not neural. More Results BibTeX @article{surismenon2023vipergpt, author = {Sur\'is D\'idac and Menon, Sachit and Vondrick, Carl}, title = {ViperGPT: Visual Inference via Python Execution for Reasoning}, journal = {arXiv preprint arXiv:2303.08128}, year = {2023}, } This research is based on work partially supported by the DARPA MCS program under Federal Agreement No. N660011924032 and the NSF CAREER Award #2046910. DS is supported by the Microsoft PhD Fellowship and SM is supported by the NSF GRFP. This webpage template was inspired by this and this project pages.