Devi Parikh is an Assistant Professor in the Bradley Department of Electrical and Computer Engineering at Virginia Tech (VT) and an Allen Distinguished Investigator of Artificial Intelligence. She leads the Computer Vision Lab at VT, and is also a member of the Virginia Center for Autonomous Systems (VaCAS) and the VT Discovery Analytics Center (DAC).
Prior to this, she was a Research Assistant Professor at Toyota Technological Institute at Chicago (TTIC), an academic computer science institute affiliated with University of Chicago. She has held visiting positions at Cornell University, University of Texas at Austin, Microsoft Research, MIT and Carnegie Mellon University. She received her M.S. and Ph.D. degrees from the Electrical and Computer Engineering department at Carnegie Mellon University in 2007 and 2009 respectively. She received her B.S. in Electrical and Computer Engineering from Rowan University in 2005.
Her research interests include computer vision, pattern recognition and AI in general and visual recognition problems in particular. Her recent work involves leveraging human-machine collaboration for building smarter machines, and exploring problems at the intersection of vision and language. She has also worked on other topics such as ensemble of classifiers, data fusion, inference in probabilistic models, 3D reassembly, barcode segmentation, computational photography, interactive computer vision, contextual reasoning, hierarchical representations of images, and human-debugging.
She is a recipient of an NSF CAREER award, a Sloan Research Fellowship, an Office of Naval Research (ONR) Young Investigator Program (YIP) award, an Army Research Office (ARO) Young Investigator Program (YIP) award, an Allen Distinguished Investigator Award in Artificial Intelligence from the Paul G. Allen Family Foundation, three Google Faculty Research Awards, an Outstanding New Assistant Professor award from the College of Engineering at Virginia Tech, and a Marr Best Paper Prize awarded at the International Conference on Computer Vision (ICCV).
Wouldn't it be nice if machines could understand content in images and communicate this understanding as effectively as humans? Such technology would be immensely powerful, be it for aiding a visually-impaired user navigate a world built by the sighted, assisting an analyst in extracting relevant information from a surveillance feed, educating a child playing a game on a touch screen, providing information to a spectator at an art gallery, or interacting with a robot. As computer vision and natural language processing techniques are maturing, we are closer to achieving this dream than we have ever been.
In this talk, I will describe the task, our dataset (the largest and most complex of its kind), and our model for free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image (e.g., “What kind of store is this?”, “How many people are waiting in the queue?”, “Is it safe to cross the street?”), the machine’s task is to automatically produce an accurate natural language answer (“bakery”, “5”, “Yes”). We have collected and released a dataset containing >250,000 images, >760,000 questions, and ~10 million answers. Our dataset is enabling the next generation of AI systems, often based on deep learning techniques, for understanding images and language, and performing complex reasoning.
I am Associate Professor at the Department of Computer, Control and Management Engineering of the University of Rome La Sapienza, where I lead the Visual Learning and Multimodal Applications Laboratory (VANDAL). My main research interest is to develop algorithms for learning, recognition and categorization of visual and multimodal patterns for artificial autonomous systems. These features are crucial to enable robots to represent and understand their surroundings, to learn and reason about it, and ultimately to equip them with cognitive capabilities. My research is sponsored by the Swiss National Science Foundation (SNSF), the Italian Ministry for Education, University and Research (MIUR), the European Commission (EC) and the European Research Council (ERC). I published more than 90 papers in the areas of visual recognition, multimodal and open¬ended learning, semantic spatial modeling and adaptive control of prosthetic hands. My work has more than 6000 citations (Google Scholar Profile), and I served as area chair and keynote speaker at several conferences and events in the fields of Computer Vision, Machine Learning and Robotics.
Robots are meant to operate in the real world. However, even the best system we can engineer today is bound to fail whenever the setting is not heavily constrained. This is because the real world is generally too nuanced and unpredictable to be summarized within a limited set of specifications. There will be inevitably novel situations and the system will always have gaps or ambiguities in its own knowledge. This calls for robots able to learn continuously over time. In this talk I will focus mainly on the life long learning of perceptual and semantic object knowledge, i.e. (a) knowledge about the visual appearance of objects, necessary to the robot to recognize and localize objects in its own environment, and (b) knowledge about properties that directly affect how the object should be manipulated, where it should be found and where it should be placed. I will show how issues like domain adaptation and transfer learning are vital to the robot vision community, and will present results that aim at bridging across these two communities, with the goal to achieve life long learning of visual patterns in autonomous systems.