We propose a full-day workshop, called “Machine Learning for Autonomous Driving” (ML4AD), as a venue for machine learning (ML) researchers to discuss research problems concerning autonomous driving (AD). Our goal is to promote ML research, and its real-world impact, on self-driving technologies. Full self-driving capability (“Level 5”) is far from solved and extremely complex, beyond the capability of any one institution or company, necessitating larger-scale communication and collaboration, which we believe workshop formats help provide.
We propose a large-attendance talk format of approximately 500 attendees, including (1) a call for papers with poster sessions and spotlight presentations; (2) keynote talks to communicate the state-of-the-art; (3) panel debates to discuss future research directions; (4) a call for challenge to encourage interaction around a common benchmark task; (5) social breaks for newer researchers to network and meet others.
Mon 7:50 a.m. - 8:00 a.m.
|
Opening Remarks
SlidesLive Video » |
Xinshuo Weng 🔗 |
Mon 8:00 a.m. - 8:30 a.m.
|
Reinforcement Learning for Autonomous Driving
(
Keynote Talk
)
SlidesLive Video » |
Jeff Schneider · Jeff Schneider 🔗 |
Mon 8:30 a.m. - 8:40 a.m.
|
Q&A: Jeff Schneider
(
Live Q/A
)
|
🔗 |
Mon 8:40 a.m. - 9:10 a.m.
|
AV2.0: Deploying End to End Deep Learning Policies at Fleet Scale
(
Keynote Talk
)
SlidesLive Video » |
Alex Kendall 🔗 |
Mon 9:10 a.m. - 9:20 a.m.
|
Q&A: Alex Kendall
(
Live Q/A
)
|
🔗 |
Mon 9:20 a.m. - 9:30 a.m.
|
(Best Paper) UMBRELLA: Uncertainty-Aware Model-Based Offline Reinforcement Learning Leveraging Planning
(
Oral
)
link »
SlidesLive Video » |
Christopher Diehl 🔗 |
Mon 9:30 a.m. - 10:30 a.m.
|
Poster Session and Social link » | 🔗 |
Mon 10:30 a.m. - 11:00 a.m.
|
Break
|
🔗 |
Mon 11:00 a.m. - 11:30 a.m.
|
Physics-Guided AI for Modeling Autonomous Vehicle Dynamics
(
Keynote Talk
)
SlidesLive Video » |
Rose Yu · Rose Yu 🔗 |
Mon 11:30 a.m. - 11:40 a.m.
|
Q&A: Rose Yu
(
Live Q/A
)
|
🔗 |
Mon 11:40 a.m. - 12:10 p.m.
|
The Ongoing Research in University of Michigan & Ford Center for Autonomous Vehicles (FCAV)
(
Keynote Talk
)
SlidesLive Video » |
Matthew Johnson-Roberson 🔗 |
Mon 12:10 p.m. - 12:20 p.m.
|
Q&A: Matthew Johnson-Roberson
(
Live Q/A
)
|
🔗 |
Mon 12:20 p.m. - 1:20 p.m.
|
CARLA Challenge
(
Challenge
)
link »
SlidesLive Video » The CARLA Autonomous Driving Challenge 2021 is organized as part of the Machine Learning for Autonomous Driving Workshop at NeurIPS 2021. This competition is open to any participant from academia and industry. You only need to sign up on the CARLA AD Leaderboard, providing your team name and your institution. This challenge leverages the existing CARLA AD Leaderboard platform to evaluate submissions. The main goal of the challenge remains the assessment of the driving proficiency of autonomous agents in realistic traffic situations, as defined in the leaderboard mechanics. Teams will have to complete 10 routes in 2 weather conditions through 5 repetitions, leading to a total of 173 Km of driving experiences. The challenge follows the same structure and rules defined for the CARLA AD Leaderboard. You can participate in any of the two available tracks: SENSORS and MAP, using the canonical sensors available for the challenge. |
German Ros 🔗 |
Mon 1:20 p.m. - 1:30 p.m.
|
Break
|
🔗 |
Mon 1:30 p.m. - 2:00 p.m.
|
Fantastic Failures and Where to Find Them: Designing Safe, Robust Autonomy
(
Keynote Talk
)
SlidesLive Video » |
Katherine Driggs-Campbell 🔗 |
Mon 2:00 p.m. - 2:10 p.m.
|
Q&A: Katie Driggs-Campbell
(
Live Q/A
)
|
🔗 |
Mon 2:10 p.m. - 2:40 p.m.
|
Safely Learning Behaviors of Other Agents
(
Keynote Talk
)
SlidesLive Video » |
Claire Tomlin 🔗 |
Mon 2:40 p.m. - 2:50 p.m.
|
Q&A: Claire Tomlin
(
Live Q/A
)
|
🔗 |
Mon 2:50 p.m. - 3:00 p.m.
|
Spotlight Talks
(
Oral
)
|
🔗 |
Mon 3:00 p.m. - 4:00 p.m.
|
Poster Session and Social link » | 🔗 |
Mon 4:00 p.m. - 4:30 p.m.
|
Learning Driving Agents from Simulation
(
Keynote Talk
)
SlidesLive Video » |
Mark Palatucci 🔗 |
Mon 4:30 p.m. - 4:40 p.m.
|
Q&A: Mark Palatucci
(
Live Q/A
)
|
🔗 |
Mon 4:40 p.m. - 5:10 p.m.
|
Autonomous Vehicle Decision-Making Policy Fast Adaptation Using Meta Reinforcement Learning
(
Keynote Talk
)
SlidesLive Video » |
Songan Zhang 🔗 |
Mon 5:10 p.m. - 5:20 p.m.
|
Q&A: Songan Zhang
(
Live Q/A
)
|
🔗 |
Mon 5:20 p.m. - 5:50 p.m.
|
Robotics for an ML-Driven World
(
Keynote Talk
)
SlidesLive Video » |
Sarah Tang 🔗 |
Mon 5:50 p.m. - 6:00 p.m.
|
Q&A: Sarah Tang
(
Live Q/A
)
|
🔗 |
Mon 6:00 p.m. - 6:20 p.m.
|
Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks
(
Challenge
)
link »
SlidesLive Video » While much research has been done on developing methods for improving robustness to distributional shift and uncertainty estimation, limited work has examined developing standard datasets and benchmarks for assessing these approaches. Moreover, most of these methods were developed only for small-scale regression or image classification tasks. However, many tasks of practical interest have different modalities, such as tabular data, audio, text, or sensor data, which offer significant challenges involving regression and discrete or continuous structured prediction. Given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary. This will enable researchers to meaningfully evaluate the plethora of recently developed uncertainty quantification methods, assessment criteria and baselines. In this work, we propose the \emph{Shifts Dataset} for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, `in-the-wild' distributional shifts and pose interesting challenges with respect to uncertainty estimation. |
Andrey Malinin 🔗 |
Mon 6:20 p.m. - 6:30 p.m.
|
Closing Remarks
SlidesLive Video » |
Rowan McAllister 🔗 |
-
|
AA3DNet: Attention Augmented Real Time 3D Object Detection
(
Poster
)
SlidesLive Video » In this work, we address the problem of 3D object detection from point cloud data in real time. For autonomous vehicles to work, it is very important for the perception component to detect the real world objects with both high accuracy and fast inference. We propose a novel neural network architecture along with the training and optimization details for detecting 3D objects using point cloud data. We present anchor design along with custom loss functions used in this work. A combination of spatial and channel wise attention module is used in this work. We use the Kitti 3D Bird’s Eye View dataset for benchmarking and validating our results. Our method surpasses previous state of the art in this domain both in terms of average precision and speed running at > 30 FPS. Finally, we present the ablation study to demonstrate that the performance of our network is generalizable. This makes it a feasible option to be deployed in real time applications like self driving cars. |
Abhinav Sagar 🔗 |
-
|
UMBRELLA: Uncertainty-Aware Model-Based Offline Reinforcement Learning Leveraging Planning
(
Poster
)
SlidesLive Video » Offline reinforcement learning (RL) provides a framework for learning decision making from offline data and therefore constitutes a promising approach for real world applications as automated driving. Self-driving vehicles (SDV) learn a policy, which potentially even outperforms the behavior in the sub-optimal data set. Especially in safety-critical applications as automated driving, explainability and transferability are key to success. This motivates the use of model-based offline RL approaches, which leverage planning. However, current state-of-the art methods often neglect the influence of aleatoric uncertainty arising from the stochastic behavior of multi-agent systems. This work proposes a novel approach for Uncertainty-aware Model-Based Offline REinforcement Learning Leveraging plAnning (UMBRELLA), which solves the prediction, planning, and control problem of the SDV jointly in an interpretable learning-based fashion. A trained action-conditioned stochastic dynamics model captures distinctively different future evolutions of the traffic scene. The analysis provides empirical evidence for the effectiveness of our approach in challenging automated driving simulations and based on a real-world public dataset. |
Christopher Diehl 🔗 |
-
|
Compressing Sensor Data for Remote Assistance of Autonomous Vehicles using Deep Generative Models
(
Poster
)
SlidesLive Video » In the foreseeable future, autonomous vehicles will require human assistance in situations they can not resolve on their own. In such scenarios, remote assistance from a human can provide the required input for the vehicle to continue its operation. Typical sensors used in autonomous vehicles include camera and lidar sensors. Due to the massive volume of sensor data that must be sent in real-time, highly efficient data compression is elementary to prevent an overload of network infrastructure. Sensor data compression using deep generative neural networks has been shown to outperform traditional compression approaches for both image and lidar data, regarding compression rate as well as reconstruction quality. However, there is a lack of research about the performance of generative-neural-network-based compression algorithms for remote assistance. In order to gain insights into the feasibility of deep generative models for usage in remote assistance, we evaluate state-of-the-art algorithms regarding their applicability and identify potential weaknesses. Further, we implement an online pipeline for processing sensor data and demonstrate its performance for remote assistance using the CARLA simulator. |
Daniel Bogdoll · Marius Zöllner 🔗 |
-
|
Efficient Unknown Object Detection with Discrepancy Networks for Semantic Segmentation
(
Poster
)
SlidesLive Video » The detection of unknown objects such as lost cargo is a required ability for self-driving cars. This is the first work focusing on reducing the computational cost of discrepancy networks for unknown object detection on monocular camera images. We propose an efficient discrepancy networks based solely on semantic segmentation, which has 50\% fewer parameters and is 140\% faster inference speed compared to an existing method, while improving detection performance by a large margin. In a major departure from prior work, we remove GANs from discrepancy networks. While previous studies have used GANs as a necessary component, our model which is not using GANs outperforms them. We improve detection performance by analyzing properties of intermediate representations and introduce {\it feature selection} and {\it deep supervision}. Our experiments on three datasets for obstacle detection show significant improvement of more than 5\% in AUROC. |
Ryo Kamoi · Takumi Iida · Kaname Tomite 🔗 |
-
|
Self-supervised Sun Glare Detection CNN for Self-aware Autonomous Driving
(
Poster
)
SlidesLive Video » For the fully autonomous driving systems that need to work under any circumstances, it is essential to be able to detect if there is any degradation of the perception models and be aware of the robustness of these algorithms to the different weather conditions. A sun glare has always been an issue for manual driving and is becoming a real problem for autonomous driving as well. Since it can obstruct critical information on the overexposed region. Ignoring and letting the algorithms work on corrupted camera images can lead to fatal consequences. In order to achieve the self-awareness, in this paper, we propose a sun glare detection approach and robustness benchmark to sun glare corruption based on glare rendering. In the benchmark, different severity levels of glare are added to assess the vulnerability of CNN detectors. With the help of self-supervised learning, our detection approach tackles the problem of glare data collection and annotation. Online glare synthesizing allows the CNN to take various and diverse training data, which makes the model robust and easy to generalize. We experimentally show that our method outperforms the state-of-the-art methods. |
Yiqiang CHEN · Feng Liu 🔗 |
-
|
Meta Guided Metric Learner for Overcoming Class Confusion in Few-Shot Road Object Detection
(
Poster
)
SlidesLive Video » Localization and recognition of less-occurring road objects have been a challenge in autonomous driving applications due to the scarcity of data samples. Few-Shot Object Detection techniques extend the knowledge from existing base object classes to learn novel road objects given few training examples. Popular techniques in FSOD adopt either meta or metric learning techniques which are prone to class confusion and base class forgetting. In this work, we introduce a novel Meta Guided Metric Learner (MGML) to overcome class confusion in FSOD. We re-weight the features of the novel classes higher than the base classes through a novel Squeeze and Excite module and encourage the learning of truly discriminative class-specific features by applying an Orthogonality Constraint to the meta learner. Our method outperforms State-of-the-Art (SoTA) approaches in FSOD on the India Driving Dataset (IDD) by upto 11 mAP points while suffering from the least class confusion of 20% given only 10 examples of each novel road object. We further show similar improvements on the few-shot splits of PASCAL VOC dataset where we outperform SoTA approaches by upto 5.8 mAP accross all splits. |
Anay Majee · Anbumani Subramanian · Kshitij Agrawal 🔗 |
-
|
Watch out for the risky actors: Assessing risk in dynamic environments for safe driving
(
Poster
)
SlidesLive Video » Driving in a dynamic environment that consists of other actors is inherently a risky task as each actor influences the driving decision and may significantly limit the number of choices in terms of navigation and safety plan. The total risk encountered by the Ego actor depends on the driving scenario and the uncertainty associated with predicting the future trajectories of the other actors in the driving scenario. However, not all objects pose similar risk. Depending on the object's type, trajectory, position, and the associated uncertainty with these quantities; some objects pose much higher risk than others. Higher the risk associated with an actor, the more attention must be directed towards that actor, in terms of resources and safety planning. In this paper, we propose a novel risk metric to calculate the importance of each actor in the world, and demonstrate its usefulness through a case study. |
Saurabh Jha · Yan Miao · Zbigniew Kalbarczyk · Ravishankar Iyer 🔗 |
-
|
A Step Towards Efficient Evaluation of Complex Perception Tasks in Simulation
(
Poster
)
SlidesLive Video » There has been increasing interest in characterising the error behaviour of systems which contain deep learning models before deploying them into any safety-critical scenario. However, characterising such behaviour usually requires large-scale testing of the model that can be extremely computationally expensive for complex real-world tasks. For example, tasks involving compute intensive object detectors as one of their components. In this work, we propose an approach that enables efficient large-scale testing using simplified low-fidelity simulators and without the computational cost of executing expensive deep learning models. Our approach relies on designing an efficient surrogate model corresponding to the compute intensive components of the task under test. We demonstrate the efficacy of our methodology by evaluating the performance of an autonomous driving task in the Carla simulator with reduced computational expense by training efficient surrogate models for PIXOR and CenterPoint LiDAR detectors, whilst demonstrating that the accuracy of the simulation is maintained. |
Jonathan Sadeghi · Blaine Rogers · Sina Samangooei · Puneet Dokania · John Redford 🔗 |
-
|
Temporal Transductive Inference for Few-Shot Video Object Segmentation
(
Poster
)
SlidesLive Video » Few-shot object segmentation has been focused on segmenting static images in the query set. Recently few-shot video object segmentation (FS-VOS), where the query images to be segmented belong to a video, has been introduced but is still under-explored. We propose a simple but effective temporal transductive inference (TTI) that uses the temporal continuity in videos to improve the segmentation with a few-shot support set. We use both global and local cues. Global cues focus on learning a consistent prototype on the sequence level, whereas local cues focus on a consistent foreground/background region proportion within a local temporal window. Our model outperforms state-of-the-art attention-based counterpart on few-shot Youtube-VIS with 2% in mean intersection over union (mIoU). Finally, we propose a more realistic FS-VOS setup that operates cross-domain. Our method outperforms the transductive inference baseline that uses static images with 1.3% improvement on two different benchmarks. It demonstrates that our method is a promising direction and opens the door towards a label efficient approach of annotating video datasets with rare classes that occur in different robotics settings such as autonomous driving. |
Mennatullah Siam · Richard Wildes 🔗 |
-
|
Spatial-Temporal Gated Transformersfor Efficient Video Processing
(
Poster
)
SlidesLive Video » We focus on the problem of efficient video stream processing with fully transformer-based architectures. Recent advances brought by transformers for image-based tasks inspires the research interests of applying transformers for videos. Yet, when applying image-based transformer solutions to videos, the computation becomes inefficient due to the redundant information in adjacent video frames. An analysis of the computation cost of the video object detection framework DETR identifies the linear layers as the major computation bottleneck. Thus, we propose dynamic gating layers to conduct conditional computation. With the generated binary or ternary gates, it is possible to avoid the computation for the stable background tokens in the video frames. The effectiveness of the dynamic gating mechanism for transformers is validated by experimental results. For video object detection, the FLOPs could be reduced by 48.3% without a significant drop of accuracy. |
Yawei Li · Babak Ehteshami Bejnordi · Bert Moons · Tijmen Blankevoort · Amirhossein Habibian · Radu Timofte · Luc V Gool 🔗 |
-
|
How Far Can I Go ? : A Self-Supervised Approach for Deterministic Video Depth Forecasting
(
Poster
)
SlidesLive Video » In this paper we present a novel self-supervised method to anticipate the depth estimate for a future, unobserved real-world urban scene. This work is the first to explore self-supervised learning for estimation of monocular depth of future unobserved frames of a video. Existing works rely on a large number of annotated samples to generate the probabilistic prediction of depth for unseen frames. However, this makes it unrealistic due to its requirement for large amount of annotated depth samples of video. In addition, the probabilistic nature of the case, where one past can have multiple future outcomes often leads to incorrect depth estimates. Unlike previous methods, we model the depth estimation of the unobserved frame as a view-synthesis problem, which treats the depth estimate of the unseen video frame as an auxiliary task while synthesizing back the views using learned pose. This approach is not only cost effective - we do not use any ground truth depth for training (hence practical) but also deterministic (a sequence of past frames map to an immediate future). To address this task we first develop a novel depth forecasting network DeFNet which estimates depth of unobserved future by forecasting latent features. Second, we develop a channel-attention based pose estimation network that estimates the pose of the unobserved frame. Using this learned pose, estimated depth map is reconstructed back into the image domain, thus forming a self-supervised solution. Our proposed approach shows significant improvements of 5 % / 8% in Abs Rel metric compared to state-of-the-art alternatives on both short and mid-term forecasting setting, benchmarked on KITTI and Cityscapes. |
Sauradip Nag · Nisarg Shah · Raghavendra Ramachandra 🔗 |
-
|
A Scenario-Based Platform for Testing Autonomous Vehicle Behavior Prediction Models in Simulation
(
Poster
)
SlidesLive Video » Behavior prediction remains one of the most challenging tasks in the autonomous vehicle (AV) software stack. Forecasting the future trajectories of nearby agents plays a critical role in ensuring road safety, as it equips AVs with the necessary information to plan safe routes of travel. However, these prediction models are data-driven and trained on data collected in real life that may not represent the full range of scenarios an AV can encounter. Hence, it is important that these prediction models are extensively tested in various test scenarios involving interactive behaviors prior to deployment. To support this need, we present a simulation-based testing platform which supports (1) intuitive scenario modeling with a probabilistic programming language called Scenic, (2) specifying a multi-objective evaluation metric with a partial priority ordering, (3) falsification of the provided metric, and (4) parallelization of simulations for scalable testing. As a part of the platform, we provide a library of 25 Scenic programs that model challenging test scenarios involving interactive traffic participant behaviors. We demonstrate the effectiveness and the scalability of our platform by testing a trained behavior prediction model and searching for failure scenarios. |
Francis Indaheng · Edward Kim · Kesav Viswanadha · Jay Shenoy · Jinkyu Kim · Daniel Fremont · Sanjit Seshia 🔗 |
-
|
TITRATED: Learned Human Driving Behavior without Infractions via Amortized Inference
(
Poster
)
SlidesLive Video » Learned models of human driving behavior have long been used for prediction in autonomous vehicles, but recently have also started being used to create non-playable characters for driving simulations. While such models are in many respects realistic, they tend to suffer from unacceptably high rates of driving infractions, such as collisions or off-road driving, particularly when deployed in locations not covered in the training dataset. In this paper we present a novel method for fine-tuning a model of human driving behavior in novel locations where human demonstrations are not available which reduces the incidence of such infractions. The method relies on conditional sampling from the learned model given the lack of infractions and extra infraction penalties applied at training time and can be regarded as a form of amortized inference. We evaluate it using the ITRA model trained on the INTERACTION dataset and transferred to CARLA. We demonstrate around 70% reduction in the infraction rate at a modest computational cost and provide evidence that further gains are possible with more computation or better inference algorithms. |
Vasileios Lioutas · Adam Scibior · Frank Wood 🔗 |
-
|
PolyTrack: Tracking with Bounding Polygons
(
Poster
)
SlidesLive Video » In this paper, we present a novel method called PolyTrack for fast multi-object tracking and segmentation using bounding polygons. Polytrack detects objects by producing heatmaps of their center keypoint. For each of them, a rough segmentation is done by computing a bounding polygon over each instance instead of the traditional bounding box. Tracking is done by taking two consecutive frames as input and computing a center offset for each object detected in the first frame to predict its location in the second frame. A Kalman filter is also applied to reduce the number of ID switches. Since our target application is automated driving systems, we apply our method on urban environment videos. We trained and evaluated PolyTrack on the MOTS and KITTIMOTS datasets. Results show that tracking polygons can be a good alternative to bounding box and mask tracking. |
Gaspar Faure · Hughes Perreault · Guillaume-Alexandre Bilodeau · Nicolas Saunier 🔗 |
-
|
DriverGym: Democratising Reinforcement Learning for Autonomous Driving
(
Poster
)
SlidesLive Video » Despite promising progress in reinforcement learning (RL), developing algorithms for autonomous driving (AD) remains challenging: one of the critical issues being the absence of an open-source platform capable of training and effectively validating the RL policies on real-world data. We propose DriverGym, an open-source OpenAI Gym-compatible environment specifically tailored for developing RL algorithms for autonomous driving. DriverGym provides access to more than 1000 hours of expert logged data and also supports reactive and data-driven agent behavior. The performance of an RL policy can be easily validated on real-world data using our extensive and flexible closed-loop evaluation protocol. In this work, we also provide behavior cloning baselines using supervised learning and RL, trained in DriverGym. We make DriverGym code, as well as all the baselines publicly available to further stimulate development from the community. |
Parth Kothari · Christian Perone · Luca Bergamini · Alexandre Alahi · Peter Ondruska 🔗 |
-
|
Incorporating Voice Instructions in Model-Based Reinforcement Learning for Self-Driving Cars
(
Poster
)
SlidesLive Video » This paper presents a novel approach that supports natural language voice instructions to guide deep reinforcement learning (DRL) algorithms when training self-driving cars. DRL methods are popular approaches for autonomous vehicle (AV) agents. However, most existing methods are sample- and time-inefficient and lack a natural communication channel with the human expert. In this paper, how new human drivers learn from human coaches motivates us to study new ways of human-in-the-loop learning and a more natural and approachable training interface for the agents. We propose incorporating natural language voice instructions (NLI) in model-based deep reinforcement learning to train self-driving cars. We evaluate the proposed method together with a few state-of-the-art DRL methods in the CARLA simulator. The results show that NLI can help ease the training process and significantly boost the agents' learning speed. |
Mingze Wang · Ziyang Zhang · Grace Yang 🔗 |
-
|
Switching Recurrent Kalman Networks
(
Poster
)
SlidesLive Video » Forecasting driving behavior or other sensor measurements is an essential component of autonomous driving systems. Often real-world multivariate time series data is hard to model because the underlying dynamics are nonlinear and the observations are noisy. In addition, driving data can often be multimodal in distribution, meaning that there are distinct predictions that are likely, but averaging can hurt model performance. To address this, we propose the Switching Recurrent Kalman Network (SRKN) for efficient inference and prediction on nonlinear and multimodal time-series data. The model switches among several Kalman filters that model different aspects of the dynamics in a factorized latent state. We empirically test the resulting scalable and interpretable deep state-space model on toy data sets and real driving data from taxis in Porto. In all cases, the model can capture the multimodal nature of the dynamics in the data. |
Giao Nguyen-Quynh · Philipp Becker · Chen Qiu · Gerhard Neumann 🔗 |
-
|
Object-Level Targeted Selection via Deep Template Matching
(
Poster
)
SlidesLive Video » Retrieving images with objects that are semantically similar to objects of interest (OOI) in a query image has many practical use cases. A few examples include fixing failures like false negatives/positives of a learned model or mitigating class imbalance in a dataset. The targeted selection task requires finding the relevant data from a large-scale pool of unlabeled data. Manual mining at this scale is infeasible. Further, the OOI are often small and occupy less than 1% of image area, are occluded, and co-exist with many semantically different objects in cluttered scenes. Existing semantic image retrieval methods often focus on mining for larger sized geographical landmarks, and/or require extra labeled data, such as images/image-pairs with similar objects, for mining images with generic objects. We propose a fast and robust template matching algorithm in the DNN feature space, that retrieves semantically similar images at the object-level from a large unlabeled pool of data. We project the region(s) around the OOI in the query image to the DNN feature space for use as the template. This enables our method to focus on the semantics of the OOI without requiring extra labeled data. In the context of autonomous driving, we evaluate our system for targeted selection by using failure cases of object detectors as OOI. We demonstrate its efficacy on a large unlabeled dataset with 2.2M images and show high recall in mining for images with small-sized OOI. We compare our method against a well-known semantic image retrieval method, which also does not require extra labeled data. Lastly, we show that our method is flexible and retrieves images with one or more semantically different co-occurring OOI seamlessly. |
Suraj Kothawade · Michele Fenzi · Elmar Haussmann · Jose M. Alvarez · Christoph Angerer 🔗 |
-
|
Self-Supervised Pretraining for Scene Change Detection
(
Poster
)
SlidesLive Video » High Definition (HD) maps provide highly accurate details of the surrounding environment that aids in the precise localization of autonomous vehicles. To provide the most recent information, these HD maps must remain up-to-date with the changes present in the real world. Scene Change Detection (SCD) is a critical perception task that helps keep these maps updated by identifying the changes of the scene captured at different time instances. Deep Neural Network (DNNs) based SCD methods hinge on the availability of large-scale labeled images that are expensive to obtain. Therefore, current SCD methods depend heavily on transfer learning from large ImageNet datasets. However, they induce domain shift which results in a drop in change detection performance. To address these challenges, we propose a novel self-supervised pretraining method for the SCD called D-SSCD that learns temporal-consistent representations between the pair of images. The D-SSCD uses absolute feature differencing to learn distinctive representations belonging to the changed region directly from unlabeled pairs of images. Our experimental results on the VL-CMU-CD and Panoramic change detection datasets demonstrate the effectiveness of the proposed method. Compared to the widely used ImageNet pretraining strategy that uses more than a million additional labeled images, D-SSCD can match or surpass it without using any additional data. Our results also demonstrate the robustness of D-SSCD to natural corruptions, out-of-distribution generalization, and its superior performance in limited label scenarios. |
Vijaya Raghavan Thiruvengadathan Ramkumar · Prashant Bhat · Elahe Arani · Bahram Zonooz 🔗 |
-
|
Does Thermal data make the detection systems more reliable?
(
Poster
)
SlidesLive Video » Deep learning-based detection networks have made remarkable progress in autonomous driving systems (ADS). ADS should have reliable performance across a variety of ambient lighting and adverse weather conditions. However, luminance degradation and visual obstructions (such as glare, fog) result in poor quality images by the visual camera which leads to performance decline. To overcome these challenges, we explore the idea of leveraging a different data modality that is disparate yet complementary to the visual data. We propose a comprehensive detection system based on a multimodal-collaborative framework that learns from both RGB (from visual cameras) and thermal (from Infrared cameras) data. This framework trains two networks collaboratively and provides flexibility in learning optimal features of its own modality while also incorporating the complementary knowledge of the other. Our extensive empirical results show that while the improvement in accuracy is nominal, the value lies in challenging and extremely difficult edge cases which is crucial in safety-critical applications such as AD. We provide a holistic view of both merits and limitations of using a thermal imaging system in detection. |
Shruthi Gowda · Bahram Zonooz · Elahe Arani 🔗 |
-
|
Reinforcement Learning as an Alternative to Reachability Analysis for Falsification of AD Functions
(
Poster
)
SlidesLive Video » Reachability analysis (RA) is one of the classical approaches to study the safety of autonomous systems, for example through falsification, the identification of initial system states which can under the right disturbances lead to unsafe or undesirable outcome states. The advantage of obtaining exact answers via RA requires analytical system models often unavailable for simulation environments for AD systems. RA suffers from rapidly rising computational costs as the dimensionality increases and ineffectiveness in dealing with nonlinearities such as saturation. Here we present an alternative in the form of a reinforcement learning (RL) approach which empirically shows good agreement with RA falsification for an Adaptive Cruise Controller, can deal with saturation, and, in preliminary data, compares favorably in computational effort against RA. Due to the choice of reward function, the RL's estimated value function provides insights into the ease of causing unsafe outcomes and allows for direct comparison with the RA falsification results. |
Angel Molina Acosta · Alexander Schliep 🔗 |
-
|
ORDER: Open World Object Detection on Road Scenes
(
Poster
)
SlidesLive Video » Object detection is a crucial component in autonomous navigation systems. Current object detectors are trained and tested on a fixed number of known classes. However, in real-world or open-world settings, the test set may consist of objects of unknown classes; this results in the unknown objects being falsely detected as known objects leading to the failure in decision making of autonomous navigation systems. We propose Open World Object Detection on Road Scenes (ORDER) to resolve the aforementioned problem. We introduce Feature-Mix that widens the gap between known and unknown classes in latent feature space and improves the unknown object detection in the ORDER framework. We identify the inherent problems present in autonomous datasets: i) a significant proportion of the dataset comprises small objects and ii) intra-class bounding box scale variations. We address the problem of small object detection and intra-class bounding box variations by proposing a novel focal regression loss. Further, the detection of small objects is improved by curriculum learning. We present an extensive evaluation on two road scene datasets: BDD and IDD. Our experimental evaluations on BDD and IDD shows consistent improvement over the current state-of-the-art method. We believe that this work will lay the foundation for real-world object detection for road scenes. |
Deepak Singh · Shyam Nandan Rai · Joseph K J · Rohit Saluja · Vineeth N Balasubramanian · Chetan Arora · Anbumani Subramanian · C.V. Jawahar 🔗 |
-
|
Scalable Primitives for Generalized Sensor Fusion in Autonomous Vehicles
(
Poster
)
SlidesLive Video » In autonomous driving, there has been an explosion in the use of deep neural networks for perception, prediction and planning tasks. As autonomous vehicles (AVs) move closer to production, multi-modal sensor inputs and heterogeneous vehicle fleets with different sets of sensor platforms are becoming increasingly common in the industry. However, neural network architectures typically target specific sensor platforms and are not robust to changes in input, making the problem of scaling and model deployment particularly difficult. Furthermore, most players still treat the problem of optimizing software and hardware as entirely independent problems. We propose a new end to end architecture, Generalized Sensor Fusion (GSF), which is designed in such a way that both sensor inputs and target tasks are modular and modifiable. This enables AV system designers to easily experiment with different sensor configurations and methods and opens up the ability to deploy on heterogeneous fleets using the same models that are shared across a large engineering organization. Using this system, we report experimental results where we demonstrate near-parity of an expensive high-density (HD) LiDAR sensor with a cheap low-density (LD) LiDAR plus camera setup in the 3D object detection task. This paves the way for the industry to jointly design hardware and software architectures as well as large fleets with heterogeneous configurations. |
Sammy Sidhu · Aayush Ahuja 🔗 |
-
|
Hierarchical Adaptable and Transferable Networks (HATN) for Driving Behavior Prediction
(
Poster
)
SlidesLive Video » When autonomous vehicles still struggle to solve challenging situations during on-road driving, humans have long mastered the essence of driving with efficient transferable and adaptable driving capability. By mimicking humans' cognition model and semantic understanding during driving, we present HATN, a hierarchical framework to generate high-quality driving behaviors in multi-agent dense-traffic environments. Our method hierarchically consists of a high-level intention identification and low-level action generation policy. With the semantic sub-task definition and generic state representation, the hierarchical framework is transferable across different driving scenarios. Besides, our model is also able to capture variations of driving behaviors across different individuals and different scenarios among individuals and scenarios by an online adaptation module. We demonstrate our algorithms in the task of trajectory prediction for real traffic data at intersections and roundabouts, where we conducted extensive studies of the proposed method and demonstrate how our method outperformed other methods in terms of prediction accuracy and transferability. |
Letian Wang · Yeping Hu · Liting Sun · Wei Zhan · Masayoshi TOMIZUKA · Changliu Liu 🔗 |
-
|
NSS-VAEs: Generative Scene Decomposition for Visual Navigable Space Construction
(
Poster
)
SlidesLive Video » Detecting navigable space is the first and also a critical step for successful robot navigation. In this work, we treat the visual navigable space segmentation as a scene decomposition problem and propose a new network, NSS-VAEs (Navigable Space Segmentation Variational AutoEncoders), a representation-learning-based framework to enable robots to learn the navigable space segmentation in an unsupervised manner. Different from prevalent segmentation techniques which heavily rely on supervised learning strategies and typically demand immense pixel-level annotated images, the proposed framework leverages a generative model -- Variational Auto-Encoder (VAE) -- to learn a probabilistic polyline representation that compactly outlines the desired navigable space boundary. Uniquely, our method also assesses the prediction uncertainty related to the unstructuredness of the scenes, which is important for robot navigation in unstructured environments. Through extensive experiments, we have validated that our proposed method can achieve remarkably high accuracy (>90%) even without a single label. We also show that the prediction of NSS-VAEs can be further improved using few labels with results significantly outperforming the SOTA fully supervised learning-based method. |
Zheng Chen · Lantao Liu 🔗 |
-
|
PKCAM: Previous Knowledge Channel Attention Module
(
Poster
)
SlidesLive Video » Attention mechanisms have been explored with CNNs, both across the spatial and channel dimensions. However, all the existing methods devote the attention modules to capture local interactions from the current feature map only, disregarded the valuable previous knowledge that is acquired by the earlier layers. This paper tackles the following question: Can one incorporate previous knowledge aggregation while learning channel attention more efficiently? To this end, we propose a Previous Knowledge Channel Attention Module(PKCAM), that captures channel-wise relations across different layers to model the global context. Our proposed module PKCAM is easily integrated into any feed-forward CNN architectures and trained in an end-to-end fashion with a negligible footprint due to its lightweight property. We validate our novel architecture through extensive experiments on image classification and object detection tasks with different backbones. Our experiments show consistent improvements in performances against their counterparts. We also conduct experiments that probe the robustness of the learned representations. |
Eslam MOHAMED-ABDELRAHMAN · Ahmad El Sallab · Mohsen Rashwan 🔗 |
-
|
MTL-TransMODS: Cascaded Multi-Task Learning for Moving Object Detection and Segmentation with Unified Transformers
(
Poster
)
SlidesLive Video » Recently, transformer-based networks have achieved state-of-the-art performance in computer vision tasks. In this paper, we propose a new cascaded MTL transformer-based framework, termed MTL-TransMODS, that tackles the moving object detection and segmentation tasks due to its importance for Autonomous Driving tasks. A critical problem in this task is how to model the spatial correlation in each frame and the temporal relationship across multiple frames to capture the motion cues. MTL-TransMODS, introducing a vision transformer to employ the temporal and spatial associations, and tackle both tasks using only one fully shared transformer architecture with unified queries. Extensive experiments demonstrate the superiority of our MTL-TransMODS over state-of-the-art methods on the KittiMoSeg dataset \cite{rashed2019fusemodnet}. Results show 0.3\% mAP improvement for Moving Object Detection, and 5.7\% IoU improvement for Moving Object Segmentation, over the state-of-the-art techniques. |
Eslam MOHAMED-ABDELRAHMAN · Ahmad El Sallab 🔗 |
-
|
Real-time Generalized Sensor Fusion with Transformers
(
Poster
)
SlidesLive Video » 3D Multi Object Tracking (MOT) is essential for the safe deployment of self-driving vehicles. While major progress has been made in 3D object detection and machine-learned tracking approaches, real time 3D MOT remains a challenging problem in dense urban scenes. Commercial deployment of self-driving cars requires high recall and redundancy often achieved by using multiple sensor modalities. While existing approaches have been shown to work well with a fixed input modality setting, it is generally hard to reconfigure the tracking pipeline for optimal performance with changes in the input sources. In this paper, we propose a generalized learnable framework for multi-modal data association leveraging Transformers. Our method encodes tracks and observations as embeddings using joint attention to capture spatio-temporal context. From these embeddings, pairwise similarity scores can be computed between tracks and observations, which are then used to classify track-observation association proposals. We experimentally demonstrate that our data-driven approach achieves better performance than heuristics-based solutions on our in-house large-scale dataset and show that it is generalizable to different combinations of input modalities without any specific hand-tuning. Our approach also has real-time performance even with a large number of inputs. |
Aayush Ahuja 🔗 |
-
|
Offline Reinforcement Learning for Autonomous Driving with Safety and Exploration Enhancement
(
Poster
)
SlidesLive Video » Reinforcement learning (RL) is a powerful data-driven control method that has been largely explored in autonomous driving tasks. However, conventional RL approaches learn control policies through trial-and-error interactions with the environment and therefore may cause disastrous consequences such as collisions when testing in real traffic. Offline RL has recently emerged as a promising framework to learn effective policies from previously-collected, static datasets without the requirement of active interactions, making it especially appealing for autonomous driving applications. Despite promising, existing offline RL algorithms such as Batch-Constrained deep Q-learning (BCQ) generally lead to rather conservative policies with limited exploration efficiency. To address such issues, this paper presents an enhanced BCQ algorithm by employing a learnable parameter noise scheme in the perturbation model to increase the diversity of observed actions. In addition, a Lyapunov-based safety enhancement strategy is incorporated to constrain the explorable state space within a safe region. Experimental results in highway and parking traffic scenarios show that our approach outperforms the conventional RL method, as well as the state-of-the-art offline RL algorithms. |
TIANYU SHI · Dong Chen 🔗 |
-
|
Circular-Symmetric Correlation Layer
(
Poster
)
SlidesLive Video »
Despite the vast success of standard planar convolutional neural networks, they are not the most efficient choice for analyzing signals that lie on an arbitrarily curved manifold, such as a cylinder. The problem arises when one performs a planar projection of these signals and inevitably causes them to be distorted or broken where there is valuable information. We propose a Circular-symmetric Correlation Layer (CCL) based on the formalism of roto-translation equivariant correlation on the continuous group $S^1 \times \mathbb{R}$, and implement it efficiently using the well-known Fast Fourier Transform (FFT) algorithm. We showcase the performance analysis of a general network equipped with CCL on a popular autonomous driving dataset, nuScenes (Caesar et al., 2020), for semantic segmentation of 3D point clouds obtained from LiDAR sweeps from their $360^\circ-$panoramic projections. The PyTorch package implementation of CCL is provided online.
|
Bahare Azari · Deniz Erdogmus 🔗 |
-
|
Improved Object Detection in Thermal Imaging Through Context Enhancement and Information Fusion: A Case Study in Autonomous Driving
(
Poster
)
SlidesLive Video » With advances sensory technologies, autonomous driving systems incorporate more imaging sensors such as thermal cameras to enhance the capability of environmental perception beyond the visible spectrum. This paper proposes an integrated context enhancement and information fusion framework (CEIFF) to generate enhanced colorized synthetic visible (SVI) images from thermal images. And, the SVI and thermal images are fused for improved perception quality. The case study shows the effectiveness of the proposed CEIFF on a multimodal autonomous dataset. |
Junchi Bin · Ran Zhang · Shan Du · Erik Blasch · Zheng Liu 🔗 |
-
|
Monocular 3D Object Detection by Leveraging Self-Supervised Visual Pre-training
(
Poster
)
SlidesLive Video » Precise detection of 3D objects is a critical task in autonomous driving. Monocular 3D object detection problem is defined as predicting 3D bounding boxes in the metric space with a single monocular image. Most 3D detectors follow the standard pre-training strategy using the supervised ImageNet dataset, which is created for a dissimilar classification task. In this paper, a simple and effective pre-training strategy is proposed for monocular 3D object detection problem, without requiring any human supervision and annotated data. A dense depth estimation pretext task is incorporated into the pre-training pipeline by taking advantage of self-supervised learning. Experiments show that transferring the pre-trained weights to the detection network increases the performance in 3D object detection and bird's eye view evaluations up to 25% improvement rate with respect to the baseline networks that are based on ImageNet pre-training. This strategy has the potential of being applicable to other 3D object detection methods without any modifications to the existing algorithm design. |
Can Erhan · Anıl Öztürk · Burak Gunel · Nazim Kemal Ure 🔗 |
-
|
Fast Polar Attentive 3D Object Detection based on Point Cloud
(
Poster
)
SlidesLive Video » 3D object detection using LiDAR sensory point-cloud data is widely used for many applications, including autonomous driving and map building. Existing solutions mainly leverage deep learning models; nevertheless, one of the underlying challenges is reducing computational load, thus latency, while maintaining high accuracy, particularly for detecting objects in the long-range. Here, we introduce a novel streaming-style detector utilizing polar space feature representations to provide faster inference for 3D object detection. Our method improves detection performance using pseudo-image features and can support edge devices with limited memory requirements. Comparing with other state-of-art methods along with experimental validations, we show our methods corroborates superiority on Waymo, KITTI dataset. On KITTI validation, it achieves 94.7\% AP for cars in BEV detection. |
Manoj Bhat 🔗 |
-
|
Are Socially-Aware Trajectory Prediction Models Really Socially-Aware?
(
Poster
)
link »
Our field has recently witnessed an arms race of neural network-based trajectory predictors. While these predictors are at the core of many applications such as autonomous navigation or pedestrian flow simulations, their adversarial robustness has not been carefully studied. In this paper, we introduce a socially-attended attack to assess the social understanding of prediction models in terms of collision avoidance. An attack is a small yet carefully-crafted perturbations to fail predictors. Technically, we define collision as a failure mode of the output, and propose hard and soft-attention mechanisms to guide our attack. Thanks to our attack, we shed light on the limitations of the current models in terms of their social understanding. We demonstrate the strengths of our method on the recent trajectory prediction models. Finally, we show that our attack can be employed to increase the social understanding of state-of-the-art models. |
Saeed Saadatnejad · Mohammadhossein Bahari 🔗 |