Eye gaze has proven to be a cost-efficient way to collect large-scale physiological data that can reveal the underlying human attentional patterns in real-life workflows, and thus has long been explored as a signal to directly measure human-related cognition in various domains. Physiological data (including but not limited to eye gaze) offer new perception capabilities, which could be used in several ML domains, e.g., egocentric perception, embodied AI, NLP, etc. They can help infer human perception, intentions, beliefs, goals, and other cognition properties that are much needed for human-AI interactions and agent coordination. In addition, large collections of eye-tracking data have enabled data-driven modeling of human visual attention mechanisms, both for saliency or scanpath prediction, with twofold advantages: from the neuroscientific perspective to understand biological mechanisms better, and from the AI perspective to equip agents with the ability to mimic or predict human behavior and improve interpretability and interactions.
With the emergence of immersive technologies, now more than any time there is a need for experts of various backgrounds (e.g., machine learning, vision, and neuroscience communities) to share expertise and contribute to a deeper understanding of the intricacies of cost-efficient human supervision signals (e.g., eye-gaze) and their utilization towards by bridging human cognition and AI in machine learning research and development. The goal of this workshop is to bring together an active research community to collectively drive progress in defining and addressing core problems in gaze-assisted machine learning.
Sat 5:30 a.m. - 6:00 a.m.
|
Meet and Greet and Getting started
Coffee - Meet and Greet and Getting started |
🔗 |
Sat 6:00 a.m. - 6:10 a.m.
|
Opening Remarks (10 mins) Organizers
(
Opening Remarks
)
|
🔗 |
Sat 6:10 a.m. - 6:55 a.m.
|
Learning gaze control, external attention, and internal attention since 1990-91
(
Keynote
)
SlidesLive Video » First I’ll discuss our early work of 1990 on attentive neural networks that learn to steer foveas, and on learning internal spotlights of attention in Transformer-like systems since 1991, then I’ll mention what happened in the subsequent 3 decades in terms of representing percepts and action plans in hierarchical neural networks, at multiple levels of abstraction, and multiple time scales. In preparation of this workshop, I made two overview web sites: 1. End-to-End Differentiable Sequential Neural Attention 1990-93 https://people.idsia.ch/~juergen/neural-attention-1990-1993.html. 2. Learning internal spotlights of attention with what’s now called "Transformers with linearized self-attention" which are formally equivalent to the 1991 Fast Weight Programmers: https://people.idsia.ch/~juergen/fast-weight-programmer-1991-transformer.html |
Jürgen Schmidhuber 🔗 |
Sat 7:00 a.m. - 7:30 a.m.
|
Eye-tracking what's going on in the mind
(
Keynote
)
SlidesLive Video » When you try to see what's going on inside a black box, it helps if there is a window. In this talk, I will illustrate how eye-tracking allows us to peek into the inner workings of the human mind. I will use two case studies to make my point. In study 1, I will use eye-tracking evidence to demonstrate that people spontaneously simulate counterfactual possibilities when making causal judgments. In study 2, eye-tracking evidence reveals how people engage in mental simulation when trying to infer what happened in the past based on visual and auditory evidence. Together these studies show that people build rich mental models of the world and that much of human thought can be understood as cognitive operations on these mental models. |
Tobias Gerstenberg 🔗 |
Sat 7:30 a.m. - 8:00 a.m.
|
Neural encoding and decoding of facial movements
(
Keynote
)
SlidesLive Video » New recording technologies have revolutionized neuroscience, allowing scientists to record the spiking activity of thousands of neurons simultaneously. One of the most surprising findings to arise from these new capabilities is that orofacial movements, including eye movements, are much more predictive of neural activity in mouse visual cortex than previously thought. To better understand the relationship between face and eye movements and visual coding, however, we need new machine learning methods for relating these neural and behavioral time series. I will present my lab's research on deep probabilistic models for |
Scott Linderman 🔗 |
Sat 8:00 a.m. - 8:15 a.m.
|
Coffee Break
|
🔗 |
Sat 8:20 a.m. - 8:32 a.m.
|
Electrode Clustering and Bandpass Analysis of EEG Data for Gaze Estimation
(
Spotlight
)
link »
SlidesLive Video » In this study, we validate the findings of previously published papers, showing the feasibility of an Electroencephalography (EEG) based gaze estimation. Moreover, we extend previous research by demonstrating that with only a slight drop in model performance, we can significantly reduce the number of electrodes, indicating that a high-density, expensive EEG cap is not necessary for the purposes of EEG-based eye tracking. Using data-driven approaches, we establish which electrode clusters impact gaze estimation and how the different types of EEG data preprocessing affect the models’ performance. Finally, we also inspect which recorded frequencies are most important for the defined tasks. |
Ard Kastrati · Martyna Plomecka · Joël Küchler · Nicolas Langer · Roger Wattenhofer 🔗 |
Sat 8:32 a.m. - 8:44 a.m.
|
Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task
(
Spotlight
)
link »
SlidesLive Video » From smoothly pursuing moving objects to rapidly shifting gazes during visual search, humans employ a wide variety of eye movement strategies in different contexts. While eye movements provide a rich window into mental processes, building generative models of eye movements is notoriously difficult, and to date the computational objectives guiding eye movements remain largely a mystery. In this work, we tackled these problems in the context of a canonical spatial planning task, maze-solving. We collected eye movement data from human subjects and built deep generative models of eye movements using a novel differentiable architecture for gaze fixations and gaze shifts. We found that human eye movements are best predicted by a model that is optimized not to perform the task as efficiently as possible but instead to run an internal simulation of an object traversing the maze. This not only provides a generative model of eye movements in this task but also suggests a computational theory for how humans solve the task, namely that humans use mental simulation. |
Jason Li · Nicholas Watters · Sandy Wang · Hansem Sohn · Mehrdad Jazayeri 🔗 |
Sat 8:44 a.m. - 8:56 a.m.
|
Intention Estimation via Gaze for Robot Guidance in Hierarchical Tasks
(
Spotlight
)
link »
SlidesLive Video » To provide effective guidance to a human agent performing hierarchical tasks, a robot must determine the level at which to provide guidance. This relies on estimating the agent's intention at each level of the hierarchy. Unfortunately, observations of task-related movements provide direct information about intention only at the lowest level. In addition, lower level tasks may be shared. The resulting ambiguity impairs timely estimation of higher level intent. This can be resolved by incorporating observations of secondary behaviors like gaze. We propose a probabilistic framework enabling robot guidance in hierarchical tasks via intention estimation from observations of both task-related movements and eye gaze. Experiments with a virtual humanoid robot demonstrate that gaze is a very powerful cue that largely compensates for simplifying assumptions made in modelling task-related movements, enabling a robot controlled by our framework to nearly match the performance of a human wizard. We examine the effect of gaze in improving both the precision and timeliness of guidance cue generation, finding that while both improve with gaze, improvements in timeliness are more significant. Our results suggest that gaze observations are critical in achieving natural and fluid human-robot collaboration, which may enable human agents to undertake significantly more complex tasks and perform them more safely and effectively, than possible without guidance. |
Yifan SHEN · Xiaoyu Mo · Vytas Krisciunas · David Hanson · Bertram Shi 🔗 |
Sat 8:56 a.m. - 9:08 a.m.
|
Facial Composite Generation with Iterative Human Feedback
(
Spotlight
)
link »
SlidesLive Video » We propose the first method in which human and AI collaborate to iteratively reconstruct the human's mental image of another person's face only from their eye gaze.Current tools for generating digital human faces involve a tedious and time-consuming manual design process.While gaze-based mental image reconstruction represents a promising alternative, previous methods still assumed prior knowledge about the target face, thereby severely limiting their practical usefulness. The key novelty of our method is a collaborative, iterative query engine: Based on the user's gaze behaviour in each iteration, our method predicts which images to show to the user in the next iteration.Results from two human studies (N=12 and N=22) show that our method can visually reconstruct digital faces that are more similar to the mental image, and is more usable compared to other methods.As such, our findings point at the significant potential of human-AI collaboration for reconstructing mental images, potentially also beyond faces, and of human gaze as a rich source of information and a powerful mediator in said collaboration. |
Florian Strohm · Ekta Sood · Dominike Thomas · Mihai Bace · Andreas Bulling 🔗 |
Sat 9:08 a.m. - 9:20 a.m.
|
Simulating Human Gaze with Neural Visual Attention
(
Spotlight
)
link »
SlidesLive Video » Existing models of human visual attention are generally unable to incorporate direct task guidance and therefore cannot model an intent or goal when exploring a scene. To integrate guidance of any downstream visual task into attention modeling, we propose the Neural Visual Attention (NeVA) algorithm. To this end, we impose to neural networks the biological constraint of foveated vision and train an attention mechanism to generate visual explorations that maximize the performance with respect to the downstream task. We observe that biologically constrained neural networks generate human-like scanpaths without being trained for this objective. Extensive experiments on three common benchmark datasets show that our method outperforms state-of-the-art unsupervised human attention models in generating human-like scanpaths. |
Leo Schwinn · Doina Precup · Bjoern Eskofier · Dario Zanca 🔗 |
Sat 9:20 a.m. - 9:50 a.m.
|
Foveated Models of Visual Search and Medical Image Perception
(
Keynote
)
Modern 3D medical imaging modalities such as computed tomography and digital breast tomosynthesis require that the radiologist scroll through a stack of 2D images (slices) to search for abnormalities (signals) predictive of disease. Here, I review the use of foveated computational models to understand various perceptual effects with radiologists. A foveated model processes the images with reduced fidelity from the fixation point, makes eye movements to explore the image, scrolls through images, and integrates sensory evidence across fixations and slices to reach decisions about the presence of signals. I show how the model predicts and explains a number of important findings with observers and radiologists. Small signals that are difficult to see in the periphery are often missed during search with 3D images. A synthetic 2D image created by combining all 3D slices can be presented as a preview to radiologists to guide their search through the 3D image stack and minimize the search errors of small signals. Variations in observers’ eye movements when searching for small signals in 3D images can fully account for variability in observer search performance. I will finalize by discussing recent efforts to integrate foveated models with deep q-learning techniques to estimate near-optimal (performance maximizing) eye movements during search with medical images. |
Miguel Eckstein 🔗 |
Sat 10:00 a.m. - 10:30 a.m.
|
Lunch
(
Lunch and Poster Walk-Around
)
|
🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Appearance-Based Gaze Estimation for Driver Monitoring
(
Poster
)
link »
SlidesLive Video » Driver inattention is a leading cause of road accidents through its impact on reaction time in the face of incidents. In the case of Level-3 (L3) vehicles, inattention adversely impacts the quality of driver take over and therefore the safe performance of L3 vehicles. There is a high correlation between a driver’s visual attention and eye movement. Gaze angle is an excellent surrogate for assessing driver attention zones, in both cabin interior and on-road scenarios. We propose appearance-based gaze estimation approaches using convolutional neural networks (CNNs) to estimate gaze angle directly from eye images and also from eye landmark coordinates. The goal is to improve learning by utilizing synthetic data with more accurate annotations. Performance analysis shows that our proposed landmark-based model, trained synthetically, is capable of predicting gaze angle in the real data with a reasonable angular error. In addition, we discuss evaluation metrics are application specific and there is a crucial requirement for a more reliable assessment metric rather than common mean angular error to measure the driver's gaze direction in L3 autonomy for a control takeover request at a proper time corresponding to the driver's attention focus to avoid ambiguities. |
Soodeh Nikan · devesh upadhyay 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Selection of XAI Methods Matters: Evaluation of Feature Attribution Methods for Oculomotoric Biometric Identification
(
Poster
)
link »
SlidesLive Video » Substantial advances in oculomotoric biometric identification have been made due to deep neural networks processing non-aggregated time series data that replace methods processing theoretically motivated engineered features. However, interpretability of deep neural networks is not trivial and needs to be thoroughly investigated for future eye tracking applications. Especially in medical or legal applications explanations can be required to be provided alongside predictions. In this work, we apply several attribution methods to a state of the art model for eye movement-based biometric identification. To asses the quality of the generated attributions, this work is focused on the quantitative evaluation of a range of established metrics. We find that Layer-wise relevance propagation generates the most robust and least complex attributions, while DeepLIFT attributions are the most faithful. Due to the absence of a correlation between attributions of these two methods we advocate to consider both methods for their potentially complementary attributions. |
Daniel Krakowczyk · David Robert Reich · Paul Prasse · Sebastian Lapuschkin · Lena A. Jäger · Tobias Scheffer 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Time-to-Saccade metrics for real-world evaluation
(
Poster
)
link »
SlidesLive Video » In this paper, we explore multiple metrics for the evaluation of time-to-saccade problems. We define a new sampling strategy that takes the sequential nature of gaze data and time-to-saccade problems into account to avoid samples of the same event into different datasets. This also allows us to define novel error metrics, evaluating predicted durations utilizing classical eye-movement classifiers. Furthermore, we define metrics to evaluate the consistency of a predictor and the (modification) of the error over time. We evaluate our method using state-of-the-art methods along with an average baseline on three different datasets. |
Tim Rolff · Niklas Stein · Markus Lappe · Frank Steinicke · Simone Frintrop 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Electrode Clustering and Bandpass Analysis of EEG Data for Gaze Estimation
(
Poster
)
link »
In this study, we validate the findings of previously published papers, showing the feasibility of an Electroencephalography (EEG) based gaze estimation. Moreover, we extend previous research by demonstrating that with only a slight drop in model performance, we can significantly reduce the number of electrodes, indicating that a high-density, expensive EEG cap is not necessary for the purposes of EEG-based eye tracking. Using data-driven approaches, we establish which electrode clusters impact gaze estimation and how the different types of EEG data preprocessing affect the models’ performance. Finally, we also inspect which recorded frequencies are most important for the defined tasks. |
Ard Kastrati · Martyna Plomecka · Joël Küchler · Nicolas Langer · Roger Wattenhofer 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Skill, or Style? Classification of Fetal Sonography Eye-Tracking Data
(
Poster
)
link »
SlidesLive Video » We present a method for classifying human skill at fetal ultrasound scanning from eye-tracking and pupillary data of sonographers. Human skill characterization for this clinical task typically creates groupings of clinician skills such as expert and beginner based on the number of years of professional experience; experts typically have more than 10 years and beginners between 0-5 years. In some cases, they also include trainees who are not yet fully-qualified professionals. Prior work has considered eye movements that necessitates separating eye-tracking data into eye movements, such as fixations and saccades. Our method does not use prior assumptions about the relationship between years of experience and does not require the separation of eye-tracking data. Our best performing skill classification model achieves an F1 score of 98% and 70% for expert and trainee classes respectively. We also show that years of experience as a direct measure of skill, is significantly correlated to the expertise of a sonographer. |
Clare Teng · Lior Drukker · Aris Papageorghiou · Alison Noble 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Decoding Attention from Gaze: A Benchmark Dataset and End-to-End Models
(
Poster
)
link »
SlidesLive Video » Eye-tracking has potential to provide rich behavioral data about human cognition in ecologically valid environments. However, analyzing this rich data is often challenging. Most automated analyses are specific to simplistic artificial visual stimuli with well-separated, static regions of interest, while most analyses in the context of complex visual stimuli, such as most natural scenes, rely on laborious and time-consuming manual annotation. This paper studies using computer vision tools for ``attention decoding'', the task of assessing the locus of a participant's overt visual attention over time. We provide a publicly available Multiple Object Eye-Tracking (MOET) dataset, consisting of gaze data from participants tracking specific objects, annotated with labels and bounding boxes, in crowded real-world videos, for training and evaluating attention decoding algorithms. We also propose two end-to-end deep learning models for attention decoding and compare these to state-of-the-art heuristic methods. |
Karan Uppal · Jaeah Kim · Shashank Singh 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Learning to count visual objects by combining "what" and "where" in recurrent memory
(
Poster
)
link »
SlidesLive Video » Counting the number of objects in a visual scene is easy for humans but challenging for modern deep neural networks. Here we explore what makes this problem hard and study the neural computations that allow transfer of counting ability to new objects and contexts. Previous work has implicated posterior parietal cortex (PPC) in numerosity perception and in visual scene understanding more broadly. It has been proposed that action-related saccadic signals computed in PPC provide object-invariant information about the number and arrangement of scene elements, and may contribute to relational reasoning in visual displays. Here, we built a glimpsing recurrent neural network that combines gaze contents ("what") and gaze location ("where") to count the number of items in a visual array. The network successfully learns to count and generalizes to several out-of-distribution test sets, including images with novel items. Through ablations and comparison to control models, we establish the contribution of brain-inspired computational principles to this generalization ability. This work provides a proof-of-principle demonstration that a neural network that combines "what" and "where" can learn a generalizable concept of numerosity and points to a promising approach for other visual reasoning tasks. |
Jessica Thompson · Hannah Sheahan · Christopher Summerfield 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task
(
Poster
)
link »
SlidesLive Video » From smoothly pursuing moving objects to rapidly shifting gazes during visual search, humans employ a wide variety of eye movement strategies in different contexts. While eye movements provide a rich window into mental processes, building generative models of eye movements is notoriously difficult, and to date the computational objectives guiding eye movements remain largely a mystery. In this work, we tackled these problems in the context of a canonical spatial planning task, maze-solving. We collected eye movement data from human subjects and built deep generative models of eye movements using a novel differentiable architecture for gaze fixations and gaze shifts. We found that human eye movements are best predicted by a model that is optimized not to perform the task as efficiently as possible but instead to run an internal simulation of an object traversing the maze. This not only provides a generative model of eye movements in this task but also suggests a computational theory for how humans solve the task, namely that humans use mental simulation. |
Jason Li · Nicholas Watters · Sandy Wang · Hansem Sohn · Mehrdad Jazayeri 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Generating Attention Maps from Eye-gaze for the Diagnosis of Alzheimer's Disease
(
Poster
)
link »
SlidesLive Video » Convolutional neural networks (CNNs), are currently the best computational methods for the diagnosis of Alzheimer’s disease (AD) from neuroimaging. CNNs are able to automatically learn a hierarchy of spatial features, but they are not optimized to incorporate domain knowledge.In this work we study the generation of attention maps based on a human expert gaze of the brain scans (domain knowledge) to guide the deep model to focus on the more relevant regions for AD diagnosis. Two strategies to generate the maps from eye-gaze were investigated; the use of average class maps and supervising a network to generate the attention maps. These approaches were compared with masking (hard attention) with regions of interest (ROI) and CNN’s with traditional attention mechanisms.For our experiments, we used positron emission tomography (PET) scans from the Alzheimer’s Disease Neuroimaging initiative (ADNI) database. For the task of normal control (NC) vs Alzheimer’s (AD), the best performing model was with insertion of regions of interest (ROI), which achieved 95.6% accuracy, 0.4% higher than the baseline CNN. |
Carlos Antunes · Margarida Silveira 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Do They Look Where They Go? Gaze Classification During Walking
(
Poster
)
link »
SlidesLive Video » In many applications of human-computer interaction a prediction of the human's next intended action is highly valuable. For locomotor actions, in order to control direction and orientation of the body a walking person relies on visual input obtained by eye and head movements. The analysis of these parameters can be used to infer the intended goal of the walker. However, such a prediction of human locomotion intentions is a challenging task since interactions between these parameters are non-linear and highly dynamic. Distinguishing gazes on future waypoints from other gazes can be a helpful source of information. We employed LSTM models to investigate if gaze and walk data can be used to predict whether walkers are currently looking at locations along their future path or whether they are looking in a direction that is away from their future path. Our models were trained on egocentric data from a virtual reality experiment in which 18 participants walked freely through a virtual environment while performing various tasks (walking along a curved path, avoiding obstacles and searching for a target). The dataset included only egocentric features (position, orientation and gaze) and no information about the environment. These features were used to determine when gaze was directed at future waypoints and when not. The trained model achieved an overall accuracy of 80%. Biasing the model to focus on correct classification of gazes away from the path increased the detection rate of these gazes to 90%.An analysis of model performance in the different walking task showed that accuracy was highest (85%) for curved path walking and lowest (73%) for the target search task. We conclude that online gaze measurements during walking can be used to estimate a walker's intention and to determine whether they look at the target of their future trajectory or away from it. |
Gianni Bremer · Niklas Stein · Markus Lappe 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Intention Estimation via Gaze for Robot Guidance in Hierarchical Tasks
(
Poster
)
link »
SlidesLive Video » To provide effective guidance to a human agent performing hierarchical tasks, a robot must determine the level at which to provide guidance. This relies on estimating the agent's intention at each level of the hierarchy. Unfortunately, observations of task-related movements provide direct information about intention only at the lowest level. In addition, lower level tasks may be shared. The resulting ambiguity impairs timely estimation of higher level intent. This can be resolved by incorporating observations of secondary behaviors like gaze. We propose a probabilistic framework enabling robot guidance in hierarchical tasks via intention estimation from observations of both task-related movements and eye gaze. Experiments with a virtual humanoid robot demonstrate that gaze is a very powerful cue that largely compensates for simplifying assumptions made in modelling task-related movements, enabling a robot controlled by our framework to nearly match the performance of a human wizard. We examine the effect of gaze in improving both the precision and timeliness of guidance cue generation, finding that while both improve with gaze, improvements in timeliness are more significant. Our results suggest that gaze observations are critical in achieving natural and fluid human-robot collaboration, which may enable human agents to undertake significantly more complex tasks and perform them more safely and effectively, than possible without guidance. |
Yifan SHEN · Xiaoyu Mo · Vytas Krisciunas · David Hanson · Bertram Shi 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Comparing radiologists' gaze and saliency maps generated by interpretability methods for chest x-rays
(
Poster
)
link »
SlidesLive Video » We use a dataset of eye-tracking data from five radiologists to compare the regions used by deep learning models for their decisions and the heatmaps representing where radiologists looked. We conduct a class-independent analysis of the saliency maps generated by two methods selected from the literature: Grad-CAM and attention maps from an attention-gated model. For the comparison, we use shuffled metrics, avoiding biases from fixation locations. We achieve scores comparable to an interobserver baseline in one metric, highlighting the potential of saliency maps from Grad-CAM to mimic a radiologist's attention over an image. We also divide the dataset into subsets to evaluate in which cases similarities are higher. |
Ricardo Bigolin Lanfredi · Ambuj Arora · Trafton Drew · Joyce Schroeder · Tolga Tasdizen 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Integrating eye gaze into machine learning using fractal curves
(
Poster
)
link »
SlidesLive Video » Eye gaze tracking has traditionally employed a camera to capture a participant's eye movements and characterise their visual fixations. However, gaze pattern recognition is still challenging. This is due to both gaze point sparsity, and a seemingly random approach participants take to viewing unfamiliar stimuli without a set task. Our paper proposes a method for integrating eye gaze into machine learning by converting a fixation’s two dimensional (x, y) coordinate into a one dimensional Hilbert curve distance metric, making it well suited for implementation into machine learning. We will compare this approach to a traditional grid-based string substitution technique, with an example implementation demonstrated in a Support Vector Machine and Convolutional Neural Network. Finally, a comparison will be made to examine what method performs better. Results have shown that this method can be both useful to dynamically quantise scanpaths for tuning statistical significance in large datasets, and to investigate the nuances of similarity found in shared bottom-up processing when participants observe unfamiliar stimuli in a free viewing experiment. Real world applications can include expertise-related eye gaze prediction, medical screening, and image saliency identification. |
Robert Ahadizad Newport · Sidong Liu · Antonio Di Ieva 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Facial Composite Generation with Iterative Human Feedback
(
Poster
)
link »
SlidesLive Video » We propose the first method in which human and AI collaborate to iteratively reconstruct the human's mental image of another person's face only from their eye gaze.Current tools for generating digital human faces involve a tedious and time-consuming manual design process.While gaze-based mental image reconstruction represents a promising alternative, previous methods still assumed prior knowledge about the target face, thereby severely limiting their practical usefulness. The key novelty of our method is a collaborative, iterative query engine: Based on the user's gaze behaviour in each iteration, our method predicts which images to show to the user in the next iteration.Results from two human studies (N=12 and N=22) show that our method can visually reconstruct digital faces that are more similar to the mental image, and is more usable compared to other methods.As such, our findings point at the significant potential of human-AI collaboration for reconstructing mental images, potentially also beyond faces, and of human gaze as a rich source of information and a powerful mediator in said collaboration. |
Florian Strohm · Ekta Sood · Dominike Thomas · Mihai Bace · Andreas Bulling 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Federated Learning for Appearance-based Gaze Estimation in the Wild
(
Poster
)
link »
SlidesLive Video »
Gaze estimation methods have significantly matured in recent years but the large number of eye images required to train deep learning models poses significant privacy risks. In addition, the heterogeneous data distribution across different users can significantly hinder the training process. In this work, we propose the first federated learning approach for gaze estimation to preserve the privacy of gaze data.We further employ pseudo-gradients optimisation to adapt our federated learning approach to the divergent model updates to address the heterogeneous nature of in-the-wild gaze data in collaborative setups. We evaluate our approach on a real-world dataset (MPIIGaze dataset) and show that our work enhances the privacy guarantees of conventional appearance-based gaze estimation methods, handles the convergence issues of gaze estimators, and significantly outperforms vanilla federated learning by $15.8\%$ (from a mean error of $10.63$ degrees to $8.95$ degrees). As such, our work paves the way to develop privacy-aware collaborative learning setups for gaze estimation while maintaining the model's performance.
|
Mayar Elfares · Zhiming Hu · Pascal Reisert · Andreas Bulling · Ralf Küsters 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Simulating Human Gaze with Neural Visual Attention
(
Poster
)
link »
SlidesLive Video » Existing models of human visual attention are generally unable to incorporate direct task guidance and therefore cannot model an intent or goal when exploring a scene. To integrate guidance of any downstream visual task into attention modeling, we propose the Neural Visual Attention (NeVA) algorithm. To this end, we impose to neural networks the biological constraint of foveated vision and train an attention mechanism to generate visual explorations that maximize the performance with respect to the downstream task. We observe that biologically constrained neural networks generate human-like scanpaths without being trained for this objective. Extensive experiments on three common benchmark datasets show that our method outperforms state-of-the-art unsupervised human attention models in generating human-like scanpaths. |
Leo Schwinn · Doina Precup · Bjoern Eskofier · Dario Zanca 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Contrastive Representation Learning for Gaze Estimation
(
Poster
)
link »
SlidesLive Video » Self-supervised learning (SSL) has become prevalent for learning representations in computer vision. Notably, SSL exploits contrastive learning to encourage visual representations to be invariant under various image transformations. The task of gaze estimation, on the other hand, demands not just invariance to various appearances but also equivariance to the geometric transformations. In this work, we propose a simple contrastive representation learning framework for gaze estimation, named Gaze Contrastive Learning (GazeCLR). GazeCLR exploits multi-view data to promote equivariance and relies on selected data augmentation techniques that do not alter gaze directions for invariance learning. Our experiments demonstrate the effectiveness of GazeCLR for several settings of the gaze estimation task. Particularly, our results show that GazeCLR improves the performance of cross-domain gaze estimation and yield as high as 17.2% relative improvement. Moreover, GazeCLR framework is competitive with state-of-the-art representation learning methods for few-shot evaluation. |
Swati Jindal · Roberto Manduchi 🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
SecNet: Semantic Eye Completion in Implicit Field
(
Poster
)
link »
SlidesLive Video » If we take a depth image of an eye, noise artifacts and holes significantly affect the depth values on the eye due to the specularity of the sclera. This paper aims at solving this problem through semantic shape completion.We propose an end-to-end approach to train a neural network, called \emph{SecNet} (semantic eye completion network), that predicts a point cloud with an accurate eye-geometry coupled with the semantic labels of each point. These labels correspond to the essential eye-regions, \ie pupil, iris and sclera.Particularly, our work performs implicit estimation of the query points with semantic labels where both the semantic and occupancy predictions are trained in an end-to-end way. To evaluate the approach, we then use the synthetic eye-scans rendered in UnityEyes simulator environment.Compared to the state of the art, the proposed method improves the accuracy for shape-completion for 3D eye-scan by 8.2\%. In practice, we also demonstrate the application of our semantic eye completion for gaze estimation. |
Yida Wang · Yiru Shen · David Joseph Tan · Federico Tombari · Sachin S Talathi 🔗 |
Sat 11:30 a.m. - 12:00 p.m.
|
Use of Machine Learning and Gaze Tracking to Predict Radiologists’ Decisions in Breast Cancer Detection
(
Keynote
)
SlidesLive Video » Breast cancer is the most common cancer for women worldwide. In 2020 the GLOBOCAN estimated that 2,261,419 new breast cancer cases were diagnosed around the world, which corresponds to 11.7% of all cancers diagnosed. Moreover, incidence of this disease has been slowly rising in the US by about 0.5% per year since the mid-2000s. The most commonly used imaging modality to screen for breast cancer is digital mammography, but it has low sensitivity (particularly in dense breasts) and a relative high number of False Positives. Perhaps because of these, historically there has been much interest in developing computer-assisted tools to aid radiologists in the task of detecting early cancerous lesions. In 2022, breast imaging is a major area of interest for the developers of Artificial Intelligence (AI), and applications to detect breast cancer account for 14% of all AI applications on the medical imaging market. |
Claudia Mello-Thoms 🔗 |
Sat 12:00 p.m. - 12:30 p.m.
|
Gabriel A. Silva Keynote
(
Keynote
)
TBD |
Gabriel Silva 🔗 |
Sat 12:30 p.m. - 1:30 p.m.
|
Breakout session
(
Discussion within onsite small groups on preselected themes
)
|
🔗 |
Sat 1:30 p.m. - 1:45 p.m.
|
Coffee
|
🔗 |
Sat 1:45 p.m. - 1:57 p.m.
|
Contrastive Representation Learning for Gaze Estimation
(
Spotlight
)
link »
SlidesLive Video » Self-supervised learning (SSL) has become prevalent for learning representations in computer vision. Notably, SSL exploits contrastive learning to encourage visual representations to be invariant under various image transformations. The task of gaze estimation, on the other hand, demands not just invariance to various appearances but also equivariance to the geometric transformations. In this work, we propose a simple contrastive representation learning framework for gaze estimation, named Gaze Contrastive Learning (GazeCLR). GazeCLR exploits multi-view data to promote equivariance and relies on selected data augmentation techniques that do not alter gaze directions for invariance learning. Our experiments demonstrate the effectiveness of GazeCLR for several settings of the gaze estimation task. Particularly, our results show that GazeCLR improves the performance of cross-domain gaze estimation and yield as high as 17.2% relative improvement. Moreover, GazeCLR framework is competitive with state-of-the-art representation learning methods for few-shot evaluation. |
Swati Jindal · Roberto Manduchi 🔗 |
Sat 1:57 p.m. - 2:09 p.m.
|
SecNet: Semantic Eye Completion in Implicit Field
(
Spotlight
)
link »
If we take a depth image of an eye, noise artifacts and holes significantly affect the depth values on the eye due to the specularity of the sclera. This paper aims at solving this problem through semantic shape completion.We propose an end-to-end approach to train a neural network, called \emph{SecNet} (semantic eye completion network), that predicts a point cloud with an accurate eye-geometry coupled with the semantic labels of each point. These labels correspond to the essential eye-regions, \ie pupil, iris and sclera.Particularly, our work performs implicit estimation of the query points with semantic labels where both the semantic and occupancy predictions are trained in an end-to-end way. To evaluate the approach, we then use the synthetic eye-scans rendered in UnityEyes simulator environment.Compared to the state of the art, the proposed method improves the accuracy for shape-completion for 3D eye-scan by 8.2\%. In practice, we also demonstrate the application of our semantic eye completion for gaze estimation. |
Yida Wang · Yiru Shen · David Joseph Tan · Federico Tombari · Sachin S Talathi 🔗 |
Sat 2:45 p.m. - 3:00 p.m.
|
Wrap Up - Closing remarks
(
Closing
)
|
🔗 |