Dan Xu - Research

Projects

Supervised/Self-Supervised Scene Depth Estimation

Depth is a fundamental element for the machine to perceive and understand the scene. While estimating the depth of a scene from a single image is a natural ability for humans, devising computational models for accurately predicting depth information from RGB data is a challenging task. In particular, recent works have achieved remarkable performance thanks to powerful deep learning models. In a supervised setting, assuming the availability of a large training set of RGB-depth pairs, monocular depth prediction is casted as a pixel-level regression problem and Convolutional Neural Network (CNN) architectures are typically employed. While for a unsupervised/self-supervised setting, videos are usually considered to learn the matching between temproal video frames with predicted scene depth, camera pose and/or intrinsic parameters.

Projects and Codes

Joint Deep Multi-task Learning for Scene Understanding

Typical deep multi-task learning approaches mainly focused on the final prediction level via employing the crossmodal interactions to mutually refining the tasks [18, 51] or designing more effective joint-optimization objective functions. These methods directly learn to predict the two tasks given the same input training data. However, simultaneously learning the different tasks using distinct loss functions makes the network optimization complicated, and it is generally not easy to obtain a good generalization ability for all the tasks. we explore multi-task deep learning by jointly learning a set of intermediate auxiliary tasks ranging from low level to high level, and then the predictions from these intermediate auxiliary tasks are utilized as multi-modal input for the final tasks.

Projects and Codes

Supervised/Self-Supervised Object Localization

We target object localization in weakly supervised or open world (zero-shot) settings. We use only image labels or large-scale cross-modal pretrained models (e.g. CLIP) to provide us supervision for training. The trained model is able to provide discriminative localization on multiple different object categrories on the same images (i.e., a sample containing multiple different object categories). A further research objective is to only use image-caption pair data to learn object localization and detection while only through large-scale contrastive cross-modal learning.

Projects and Codes

Human and Head Video Generation

Human/head video generation aims to produce a synthetic human face video that contains the identity and pose information respectively from a given source image and a driving video. Existing works for this task heavily rely on 2D representations (e.g. appearance and motion) learned from the input images. However, dense 3D facial geometry (e.g. pixel-wise depth) is extremely important for this task as it is particularly beneficial for us to essentially generate accurate 3D face structures and distinguish noisy information from the possibly cluttered background.

Projects and Codes

Deep Visual SLAM

End-to-end deep visual SLAM is an important research domain for a wide range of applications in computer vision and robotics. Visual SLAM involves high-level scene understanding problems integrating data and tasks from 2D and 3D perspectives. high-level scene understanding typically involves multiple different tasks in both 2D and 3D domains. In 2D, common tasks include object detection, segmentation, tracking based on image or video data. In 3D, we need to deal with tasks such as 2.5D depth estimation, 3D point cloud regression and spatial-temporal map construction and localization. These different tasks and data sources could be integrated into a single deep framework and finally contribute to a joint end-to-end high-level scene understanding framework.

Projects and Codes

Joint Video Object Detection and Tracking

Multiple object tracking and video object detection are fundamental tasks in intelligent scene understanding. We explore to build a joint framework to model the interaction between these two tasks in both the feature and the prediction level, in order to effectively refine each other from inherent spatial and temporal relationships. Our work targets deriving the scene depth information from the RGB data, and thus does not require additional depth sensors other than standard RGB cameras. One of our works aimed to explore using scene geometry for the task of video object detection in CNN. Instead of estimating accurate 3D geometry, we consider deriving and utilizing scene-specific geometry, and enforce the convolutional operations to be conditioned on the object scales and positions, leading to a geometry-aware deep learning.

Projects and Codes

Statistical Deep Learning

Another focus of my current research is statistical-based structured representation learning from both the performance and the computational complexity. Large-scale graph based deep learning models and theory. Probabilistic graph models have been widely studied by researchers to model structural and interdependent data in traditional non-deep machine learning. However, applying probabilistic models to understand deep learning in a principled means, and developing novel theoretic statistical inference strategies to further improve the performance of the deep learning are still in their infancy. The deep network can essentially be treated as a large-scale graph, and the research on the statistical graph theory with deep graph network would be remarkably beneficial in the development of new deep learning technologies. We conduct experiments on different challenging tasks, i.e. scene parsing on Cityscapes, instance segmentation and object detection on COCO, and on different strong backbone networks, demonstrating the generalisability and the effectiveness of deep graph models.

Projects and Codes

Cross-domain image translation

Semantic-guided scene generation is a hot research topic covering several main-stream research directions, including cross-view image translation and semantic image synthesis. The cross-view image translation task proposed in is essentially an illposed problem due to the large ambiguity in the generation if only a single RGB image is given as input. To alleviate this problem, recent works such as SelectionGAN try to generate the target image based on an image of the scene and several novel semantic maps. Adding a semantic map allows the model to learn the correspondences in the target view with appropriate object relations and transformations. On the other side, the semantic image synthesis task aims to generate a photo-realistic image from a semantic map. With the useful semantic information, existing methods on both tasks achieved promising performance in scene generation. However, one can still observe unsatisfying perspectives, especially on the generation of local scene structure and details as well as small scale objects.

Projects and Codes