top of page

Research Blog

Buscar

Actualizado: 19 abr




 

Abstract


Traditional deep-learning techniques for image reconstruction often demand extensive training datasets, which might not always be readily available. In response to this challenge, methods not requiring pre-trained models have been developed, leveraging the training of networks to reverse-engineer the physical principles behind image creation. In this context, we introduce an innovative approach with our untrained Res-U2Net model for phase retrieval. This model allows us to extract phase information, crucial for detecting alterations on an object's surface. We can use this information to create a mesh model to represent the object's three-dimensional structure visually. Our study evaluates the effectiveness of the Res-U2Net model in phase retrieval tasks, comparing its performance with that of the UNet and U2Net models, specifically using images from the GDXRAY dataset.


Fig.1. 3D phase retrieval: (a) 2D Ray-X test image, (b) 2D phase retrieval estimate, and (c) resulting 3D mesh.


Method

Overview of the proposed architecture for Phase Retrieval:



Fig.2. Res-U2Net architecture: (a) U2Net model configuration, based on a multi-scale sequence of Res-UNet models, (b) Res-UNet model, the encoder extracts features using convolutional layers (Conv2D) with batch normalization, ReLU activation (ResBlock), and spatial resolution reduction via max pooling (MaxPooling2D). This is followed by a decoder assigning phases to the features by upsampling using transpose convolutions (Conv2DTranspose) with skip connections. Residual connections link the encoder and decoder layers to improve the training performance. Finally, a 1×440×4401×440×440 convolutional layer generates the segmentation mask, resulting in the network output.


BibTeX

@article{OsorioQuero:24,author = {Carlos Osorio Quero and Daniel Leykam and Irving Rondon Ojeda},journal = {J. Opt. Soc. Am. A},keywords = {Biomedical imaging; Computational imaging; Fluorescence lifetime imaging; Imaging techniques; Inverse design; Phase retrieval},number = {5},pages = {766--773},publisher = {Optica Publishing Group},title = {Res-U2Net: untrained deep learning for phase retrieval and image reconstruction},volume = {41},month = {May},year = {2024},url {https://opg.optica.org/josaa/abstract.cfm?URI=josaa-41-5-766}, doi = {10.1364/JOSAA.511074}}

 





 

Abstract

Capturing 3D details of human pose and body shape from just one monocular image poses a significant challenge in computer vision. Traditional approaches rely on RGB images, which are limited by changes in lighting and obstructions. However, advances in imaging technology have led to novel methods like single-pixel imaging (SPI), which overcome these obstacles. SPI, especially in the near-infrared (NIR) spectrum, excels at detecting 3D human poses. This wavelength can go through clothing and is less affected by lighting changes than visible light, offering a dependable way to capture accurate body shape and pose information, even in challenging environments. In this research, we investigate using an SPI camera operating in the NIR spectrum with time-of-flight (TOF) technology at 850 and 1550 nm wavelengths. This setup is designed to identify humans in low-light conditions. We employ the vision transformers (ViT) model to recognize and extract human features, integrating them into a 3D body model called SMPL-X through deep learning-based 3D body shape regression. To test the effectiveness of NIR-SPI for 3D image reconstruction, we created a lab environment that mimics night conditions, allowing us to explore the potential of NIR-SPI as a vision sensor in outdoor night settings. By analyzing the data from this experiment, we aim to showcase NIR-SPI's capabilities as a powerful tool for nighttime human detection and for capturing precise 3D human body pose and shape.


Human modeling


Using parametric human models, such as SMPL-X, allows for a concise representation of human shapes by utilizing shape and pose parameters to encode variations [6]. The SMPL-X model offers various advantages:


  • It disentangles the human shape and pose, allowing for independent analysis and control of each shape.

  • It avoids modeling rugged and twisted shapes directly, which can pose difficulties for neural network-based methods, by utilizing a skinning process to model deformation and

  • It is differentiable and can be easily integrated with neural networks. For this research, we used SMPL-X as the underlying representation for modeling 3D humans.


Proposed Method


The process used to obtain the 3D human model from NIR-SPI. It involves several steps that use different computer vision techniques to reconstruct a 3D human pose from a single low-resolution image. Here is a detailed explanation of each step:


  • Take a single-pixel low-resolution image. This step involves capturing an image of a human. The image contrast is adjusted to extract the basic shape of the person, and then the background is removed using U2Net. This deep-learning model can accurately segment the foreground and background of an image. Thus, an image segmentation technique isolates the person from the background to obtain the image’s silhouette. This image only shows the person's outline without any details of the surface or texture.

  • Applied over the silhouette image, ViT can identify four human poses: lying, bending, sitting, and standing. Once the pose is identified, it can be used to generate a 3D human pose using the VIBE method, a deep learning model that can estimate the 3D pose of a human from a single image or video.

  • Finally, we can reconstruct the human body shape and pose in 3D space. As discussed above, this can be done using a tool such as SMPL-X.

Fig.1. Overview of the proposed network architecture, which takes NIR single-pixel imaging input and outputs 3D body reconstruction based on SMPL-X shape and pose parameters. The entire network consists of three main modules: (i) NIR-SPI-based image acquisition. (ii) Feature extraction using deep learning: The NIR-SPI image extracts the background to obtain the silhouette. (iii) 3D pose estimation using a regression-based approach: The silhouette image is used to obtain the gait features (shape estimation), which are then used to pose the human using ViT and skeleton joint features. These features are used to pre-define the pose SMPL-X model; from the pre-defined parameters (pose 𝜃, shape 𝛽 and camera s, R, T), the SMPL-X model is fed to the off-the-shelf SMPL-X model to obtain the reconstructed 3D human mesh.


Fig.2. Capture human pose imaging at a distance of 1 m: (a) Capture NIR-SPI imaging of human pose standing, sitting, and bending, (b) silhouette image, and (c) 3D human pose regression based on SMPL-X model.


The proposed methods to obtain a 3D human model from NIRSPI imaging, for human poses such as lying, bending, sitting, and standing. The best accuracy was achieved in the sitting position, with an accuracy of around 91%, as shown in the V2V and MPJPE errors. The results demonstrate the effectiveness of the proposed approach, with limitations in hand positioning due to the low contrast of the NIR-SPI image. However, the level position of the core person detection shows an accurate estimation of the 3D pose of the person through qualitative and quantitative evaluations. These findings highlight the potential of the proposed approach for 3D human modeling from a single low-resolution image.


In comparison, the presented SMPL-X model captures the body, face, and hands jointly, and the SMPL-X approach fits the model to a single NIR-SPI image and 2D joint detections. The results of this work demonstrate the expressivity of SMPL-X in capturing bodies, hands, and faces from NIR-SPI images. However, we observed that the bending and lying pose presented the highest V2V and MPJPE error levels, indicating limitations in the pose parameters θ. Therefore, it is recommended that a compensation model be implemented in future applications. Future work may involve the development of a dataset of in-the-wild SMPL-X fits and the direct regression of SMPL-X parameters from NIR-SPI images.


BibTeX

@article{OsorioQuero:24, author = {Carlos Osorio Quero and Daniel Durini and Jose Rangel-Magdaleno and Jose Martinez-Carranza and Ruben Ramos-Garcia}, journal = {J. Opt. Soc. Am. A},keywords = {Image metrics; Imaging techniques; Machine vision; Single pixel imaging; Three dimensional imaging; Three dimensional reconstruction}, number = {3}, pages = {414--423}, publisher = {Optica Publishing Group}, title = {Enhancing 3D human pose estimation with NIR single-pixel imaging and time-of-flight technology: a deep learning approach}, volume = {41}, month = {Mar}, year = {2024}, url = {https://opg.optica.org/josaa/abstract.cfm?URI=josaa-41-3-414},doi = {10.1364/JOSAA.499933},}
 

  • Foto del escritorCarlos Osorio

Recent progress in edge computing has been significantly influenced by innovations that have introduced specialized accelerators for achieving high levels of hardware parallelism. This has been particularly impactful in the field of computer imaging (CI), where the use of GPU acceleration plays a crucial role, notably in reconstructing 2D images through techniques such as Single-Pixel Imaging (SPI). Within SPI, the application of algorithms such as compressive sensing (CS), deep learning, and Fourier transformation is essential for the reconstruction of 2D images. These algorithms benefit immensely from parallel processing, which in turn enhances performance by shortening processing times. To optimize the performance of GPUs, strategies such as memory usage optimization, loop unrolling, the creation of efficient kernels to minimize operations, the use of asynchronous operations, and an increase in the utilization of active threads and warps are employed. In laboratory settings, the integration of embedded GPUs is key to improving the efficiency of algorithms on System-on-Chip GPUs (SoC-GPUs). This study emphasizes the accelerated optimization of the fast Harley Transform (FHT) for 2D image reconstruction on the Nvidia Xavier platform. Through the application of various parallelism techniques using PyCUDA, we have managed to triple the processing speed, approaching real-time processing capabilities


Fig.1. Improve the 2D reconstruction process of FHT by implementing diverse optimization methods, with a particular emphasis on leveraging CUDA for parallelizing the algorithms.


Our team has developed a range of optimization methods for enhancing the Fast Hartley Transform (FHT) algorithm on the NVIDIA Xavier NX GPU. We've introduced two distinct kernel types: one to improve the calculation of the Inverse FHT (IKFHT) and another for managing the digit reversal process. Our tests reveal notable execution time disparities for the FHT algorithm on different computing platforms. On a Central Processing Unit (CPU), the FHT algorithm's execution time is 103 ms, while on the GPU, it drops to 45 ms. This significant reduction highlights the GPU's superior parallel processing capabilities, making it exceptionally suited for the FHT algorithm's requirements.

Furthermore, we've adopted a pre-indexing technique to boost the FHT algorithm's efficiency further. Pre-indexing pre-calculates specific frequently used values, thus shortening the duration of each algorithm iteration. With pre-indexing on the CPU, execution time is nearly halved to 43 ms, marking a substantial improvement. For the GPU, this technique reduces processing time to 34 ms, demonstrating its effectiveness in decreasing the computational effort needed for tasks that require quick recalculations and adjustments in image gradients. Memory consumption remains modest on both platforms, though the GPU exhibits slightly higher memory usage. However, pre-processing significantly lowers memory demand, particularly on the GPU, where it falls from 1.92% to 1.24%. Lastly, we measured the speedup percentage, an essential indicator of performance improvement with GPU use over the CPU. Without pre-processing, the GPU's performance is roughly 2.28 times faster than that of the CPU. With pre-processing, this acceleration increases to 2.39 times on the CPU and to 3 times on the GPU. This underscores the advantages of leveraging both the GPU and pre-processing methods for optimizing the FHT algorithm.

bottom of page