Single image processing time is critical for generating 2D digital images and video streams in SPI for real-time applications. To reduce the time required by the 2D image processing system, to generate an image, we determined the minimum number of illumination patterns equivalent to a compression factor of 2% to be able to adapt the OMP-Batch Algorithm [68] based on hardware architecture and a GPU. The OMP-Batch Algorithm [68], compared with other compressed sensings (CS) Algorithms such as OMP and OMP-Cholesky, presents improvements in processing time because they do not make operations of the inverse matrix. On the contrary, using the definition of the Gram-matrix, G=ATA, initially, pre-calculating Gram-matrix G with initial projection i, where an initial projection as p0=ATy is defined. This allows finding the new atom Ai, which will be used as stop criteria for the system solution calculation. For implementing the OMP-Batch algorithm, we defined four kernels that must operate in parallel.
The input information is defined in the first kernel, the Gram-matrix (G= ATA) is generated, and the residual norm r is calculated.
The second kernel was used to calculate the new atom Ai.
The third kernel was used to calculate the Cholesky decomposition, where the matrix N×N was defined to calculate the matrix L using.
The fourth kernel was used to calculate the matrix space-vector product and the normal error e.
Implementation of the OMP-Batch algorithm on GPU
For the implementation of the OMP-Batch algorithm on GPU, we must take into consideration some details, such as: The Cholesky factorization scheme, the memory layout, the matrix batched-matrix products, normalized columns of the dictionary, packed representation (library Python), and efficient batched argmax
Memory layout: The main bottleneck in the process is matrix multiplication. To get more speed, transpose the matrix columns, as one column can span into a single line in the memory. After that, we apply the operation of matrix multiplication.
Cholesky factorization scheme: The OMP-Batch uses an inverse Cholesky factorization method based on the precomputedATy, and ATA, in the parallel computing environment. We seek to calculate Lk, which we do not need to store in each iteration. We only need to perform are iteration
to obtain the new projections w.
Matrix batched-matrix products: A less-known fact is that we can sometimes get higher performance. If we apply the matrix times batched-matrix product. We can see that this matrix times batched-
vector product is equivalent to a simple matrix-matrix product, with which we can use the library BLAS (Basic Linear Algebra Subprograms) most efficiently.
Normalized columns of the used dictionary: The main approach to speeding up the algorithm is to minimize the number of operations to perform each iteration. Many algorithms assume normalized columns in A such that correlation projection⟨an, rk−1⟩/∥an∥ turns into a simple projection ⟨an, rk−1⟩=[ATrk−1]. This is valid since the algorithm is invariant to the column norm, as it will be divided out in the correlation step, and lastly, the least-squares estimateˆyk=AAT y=A,argminx∈Rk∥y−Ax∥ is unique. Thus, it is also invariant to column scaling. For the final estimate ˆx from ATAˆx=ATy, one should not use the pre-normalized A, or at least the scale ˆxappropriately (by the reciprocal of the column norm) to account for this.
Packed representation: Specialized calls for batched matrix-matrix multiplication, batched Cholesky factorization and batched triangular system solving for GPU exist. In our approach, we use the PyTorch library.
Efficient batched argmax: A core part of the OMP loop is argmax, which can be performed on a batch by k= arg maxK|p|. One issue is that abs (|p|) creates an intermediate M×N array in the first pass, and then a second pass over this is needed to get the ”argmax.” The ”argmax” line takes 5-25% of the
GPU computation time. To optimize this, we applied libraries of CuPy to increase the speed-up of computation. The function arg maxK is compiled in C++ and called in Python as an external function.
By implementing the OMP-Batch on GPU, we can reach an acceleration of x27 compared to running the same algorithm on the CPU (see Fig. 1). In table 1, we compared the complexity and memory requirements of the different OMP reconstruction test algorithms.
Table 1. Comparison of complexities and memory requirements in the k-th iteration,
where the dictionary Φ is MxN.
Fig.1.Comparative reconstruction time for running the OMP-Batch algorithm on different platforms: CPU Intel i5 OMP Basic, CPU-1 with Linux OMP basic, GPU-VO using Pytourch GPU reach an acceleration x 17 on Jetson Nano, GPU-V1 using PyTorch GPU compile in C++ the function arg maxK with an acceleration x 27 on Jetson Nano
Комментарии