Generating Interactive Computer Graphics
How it works
Nvidia, a major GPU creator, recently discovered a way on how to generate 3D graphics without a traditional 3D polygon engine. Instead of using 3D models and textures they are able to generate three-dimensional virtual worlds by feeding video into a Generative Adversarial Network. The trained network is then fed segmentation maps of real videos, where the network then generates the features of the image based on the segmentation map. This technology is what Nvidia calls video-to-video synthesis. Nvidia has a video [6], which shows researchers using a steering wheel to navigate the AI generated environment.
This is achieved by using the game engine, Unreal Engine 4, to create the segmentation maps which are changed by user input. The network then takes the Unreal Engine generated segmentation maps, and generates a photorealistic image in real time. Nvidia was able to render life-like cities in real time by using their tensor cores. Tensor cores are ASICs on an Nvidia GPU that can compute matrix multiplications and accumulate calculations in real time up to 40 times faster than a traditional cpu. Nvidia tensor cores are currently in Nvidia Geforce RTX graphics cards for consumers, as well as recent Tesla and Quadro product lines for professionals.
Researchers at Nvidia had the goal to create a mapping function that converts frames from an input segmentation map and outputs photorealistic video. Nvidia used a Generative Adversarial Network to achieve their results. According to skymind.ai, a Generative Adversarial Network is a neural network structure that consists of two neural networks competing with each other [1]. A GAN for short, consists of one network called a generator, and a separate network called a discriminator that evaluates the data to see how correct it is.
The generator network of the GAN can typically generate fake data using input from random number generators. Next, the discriminator part of the GAN analyzes the output to see if it looks real or fake. For example, a segmentation mapped video is fed into the generator side of the GAN. After the video is processed it is, it then becomes an input for the discriminator part of the GAN. The video is then classified as either fake or real. Each part of the GAN is trained by holding either neural network constant. For example, if you want to train the generator you must hold the discriminator side constant.
Nvidias GAN architecture for vid2vid, which is the program that uses an input segmentation map and generates photorealistic video from the input, is based on coarse-to-fine architecture. The lower resolution generator network has two inputs a semantic map and previous video footage, which is input as a single frame at a time. After the data is sent through the network there are three outputs, a flow map, mask, and an intermediate image (frame of video). The high resolution generator network architecture includes a wrapper around the lower resolution network. Semantic maps and previous images are downsampled in order to be able to be input into the lower resolution GAN. Outputs from the lower resolution network are then sent into the last part of the high resolution network, which are residual blocks that output the final video.
The PatchGAN architecture was implemented for the image discriminator part of the GAN, which only gives error to structure of the image per image patch at a given scale [2]. For example, the discriminator will try to figure out if a given portion, such as a wheel on a car, or leaves in a tree, are fake or not. The patches that are determined as correct or incorrect are averaged throughout the image which is then computed into the final output error.
Nvidia also developed a multi-scale video discriminator which uses downsampled frame rates from both the real and generated videos. Nvidias highest scale on the discriminator took K frames in the input video sequence as input. The lowest scale however, skips frames by a factor of K, which Nvidia denoted as skipping K - 1 frames in the input video sequence. By using multiple scales in their implementation, Nvidia was able to have good short-term and long-term consistency in the GAN.
Nvidia uses the learning objective function F,to train the GAN for sequential video synthesis [7]. The GAN loss of the video discriminator DI is represented by LI. Loss on consecutive frames Lv is represented by Dv. w Lw(F) is the weight multiplied by the flow estimation loss, with the weight being set to 10 throughout Nvidias experiments. Nvidia then solved this equation for F to develop an accurate learning objective function for the GAN to accurately train on video footage.
The segmentation mask is filled in with GAN generated data by dividing the image into the foreground and background by looking at objects in the segmentation mask. The network determines which objects belong to the foreground or background. Objects such as buildings and roads belong to the background and cars and pedestrians were categorized as being in the foreground. The network then hallucinates how to fill in the segmentation maps input into the network. Nvidia determined that backgrounds can more accurately be hallucinated than the foreground images as they have less motion and are easy to generate using warping. The foreground is harder to generate and results in a smudging effect which is seen in Nvidias demo video. Foreground objects are hard to estimate as they occupy a small portion of the image and have a lot of motion throughout a video.
Many datasets were used to test the video-to-video network. The highest resolution dataset called Cityscapes, contained videos captured in German cities [7]. The segmentation maps for Cityscapes were generated by a trained semantic segmentation map network. The ground truth flow for the Cityscapes dataset was extracted by an optical flow estimator called FlowNet2. Nvidia produced AI generated graphics, which were created by a network that was only trained on 2,975 videos.
Video-to-Video is also able to take a video of someone dancing, extract the pose, and sync the poses to another person that is trained in the network. The network can create videos that contain unseen poses that the initial training video didnt have. This is particularly interesting as the network can hallucinate new object morphs that the training set didnt have. In the future of surveillance this may become an issue, as someone can tamper with video footage and change the person in the video footage.
The modified GAN network video generator that Nvidia uses has three scales, 512 x 256, 1024 x 512, and 2048 x 1024. The network is trained starting out with generating a few frames of video, then after the discriminator part of the GAN looks at the generated frames the weights are then recalculated. Nvidia uses different scales of video in order to more accurately predict objects in the image. For training the GAN, the network was ran on eight gpus; with four gpus running the discriminator, and four gpus running the generator. Due to the complexity of training the network, it took 10 days for Nvidia to train the network for 2048— 1024 resolution.
The DGX-1 station is a product that Nvidia built for AI research, and it was used to train the network [3]. It requires an astounding 3.5
kilowatts and costs $149,000 a system at the consumer level. Eight 32gb Tesla V100 GPUs power the system, with a combined deep learning compute power of 1000 teraflops of deep learning compute power (125 tflops per tensor unit * 8 GPUs) [3]. To feed the GPUs, the system uses two 20-core Xeon cpus, that access a whopping 512gb of DDR4 2133mhz memory. Nvidia claims that DGX-1 is 140 times faster than an equivalent cpu only server without gpus.
Better yet, Nvidia has an even faster system that was released this quarter (Q4 2018), the DGX-2H [4]. The DGX-2H uses the same Tesla V100 GPUs, but instead of eight GPUs it has 16. Nvidia claims it has 10 times the performance of the DGX-1 in GPU memory limited applications, such as neural networks for language translations. Nvidia also allows unified GPU memory on DGX-2H using NVSwitch, which allows GPUs to communicate directly with each other at 300GB/s per gpu-gpu, which has a combined GPU to GPU bandwidth of an astounding 2.4 terabytes per second. The DGX-2H has 1.5TB of system memory and 0.5TB of GPU memory, which can crunch large neural networks, sadly the system pulls 12,000 watts from the wall.
As hardware gets better, despite moore's law dying, researchers will continue to research more efficient ways to harvest more computing power. Once Nvidia tunes their video-to-video network for the DGX-2H system, it will be interesting to see if they can improve the fidelity of the network without increasing training time.
I found Artificial Intelligence generated graphics particularly interesting as Ive always been astounded by the increase in visual quality of video games every year. Once GANs become more adept for creating graphics, it will be speed up development time when developers can create 3D models on a generative adversarial network. Graphics cards may just be one big tensor core rendering graphics, rather than cuda cores or stream processors for rendering 3D polygons.
Generating Interactive Computer Graphics. (2019, Mar 28). Retrieved from https://papersowl.com/examples/generating-interactive-computer-graphics/