Jetson Containers - Samples
Introduction
The last post covered how to build out the needed infrastructure to begin building applications for the NVIDIA Jetson platform.
For our first application, we’re going to build the JetPack samples and create a minimal container from that. This involves multi-stage builds which allow us to isolate our build environment from the deployment environment and leverage the base images created for the device.
Note: We’re going to start off with Xavier (jax) but you can also run Nano/TX2 builds here (just substitute the text jax for nano-dev/tx2). Both UI and Terminal options are listed for each step. For the UI commands it is assumed that the jetson-containers repository is open in VS Code.
Note: This uses a lot of RAM and CPU. You can lower the impact by changing RUN make -j$($(nproc) - 1)
in jetson-containers/docker/examples/samples/Dockerfile
to use a lower CPU count which will reduce RAM usage as well. Replace $($(nproc) - 1)
with a smaller number. It calculated the number of processors available and subtracts one.
Building the Samples
UI:
Press Ctrl+Shift+B
, select make <build samples>
, select build-32.2-jax-jetpack-4.2.1-samples
, press Enter
.
Terminal:
~/jetson-containers$ make build-32.2-jax-jetpack-4.2.1-samples
Which runs:
docker build --build-arg IMAGE_NAME=l4t \
-t l4t:32.2-jax-jetpack-4.2.1-samples \
-f /home/<user>/dev/jetson-containers/docker/examples/samples/Dockerfile \
.
The Dockerfile with those variables defined is very straightforward. It compiles the cuda-10.0
samples using the devel
image, then leveraging multi-stage build, creates the final image from the runtime
image, and installs dependencies needed to run the samples, and then copies the compiled samples from the devel
based container into the final image.
The dependent libraries installed on top of the runtime
image were found by running ldd
against each binary identifying each missing dependency. For your applications, this process will be simpler as you’ll be building much fewer than the 130+ applications.
FROM l4t:32.2-jax-jetpack-4.2.1-devel as builder
WORKDIR /usr/local/cuda-10.0/samples
RUN make -j$($(nproc) - 1)
FROM l4t:32.2-jax-jetpack-4.2.1-runtime
# Prereqs
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
freeglut3 \
libegl1 \
libx11-dev \
libgles2-mesa \
libgl1-mesa-glx \
libglu1-mesa \
libgomp1 \
libxi-dev \
libxmu-dev \
openmpi-bin \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN mkdir samples
COPY --from=builder /usr/local/cuda-10.0/samples/ /samples
WORKDIR /samples/bin/aarch64/linux/release/
The final l4t:32.2-jax-jetpack-4.2.1-samples
image adds 1.11GB
of which 0.86GB
is the sample binaries themselves:
Component | Size |
---|---|
l4t:32.2-jax-jetpack-4.2.1-devel | 5.67GB |
l4t:32.2-jax-jetpack-4.2.1-runtime | 1.21GB |
/samples | 859.7MB |
l4t:32.2-jax-jetpack-4.2.1-samples | 2.32 GB |
The l4t:32.2-jax-jetpack-4.2.1-samples
has a layer for the external dependencies and will be cached should future updates to the sample binaries be required. This layering is by design with images. Large layers, and layers that change infrequently, come in first. Then the more volatile pieces are laid in on top. This gives us smaller updates to deployments.
If you are not running the builds on your device, push the l4t:32.2-jax-jetpack-4.2.1-samples
image to your container registry so that the device can pull down the images, or take a look at pushing images to devices.
Running the Samples
From here we need a device for the first time (if you have been building the images on the x86_64
host). If your images are pushed to your container registry, this run command will pull them automatically.
If running remotely, set the DOCKER_HOST
variable in the .env
file to proxy the run to the device: DOCKER_HOST=ssh://<user>@<device>.local
.
Note: You may need to log into your container registry on the device in order to pull the images if you haven’t built on or pushed to the device.
UI:
Press Ctrl+Shift+B
, select make <run samples>
, select run-32.2-jax-jetpack-4.2.1-samples
, press Enter
.
Terminal:
~/jetson-containers$ make run-32.2-jax-jetpack-4.2.1-samples
Which runs:
docker run \
--rm \
-it \
--device=/dev/nvhost-ctrl \
--device=/dev/nvhost-ctrl-gpu \
--device=/dev/nvhost-prof-gpu \
--device=/dev/nvmap \
--device=/dev/nvhost-gpu \
--device=/dev/nvhost-as-gpu \
--device=/dev/nvhost-vic \
l4t:32.2-jax-jetpack-4.2.1-samples
From here you’ll be given a command prompt. Let’s run the device’s "Hello, World!"
, deviceQuery
:
root@e1283970319e:/samples/bin/aarch64/linux/release# ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Xavier"
CUDA Driver Version / Runtime Version 10.0 / 10.0
CUDA Capability Major/Minor version number: 7.2
Total amount of global memory: 15700 MBytes (16462909440 bytes)
( 8) Multiprocessors, ( 64) CUDA Cores/MP: 512 CUDA Cores
GPU Max Clock rate: 1500 MHz (1.50 GHz)
Memory Clock rate: 1377 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
We see that the device is working and that the GPU is visible in the container. From here you can run various samples, though some of them will fail if they try to create windows.
If you want to see the samples’ UI, we have a couple options, but we’ll have to save that for a future post.