RPMS for TORQUE with Nvidia GPU Support

Getting TORQUE built into RPMs with GPU support was considerably more frustrating than I expected. I’m really not a fan of TORQUE as I seem to often run in to silly problems or serious limitations since pbs_sched is so simplistic that it’s really not the best fit for most users … but I still have to support it, so here goes.

First we need to install CUDA. Thankfully, Nvidia has added a yum repo so this whole process has gotten a little bit easier. The Getting Started Guide has all of the info, but it’s a bit to wade through since they tackle multiple distros. The basic process is to enable the EPEL repository, enable the Nvidia repository (install the appropriate RPM from the CUDA Downloads page), then install the cuda and gpu-deployment-kit packages with yum.

yum -y install cuda gpu-deployment-kit

Download the source for TORQUE from Cluster Resources/Adaptive Computing. I used version 4.2.7 and all of the examples will reference this.

Untar the TORQUE source and run configure with a few options. The annoying option is the --with-default-server options since omitting this makes the clients connect only the localhost instead of the the actual pbs_server process. No amount of config file changes or environment settings change this behavior.

./configure --with-default-server=head.cluster --enable-nvidia-gpus --with-nvml-lib=/usr/lib64/nvidia --with-nvml-include=/usr/include/nvidia/gdk

Now, you would think that would add all of the correct options and building would just go smoothly. NOPE! None of the GPU stuff gets added to the torque.spec file. Fancy! So edit torque.spec and look for the %configure section. It will look like:

%configure --includedir=%{_includedir}/%{name} --with-default-server=%{torque_server} \
--with-server-home=%{torque_home} %{ac_with_debug} %{ac_with_libcpuset} \
--with-sendmail=%{sendmail_path} %{ac_with_numa} %{ac_with_memacct} %{ac_with_top} \
--disable-dependency-tracking %{ac_with_gui} %{ac_with_scp} %{ac_with_syslog} \
--disable-gcc-warnings %{ac_with_munge} %{ac_with_pam} %{ac_with_drmaa} \
--disable-qsub-keep-override %{ac_with_blcr} %{ac_with_cpuset} %{ac_with_spool} %{?acflags}
%{__make} %{?_smp_mflags} %{?mflags}

Change this to:

%configure --includedir=%{_includedir}/%{name} --with-default-server=%{torque_server} \
--enable-nvidia-gpus --with-nvml-lib=/usr/lib64/nvidia --with-nvml-include=/usr/include/nvidia/gdk \
--with-server-home=%{torque_home} %{ac_with_debug} %{ac_with_libcpuset} \
--with-sendmail=%{sendmail_path} %{ac_with_numa} %{ac_with_memacct} %{ac_with_top} \
--disable-dependency-tracking %{ac_with_gui} %{ac_with_scp} %{ac_with_syslog} \
--disable-gcc-warnings %{ac_with_munge} %{ac_with_pam} %{ac_with_drmaa} \
--disable-qsub-keep-override %{ac_with_blcr} %{ac_with_cpuset} %{ac_with_spool} %{?acflags}
%{__make} %{?_smp_mflags} %{?mflags}

Now we can run make -j then make rpm and your RPMs will be joyfully created in ~/rpmbuild/RPMS/x86_64.

Share Button