ruoyunbai 2bb9621e30 1
2021-09-29 21:06:16 +08:00

2720 lines
147 KiB
Plaintext

c - crash b - bug fix e - enhancement f - new feature n - note
NOTE: the CHANGELOG file is now deprecated. Please check the release notes page
on the Adaptive Computing Website. For example, the 6.0.1 release notes can be
found:
http://docs.adaptivecomputing.com/9-0-1/releaseNotes/help.htm
6.0.0
b - TRQ-3245. Enable reporter mom to correctly handle UNKNOWN role.
b - TRQ-3242. Fix problem where resource string argument to prologue
script getting garbled.
b - TRQ-3232 Start threadpool at pbs_mom start.
b - TRQ-3117. Fix a misspelling in the number_successful tag (was number_successfull)
e - TRQ-3131. Add capability to pass environment variables to pbsdsh.
b - TRQ-3185. Create subdirs when server attribute use_job_subdirs set.
5.1.2
b - TRQ-2675. Fix small errors in suse init.d scripts.
b - TRQ-3235. Fix problem when path to error, output or execution environment
contains one or more spaces.
e - TRQ-3098. Add the ability to set a parameter exit_code_canceled_job to force all
canceled jobs to have the same exit code regardless of the state they were in
when they were canceled.
e - TRQ-2836. Make node health check run on sister nodes when configured for job
start and job end as well.
e - TRQ-2843. Add the qmgr setting dont_write_nodes_file to make it so that nodes
cannot be edited dynamically
f - TRQ-2897. Add the ability to adopt running processes into a job with pbs_track.
b - TRQ-3189. Never delete a running job because of a dependency.
5.1.1.2
e - TRQ-3197. Add support for RHEL7 and SLES12.
5.1.1
b - TRQ-2947. Fix a race condition on deleting jobs which are failing to start.
c - TRQ-3068. Fix a race condition where a job may be deleted but have it's pointer
may still be in the alljobs container.
b - TRQ-2753. Fix a memory leak in generating the authoritative okclients list.
b - TRQ-2332. Fix a job dependency problem when the failover server comes up. This
only affects users running high availability.
b - TRQ-3023. Fix a bug when ALPS incorrectly returns a permanent confirmation failure.
b - TRQ-2833. Set CUDA_VISIBLE_DEVICES to only the indices for this host when
it will be set.
b - TRQ-3039. Fix a deadlock when deleting a job where other jobs have after any dependencies
on the first job.
f - TRQ-2782. Distribute job files into subdirectories when server attribute use_jobs_subdirs
set to true. Default is false (do not distribute job files).
b - TRQ-3116. Make qsub only retry on transient errors.
b - TRQ-3122. Fix a problem with login_property not working correctly (cray only).
b - TRQ-3114. Fix an issue where an asynchronously started job is stuck with a
substate of starting after a failed job start.
b - TRQ-3110. Handle slot limits correctly when jobs are preempted.
e - TRQ-3095. Add the server setting disable_automatic_requeue to stop jobs from being
requeued if they experience a transient failure on the mom.
e - TRQ-2307. Fix probelms where mom restarts intermittently fail.
b - TRQ-2946. Make qmgr able to handle Cray numeric node ids.
b - TRQ-2790. Make offlining cray compute nodes persist across restarts.
e - TRQ-3104. Add millisecond precision to the Torque log file
e - TRQ-2881. Add node health check error messages to a node's notes and therefore pbsnodes
output.
b - TRQ-3166. Add another safety check before killing stray jobs.
5.0.2
b - TRQ-3029. Make it so that pbs_server can't have active threads when the main
thread exits.
b - TRQ-3012. Fix memory leaks that happen each time a job is run.
b - TRQ-2966. Improved job rerun speed which had been significantly slowed down
starting in 4.2.6. Also, pbs_mom now correctly accounts for job resources when
user jobs call setsid more than once.
c - TRQ-2987. Fix a crash around job exits due to incorrect error code handling.
b - TRQ-2841. Fix some ways that max_user_queuable can become incorrect
b - TRQ-3097. Fixed a problem where failed job submissions would count against
the max_user_queuable count and could not be cleared until pbs_server
was restarted.
b - TRQ-3087. Fixed a problem where completed jobs were counted against max_user_queuable
when restarting pbs_server. Also if the max_user_queuable was set on a
queue and the number of queued jobs and completed jobs were over the
maximum then the last jobs submitted would not get loaded.
5.0.1.h2
b - Reverted a change in 5.0.0 which made it so a user could not submit a
job from a node which had been allowed using the acl_hosts list. The
change to 5.0.0 made it so the ruserok call could not be made to check
for user authorization.
5.0.1
e - TRQ-2410. Improved qstat behavior in cases where bad job IDs were referenced
in the command.
e - TRQ-2460. Two new fields were added to the accounting file for completed jobs:
total_execution_slots and unique_node_count. total_execution_slots should be
20 for a job that requests nodes=2:ppn=10. unique_node_count should be the
number of unique hosts the job occupied.
e - TRQ-2594. TORQUE now uses the Munge API rather than forking when configured
with the --enable-munge-auth option.
e - TRQ-2863. Reduced verbosity in error logging in HA environments.
e - TRQ-2868. TORQUE now allows for the modification of the output location
based on the Mother superior hostname. An environment variable ($HOSTNAME)
has been added to the job's environment.
e - TRQ-2882. Improved trqauthd error messages to more meaningful and less
redundant.
e - TRQ-2890. Added stderr capturing when using -o option.
b - TRQ-2025. Fixed bug where giving a bad queue name to qstat -Q results in
duplicate output.
b - TRQ-2292. Fixed bug where some tasks were incorrectly listed as 0 in 'qstat -a'
when requesting specific nodes.
b - TRQ-2367. Fixed bug related to accounting records on large systems.
b - TRQ-2411. Fixed output format bug in cases where multiple job IDs are passed
into qstat.
b - TRQ-2646. Fixed bug where qsub did not process args correctly when using
a submit filter.
b - TRQ-2652. Fixed parsing bug when using hostlist ranges in qsub.
b - TRQ-2653. Fixed build bug related to newer Intel MIC libraries installing
in different locations.
b - TRQ-2730. Fixed problem where GPUs were not split between NUMA nodes. You
now need to specify which gpus belong to each node board in the mom.layout
file. A sample mom.layout file might look like:
nodes=0 gpu=0
nodes=1 gpu=1
Also please note that this only works if you use nvml. The nvidia-smi
command is not supported.
b - TRQ-2732. Fixed bug where OU files were being left in spool when job was
preempted or requeued.
b - TRQ-2759. Fixed bug where reported cput was incorrect.
b - TRQ-2760. Fixed unexpected error when running 'pbsnodes -l offline -n'.
b - TRQ-2795. Fixed bug where jobs rejected due to max_user_queuable limit reached,
yet no jobs in the queue.
b - TRQ-2828. Fixed bug where 'momctl -q clearmsg' didn't clear error messages
properly.
b - TRQ-2837. Fixed bug where GPU modes were not passed to sister nodes.
b - TRQ-2852. Fixed bug while writing resources_default units to serverdb file.
b - TRQ-2885, CVE-2014-3684. Fixed issue around unauthorized termination
of processes.
b - TRQ-2890. Improved pbsdsh to better handle simultaneous use of -o and -s
options. Also fixed some problems where -o output was sometimes getting
truncated.
b - TRQ-2904. Fixed bug where TORQUE was not honoring KeepCompleted server
parameter when job_nanny was set to true.
b - TRQ-2918. Fixed problem with remote client job submission during
ruserok() calls.
b - TRQ-2919. Fixed deadlock issue when running 'qdel -p' as non-root user.
b - TRQ-2937. Fixed bug in qsub -m when TORQUE is configured --with-sendmail.
Some missing newlines were added.
b - TRQ-2956. Fixed bug where HOST_NAME_SUFFIX was no longer adding suffix to job names.
c - TRQ-2928, TRQ-2921, TRQ-2855, TRQ-2854, TRQ-2853, TRQ-2835. Fixed various crashes.
5.0.0
e - TRQ-2083. Remove job status polling from TORQUE. Have pbs_server only poll a
mom for a job's information if the information hasn't been received for 5
minutes. Otherwise, this information is communicated with the mom's status
information.
e - TRQ-2309. Have TORQUE recognize when a request to run a job specifies a node
list and directly access those nodes instead of searching linearly.
e - TRQ-1539. Condense the exec_host list to have one entry per node instead of one
entry per execution slot. The node entry contains a string specifying each
execution slot index. Also no longer display the value of exec_port in qstat.
f - TRQ-2363. Make it so that if you execute qrerun all - which previously
returned an error - it will ask for confirmation, and then place all running
jobs in a queued state without contacting the moms. This is meant to be used
only when the entire cluster has gone down and can't be contacted.
4.5.0
b - TRQ-2319. Replace two Torque functions with ones from the hwloc package.
n - Portable Hardware Locality (hwloc) package version 1.2 or higher must be
installed when using cpusets (--enable-cpuset). Previously at least version
1.1 was required.
b - TRQ-2373. Fix login nodes restricting the number of jobs to the number specified
by np=X.
e - TRQ-2044. Create a unique identifier for all jobs in TORQUE. This makes it
so that we're performing integer comparisons instead of string comparisons
for finding jobs.
4.2.9
b - TRQ-2730. Make nvml and numa-support configurations work together. The admin must
now specify which gpus are on which node board the same way it is done with mic
co-processors, adding gpu=X[-Y] to the mom.layout line for that node board.
4.2.8
b - TRQ-2501. Fix the total number of execution slots having a count that is off-by-one for
every Cray compute node.
b - TRQ-2498. Fixed a memory leak when using qrun -a (asynchronous). Also fixed a write
after free error that could lead to memory corruption.
b - Fixed the thread pool manager so it would free idle nodes. Also changed the default
thread stack sizes to a maximum of 8 Mb and Minimum of 1 Mb.
4.2.7
b - TRQ-2423. Fix a bug where cpusets would incorrectly be reported on mpi jobs
b - TRQ-2329. Fix a problem where nodes could be allocated to array subjobs even
after the job was deleted.
b - TRQ-2351. Fix an issue where moms that are before 4.2.6 can't run jobs if the
server is 4.2.6.
e - Made it so trqauthd cannot be loaded more than once. trqauthd opens a UNIX
domain name file to do its communication with client commands. If the
UNIX domain name file exists trqauthd will not load. By default this file
is /tmp/trqauthd-unix. It can be configured to point to a different directory.
If trqauthd will not start and you know there are no other instances of trqauthd
running you should delete the UNIX domain file and try again.
b - TRQ-2319. Replace two Torque functions with ones from the hwloc package.
n - Portable Hardware Locality (hwloc) package version 1.2 or higher must be
installed when using cpusets (--enable-cpuset). Previously at least version
1.1 was required.
b - TRQ-2373. Fix login nodes restricting the number of jobs to the number specified
by np=X.
b - TRQ-2354. Fix an issue with potential overflow in user job counts. Also fix a
user being considered different if from a different submit host.
b - TRQ-2369. Fix a problem with pbs_mom recovering which cpu indices were in use for
jobs that were running at shutdown and still running at the time the mom restarted.
b - TRQ-2377. Jobs with future start dates were being placed in queued after being
deleted if they were deleted before their start date and keep_completed kept them
around long enough. Fix this.
c - TRQ-2347. Fix a segfault around re-sending batch requests.
b - TRQ-2270. Fix some problems with TORQUE continuing to have nodes in a free state
when the host is down.
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
server running in cray enabled mode.
n - TRQ-2299. Make it so that the reporter mom doesn't fork to send its update.
4.2.6.1.h1
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
server running in cray enabled mode.
4.2.6.1
b - TRQ-2351. Fix an issue where moms that are before 4.2.6 can't run jobs if the
server is 4.2.6.
e - Made it so trqauthd cannot be loaded more than once. trqauthd opens a UNIX
domain name file to do its communication with client commands. If the
4.2.6
b - TRQ-2273. Job start time is hard coded to 5 minutes. If the prolog takes longer
than that to run the job will be requeued without killing the prolog. This
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
server running in cray enabled mode.
n - TRQ-2299. Make it so that the reporter mom doesn't fork to send its update.
b - TRQ-2111. Fix a rare case of running jobs being deleted without having their
resources freed.
b - TRQ-2208. Stop having pbs_mom use trqauthd when it is checkpointing a job.
e - TRQ-2022. Make pbs_mom capable of handling either naming convention for cpuset
files, those with the 'cpuset.' prefix and those without.
b - TRQ-2259. Fix a problem for multi-node jobs: vmem was being stored in mem and
vice versa from the sisters.
b - TRQ-2280. Save properties added to cray compute nodes in the nodes file if the
file is overwritten by pbs_server.
around long enough. Fix this.
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
server running in cray enabled mode.
n - TRQ-2299. Make it so that the reporter mom doesn't fork to send its update.
4.2.5
e - Remove the mom asking for a job status before sending an obit to pbs_server
for a job that has exited. This is unnecessary overhead.
b - TRQ-2097. Make it so that the proper errno is stored for non-blocking sockets
at connect time.
b - TRQ-2111. Make queued jobs never hold node resources.
c - TRQ-2155. Fix a crash in trqauthd.
e - TRQ-2058. Add the option of having the pbs_mom daemon read the mom hierarchy
file instead of having to get it from pbs_server. To do this, copy the
hierarchy to mom_priv/mom_hierarchy.
e - TRQ-2058. Add the -n option to pbs_server, telling pbs_server not to send a
hierarchy over the network unless it is requested by pbs_mom.
e - TRQ-2020. Add the option of setting properties (features) for cray compute
nodes in the nodes file. Syntax: node_id cray_compute property_name.
4.2.4
b - TRQ-1802. Make the environment variable $PBS_NUM_NODES accurate for multi-req
jobs.
e - TRQ-1832. Add the ability to add a login_property to a job at the queue level
by setting required_login_property on the queue.
e - TRQ-1925. Make pbs_mom smart enough to reserved extra memory nodes for non-numa
configured TORQUE when more memory is requested than reserved.
e - TRQ-1923. Make job aborts for a mother superior not recognizing the job a bit
more intelligent - if the job has been reported in the last 180 seconds in the
mom's status update don't abort it.
b - TRQ-1934. Ask for canonical hostnames on the default address family without
specifying for uniformity in the code.
b - TRQ-2003. For cray fix a miscalculation of nppn and width when mppdepth is
provided for the job.
e - TRQ-1833. Optimize starting jobs by not internally tracking the jobid for each
execution slot used by the job. Reduce string buildup and manipulation in other
internal places as well. Job start for large jobs has been optimized to be up
to 150X faster according to internal testing.
b - TRQ-2030. Fix an ALPS 1.2 bug with labels on nodes. In 1.2 labels would be
repeated like this: labelnamelabelname... Cray only.
b - TRQ-1914. Fix after type dependencies not being removed from arrays.
b - TRQ-2015. Fix a problem where pbs_mom processes get stuck in a defunc state when
doing a qrerun on a job. qrerun is not required to make this happen. Just the
action of requeing a running job on the mom causes this to happen.
4.2.3
b - TRQ-1653. Arrays depending on non-array jobs was broken. Fix this.
b - Add retries on transient failures to setuid and seteuid calls. TRQ-1541.
e - Add support for qstat -f -u <user>. This results in qstat -f output for only
the specified user.
e - TRQ-1798. Make pbs_server calculate mppmaxnodect more accurately for Cray.
e - Add a timeout for mother superior when cleaning up a job. Instead of waiting
infinitely for sisters to confirm that a job has exited, consider the job dead
after 10 minutes. This time can be adjusted by setting $job_exit_wait_time in
the mom's config file (time in seconds). This prevents jobs from being stuck
infinitely if a compute node crashes or if a mom daemon becomes unresponsive.
TRQ-1776.
e - Add the parameter default_features to queues. TRQ-1794. The other way of adding
a feature to all jobs in a queue (setting resources_default.neednodes) is
circumvented if a user requests a feature in the nodes request. Setting
default_features overcomes this issue.
b - If privileged ports are disabled, make pbs_moms not check if incoming connections
from mother superior are on privileged ports. TRQ-1669.
c - TRQ-1784, bugzilla #231. Fix a crash for modifying arrays with qalter.
e - Add two mom config parameters: max_join_job_wait_time and resend_join_job_wait_time.
The first specifies how long pbs_mom should wait before deciding that join jobs
will never be received, and defaults to 10 minutes. The latter specifies how long
pbs_mom should wait before attempting to resend join jobs to moms that it hasn't
received replies from, and this defaults to 5 minutes. Both are specified in
seconds. Prior to this functionality mother superior would wait indefinitely
for the join job replies. Please carefully consider what these values should be
for your site and set them appropriately. TRQ-1790.
e - If an error happens communicating with one MIC, attempt to communicate with the
others instead of failing the entire routine.
e - Reintroduced the procct resource for queues which allows jobs to be managed based
on the number of procs requested. TRQ-1623
b - TRQ-1709. Fix parsing of -l gpus=X,other_things parsing incorrectly.
b - TRQ-1639. Gpu status information wasn't being displayed correctly.
b - TRQ-1826. mppdepth is now passed correctly to the ALPS reservation.
b - TRQ-1639. Gpu status information wasn't being displayed correctly.
b - TRQ-1826. mppdepth is now passed correctly to the ALPS reservation.
e - Reintroduced the procct resource for queues which allows jobs to be managed based
on the number of procs requested. TRQ-1623
4.2.2
b - Make job_starter work for parallel jobs as well as serial. (TRQ-1577 - thanks
to NERSC for the patch)
b - Fix one issue with being able to submit jobs to the cray while offline. TRQ-1595.
b - Make job_starter work for parallel jobs as well as serial. (TRQ-1577 - thanks
e - Make the abort and email messages for jobs more specific when they are killed
for going over a limit. TRQ-1076.
e - Add mom parameter mom_oom_immunize, making the mom immune to being killed in out
of memory conditions. Default is now true. (thanks to Lukasz Flis for this work)
b - Don't count completed jobs against max_user_queuable. TRQ-1420.
e - For mics, set the variable $OFFLOAD_DEVICES with a list of MICs to use for the
job.
b - make pbs_track compatible with display_job_server_suffix = false. The user
has to set NO_SERVER_SUFFIX in the environment. TRQ-1389
b - Fix the way we monitor if a thread is active. Before we used the id, but if the
thread has exited, the id is no longer valid and this will cause a crash. Use
pthread_cleanup functionality instead. TRQ-1745.
b - TRQ-1751. Add some code to handle a corrupted job file where the job file says it
is running but there is no exec host list. These jobs now will receive a system
hold.
b - Fixed problem where max_queuable and max_user_queuable would fail incorrectly.
TRQ-1494
b - Cray: nppn wasn't being specified in reservations. Fix this. TRQ-1660.
4.2.1
b - Fix a deadlock when submitting two large arrays consecutively, the second
depending on the first. TRQ-1646 (reported by Jorg Blank).
4.2.0
f - Support the MIC architecture. This was co-developed with Doug Johnson at
Ohio Supercomputer Center (OSC) and provides support for the Intel® MIC
architecture similar to GPU support in TORQUE.
b - Fix a queue deadlock. TRQ-1435
b - Fix an issue with multi-node jobs not reporting resources completely. TRQ-1222.
b - Make the API not retry for 5 consecutive timeouts. TRQ-1425
b - Fix a deadlock when no files can be copied from compute nodes to pbs_server.
TRQ-1447.
b - Don't strip quotes from values in scripts before specific processing. TRQ-1632
4.1.6
b - Make job_starter work for parallel jobs as well as serial. (TRQ-1577 - thanks
to NERSC for the patch, backported from 4.2.2)
b - Fix one issue with being able to submit jobs to the cray while offline. TRQ-1595.
backported from 4.2.2
4.1.5
b - For cray: make sure that reservations are released when jobs are requeued. TRQ-1572.
b - For cray: support the mppdepth directive. Bugzilla #225.
c - If the job is no long valid after attempting to lock the array in get_jobs_array(),
make sure the array is valid before attempting to unlock it. TRQ-1598.
e - For cray: make it so you can continue to submit jobs to pbs_server even if you have
restarted it while the cray is offline. TRQ-1595.
b - Don't log an invalid connection message when close_conn() is called on 65535
(PBS_LOCAL_CONNECTION). TRQ-1557.
4.1.4
e - When in cray mode, write physmem and availmem in addition to totmem so that
Moab correctly reads memory info.
e - Specifying size, nodes, and mppwidth and all mutually exclusize, so reject
job submissions that attempt to specify more than one of these. TRQ-1185.
b - Merged changes for revision 7000 by hand because the merge was not clean. This
fixes problems with a deadlock when doing job dependencies using synccount/syncwith.
TRQ-1374
b - Fix a segfault in req_jobobit due to an off-by-one error. TRQ-1361.
e - Add the svn revision to --version outputs. TRQ-1357.
b - Fix a race condition in mom hierarchy reporting. TRQ-1378.
b - Fixed pbs_mom so epilogue will only run once. TRQ-1134
b - Fix some debug output escaping into job output. TRQ-1360.
b - Fixed a problem where server threads all get stuck in a poll. The problem
was an infinite loop created in socket_wait_for_read if poll return -1.
TRQ-1382
b - Fix a Cray-mode bug with jobs ending immediately when spanning nodes of
different proc counts when specifying -l procs. TRQ-1365.
b - Don't fail to make the tmpdir for sister moms. bugzilla #220, TRQ-1403.
c - Fix crashes due to unprotected array accesses. TRQ-1395.
b - Fixed a deadlock in get_parent_dest_queues when the queue_parent_name
and queue_dest_name are the same. TRQ-1413. 11/7/12
b - Fixed segfault in req_movejob where the job ji_qhdr was NULL. TRQ-1416
b - Fix a conflict in the code for herogeneous jobs and regular jobs.
b - For alps jobs, use the login nodes evenly even when one goes down. TRQ-1317.
b - Display the correct 'Assigned Cpu Count' in momctl output. TRQ-1307.
b - Make pbs_original_connect() no longer hang if the host is down. TRQ-1388.
b - Make epilogues run only once and be executed by the child and not the main
pbs_mom process. TRQ-937.
b - Reduce the error messages in HA mode from moms. They now only log errors if
no server could be contacted. TRQ-1385.
b - Fixed a seg-fault in send_depend_req. Also fixed a deadlock in the depend_on_term
TRQ-1430 and TRQ-1436
b - Fixed a null pointer dereference seg-fault when checking for disallowed types
TRQ-1408.
b - Fix a counting problem when running multi-req ALPS jobs (cray only). TRQ-1431.
b - Remove red herring error messages 'did not find work task for local request'.
These tasks are no longer created since issue_Drequest blocks until it gets the
reply and then processes it. TRQ-1423.
b - Fixed a problem where qsub was not applying the submit filter when given in the torque.cfg
file. TRQ-1446
e - When the mom has no jobs, check the aux path to make sure it is clean and
that we aren't leaving any files there. TRQ-1240.
b - Fix a counting problem when running multi-req ALPS jobs (cray only). TRQ-1431.
b - Remove red herring error messages 'did not find work task for local request'.
These tasks are no longer created since issue_Drequest blocks until it gets the
reply and then processes it. TRQ-1423.
e - When the mom has no jobs, check the aux path to make sure it is clean and
that we aren't leaving any files there. TRQ-1240.
b - Made it so that threads taken up by poll job tasks cannot consume all available
threads in the thread pool. This will make it so other work can continue if
poll jobs get stuck for whatever reason and that the server will recover. TRQ-1433
b - Fix a deadlock when recording alps reservations. TRQ-1421.
b - Fixed a segfault in req_jobobit caused by NULL pointer assignment to variable
pa. TRQ-1467
b - Fixed deadlock in remove_array. remove_array was calling get_arry with allarrays_mutex
locked. TRQ-1466
b - Fixed a problem with an end of file error when running momctl -dx. TRQ-1432.
b - Fix a deadlock in rare cases on job insertion. TRQ-1472.
b - Fix a deadlock after restarting pbs_server when it was SIGKILL'd before a
job array was done cloning. TRQ-1474.
b - Fix a Cray-related deadlock. Always lock the reporter mom before a compute
node. TRQ-1445
b - Additional fix for TRQ-1472. In rm_request on the mom pbs_tcp_timeout was
getting set to 0 which made it so the MOM would fail reading incoming data
if it had not already arrived. This would cause momctl -to fail with an
end of file message.
e - Add a safety net to resend any obits for exiting jobs on the mom that still
haven't cleaned up after five minutes. TRQ-1458.
b - Fix cray running jobs being cancelled after a restart due to jobs not being
set to the login nodes. TRQ-1482.
b - Fix a bug that using -V got rid of -v. TRQ-1457.
b - Make qsub -I -x work again. TRQ-1483.
c - Fix a potential crash when getting the status of a login node in cray mode.
TRQ-1491.
4.1.3
b - fix a security loophole that potentially allowed an interactive job to run
as root due to not resetting a value when $attempt_to_make_dir and $tmpdir
are set. TRQ-1078.
b - fix down_on_error for the server. TRQ-1074.
b - prevent pbs_server from spinning in select due to sockets in CLOSE_WAIT.
TRQ-1161.
e - Have pbs_server save the queues each time before exiting so that legacy
formats are converted to xml after upgrading. TRQ-1120.
b - Fix phantom jobs being left on the pbs_moms and blocking jobs for Cray
hardware. TRQ-1162. (Thanks Matt Ezell)
b - Fix a race condition on free'd memory when check for orphaned alps
reservations. TRQ-1181. (Thanks Matt Ezell)
b - If interrupted when reading the terminal type for an interactive job continue
trying to read instead of giving up. TRQ-1091.
b - Fix displaying elapsed time for a job. TRQ-1133.
b - Make offlining nodes persistent after shutting down. TRQ-1087.
b - Fixed a memory leak when calling net_move. net_move allocates memory for args
and starts a thread on send_job. However, args were not getting released
in send_job. TRQ-1199
b - Changed pbs_connect to check for a server name. If it is passed in only that
server name is tried for a connection. If no server name is given then the
default list is used. The previous behavior was to try the name passed in and
the default server list. This would lead to confusion in utilities like qstat
when querying for a specific server. If the server specified was no available
information from the remaining list would still be returned.
TRQ-1143.
e - Make issue_Drequest wait for the reply and have functions continue processing
immediately after instead of the added overhead of using the threadpool.
c - tm_adopt() calls caused pbs_mom to crash. Fix this. TRQ-1210.
b - Array element 0 wasn't showing up in qstat -t output. TRQ-1155.
b - Cores with multiple processing units were being incorrectly assigned in cpusets.
Additionally, multi-node jobs were getting the cpu list from each node in each
cpuset, also causing problems. TRQ-1202.
b - Finding subjobs (for heterogeneous jobs) wasn't compatible with hostnames that
have dashes. TRQ-1229.
b - Removed the call to wait_request the main_loop on pbs_server. All of our communication
is handled directly and there is no longer a need to wait for an out of band
reply from a client. TRQ-1161.
e - Modfied output for qstat -r. Expanded Req'd Time to include seconds and centered Elap Time
over it's column.
b - Fixed a bug found at Univ. of Michigan where a corrupt .JB file would cause
pbs_server to seg-fault and restart.
b - Don't leave quotes on any arguments passed to the resource list. TRQ-1209.
b - Fix a race condition that causes deadlock when two threads are routing the same job.
b - Fixed a bug with qsub where environment variables were not getting populated with the
-v option. TRQ-1228.
b - This time for sure. TRQ-1228. When max_queuable or max_user_queuable were set it
was still possible to go over the limit. This was because a job is qualified
in the call to req_quejob but does not get inserted into the queue until svr_enquejob
is called in req_commit, four network requests later. In a multi-threaded environment
this allowed several jobs to be qualified and put in the pipeline before they
were actually commited to a queue.
b - If max_user_queuable or max_queuable were set on a queue TORQUE would not honor
the limit when filling those queues from a routing queue. This has now
been fixed. TRQ-1088.
b - Fixed seg-fault when running jobs asynchronously. TRQ-1252.
b - Fixed a bug with SIGHUP to pbs_server. The signal handler (change_logs()) does file I/O
which is not allowed for signal interruption. This caused pbs_server to be up but
unresponsive to any commands. TRQ-1250 and TR!-1224
b - Job dependencies didn't work with display_server_suffix=false. Fixed. TRQ-1255.
b - Don't report alps reservation ids if a node is in interactive mode. TRQ-1251.
b - Only attempt to cancel an orphaned alps reservation a maximum of one time per
iteration. TRQ-1251.
b - Fix a deadlock when recording an alps reservation on the server side. Cray only.
TRQ-1272.
c - Fix mismanagement of the ji_globid. TRQ-1262.
c - Setting display_job_server_suffix=false crashed with job arrays. Fixed. bugzilla #216
b - Restore the asynchronous functionality. TRQ-1284.
e - Made it so pbs_server will come up even if a job cannot recover because of a missing
job dependency. TRQ-1287
b - Fixed a segfault in the path from do_tcp to tm_request to tm_eof. In this path we freed
the tcp channel three times. the call to DIS_tcp_cleanup was removed from tm_eof and
tm_request. TRQ-1232.
b - Fix a deadlock in logging when the machine is out of disk space. TRQ-1302.
b - Fixed a deadlock which occurs when there is a job with a dependency that is being moved
from a routing queue to an execution queue. TRQ-1294
e - Retry cleanup with the mom every 20 seconds for jobs that are stuck in an exiting state.
TRQ-1299.
b - Enabled qsub filters to be access from a non-default location.i TRQ-1127
b - Put the ability to write the resources_used data to the accounting logs. This was in 4.1.1
and 4.1.2 but failed to make it into 4.1.3. TRQ-1329
c - Fix a double free if the same chan is stored on two tasks for a job. TRQ-1299.
b - Changed pbs_original_connect to retry a failed connect attempt
MAX_RETRIES (5) times before returning failure. This will
reduce the number of client commands that fail due to a connection
failure. TRQ-1355
b - Fix the proliferation of "Non-digit found where a digit was expected" messages, due
to an off-by-one error. TRQ-1230.
b - Fixed a deadlock caused by queue not getting released when jobs are aborted when
moving jobs from a routing queue to an execution queue. TRQ-1344.
4.1.2
e - Add the ability to run a single job partially on CRAY hardware and partially
on hardware external to the CRAY in order to allow visualization of
large simulations.
4.1.1
e - pbs_server will now detect and release orphaned ALPS reservations
b - Fixed a deadlock with nodes in stream_eof after call to svr_connect.
b - resources_used information now appears in the accounting log again
TRQ-1083 and bugzilla 198.
b - Fixed a seg-fault found a LBNL where freeaddrinfo would crash because
of uninitialized memory.
b - Fixed a deadlock in handle_complete_second_time. We were not unlocking
when exiting svr_job_purge.
e - Added the wrappers lock_ji_mutex and unlock_ji_mutex to do the mutex locking
for all job->ji_mutex locks.
e - admins can now set the global max_user_queuable limit using qmgr. TRQ-978.
b - No longer make multiple alps reservation parameters for each alps reservation.
This creates problems for the aprun -B command.
b - Fix a problem running extremely large jobs with alps 1.1 and 1.2. Reservations
weren't correctly created in the past. TRQ-1092.
b - Fixed a deadlock with a queue mutex caused by call qstat -a <queue1> <queue2>
b - Fixed a memory corruption bug, double free in check_if_orphaned. To fix this
issue_Drequest was modified to always free the batch request regardless of
any errors.
b - Fix a potential segfault when using munge but not having set authorized users.
TRQ-1102
b - Added a modified version of a patch submitted by Matt Ezell for Bugzilla 207.
This fixes a seg-fault in qsub if Moab passes an environment variable without
a value.
b - fix an error in parsing environment variables with commas, newlines, etc. TRQ-1113
b - fixed a deadlock with array jobs running simultaneously with qstat.
b - Fixed qsub -v option. Variable list was not getting passed in to job environment.
TRQ-1128
b - TRQ-1116. mail is now sent on job start again.
b - TRQ-1118. Cray jobs are now recovered correctly after a restart.
b - TRQ-1109. Fixed x11 forwarding for interactive jobs. (qsub -I -X). Previous to
this fix interactive jobs would not run any x applications such as xterm, xclock,
etc.
b - TRQ-1161, Fixes a problem where TORQUE gets into a high CPU utilization condition.
The problem was that in the function process_pbs_server_port there was not
error returned if the call to getpeername() failed in the default case.
b - TRQ-1161. This fixes another case that would cause a thread to spin on poll
in start_process_pbs_server_port. A call to the dis function would return
and error but the code would close the connection and return the error code which
was a value less than 20. start_process_pbs_server_port did not recognize the low
error code value and would keep calling into process_pbs_server_port.
b - qdel'ing a running job in the cray environment was trying to communicate with the
cray compute instead of the login node. This is now fixed. TRQ-1184.
b - TRQ-1161. Fixed a problem in stream_eof where a svr_connect was used to connect
to a MOM to see if it was still there. On successful connection the connection
is closed but the wrong function (close_conn) with the wrong argument (the
handle returned by svr_connect()) was used. Replaced with svr_disconnect
b - Make it so that procct is never shown to Moab or users. TRQ-872.
b - TRQ-1182. Fixed a problem where jobs with dependencies were deleted on
the restart of pbs_server.
b - TRQ-1199. Fixed memory leaks found by Valgrind. Fixed a leak when routing jobs
to a remote server, memory leak with procct, memory leak creating queues,
memory leak with mom_server_valid_message_source and a memory leak in req_track.
4.1.0
e - make free_nodes() only look at nodes in the exec_host list and not examine
all nodes to check if the job at hand was there. This should greatly speed
up freeing nodes.
f - add the server parameter interactive_jobs_can_roam (Cray only). When set to
true, interactive jobs can have any login as mother superior, but by default
all interactive jobs with have their submit_host as mother superior
b - Fixed TRQ-696. Jobs get stuck in running state.
b - Fixed a problem where interactive jobs using X-forwarding would fail
because TORQUE though DISPLAY was not set. The problem was that
DISPLAY was set using lowercase internally. TRQ-1010
e - Add a hostname/address caching feature to alleviate stress on DNS.
4.0.3
b - fix qdel -p all - was performing a qdel all. TRQ-947
b - fix some memory leaks in 4.0.2 on the mom and server TRQ-944
c - TRQ-973. Fix a possibility of a segfault in netcounter_incr()
b - removed memory manager from alloc_br and free_br to solve a memory leak
b - fixes to communications between pbs_sched and pbs_server. TRQ-884
b - fix server crash caused by gpu mode not being right after gpus=x:. TRQ-948.
b - fix logic in torque.setup so it does not say successfully started when
trqauthd failed to start. TRQ-938.
b - fix segfaults on job deletes, dependencies, and cases where a batch
request is held in multiple places. TRQ-933, 988, 990
e - TRQ-961/bugzilla-176 - add the configure option --with-hwloc-path=PATH
to allow installing hwloc to a non-default location.
c - fix a crash when using job dependencies that fail - TRQ-990
e - Cache addresses and names to prevent calling getnameinfo() and getaddrinfo()
too often. TRQ-993
c - fix a crash around re-running jobs
e - change so some Moab envirionment variables will be put into environment for
the prologue and epilogue scripts. TRQ-967.
b - make command line arguments override the job script arguments. TRQ-1033.
b - fix a pbs_mom crash when using blcr. TRQ-1020.
e - Added patch to buildutils/pbs_mkdirs.in which enables pbs_mkdirs to run
silently. Patch submitted by Bas van der Vlies. Bugzilla 199.
4.0.2
e - Change so init.d script variables get set based on the configure command.
TRQ-789, TRQ-792.
b - Fix so qrun jobid[] does not cause pbs_server segfault. TRQ-865.
b - Fix to validate qsub -l nodes=x against resources_max.nodes the same as v2.4.
TRQ-897.
b - bugzilla #185. Empty arrays should no longer be loaded and now when qdel'ed
they will be deleted.
b - bugzilla #182. The serverdb will now correctly write out memory allocated.
b - bugzilla #188. The deadlock when using job logging is resolved
b - bugzilla #184. pbs_server will no longer log an erroneous error when the 12th
job array is submitted.
e - Allow pbs_mom to change users group on stderr/stdout files. Enabled by configuring
Torque with CFLAGS='-DRESETGROUP'. TRQ-908.
e - Have the parent intermediate mom process wait for the child to open the demux before
moving on for more precise synchronization for radix jobs.
e - Changed the way jobs queued in a routing queue are updated. A thread is now launched
at startup and by default checks every 10 seconds to see if there are jobs
in the routing queues that can be promoted to execution queues.
b - Fix so pbs_mom will compile when configured with --with-nvml-lib=/usr/lib and
--with-nvml-include. TRQ-926.
b - fix pbs_track to add its process to the cpuset as well. TRQ-925.
b - Fix so gpu count gets written out to server nodes file when using
--enable-nvidia-gpus. TRQ-927.
b - change pbs_server to listen on all interfaces. TRQ-923
b - Fix so "pbs_server --ha" does not fail when checking path for server.lock file. TRQ-907.
b - Fixed a problem in qmgr where only 9 commands could be completed before a failure.
Bugzilla 192 and TRQ-931
b - Fix to prevent deadlock on server restart with completed job that had a dependency.
TRQ-936.
b - prevent TORQUE from losing connectivity with Moab when starting jobs asynchronously
TRQ-918
b - prevent the API from segfaulting when passed a negative socket descriptor
b - don't allow pbs_tcp_timeout to ever be less than 5 minutes - may be temporary
b - fix pbs_server so it fails if another instance of pbs_server is already
running on same port. TRQ-914.
4.0.1
b - Fix trqauthd init scripts to use correct path to trqauthd.
b - fix so multiple stage in/out files can again be used with qsub -W
b - fix so comma separated file list can be used with qsub -W stagein/stageout.
Matches qsub documentation again.
b - Only seed the random number generator once
b - The code to run the epilogue set of scripts was removed when refactoring the
obit code. The epilogues are now run as part of post_epilogue. preobit_reply
is no longer used.
b - if using a default hierarchy and moms on non-default ports, pass that information
along in the hierarchy
e - Make pbs_server contact pbs_moms in the order in which they appear in the hierarchy
in order to reduce errors on start-up of a large cluster.
b - fix another possibility for deadlock with routing queues
e - move some the the main loop functionality to the threapool in order to increase
responsiveness.
e - Enabled the configuration to be able to write the path of the library directory
to /etc/ld.so.conf.d in a file named libtorque.conf. The file will be created
by default during make install. The configuration can be made to not install this
file by using the configure option --without-loadlibfile
b - Fixed a bug where Moab was using the option SYNCJOBID=TRUE which allows Moab
to create the job ids in TORQUE. With this in place if TORQUE were terminated
it would delete all jobs submitted through msub when pbs_server was restarted.
This fix recovers all jobs whether submitted with msub or qsub when pbs_server
restarts.
b - fix for where pbsnodes displays outdated gpu_status information.
b - fix problem with '+ and segfault when using multiple node gpu requests.
b - Fixed a bug in svr_connect. If the value for func were null then the newly
created connection was not added to the svr_conn table. This was not right.
We now always add the new connection to svr_conn.
b - fix problem with mom segfault when using 8 or more gpus on mom node.
b - Fix so child pbs_mom does not remain running after qdel on slow starting job.
TRQ-860.
b - Made it so the MOM will let pbs_server know it is down after momctl -s is invoked.
e - Made it so localhost is no longer hard coded. The string comes from getnameinfo.
b - fix a mom hiearchy error for running the moms on non-default ports
b - Fix server segfault for where mom in nodes file is not in mom_hierarchy. TRQ-873.
b - Fix so pbs_mom won't segfault after a qdel is done for a job that is still
running the prologue. TRQ-832.
b - Fix for segfault when using routing queues in pbs_server. TRQ-808
b - Fix so epilogue.precancel runs only once and only for cancelled jobs. TRQ-831.
b - Added a close socket to validate_socket to properly terminate the connection.
Moved the free of the incoming variable sock to process_svr_conn from the
beginning of the function to the end. This fixed a problem where the client
would always get a RST when trying to close its end of the connection.
b - Fix server segfault for where mom in nodes file is not in mom_hierarchy. TRQ-873.
b - routing to a routing queue now works again, TRQ-905, bugzilla 186
b - Fix server segfaults that happened doing qhold for blcr job. TRQ-900.
n - TORQUE 4.0.1 released 5/3/2012
4.0.0
e - make a threadpool for TORQUE server. The number of threads is
customizable using min_threads and max_threads, and idle time before
exiting can be set using thread_idle_seconds.
e - make pbs_server multi-threaded in order to increase responsiveness and scalability.
e - remove the forking from pbs_server running a job, the thread handling the request just
waits until the job is run.
e - change qdel to simply send qdel all - previously this was executed by a qstat and a qdel
of every individual job
e - no longer fork to send mail, just use a thread
e - use hwloc as the backbone for cpuset support in TORQUE (contributed by Dr. Bernd Kallies)
e - add the boolean variable $use_smt to mom config. If set to false, this skips logical
cores and uses only physical cores for the job. It is true by default.
(contributed by Dr. Bernd Kallies)
n - with the multi-threading the pbs_server -t create and -t cold commands could no longer
ask for user input from the command line. The call to ask if the user wants to continue
was moved higher in the initialization process and some of the wording changed to
reflect what is now happening.
e - if cpusets are configured but aren't found and cannot be mounted, pbs_mom will now fail to
start instead of failing silently.
e - Change node_spec from an N^2 (but average 5N) algorithm to an N algorithm with respect
to nodes. We only loop over each node once at a maximum.
e - Abandon pbs_iff in favor of trqauthd. trqauthd is a daemon to be started once that can
perform pbs_iff's functionality, increasing speed and enabling future security
enhancements
e - add mom_hierarchy functionality for reporting. The file is located in
<TORQUE_HOME>/server_priv/mom_hierarchy, and can be written to tell moms to send
updates to other moms who will pass them on to pbs_server. See docs for details
e - add a unit testing framework (check). It is compiled with --with-check and tests
are executed using make check. The framework is complete but not many tests have
been written as of yet.
b - Made changes to IM protocol where commands were not either waiting for a reply
or not sending a reply. Also made changes to close connections that were left
open.
b - Fix for where qmgr record_job_info is True and server hangs on startup.
e - Mom rejection messages are now passed back to qrun when possible
e - Added the option -c for startup. By default, the server attempts to send the mom
hierarchy file to all moms on startup, and all moms update the server and request
the hierarchy file. If both are trying to do this at once, it can cause a lot of
traffic. -c tells pbs_server to wait 10 minutes to attempt to contact moms that
haven't contacted it, reducing this traffic.
e - Added mom parameter -w to reduce start times. This parameter wait to send it's
first update until the server sends it the mom hierarchy file, or until 10
minutes have passed. This should reduce large cluster startup times.
3.0.5
b - fix for writing too much data when job_script is saved to job log.
b - fix for where pbs_mom would not automatically set gpu mode.
b - fix for alligning qstat -r output when configured with -DTXT.
e - Change size of transfer block used on job rerun from 4k to 64k.
b - With nvidia gpus, TORQUE was losing the directive of what nodes it should
run the job on from Moab. Corrected.
e - add the $PBS_WALLTIME variable to jobs, thanks to a patch from Mark Roberts
n - change moab_array_compatible server parameter so it defaults to true
e - change to allow pbs_mom to run if configured with --enable-nvidia-gpus but
installed on a node without Nvidia gpus.
3.0.4
c - fix a buffer being overrun with nvidia gpus enabled
b - no longer leave zombie processes when munge authenticating.
b - no longer reject procs if it is the second argument to -l
b - when having pbs_mom re-read the config file, old servers were kept, and pbs_mom
attempted to communicate with those as well. Now they are cleared and only the
new server(s) are contacted.
b - pbsnodes -l can now search on all valid node states
e - Added functionality that allows the values for the server parameter
authorized_users to use wild cards for both the user and host portion.
e - Improvements in munge handling of client connections and authentication.
3.0.3
b - fix for bugzilla #141 - qsub was overwriting the path variable in PBSD_authenticate
e - automatically create and mount /dev/cpuset when TORQUE is configured but the cpuset
directory isn't there
b - fix a bug where node lines past 256 characters were rejected. This buffer has been
made much larger (8192 characters)
b - clear out exec_gpus as needed
b - fix for bugzilla #147 - recreate $PBS_NODESFILE file when restarting a blcr
checkpointed job
b - Applied patch submitted by Eric Roman for resmom/Makefile.am (Bugzilla #147)
b - Fix for adding -lcr for BLCR makefiles (Bugzilla #146)
c - fix a potential segfault when using asynchronous runjob with an array slot limit
b - fix bugzilla #135, stagein was deleting directory instead of file
b - fix bugzilla #133, qsub submit filter, the -W arguments are not all there
e - add a mom config option - $attempt_to_make_dir - to give the user the option to
have TORQUE attempt to create the directories for their output file if they don't exist
b - Fixed momctl to return an error on failure. Prior to this fix momctl always returned 0
regardless of success or failure.
e - Change to allow qsub -l ncpus=x:gpus=x which adds a resource list entry for both
b - fix so user epilogues are run as user instead of root
b - No longer report a completion code if a job is pre-empted using qrerun.
c - Fix a crash in record_jobinfo() - this is fixed by backporting dynamic strings from
4.0.0 so that all of the resizing is done in a central location, fixing the crash.
b - No longer count down walltime for jobs that are suspending or have stopped running
for any other reasons
e - add a mom config option - $ext_pwd_retry - to specify # of retries on
checking for password validity.
3.0.2
c - check if the file pointer to /dev/console can be opened. If not, don't attempt to write it
b - fix a potential buffer overflow security issue in job names and host address names
b - restore += functionality for nodes when using qmgr. It was overwriting old properties
b - fix bugzilla #134, qmgr -= was deleting all entries
e - added the ability in qsub to submit jobs requesting total gpus for job instead of gpus per node:
-l ncpus=X,gpus=Y
b - do not prepend ${HOME} with the current dir for -o and -e in qsub
e - allow an administator using the proxy user submission to also set the job id to be used
in TORQUE. This makes TORQUE easier to use in grid configurations.
b - fix jobs named with -J not always having the server name appended correctly
b - make it so that jobs named like arrays via -J have legal output and error file names
b - make a fix for ATTR_node_exclusive - qsub wasn't accepting -n as a valid argument
3.0.1
e - updated qsub's man page to include ATTR_node_exclusive
b - when updating the nodes file, write out the ports for the mom if needed
b - fix a bug for non-NUMA systems that was continuously increasing memory values
e - the queue files are now stored as XML, just like the serverdb
e - Added code from 2.5-fixes which will try and find nodes that did not
resolve when pbs_server started up. This is in reference to Bugzilla
bug 110.
e - make gpus compatible with NUMA systems, and add the node attribute
numa_gpu_node_str for an additional way to specify gpus on node boards
e - Add code to verify the group list as well when VALIDATEGROUPS is set in torque.cfg
b - Fix a bug where if geometry requests are enabled and cpusets are enabled, the cpuset
wasn't deleted unless a geometry request was made.
b - Fix a race condition for pbs_mom -q, exitstatus was getting overwritten and as a result
pbs_server wasn't always re-queued, but were being deleted instead.
e - Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on
pbs_server. We recommend --with-tcp-retry-limit=2
n - Changing the way to set ATTR_node_exclusive from -E to -n, in order to continue
compatibility with Moab.
b - preserve the order on array strings in TORQUE, like the route_destinations for a
routing queue
b - fix bugzilla #111, multi-line environment variables causing errors in TORQUE.
b - allow apostrophes in Mail_Users attributes, as apostrophes are rare but legal email
characters
b - restored functionality for -W umask as reported in bugzilla 115
b - Updated torque.spec.in to be able to handle the snapshot names of builds.
b - fix pbs_mom -q to work with parallel jobs
b - Added code to free the mom.lock file during MOM shutdown.
e - Added new MOM configure option job_starter. This options will execute
the script submitted in qsub to the executable or script provided
b - fixed a bug in set_resources that prevented the last resource in a list from being
checked. As a result the last item in the list would always be added
without regard to previous entries.
e - altered the prologue/epilogue code to allow root squashing
f - added the mom config parameter $reduce_prolog_checks. This makes it so TORQUE only checks
to verify that the file is a regular file and is executable.
e - allow more than 5 concurrent connections to TORQUE using pbsD_connect. Increase it to 10
b - fix a segfault when receiving an obit for a job that no longer exists
e - Added options to conditionally build munge, BLCR, high-availability, cpusets,
and spooling. Also allows customization of the sendmail path and allows for
optional XML conversion to serverdb.
b - also remove the procct resource when it is applied because of a default
c - fix a segfault when queue has acl_group_enable and acl_group_sloppy set
true and no acl_groups are defined.
3.0.0
e - serverdb is now stored as xml, this is no longer configurable.
f - added --enable-numa-support for supporting NUMA-type architectures. We
have tested this build on UV and Altix machines. The server treats the
mom as a node with several special numa nodes embedded, and the pbs_mom
reports on these numa nodes instead of itself as a whole.
f - for numa configurations, pbs_mom creates cpusets for memory as well as
cpus
e - adapted the task manager interface to interact properly with NUMA
systems, including tm_adopt
e - Addeded autogen.sh go make life easier in a Makefile.in-less world.
e - Modified buildutils/pbs_mkdirs.in to create server_priv/nodes file
at install time. The file only shows examples and a link to the
TORQUE documentation.
f - added ATTR_node_exclusive to allow a job to have a node exclusively.
f - added --enable-memacct to use an extra protocol in order to
accurately track jobs that exceed over their memory limits and kill
them
e - when ATTR_node_exclusive is set, reserve the entire node (or entire
numa node if applicable) in the cpuset
n - Changed the protocol versions for all client-to-server, mom-to-server and
mom-to-mom protocols from 1 to 2. The changes to the protocol in this version
of TORQUE will make it incompatible with previous versions.
e - when a select statement is used, tally up the memory requests and mark
the total in the resource list. This allows memory enforcement for
NUMA jobs, but doesn't affect others as memory isn't enforced for
multinode jobs
e - add an asynchronous option to qdel
b - do not reply when an asynchronous reply has already been sent
e - make the mem, vmem, and cput usage available on a per-mom basis using momctl -d2
(Dr. Bernd Kallies)
e - move the memory monitor functionality to linux/mom_mach.c in order to store the
more accurate statistics for usage, and still use it for applying limits.
(Dr. Bernd Kallies)
e - when pbs_mom is compiled to use cpusets, instead of looking at all processes,
only examine the ones in cpuset task files. For busy machines (especially large
systems like UVs) this can exponentially reduce job monitoring/harvesting times.
(Dr. Bernd Kallies)
e - when cpusets are configured and memory pressure enabled, add the ability to
check memory pressure for a job. Using $memory_pressure_threshold and
$memory_pressure_duration in the mom's config, the admin sets a threshold at
which a job becomes a problem. If duration is set, the job will be killed if
it exceeds the threshold for the configured number of checks. If duration isn't
set, then an arror is logged.
(Dr. Bernd Kallies)
e - change pbs_track to look for the executable in the existing path so it doesn't always
need a complete path.
(Dr. Bernd Kallies)
e - report sessions on a per numa node basis when NUMA is enabled
(Dr. Bernd Kallies)
b - Merged revision 4325 from 2.5-fixes. Fixed a problem where the -m n
(request no mail on qsub) was not always being recongnized.
e - Merged buildutils/torque.spec.in from 2.4-fixes.
Refactored torque spec file to comply with established RPM best
practices, including the following:
- Standard installation locations based on RPM macro configuration
(e.g., %{_prefix})
- Latest upstream RPM conditional build semantics with fallbacks for
older versions of RPM (e.g., RHEL4)
- Initial set of optional features (GUI, PAM, syslog, SCP) with more
planned
- Basic working configuration automatically generated at install-time
- Reduce the number of unnecessary subpackages by consolidating where
it makes sense and using existing RPM features (e.g., --excludedocs).
2.5.10
b - Fixed a problem where pbs_mom will crash of check_pwd returns NULL. This could
happen for example if LDAP was down and getpwnam returns NULL.
e - Added code to delete a job on the MOM if a job is in the EXITED substate and
going through the scan_for_exiting code. This happens when an obit has been
sent and the obit reply received by the PBS_BATCH_DeleteJob has not been
received from the server on the MOM. This fix allows the MOM to delete the
job and free up resources even if the server for some reason does not send
the delete job request.
b - TRQ-608: Removed code to check for blocking mode in write_nonblocking_socket().
Fixes problem with interactive jobs (qsub -I) exiting prematurely.
c - fix a buffer being overrun with nvidia gpus enabled (backported from 3.0.4)
b - To fix a problem in 2.5.9 where the job_array structure was modified
without changing the version or creating an upgrade path. This made
it incompatible with previous versions of TORQUE 2.5 and 3.0.
Added new array structure job_array_259. This is the original torque
2.5.9 job_array structure with the num_purged element added in the middle
of the structure. job_array_259 was created so users could upgrade from 2.5.9
and 3.0.3 to later versions of TORQUE. The job_array structure was
modified by moving the num_purged element to the bottom of the structure.
pbsd_init now has an upgrade path for job arrays from version 3 to version
4. However, there is an exceptional case when upgrading from 2.5.9 or 3.0.3
where pbs_server must be started using a new -u option.
b - no longer leave zombie processes when munge authenticating. (backported from 3.0.4)
2.5.9
e - change mom to only log "cannot find nvidia-smi in PATH" once when built
with --enable-nvidia-gpus and running on a node that does not have Nvidia
drivers installed.
b - Change so gpu states get set/unset correctly. Fixes problems with multiple
exclusive jobs being assigned to same gpu and where next job gets rejected
because gpu state was not reset after last shared gpu job finished.
e - Added a 1 millisecond sleep to src/lib/Libnet/net_client.c client_to_svr()
if connect fails with EADDRINTUSE EINVAL or EADDRNOTAVAIL case. For these cases
TORQUE will retry the connect again. This fix increases the chance of success
on the next iteration.
b - Changes to decrease some gpu error messages and to detect unusual gpu
drivers and configurations.
b - Change so user cannot impersonate a different user when using munge.
e - Added new option to torque.cfg name TRQ_IFNAME. This allows the user to designate
a preferred outbound interface for TORQUE requests. The interface is the name
of the NIC interface, for example eth0.
e - Added instructions concerning the server parameter moab_array_compatible to the
README.array_changes file.
b - Fixed a problem where pbs_server would seg-fault if munged was not running. It would
also seg-fault if an invalid credential were sent from a client. The seg-fault was
occurred in the same place for both cases.
b - Fixed a problem where jobs dependent on an array using afteranyarray would not start
when a job element of the array completed.
b - Fixed a bug where array jobs .AZ file would not be deleted when the array job was done.
e - Modified qsub so that it will set PBS_O_HOST on the server from the incoming interface.
(with this fix QSUBHOST from torque.cfg will no longer work. Do we need to make it
to override the host name?)
b - fix so user epilogues are run as user instead of root (backported from 3.0.3)
b - fix the prevent pbs_server from hanging when doing server to server job moves.
(backported from 3.0.3)
b - Fixed a problem where array jobs would always lose their state when pbs_server was
restarted. Array jobs now retain their previous state between restarts of the server
the same as non-array jobs. This fix takes care of a problem where Moab and TORQUE
would get out of sync on jobs because of this discrepency between states.
b - Made a fix related to procct. If no resources are requested on the qsub line previous
versions of TORQUE did not create a Resource_List attribute. Specifically a node and
nodect element for Resource_List. Adding this broke some applications. I made it so
if no nodes or procs resources are requested the procct is set to 1 without creating
the nodes element.
e - Changed enable-job-create to with-job-create with an optional CFLAG argument.
--with-job-create=<CFLAG options>
e - Changed qstat.c to display 6 instead of 5 digits for Req'd Memory for a qstat -a.
2.5.8
e - added util function getpwnam_ext() that has retry and errno logging
capability for calls to getpwnam().
c - fix a potential segfault when using asynchronous runjob with an array slot limit
(backported from 3.0.3)
b - In pbs_original_connect() only the first NCONNECT entries of the connection table
were checked for availability. NCONNECT is defined as 10. However, the connection
table is PBS_NET_MAX_CONNECTIONS in size. PBS_NET_MAX_CONNECTIONS is 10240.
NCONNECT is now defined as PBS_NET_MAX_CONNECTIONS.
b - fix bugzilla #135, stagein was deleting directory instead of file (backported
from 3.0.3)
b - If the resources nodes or procs are not submitted on the qsub command line then
the nodes attribute does not get set. This causes a problem if procct is set on
queues because there is no proc count available to evaluate. This fix sets
a default nodes value of 1 if the nodes or procs resources are not requested.
e - Change so Nvidia drivers 260, 270 and above are recognized.
e - Added server attribute no_mail_force which when set True eliminates all
e-mail when job mail_points is set to "n"
2.5.7
e - Added new qsub argument -F. This argument takes a quoted string as
an argument. The string is a list of space separated commandline
arguments which are available to the job script.
b - Fixed a potential buffer overflow problem in src/resmom/checkpoint.c function
mom_checkpoint_recover. I modified the code to change strcpy and strcat to strncpy
and strncpy.
b - Fixed a bug for high availability. The -l listener option for pbs_server was not
complete and did not allow pbs_server to properly communicate with the scheduler.
Also fixed a bug with job dependencies where the second server or later in the
$TORQUE_HOME/server_name directory was not added as part of the job dependecny
so dependent jobs would get stuck on hold if the current server was not the first
server in the server_name file.
2.5.6
b - Made changes to record_jobinfo and supporting functions to be
able to use dynamically allcated buffers for data. This fixed
a problem where incoming data overran fixed sized buffers.
b - Updated torque.spec.in to be able to handle the snapshot
names of builds.
e - Added new MOM configure option job_starter. This options will execute
the script submitted in qsub to the executable or script provided
as the argument to the job_starter option of the MOM configure file.
b - fixed a problem with pbs_server high availability where the current
server could not keep the HA lock. The problem was a result of truncating
the directory name where the lock file was kept. TORQUE would fail to
validate permissions because it would do a stat on the wrong directory.
b - Added code to free the mom.lock file during MOM shutdown.
b - fixed a bug in set_resources that prevented the last resource in a list from being
checked. As a result the last item in the list would always be added
without regard to previous entries.
e - Added new symbol JOB_EXEC_OVERLIMIT. When a job exceeds a limit (i.e. walltime) the
job will fail with the JOB_EXEC_OVERLIMIT value and
also produce an abort case for mailing purposes. Previous to this change
a job exceeding a limit returned 0 on success and no mail
was sent to the user if requested on abort.
e - Added options to buildutils/torque.spec.in to conditionally build munge, BLCR,
high-availability, cpusets, and spooling. Also allows customization of the
sendmail path and allows for optional XML conversion to serverdb.
b - --with-tcp-retry-limit now actually changes things without needing to run autoheader
b - Fixed a problem with minimum sizes in queues. Minimum sizes were not getting enforced because
the logic checking the queue against the user request used and && when it need a || in the
comparison.
e - The -e and -o options of qsub allow a user to specify a path or optionally a filename for output.
If the path given by the user ended with a directory name but no '/' character at the end then
TORQUE was confused and would not convert the .OU or .ER file to the final output/error file. The
code has now been changed to stat the path to see if the end path element is a path or directory
and handled appropriately.
e - Added new MOM configuration option $rpp_throttle. The syntax for this in the
$TORQUE_HOME/mom_priv/config file is $rpp_throttle <value> where value is a long
representing microseconds. Setting this values causes rpp data to pause after every
sendto for <value> microseconds. This may help with large jobs where full data does
not arrive at sister nodes.
c - check if the file pointer to /dev/console can be opened. If not, don't attempt to write it
(backported from 3.0.2)
b - Added patch from Michael Jennings to buildutils/torque.spec.in. This patch
allows an rpm configured with DRMAA to complete even if all of the
support files are not present on the system.
b - commited patch submitted by Michael Jennings to fix bug 130. TORQUE on the MOM would call
lstat as root when it should call it as user in open_std_file.
f - Added the ability to detect Nvidia gpus using nvidia-smi (default) or NVML.
Server receives gpu statuses from pbs_mom. Added server attribute auto_node_gpu
that allows automatically setting number of gpus for nodes based on gpu
statuses. Added new configure options --enable-nvidia-gpus,
--with-nvml-include and --with-nvml-lib.
c - fix a segfault when using --enable-nvidia-gpus and pbs_mom has Nvidia driver
older than 260 that still has nvidia-smi command
e - Added capability to automatically set mode on Nvidia gpus. Added support for
gpu reseterr option on qsub. The nodes file will be updated with Nvidia gpu
count when --enable-nvidia-gpu configure option is used. Moved some code
out of job_purge_thread to prevent segfault on mom.
e - Applied patch submitted by Eric Roman. This patch addresses some build issues
with BLCR, and fixes an error where BLCR would report -ENOSUPPORT when trying
to checkpoint a parallel job. The patch adds a --with-blcr option to configure
to find the path to the BLCR libaries. There are --with-blcr-include,
--with-blcr-lib and --with-blcr-bin to override the search paths, if necessary.
The last option, --with-blcr-bin is used to generate contrib/blcr/checkpoint_script
and contrib/blcr/restart_script from the information supplied at configure time.
b - Fixed problem where calling qstat with a non-existent job id would hang the qstat
command. This was only a problem when configured with MUNGE.
b - fix a potential buffer overflow security issue in job names and host address names
2.5.5
b - change so gpus get written back to nodes file
e - make it so that even if an array request has multiple consecutive '%' the slot
limit will be set correctly
b - Fixed bug in job_log_open where the global variable logpath was freed instead
of joblogpath.
b - Fixed memory leak in function procs_requested.
b - Validated incoming data for escape_xml to prevent a seg-fault with incoming
null pointers
e - Added submit_host and init_work_dir as job attributes. These two
values are now displayed with a qstat -f. The submit_host is
the name of the host from where the job was submitted. init_work_dir
is the working directory as in PBS_O_WORKDIR.
e - change so blcr checkpoint jobs can restart on different node. Use
configure --enable-blcr to allow.
b - remove the use of a GNU specific function, and fix an error for solaris builds
b - Updated PBS_License.txt to remove the implication that the software
is not freely redistributable.
b - remove the $PBS_GPUFILE when job is done on mom
b - fix a race condition when issuing a qrerun followed by a qdel that caused
the job to be queued instead of deleted sometimes.
e - Implemented Bugzilla Bug 110. If a host in the nodes file cannot be resolved
at startup the server will try once every 5 minutes until the node
will resolve and it will add it to the nodes list.
e - Added a "create" method to pbs_server init.d script so a serverdb file
can be created if it does not exist at startup time. This is an enhancement
in reference to Bugzilla bug 90.
b - Fixed a problem in parse_node_token where the local static variable pt would be advanced
past the end of the line input if there is no newline character at the end of the nodes
file.
e - To fix Bugzilla Bug 121 I created a thread in job_purge on the mom in the file src/resmom/job_func.c
All job purging now happens on its own thread. If any of the system calls fail to return
the thread will hang but the MOM will still be able to process work.
2.5.4
f - added the ability to track gpus. Users set gpus=X in the nodes file for
relevant node, and then request gpus in the nodes request:
-l nodes=X[:ppn=Y][:gpus=Z]. The gpus appear in $PBS_GPUFILE, a new
environment variable, in the form: <hostname>-gpu<index> and in a
new job attribute exec_gpus:
<hostname>-gpu/<index>[+<hostname>-gpu/<index>...]
b - clean up job mom checkpoint directory on checkpoint failure
e - Bugzilla bug 91. Check the status before the service is actually started.
(Steve Traylen - CERN)
e - Bugzilla bug 89. Only touch lock/subsys files if service actually starts.
(Steve Traylen - CERN)
c - when using job_force_cancel_time, fix a crash in rare cases
e - add server parameter moab_array_compatible. When set to true, this parameter
places a limit hold on jobs past the slot limit. Once one of the unheld jobs
completes or is deleted, one of the held jobs is freed.
b - fix a potential memory corruption for walltime remaining for jobs
(Vikentsi Lapa)
b - fix potential buffer overrun in pbs_sched (Bugzilla #98, patch from
Stephen Usher @ University of Oxford)
e - check if a process still exists before killing it and sleeping. This speeds up
the time for killing a task exponentially, although this will show mostly for
SMP/NUMA systems, but it will help everywhere.
(Dr. Bernd Kallies)
b - Fix for reque failures on mom. Forked pbs_mom would silently segfault and
job was left in Exiting state.
b - change so "mom_checkpoint_job_has_checkpoint" and "execing command" log
messages do not always get logged
2.5.3
b - stop reporting errors on success when modifying array ranges
b - don't try to set the user id multiple times
b - added some retrying to get connection and changed some log messages when
doing a pbs_alterjob after a checkpoint
c - fix segfault in tracejob. It wasn't malloc'ing space for the null
terminator
e - add the variables PBS_NUM_NODES and PBS_NUM_PPN to the job environment
(TRQ-6)
e - be able to append to the job's variable_list through the API
(TRQ-5)
e - Added support for munge authentication. This is an alternative for the
default ruserok remote authentication and pbs_iff. This is a compile
time option. The configure option to use is --enable-munge-auth.
Ken Nielson (TRQ-7) September 15, 2010.
b - fix the dependency hold for arrays. They were accidentally cleared
before (RT 8593)
e - add a logging statement if sendto fails at any points in rpp_send_out
b - Applied patch submitted by Will Nolan to fix bug 76.
"blocking read does not time out using signal handler"
b - fix a bug in the $spool_as_final_name code if HAVE_WORDEXP is
undefined
b - Bugzilla bug 84. Security bug on the way checkpoint is being handled.
(Robin R. - Miami Univ. of Ohio)
e - Now saving serverdb as an xml file instead of a byte-dump, thus
allowing canned installations without qmgr scripts, as well as more
portability. Able to upgrade automatically from 2.1, 2.3, and 2.4
b - fix to cleanup job files on mom after a BLCR job is checkpointed and held
b - make the tcp reading buffer able to grow dynamically to read larger
values in order to avoid "invalid protocol" messages
e - change so checkpoint files are transfered as the user, not as root.
f - Added configure option --with-servchkptdir which allows specifying path
for server's checkpoint files
b - could not set the server HA parameters lock_file_update_time and
lock_file_check_time previously. Fixed.
e - qpeek now has the options --ssh, --rsh, --spool, --host, -o, and
-e. Can now output both the STDOUT and STDERR files. Eliminated
numlines, which didn't work.
b - fix to prevent a possible segfault when using checkpointing.
2.5.2
e - Allow the nodes file to use the syntax node[0-100] in the name to
create identical nodes with names node0, node1, ..., node100.
(also node[000-100] => node000, node001, ... node100)
b - fix support of the 'procs' functionality for qsub.
b - remove square brackets [] from job and default stdout/stderr filenames
for job arrays (fixes conflict with some non-bash shells)
n - fix build system so README.array_changes is included in tar.gz file made
with "make dist"
n - fix build system so contrib/pbsweb-lite-0.95.tar.gz, contrib/qpool.gz
and contrib/README.pbstools are included the the tar.gz file made
with "make dist"
c - fixed crash when moving the job to a different queue (bugzilla 73)
e - Modified buildutils/pbs_mkdirs.in to create server_priv/nodes file
at install time. The file only shows examples and a link to the
TORQUE documentation. This enhancement was first committed to trunk.
c - fix pbs_server crash from invalid qsub -t argument
b - fix so blcr checkpoint jobs work correctly when put on hold
b - fixed bugzilla #75 where pbs_server would segfault with a double free when
calling qalter on a running job or job array.
e - Changed free_br back to its original form and modifed copy_batchrequest
to make a copy of the rq_extend element which will be freed in
free_br.
b - fix condition where job array "template" may not get cleaned up properly
after a server restart
b - fix to get new pagg ID and add additional CSA records when restarting from
checkpoint
e - added documentation for pbs_alterjob_async(), pbs_checkpointjob(),
pbs_fbserver(), pbs_get_server_list() and pbs_sigjobasync().
b - Commited patch from Eygene Ryanbinkin to fix bug 61. /dev/null would
under some circumstances have its permissions modified when jobs exited
on a compute node.
e - add --enable-top-tempdir-only to only create the top directory of the
job's temporary directory when configured
b - make the code for reconnecting to the server more robust, and remove
elements of not connecting if a job isn't running
e - allow input of walltime in the format of [DD]:HH:MM:SS
b - Fix so BLCR checkpoint files get copied to server on qchkpt and periodic
checkpoints
c - corrected a segfault when display_job_server_suffix is set to false
and job_suffix_alias was unset.
2.5.1
b - modified Makefile.in and Makefile.am at root to include contrib/AddPrivileges
2.5.0
e - Added new server config option alias_server_name. This option allows
the MOM to add an additional server name to be added to the list
of trusted addresses. The point of this is to be able to handle
alias ip addresses. UDP requests that come into an aliased ip address
are returned through the primary ip address in TORQUE. Because
the address of the reply packet from the server is not the same address
the MOM sent its HELLO1 request, the MOM drops the packet and the MOM
cannot be added to the server.
n - auto_node_np will now adjust np values down as well as up.
e - Enabled TORQUE to be able to parse the -l procs=x node spec. Previously
TORQUE simply recored the value of x for procs in Resources_List. It
now takes that value and allocates x processors packed on any available
node. (Ken Nielson Adaptive Computing. June 17, 2010)
f - added full support (server-scheduler-mom) for Cygwin (UIIP NAS of Belarus,
uiip.bas-net.by)
b - fixed EINPROGRESS in net_client.c. This signal appears every time of
connecting and requires individual processing. The old erroneous
processing brought a large network delay, especially on Cygwin.
e - improved signal processing after connecting in client_to_svr and added own
implementation of bindresvport for OS which lack it (Igor Ilyenko,
UIIP Minsk)
f - created permission checking of Windows (Cygwin) users, using mkpasswd,
mkgroup and own functions IamRoot, IamUser (Yauheni Charniauski,
UIIP Minsk)
f - created permission checking of submitted jobs (Vikentsi Lapa,
UIIP Minsk)
f - Added the --disable-daemons configure option for start server-sched-mom
as Windows services, cygrunsrv.exe goes its into background
independently.
e - Adapted output of Cygwin's diagnostic information (Yauheni
Charniauski, UIIP Minsk)
b - Changed pbsd_main to call daemonize_server early only if
high_availability_mode is set.
e - added new qmgr server attributes (clone_batch_size, clone_batch_delay)
for controlling job cloning (Bugzilla #4)
e - added new qmgr attribute (checkpoint_defaults) for setting default
checkpoint values on Execution queues (Bugzilla #1)
e - print a more informative error if pbs_iff isn't found when trying to
authenticate a client
e - added qmgr server attribute job_start_timeout, specifies timeout to be
used for sending job to mom. If not set, tcp_timeout is used.
e - added -DUSESAVEDRESOURCES code that uses servers saved resources used
for accounting end record instead of current resources used for jobs that
stopped running while mom was not up.
e - TORQUE job arrays now use arrays to hold the job pointers and not
linked lists (allows constant lookup).
f - Allow users to delete a range of jobs from the job array (qdel -t)
f - Added a slot limit to the job arrays - this restricts the number of
jobs that can concurrently run from one job array.
f - added support for holding ranges of jobs from an array with a single
qhold (using the -t option).
f - now ranges of jobs in an array can be modified through qalter
(using the -t option).
f - jobs can now depend on arrays using these dependencies:
afterstartarray, afterokarray, afternotokarray, afteranyarray,
f - added support for using qrls on arrays with the -t option
e - complte overhaul of job array submission code
f - by default show only a single entry in qstat output for the whole array
(qstat -t expands the job array)
f - server parameter max_job_array_size limits the number of jobs allowed
in an array
b - job arrays can no longer circumvent max_user_queuable
b - job arrays can no longer circumvent max_queuable
f - added server parameter max_slot_limit to restrict slot limits
e - changed array names from jobid-index to jobid[index] for consistency
2.4.13
e - change so blcr checkpoint jobs can restart on different node. Use
configure --enable-blcr to allow. (Bugzilla 68, backported from 2.5.5)
e - Add code to verify the group list as well when VALIDATEGROUPS is set in torque.cfg
(backported from 3.0.1)
b - Fix a bug where if geometry requests are enabled and cpusets are enabled, the cpuset
wasn't deleted unless a geometry request was made. (backported from 3.0.1)
b - Fix a race condition for pbs_mom -q, exitstatus was getting overwritten and as a result
pbs_server wasn't always re-queued, but were being deleted instead. (backported from 3.0.1)
b - allow apostrophes in Mail_Users attributes, as apostrophes are rare but legal email
characters (backported from 3.0.1)
b - Fixed a problem in parse_node_token where the local static variable pt would be advanced
past the end of the line input if there is no newline character at the end of the nodes
file.
b - Updated torque.spec.in to be able to handle the snapshot
names of builds.
b - Merged revisions 4555, 4556 and 4557 from 2.5-fixes branch. This revisions fix problems in
High availability mode and also a problem where the MOM was not releasing the lock on
mom.lock on exit.
b - fix pbs_mom -q to work with parallel jobs (backported from 3.0.1)
b - fixed a bug in set_resources that prevented the last resource in a list from being
checked. As a result the last item in the list would always be added
without regard to previous entries.
e - allow more than 5 concurrent connections to TORQUE using pbsD_connect. Increase it to 10
(backported from 3.0.1)
b - fix a segfault when receiving an obit for a job that no longer exists (backported from 3.0.1)
b - Fixed a problem with minimum sizes in queues. Minimum sizes were not getting enforced because
the logic checking the queue against the user request used and && when it need a || in the
comparison.
c - fix a segfault when queue has acl_group_enable and acl_group_sloppy set
true and no acl_groups are defined. (backported from 3.0.1)
e - To fix Bugzilla Bug 121 I created a thread in job_purge on the mom in the file src/resmom/job_func.c
All job purging now happens on its own thread. If any of the system calls fail to return
the thread will hang but the MOM will still be able to process work.
e - Updated Makefile.in, configure, etc. to reflect change in configure.ac to add
libpthread to the build. This was done for the fix for Bugzilla Bug 121.
2.4.12
b - Bugzilla bug 84. Security bug on the way checkpoint is being handled.
(Robin R. - Miami Univ. of Ohio, back-ported from 2.5.3)
b - make the tcp reading buffer able to grow dynamically to read larger
values in order to avoid "invalid protocol" messages (backported from
2.5.3)
b - could not set the server HA parameters lock_file_update_time and
lock_file_check_time previously. Fixed. (backported from 2.5.3)
e - qpeek now has the options --ssh, --rsh, --spool, --host, -o, and
-e. Can now output both the STDOUT and STDERR files. Eliminated
numlines, which didn't work. (backported from 2.5.3)
b - Modified the pbs_server startup routine to skip unknown hosts in the
nodes file instead of terminating the server startup.
b - fix to prevent a possible segfault when using checkpointing (back-ported
from 2.5.3).
b - fix to cleanup job files on mom after a BLCR job is checkpointed and held
(back-ported from 2.5.3)
c - when using job_force_cancel_time, fix a crash in rare cases
(backported from 2.5.4)
b - fix a potential memory corruption for walltime remaining for jobs
(Vikentsi Lapa, backported from 2.5.4)
b - fix potential buffer overrun in pbs_sched (Bugzilla #98, patch from
Stephen Usher @ University of Oxford, backported from 2.5.4)
e - check if a process still exists before killing it and sleeping. This speeds up
the time for killing a task exponentially, although this will show mostly for
SMP/NUMA systems, but it will help everywhere. (backported from 2.5.4)
(Dr. Bernd Kallies)
e - Refactored torque spec file to comply with established RPM best
practices, including the following:
- Standard installation locations based on RPM macro configuration
(e.g., %{_prefix})
- Latest upstream RPM conditional build semantics with fallbacks for
older versions of RPM (e.g., RHEL4)
- Initial set of optional features (GUI, PAM, syslog, SCP) with more
planned
- Basic working configuration automatically generated at install-time
- Reduce the number of unnecessary subpackages by consolidating where
it makes sense and using existing RPM features (e.g.,
--excludedocs).
b - Merged revision 4325 from 2.5-fixes. Fixed a problem where the -m n
(request no mail on qsub) was not always being recongnized.
b - Fix for reque failures on mom. Forked pbs_mom would silently segfault and
job was left in Exiting state. (backported from 2.5.4)
b - prevent the nodes file from being overwritten when running make packages
b - change so "mom_checkpoint_job_has_checkpoint" and "execing command" log
messages do not always get logged (back-ported from 2.5.4)
b - remove the use of a GNU specific function. (back-ported from 2.5.5)
2.4.11
b - changed type cast for calloc of ioenv from sizeof(char) to sizof(char *)
in pbsdsh.c. This fixes bug 79.
b - Added patch to fix bug 76, "blocking read does not time out using
signal handler.
b - Modified the pbs_server startup routine to skip unknown hosts in the
nodes file instead of terminating the server startup.
2.4.10
b - fix for bug 61. The fix takes care of a problem where pbs_mom under
some situations will change the mode and permissions of /dev/null.
2.4.9
b - Bugzilla bug 57. Check return value of malloc for tracejob for Linux
(Chris Samuel - Univ. of Melbourne)
b - fix so "gres" config gets displayed by pbsnodes
b - use QSUBHOST as the default host for output files when no host is
specified. (RT 7678)
e - allow users to use cpusets and geometry requests at the same time by
specifying both at configure time.
b - Bugzilla bug 55. Check return value of malloc for pbs_mom for Linux
(Chris Samuel - Univ. of Melbourne)
e - added server parameter job_force_cancel_time. When configured to X
seconds, a job that is still there X seconds after a qdel will be
purged. Useful for freeing nodes from a job when one node goes down
midjob.
b - fixed gcc warnings reported by Skip Montanaro
e - added RPT_BAVAIL define that allows pbs_mom to report f_bavail instead of
f_bfree on Linux systems
b - no longer consider -t and -T the same in qsub
e - make PBS_O_WORKDIR accessible in the environment for prolog scripts
e - Bugzilla 59. Applied patch to allow '=' for qdel -m.
(Chris Samuel - Univ. of Melbourne)
b - properly escape characters (&"'<>) in XML output)
b - ignore port when checking host in svr_get_privilege()
b - restore ability to parse -W x=geometry:{...,...}
e - from Simon Toth: If no available amount is specified for a resource
and the max limit is set, the requirement should be checked against
the maximum only (for scheduler, bugzilla 23).
b - check return values from fwrite in cpuset.c to avoid warnings
e - expand acl host checking to allow * in the middle of hostnames, not
just at the beginning. Also allow ranges like a[10-15] to mean a10,
a11, ..., a15.
2.4.8
e - Bugzilla bug 22. HIGH_PRECISION_FAIRSHARE for fifo scheduling.
c - no longer sigabrt with "running" jobs not in an execution queue. log
an error.
c - fixed segfault for when TORQUE thinks there's a nanny but there isn't
e - mapped 'qsub -P user:group' to qsub -P user -W group_list=group
b - reverted to old behavior where interactive scripts are checked for
directives and not run without a parameter.
e - setting a queue's resource_max.nodes now actually restricts things,
although so far it only limits based on the number of nodes (i.e. not
ppn)
f - added QSUBSENDGROUPLIST to qsub. This allows the server to know the
correct group name when disable_server_id_check is set to true and
the user doesn't exist on the server.
e - Bugzilla bug 54. Patch submitted by Bas van der Vlies to make pbs_mkdirs
more robust, provide a help function and new option -C <chk_tree_location>
2.4.7
b - fixed a bug for when a resource_list has been set, but isn't completely
initialized, causing a segfault
b - stop counting down walltime remaining after a job is completed
b - correctly display the number for tasks as used in TORQUE in qstat -a output
b - no longer ignoring fread return values in linux cpuset code (gcc 4.3.3)
b - fixed a bug where job was added to obit retry list multiple times, causing
a segfault
b - Fix for Bugzilla bug 43. "configure ignores with-modulefiles=no"
b - no longer try to decide when to start with -t create in init.d scripts,
-t creates should be done manually by the user
f - added -P to qsub. When submitting a job as root, the root user may add -P
<username> to submit the job as the proxy user specified by <usermname>
2.4.6
f - added an asynchronous option for qsig, specified with -a.
b - fix to cleanup job that is left in running state after mom restart
f - added two server parameters: display_job_server_suffix and job_suffix_alias.
The first defaults to true and is whether or not jobs should be appended
by .server_name. The second defaults to NULL, but if it is defined it
will be appended at the end of the jobid, i.e. jobid.job_suffix_alias.
f - added -l option to qstat so that it will display a server name and an
alias if both are used. If these aren't used, -l has no effect.
e - qstat -f now includes an extra field "Walltime Remaining" that tells
the remaining walltime in seconds. This field is does not account for
weighted walltime.
b - fixed open_std_file to setegid as well, this caused a problem with
epilogue.user scripts.
e - qsub's -W can now parse attributes with quoted lists, for example:
qsub script -W attr="foo,foo1,foo2,foo3" will set foo,foo1,foo2,foo3
as attr's value.
b - split Cray job library and CSA functionality since CSA is dependant on job
library but job library is not dependant on CSA
2.4.5
b - epilogue.user scripts were being run with prologue argments. Fixed
bug in run_pelog() to include PE_EPILOGUSER so epilogue arguments get
passed to eplilogue.user script.
b - Ticket 6665. pbs_mom and job recovery. Fixed a bug where the -q option
would terminate running processes as well as requeue jobs. This made the
-q option the same as the -r option for pbs_mom. -q will now only reque
jobs and will not attempt to kill running processes. I also added a -P
option to start pbs_mom. This is similar to the -p option except the -P
option will only delete any left over jobs from the queue and will not
attempt to adopt and running processes.
e - Modified man page for pbs_mom. Added new -P option plus edited -p, -q
and -r options to hopefully make them more understandable.
n - 01/15/2010 created snapshot torque-2.4.5-snap201001151416.tar.gz.
b - now checks secondary groups (as well as primary) for creating a file
when spooling. Before it wouldn't create the spool file if a user had
permission through a secondary group.
n - 01/18/2010. Items above this point merged into trunk.
b - fixed a file descriptor error with high availability. Before it was possible
to try to regain a file descriptor which was never held, now this is fixed.
b - No longer overwrites the user's environment when spoolasfinalname is set.
Now the environment is handled correctly.
b - No longer will segfault if pbs_mom restarts in a bad state (user environment
not initialized)
e - Changing MAXNOTDEFAULT behavior. Now, by default, max is not default and max
can be configured as default with --enable-maxdefault.
2.4.4
b - fixed contrib/init.d/pbs_mom so that it doesn't overwrite $args defined in
/etc/sysconfig/pbs_mom
b - when spool_as_final_name is configured for the mom, no longer send email
messages about not being able to copy the spool file
b - when spool_as_final_name is configured for the mom, correctly substitue
job environment variables
f - added logging for email events, allows the admin to check if emails are
being sent correctly
b - Made a fix to svr_get_privilege(). On some architectures a non-root user
name would be set to null after the line " host_no_port[num_host_chars] = 0;"
because num_host_chars was = 1024 which was the size of hot_no_port.
The null termination needed to happen at 1023. There were other problems
with this function so code was added to validate the incoming
variables before they were used. The symptom of this bug was that non-root
managers and operators could not perform operations where they should
have had rights.
b - Missed a format statement in an sprintf statement for the bug fix above.
b - Fixed a way that a file descriptor (for the server lockfile) could be used without
initialization. RT 6756
2.4.3
b - fix PBSD_authenticate so it correctly splits PATH with : instead of ;
(bugzilla #33)
b - pbs_mom now sets resource limits for tasks started with tm_spawn (Chris
Samuel, VPAC)
c - fix assumption about size of unsocname.sun_path in Libnet/net_server.c
b - Fix for Bugzilla bug 34. "torque 2.4.X breaks OSC's mpiexec". fix in src/server
src/server/stat_job.c revision 3268.
b - Fix for Bugzilla bug 35 - printing the wrong pid (normal mode) and not
printing any pid for high availability mode.
f - added a diagnostic script (contrib/diag/tdiag.sh). This script grabs
the log files for the server and the mom, records the output of qmgr
-c 'p s' and the nodefile, and creates a tarfile containing these.
b - Changed momctl -s to use exit(EXIT_FAILURE) instead of return(-1) if
a mom is not running.
b - Fix for Bugzilla bug 36. "qsub crashes with long dependency list".
b - Fix for Bugzilla bug 41. "tracejob creates a file in the local directory".
2.4.2
b - Changed predicate in pbsd_main.c for the two locations where
daemonize_server is called to check for the value of high_availability_mode
to determine when to put the server process in the background.
b - Added pbs_error_db.h to src/include/Makefile.am and src/include/Makefile.in.
pbs_error_db.h now needed for install.
e - Modified pbs_get_server_list so the $TORQUE_HOME/server_name file will work with
a comma delimited string or a list of server names separated by a new line.
b - fix tracejob so it handles multiple server and mom logs for the same day
f - Added a new server parameter np_default. This allows the administrator to
change the number of processors to a unified value dynamically for the
entire cluster.
e - high availability enhanced so that the server spawns a separate thread to
update the "lock" on the lockfile. Thread update and check time are both
setable parameters in qmgr.
b - close empty ACL files
2.4.1
e - added a prologue and epilogue option to the list of resources for qsub -l
which allows a per job prologue or epilogue script. The syntax for
the new option is qsub -l prologue=<prologue script>,
epilogue=<epilogue script>
f - added a "-w" option to qsub to override the working directory
e - changes needed to allow relocatable checkpoint jobs. Job checkpoint files
are now under the control of the server.
c - check filename for NULL to prevent crash
b - changed so we don't try to copy a local file when the destination is a
directory and the file is already in that directory
f - changes to allow TORQUE to operate without pbs_iff (merged from 2.3)
e - made logging functions rentrant safe by using localtime_r instead of
localtime() (merged from 2.3)
e - Merged in more logging and NOSIGCHLDMOM capability from Yahoo branch
e - merged in new log_ext() function to allow more fine grained syslog events,
you can now specify severity level. Also added more logging statements
b - fixed a bug where CPU time was not being added up properly in all cases
(fix for Linux only)
c - fixed a few memory errors due to some uninitialized memory being allocated
(ported from 2.3 R2493)
e - added code to allow compilers to override CLONE_BATCH_SIZE at configure
time (allows for finer grained control on how arrays are created) (ported
from Yahoo R2461)
e - added code which prefixes the severity tag on all log_ext() and log_err()
messages (ported from Yahoo R2358)
f - added code from 2.3-extreme that allows TORQUE to handle more than 1024 sockets.
Also, increased the size of TORQUE's internal socket handle table to avoid
running out of handles under busy conditions.
e - TORQUE can now handle server names larger than 64 bytes (now set to 1024,
which should be larger than the max for hostnames)
e - added qmgr option accounting_keep_days, specifies how long to keep
accounting files.
e - changed mom config varattr so invoked script returns the varattr name
and value(s)
e - improved the performance of pbs_server when submitting large numbers of
jobs with dependencies defined
e - added new parameter "log_keep_days" to both pbs_server and pbs_mom.
Specifies how long to keep log files before they are automatically removed
e - added qmgr server attribute lock_file, specifies where server lock file
is located
b - change so we use default file name for output / error file when just a
directory is specified on qsub / qalter -e -o options
e - modified to allow retention of completed jobs across server shutdown
e - added job_must_report qmgr configuration which says the job must be
reported to scheduler. Added job attribute "reported". Added PURGECOMP
functionality which allows scheduler to confirm jobs are reported. Also
added -c option to qdel. Used to clean up unreported jobs.
b - Fix so interactive jobs run when using $job_output_file_umask userdefault
f - Allow adding extra End accounting record for a running job that is rerun.
Provides usage data. Enabled by CFLAGS=-DRERUNUSAGE.
b - Fix to use queue/server resources_defaults to validate mppnodect against
resources_max when mppwidth or mppnppn are not specified for job
f - merged in new dynamic array struct and functions to implement a new (and
more efficient) way of loading jobs at startup--should help by 2 orders of
magnitude!
f - changed TORQUE_MAXCONNECTTIMEOUT to be a global variable that is now
changed by the MOM to be smaller than the pbs_server and is also
configurable on the MOM ($max_conn_timeout_micro_sec)
e - change so queued jobs that get deleted go to complete and get displayed
in qstat based on keep_completed
b - Changes to improve the qstat -x XML output and documentation
b - Change so BATCH_PARTITION_ID does not pass through to child jobs
c - fix to prevent segfault on pbs_server -t cold
b - fix so find_resc_entry still works after setting server extra_resc
c - keep pbs_server from trying to free empty attrlist after recieving
bad request (Michael Meier, University of Erlangen-Nurnberg) (merged from
2.3.8)
f - new fifo scheduler config option. ignore_queue: queue_name
allows the scheduler to be instructed to ignore up to 16 queues on the server
(Simon Toth, CESNET z.s.p.o.)
e - add administrator customizable email notifications (see manpage for
pbs_server_attributes) - (Roland Haas, Georgia Tech)
e - moving jobs can now trigger a scheduling iteration (merged from 2.3.8)
e - created a utility module that is shared between both server and mom but
does NOT get placed in the libtorque library
e - allow the user to request a specific processor geometry for their job using
a bitmap, and then bind their jobs to those processors using cpusets.
b - fix how qsub sets PBS_O_HOST and PBS_SERVER (Eirikur Hjartarson, deCODE
genetics) (merged from 2.3.8)
b - fix to prevent some jobs from getting deleted on startup.
f - add qpool.gz to contrib directory
e - improve how error constants and text messages are represented (Simon Toth,
CESNET z.s.p.o)
f - new boolean queue attribute "is_transit" that allows jobs to exceede
server resource limits (queue limits are respected). This allows routing
queues to route jobs that would be rejected for exceeding local resources
even when the job won't be run locally. (Simon Toth, CESNET z.s.p.o)
e - add support for "job_array" as a type for queue disallowed_types attribute
e - added pbs_mom config option ignmem to ignore mem/pmem limit enforcement
e - added pbs_mom config option igncput to ignore pcput limit enforcement
2.4.0
f - added a "-q" option to pbs_mom which does *not* perform the default -p
behavior
e - made "pbs_mom -p" the default option when starting pbs_mom
e - added -q to qalter to allow quicker response to modify requests
f - added basic qhold support for job arrays
b - clear out ji_destin in obit_reply
f - add qchkpt command
e - renamed job.h to pbs_job.h
b - fix logic error in checkpoint interval test
f - add RERUNNABLEBYDEFAULT parameter to torque.cfg. allows admin to
change the default value of the job rerunnable attribute from true
to false
e - added preliminary Comprehensive System Accounting (CSA) functionality for
Linux. Configure option --enable-csa will cause workload management
records to be written if CSA is installed and wkmg is turned on.
b - changes to allow post_checkpoint() to run when checkpoint is completed,
not when it has just started. Also corrected issue when checkpoint fails
while trying to put job on hold.
b - update server immediately with changed checkpoint name and time attributes
after successful checkpoint.
e - Changes so checkpoint jobs failing after restarted are put on hold or
requeued
e - Added checkpoint_restart_status job attribute used for restart status
b - Updated manpages for qsub and qterm to reflect changed checkpointing
options.
b - reject a qchkpt request if checkpointing is not enabled for the job
b - Mom should not send checkpoint name and time to server unless checkpoint
was successful
b - fix so that running jobs that have a hold type and that fail on checkpoint
restart get deleted when qdel is used
b - fix so we reset start_time, if needed, when restarting a checkpointed job
f - added experimental fault_tolerant job attribute (set to true by passing
-f to qsub) this attribute indicates that a job can survive the loss of
a sister mom also added corresponding fault_tolerant and
fault_intolerant types to the "disallowed_types" queue attribute
b - fixes for pbs_moms updating of comment and checkpoint name and time
e - change so we can reject hold requests on running jobs that do not have
checkpoint enabled if system was configured with --enable-blcr
e - change to qsub so only the host name can be specified on the -e/-o options
e - added -w option to qsub that allows setting of PBS_O_WORKDIR
2.3.8
c - keep pbs_server from trying to free empty attrlist after recieving
bad request (Michael Meier, University of Erlangen-Nurnberg)
e - moving jobs can now trigger a scheduling iteration
b - fix how qsub sets PBS_O_HOST and PBS_SERVER (Eirikur Hjartarson, deCODE
genetics)
f - add qpool.gz to contrib directory
b - fix return value of cpuset_delete() for Linux (Chris Samuel - VPAC)
e - Set PBS_MAXUSER to 32 from 16 in order to accomodate systems that
use a 32 bit user name.(Ken Nielson Cluster Resources)
c - modified acct_job in server/accounting.c to dynamically allocate memory
to accomodate strings larger than PBS_ACCT_MAX_RCD. (Ken Nielson Cluster
Resources)
e - all the user to turn off credential lifetimes so they don't have to lose
iterations while credentials are renewed.
e - added OS independent resending of failed job obits (from D Beer), also
removed OS specific CACHEOBITFAILURES code.
b - fix so after* dependencies are handled correctly for exiting / completed
jobs
2.3.7
b - fixed a bug where UNIX domain socket communication was failing when
"--disable-privports" was used.
e - add job exit status as 10th argument to the epilogue script
b - fix truncated output in qmgr (peter h IPSec+jan n NANCO)
b - change so set_jobexid() gets called if JOB_ATR_egroup is not set
e - pbs_mom sisters can now tolerate an explicit group ID instead of only a
valid group name. This helps TORQUE be more robust to group lookup failures.
2.3.6
b - change back to not sending status updates until we get cluster addr
message from server, also only try to send hello when the server stream
is down.
b - change pbs_server so log_file_max_size of zero behavior matches documentation
e - added periodic logging of version and loglevel to help in support
e - added pbs_mom config option ignvmem to ignore vmem/pvmem limit enforcement
b - change to correct strtoks that accidentally got changed in astyle
formatting
e - in Linux, a pbs_mom will now "kill" a job's task, even if that task can no
longer be found in the OS processor table. This prevents jobs from getting
"stuck" when the PID vanishes in some rare cases.
2.3.5
e - added new init.d scripts for Debian/Ubuntu systems
b - fixed a bug where TORQUE's exponential backoff for sending messages to the
MOM could overflow
2.3.4
c - fixed segfault when loading array files of an older/incompatible version
b - fixed a bug where if attempt to send job to a pbs_mom failed due to
timeout, the job would indefinitely remain the in 'R' state
b - qsub now properly interprets -W umask=0XXX as octal umask
e - allow $HOME to be specified for path
e - added --disable-qsub-keep-override to allow the qsub -k flag to not
override -o -e.
e - updated with security patches for setuid, setgid, setgroups
b - fixed correct_ct() in svr_jobfunc.c so we don't crash if we hit COMPLETED
job
b - fixed problem where momctl -d 0 showed ConfigVersion twice
e - if a .JB file gets upgraded pbs_server will back up the original
b - removed qhold / qrls -h n option since there is no code to support it
b - set job state and substate correctly when job has a hold attribute and
is being rerun
b - fixed a bug preventing multiple TORQUE servers and TORQUE MOMs from
operating properly all from the same host
e - fixed several compiler error and warnings for AIX 5.2 systems
b - fixed a bug with "max_report" where jobs not in the Q state were not always
being reported to scheduler
2.3.3
b - fixed bug where pbs_mom would sometimes not connect properly with
pbs_server after network failures
b - changed so run_pelog opens correct stdout/stderr when join is used
b - corrected pbs_server man page for SIGUSR1 and SIGUSR2
f - added new pbs_track command which may be used to launch an external
process and a pbs_mom will then track the resource usage of that process
and attach it to a specified job (experimental) (special thanks to David
Singleton and David Houlder from APAC)
e - added alternate method for sending cluster addresses to mom
(ALT_CLSTR_ADDR)
2.3.2
e - added --disable-posixmemlock to force mom not to use POSIX MEMLOCK.
b - fix potential buffer overrun in qsub
b - keep pbs_mom, pbs_server, pbs_sched from closing sockets opened by
nss_ldap (SGI)
e - added PBS_VERSION environment variable
e - added --enable-acct-x to allow adding of x attributes to accounting log
b - fix net_server.h build error
2.3.1
b - fixed a bug where torque would fail to start if there was no LF in nodes
file
b - fixed a bug where TORQUE would ignore the "pbs_asyrunjob" API extension
string when starting jobs in asynchronous mode
b - fixed memory leak in free_br for PBS_BATCH_MvJobFile case
e - torque can now compile on Linux and OS X with NDEBUG defined
f - when using qsub it is now possible to specify both -k and -o/-e
(before -o/-e did not behave as expected if -k was also used)
e - changed pbs_server to have "-l" option. Specifies a host/port that event
messages will be sent to. Event messages are the same as what the
scheduler currently receives.
e - added --enable-autorun to allow qsub jobs to automatically try to run
if there are any nodes available.
e - added --enable-quickcommit to allow qsub to combine the ready to commit
and commit phases into 1 network transmission.
e - added --enable-nochildsignal to allow pbs_server to use inline checking
for SIGCHLD instead of using the signal handler.
e - change qsub so '-v var=' will look in environment for value. If value
is not found set it to "".
b - fix qdel of entire job arrays for non operator/managers
b - fix so we continue to process exiting jobs for other servers
e - added source_login_batch and source_login_interactive to mom config. This
allows us to bypass the sourcing of /etc/profile, etc. type files.
b - fixed pbs_server segmentation fault when job_array submissions are
rejected before ji_arraystruct was initialized
e - add some casts to fix some compiler warnings with gcc-4.1 on i386 when
-D_FILE_OFFSET_BITS=64 is set
e - added --enable-maxnotdefault to allow not using resources_max as defaults.
b - added new values to TJobAttr so we don't have mismatch with job.h values.
b - reset ji_momhandle so we cannot have more than one pjob for obit_reply to
find.
e - change qdel to accept 'ALL' as well as 'all'
b - changed order of searching so we find most recent jobs first. Prevents
finding old leftover job when pids rollover. Also some CACHEOBITFAILURES
updates.
b - handle case where mom replies with an unknown job error to a stat request
from the server
b - allow qalter to modify HELD jobs if BLCR is not enabled
b - change to update errpath/outpath attributes when -e -o are used with qsub
e - added string output for errnos, etc.
2.3.0
b - fixed a bug where TORQUE would ignore the "pbs_asyrunjob" API extension
string when starting jobs in asynchronous mode
e - redesign how torque.spec is built
e - added -a to qrun to allow asynchronous job start
e - allow qrerun on completed jobs
e - allow qdel to delete all jobs
e - make qdel -m functionality match the documentation
b - prevent runaway hellos being sent to server when mom's node is removed
from the server's node list
e - local client connections use a unix domain socket, bypassing inet and
pbs_iff
f - Linux 2.6 cpuset support (in development)
e - new job array submission syntax
b - fixed SIGUSR1 / SIGUSR2 to correctly change the log level
f - health check script can now be run at job start and end
e - tm tasks are now stored in a single .TK file rather than eat lots of
inodes
f - new "extra_resc" server attribute
b - "pbs_version" attr is now correctly read-only
e - increase max size of .JB and .SC file names
e - new "sched_version" server attribute
f - new printserverdb tool
e - pbs_server/pbs_mom hostname arg is now -H, -h is help
e - added $umask to pbs_mom config, used for generated output files.
e - minor pbsnodes overhaul
b - fixed memory leak in pbs_server
2.2.2
b - correctly parse /proc/pid/stat that contains parens (Meier)
b - prevent runaway hellos being sent to server when mom's node is removed
from the server's node list
b - fix qdel of entire job arrays for non operator/managers
b - fix problem where job array .AR files are not saved to disk
b - fixed problem with tracking job memory usage on OS X
b - pbs_server doesn't try to "upgrade" .JB files if they have a newer
version of the job_qs struct
2.2.1
b - fix a bug where dependent jobs get put on hold when the previous job has
completed but its state is still available for life of keep_completed
b - fixed a bug where pbs_server never delete files from the "jobs" directory
b - fixed a bug where compute nodes were being put in an indefinite "down"
state
e - added job_array_size attribute to pbs_submit documentation
2.2.0
e - improve RPP logging for corruption issues
f - dynamic resources
e - use mlockall() in pbs_mom if _POSIX_MEMLOCK
f - consumable resource "tokens" support (Harte-Hanks)
e - build process sets default submit filter path to ${libexecdir}/qsub_filter
we fall back to /usr/local/sbin/torque_submitfilter to maintain
compatibility
e - allow long job names when not using -N
f - new MOM $varattr config
e - daemons are no longer installed 700
e - tighten directory path checks
f - new mom configs: $auto_ideal_load and $auto_max_load
e - pbs_mom on Darwin (OS X) no longer depends on libkvm (now works on all
versions without need to re-enable /dev/kmem on newer PPC or all x86
versions)
e - added PBS_SERVER env variable for job scripts
e - add --about support to daemons and client commands
f - added qsub -t (primitive job array)
e - add PBS_RESOURCE_GRES to prolog/epilog environment
e - add -h hostname to pbs_mom (NCIFCRF)
e - filesec enhancements (StockholmU)
e - added ERS and IDS documentation
e - allow export of specific variables into prolog/epilog environment
b - change fclose to pclose to close submit filter pipe (ABCC)
e - add support for Cray XT size and larger qstat task reporting (ORNL)
b - pbs_demux is now built with pbs_mom instead of with clients
e - epilogue will only run if job is still valid on exec node
e - add qnodes, qnoded, qserverd, and qschedd symlinks
e - enable DEFAULTCKPT torque.cfg parameter
e - allow compute host and submit host suffix with nodefile_suffix
f - add --with-modulefiles=[DIR] support
b - be more careful about broken tclx installs
2.1.11
b - nqs2pbs is now a generated script
b - correct handling of priv job attr
b - change font selectors in manpages to bold
b - on pbs_server startup, don't skip job-exclusive nodes on initial MOM scan
b - pbs_server should not connect to "down" MOMs for any job operation
b - use alarm() around writing to job's stdio incase it happens to be a stopped tty
2.1.10
b - fix buffer overflow in rm_request,
fix 2 printf that should be sprintf (Umea University)
b - correct updating trusted client list (Yahoo)
b - Catch newlines in log messages, split messages text (Eygene Ryabinkin)
e - pbs_mom remote reconfig pbs_mom now disabled by default
use $remote_reconfig to enable it
b - fix pam configure (Adrian Knoth)
b - handle /dev/null correctly when job rerun
2.1.9
f - new queue attribute disallowed_types, currently recognized types:
interactive, batch, rerunable, and nonrerunable
e - refine "node note" feature with pbsnodes -N
e - bypass pbs_server's uid 0 check on cygwin
e - update suse initscripts
b - fix mom memory locking
b - fix sum buffer length checks in pbs_mom
b - fix memory leak in fifo scheduler
b - fix nonstandard usage of 'tail' in tpackage
b - fix aliasing error with brp_txtlen
f - allow manager to set "next job number" via hidden qmgr attribute
next_job_number
2.1.8
b - stop possible memory corruption with an invalid request type (StockholmU)
b - add node name to pbsnodes XML output (NCIFCRF)
b - correct Resource_list in qstat XML output (NCIFCRF)
b - pam_authuser fixes from uam.es
e - allow 'pbsnodes -l' to work with a node spec
b - clear exec_host and session_id on job requeue
b - fix mom child segfault when a user env var has a '%'
b - correct buggy logging in chk_job_request() (StockholmU)
e - pbs_mom shouldn't require server_name file unless it is
actually going to be read (StockholmU)
f - "node notes" with pbsnodes -n (sandia)
2.1.7
b - fix bison syntax error in Parser.y
b - fix 2.1.4 regression with spool file group owner on freebsd
b - don't exit if mlockall sets errno ENOSYS
f - qalter -v variable_list
f - MOMSLEEPTIME env delays pbs_mom initialization
e - minor log message fixups
e - enable node-reuse in qsub eval if server resources_available.nodect is set
e - pbs_mom and pbs_server can now use PBS_MOM_SERVER_PORT,
PBS_BATCH_SERVICE_PORT, and PBS_MANAGER_SERVICE_PORT env vars.
e - pbs_server can also use PBS_SCHEDULER_SERVICE_PORT env var.
e - add "other" resource to pelog's 5th argument
2.1.6
b - freebsd5 build fix
b - fix 2.1.4 regression with TM on single-node jobs
b - fix 2.1.4 regression with rerunning jobs
b - additional spool handling security fixes
2.1.5
b - fix 2.1.4 regression with -o/dev/null
2.1.4
b - fix cput job status
b - Fix "Spool Job Race condition"
2.1.3
b - correct run-time symbol in pam module on RHEL4
b - some minor hpux11 build fixes (PACCAR)
b - fix bug with log roll and automatic log filenames
b - compile error with size_fs() on digitalunix
e - pbs_server will now print build details with --about
e - new freebsd5 mom arch for Freebsd 5.x and 6.x (trasz)
e - optimize acl_group_sloppy
e - fix "list_head" symbol clash on Solaris 10
e - allow pam_pbssimpleauth to be built on OSX and Solaris
b - networking fixes for HPUX, fixes pbs_iff (PACCAR)
e - allow long job names when not using -N
c - using depend=syncwith crashed pbs_server
c - races with down nodes and purging jobs crashed pbs_server
b - staged out files will retain proper permission bits
f - may now specify umask to use while creating stderr and stdout spools
e.g. qsub -W umask=22
b - correct some fast startup behaviour
e - queue attribute max_queuable accounts for C jobs
2.1.2
b - fix momctl queries with multiple hosts
b - don't fail make install if --without-sched
b - correct MOM compile error with atol()
f - qsub will now retry connecting to pbs_server (see manpage)
f - X11 forwarding for single-node, interactive jobs with qsub -X
f - new pam_pbssimpleauth PAM module, requires --with-pam=DIR
e - add logging for node state adjustment
f - correctly track node state and allocation based for suspended jobs
e - entries can always be deleted from manager ACL,
even if ACL contains host(s) that no longer exist
e - more informative error message when modifying manager ACL
f - all queue create, set, and unset operations now set a queue mtime
f - added support for log rolling to libtorque
f - pbs_server and pbs_mom have two new attributes
log_file_max_size, log_file_roll_depth
e - support installing client libs and cmds on unsupported OSes (like cygwin)
b - fix subnode allocation with pbs_sched
b - fix node allocation with suspend-resume
b - fix stale job-exclusive state when restarting pbs_server
b - don't fall over when duplicate subnodes are assigned after suspend-resume
b - handle suspended jobs correctly when restarting pbs_server
b - allow long host lists in runjob request
b - fix truncated XML output in qstat and pbsnodes
b - typo broke compile on irix6array and unicos8
e - momctl now skips down nodes when selecting by property
f - added submit_args job attribute
2.1.1
c - fix mom_sync_job code that crashes pbs_server (USC)
b - checking disk space in $PBS_SERVER_HOME was mistakenly disabled (USC)
e - node's np now accessible in qmgr (USC)
f - add ":ALL" as a special node selection when stat'ing nodes (USC)
f - momctl can now use :property node selection (USC)
f - send cluster addrs to all nodes when a node is created in qmgr (USC)
- new nodes are marked offline
- all nodes get new cluster ipaddr list
- new nodes are cleared of offline bit
f - set a node's np from the status' ncpus (only if ncpus > np) (USC)
- controlled by new server attribute "auto_node_np"
c - fix possible pbs_server crash when nodes are deleted in qmgr (USC)
e - avoid dup streams with nodes for quicker pbs_server startup (USC)
b - configure program prefix/suffix will now work correctly (USC)
b - handle shared libs in tpackages (USC)
f - qstat's -1 option can now be used with -f for easier parsing (USC)
b - fix broken TM on OSX (USC)
f - add "version" and "configversion" RM requests (USC)
b - in pbs-config --libs, don't print rpath if libdir is in the sys dlsearch
path (USC)
e - don't reject job submits if nodes are temporarily down (USC)
e - if MOM can't resolve $pbsserver at startup, try again later (USC)
- $pbsclient still suffers this problem
c - fix nd_addrs usage in bad_node_warning() after deleting nodes (MSIC)
b - enable build of xpbsmom on darwin systems (JAX)
e - run-time config of MOM's rcp cmd (see pbs_mom(8)) (USC)
e - momctl can now accept query strings with spaces, multiple -q opts (USC)
b - fix linking order for single-pass linkers like IRIX (ncifcrf)
b - fix mom compile on solaris with statfs (USC)
b - memory corruption on job exit causing cpu0 to be allocated more than once (USC)
e - add increased verbosity to tracejob and added '-q' commandline option
e - support larger values in qstat output (might break scripts!) (USC)
e - make 'qterm -t quick' shutdown pbs_server faster (USC)
2.1.0p0
fixed job tracking with SMP job suspend/resume (MSIC)
modify pbs_mom to enforce memory limits for serial jobs (GaTech)
- linux only
enable 'never' qmgr maildomain value to disable user mail
enable qsub reporting of job rejection reason
add suspend/resume diagnostics and logging
prevent stale job handler from destroying suspended jobs
prevent rapid hello from MOM from doing DOS on pbs_server
add diagnostics for why node not considered available
add caching of local serverhost addr lookup
enable job centric vs queue centric queue limit parameter
brand new autoconf+automake+libtool build system (USC)
automatic MOM restarts for easier upgrades (USC)
new server attributes: acl_group_sloppy, acl_logic_or, keep_completed, kill_delay
new server attributes: server_name, allow_node_submit, submit_hosts
torque.cfg no longer used by pbs_server
pbsdsh and TM enhancements (USC)
- tm_spawn() returns an error if execution fails
- capture TM stdout with -o
- run on unique nodes with -u
- run on a given hostname with -h
largefile support in staging code and when removing $TMPDIR (USC)
use bindresvport() instead of looping over calls to bind() (USC)
fix qsub "out of memory" for large resource requests (SANDIA)
pbsnodes default arg is now '-a' (USC)
new ":property" node selection when node stat and manager set (pbsnodes) (USC)
fix race with new jobs reporting wrong walltime (USC)
sister moms weren't setting job state to "running" (USC)
don't reject jobs if requested nodes is too large node_pack=T (USC)
add epilogue.parallel and epilogue.user.parallel (SARA)
add $PBS_NODENUM, $PBS_MSHOST, and $PBS_NODEFILE to pelogs (USC)
add more flexible --with-rcp='scp|rcp|mom_rcp' instead of --with-scp (USC)
build/install a single libtorque.so (USC)
nodes are no longer checked against server host acl list (USC)
Tcl's buildindex now supports a 3rd arg for "destdir" to aid fakeroot installs (USC)
fixed dynamic node destroy qmgr option
install rm.h (USC)
printjob now prints saved TM info (USC)
make MOM restarts with running jobs more reliable (USC)
fix return check in pbs_rescquery fixing segfault in pbs_sched (USC)
add README.pbstools to contrib directory
workaround buggy recvfrom() in Tru64 (USC)
attempt to handle socklen_t portably (USC)
fix infinite loop in is_stat_get() triggered by network congestion (USC)
job suspend/resume enhancements (see qsig manpage) (USC)
support higher file descriptors in TM by using poll() instead of select() (USC)
immediate job delete feedback to interactive queued jobs (USC)
move qmgr manpage from section 8 to section 1
add SuSE initscripts to contrib/init.d/
fix ctrl-c race while starting interactive jobs (USC)
fix memory corruption when tm_spawn() is interrupted (USC)
2.0.0p8
really fix torque.cfg parsing (USC)
fix possible overlapping memcpy in ACL parsing (USC)
fix rare self-inflicted sigkill in MOM (USC)
2.0.0p7
fixed pbs_mom SEGV in req_stat_job()
fixed torque.cfg parameter handling
fixed qmgr memory leak
2.0.0p6
fix segfault in new "acl_group_sloppy" code if a group doesn't exist (USC)
configure defaults changed to enable syslog, enable docs, and disable filesync (USC)
pelog now correctly restores previous alarm handler (Sandia)
misc fixes with syscalls returns, sign-mismatches, and mem corruption (USC)
prevent MOM from killing herself on new job race condition (USC)
- so far, only linux is fixed
remove job delete nanny earlier to not interrupt long stageouts (USC)
display C state later when using keep_completed (USC)
add 'printtracking' command in src/tools (USC)
stop overriding the user with name resolution on qsub's -o/-e args (USC)
xpbsmon now works with Tcl 8.4 (BCGSC)
don't bother spooling/keeping job output intended for /dev/null (USC)
correct missing hpux11 manpage (USC)
fix compile for freebsd - missing symbols (yahoo)
fix momctl exit code (yahoo)
new "exit_status" job attribute (USC)
new "mail_domain" server attribute (overrides --maildomain) (USC)
configure fixes for linux x86_64 and tcl install weirdness (USC)
extended mom parameter buffer space
change pbs_mkdirs to use standard var names so that chroot installs work better (USC)
torque.spec now has tcl/gui and wordexp enabled by default
enable multiple dynamic+static generic resources per node (GATech)
make sure attrs on job launch are sent to server (fixes session_id) (USC)
add resmom job modify logging
torque.cfg parsing fixes
2.0.0p5
reorganize ji_newt structure to eliminate 64 bit data packing issues
enable '--disable-spool' configure directive
enable stdout/stderr stageout to search through $HOME and $HOME/.pbs_spool
fixes to qsub's env handling for newlines and commas (UMU)
fixes to at_arst encoding and decoding for newlines and commas (USC)
use -p with rcp/scp (USC)
several fixes around .pbs_spool usage (USC)
don't create "kept" stdout/err files ugo+rw (avoid insane umask) (USC)
qsub -V shouldn't clobber qsub's environ (USC)
don't prevent connects to "down" nodes that are still talking (USC)
allow file globs to work correctly under --enable-wordexp (USC)
enable secondary group checking when evaluating queue acl_group attribute
- enable the new queue parameter "acl_group_sloppy"
sol10 build system fixes (USC)
fixed node manager buffer overflow (UMU)
fix "pbs_version" server attribute (USC)
torque.spec updates (USC)
remove the leading space on the node session attribute on darwin (USC)
prevent SEGV if config file is missing/corrupt
"keep_completed" execution queue attribute
several misc code fixes (UMU)
2.0.0p4
fix up socklen_t issues
fixed epilog to report total job resource utilization
improved RPM spec (USC)
modified qterm to drop hung connections to bad nodes
enhance HPUX operation
2.0.0p3
fixed dynamic gres loading in pbs_mom (CRI)
added torque.spec (rpmbuild -tb should work) (USC)
new 'packages' make target (see INSTALL) (USC)
added '-1' qstat option to display node info (UMICH)
various fixes in file staging and copying (USC)
- reenable stageout of directories
- fix confusing email messages on failed stageout
- child processes can't use MOM's logging, must use syslog
fix overflow in RM netload (USC)
don't check walltime on sister nodes, only on MS (ANU)
kill_task wasn't being declared properly for all mach types (USC)
don't unnecessarily link with libelf and libdl (USC)
fix compile warnings with qsort/bsearch on bsd/darwin (USC)
fix --disable-filesync to actually work (USC)
added prolog diagnostics to 'momctl -d' output (CRI)
added logging for job file management (CRI)
added mom parameter $ignwalltime (CRI)
added $PBS_VNODENUM to job/TM env (USC)
fix self-referencing job deps (USC)
Use --enable-wordexp to enable variables in data staging (USC)
$PBS_HOME/server_name is now used by MOM _iff $pbsserver isn't used_ (USC)
Fix TRU64 compile issues (NCIFCRF)
Expand job limits up to ULONG_MAX (NCIFCRF)
user-supplied TMPDIR no longer treated specially (USC)
remtree() now deals with symlinks correctly (USC)
enable configurable mail domain (Sandia)
configure now handles darwin8 (USC)
configure now handles --with-scp=path and --without-scp correctly (USC)
2.0.0p2
fix check_pwd() memory leak (USC)
2.0.0p1
fix mpiexec stdout regression from 2.0.0p0 (USC)
add 'qdel -m' support to enable annotating job cancellation (CRI)
add mom diagnostics for prolog failures and timeouts (CRI)
interactive jobs cannot be rerunable (USC)
be sure nodefile is removed when job is purged (USC)
don't run epilogue multiple times when multiple jobs exit at once (USC)
fix clearjob MOM request (momctl -c) (USC)
fix detection of local output files with localhost or /dev/null (USC)
new qstat/qselect -e option to only select jobs in exec queues (USC)
$clienthost and $headnode removed, $pbsclient and $pbsserver added (USC)
$PBS_HOME/server_name is now added to MOM's server list (USC)
resmom transient TMPDIR (USC)
add joblist to MOM's status & add experimental server "mom_job_sync" (USC)
export PBS_SCHED_HINT to pelogues if set in the job (USC)
don't build or install pbs_rcp if --enable-scp (USC)
set user hold on submitted jobs with invalid deps (USC)
add initial multi-server support for HA (CRI)
Altix cpuset enhancements (CSIRO)
enhanced momctl to diagnose and report on connectivity issues (CRI)
added hostname resolution diagnostics and logging (CRI)
fixed 'first node down' rpp failure (USC)
improved qsub response time
2.0.0p0
torque patches for RCP and resmom (UCHSC)
enhanced DIS logging
improved start-up to support quick startup with down nodes
fixed corrupt job/node/queue API reporting
fixed tracejob for large jobs (Sandia)
changed qdel to only send one SIGTERM at mom level
fixed doc build by adding AIX 5 resources docs
added prerun timeout change (RENTEC)
added code to handle select() EBADF - 9
disabled MOM quota feature by default, enabled with -DTENABLEQUOTA
cleanup MOM child error messages (USC)
fix makedepend-sh for gcc-3.4 and higher (DTU)
don't fallback to mom_rcp if configured to use scp (USC)
1.2.0p6
enabled opsys mom config (USC)
enabled arch mom config (CRI)
fixed qrun based default scheduling to ignore down nodes (USC)
disable unsetting of key/integer server parameters (USC)
allow FC4 support - quota struct fix (USC)
add fix for out of memory failure (USC)
add file recovery failure messages (USC)
add direct support for external scheduler extensions
add passwd file corruption check
add job cancel nanny patch (USC)
recursively remove job dependencies if children can never be satisfied (USC)
make poll_jobs the default behavior with a restat time of 45 seconds
added 'shell-use-arg' patch (OSC)
improved API timeout disconnect feature
added improved rapid start up
reworked mom-server state management (USC)
- removed 'unknown' state
- improved pbsnodes 'offline' management
- fixed 'momctl -C' which actually _prevented_ an update
- fixed incorrect math on 'tmpTime'
- added 'polltime' to the math on 'tmpTime'
- consolidated node state changes to new 'update_node_state()'
- tightened up the "node state machine"
- changed mom's state to follow the documented state guidelines
- correctly handle "down" from mom
- moved server stream handling out of 'is_update_stat()' to new
'init_server_stream()'
- refactored the top of the main loop to tighten up state changes
- fixed interval counting on the health check script
- forced health check script if update state is forced
- don't spam the server with updates on startup
- required new addr list after connections are dropped
- removed duplicate state updates because of broken multi-server support
- send "down" if internal_state is down (aix's query_adp() can do this)
- removed ferror() check on fread() because fread() randomly fails on initial
mom startup.
- send "down" if health check returns "ERROR"
- send "down" if disk space check fails.
1.2.0p5
make '-t quick' default behavior for qterm
added '-p' flag to qdel to enable forced job purge (USC)
fixed server resources_available n-1 issue
added further Altix CPUSet support (NCSA)
added local checkpoint script support for linux
fixed 'premature end of message warning'
clarify job deleted mail message (SDSC)
fixed AIX 5.3 support in configure (WestGrid)
fixed crash when qrun issued on job with incomplete requeue
added support for >= 4GB memory usage (GMX)
log job execution limits failures
added more detailed error messages for missing user shell on mom
fixed qsub env overflow issue
1.2.0p4
extended job prolog to include jobname, resource, queue, and account info (MAINE)
added support for Darwin 8/OS X 10.4 (MAINE)
fixed suspend/resume for MPI jobs (NORWAY)
added support for epilog.precancel to enable local job cancellation handling
fixed build for case insensitive filesystems
fixed relative path based Makefiles for xpbsmom
added support for gcc 4.0
added PBSDEBUG support to client commands to allow more verbose diagnostics of client failures
added ALLOWCOMPUTEHOSTSUBMIT option to torque.cfg
fixed dynamic pbs_server loglevel support
added mom-server rpp socket diagnostics
added support for multi-homed hosts w/SERVERHOST parameter in torque.cfg
added support for static linking w/PBSBINDIR
added availmem/totmem support to Darwin systems (MAINE)
added netload support to Darwin systems (MAINE)
1.2.0p3
enable multiple server to mom communication
fixed node reject message overwrite issue
enable pre-start node health check (BOEING)
fixed pid scanning for RHEL3 (VPAC)
added improved vmem/mem limit enforcement and reporting (UMU)
added submit filter return code processing to qsub
1.2.0p2
enhance network failure messages
fixed tracejob tool to only match correct jobs (WESTGRID)
modified reporting of linux availmem and totmem to allow larger file sizes
fixed pbs_demux for OSF/TRU64 systems to stop orphaned demux processes
added dynamic pbs_server loglevel specification
added intelligent mom job stat sync'ing for improved scalability (USC/CRI)
added mom state sync patch for dup join (USC)
added spool dir space check (MAINE)
1.2.0p1
add default DEFAULTMAILDOMAIN configure option
improve configure options to use pbs environment (USC)
use openpty() based tty management by default
enable default resource manager extensions
make mom config parameters case insensitive
added jobstartblocktime mom parameter
added bulk read in pbs_disconnect() (USC)
added support for solaris 5
added support for program args in pbsdsh (USC)
added improved task recovery (USC)
1.2.0p0
fixed MOM state update behavior (USC/Poland)
fixed set_globid() crash
added support for > 2GB file size job requirements
updated config.guess to 2003 release
general patch to initialize all function variables (USC)
added patch for serial job TJE leakage (USC)
add "hw.memsize" based physmem MOM query for darwin (Maine)
add configure option (--disable-filesync) to speed up job submission
set PBS mail precedence to bulk to avoid vactaion responses (VPAC)
added multiple changes to address gcc warnings (USC)
enabled auto-sizing of 'qstat -Q' columns
purge DOS EOL characters from submit scripts
1.1.0p6
added failure logging for various MOM job launch failures (USC)
allow qsub '-d' relative path qsub specification
enabled $restricted parameter w/in FIFO to allow used of non-privileged ports (SAIC)
checked job launch status code for retry decisions
added nodect resource_available checking to FIFO
disabled client port binding by default for darwin systems (use --enable-darwinbind to re-enable)
- workaround for darwin bind and pclose OS bugs
fixed interactive job terminal control for MAC (NCIFCRF)
added support for MAC MOM-level cpu usage tracking (Maine)
fixed __P warning (USC)
added support for server level resources_avail override of job nodect limits (VPAC)
modify MOM copy files and delete file requests to handle NFS root issues (USC/CRI)
enhance port retry code to support mac socket behavior
clean up file/socket descriptors before execing prolog/epilog
enable dynamic cpu set management (ORNL)
enable array services support for memory management (ORNL)
add server command logging to diagnostics
fix linux setrlimit persistance on failures
1.1.0p5
added loglevel as MOM config parameter
distributed job start sequence into multiple routines
force node state/subnode state offline stat synchronization (NCSA)
fixed N-1 cpu allocation issue (no sanity checking in set_nodes)
enhance job start failure logging
added continued port checking if connect fails (rentec)
added case insensitive host authentication checks
added support for submitfilter command line args
added support for relocatable submitfilter via torque.cfg
fixed offline status cleared when server restarted (USC)
updated PBSTop to 4.05 (USC)
fixed PServiceType array to correctly report service messages
fixed pbs_server crash from job dependencies
prevent mom from truncating lock file when mom is already running
tcp timeout added as config option
1.1.0p4
added 15004 error logging
added use of openpty() call for locating pseudo terminals (SNL)
add diagnostic reporting of config and executable version info
add support for config push
add support for MOM config version parameters
log node offline/online and up/down state changes in pbs_server logs
add mom fork logging and home directory check
add timeout checking in rpp socket handling
added buffer overflow prevention routines
added lockfile logging
supported protected env variables with qstat
1.1.0p3
added support for node specification w/pbsnodes -a
added hstfile support to momctl
added chroot (-D) support (SRCE)
added mom chdir pjob check (SRCE)
fixed MOM HELLO initialization procedure
added momctl diagnostic/admin command (shutdown, reconfig, query, diagnose)
added mom job abort bailout to prevent infinite loops
added network reinitialization when socket failure detected
added mom-to-scheduler reporting when existing job detected
added mom state machine failure logging
1.1.0p2
add support for disk size reporting via pbs_mom
fixed netload initialization
fixed orphans on mom fork failure
updated to pbstop v 3.9 (USC)
fixed buffer overflow issue in net_server.c
added pestat package to contrib (ANU)
added parameter checking to cpy_stage() (NCSA)
added -x (xml output) support for 'qstat -f' and 'pbsnodes -a'
added SSS xml library (SSS)
updated user-project mapping enforcement (ANL)
fix bogus 'cannot find submitfilter' message for interactive jobs
fix incorrect job allocation issue for interactive jobs (NCSA)
prevent failure with invalid 'servername' specification (NCSA)
provide more meaningful 'post processing error' messages (NCSA)
check for corrupt jobs in server database and remove them immediately
enable SIGUSR1/SIGUSR2 pbs_mom dynamic loglevel adjustment
profiling enhancements
use local directory variable in scan_non_child_tasks() to prevent race condition (VPAC)
added AIX 5 odm support for realmem reporting (VPAC)
1.1.0p1
added pbstop to contrib (USC)
added OSC mpiexec patch (OSC)
confirmed OSC mom-restart patch (OSC)
fix pbsd_init purge job tracking
allow tracking of completed jobs (w/TORQUEKEEPCOMPLETED env)
added support for MAC OS 10
added qsub wrapper support
added '-d' qsub command line flag for specifying working directory
fixed numerous spelling issues in pbs docs
enable logical or'ing of user and group ACL's
allow large memory sizes for physmem under solaris (USC)
fixed qsub SEGV on bad '-o' specification
add null checking on ap->value
fixed physmem() routine for tru64 systems to load compute node physical memory
added netload tracking
1.1.0p0
fixed linux swap space checking
fixed AIX5 resmom ODM memory leak
handle split var/etc directories for default server check (CHPC)
add pbs_check utility
added TERAGRID nospool log bounds checking
add code to force host domains to lower case
verified integration of OSC prologue-environment.patch (export Resource_List.nodes in an environment variable for prologue)
verified integration of OSC no-munge-server-name.patch (do not install over existing server_name)
verified integration of OSC docfix.patch (fix minor manpage type)
1.0.1p6
add messaging to report remote data staging failures to pbs_server
added tcp_timeout server parameter
add routine to mark hung nodes as down
add torque.setup initialization script
track okclient status
fixed INDIANA ji_grpcache MOM crash
fixed pbs_mom PBSLOGLEVEL/PBSDEBUG support
fixed pbs_mom usage
added rentec patch to mom 'sessions' output
fixed pbs_server --help option
added OSC patch to allow jobs to survive mom shutdown
added patch to support server level node comments
added support for reporting of node static resources via sss interface
added support for tracking available physical memory for IRIX/Linux systems
added support for per node probes to dynamically report local state of arbitrary value
fixed qsub -c (checkpoint) usage
1.0.1p5
add SUSE 9.0 support
add Linux 2.4 meminfo support
add support for inline comments in mom_priv/conf
allow support for upto 100 million unique jobs
add pbs_resources_all documentation
fix kill_task references
add contrib/pam_authuser
1.0.1p4
fixed multi-line readline buffer overflow
extended TORQUE documentation
fixed node health check management
1.0.1p3
added support for pbs_server health check and routing to scheduler
added support for specification of more than one clienthost parameter
added PW unused-tcp-interrupt patch
added PW mom-file-descriptor-leak patch
added PW prologue-bounce patch
added PW mlockall patch (release mlock for mom children)
added support for job names up to 256 chars in length
added PW errno-fix patch
1.0.1p2
added support for macintosh (darwin)
fixed qsub 'usage' message to correctly represent '-j',
'-k', '-m', and '-q' support
add support for 'PBSAPITIMEOUT' env variable
fixed mom dec/hp/linux physmem probes to support 64 bit
fixed mom dec/hp/linux availmem probes to support 64 bit
fixed mom dec/hp/linux totmem probes to support 64 bit
fixed mom dec/hp/linux disk_fs probes to support 64 bit
removed pbs server request to bogus probe
added support for node 'message' attribute to report internal
failures to server/scheduler
corrected potential buffer overflow situations
improved logging replacing 'unknown' error with real error message
enlarged internal tcp message buffer to support 2000 proc systems
fixed enc_attr return code checking
Patches incorporated prior to patch 2:
HPUX superdome support
add proper tracking of HP resources - Oct 2003 (NOR)
is_status memory leak patches - Oct 2003 (CRI)
corrects various memory leaks
Bash test - Sep 2003 (FHCRC)
allows support for linked shells at configure time
AIXv5 support -Sep 2003 (CRI)
allows support for AIX 5.x systems
OSC Meminfo -- Dec 2001 (P. Wycoff)
corrects how pbs_mom figures out how much physical memory each node has under Linux
Sandia CPlant Fault Tolerance I (w/OSC enhancements) -- Dec 2001 (L. Fisk/P. Wycoff)
handles server-MOM hangs
OSC Timeout I -- Dec 2001 (P. Wycoff)
enables longer inter daemon timeouts
OSC Prologue Env I -- Jan 2002 (P. Wycoff)
add support for env variable PBS_RESOURCE_NODES in job prolog
OSC Doc/Install I -- Dec 2001 (P. Wycoff)
fix to the pbsnodes man page
Configuration information for Linux on the IA64 architecture
fix the build process to make it clean out the documentation directories during a "make distclean"
fix the installation process to keep it from overwriting ${PBS_HOME}/server_name if it already exists
correct code creating compile time warnings
allow PBS to compile on Linux systems which do not have the Linux kernel source installed
Maui RM Extension -- Dec 2002 (CRI)
enable Maui resource manager extensions including QOS, reservations, etc
NCSA Scaling I -- Mar 2001 (G. Arnold)
increase number of nodes supported by PBS to 512
NCSA No Spool -- Apr 2001 (G. Arnold)
support $HOME/.pbs_spool for large jobs
NCSA MOM Pin
pin PBS MOM into memory to keep it from getting swapped
ANL RPP Tuning -- Sep 2000 (J Navarro)
tuning RPP for large systems
WGR Server Node Allocation -- Jul 2000 (B Webb)
addresses issue where PBS server incorrectly claims insufficient nodes
WGR MOM Soft Kill -- May 2002 (B Webb)
processes are killed with SIGTERM followed by SIGKILL
PNNL SSS Patch -- Jun 2002 (Skousen)
improves server-mom communication and server-scheduler
CRI Job Init Patch -- Jul 2003 (CRI)
correctly initializes new jobs eliminating unpredictable behavior and crashes
VPAC Crash Trap -- Jul 2003 (VPAC)
supports PBSCOREDUMP env variable
CRI Node Init Patch -- Aug 2003 (CRI)
correctly initializes new nodes eliminating unpredictable behavior and crashes
SDSC Log Buffer Patch -- Aug 2003 (SDSC)
addresses log message overruns