2720 lines
147 KiB
Plaintext
2720 lines
147 KiB
Plaintext
c - crash b - bug fix e - enhancement f - new feature n - note
|
|
|
|
NOTE: the CHANGELOG file is now deprecated. Please check the release notes page
|
|
on the Adaptive Computing Website. For example, the 6.0.1 release notes can be
|
|
found:
|
|
|
|
http://docs.adaptivecomputing.com/9-0-1/releaseNotes/help.htm
|
|
|
|
6.0.0
|
|
b - TRQ-3245. Enable reporter mom to correctly handle UNKNOWN role.
|
|
b - TRQ-3242. Fix problem where resource string argument to prologue
|
|
script getting garbled.
|
|
b - TRQ-3232 Start threadpool at pbs_mom start.
|
|
b - TRQ-3117. Fix a misspelling in the number_successful tag (was number_successfull)
|
|
e - TRQ-3131. Add capability to pass environment variables to pbsdsh.
|
|
b - TRQ-3185. Create subdirs when server attribute use_job_subdirs set.
|
|
|
|
|
|
5.1.2
|
|
b - TRQ-2675. Fix small errors in suse init.d scripts.
|
|
b - TRQ-3235. Fix problem when path to error, output or execution environment
|
|
contains one or more spaces.
|
|
e - TRQ-3098. Add the ability to set a parameter exit_code_canceled_job to force all
|
|
canceled jobs to have the same exit code regardless of the state they were in
|
|
when they were canceled.
|
|
e - TRQ-2836. Make node health check run on sister nodes when configured for job
|
|
start and job end as well.
|
|
e - TRQ-2843. Add the qmgr setting dont_write_nodes_file to make it so that nodes
|
|
cannot be edited dynamically
|
|
f - TRQ-2897. Add the ability to adopt running processes into a job with pbs_track.
|
|
b - TRQ-3189. Never delete a running job because of a dependency.
|
|
|
|
5.1.1.2
|
|
e - TRQ-3197. Add support for RHEL7 and SLES12.
|
|
|
|
5.1.1
|
|
b - TRQ-2947. Fix a race condition on deleting jobs which are failing to start.
|
|
c - TRQ-3068. Fix a race condition where a job may be deleted but have it's pointer
|
|
may still be in the alljobs container.
|
|
b - TRQ-2753. Fix a memory leak in generating the authoritative okclients list.
|
|
b - TRQ-2332. Fix a job dependency problem when the failover server comes up. This
|
|
only affects users running high availability.
|
|
b - TRQ-3023. Fix a bug when ALPS incorrectly returns a permanent confirmation failure.
|
|
b - TRQ-2833. Set CUDA_VISIBLE_DEVICES to only the indices for this host when
|
|
it will be set.
|
|
b - TRQ-3039. Fix a deadlock when deleting a job where other jobs have after any dependencies
|
|
on the first job.
|
|
f - TRQ-2782. Distribute job files into subdirectories when server attribute use_jobs_subdirs
|
|
set to true. Default is false (do not distribute job files).
|
|
b - TRQ-3116. Make qsub only retry on transient errors.
|
|
b - TRQ-3122. Fix a problem with login_property not working correctly (cray only).
|
|
b - TRQ-3114. Fix an issue where an asynchronously started job is stuck with a
|
|
substate of starting after a failed job start.
|
|
b - TRQ-3110. Handle slot limits correctly when jobs are preempted.
|
|
e - TRQ-3095. Add the server setting disable_automatic_requeue to stop jobs from being
|
|
requeued if they experience a transient failure on the mom.
|
|
e - TRQ-2307. Fix probelms where mom restarts intermittently fail.
|
|
b - TRQ-2946. Make qmgr able to handle Cray numeric node ids.
|
|
b - TRQ-2790. Make offlining cray compute nodes persist across restarts.
|
|
e - TRQ-3104. Add millisecond precision to the Torque log file
|
|
e - TRQ-2881. Add node health check error messages to a node's notes and therefore pbsnodes
|
|
output.
|
|
b - TRQ-3166. Add another safety check before killing stray jobs.
|
|
|
|
|
|
5.0.2
|
|
b - TRQ-3029. Make it so that pbs_server can't have active threads when the main
|
|
thread exits.
|
|
b - TRQ-3012. Fix memory leaks that happen each time a job is run.
|
|
b - TRQ-2966. Improved job rerun speed which had been significantly slowed down
|
|
starting in 4.2.6. Also, pbs_mom now correctly accounts for job resources when
|
|
user jobs call setsid more than once.
|
|
c - TRQ-2987. Fix a crash around job exits due to incorrect error code handling.
|
|
b - TRQ-2841. Fix some ways that max_user_queuable can become incorrect
|
|
b - TRQ-3097. Fixed a problem where failed job submissions would count against
|
|
the max_user_queuable count and could not be cleared until pbs_server
|
|
was restarted.
|
|
b - TRQ-3087. Fixed a problem where completed jobs were counted against max_user_queuable
|
|
when restarting pbs_server. Also if the max_user_queuable was set on a
|
|
queue and the number of queued jobs and completed jobs were over the
|
|
maximum then the last jobs submitted would not get loaded.
|
|
|
|
|
|
5.0.1.h2
|
|
b - Reverted a change in 5.0.0 which made it so a user could not submit a
|
|
job from a node which had been allowed using the acl_hosts list. The
|
|
change to 5.0.0 made it so the ruserok call could not be made to check
|
|
for user authorization.
|
|
|
|
5.0.1
|
|
e - TRQ-2410. Improved qstat behavior in cases where bad job IDs were referenced
|
|
in the command.
|
|
e - TRQ-2460. Two new fields were added to the accounting file for completed jobs:
|
|
total_execution_slots and unique_node_count. total_execution_slots should be
|
|
20 for a job that requests nodes=2:ppn=10. unique_node_count should be the
|
|
number of unique hosts the job occupied.
|
|
e - TRQ-2594. TORQUE now uses the Munge API rather than forking when configured
|
|
with the --enable-munge-auth option.
|
|
e - TRQ-2863. Reduced verbosity in error logging in HA environments.
|
|
e - TRQ-2868. TORQUE now allows for the modification of the output location
|
|
based on the Mother superior hostname. An environment variable ($HOSTNAME)
|
|
has been added to the job's environment.
|
|
e - TRQ-2882. Improved trqauthd error messages to more meaningful and less
|
|
redundant.
|
|
e - TRQ-2890. Added stderr capturing when using -o option.
|
|
b - TRQ-2025. Fixed bug where giving a bad queue name to qstat -Q results in
|
|
duplicate output.
|
|
b - TRQ-2292. Fixed bug where some tasks were incorrectly listed as 0 in 'qstat -a'
|
|
when requesting specific nodes.
|
|
b - TRQ-2367. Fixed bug related to accounting records on large systems.
|
|
b - TRQ-2411. Fixed output format bug in cases where multiple job IDs are passed
|
|
into qstat.
|
|
b - TRQ-2646. Fixed bug where qsub did not process args correctly when using
|
|
a submit filter.
|
|
b - TRQ-2652. Fixed parsing bug when using hostlist ranges in qsub.
|
|
b - TRQ-2653. Fixed build bug related to newer Intel MIC libraries installing
|
|
in different locations.
|
|
b - TRQ-2730. Fixed problem where GPUs were not split between NUMA nodes. You
|
|
now need to specify which gpus belong to each node board in the mom.layout
|
|
file. A sample mom.layout file might look like:
|
|
|
|
nodes=0 gpu=0
|
|
nodes=1 gpu=1
|
|
|
|
Also please note that this only works if you use nvml. The nvidia-smi
|
|
command is not supported.
|
|
b - TRQ-2732. Fixed bug where OU files were being left in spool when job was
|
|
preempted or requeued.
|
|
b - TRQ-2759. Fixed bug where reported cput was incorrect.
|
|
b - TRQ-2760. Fixed unexpected error when running 'pbsnodes -l offline -n'.
|
|
b - TRQ-2795. Fixed bug where jobs rejected due to max_user_queuable limit reached,
|
|
yet no jobs in the queue.
|
|
b - TRQ-2828. Fixed bug where 'momctl -q clearmsg' didn't clear error messages
|
|
properly.
|
|
b - TRQ-2837. Fixed bug where GPU modes were not passed to sister nodes.
|
|
b - TRQ-2852. Fixed bug while writing resources_default units to serverdb file.
|
|
b - TRQ-2885, CVE-2014-3684. Fixed issue around unauthorized termination
|
|
of processes.
|
|
b - TRQ-2890. Improved pbsdsh to better handle simultaneous use of -o and -s
|
|
options. Also fixed some problems where -o output was sometimes getting
|
|
truncated.
|
|
b - TRQ-2904. Fixed bug where TORQUE was not honoring KeepCompleted server
|
|
parameter when job_nanny was set to true.
|
|
b - TRQ-2918. Fixed problem with remote client job submission during
|
|
ruserok() calls.
|
|
b - TRQ-2919. Fixed deadlock issue when running 'qdel -p' as non-root user.
|
|
b - TRQ-2937. Fixed bug in qsub -m when TORQUE is configured --with-sendmail.
|
|
Some missing newlines were added.
|
|
b - TRQ-2956. Fixed bug where HOST_NAME_SUFFIX was no longer adding suffix to job names.
|
|
c - TRQ-2928, TRQ-2921, TRQ-2855, TRQ-2854, TRQ-2853, TRQ-2835. Fixed various crashes.
|
|
|
|
5.0.0
|
|
e - TRQ-2083. Remove job status polling from TORQUE. Have pbs_server only poll a
|
|
mom for a job's information if the information hasn't been received for 5
|
|
minutes. Otherwise, this information is communicated with the mom's status
|
|
information.
|
|
e - TRQ-2309. Have TORQUE recognize when a request to run a job specifies a node
|
|
list and directly access those nodes instead of searching linearly.
|
|
e - TRQ-1539. Condense the exec_host list to have one entry per node instead of one
|
|
entry per execution slot. The node entry contains a string specifying each
|
|
execution slot index. Also no longer display the value of exec_port in qstat.
|
|
f - TRQ-2363. Make it so that if you execute qrerun all - which previously
|
|
returned an error - it will ask for confirmation, and then place all running
|
|
jobs in a queued state without contacting the moms. This is meant to be used
|
|
only when the entire cluster has gone down and can't be contacted.
|
|
|
|
4.5.0
|
|
b - TRQ-2319. Replace two Torque functions with ones from the hwloc package.
|
|
n - Portable Hardware Locality (hwloc) package version 1.2 or higher must be
|
|
installed when using cpusets (--enable-cpuset). Previously at least version
|
|
1.1 was required.
|
|
b - TRQ-2373. Fix login nodes restricting the number of jobs to the number specified
|
|
by np=X.
|
|
e - TRQ-2044. Create a unique identifier for all jobs in TORQUE. This makes it
|
|
so that we're performing integer comparisons instead of string comparisons
|
|
for finding jobs.
|
|
|
|
4.2.9
|
|
b - TRQ-2730. Make nvml and numa-support configurations work together. The admin must
|
|
now specify which gpus are on which node board the same way it is done with mic
|
|
co-processors, adding gpu=X[-Y] to the mom.layout line for that node board.
|
|
|
|
4.2.8
|
|
b - TRQ-2501. Fix the total number of execution slots having a count that is off-by-one for
|
|
every Cray compute node.
|
|
b - TRQ-2498. Fixed a memory leak when using qrun -a (asynchronous). Also fixed a write
|
|
after free error that could lead to memory corruption.
|
|
b - Fixed the thread pool manager so it would free idle nodes. Also changed the default
|
|
thread stack sizes to a maximum of 8 Mb and Minimum of 1 Mb.
|
|
|
|
4.2.7
|
|
b - TRQ-2423. Fix a bug where cpusets would incorrectly be reported on mpi jobs
|
|
b - TRQ-2329. Fix a problem where nodes could be allocated to array subjobs even
|
|
after the job was deleted.
|
|
b - TRQ-2351. Fix an issue where moms that are before 4.2.6 can't run jobs if the
|
|
server is 4.2.6.
|
|
e - Made it so trqauthd cannot be loaded more than once. trqauthd opens a UNIX
|
|
domain name file to do its communication with client commands. If the
|
|
UNIX domain name file exists trqauthd will not load. By default this file
|
|
is /tmp/trqauthd-unix. It can be configured to point to a different directory.
|
|
If trqauthd will not start and you know there are no other instances of trqauthd
|
|
running you should delete the UNIX domain file and try again.
|
|
b - TRQ-2319. Replace two Torque functions with ones from the hwloc package.
|
|
n - Portable Hardware Locality (hwloc) package version 1.2 or higher must be
|
|
installed when using cpusets (--enable-cpuset). Previously at least version
|
|
1.1 was required.
|
|
b - TRQ-2373. Fix login nodes restricting the number of jobs to the number specified
|
|
by np=X.
|
|
b - TRQ-2354. Fix an issue with potential overflow in user job counts. Also fix a
|
|
user being considered different if from a different submit host.
|
|
b - TRQ-2369. Fix a problem with pbs_mom recovering which cpu indices were in use for
|
|
jobs that were running at shutdown and still running at the time the mom restarted.
|
|
b - TRQ-2377. Jobs with future start dates were being placed in queued after being
|
|
deleted if they were deleted before their start date and keep_completed kept them
|
|
around long enough. Fix this.
|
|
c - TRQ-2347. Fix a segfault around re-sending batch requests.
|
|
b - TRQ-2270. Fix some problems with TORQUE continuing to have nodes in a free state
|
|
when the host is down.
|
|
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
|
|
server running in cray enabled mode.
|
|
n - TRQ-2299. Make it so that the reporter mom doesn't fork to send its update.
|
|
|
|
4.2.6.1.h1
|
|
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
|
|
server running in cray enabled mode.
|
|
|
|
4.2.6.1
|
|
|
|
b - TRQ-2351. Fix an issue where moms that are before 4.2.6 can't run jobs if the
|
|
server is 4.2.6.
|
|
e - Made it so trqauthd cannot be loaded more than once. trqauthd opens a UNIX
|
|
domain name file to do its communication with client commands. If the
|
|
|
|
4.2.6
|
|
b - TRQ-2273. Job start time is hard coded to 5 minutes. If the prolog takes longer
|
|
than that to run the job will be requeued without killing the prolog. This
|
|
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
|
|
server running in cray enabled mode.
|
|
n - TRQ-2299. Make it so that the reporter mom doesn't fork to send its update.
|
|
b - TRQ-2111. Fix a rare case of running jobs being deleted without having their
|
|
resources freed.
|
|
b - TRQ-2208. Stop having pbs_mom use trqauthd when it is checkpointing a job.
|
|
e - TRQ-2022. Make pbs_mom capable of handling either naming convention for cpuset
|
|
files, those with the 'cpuset.' prefix and those without.
|
|
b - TRQ-2259. Fix a problem for multi-node jobs: vmem was being stored in mem and
|
|
vice versa from the sisters.
|
|
b - TRQ-2280. Save properties added to cray compute nodes in the nodes file if the
|
|
file is overwritten by pbs_server.
|
|
around long enough. Fix this.
|
|
b - TRQ-2395. Fix a problem when running jobs on non-Cray nodes reporting to a pbs
|
|
server running in cray enabled mode.
|
|
n - TRQ-2299. Make it so that the reporter mom doesn't fork to send its update.
|
|
|
|
4.2.5
|
|
e - Remove the mom asking for a job status before sending an obit to pbs_server
|
|
for a job that has exited. This is unnecessary overhead.
|
|
b - TRQ-2097. Make it so that the proper errno is stored for non-blocking sockets
|
|
at connect time.
|
|
b - TRQ-2111. Make queued jobs never hold node resources.
|
|
c - TRQ-2155. Fix a crash in trqauthd.
|
|
e - TRQ-2058. Add the option of having the pbs_mom daemon read the mom hierarchy
|
|
file instead of having to get it from pbs_server. To do this, copy the
|
|
hierarchy to mom_priv/mom_hierarchy.
|
|
e - TRQ-2058. Add the -n option to pbs_server, telling pbs_server not to send a
|
|
hierarchy over the network unless it is requested by pbs_mom.
|
|
e - TRQ-2020. Add the option of setting properties (features) for cray compute
|
|
nodes in the nodes file. Syntax: node_id cray_compute property_name.
|
|
|
|
4.2.4
|
|
b - TRQ-1802. Make the environment variable $PBS_NUM_NODES accurate for multi-req
|
|
jobs.
|
|
e - TRQ-1832. Add the ability to add a login_property to a job at the queue level
|
|
by setting required_login_property on the queue.
|
|
e - TRQ-1925. Make pbs_mom smart enough to reserved extra memory nodes for non-numa
|
|
configured TORQUE when more memory is requested than reserved.
|
|
e - TRQ-1923. Make job aborts for a mother superior not recognizing the job a bit
|
|
more intelligent - if the job has been reported in the last 180 seconds in the
|
|
mom's status update don't abort it.
|
|
b - TRQ-1934. Ask for canonical hostnames on the default address family without
|
|
specifying for uniformity in the code.
|
|
b - TRQ-2003. For cray fix a miscalculation of nppn and width when mppdepth is
|
|
provided for the job.
|
|
e - TRQ-1833. Optimize starting jobs by not internally tracking the jobid for each
|
|
execution slot used by the job. Reduce string buildup and manipulation in other
|
|
internal places as well. Job start for large jobs has been optimized to be up
|
|
to 150X faster according to internal testing.
|
|
b - TRQ-2030. Fix an ALPS 1.2 bug with labels on nodes. In 1.2 labels would be
|
|
repeated like this: labelnamelabelname... Cray only.
|
|
b - TRQ-1914. Fix after type dependencies not being removed from arrays.
|
|
|
|
b - TRQ-2015. Fix a problem where pbs_mom processes get stuck in a defunc state when
|
|
doing a qrerun on a job. qrerun is not required to make this happen. Just the
|
|
action of requeing a running job on the mom causes this to happen.
|
|
|
|
|
|
|
|
4.2.3
|
|
b - TRQ-1653. Arrays depending on non-array jobs was broken. Fix this.
|
|
b - Add retries on transient failures to setuid and seteuid calls. TRQ-1541.
|
|
e - Add support for qstat -f -u <user>. This results in qstat -f output for only
|
|
the specified user.
|
|
e - TRQ-1798. Make pbs_server calculate mppmaxnodect more accurately for Cray.
|
|
e - Add a timeout for mother superior when cleaning up a job. Instead of waiting
|
|
infinitely for sisters to confirm that a job has exited, consider the job dead
|
|
after 10 minutes. This time can be adjusted by setting $job_exit_wait_time in
|
|
the mom's config file (time in seconds). This prevents jobs from being stuck
|
|
infinitely if a compute node crashes or if a mom daemon becomes unresponsive.
|
|
TRQ-1776.
|
|
e - Add the parameter default_features to queues. TRQ-1794. The other way of adding
|
|
a feature to all jobs in a queue (setting resources_default.neednodes) is
|
|
circumvented if a user requests a feature in the nodes request. Setting
|
|
default_features overcomes this issue.
|
|
b - If privileged ports are disabled, make pbs_moms not check if incoming connections
|
|
from mother superior are on privileged ports. TRQ-1669.
|
|
c - TRQ-1784, bugzilla #231. Fix a crash for modifying arrays with qalter.
|
|
e - Add two mom config parameters: max_join_job_wait_time and resend_join_job_wait_time.
|
|
The first specifies how long pbs_mom should wait before deciding that join jobs
|
|
will never be received, and defaults to 10 minutes. The latter specifies how long
|
|
pbs_mom should wait before attempting to resend join jobs to moms that it hasn't
|
|
received replies from, and this defaults to 5 minutes. Both are specified in
|
|
seconds. Prior to this functionality mother superior would wait indefinitely
|
|
for the join job replies. Please carefully consider what these values should be
|
|
for your site and set them appropriately. TRQ-1790.
|
|
e - If an error happens communicating with one MIC, attempt to communicate with the
|
|
others instead of failing the entire routine.
|
|
e - Reintroduced the procct resource for queues which allows jobs to be managed based
|
|
on the number of procs requested. TRQ-1623
|
|
b - TRQ-1709. Fix parsing of -l gpus=X,other_things parsing incorrectly.
|
|
b - TRQ-1639. Gpu status information wasn't being displayed correctly.
|
|
b - TRQ-1826. mppdepth is now passed correctly to the ALPS reservation.
|
|
b - TRQ-1639. Gpu status information wasn't being displayed correctly.
|
|
b - TRQ-1826. mppdepth is now passed correctly to the ALPS reservation.
|
|
e - Reintroduced the procct resource for queues which allows jobs to be managed based
|
|
on the number of procs requested. TRQ-1623
|
|
|
|
4.2.2
|
|
b - Make job_starter work for parallel jobs as well as serial. (TRQ-1577 - thanks
|
|
to NERSC for the patch)
|
|
b - Fix one issue with being able to submit jobs to the cray while offline. TRQ-1595.
|
|
b - Make job_starter work for parallel jobs as well as serial. (TRQ-1577 - thanks
|
|
e - Make the abort and email messages for jobs more specific when they are killed
|
|
for going over a limit. TRQ-1076.
|
|
e - Add mom parameter mom_oom_immunize, making the mom immune to being killed in out
|
|
of memory conditions. Default is now true. (thanks to Lukasz Flis for this work)
|
|
b - Don't count completed jobs against max_user_queuable. TRQ-1420.
|
|
e - For mics, set the variable $OFFLOAD_DEVICES with a list of MICs to use for the
|
|
job.
|
|
b - make pbs_track compatible with display_job_server_suffix = false. The user
|
|
has to set NO_SERVER_SUFFIX in the environment. TRQ-1389
|
|
b - Fix the way we monitor if a thread is active. Before we used the id, but if the
|
|
thread has exited, the id is no longer valid and this will cause a crash. Use
|
|
pthread_cleanup functionality instead. TRQ-1745.
|
|
b - TRQ-1751. Add some code to handle a corrupted job file where the job file says it
|
|
is running but there is no exec host list. These jobs now will receive a system
|
|
hold.
|
|
b - Fixed problem where max_queuable and max_user_queuable would fail incorrectly.
|
|
TRQ-1494
|
|
b - Cray: nppn wasn't being specified in reservations. Fix this. TRQ-1660.
|
|
|
|
4.2.1
|
|
b - Fix a deadlock when submitting two large arrays consecutively, the second
|
|
depending on the first. TRQ-1646 (reported by Jorg Blank).
|
|
|
|
4.2.0
|
|
f - Support the MIC architecture. This was co-developed with Doug Johnson at
|
|
Ohio Supercomputer Center (OSC) and provides support for the Intel® MIC
|
|
architecture similar to GPU support in TORQUE.
|
|
b - Fix a queue deadlock. TRQ-1435
|
|
b - Fix an issue with multi-node jobs not reporting resources completely. TRQ-1222.
|
|
b - Make the API not retry for 5 consecutive timeouts. TRQ-1425
|
|
b - Fix a deadlock when no files can be copied from compute nodes to pbs_server.
|
|
TRQ-1447.
|
|
b - Don't strip quotes from values in scripts before specific processing. TRQ-1632
|
|
|
|
4.1.6
|
|
b - Make job_starter work for parallel jobs as well as serial. (TRQ-1577 - thanks
|
|
to NERSC for the patch, backported from 4.2.2)
|
|
b - Fix one issue with being able to submit jobs to the cray while offline. TRQ-1595.
|
|
backported from 4.2.2
|
|
|
|
4.1.5
|
|
b - For cray: make sure that reservations are released when jobs are requeued. TRQ-1572.
|
|
b - For cray: support the mppdepth directive. Bugzilla #225.
|
|
c - If the job is no long valid after attempting to lock the array in get_jobs_array(),
|
|
make sure the array is valid before attempting to unlock it. TRQ-1598.
|
|
e - For cray: make it so you can continue to submit jobs to pbs_server even if you have
|
|
restarted it while the cray is offline. TRQ-1595.
|
|
b - Don't log an invalid connection message when close_conn() is called on 65535
|
|
(PBS_LOCAL_CONNECTION). TRQ-1557.
|
|
|
|
4.1.4
|
|
e - When in cray mode, write physmem and availmem in addition to totmem so that
|
|
Moab correctly reads memory info.
|
|
e - Specifying size, nodes, and mppwidth and all mutually exclusize, so reject
|
|
job submissions that attempt to specify more than one of these. TRQ-1185.
|
|
b - Merged changes for revision 7000 by hand because the merge was not clean. This
|
|
fixes problems with a deadlock when doing job dependencies using synccount/syncwith.
|
|
TRQ-1374
|
|
b - Fix a segfault in req_jobobit due to an off-by-one error. TRQ-1361.
|
|
e - Add the svn revision to --version outputs. TRQ-1357.
|
|
b - Fix a race condition in mom hierarchy reporting. TRQ-1378.
|
|
b - Fixed pbs_mom so epilogue will only run once. TRQ-1134
|
|
b - Fix some debug output escaping into job output. TRQ-1360.
|
|
b - Fixed a problem where server threads all get stuck in a poll. The problem
|
|
was an infinite loop created in socket_wait_for_read if poll return -1.
|
|
TRQ-1382
|
|
b - Fix a Cray-mode bug with jobs ending immediately when spanning nodes of
|
|
different proc counts when specifying -l procs. TRQ-1365.
|
|
b - Don't fail to make the tmpdir for sister moms. bugzilla #220, TRQ-1403.
|
|
c - Fix crashes due to unprotected array accesses. TRQ-1395.
|
|
b - Fixed a deadlock in get_parent_dest_queues when the queue_parent_name
|
|
and queue_dest_name are the same. TRQ-1413. 11/7/12
|
|
b - Fixed segfault in req_movejob where the job ji_qhdr was NULL. TRQ-1416
|
|
b - Fix a conflict in the code for herogeneous jobs and regular jobs.
|
|
b - For alps jobs, use the login nodes evenly even when one goes down. TRQ-1317.
|
|
b - Display the correct 'Assigned Cpu Count' in momctl output. TRQ-1307.
|
|
b - Make pbs_original_connect() no longer hang if the host is down. TRQ-1388.
|
|
b - Make epilogues run only once and be executed by the child and not the main
|
|
pbs_mom process. TRQ-937.
|
|
b - Reduce the error messages in HA mode from moms. They now only log errors if
|
|
no server could be contacted. TRQ-1385.
|
|
b - Fixed a seg-fault in send_depend_req. Also fixed a deadlock in the depend_on_term
|
|
TRQ-1430 and TRQ-1436
|
|
b - Fixed a null pointer dereference seg-fault when checking for disallowed types
|
|
TRQ-1408.
|
|
b - Fix a counting problem when running multi-req ALPS jobs (cray only). TRQ-1431.
|
|
b - Remove red herring error messages 'did not find work task for local request'.
|
|
These tasks are no longer created since issue_Drequest blocks until it gets the
|
|
reply and then processes it. TRQ-1423.
|
|
b - Fixed a problem where qsub was not applying the submit filter when given in the torque.cfg
|
|
file. TRQ-1446
|
|
e - When the mom has no jobs, check the aux path to make sure it is clean and
|
|
that we aren't leaving any files there. TRQ-1240.
|
|
b - Fix a counting problem when running multi-req ALPS jobs (cray only). TRQ-1431.
|
|
b - Remove red herring error messages 'did not find work task for local request'.
|
|
These tasks are no longer created since issue_Drequest blocks until it gets the
|
|
reply and then processes it. TRQ-1423.
|
|
e - When the mom has no jobs, check the aux path to make sure it is clean and
|
|
that we aren't leaving any files there. TRQ-1240.
|
|
b - Made it so that threads taken up by poll job tasks cannot consume all available
|
|
threads in the thread pool. This will make it so other work can continue if
|
|
poll jobs get stuck for whatever reason and that the server will recover. TRQ-1433
|
|
b - Fix a deadlock when recording alps reservations. TRQ-1421.
|
|
b - Fixed a segfault in req_jobobit caused by NULL pointer assignment to variable
|
|
pa. TRQ-1467
|
|
b - Fixed deadlock in remove_array. remove_array was calling get_arry with allarrays_mutex
|
|
locked. TRQ-1466
|
|
b - Fixed a problem with an end of file error when running momctl -dx. TRQ-1432.
|
|
b - Fix a deadlock in rare cases on job insertion. TRQ-1472.
|
|
b - Fix a deadlock after restarting pbs_server when it was SIGKILL'd before a
|
|
job array was done cloning. TRQ-1474.
|
|
b - Fix a Cray-related deadlock. Always lock the reporter mom before a compute
|
|
node. TRQ-1445
|
|
b - Additional fix for TRQ-1472. In rm_request on the mom pbs_tcp_timeout was
|
|
getting set to 0 which made it so the MOM would fail reading incoming data
|
|
if it had not already arrived. This would cause momctl -to fail with an
|
|
end of file message.
|
|
e - Add a safety net to resend any obits for exiting jobs on the mom that still
|
|
haven't cleaned up after five minutes. TRQ-1458.
|
|
b - Fix cray running jobs being cancelled after a restart due to jobs not being
|
|
set to the login nodes. TRQ-1482.
|
|
b - Fix a bug that using -V got rid of -v. TRQ-1457.
|
|
b - Make qsub -I -x work again. TRQ-1483.
|
|
c - Fix a potential crash when getting the status of a login node in cray mode.
|
|
TRQ-1491.
|
|
|
|
4.1.3
|
|
b - fix a security loophole that potentially allowed an interactive job to run
|
|
as root due to not resetting a value when $attempt_to_make_dir and $tmpdir
|
|
are set. TRQ-1078.
|
|
b - fix down_on_error for the server. TRQ-1074.
|
|
b - prevent pbs_server from spinning in select due to sockets in CLOSE_WAIT.
|
|
TRQ-1161.
|
|
e - Have pbs_server save the queues each time before exiting so that legacy
|
|
formats are converted to xml after upgrading. TRQ-1120.
|
|
b - Fix phantom jobs being left on the pbs_moms and blocking jobs for Cray
|
|
hardware. TRQ-1162. (Thanks Matt Ezell)
|
|
b - Fix a race condition on free'd memory when check for orphaned alps
|
|
reservations. TRQ-1181. (Thanks Matt Ezell)
|
|
b - If interrupted when reading the terminal type for an interactive job continue
|
|
trying to read instead of giving up. TRQ-1091.
|
|
b - Fix displaying elapsed time for a job. TRQ-1133.
|
|
b - Make offlining nodes persistent after shutting down. TRQ-1087.
|
|
b - Fixed a memory leak when calling net_move. net_move allocates memory for args
|
|
and starts a thread on send_job. However, args were not getting released
|
|
in send_job. TRQ-1199
|
|
b - Changed pbs_connect to check for a server name. If it is passed in only that
|
|
server name is tried for a connection. If no server name is given then the
|
|
default list is used. The previous behavior was to try the name passed in and
|
|
the default server list. This would lead to confusion in utilities like qstat
|
|
when querying for a specific server. If the server specified was no available
|
|
information from the remaining list would still be returned.
|
|
TRQ-1143.
|
|
e - Make issue_Drequest wait for the reply and have functions continue processing
|
|
immediately after instead of the added overhead of using the threadpool.
|
|
c - tm_adopt() calls caused pbs_mom to crash. Fix this. TRQ-1210.
|
|
b - Array element 0 wasn't showing up in qstat -t output. TRQ-1155.
|
|
b - Cores with multiple processing units were being incorrectly assigned in cpusets.
|
|
Additionally, multi-node jobs were getting the cpu list from each node in each
|
|
cpuset, also causing problems. TRQ-1202.
|
|
b - Finding subjobs (for heterogeneous jobs) wasn't compatible with hostnames that
|
|
have dashes. TRQ-1229.
|
|
b - Removed the call to wait_request the main_loop on pbs_server. All of our communication
|
|
is handled directly and there is no longer a need to wait for an out of band
|
|
reply from a client. TRQ-1161.
|
|
e - Modfied output for qstat -r. Expanded Req'd Time to include seconds and centered Elap Time
|
|
over it's column.
|
|
b - Fixed a bug found at Univ. of Michigan where a corrupt .JB file would cause
|
|
pbs_server to seg-fault and restart.
|
|
b - Don't leave quotes on any arguments passed to the resource list. TRQ-1209.
|
|
b - Fix a race condition that causes deadlock when two threads are routing the same job.
|
|
b - Fixed a bug with qsub where environment variables were not getting populated with the
|
|
-v option. TRQ-1228.
|
|
b - This time for sure. TRQ-1228. When max_queuable or max_user_queuable were set it
|
|
was still possible to go over the limit. This was because a job is qualified
|
|
in the call to req_quejob but does not get inserted into the queue until svr_enquejob
|
|
is called in req_commit, four network requests later. In a multi-threaded environment
|
|
this allowed several jobs to be qualified and put in the pipeline before they
|
|
were actually commited to a queue.
|
|
b - If max_user_queuable or max_queuable were set on a queue TORQUE would not honor
|
|
the limit when filling those queues from a routing queue. This has now
|
|
been fixed. TRQ-1088.
|
|
b - Fixed seg-fault when running jobs asynchronously. TRQ-1252.
|
|
b - Fixed a bug with SIGHUP to pbs_server. The signal handler (change_logs()) does file I/O
|
|
which is not allowed for signal interruption. This caused pbs_server to be up but
|
|
unresponsive to any commands. TRQ-1250 and TR!-1224
|
|
b - Job dependencies didn't work with display_server_suffix=false. Fixed. TRQ-1255.
|
|
b - Don't report alps reservation ids if a node is in interactive mode. TRQ-1251.
|
|
b - Only attempt to cancel an orphaned alps reservation a maximum of one time per
|
|
iteration. TRQ-1251.
|
|
b - Fix a deadlock when recording an alps reservation on the server side. Cray only.
|
|
TRQ-1272.
|
|
c - Fix mismanagement of the ji_globid. TRQ-1262.
|
|
c - Setting display_job_server_suffix=false crashed with job arrays. Fixed. bugzilla #216
|
|
b - Restore the asynchronous functionality. TRQ-1284.
|
|
e - Made it so pbs_server will come up even if a job cannot recover because of a missing
|
|
job dependency. TRQ-1287
|
|
b - Fixed a segfault in the path from do_tcp to tm_request to tm_eof. In this path we freed
|
|
the tcp channel three times. the call to DIS_tcp_cleanup was removed from tm_eof and
|
|
tm_request. TRQ-1232.
|
|
b - Fix a deadlock in logging when the machine is out of disk space. TRQ-1302.
|
|
b - Fixed a deadlock which occurs when there is a job with a dependency that is being moved
|
|
from a routing queue to an execution queue. TRQ-1294
|
|
e - Retry cleanup with the mom every 20 seconds for jobs that are stuck in an exiting state.
|
|
TRQ-1299.
|
|
b - Enabled qsub filters to be access from a non-default location.i TRQ-1127
|
|
b - Put the ability to write the resources_used data to the accounting logs. This was in 4.1.1
|
|
and 4.1.2 but failed to make it into 4.1.3. TRQ-1329
|
|
c - Fix a double free if the same chan is stored on two tasks for a job. TRQ-1299.
|
|
b - Changed pbs_original_connect to retry a failed connect attempt
|
|
MAX_RETRIES (5) times before returning failure. This will
|
|
reduce the number of client commands that fail due to a connection
|
|
failure. TRQ-1355
|
|
b - Fix the proliferation of "Non-digit found where a digit was expected" messages, due
|
|
to an off-by-one error. TRQ-1230.
|
|
b - Fixed a deadlock caused by queue not getting released when jobs are aborted when
|
|
moving jobs from a routing queue to an execution queue. TRQ-1344.
|
|
|
|
4.1.2
|
|
e - Add the ability to run a single job partially on CRAY hardware and partially
|
|
on hardware external to the CRAY in order to allow visualization of
|
|
large simulations.
|
|
|
|
4.1.1
|
|
e - pbs_server will now detect and release orphaned ALPS reservations
|
|
b - Fixed a deadlock with nodes in stream_eof after call to svr_connect.
|
|
b - resources_used information now appears in the accounting log again
|
|
TRQ-1083 and bugzilla 198.
|
|
b - Fixed a seg-fault found a LBNL where freeaddrinfo would crash because
|
|
of uninitialized memory.
|
|
b - Fixed a deadlock in handle_complete_second_time. We were not unlocking
|
|
when exiting svr_job_purge.
|
|
e - Added the wrappers lock_ji_mutex and unlock_ji_mutex to do the mutex locking
|
|
for all job->ji_mutex locks.
|
|
e - admins can now set the global max_user_queuable limit using qmgr. TRQ-978.
|
|
b - No longer make multiple alps reservation parameters for each alps reservation.
|
|
This creates problems for the aprun -B command.
|
|
b - Fix a problem running extremely large jobs with alps 1.1 and 1.2. Reservations
|
|
weren't correctly created in the past. TRQ-1092.
|
|
b - Fixed a deadlock with a queue mutex caused by call qstat -a <queue1> <queue2>
|
|
b - Fixed a memory corruption bug, double free in check_if_orphaned. To fix this
|
|
issue_Drequest was modified to always free the batch request regardless of
|
|
any errors.
|
|
b - Fix a potential segfault when using munge but not having set authorized users.
|
|
TRQ-1102
|
|
b - Added a modified version of a patch submitted by Matt Ezell for Bugzilla 207.
|
|
This fixes a seg-fault in qsub if Moab passes an environment variable without
|
|
a value.
|
|
b - fix an error in parsing environment variables with commas, newlines, etc. TRQ-1113
|
|
b - fixed a deadlock with array jobs running simultaneously with qstat.
|
|
b - Fixed qsub -v option. Variable list was not getting passed in to job environment.
|
|
TRQ-1128
|
|
b - TRQ-1116. mail is now sent on job start again.
|
|
b - TRQ-1118. Cray jobs are now recovered correctly after a restart.
|
|
b - TRQ-1109. Fixed x11 forwarding for interactive jobs. (qsub -I -X). Previous to
|
|
this fix interactive jobs would not run any x applications such as xterm, xclock,
|
|
etc.
|
|
b - TRQ-1161, Fixes a problem where TORQUE gets into a high CPU utilization condition.
|
|
The problem was that in the function process_pbs_server_port there was not
|
|
error returned if the call to getpeername() failed in the default case.
|
|
b - TRQ-1161. This fixes another case that would cause a thread to spin on poll
|
|
in start_process_pbs_server_port. A call to the dis function would return
|
|
and error but the code would close the connection and return the error code which
|
|
was a value less than 20. start_process_pbs_server_port did not recognize the low
|
|
error code value and would keep calling into process_pbs_server_port.
|
|
b - qdel'ing a running job in the cray environment was trying to communicate with the
|
|
cray compute instead of the login node. This is now fixed. TRQ-1184.
|
|
b - TRQ-1161. Fixed a problem in stream_eof where a svr_connect was used to connect
|
|
to a MOM to see if it was still there. On successful connection the connection
|
|
is closed but the wrong function (close_conn) with the wrong argument (the
|
|
handle returned by svr_connect()) was used. Replaced with svr_disconnect
|
|
b - Make it so that procct is never shown to Moab or users. TRQ-872.
|
|
b - TRQ-1182. Fixed a problem where jobs with dependencies were deleted on
|
|
the restart of pbs_server.
|
|
b - TRQ-1199. Fixed memory leaks found by Valgrind. Fixed a leak when routing jobs
|
|
to a remote server, memory leak with procct, memory leak creating queues,
|
|
memory leak with mom_server_valid_message_source and a memory leak in req_track.
|
|
|
|
4.1.0
|
|
e - make free_nodes() only look at nodes in the exec_host list and not examine
|
|
all nodes to check if the job at hand was there. This should greatly speed
|
|
up freeing nodes.
|
|
f - add the server parameter interactive_jobs_can_roam (Cray only). When set to
|
|
true, interactive jobs can have any login as mother superior, but by default
|
|
all interactive jobs with have their submit_host as mother superior
|
|
b - Fixed TRQ-696. Jobs get stuck in running state.
|
|
b - Fixed a problem where interactive jobs using X-forwarding would fail
|
|
because TORQUE though DISPLAY was not set. The problem was that
|
|
DISPLAY was set using lowercase internally. TRQ-1010
|
|
e - Add a hostname/address caching feature to alleviate stress on DNS.
|
|
|
|
4.0.3
|
|
b - fix qdel -p all - was performing a qdel all. TRQ-947
|
|
b - fix some memory leaks in 4.0.2 on the mom and server TRQ-944
|
|
c - TRQ-973. Fix a possibility of a segfault in netcounter_incr()
|
|
b - removed memory manager from alloc_br and free_br to solve a memory leak
|
|
b - fixes to communications between pbs_sched and pbs_server. TRQ-884
|
|
b - fix server crash caused by gpu mode not being right after gpus=x:. TRQ-948.
|
|
b - fix logic in torque.setup so it does not say successfully started when
|
|
trqauthd failed to start. TRQ-938.
|
|
b - fix segfaults on job deletes, dependencies, and cases where a batch
|
|
request is held in multiple places. TRQ-933, 988, 990
|
|
e - TRQ-961/bugzilla-176 - add the configure option --with-hwloc-path=PATH
|
|
to allow installing hwloc to a non-default location.
|
|
c - fix a crash when using job dependencies that fail - TRQ-990
|
|
e - Cache addresses and names to prevent calling getnameinfo() and getaddrinfo()
|
|
too often. TRQ-993
|
|
c - fix a crash around re-running jobs
|
|
e - change so some Moab envirionment variables will be put into environment for
|
|
the prologue and epilogue scripts. TRQ-967.
|
|
b - make command line arguments override the job script arguments. TRQ-1033.
|
|
b - fix a pbs_mom crash when using blcr. TRQ-1020.
|
|
e - Added patch to buildutils/pbs_mkdirs.in which enables pbs_mkdirs to run
|
|
silently. Patch submitted by Bas van der Vlies. Bugzilla 199.
|
|
|
|
4.0.2
|
|
e - Change so init.d script variables get set based on the configure command.
|
|
TRQ-789, TRQ-792.
|
|
b - Fix so qrun jobid[] does not cause pbs_server segfault. TRQ-865.
|
|
b - Fix to validate qsub -l nodes=x against resources_max.nodes the same as v2.4.
|
|
TRQ-897.
|
|
b - bugzilla #185. Empty arrays should no longer be loaded and now when qdel'ed
|
|
they will be deleted.
|
|
b - bugzilla #182. The serverdb will now correctly write out memory allocated.
|
|
b - bugzilla #188. The deadlock when using job logging is resolved
|
|
b - bugzilla #184. pbs_server will no longer log an erroneous error when the 12th
|
|
job array is submitted.
|
|
e - Allow pbs_mom to change users group on stderr/stdout files. Enabled by configuring
|
|
Torque with CFLAGS='-DRESETGROUP'. TRQ-908.
|
|
e - Have the parent intermediate mom process wait for the child to open the demux before
|
|
moving on for more precise synchronization for radix jobs.
|
|
e - Changed the way jobs queued in a routing queue are updated. A thread is now launched
|
|
at startup and by default checks every 10 seconds to see if there are jobs
|
|
in the routing queues that can be promoted to execution queues.
|
|
b - Fix so pbs_mom will compile when configured with --with-nvml-lib=/usr/lib and
|
|
--with-nvml-include. TRQ-926.
|
|
b - fix pbs_track to add its process to the cpuset as well. TRQ-925.
|
|
b - Fix so gpu count gets written out to server nodes file when using
|
|
--enable-nvidia-gpus. TRQ-927.
|
|
b - change pbs_server to listen on all interfaces. TRQ-923
|
|
b - Fix so "pbs_server --ha" does not fail when checking path for server.lock file. TRQ-907.
|
|
b - Fixed a problem in qmgr where only 9 commands could be completed before a failure.
|
|
Bugzilla 192 and TRQ-931
|
|
b - Fix to prevent deadlock on server restart with completed job that had a dependency.
|
|
TRQ-936.
|
|
b - prevent TORQUE from losing connectivity with Moab when starting jobs asynchronously
|
|
TRQ-918
|
|
b - prevent the API from segfaulting when passed a negative socket descriptor
|
|
b - don't allow pbs_tcp_timeout to ever be less than 5 minutes - may be temporary
|
|
b - fix pbs_server so it fails if another instance of pbs_server is already
|
|
running on same port. TRQ-914.
|
|
|
|
4.0.1
|
|
b - Fix trqauthd init scripts to use correct path to trqauthd.
|
|
b - fix so multiple stage in/out files can again be used with qsub -W
|
|
b - fix so comma separated file list can be used with qsub -W stagein/stageout.
|
|
Matches qsub documentation again.
|
|
b - Only seed the random number generator once
|
|
b - The code to run the epilogue set of scripts was removed when refactoring the
|
|
obit code. The epilogues are now run as part of post_epilogue. preobit_reply
|
|
is no longer used.
|
|
b - if using a default hierarchy and moms on non-default ports, pass that information
|
|
along in the hierarchy
|
|
e - Make pbs_server contact pbs_moms in the order in which they appear in the hierarchy
|
|
in order to reduce errors on start-up of a large cluster.
|
|
b - fix another possibility for deadlock with routing queues
|
|
e - move some the the main loop functionality to the threapool in order to increase
|
|
responsiveness.
|
|
e - Enabled the configuration to be able to write the path of the library directory
|
|
to /etc/ld.so.conf.d in a file named libtorque.conf. The file will be created
|
|
by default during make install. The configuration can be made to not install this
|
|
file by using the configure option --without-loadlibfile
|
|
b - Fixed a bug where Moab was using the option SYNCJOBID=TRUE which allows Moab
|
|
to create the job ids in TORQUE. With this in place if TORQUE were terminated
|
|
it would delete all jobs submitted through msub when pbs_server was restarted.
|
|
This fix recovers all jobs whether submitted with msub or qsub when pbs_server
|
|
restarts.
|
|
b - fix for where pbsnodes displays outdated gpu_status information.
|
|
b - fix problem with '+ and segfault when using multiple node gpu requests.
|
|
b - Fixed a bug in svr_connect. If the value for func were null then the newly
|
|
created connection was not added to the svr_conn table. This was not right.
|
|
We now always add the new connection to svr_conn.
|
|
b - fix problem with mom segfault when using 8 or more gpus on mom node.
|
|
b - Fix so child pbs_mom does not remain running after qdel on slow starting job.
|
|
TRQ-860.
|
|
b - Made it so the MOM will let pbs_server know it is down after momctl -s is invoked.
|
|
e - Made it so localhost is no longer hard coded. The string comes from getnameinfo.
|
|
b - fix a mom hiearchy error for running the moms on non-default ports
|
|
b - Fix server segfault for where mom in nodes file is not in mom_hierarchy. TRQ-873.
|
|
b - Fix so pbs_mom won't segfault after a qdel is done for a job that is still
|
|
running the prologue. TRQ-832.
|
|
b - Fix for segfault when using routing queues in pbs_server. TRQ-808
|
|
b - Fix so epilogue.precancel runs only once and only for cancelled jobs. TRQ-831.
|
|
b - Added a close socket to validate_socket to properly terminate the connection.
|
|
Moved the free of the incoming variable sock to process_svr_conn from the
|
|
beginning of the function to the end. This fixed a problem where the client
|
|
would always get a RST when trying to close its end of the connection.
|
|
b - Fix server segfault for where mom in nodes file is not in mom_hierarchy. TRQ-873.
|
|
b - routing to a routing queue now works again, TRQ-905, bugzilla 186
|
|
b - Fix server segfaults that happened doing qhold for blcr job. TRQ-900.
|
|
n - TORQUE 4.0.1 released 5/3/2012
|
|
|
|
4.0.0
|
|
e - make a threadpool for TORQUE server. The number of threads is
|
|
customizable using min_threads and max_threads, and idle time before
|
|
exiting can be set using thread_idle_seconds.
|
|
e - make pbs_server multi-threaded in order to increase responsiveness and scalability.
|
|
e - remove the forking from pbs_server running a job, the thread handling the request just
|
|
waits until the job is run.
|
|
e - change qdel to simply send qdel all - previously this was executed by a qstat and a qdel
|
|
of every individual job
|
|
e - no longer fork to send mail, just use a thread
|
|
e - use hwloc as the backbone for cpuset support in TORQUE (contributed by Dr. Bernd Kallies)
|
|
e - add the boolean variable $use_smt to mom config. If set to false, this skips logical
|
|
cores and uses only physical cores for the job. It is true by default.
|
|
(contributed by Dr. Bernd Kallies)
|
|
n - with the multi-threading the pbs_server -t create and -t cold commands could no longer
|
|
ask for user input from the command line. The call to ask if the user wants to continue
|
|
was moved higher in the initialization process and some of the wording changed to
|
|
reflect what is now happening.
|
|
e - if cpusets are configured but aren't found and cannot be mounted, pbs_mom will now fail to
|
|
start instead of failing silently.
|
|
e - Change node_spec from an N^2 (but average 5N) algorithm to an N algorithm with respect
|
|
to nodes. We only loop over each node once at a maximum.
|
|
e - Abandon pbs_iff in favor of trqauthd. trqauthd is a daemon to be started once that can
|
|
perform pbs_iff's functionality, increasing speed and enabling future security
|
|
enhancements
|
|
e - add mom_hierarchy functionality for reporting. The file is located in
|
|
<TORQUE_HOME>/server_priv/mom_hierarchy, and can be written to tell moms to send
|
|
updates to other moms who will pass them on to pbs_server. See docs for details
|
|
e - add a unit testing framework (check). It is compiled with --with-check and tests
|
|
are executed using make check. The framework is complete but not many tests have
|
|
been written as of yet.
|
|
b - Made changes to IM protocol where commands were not either waiting for a reply
|
|
or not sending a reply. Also made changes to close connections that were left
|
|
open.
|
|
b - Fix for where qmgr record_job_info is True and server hangs on startup.
|
|
e - Mom rejection messages are now passed back to qrun when possible
|
|
e - Added the option -c for startup. By default, the server attempts to send the mom
|
|
hierarchy file to all moms on startup, and all moms update the server and request
|
|
the hierarchy file. If both are trying to do this at once, it can cause a lot of
|
|
traffic. -c tells pbs_server to wait 10 minutes to attempt to contact moms that
|
|
haven't contacted it, reducing this traffic.
|
|
e - Added mom parameter -w to reduce start times. This parameter wait to send it's
|
|
first update until the server sends it the mom hierarchy file, or until 10
|
|
minutes have passed. This should reduce large cluster startup times.
|
|
|
|
3.0.5
|
|
b - fix for writing too much data when job_script is saved to job log.
|
|
b - fix for where pbs_mom would not automatically set gpu mode.
|
|
b - fix for alligning qstat -r output when configured with -DTXT.
|
|
e - Change size of transfer block used on job rerun from 4k to 64k.
|
|
b - With nvidia gpus, TORQUE was losing the directive of what nodes it should
|
|
run the job on from Moab. Corrected.
|
|
e - add the $PBS_WALLTIME variable to jobs, thanks to a patch from Mark Roberts
|
|
n - change moab_array_compatible server parameter so it defaults to true
|
|
e - change to allow pbs_mom to run if configured with --enable-nvidia-gpus but
|
|
installed on a node without Nvidia gpus.
|
|
|
|
3.0.4
|
|
c - fix a buffer being overrun with nvidia gpus enabled
|
|
b - no longer leave zombie processes when munge authenticating.
|
|
b - no longer reject procs if it is the second argument to -l
|
|
b - when having pbs_mom re-read the config file, old servers were kept, and pbs_mom
|
|
attempted to communicate with those as well. Now they are cleared and only the
|
|
new server(s) are contacted.
|
|
b - pbsnodes -l can now search on all valid node states
|
|
e - Added functionality that allows the values for the server parameter
|
|
authorized_users to use wild cards for both the user and host portion.
|
|
e - Improvements in munge handling of client connections and authentication.
|
|
|
|
3.0.3
|
|
b - fix for bugzilla #141 - qsub was overwriting the path variable in PBSD_authenticate
|
|
e - automatically create and mount /dev/cpuset when TORQUE is configured but the cpuset
|
|
directory isn't there
|
|
b - fix a bug where node lines past 256 characters were rejected. This buffer has been
|
|
made much larger (8192 characters)
|
|
b - clear out exec_gpus as needed
|
|
b - fix for bugzilla #147 - recreate $PBS_NODESFILE file when restarting a blcr
|
|
checkpointed job
|
|
b - Applied patch submitted by Eric Roman for resmom/Makefile.am (Bugzilla #147)
|
|
b - Fix for adding -lcr for BLCR makefiles (Bugzilla #146)
|
|
c - fix a potential segfault when using asynchronous runjob with an array slot limit
|
|
b - fix bugzilla #135, stagein was deleting directory instead of file
|
|
b - fix bugzilla #133, qsub submit filter, the -W arguments are not all there
|
|
e - add a mom config option - $attempt_to_make_dir - to give the user the option to
|
|
have TORQUE attempt to create the directories for their output file if they don't exist
|
|
b - Fixed momctl to return an error on failure. Prior to this fix momctl always returned 0
|
|
regardless of success or failure.
|
|
e - Change to allow qsub -l ncpus=x:gpus=x which adds a resource list entry for both
|
|
b - fix so user epilogues are run as user instead of root
|
|
b - No longer report a completion code if a job is pre-empted using qrerun.
|
|
c - Fix a crash in record_jobinfo() - this is fixed by backporting dynamic strings from
|
|
4.0.0 so that all of the resizing is done in a central location, fixing the crash.
|
|
b - No longer count down walltime for jobs that are suspending or have stopped running
|
|
for any other reasons
|
|
e - add a mom config option - $ext_pwd_retry - to specify # of retries on
|
|
checking for password validity.
|
|
|
|
3.0.2
|
|
c - check if the file pointer to /dev/console can be opened. If not, don't attempt to write it
|
|
b - fix a potential buffer overflow security issue in job names and host address names
|
|
b - restore += functionality for nodes when using qmgr. It was overwriting old properties
|
|
b - fix bugzilla #134, qmgr -= was deleting all entries
|
|
e - added the ability in qsub to submit jobs requesting total gpus for job instead of gpus per node:
|
|
-l ncpus=X,gpus=Y
|
|
b - do not prepend ${HOME} with the current dir for -o and -e in qsub
|
|
e - allow an administator using the proxy user submission to also set the job id to be used
|
|
in TORQUE. This makes TORQUE easier to use in grid configurations.
|
|
b - fix jobs named with -J not always having the server name appended correctly
|
|
b - make it so that jobs named like arrays via -J have legal output and error file names
|
|
b - make a fix for ATTR_node_exclusive - qsub wasn't accepting -n as a valid argument
|
|
|
|
3.0.1
|
|
e - updated qsub's man page to include ATTR_node_exclusive
|
|
b - when updating the nodes file, write out the ports for the mom if needed
|
|
b - fix a bug for non-NUMA systems that was continuously increasing memory values
|
|
e - the queue files are now stored as XML, just like the serverdb
|
|
e - Added code from 2.5-fixes which will try and find nodes that did not
|
|
resolve when pbs_server started up. This is in reference to Bugzilla
|
|
bug 110.
|
|
e - make gpus compatible with NUMA systems, and add the node attribute
|
|
numa_gpu_node_str for an additional way to specify gpus on node boards
|
|
e - Add code to verify the group list as well when VALIDATEGROUPS is set in torque.cfg
|
|
b - Fix a bug where if geometry requests are enabled and cpusets are enabled, the cpuset
|
|
wasn't deleted unless a geometry request was made.
|
|
b - Fix a race condition for pbs_mom -q, exitstatus was getting overwritten and as a result
|
|
pbs_server wasn't always re-queued, but were being deleted instead.
|
|
e - Add a configure option --with-tcp-retry-limit to prevent potential 4+ hour hangs on
|
|
pbs_server. We recommend --with-tcp-retry-limit=2
|
|
n - Changing the way to set ATTR_node_exclusive from -E to -n, in order to continue
|
|
compatibility with Moab.
|
|
b - preserve the order on array strings in TORQUE, like the route_destinations for a
|
|
routing queue
|
|
b - fix bugzilla #111, multi-line environment variables causing errors in TORQUE.
|
|
b - allow apostrophes in Mail_Users attributes, as apostrophes are rare but legal email
|
|
characters
|
|
b - restored functionality for -W umask as reported in bugzilla 115
|
|
b - Updated torque.spec.in to be able to handle the snapshot names of builds.
|
|
b - fix pbs_mom -q to work with parallel jobs
|
|
b - Added code to free the mom.lock file during MOM shutdown.
|
|
e - Added new MOM configure option job_starter. This options will execute
|
|
the script submitted in qsub to the executable or script provided
|
|
b - fixed a bug in set_resources that prevented the last resource in a list from being
|
|
checked. As a result the last item in the list would always be added
|
|
without regard to previous entries.
|
|
e - altered the prologue/epilogue code to allow root squashing
|
|
f - added the mom config parameter $reduce_prolog_checks. This makes it so TORQUE only checks
|
|
to verify that the file is a regular file and is executable.
|
|
e - allow more than 5 concurrent connections to TORQUE using pbsD_connect. Increase it to 10
|
|
b - fix a segfault when receiving an obit for a job that no longer exists
|
|
e - Added options to conditionally build munge, BLCR, high-availability, cpusets,
|
|
and spooling. Also allows customization of the sendmail path and allows for
|
|
optional XML conversion to serverdb.
|
|
b - also remove the procct resource when it is applied because of a default
|
|
c - fix a segfault when queue has acl_group_enable and acl_group_sloppy set
|
|
true and no acl_groups are defined.
|
|
|
|
3.0.0
|
|
e - serverdb is now stored as xml, this is no longer configurable.
|
|
f - added --enable-numa-support for supporting NUMA-type architectures. We
|
|
have tested this build on UV and Altix machines. The server treats the
|
|
mom as a node with several special numa nodes embedded, and the pbs_mom
|
|
reports on these numa nodes instead of itself as a whole.
|
|
f - for numa configurations, pbs_mom creates cpusets for memory as well as
|
|
cpus
|
|
e - adapted the task manager interface to interact properly with NUMA
|
|
systems, including tm_adopt
|
|
e - Addeded autogen.sh go make life easier in a Makefile.in-less world.
|
|
e - Modified buildutils/pbs_mkdirs.in to create server_priv/nodes file
|
|
at install time. The file only shows examples and a link to the
|
|
TORQUE documentation.
|
|
f - added ATTR_node_exclusive to allow a job to have a node exclusively.
|
|
f - added --enable-memacct to use an extra protocol in order to
|
|
accurately track jobs that exceed over their memory limits and kill
|
|
them
|
|
e - when ATTR_node_exclusive is set, reserve the entire node (or entire
|
|
numa node if applicable) in the cpuset
|
|
n - Changed the protocol versions for all client-to-server, mom-to-server and
|
|
mom-to-mom protocols from 1 to 2. The changes to the protocol in this version
|
|
of TORQUE will make it incompatible with previous versions.
|
|
e - when a select statement is used, tally up the memory requests and mark
|
|
the total in the resource list. This allows memory enforcement for
|
|
NUMA jobs, but doesn't affect others as memory isn't enforced for
|
|
multinode jobs
|
|
e - add an asynchronous option to qdel
|
|
b - do not reply when an asynchronous reply has already been sent
|
|
e - make the mem, vmem, and cput usage available on a per-mom basis using momctl -d2
|
|
(Dr. Bernd Kallies)
|
|
e - move the memory monitor functionality to linux/mom_mach.c in order to store the
|
|
more accurate statistics for usage, and still use it for applying limits.
|
|
(Dr. Bernd Kallies)
|
|
e - when pbs_mom is compiled to use cpusets, instead of looking at all processes,
|
|
only examine the ones in cpuset task files. For busy machines (especially large
|
|
systems like UVs) this can exponentially reduce job monitoring/harvesting times.
|
|
(Dr. Bernd Kallies)
|
|
e - when cpusets are configured and memory pressure enabled, add the ability to
|
|
check memory pressure for a job. Using $memory_pressure_threshold and
|
|
$memory_pressure_duration in the mom's config, the admin sets a threshold at
|
|
which a job becomes a problem. If duration is set, the job will be killed if
|
|
it exceeds the threshold for the configured number of checks. If duration isn't
|
|
set, then an arror is logged.
|
|
(Dr. Bernd Kallies)
|
|
e - change pbs_track to look for the executable in the existing path so it doesn't always
|
|
need a complete path.
|
|
(Dr. Bernd Kallies)
|
|
e - report sessions on a per numa node basis when NUMA is enabled
|
|
(Dr. Bernd Kallies)
|
|
b - Merged revision 4325 from 2.5-fixes. Fixed a problem where the -m n
|
|
(request no mail on qsub) was not always being recongnized.
|
|
e - Merged buildutils/torque.spec.in from 2.4-fixes.
|
|
Refactored torque spec file to comply with established RPM best
|
|
practices, including the following:
|
|
- Standard installation locations based on RPM macro configuration
|
|
(e.g., %{_prefix})
|
|
- Latest upstream RPM conditional build semantics with fallbacks for
|
|
older versions of RPM (e.g., RHEL4)
|
|
- Initial set of optional features (GUI, PAM, syslog, SCP) with more
|
|
planned
|
|
- Basic working configuration automatically generated at install-time
|
|
- Reduce the number of unnecessary subpackages by consolidating where
|
|
it makes sense and using existing RPM features (e.g., --excludedocs).
|
|
|
|
2.5.10
|
|
b - Fixed a problem where pbs_mom will crash of check_pwd returns NULL. This could
|
|
happen for example if LDAP was down and getpwnam returns NULL.
|
|
e - Added code to delete a job on the MOM if a job is in the EXITED substate and
|
|
going through the scan_for_exiting code. This happens when an obit has been
|
|
sent and the obit reply received by the PBS_BATCH_DeleteJob has not been
|
|
received from the server on the MOM. This fix allows the MOM to delete the
|
|
job and free up resources even if the server for some reason does not send
|
|
the delete job request.
|
|
b - TRQ-608: Removed code to check for blocking mode in write_nonblocking_socket().
|
|
Fixes problem with interactive jobs (qsub -I) exiting prematurely.
|
|
c - fix a buffer being overrun with nvidia gpus enabled (backported from 3.0.4)
|
|
b - To fix a problem in 2.5.9 where the job_array structure was modified
|
|
without changing the version or creating an upgrade path. This made
|
|
it incompatible with previous versions of TORQUE 2.5 and 3.0.
|
|
Added new array structure job_array_259. This is the original torque
|
|
2.5.9 job_array structure with the num_purged element added in the middle
|
|
of the structure. job_array_259 was created so users could upgrade from 2.5.9
|
|
and 3.0.3 to later versions of TORQUE. The job_array structure was
|
|
modified by moving the num_purged element to the bottom of the structure.
|
|
pbsd_init now has an upgrade path for job arrays from version 3 to version
|
|
4. However, there is an exceptional case when upgrading from 2.5.9 or 3.0.3
|
|
where pbs_server must be started using a new -u option.
|
|
b - no longer leave zombie processes when munge authenticating. (backported from 3.0.4)
|
|
|
|
|
|
2.5.9
|
|
e - change mom to only log "cannot find nvidia-smi in PATH" once when built
|
|
with --enable-nvidia-gpus and running on a node that does not have Nvidia
|
|
drivers installed.
|
|
b - Change so gpu states get set/unset correctly. Fixes problems with multiple
|
|
exclusive jobs being assigned to same gpu and where next job gets rejected
|
|
because gpu state was not reset after last shared gpu job finished.
|
|
e - Added a 1 millisecond sleep to src/lib/Libnet/net_client.c client_to_svr()
|
|
if connect fails with EADDRINTUSE EINVAL or EADDRNOTAVAIL case. For these cases
|
|
TORQUE will retry the connect again. This fix increases the chance of success
|
|
on the next iteration.
|
|
b - Changes to decrease some gpu error messages and to detect unusual gpu
|
|
drivers and configurations.
|
|
b - Change so user cannot impersonate a different user when using munge.
|
|
e - Added new option to torque.cfg name TRQ_IFNAME. This allows the user to designate
|
|
a preferred outbound interface for TORQUE requests. The interface is the name
|
|
of the NIC interface, for example eth0.
|
|
e - Added instructions concerning the server parameter moab_array_compatible to the
|
|
README.array_changes file.
|
|
b - Fixed a problem where pbs_server would seg-fault if munged was not running. It would
|
|
also seg-fault if an invalid credential were sent from a client. The seg-fault was
|
|
occurred in the same place for both cases.
|
|
b - Fixed a problem where jobs dependent on an array using afteranyarray would not start
|
|
when a job element of the array completed.
|
|
b - Fixed a bug where array jobs .AZ file would not be deleted when the array job was done.
|
|
e - Modified qsub so that it will set PBS_O_HOST on the server from the incoming interface.
|
|
(with this fix QSUBHOST from torque.cfg will no longer work. Do we need to make it
|
|
to override the host name?)
|
|
b - fix so user epilogues are run as user instead of root (backported from 3.0.3)
|
|
b - fix the prevent pbs_server from hanging when doing server to server job moves.
|
|
(backported from 3.0.3)
|
|
b - Fixed a problem where array jobs would always lose their state when pbs_server was
|
|
restarted. Array jobs now retain their previous state between restarts of the server
|
|
the same as non-array jobs. This fix takes care of a problem where Moab and TORQUE
|
|
would get out of sync on jobs because of this discrepency between states.
|
|
b - Made a fix related to procct. If no resources are requested on the qsub line previous
|
|
versions of TORQUE did not create a Resource_List attribute. Specifically a node and
|
|
nodect element for Resource_List. Adding this broke some applications. I made it so
|
|
if no nodes or procs resources are requested the procct is set to 1 without creating
|
|
the nodes element.
|
|
e - Changed enable-job-create to with-job-create with an optional CFLAG argument.
|
|
--with-job-create=<CFLAG options>
|
|
e - Changed qstat.c to display 6 instead of 5 digits for Req'd Memory for a qstat -a.
|
|
|
|
2.5.8
|
|
e - added util function getpwnam_ext() that has retry and errno logging
|
|
capability for calls to getpwnam().
|
|
c - fix a potential segfault when using asynchronous runjob with an array slot limit
|
|
(backported from 3.0.3)
|
|
b - In pbs_original_connect() only the first NCONNECT entries of the connection table
|
|
were checked for availability. NCONNECT is defined as 10. However, the connection
|
|
table is PBS_NET_MAX_CONNECTIONS in size. PBS_NET_MAX_CONNECTIONS is 10240.
|
|
NCONNECT is now defined as PBS_NET_MAX_CONNECTIONS.
|
|
b - fix bugzilla #135, stagein was deleting directory instead of file (backported
|
|
from 3.0.3)
|
|
b - If the resources nodes or procs are not submitted on the qsub command line then
|
|
the nodes attribute does not get set. This causes a problem if procct is set on
|
|
queues because there is no proc count available to evaluate. This fix sets
|
|
a default nodes value of 1 if the nodes or procs resources are not requested.
|
|
e - Change so Nvidia drivers 260, 270 and above are recognized.
|
|
e - Added server attribute no_mail_force which when set True eliminates all
|
|
e-mail when job mail_points is set to "n"
|
|
|
|
2.5.7
|
|
e - Added new qsub argument -F. This argument takes a quoted string as
|
|
an argument. The string is a list of space separated commandline
|
|
arguments which are available to the job script.
|
|
b - Fixed a potential buffer overflow problem in src/resmom/checkpoint.c function
|
|
mom_checkpoint_recover. I modified the code to change strcpy and strcat to strncpy
|
|
and strncpy.
|
|
b - Fixed a bug for high availability. The -l listener option for pbs_server was not
|
|
complete and did not allow pbs_server to properly communicate with the scheduler.
|
|
Also fixed a bug with job dependencies where the second server or later in the
|
|
$TORQUE_HOME/server_name directory was not added as part of the job dependecny
|
|
so dependent jobs would get stuck on hold if the current server was not the first
|
|
server in the server_name file.
|
|
|
|
2.5.6
|
|
b - Made changes to record_jobinfo and supporting functions to be
|
|
able to use dynamically allcated buffers for data. This fixed
|
|
a problem where incoming data overran fixed sized buffers.
|
|
b - Updated torque.spec.in to be able to handle the snapshot
|
|
names of builds.
|
|
e - Added new MOM configure option job_starter. This options will execute
|
|
the script submitted in qsub to the executable or script provided
|
|
as the argument to the job_starter option of the MOM configure file.
|
|
b - fixed a problem with pbs_server high availability where the current
|
|
server could not keep the HA lock. The problem was a result of truncating
|
|
the directory name where the lock file was kept. TORQUE would fail to
|
|
validate permissions because it would do a stat on the wrong directory.
|
|
b - Added code to free the mom.lock file during MOM shutdown.
|
|
b - fixed a bug in set_resources that prevented the last resource in a list from being
|
|
checked. As a result the last item in the list would always be added
|
|
without regard to previous entries.
|
|
e - Added new symbol JOB_EXEC_OVERLIMIT. When a job exceeds a limit (i.e. walltime) the
|
|
job will fail with the JOB_EXEC_OVERLIMIT value and
|
|
also produce an abort case for mailing purposes. Previous to this change
|
|
a job exceeding a limit returned 0 on success and no mail
|
|
was sent to the user if requested on abort.
|
|
e - Added options to buildutils/torque.spec.in to conditionally build munge, BLCR,
|
|
high-availability, cpusets, and spooling. Also allows customization of the
|
|
sendmail path and allows for optional XML conversion to serverdb.
|
|
b - --with-tcp-retry-limit now actually changes things without needing to run autoheader
|
|
b - Fixed a problem with minimum sizes in queues. Minimum sizes were not getting enforced because
|
|
the logic checking the queue against the user request used and && when it need a || in the
|
|
comparison.
|
|
e - The -e and -o options of qsub allow a user to specify a path or optionally a filename for output.
|
|
If the path given by the user ended with a directory name but no '/' character at the end then
|
|
TORQUE was confused and would not convert the .OU or .ER file to the final output/error file. The
|
|
code has now been changed to stat the path to see if the end path element is a path or directory
|
|
and handled appropriately.
|
|
e - Added new MOM configuration option $rpp_throttle. The syntax for this in the
|
|
$TORQUE_HOME/mom_priv/config file is $rpp_throttle <value> where value is a long
|
|
representing microseconds. Setting this values causes rpp data to pause after every
|
|
sendto for <value> microseconds. This may help with large jobs where full data does
|
|
not arrive at sister nodes.
|
|
c - check if the file pointer to /dev/console can be opened. If not, don't attempt to write it
|
|
(backported from 3.0.2)
|
|
b - Added patch from Michael Jennings to buildutils/torque.spec.in. This patch
|
|
allows an rpm configured with DRMAA to complete even if all of the
|
|
support files are not present on the system.
|
|
b - commited patch submitted by Michael Jennings to fix bug 130. TORQUE on the MOM would call
|
|
lstat as root when it should call it as user in open_std_file.
|
|
f - Added the ability to detect Nvidia gpus using nvidia-smi (default) or NVML.
|
|
Server receives gpu statuses from pbs_mom. Added server attribute auto_node_gpu
|
|
that allows automatically setting number of gpus for nodes based on gpu
|
|
statuses. Added new configure options --enable-nvidia-gpus,
|
|
--with-nvml-include and --with-nvml-lib.
|
|
c - fix a segfault when using --enable-nvidia-gpus and pbs_mom has Nvidia driver
|
|
older than 260 that still has nvidia-smi command
|
|
e - Added capability to automatically set mode on Nvidia gpus. Added support for
|
|
gpu reseterr option on qsub. The nodes file will be updated with Nvidia gpu
|
|
count when --enable-nvidia-gpu configure option is used. Moved some code
|
|
out of job_purge_thread to prevent segfault on mom.
|
|
e - Applied patch submitted by Eric Roman. This patch addresses some build issues
|
|
with BLCR, and fixes an error where BLCR would report -ENOSUPPORT when trying
|
|
to checkpoint a parallel job. The patch adds a --with-blcr option to configure
|
|
to find the path to the BLCR libaries. There are --with-blcr-include,
|
|
--with-blcr-lib and --with-blcr-bin to override the search paths, if necessary.
|
|
The last option, --with-blcr-bin is used to generate contrib/blcr/checkpoint_script
|
|
and contrib/blcr/restart_script from the information supplied at configure time.
|
|
b - Fixed problem where calling qstat with a non-existent job id would hang the qstat
|
|
command. This was only a problem when configured with MUNGE.
|
|
b - fix a potential buffer overflow security issue in job names and host address names
|
|
|
|
|
|
2.5.5
|
|
b - change so gpus get written back to nodes file
|
|
e - make it so that even if an array request has multiple consecutive '%' the slot
|
|
limit will be set correctly
|
|
b - Fixed bug in job_log_open where the global variable logpath was freed instead
|
|
of joblogpath.
|
|
b - Fixed memory leak in function procs_requested.
|
|
b - Validated incoming data for escape_xml to prevent a seg-fault with incoming
|
|
null pointers
|
|
e - Added submit_host and init_work_dir as job attributes. These two
|
|
values are now displayed with a qstat -f. The submit_host is
|
|
the name of the host from where the job was submitted. init_work_dir
|
|
is the working directory as in PBS_O_WORKDIR.
|
|
e - change so blcr checkpoint jobs can restart on different node. Use
|
|
configure --enable-blcr to allow.
|
|
b - remove the use of a GNU specific function, and fix an error for solaris builds
|
|
b - Updated PBS_License.txt to remove the implication that the software
|
|
is not freely redistributable.
|
|
b - remove the $PBS_GPUFILE when job is done on mom
|
|
b - fix a race condition when issuing a qrerun followed by a qdel that caused
|
|
the job to be queued instead of deleted sometimes.
|
|
e - Implemented Bugzilla Bug 110. If a host in the nodes file cannot be resolved
|
|
at startup the server will try once every 5 minutes until the node
|
|
will resolve and it will add it to the nodes list.
|
|
e - Added a "create" method to pbs_server init.d script so a serverdb file
|
|
can be created if it does not exist at startup time. This is an enhancement
|
|
in reference to Bugzilla bug 90.
|
|
b - Fixed a problem in parse_node_token where the local static variable pt would be advanced
|
|
past the end of the line input if there is no newline character at the end of the nodes
|
|
file.
|
|
e - To fix Bugzilla Bug 121 I created a thread in job_purge on the mom in the file src/resmom/job_func.c
|
|
All job purging now happens on its own thread. If any of the system calls fail to return
|
|
the thread will hang but the MOM will still be able to process work.
|
|
|
|
|
|
2.5.4
|
|
f - added the ability to track gpus. Users set gpus=X in the nodes file for
|
|
relevant node, and then request gpus in the nodes request:
|
|
-l nodes=X[:ppn=Y][:gpus=Z]. The gpus appear in $PBS_GPUFILE, a new
|
|
environment variable, in the form: <hostname>-gpu<index> and in a
|
|
new job attribute exec_gpus:
|
|
<hostname>-gpu/<index>[+<hostname>-gpu/<index>...]
|
|
b - clean up job mom checkpoint directory on checkpoint failure
|
|
e - Bugzilla bug 91. Check the status before the service is actually started.
|
|
(Steve Traylen - CERN)
|
|
e - Bugzilla bug 89. Only touch lock/subsys files if service actually starts.
|
|
(Steve Traylen - CERN)
|
|
c - when using job_force_cancel_time, fix a crash in rare cases
|
|
e - add server parameter moab_array_compatible. When set to true, this parameter
|
|
places a limit hold on jobs past the slot limit. Once one of the unheld jobs
|
|
completes or is deleted, one of the held jobs is freed.
|
|
b - fix a potential memory corruption for walltime remaining for jobs
|
|
(Vikentsi Lapa)
|
|
b - fix potential buffer overrun in pbs_sched (Bugzilla #98, patch from
|
|
Stephen Usher @ University of Oxford)
|
|
e - check if a process still exists before killing it and sleeping. This speeds up
|
|
the time for killing a task exponentially, although this will show mostly for
|
|
SMP/NUMA systems, but it will help everywhere.
|
|
(Dr. Bernd Kallies)
|
|
b - Fix for reque failures on mom. Forked pbs_mom would silently segfault and
|
|
job was left in Exiting state.
|
|
b - change so "mom_checkpoint_job_has_checkpoint" and "execing command" log
|
|
messages do not always get logged
|
|
|
|
2.5.3
|
|
b - stop reporting errors on success when modifying array ranges
|
|
b - don't try to set the user id multiple times
|
|
b - added some retrying to get connection and changed some log messages when
|
|
doing a pbs_alterjob after a checkpoint
|
|
c - fix segfault in tracejob. It wasn't malloc'ing space for the null
|
|
terminator
|
|
e - add the variables PBS_NUM_NODES and PBS_NUM_PPN to the job environment
|
|
(TRQ-6)
|
|
e - be able to append to the job's variable_list through the API
|
|
(TRQ-5)
|
|
e - Added support for munge authentication. This is an alternative for the
|
|
default ruserok remote authentication and pbs_iff. This is a compile
|
|
time option. The configure option to use is --enable-munge-auth.
|
|
Ken Nielson (TRQ-7) September 15, 2010.
|
|
b - fix the dependency hold for arrays. They were accidentally cleared
|
|
before (RT 8593)
|
|
e - add a logging statement if sendto fails at any points in rpp_send_out
|
|
b - Applied patch submitted by Will Nolan to fix bug 76.
|
|
"blocking read does not time out using signal handler"
|
|
b - fix a bug in the $spool_as_final_name code if HAVE_WORDEXP is
|
|
undefined
|
|
b - Bugzilla bug 84. Security bug on the way checkpoint is being handled.
|
|
(Robin R. - Miami Univ. of Ohio)
|
|
e - Now saving serverdb as an xml file instead of a byte-dump, thus
|
|
allowing canned installations without qmgr scripts, as well as more
|
|
portability. Able to upgrade automatically from 2.1, 2.3, and 2.4
|
|
b - fix to cleanup job files on mom after a BLCR job is checkpointed and held
|
|
b - make the tcp reading buffer able to grow dynamically to read larger
|
|
values in order to avoid "invalid protocol" messages
|
|
e - change so checkpoint files are transfered as the user, not as root.
|
|
f - Added configure option --with-servchkptdir which allows specifying path
|
|
for server's checkpoint files
|
|
b - could not set the server HA parameters lock_file_update_time and
|
|
lock_file_check_time previously. Fixed.
|
|
e - qpeek now has the options --ssh, --rsh, --spool, --host, -o, and
|
|
-e. Can now output both the STDOUT and STDERR files. Eliminated
|
|
numlines, which didn't work.
|
|
b - fix to prevent a possible segfault when using checkpointing.
|
|
|
|
2.5.2
|
|
e - Allow the nodes file to use the syntax node[0-100] in the name to
|
|
create identical nodes with names node0, node1, ..., node100.
|
|
(also node[000-100] => node000, node001, ... node100)
|
|
b - fix support of the 'procs' functionality for qsub.
|
|
b - remove square brackets [] from job and default stdout/stderr filenames
|
|
for job arrays (fixes conflict with some non-bash shells)
|
|
n - fix build system so README.array_changes is included in tar.gz file made
|
|
with "make dist"
|
|
n - fix build system so contrib/pbsweb-lite-0.95.tar.gz, contrib/qpool.gz
|
|
and contrib/README.pbstools are included the the tar.gz file made
|
|
with "make dist"
|
|
c - fixed crash when moving the job to a different queue (bugzilla 73)
|
|
e - Modified buildutils/pbs_mkdirs.in to create server_priv/nodes file
|
|
at install time. The file only shows examples and a link to the
|
|
TORQUE documentation. This enhancement was first committed to trunk.
|
|
c - fix pbs_server crash from invalid qsub -t argument
|
|
b - fix so blcr checkpoint jobs work correctly when put on hold
|
|
b - fixed bugzilla #75 where pbs_server would segfault with a double free when
|
|
calling qalter on a running job or job array.
|
|
e - Changed free_br back to its original form and modifed copy_batchrequest
|
|
to make a copy of the rq_extend element which will be freed in
|
|
free_br.
|
|
b - fix condition where job array "template" may not get cleaned up properly
|
|
after a server restart
|
|
b - fix to get new pagg ID and add additional CSA records when restarting from
|
|
checkpoint
|
|
e - added documentation for pbs_alterjob_async(), pbs_checkpointjob(),
|
|
pbs_fbserver(), pbs_get_server_list() and pbs_sigjobasync().
|
|
b - Commited patch from Eygene Ryanbinkin to fix bug 61. /dev/null would
|
|
under some circumstances have its permissions modified when jobs exited
|
|
on a compute node.
|
|
e - add --enable-top-tempdir-only to only create the top directory of the
|
|
job's temporary directory when configured
|
|
b - make the code for reconnecting to the server more robust, and remove
|
|
elements of not connecting if a job isn't running
|
|
e - allow input of walltime in the format of [DD]:HH:MM:SS
|
|
b - Fix so BLCR checkpoint files get copied to server on qchkpt and periodic
|
|
checkpoints
|
|
c - corrected a segfault when display_job_server_suffix is set to false
|
|
and job_suffix_alias was unset.
|
|
|
|
2.5.1
|
|
b - modified Makefile.in and Makefile.am at root to include contrib/AddPrivileges
|
|
|
|
2.5.0
|
|
|
|
e - Added new server config option alias_server_name. This option allows
|
|
the MOM to add an additional server name to be added to the list
|
|
of trusted addresses. The point of this is to be able to handle
|
|
alias ip addresses. UDP requests that come into an aliased ip address
|
|
are returned through the primary ip address in TORQUE. Because
|
|
the address of the reply packet from the server is not the same address
|
|
the MOM sent its HELLO1 request, the MOM drops the packet and the MOM
|
|
cannot be added to the server.
|
|
n - auto_node_np will now adjust np values down as well as up.
|
|
e - Enabled TORQUE to be able to parse the -l procs=x node spec. Previously
|
|
TORQUE simply recored the value of x for procs in Resources_List. It
|
|
now takes that value and allocates x processors packed on any available
|
|
node. (Ken Nielson Adaptive Computing. June 17, 2010)
|
|
f - added full support (server-scheduler-mom) for Cygwin (UIIP NAS of Belarus,
|
|
uiip.bas-net.by)
|
|
b - fixed EINPROGRESS in net_client.c. This signal appears every time of
|
|
connecting and requires individual processing. The old erroneous
|
|
processing brought a large network delay, especially on Cygwin.
|
|
e - improved signal processing after connecting in client_to_svr and added own
|
|
implementation of bindresvport for OS which lack it (Igor Ilyenko,
|
|
UIIP Minsk)
|
|
f - created permission checking of Windows (Cygwin) users, using mkpasswd,
|
|
mkgroup and own functions IamRoot, IamUser (Yauheni Charniauski,
|
|
UIIP Minsk)
|
|
f - created permission checking of submitted jobs (Vikentsi Lapa,
|
|
UIIP Minsk)
|
|
f - Added the --disable-daemons configure option for start server-sched-mom
|
|
as Windows services, cygrunsrv.exe goes its into background
|
|
independently.
|
|
e - Adapted output of Cygwin's diagnostic information (Yauheni
|
|
Charniauski, UIIP Minsk)
|
|
b - Changed pbsd_main to call daemonize_server early only if
|
|
high_availability_mode is set.
|
|
e - added new qmgr server attributes (clone_batch_size, clone_batch_delay)
|
|
for controlling job cloning (Bugzilla #4)
|
|
e - added new qmgr attribute (checkpoint_defaults) for setting default
|
|
checkpoint values on Execution queues (Bugzilla #1)
|
|
e - print a more informative error if pbs_iff isn't found when trying to
|
|
authenticate a client
|
|
e - added qmgr server attribute job_start_timeout, specifies timeout to be
|
|
used for sending job to mom. If not set, tcp_timeout is used.
|
|
e - added -DUSESAVEDRESOURCES code that uses servers saved resources used
|
|
for accounting end record instead of current resources used for jobs that
|
|
stopped running while mom was not up.
|
|
e - TORQUE job arrays now use arrays to hold the job pointers and not
|
|
linked lists (allows constant lookup).
|
|
f - Allow users to delete a range of jobs from the job array (qdel -t)
|
|
f - Added a slot limit to the job arrays - this restricts the number of
|
|
jobs that can concurrently run from one job array.
|
|
f - added support for holding ranges of jobs from an array with a single
|
|
qhold (using the -t option).
|
|
f - now ranges of jobs in an array can be modified through qalter
|
|
(using the -t option).
|
|
f - jobs can now depend on arrays using these dependencies:
|
|
afterstartarray, afterokarray, afternotokarray, afteranyarray,
|
|
f - added support for using qrls on arrays with the -t option
|
|
e - complte overhaul of job array submission code
|
|
f - by default show only a single entry in qstat output for the whole array
|
|
(qstat -t expands the job array)
|
|
f - server parameter max_job_array_size limits the number of jobs allowed
|
|
in an array
|
|
b - job arrays can no longer circumvent max_user_queuable
|
|
b - job arrays can no longer circumvent max_queuable
|
|
f - added server parameter max_slot_limit to restrict slot limits
|
|
e - changed array names from jobid-index to jobid[index] for consistency
|
|
|
|
2.4.13
|
|
e - change so blcr checkpoint jobs can restart on different node. Use
|
|
configure --enable-blcr to allow. (Bugzilla 68, backported from 2.5.5)
|
|
e - Add code to verify the group list as well when VALIDATEGROUPS is set in torque.cfg
|
|
(backported from 3.0.1)
|
|
b - Fix a bug where if geometry requests are enabled and cpusets are enabled, the cpuset
|
|
wasn't deleted unless a geometry request was made. (backported from 3.0.1)
|
|
b - Fix a race condition for pbs_mom -q, exitstatus was getting overwritten and as a result
|
|
pbs_server wasn't always re-queued, but were being deleted instead. (backported from 3.0.1)
|
|
b - allow apostrophes in Mail_Users attributes, as apostrophes are rare but legal email
|
|
characters (backported from 3.0.1)
|
|
b - Fixed a problem in parse_node_token where the local static variable pt would be advanced
|
|
past the end of the line input if there is no newline character at the end of the nodes
|
|
file.
|
|
b - Updated torque.spec.in to be able to handle the snapshot
|
|
names of builds.
|
|
b - Merged revisions 4555, 4556 and 4557 from 2.5-fixes branch. This revisions fix problems in
|
|
High availability mode and also a problem where the MOM was not releasing the lock on
|
|
mom.lock on exit.
|
|
b - fix pbs_mom -q to work with parallel jobs (backported from 3.0.1)
|
|
b - fixed a bug in set_resources that prevented the last resource in a list from being
|
|
checked. As a result the last item in the list would always be added
|
|
without regard to previous entries.
|
|
e - allow more than 5 concurrent connections to TORQUE using pbsD_connect. Increase it to 10
|
|
(backported from 3.0.1)
|
|
b - fix a segfault when receiving an obit for a job that no longer exists (backported from 3.0.1)
|
|
b - Fixed a problem with minimum sizes in queues. Minimum sizes were not getting enforced because
|
|
the logic checking the queue against the user request used and && when it need a || in the
|
|
comparison.
|
|
c - fix a segfault when queue has acl_group_enable and acl_group_sloppy set
|
|
true and no acl_groups are defined. (backported from 3.0.1)
|
|
e - To fix Bugzilla Bug 121 I created a thread in job_purge on the mom in the file src/resmom/job_func.c
|
|
All job purging now happens on its own thread. If any of the system calls fail to return
|
|
the thread will hang but the MOM will still be able to process work.
|
|
e - Updated Makefile.in, configure, etc. to reflect change in configure.ac to add
|
|
libpthread to the build. This was done for the fix for Bugzilla Bug 121.
|
|
2.4.12
|
|
b - Bugzilla bug 84. Security bug on the way checkpoint is being handled.
|
|
(Robin R. - Miami Univ. of Ohio, back-ported from 2.5.3)
|
|
b - make the tcp reading buffer able to grow dynamically to read larger
|
|
values in order to avoid "invalid protocol" messages (backported from
|
|
2.5.3)
|
|
b - could not set the server HA parameters lock_file_update_time and
|
|
lock_file_check_time previously. Fixed. (backported from 2.5.3)
|
|
e - qpeek now has the options --ssh, --rsh, --spool, --host, -o, and
|
|
-e. Can now output both the STDOUT and STDERR files. Eliminated
|
|
numlines, which didn't work. (backported from 2.5.3)
|
|
b - Modified the pbs_server startup routine to skip unknown hosts in the
|
|
nodes file instead of terminating the server startup.
|
|
b - fix to prevent a possible segfault when using checkpointing (back-ported
|
|
from 2.5.3).
|
|
b - fix to cleanup job files on mom after a BLCR job is checkpointed and held
|
|
(back-ported from 2.5.3)
|
|
c - when using job_force_cancel_time, fix a crash in rare cases
|
|
(backported from 2.5.4)
|
|
b - fix a potential memory corruption for walltime remaining for jobs
|
|
(Vikentsi Lapa, backported from 2.5.4)
|
|
b - fix potential buffer overrun in pbs_sched (Bugzilla #98, patch from
|
|
Stephen Usher @ University of Oxford, backported from 2.5.4)
|
|
e - check if a process still exists before killing it and sleeping. This speeds up
|
|
the time for killing a task exponentially, although this will show mostly for
|
|
SMP/NUMA systems, but it will help everywhere. (backported from 2.5.4)
|
|
(Dr. Bernd Kallies)
|
|
e - Refactored torque spec file to comply with established RPM best
|
|
practices, including the following:
|
|
- Standard installation locations based on RPM macro configuration
|
|
(e.g., %{_prefix})
|
|
- Latest upstream RPM conditional build semantics with fallbacks for
|
|
older versions of RPM (e.g., RHEL4)
|
|
- Initial set of optional features (GUI, PAM, syslog, SCP) with more
|
|
planned
|
|
- Basic working configuration automatically generated at install-time
|
|
- Reduce the number of unnecessary subpackages by consolidating where
|
|
it makes sense and using existing RPM features (e.g.,
|
|
--excludedocs).
|
|
b - Merged revision 4325 from 2.5-fixes. Fixed a problem where the -m n
|
|
(request no mail on qsub) was not always being recongnized.
|
|
b - Fix for reque failures on mom. Forked pbs_mom would silently segfault and
|
|
job was left in Exiting state. (backported from 2.5.4)
|
|
b - prevent the nodes file from being overwritten when running make packages
|
|
b - change so "mom_checkpoint_job_has_checkpoint" and "execing command" log
|
|
messages do not always get logged (back-ported from 2.5.4)
|
|
b - remove the use of a GNU specific function. (back-ported from 2.5.5)
|
|
|
|
|
|
2.4.11
|
|
b - changed type cast for calloc of ioenv from sizeof(char) to sizof(char *)
|
|
in pbsdsh.c. This fixes bug 79.
|
|
b - Added patch to fix bug 76, "blocking read does not time out using
|
|
signal handler.
|
|
b - Modified the pbs_server startup routine to skip unknown hosts in the
|
|
nodes file instead of terminating the server startup.
|
|
|
|
2.4.10
|
|
b - fix for bug 61. The fix takes care of a problem where pbs_mom under
|
|
some situations will change the mode and permissions of /dev/null.
|
|
|
|
2.4.9
|
|
b - Bugzilla bug 57. Check return value of malloc for tracejob for Linux
|
|
(Chris Samuel - Univ. of Melbourne)
|
|
b - fix so "gres" config gets displayed by pbsnodes
|
|
b - use QSUBHOST as the default host for output files when no host is
|
|
specified. (RT 7678)
|
|
e - allow users to use cpusets and geometry requests at the same time by
|
|
specifying both at configure time.
|
|
b - Bugzilla bug 55. Check return value of malloc for pbs_mom for Linux
|
|
(Chris Samuel - Univ. of Melbourne)
|
|
e - added server parameter job_force_cancel_time. When configured to X
|
|
seconds, a job that is still there X seconds after a qdel will be
|
|
purged. Useful for freeing nodes from a job when one node goes down
|
|
midjob.
|
|
b - fixed gcc warnings reported by Skip Montanaro
|
|
e - added RPT_BAVAIL define that allows pbs_mom to report f_bavail instead of
|
|
f_bfree on Linux systems
|
|
b - no longer consider -t and -T the same in qsub
|
|
e - make PBS_O_WORKDIR accessible in the environment for prolog scripts
|
|
e - Bugzilla 59. Applied patch to allow '=' for qdel -m.
|
|
(Chris Samuel - Univ. of Melbourne)
|
|
b - properly escape characters (&"'<>) in XML output)
|
|
b - ignore port when checking host in svr_get_privilege()
|
|
b - restore ability to parse -W x=geometry:{...,...}
|
|
e - from Simon Toth: If no available amount is specified for a resource
|
|
and the max limit is set, the requirement should be checked against
|
|
the maximum only (for scheduler, bugzilla 23).
|
|
b - check return values from fwrite in cpuset.c to avoid warnings
|
|
e - expand acl host checking to allow * in the middle of hostnames, not
|
|
just at the beginning. Also allow ranges like a[10-15] to mean a10,
|
|
a11, ..., a15.
|
|
|
|
2.4.8
|
|
e - Bugzilla bug 22. HIGH_PRECISION_FAIRSHARE for fifo scheduling.
|
|
c - no longer sigabrt with "running" jobs not in an execution queue. log
|
|
an error.
|
|
c - fixed segfault for when TORQUE thinks there's a nanny but there isn't
|
|
e - mapped 'qsub -P user:group' to qsub -P user -W group_list=group
|
|
b - reverted to old behavior where interactive scripts are checked for
|
|
directives and not run without a parameter.
|
|
e - setting a queue's resource_max.nodes now actually restricts things,
|
|
although so far it only limits based on the number of nodes (i.e. not
|
|
ppn)
|
|
f - added QSUBSENDGROUPLIST to qsub. This allows the server to know the
|
|
correct group name when disable_server_id_check is set to true and
|
|
the user doesn't exist on the server.
|
|
e - Bugzilla bug 54. Patch submitted by Bas van der Vlies to make pbs_mkdirs
|
|
more robust, provide a help function and new option -C <chk_tree_location>
|
|
|
|
2.4.7
|
|
b - fixed a bug for when a resource_list has been set, but isn't completely
|
|
initialized, causing a segfault
|
|
b - stop counting down walltime remaining after a job is completed
|
|
b - correctly display the number for tasks as used in TORQUE in qstat -a output
|
|
b - no longer ignoring fread return values in linux cpuset code (gcc 4.3.3)
|
|
b - fixed a bug where job was added to obit retry list multiple times, causing
|
|
a segfault
|
|
b - Fix for Bugzilla bug 43. "configure ignores with-modulefiles=no"
|
|
b - no longer try to decide when to start with -t create in init.d scripts,
|
|
-t creates should be done manually by the user
|
|
f - added -P to qsub. When submitting a job as root, the root user may add -P
|
|
<username> to submit the job as the proxy user specified by <usermname>
|
|
|
|
2.4.6
|
|
f - added an asynchronous option for qsig, specified with -a.
|
|
b - fix to cleanup job that is left in running state after mom restart
|
|
f - added two server parameters: display_job_server_suffix and job_suffix_alias.
|
|
The first defaults to true and is whether or not jobs should be appended
|
|
by .server_name. The second defaults to NULL, but if it is defined it
|
|
will be appended at the end of the jobid, i.e. jobid.job_suffix_alias.
|
|
f - added -l option to qstat so that it will display a server name and an
|
|
alias if both are used. If these aren't used, -l has no effect.
|
|
e - qstat -f now includes an extra field "Walltime Remaining" that tells
|
|
the remaining walltime in seconds. This field is does not account for
|
|
weighted walltime.
|
|
b - fixed open_std_file to setegid as well, this caused a problem with
|
|
epilogue.user scripts.
|
|
e - qsub's -W can now parse attributes with quoted lists, for example:
|
|
qsub script -W attr="foo,foo1,foo2,foo3" will set foo,foo1,foo2,foo3
|
|
as attr's value.
|
|
b - split Cray job library and CSA functionality since CSA is dependant on job
|
|
library but job library is not dependant on CSA
|
|
|
|
2.4.5
|
|
b - epilogue.user scripts were being run with prologue argments. Fixed
|
|
bug in run_pelog() to include PE_EPILOGUSER so epilogue arguments get
|
|
passed to eplilogue.user script.
|
|
b - Ticket 6665. pbs_mom and job recovery. Fixed a bug where the -q option
|
|
would terminate running processes as well as requeue jobs. This made the
|
|
-q option the same as the -r option for pbs_mom. -q will now only reque
|
|
jobs and will not attempt to kill running processes. I also added a -P
|
|
option to start pbs_mom. This is similar to the -p option except the -P
|
|
option will only delete any left over jobs from the queue and will not
|
|
attempt to adopt and running processes.
|
|
e - Modified man page for pbs_mom. Added new -P option plus edited -p, -q
|
|
and -r options to hopefully make them more understandable.
|
|
n - 01/15/2010 created snapshot torque-2.4.5-snap201001151416.tar.gz.
|
|
b - now checks secondary groups (as well as primary) for creating a file
|
|
when spooling. Before it wouldn't create the spool file if a user had
|
|
permission through a secondary group.
|
|
n - 01/18/2010. Items above this point merged into trunk.
|
|
b - fixed a file descriptor error with high availability. Before it was possible
|
|
to try to regain a file descriptor which was never held, now this is fixed.
|
|
b - No longer overwrites the user's environment when spoolasfinalname is set.
|
|
Now the environment is handled correctly.
|
|
b - No longer will segfault if pbs_mom restarts in a bad state (user environment
|
|
not initialized)
|
|
e - Changing MAXNOTDEFAULT behavior. Now, by default, max is not default and max
|
|
can be configured as default with --enable-maxdefault.
|
|
|
|
2.4.4
|
|
b - fixed contrib/init.d/pbs_mom so that it doesn't overwrite $args defined in
|
|
/etc/sysconfig/pbs_mom
|
|
b - when spool_as_final_name is configured for the mom, no longer send email
|
|
messages about not being able to copy the spool file
|
|
b - when spool_as_final_name is configured for the mom, correctly substitue
|
|
job environment variables
|
|
f - added logging for email events, allows the admin to check if emails are
|
|
being sent correctly
|
|
b - Made a fix to svr_get_privilege(). On some architectures a non-root user
|
|
name would be set to null after the line " host_no_port[num_host_chars] = 0;"
|
|
because num_host_chars was = 1024 which was the size of hot_no_port.
|
|
The null termination needed to happen at 1023. There were other problems
|
|
with this function so code was added to validate the incoming
|
|
variables before they were used. The symptom of this bug was that non-root
|
|
managers and operators could not perform operations where they should
|
|
have had rights.
|
|
b - Missed a format statement in an sprintf statement for the bug fix above.
|
|
b - Fixed a way that a file descriptor (for the server lockfile) could be used without
|
|
initialization. RT 6756
|
|
|
|
2.4.3
|
|
b - fix PBSD_authenticate so it correctly splits PATH with : instead of ;
|
|
(bugzilla #33)
|
|
b - pbs_mom now sets resource limits for tasks started with tm_spawn (Chris
|
|
Samuel, VPAC)
|
|
c - fix assumption about size of unsocname.sun_path in Libnet/net_server.c
|
|
b - Fix for Bugzilla bug 34. "torque 2.4.X breaks OSC's mpiexec". fix in src/server
|
|
src/server/stat_job.c revision 3268.
|
|
b - Fix for Bugzilla bug 35 - printing the wrong pid (normal mode) and not
|
|
printing any pid for high availability mode.
|
|
f - added a diagnostic script (contrib/diag/tdiag.sh). This script grabs
|
|
the log files for the server and the mom, records the output of qmgr
|
|
-c 'p s' and the nodefile, and creates a tarfile containing these.
|
|
b - Changed momctl -s to use exit(EXIT_FAILURE) instead of return(-1) if
|
|
a mom is not running.
|
|
b - Fix for Bugzilla bug 36. "qsub crashes with long dependency list".
|
|
b - Fix for Bugzilla bug 41. "tracejob creates a file in the local directory".
|
|
|
|
2.4.2
|
|
b - Changed predicate in pbsd_main.c for the two locations where
|
|
daemonize_server is called to check for the value of high_availability_mode
|
|
to determine when to put the server process in the background.
|
|
b - Added pbs_error_db.h to src/include/Makefile.am and src/include/Makefile.in.
|
|
pbs_error_db.h now needed for install.
|
|
e - Modified pbs_get_server_list so the $TORQUE_HOME/server_name file will work with
|
|
a comma delimited string or a list of server names separated by a new line.
|
|
b - fix tracejob so it handles multiple server and mom logs for the same day
|
|
f - Added a new server parameter np_default. This allows the administrator to
|
|
change the number of processors to a unified value dynamically for the
|
|
entire cluster.
|
|
e - high availability enhanced so that the server spawns a separate thread to
|
|
update the "lock" on the lockfile. Thread update and check time are both
|
|
setable parameters in qmgr.
|
|
b - close empty ACL files
|
|
|
|
2.4.1
|
|
e - added a prologue and epilogue option to the list of resources for qsub -l
|
|
which allows a per job prologue or epilogue script. The syntax for
|
|
the new option is qsub -l prologue=<prologue script>,
|
|
epilogue=<epilogue script>
|
|
f - added a "-w" option to qsub to override the working directory
|
|
e - changes needed to allow relocatable checkpoint jobs. Job checkpoint files
|
|
are now under the control of the server.
|
|
c - check filename for NULL to prevent crash
|
|
b - changed so we don't try to copy a local file when the destination is a
|
|
directory and the file is already in that directory
|
|
f - changes to allow TORQUE to operate without pbs_iff (merged from 2.3)
|
|
e - made logging functions rentrant safe by using localtime_r instead of
|
|
localtime() (merged from 2.3)
|
|
e - Merged in more logging and NOSIGCHLDMOM capability from Yahoo branch
|
|
e - merged in new log_ext() function to allow more fine grained syslog events,
|
|
you can now specify severity level. Also added more logging statements
|
|
b - fixed a bug where CPU time was not being added up properly in all cases
|
|
(fix for Linux only)
|
|
c - fixed a few memory errors due to some uninitialized memory being allocated
|
|
(ported from 2.3 R2493)
|
|
e - added code to allow compilers to override CLONE_BATCH_SIZE at configure
|
|
time (allows for finer grained control on how arrays are created) (ported
|
|
from Yahoo R2461)
|
|
e - added code which prefixes the severity tag on all log_ext() and log_err()
|
|
messages (ported from Yahoo R2358)
|
|
f - added code from 2.3-extreme that allows TORQUE to handle more than 1024 sockets.
|
|
Also, increased the size of TORQUE's internal socket handle table to avoid
|
|
running out of handles under busy conditions.
|
|
e - TORQUE can now handle server names larger than 64 bytes (now set to 1024,
|
|
which should be larger than the max for hostnames)
|
|
e - added qmgr option accounting_keep_days, specifies how long to keep
|
|
accounting files.
|
|
e - changed mom config varattr so invoked script returns the varattr name
|
|
and value(s)
|
|
e - improved the performance of pbs_server when submitting large numbers of
|
|
jobs with dependencies defined
|
|
e - added new parameter "log_keep_days" to both pbs_server and pbs_mom.
|
|
Specifies how long to keep log files before they are automatically removed
|
|
e - added qmgr server attribute lock_file, specifies where server lock file
|
|
is located
|
|
b - change so we use default file name for output / error file when just a
|
|
directory is specified on qsub / qalter -e -o options
|
|
e - modified to allow retention of completed jobs across server shutdown
|
|
e - added job_must_report qmgr configuration which says the job must be
|
|
reported to scheduler. Added job attribute "reported". Added PURGECOMP
|
|
functionality which allows scheduler to confirm jobs are reported. Also
|
|
added -c option to qdel. Used to clean up unreported jobs.
|
|
b - Fix so interactive jobs run when using $job_output_file_umask userdefault
|
|
f - Allow adding extra End accounting record for a running job that is rerun.
|
|
Provides usage data. Enabled by CFLAGS=-DRERUNUSAGE.
|
|
b - Fix to use queue/server resources_defaults to validate mppnodect against
|
|
resources_max when mppwidth or mppnppn are not specified for job
|
|
f - merged in new dynamic array struct and functions to implement a new (and
|
|
more efficient) way of loading jobs at startup--should help by 2 orders of
|
|
magnitude!
|
|
f - changed TORQUE_MAXCONNECTTIMEOUT to be a global variable that is now
|
|
changed by the MOM to be smaller than the pbs_server and is also
|
|
configurable on the MOM ($max_conn_timeout_micro_sec)
|
|
e - change so queued jobs that get deleted go to complete and get displayed
|
|
in qstat based on keep_completed
|
|
b - Changes to improve the qstat -x XML output and documentation
|
|
b - Change so BATCH_PARTITION_ID does not pass through to child jobs
|
|
c - fix to prevent segfault on pbs_server -t cold
|
|
b - fix so find_resc_entry still works after setting server extra_resc
|
|
c - keep pbs_server from trying to free empty attrlist after recieving
|
|
bad request (Michael Meier, University of Erlangen-Nurnberg) (merged from
|
|
2.3.8)
|
|
f - new fifo scheduler config option. ignore_queue: queue_name
|
|
allows the scheduler to be instructed to ignore up to 16 queues on the server
|
|
(Simon Toth, CESNET z.s.p.o.)
|
|
e - add administrator customizable email notifications (see manpage for
|
|
pbs_server_attributes) - (Roland Haas, Georgia Tech)
|
|
e - moving jobs can now trigger a scheduling iteration (merged from 2.3.8)
|
|
e - created a utility module that is shared between both server and mom but
|
|
does NOT get placed in the libtorque library
|
|
e - allow the user to request a specific processor geometry for their job using
|
|
a bitmap, and then bind their jobs to those processors using cpusets.
|
|
b - fix how qsub sets PBS_O_HOST and PBS_SERVER (Eirikur Hjartarson, deCODE
|
|
genetics) (merged from 2.3.8)
|
|
b - fix to prevent some jobs from getting deleted on startup.
|
|
f - add qpool.gz to contrib directory
|
|
e - improve how error constants and text messages are represented (Simon Toth,
|
|
CESNET z.s.p.o)
|
|
f - new boolean queue attribute "is_transit" that allows jobs to exceede
|
|
server resource limits (queue limits are respected). This allows routing
|
|
queues to route jobs that would be rejected for exceeding local resources
|
|
even when the job won't be run locally. (Simon Toth, CESNET z.s.p.o)
|
|
e - add support for "job_array" as a type for queue disallowed_types attribute
|
|
e - added pbs_mom config option ignmem to ignore mem/pmem limit enforcement
|
|
e - added pbs_mom config option igncput to ignore pcput limit enforcement
|
|
|
|
2.4.0
|
|
f - added a "-q" option to pbs_mom which does *not* perform the default -p
|
|
behavior
|
|
e - made "pbs_mom -p" the default option when starting pbs_mom
|
|
e - added -q to qalter to allow quicker response to modify requests
|
|
f - added basic qhold support for job arrays
|
|
b - clear out ji_destin in obit_reply
|
|
f - add qchkpt command
|
|
e - renamed job.h to pbs_job.h
|
|
b - fix logic error in checkpoint interval test
|
|
f - add RERUNNABLEBYDEFAULT parameter to torque.cfg. allows admin to
|
|
change the default value of the job rerunnable attribute from true
|
|
to false
|
|
e - added preliminary Comprehensive System Accounting (CSA) functionality for
|
|
Linux. Configure option --enable-csa will cause workload management
|
|
records to be written if CSA is installed and wkmg is turned on.
|
|
b - changes to allow post_checkpoint() to run when checkpoint is completed,
|
|
not when it has just started. Also corrected issue when checkpoint fails
|
|
while trying to put job on hold.
|
|
b - update server immediately with changed checkpoint name and time attributes
|
|
after successful checkpoint.
|
|
e - Changes so checkpoint jobs failing after restarted are put on hold or
|
|
requeued
|
|
e - Added checkpoint_restart_status job attribute used for restart status
|
|
b - Updated manpages for qsub and qterm to reflect changed checkpointing
|
|
options.
|
|
b - reject a qchkpt request if checkpointing is not enabled for the job
|
|
b - Mom should not send checkpoint name and time to server unless checkpoint
|
|
was successful
|
|
b - fix so that running jobs that have a hold type and that fail on checkpoint
|
|
restart get deleted when qdel is used
|
|
b - fix so we reset start_time, if needed, when restarting a checkpointed job
|
|
f - added experimental fault_tolerant job attribute (set to true by passing
|
|
-f to qsub) this attribute indicates that a job can survive the loss of
|
|
a sister mom also added corresponding fault_tolerant and
|
|
fault_intolerant types to the "disallowed_types" queue attribute
|
|
b - fixes for pbs_moms updating of comment and checkpoint name and time
|
|
e - change so we can reject hold requests on running jobs that do not have
|
|
checkpoint enabled if system was configured with --enable-blcr
|
|
e - change to qsub so only the host name can be specified on the -e/-o options
|
|
e - added -w option to qsub that allows setting of PBS_O_WORKDIR
|
|
|
|
2.3.8
|
|
c - keep pbs_server from trying to free empty attrlist after recieving
|
|
bad request (Michael Meier, University of Erlangen-Nurnberg)
|
|
e - moving jobs can now trigger a scheduling iteration
|
|
b - fix how qsub sets PBS_O_HOST and PBS_SERVER (Eirikur Hjartarson, deCODE
|
|
genetics)
|
|
f - add qpool.gz to contrib directory
|
|
b - fix return value of cpuset_delete() for Linux (Chris Samuel - VPAC)
|
|
e - Set PBS_MAXUSER to 32 from 16 in order to accomodate systems that
|
|
use a 32 bit user name.(Ken Nielson Cluster Resources)
|
|
c - modified acct_job in server/accounting.c to dynamically allocate memory
|
|
to accomodate strings larger than PBS_ACCT_MAX_RCD. (Ken Nielson Cluster
|
|
Resources)
|
|
e - all the user to turn off credential lifetimes so they don't have to lose
|
|
iterations while credentials are renewed.
|
|
e - added OS independent resending of failed job obits (from D Beer), also
|
|
removed OS specific CACHEOBITFAILURES code.
|
|
b - fix so after* dependencies are handled correctly for exiting / completed
|
|
jobs
|
|
|
|
|
|
2.3.7
|
|
b - fixed a bug where UNIX domain socket communication was failing when
|
|
"--disable-privports" was used.
|
|
e - add job exit status as 10th argument to the epilogue script
|
|
b - fix truncated output in qmgr (peter h IPSec+jan n NANCO)
|
|
b - change so set_jobexid() gets called if JOB_ATR_egroup is not set
|
|
e - pbs_mom sisters can now tolerate an explicit group ID instead of only a
|
|
valid group name. This helps TORQUE be more robust to group lookup failures.
|
|
|
|
2.3.6
|
|
b - change back to not sending status updates until we get cluster addr
|
|
message from server, also only try to send hello when the server stream
|
|
is down.
|
|
b - change pbs_server so log_file_max_size of zero behavior matches documentation
|
|
e - added periodic logging of version and loglevel to help in support
|
|
e - added pbs_mom config option ignvmem to ignore vmem/pvmem limit enforcement
|
|
b - change to correct strtoks that accidentally got changed in astyle
|
|
formatting
|
|
e - in Linux, a pbs_mom will now "kill" a job's task, even if that task can no
|
|
longer be found in the OS processor table. This prevents jobs from getting
|
|
"stuck" when the PID vanishes in some rare cases.
|
|
|
|
2.3.5
|
|
e - added new init.d scripts for Debian/Ubuntu systems
|
|
b - fixed a bug where TORQUE's exponential backoff for sending messages to the
|
|
MOM could overflow
|
|
|
|
2.3.4
|
|
c - fixed segfault when loading array files of an older/incompatible version
|
|
b - fixed a bug where if attempt to send job to a pbs_mom failed due to
|
|
timeout, the job would indefinitely remain the in 'R' state
|
|
b - qsub now properly interprets -W umask=0XXX as octal umask
|
|
e - allow $HOME to be specified for path
|
|
e - added --disable-qsub-keep-override to allow the qsub -k flag to not
|
|
override -o -e.
|
|
e - updated with security patches for setuid, setgid, setgroups
|
|
b - fixed correct_ct() in svr_jobfunc.c so we don't crash if we hit COMPLETED
|
|
job
|
|
b - fixed problem where momctl -d 0 showed ConfigVersion twice
|
|
e - if a .JB file gets upgraded pbs_server will back up the original
|
|
b - removed qhold / qrls -h n option since there is no code to support it
|
|
b - set job state and substate correctly when job has a hold attribute and
|
|
is being rerun
|
|
b - fixed a bug preventing multiple TORQUE servers and TORQUE MOMs from
|
|
operating properly all from the same host
|
|
e - fixed several compiler error and warnings for AIX 5.2 systems
|
|
b - fixed a bug with "max_report" where jobs not in the Q state were not always
|
|
being reported to scheduler
|
|
|
|
2.3.3
|
|
b - fixed bug where pbs_mom would sometimes not connect properly with
|
|
pbs_server after network failures
|
|
b - changed so run_pelog opens correct stdout/stderr when join is used
|
|
b - corrected pbs_server man page for SIGUSR1 and SIGUSR2
|
|
f - added new pbs_track command which may be used to launch an external
|
|
process and a pbs_mom will then track the resource usage of that process
|
|
and attach it to a specified job (experimental) (special thanks to David
|
|
Singleton and David Houlder from APAC)
|
|
e - added alternate method for sending cluster addresses to mom
|
|
(ALT_CLSTR_ADDR)
|
|
|
|
2.3.2
|
|
e - added --disable-posixmemlock to force mom not to use POSIX MEMLOCK.
|
|
b - fix potential buffer overrun in qsub
|
|
b - keep pbs_mom, pbs_server, pbs_sched from closing sockets opened by
|
|
nss_ldap (SGI)
|
|
e - added PBS_VERSION environment variable
|
|
e - added --enable-acct-x to allow adding of x attributes to accounting log
|
|
b - fix net_server.h build error
|
|
|
|
2.3.1
|
|
b - fixed a bug where torque would fail to start if there was no LF in nodes
|
|
file
|
|
b - fixed a bug where TORQUE would ignore the "pbs_asyrunjob" API extension
|
|
string when starting jobs in asynchronous mode
|
|
b - fixed memory leak in free_br for PBS_BATCH_MvJobFile case
|
|
e - torque can now compile on Linux and OS X with NDEBUG defined
|
|
f - when using qsub it is now possible to specify both -k and -o/-e
|
|
(before -o/-e did not behave as expected if -k was also used)
|
|
e - changed pbs_server to have "-l" option. Specifies a host/port that event
|
|
messages will be sent to. Event messages are the same as what the
|
|
scheduler currently receives.
|
|
e - added --enable-autorun to allow qsub jobs to automatically try to run
|
|
if there are any nodes available.
|
|
e - added --enable-quickcommit to allow qsub to combine the ready to commit
|
|
and commit phases into 1 network transmission.
|
|
e - added --enable-nochildsignal to allow pbs_server to use inline checking
|
|
for SIGCHLD instead of using the signal handler.
|
|
e - change qsub so '-v var=' will look in environment for value. If value
|
|
is not found set it to "".
|
|
b - fix qdel of entire job arrays for non operator/managers
|
|
b - fix so we continue to process exiting jobs for other servers
|
|
e - added source_login_batch and source_login_interactive to mom config. This
|
|
allows us to bypass the sourcing of /etc/profile, etc. type files.
|
|
b - fixed pbs_server segmentation fault when job_array submissions are
|
|
rejected before ji_arraystruct was initialized
|
|
e - add some casts to fix some compiler warnings with gcc-4.1 on i386 when
|
|
-D_FILE_OFFSET_BITS=64 is set
|
|
e - added --enable-maxnotdefault to allow not using resources_max as defaults.
|
|
b - added new values to TJobAttr so we don't have mismatch with job.h values.
|
|
b - reset ji_momhandle so we cannot have more than one pjob for obit_reply to
|
|
find.
|
|
e - change qdel to accept 'ALL' as well as 'all'
|
|
b - changed order of searching so we find most recent jobs first. Prevents
|
|
finding old leftover job when pids rollover. Also some CACHEOBITFAILURES
|
|
updates.
|
|
b - handle case where mom replies with an unknown job error to a stat request
|
|
from the server
|
|
b - allow qalter to modify HELD jobs if BLCR is not enabled
|
|
b - change to update errpath/outpath attributes when -e -o are used with qsub
|
|
e - added string output for errnos, etc.
|
|
|
|
2.3.0
|
|
b - fixed a bug where TORQUE would ignore the "pbs_asyrunjob" API extension
|
|
string when starting jobs in asynchronous mode
|
|
e - redesign how torque.spec is built
|
|
e - added -a to qrun to allow asynchronous job start
|
|
e - allow qrerun on completed jobs
|
|
e - allow qdel to delete all jobs
|
|
e - make qdel -m functionality match the documentation
|
|
b - prevent runaway hellos being sent to server when mom's node is removed
|
|
from the server's node list
|
|
e - local client connections use a unix domain socket, bypassing inet and
|
|
pbs_iff
|
|
f - Linux 2.6 cpuset support (in development)
|
|
e - new job array submission syntax
|
|
b - fixed SIGUSR1 / SIGUSR2 to correctly change the log level
|
|
f - health check script can now be run at job start and end
|
|
e - tm tasks are now stored in a single .TK file rather than eat lots of
|
|
inodes
|
|
f - new "extra_resc" server attribute
|
|
b - "pbs_version" attr is now correctly read-only
|
|
e - increase max size of .JB and .SC file names
|
|
e - new "sched_version" server attribute
|
|
f - new printserverdb tool
|
|
e - pbs_server/pbs_mom hostname arg is now -H, -h is help
|
|
e - added $umask to pbs_mom config, used for generated output files.
|
|
e - minor pbsnodes overhaul
|
|
b - fixed memory leak in pbs_server
|
|
|
|
2.2.2
|
|
b - correctly parse /proc/pid/stat that contains parens (Meier)
|
|
b - prevent runaway hellos being sent to server when mom's node is removed
|
|
from the server's node list
|
|
b - fix qdel of entire job arrays for non operator/managers
|
|
b - fix problem where job array .AR files are not saved to disk
|
|
b - fixed problem with tracking job memory usage on OS X
|
|
b - pbs_server doesn't try to "upgrade" .JB files if they have a newer
|
|
version of the job_qs struct
|
|
|
|
2.2.1
|
|
b - fix a bug where dependent jobs get put on hold when the previous job has
|
|
completed but its state is still available for life of keep_completed
|
|
b - fixed a bug where pbs_server never delete files from the "jobs" directory
|
|
b - fixed a bug where compute nodes were being put in an indefinite "down"
|
|
state
|
|
e - added job_array_size attribute to pbs_submit documentation
|
|
|
|
2.2.0
|
|
e - improve RPP logging for corruption issues
|
|
f - dynamic resources
|
|
e - use mlockall() in pbs_mom if _POSIX_MEMLOCK
|
|
f - consumable resource "tokens" support (Harte-Hanks)
|
|
e - build process sets default submit filter path to ${libexecdir}/qsub_filter
|
|
we fall back to /usr/local/sbin/torque_submitfilter to maintain
|
|
compatibility
|
|
e - allow long job names when not using -N
|
|
f - new MOM $varattr config
|
|
e - daemons are no longer installed 700
|
|
e - tighten directory path checks
|
|
f - new mom configs: $auto_ideal_load and $auto_max_load
|
|
e - pbs_mom on Darwin (OS X) no longer depends on libkvm (now works on all
|
|
versions without need to re-enable /dev/kmem on newer PPC or all x86
|
|
versions)
|
|
e - added PBS_SERVER env variable for job scripts
|
|
e - add --about support to daemons and client commands
|
|
f - added qsub -t (primitive job array)
|
|
e - add PBS_RESOURCE_GRES to prolog/epilog environment
|
|
e - add -h hostname to pbs_mom (NCIFCRF)
|
|
e - filesec enhancements (StockholmU)
|
|
e - added ERS and IDS documentation
|
|
e - allow export of specific variables into prolog/epilog environment
|
|
b - change fclose to pclose to close submit filter pipe (ABCC)
|
|
e - add support for Cray XT size and larger qstat task reporting (ORNL)
|
|
b - pbs_demux is now built with pbs_mom instead of with clients
|
|
e - epilogue will only run if job is still valid on exec node
|
|
e - add qnodes, qnoded, qserverd, and qschedd symlinks
|
|
e - enable DEFAULTCKPT torque.cfg parameter
|
|
e - allow compute host and submit host suffix with nodefile_suffix
|
|
f - add --with-modulefiles=[DIR] support
|
|
b - be more careful about broken tclx installs
|
|
|
|
2.1.11
|
|
b - nqs2pbs is now a generated script
|
|
b - correct handling of priv job attr
|
|
b - change font selectors in manpages to bold
|
|
b - on pbs_server startup, don't skip job-exclusive nodes on initial MOM scan
|
|
b - pbs_server should not connect to "down" MOMs for any job operation
|
|
b - use alarm() around writing to job's stdio incase it happens to be a stopped tty
|
|
|
|
2.1.10
|
|
b - fix buffer overflow in rm_request,
|
|
fix 2 printf that should be sprintf (Umea University)
|
|
b - correct updating trusted client list (Yahoo)
|
|
b - Catch newlines in log messages, split messages text (Eygene Ryabinkin)
|
|
e - pbs_mom remote reconfig pbs_mom now disabled by default
|
|
use $remote_reconfig to enable it
|
|
b - fix pam configure (Adrian Knoth)
|
|
b - handle /dev/null correctly when job rerun
|
|
|
|
2.1.9
|
|
f - new queue attribute disallowed_types, currently recognized types:
|
|
interactive, batch, rerunable, and nonrerunable
|
|
e - refine "node note" feature with pbsnodes -N
|
|
e - bypass pbs_server's uid 0 check on cygwin
|
|
e - update suse initscripts
|
|
b - fix mom memory locking
|
|
b - fix sum buffer length checks in pbs_mom
|
|
b - fix memory leak in fifo scheduler
|
|
b - fix nonstandard usage of 'tail' in tpackage
|
|
b - fix aliasing error with brp_txtlen
|
|
f - allow manager to set "next job number" via hidden qmgr attribute
|
|
next_job_number
|
|
|
|
2.1.8
|
|
b - stop possible memory corruption with an invalid request type (StockholmU)
|
|
b - add node name to pbsnodes XML output (NCIFCRF)
|
|
b - correct Resource_list in qstat XML output (NCIFCRF)
|
|
b - pam_authuser fixes from uam.es
|
|
e - allow 'pbsnodes -l' to work with a node spec
|
|
b - clear exec_host and session_id on job requeue
|
|
b - fix mom child segfault when a user env var has a '%'
|
|
b - correct buggy logging in chk_job_request() (StockholmU)
|
|
e - pbs_mom shouldn't require server_name file unless it is
|
|
actually going to be read (StockholmU)
|
|
f - "node notes" with pbsnodes -n (sandia)
|
|
|
|
2.1.7
|
|
b - fix bison syntax error in Parser.y
|
|
b - fix 2.1.4 regression with spool file group owner on freebsd
|
|
b - don't exit if mlockall sets errno ENOSYS
|
|
f - qalter -v variable_list
|
|
f - MOMSLEEPTIME env delays pbs_mom initialization
|
|
e - minor log message fixups
|
|
e - enable node-reuse in qsub eval if server resources_available.nodect is set
|
|
e - pbs_mom and pbs_server can now use PBS_MOM_SERVER_PORT,
|
|
PBS_BATCH_SERVICE_PORT, and PBS_MANAGER_SERVICE_PORT env vars.
|
|
e - pbs_server can also use PBS_SCHEDULER_SERVICE_PORT env var.
|
|
e - add "other" resource to pelog's 5th argument
|
|
|
|
2.1.6
|
|
b - freebsd5 build fix
|
|
b - fix 2.1.4 regression with TM on single-node jobs
|
|
b - fix 2.1.4 regression with rerunning jobs
|
|
b - additional spool handling security fixes
|
|
|
|
2.1.5
|
|
b - fix 2.1.4 regression with -o/dev/null
|
|
|
|
2.1.4
|
|
b - fix cput job status
|
|
b - Fix "Spool Job Race condition"
|
|
|
|
2.1.3
|
|
|
|
b - correct run-time symbol in pam module on RHEL4
|
|
b - some minor hpux11 build fixes (PACCAR)
|
|
b - fix bug with log roll and automatic log filenames
|
|
b - compile error with size_fs() on digitalunix
|
|
e - pbs_server will now print build details with --about
|
|
e - new freebsd5 mom arch for Freebsd 5.x and 6.x (trasz)
|
|
e - optimize acl_group_sloppy
|
|
e - fix "list_head" symbol clash on Solaris 10
|
|
e - allow pam_pbssimpleauth to be built on OSX and Solaris
|
|
b - networking fixes for HPUX, fixes pbs_iff (PACCAR)
|
|
e - allow long job names when not using -N
|
|
c - using depend=syncwith crashed pbs_server
|
|
c - races with down nodes and purging jobs crashed pbs_server
|
|
b - staged out files will retain proper permission bits
|
|
f - may now specify umask to use while creating stderr and stdout spools
|
|
e.g. qsub -W umask=22
|
|
b - correct some fast startup behaviour
|
|
e - queue attribute max_queuable accounts for C jobs
|
|
|
|
2.1.2
|
|
|
|
b - fix momctl queries with multiple hosts
|
|
b - don't fail make install if --without-sched
|
|
b - correct MOM compile error with atol()
|
|
f - qsub will now retry connecting to pbs_server (see manpage)
|
|
f - X11 forwarding for single-node, interactive jobs with qsub -X
|
|
f - new pam_pbssimpleauth PAM module, requires --with-pam=DIR
|
|
e - add logging for node state adjustment
|
|
f - correctly track node state and allocation based for suspended jobs
|
|
e - entries can always be deleted from manager ACL,
|
|
even if ACL contains host(s) that no longer exist
|
|
e - more informative error message when modifying manager ACL
|
|
f - all queue create, set, and unset operations now set a queue mtime
|
|
f - added support for log rolling to libtorque
|
|
f - pbs_server and pbs_mom have two new attributes
|
|
log_file_max_size, log_file_roll_depth
|
|
e - support installing client libs and cmds on unsupported OSes (like cygwin)
|
|
b - fix subnode allocation with pbs_sched
|
|
b - fix node allocation with suspend-resume
|
|
b - fix stale job-exclusive state when restarting pbs_server
|
|
b - don't fall over when duplicate subnodes are assigned after suspend-resume
|
|
b - handle suspended jobs correctly when restarting pbs_server
|
|
b - allow long host lists in runjob request
|
|
b - fix truncated XML output in qstat and pbsnodes
|
|
b - typo broke compile on irix6array and unicos8
|
|
e - momctl now skips down nodes when selecting by property
|
|
f - added submit_args job attribute
|
|
|
|
2.1.1
|
|
|
|
c - fix mom_sync_job code that crashes pbs_server (USC)
|
|
b - checking disk space in $PBS_SERVER_HOME was mistakenly disabled (USC)
|
|
e - node's np now accessible in qmgr (USC)
|
|
f - add ":ALL" as a special node selection when stat'ing nodes (USC)
|
|
f - momctl can now use :property node selection (USC)
|
|
f - send cluster addrs to all nodes when a node is created in qmgr (USC)
|
|
- new nodes are marked offline
|
|
- all nodes get new cluster ipaddr list
|
|
- new nodes are cleared of offline bit
|
|
f - set a node's np from the status' ncpus (only if ncpus > np) (USC)
|
|
- controlled by new server attribute "auto_node_np"
|
|
c - fix possible pbs_server crash when nodes are deleted in qmgr (USC)
|
|
e - avoid dup streams with nodes for quicker pbs_server startup (USC)
|
|
b - configure program prefix/suffix will now work correctly (USC)
|
|
b - handle shared libs in tpackages (USC)
|
|
f - qstat's -1 option can now be used with -f for easier parsing (USC)
|
|
b - fix broken TM on OSX (USC)
|
|
f - add "version" and "configversion" RM requests (USC)
|
|
b - in pbs-config --libs, don't print rpath if libdir is in the sys dlsearch
|
|
path (USC)
|
|
e - don't reject job submits if nodes are temporarily down (USC)
|
|
e - if MOM can't resolve $pbsserver at startup, try again later (USC)
|
|
- $pbsclient still suffers this problem
|
|
c - fix nd_addrs usage in bad_node_warning() after deleting nodes (MSIC)
|
|
b - enable build of xpbsmom on darwin systems (JAX)
|
|
e - run-time config of MOM's rcp cmd (see pbs_mom(8)) (USC)
|
|
e - momctl can now accept query strings with spaces, multiple -q opts (USC)
|
|
b - fix linking order for single-pass linkers like IRIX (ncifcrf)
|
|
b - fix mom compile on solaris with statfs (USC)
|
|
b - memory corruption on job exit causing cpu0 to be allocated more than once (USC)
|
|
e - add increased verbosity to tracejob and added '-q' commandline option
|
|
e - support larger values in qstat output (might break scripts!) (USC)
|
|
e - make 'qterm -t quick' shutdown pbs_server faster (USC)
|
|
|
|
2.1.0p0
|
|
|
|
fixed job tracking with SMP job suspend/resume (MSIC)
|
|
modify pbs_mom to enforce memory limits for serial jobs (GaTech)
|
|
- linux only
|
|
enable 'never' qmgr maildomain value to disable user mail
|
|
enable qsub reporting of job rejection reason
|
|
add suspend/resume diagnostics and logging
|
|
prevent stale job handler from destroying suspended jobs
|
|
prevent rapid hello from MOM from doing DOS on pbs_server
|
|
add diagnostics for why node not considered available
|
|
add caching of local serverhost addr lookup
|
|
enable job centric vs queue centric queue limit parameter
|
|
brand new autoconf+automake+libtool build system (USC)
|
|
automatic MOM restarts for easier upgrades (USC)
|
|
new server attributes: acl_group_sloppy, acl_logic_or, keep_completed, kill_delay
|
|
new server attributes: server_name, allow_node_submit, submit_hosts
|
|
torque.cfg no longer used by pbs_server
|
|
pbsdsh and TM enhancements (USC)
|
|
- tm_spawn() returns an error if execution fails
|
|
- capture TM stdout with -o
|
|
- run on unique nodes with -u
|
|
- run on a given hostname with -h
|
|
largefile support in staging code and when removing $TMPDIR (USC)
|
|
use bindresvport() instead of looping over calls to bind() (USC)
|
|
fix qsub "out of memory" for large resource requests (SANDIA)
|
|
pbsnodes default arg is now '-a' (USC)
|
|
new ":property" node selection when node stat and manager set (pbsnodes) (USC)
|
|
fix race with new jobs reporting wrong walltime (USC)
|
|
sister moms weren't setting job state to "running" (USC)
|
|
don't reject jobs if requested nodes is too large node_pack=T (USC)
|
|
add epilogue.parallel and epilogue.user.parallel (SARA)
|
|
add $PBS_NODENUM, $PBS_MSHOST, and $PBS_NODEFILE to pelogs (USC)
|
|
add more flexible --with-rcp='scp|rcp|mom_rcp' instead of --with-scp (USC)
|
|
build/install a single libtorque.so (USC)
|
|
nodes are no longer checked against server host acl list (USC)
|
|
Tcl's buildindex now supports a 3rd arg for "destdir" to aid fakeroot installs (USC)
|
|
fixed dynamic node destroy qmgr option
|
|
install rm.h (USC)
|
|
printjob now prints saved TM info (USC)
|
|
make MOM restarts with running jobs more reliable (USC)
|
|
fix return check in pbs_rescquery fixing segfault in pbs_sched (USC)
|
|
add README.pbstools to contrib directory
|
|
workaround buggy recvfrom() in Tru64 (USC)
|
|
attempt to handle socklen_t portably (USC)
|
|
fix infinite loop in is_stat_get() triggered by network congestion (USC)
|
|
job suspend/resume enhancements (see qsig manpage) (USC)
|
|
support higher file descriptors in TM by using poll() instead of select() (USC)
|
|
immediate job delete feedback to interactive queued jobs (USC)
|
|
move qmgr manpage from section 8 to section 1
|
|
add SuSE initscripts to contrib/init.d/
|
|
fix ctrl-c race while starting interactive jobs (USC)
|
|
fix memory corruption when tm_spawn() is interrupted (USC)
|
|
|
|
2.0.0p8
|
|
really fix torque.cfg parsing (USC)
|
|
fix possible overlapping memcpy in ACL parsing (USC)
|
|
fix rare self-inflicted sigkill in MOM (USC)
|
|
|
|
2.0.0p7
|
|
|
|
fixed pbs_mom SEGV in req_stat_job()
|
|
fixed torque.cfg parameter handling
|
|
fixed qmgr memory leak
|
|
|
|
2.0.0p6
|
|
|
|
fix segfault in new "acl_group_sloppy" code if a group doesn't exist (USC)
|
|
configure defaults changed to enable syslog, enable docs, and disable filesync (USC)
|
|
pelog now correctly restores previous alarm handler (Sandia)
|
|
misc fixes with syscalls returns, sign-mismatches, and mem corruption (USC)
|
|
prevent MOM from killing herself on new job race condition (USC)
|
|
- so far, only linux is fixed
|
|
remove job delete nanny earlier to not interrupt long stageouts (USC)
|
|
display C state later when using keep_completed (USC)
|
|
add 'printtracking' command in src/tools (USC)
|
|
stop overriding the user with name resolution on qsub's -o/-e args (USC)
|
|
xpbsmon now works with Tcl 8.4 (BCGSC)
|
|
don't bother spooling/keeping job output intended for /dev/null (USC)
|
|
correct missing hpux11 manpage (USC)
|
|
fix compile for freebsd - missing symbols (yahoo)
|
|
fix momctl exit code (yahoo)
|
|
new "exit_status" job attribute (USC)
|
|
new "mail_domain" server attribute (overrides --maildomain) (USC)
|
|
configure fixes for linux x86_64 and tcl install weirdness (USC)
|
|
extended mom parameter buffer space
|
|
change pbs_mkdirs to use standard var names so that chroot installs work better (USC)
|
|
torque.spec now has tcl/gui and wordexp enabled by default
|
|
enable multiple dynamic+static generic resources per node (GATech)
|
|
make sure attrs on job launch are sent to server (fixes session_id) (USC)
|
|
add resmom job modify logging
|
|
torque.cfg parsing fixes
|
|
|
|
2.0.0p5
|
|
|
|
reorganize ji_newt structure to eliminate 64 bit data packing issues
|
|
enable '--disable-spool' configure directive
|
|
enable stdout/stderr stageout to search through $HOME and $HOME/.pbs_spool
|
|
fixes to qsub's env handling for newlines and commas (UMU)
|
|
fixes to at_arst encoding and decoding for newlines and commas (USC)
|
|
use -p with rcp/scp (USC)
|
|
several fixes around .pbs_spool usage (USC)
|
|
don't create "kept" stdout/err files ugo+rw (avoid insane umask) (USC)
|
|
qsub -V shouldn't clobber qsub's environ (USC)
|
|
don't prevent connects to "down" nodes that are still talking (USC)
|
|
allow file globs to work correctly under --enable-wordexp (USC)
|
|
enable secondary group checking when evaluating queue acl_group attribute
|
|
- enable the new queue parameter "acl_group_sloppy"
|
|
sol10 build system fixes (USC)
|
|
fixed node manager buffer overflow (UMU)
|
|
fix "pbs_version" server attribute (USC)
|
|
torque.spec updates (USC)
|
|
remove the leading space on the node session attribute on darwin (USC)
|
|
prevent SEGV if config file is missing/corrupt
|
|
"keep_completed" execution queue attribute
|
|
several misc code fixes (UMU)
|
|
|
|
2.0.0p4
|
|
|
|
fix up socklen_t issues
|
|
fixed epilog to report total job resource utilization
|
|
improved RPM spec (USC)
|
|
modified qterm to drop hung connections to bad nodes
|
|
enhance HPUX operation
|
|
|
|
2.0.0p3
|
|
|
|
fixed dynamic gres loading in pbs_mom (CRI)
|
|
added torque.spec (rpmbuild -tb should work) (USC)
|
|
new 'packages' make target (see INSTALL) (USC)
|
|
added '-1' qstat option to display node info (UMICH)
|
|
various fixes in file staging and copying (USC)
|
|
- reenable stageout of directories
|
|
- fix confusing email messages on failed stageout
|
|
- child processes can't use MOM's logging, must use syslog
|
|
fix overflow in RM netload (USC)
|
|
don't check walltime on sister nodes, only on MS (ANU)
|
|
kill_task wasn't being declared properly for all mach types (USC)
|
|
don't unnecessarily link with libelf and libdl (USC)
|
|
fix compile warnings with qsort/bsearch on bsd/darwin (USC)
|
|
fix --disable-filesync to actually work (USC)
|
|
added prolog diagnostics to 'momctl -d' output (CRI)
|
|
added logging for job file management (CRI)
|
|
added mom parameter $ignwalltime (CRI)
|
|
added $PBS_VNODENUM to job/TM env (USC)
|
|
fix self-referencing job deps (USC)
|
|
Use --enable-wordexp to enable variables in data staging (USC)
|
|
$PBS_HOME/server_name is now used by MOM _iff $pbsserver isn't used_ (USC)
|
|
Fix TRU64 compile issues (NCIFCRF)
|
|
Expand job limits up to ULONG_MAX (NCIFCRF)
|
|
user-supplied TMPDIR no longer treated specially (USC)
|
|
remtree() now deals with symlinks correctly (USC)
|
|
enable configurable mail domain (Sandia)
|
|
configure now handles darwin8 (USC)
|
|
configure now handles --with-scp=path and --without-scp correctly (USC)
|
|
|
|
2.0.0p2
|
|
|
|
fix check_pwd() memory leak (USC)
|
|
|
|
2.0.0p1
|
|
|
|
fix mpiexec stdout regression from 2.0.0p0 (USC)
|
|
add 'qdel -m' support to enable annotating job cancellation (CRI)
|
|
add mom diagnostics for prolog failures and timeouts (CRI)
|
|
interactive jobs cannot be rerunable (USC)
|
|
be sure nodefile is removed when job is purged (USC)
|
|
don't run epilogue multiple times when multiple jobs exit at once (USC)
|
|
fix clearjob MOM request (momctl -c) (USC)
|
|
fix detection of local output files with localhost or /dev/null (USC)
|
|
new qstat/qselect -e option to only select jobs in exec queues (USC)
|
|
$clienthost and $headnode removed, $pbsclient and $pbsserver added (USC)
|
|
$PBS_HOME/server_name is now added to MOM's server list (USC)
|
|
resmom transient TMPDIR (USC)
|
|
add joblist to MOM's status & add experimental server "mom_job_sync" (USC)
|
|
export PBS_SCHED_HINT to pelogues if set in the job (USC)
|
|
don't build or install pbs_rcp if --enable-scp (USC)
|
|
set user hold on submitted jobs with invalid deps (USC)
|
|
add initial multi-server support for HA (CRI)
|
|
Altix cpuset enhancements (CSIRO)
|
|
enhanced momctl to diagnose and report on connectivity issues (CRI)
|
|
added hostname resolution diagnostics and logging (CRI)
|
|
fixed 'first node down' rpp failure (USC)
|
|
improved qsub response time
|
|
|
|
2.0.0p0
|
|
|
|
torque patches for RCP and resmom (UCHSC)
|
|
enhanced DIS logging
|
|
improved start-up to support quick startup with down nodes
|
|
fixed corrupt job/node/queue API reporting
|
|
fixed tracejob for large jobs (Sandia)
|
|
changed qdel to only send one SIGTERM at mom level
|
|
fixed doc build by adding AIX 5 resources docs
|
|
added prerun timeout change (RENTEC)
|
|
added code to handle select() EBADF - 9
|
|
disabled MOM quota feature by default, enabled with -DTENABLEQUOTA
|
|
cleanup MOM child error messages (USC)
|
|
fix makedepend-sh for gcc-3.4 and higher (DTU)
|
|
don't fallback to mom_rcp if configured to use scp (USC)
|
|
|
|
1.2.0p6
|
|
|
|
enabled opsys mom config (USC)
|
|
enabled arch mom config (CRI)
|
|
fixed qrun based default scheduling to ignore down nodes (USC)
|
|
disable unsetting of key/integer server parameters (USC)
|
|
allow FC4 support - quota struct fix (USC)
|
|
add fix for out of memory failure (USC)
|
|
add file recovery failure messages (USC)
|
|
add direct support for external scheduler extensions
|
|
add passwd file corruption check
|
|
add job cancel nanny patch (USC)
|
|
recursively remove job dependencies if children can never be satisfied (USC)
|
|
make poll_jobs the default behavior with a restat time of 45 seconds
|
|
added 'shell-use-arg' patch (OSC)
|
|
improved API timeout disconnect feature
|
|
added improved rapid start up
|
|
|
|
reworked mom-server state management (USC)
|
|
- removed 'unknown' state
|
|
- improved pbsnodes 'offline' management
|
|
- fixed 'momctl -C' which actually _prevented_ an update
|
|
- fixed incorrect math on 'tmpTime'
|
|
- added 'polltime' to the math on 'tmpTime'
|
|
- consolidated node state changes to new 'update_node_state()'
|
|
- tightened up the "node state machine"
|
|
- changed mom's state to follow the documented state guidelines
|
|
- correctly handle "down" from mom
|
|
- moved server stream handling out of 'is_update_stat()' to new
|
|
'init_server_stream()'
|
|
- refactored the top of the main loop to tighten up state changes
|
|
- fixed interval counting on the health check script
|
|
- forced health check script if update state is forced
|
|
- don't spam the server with updates on startup
|
|
- required new addr list after connections are dropped
|
|
- removed duplicate state updates because of broken multi-server support
|
|
- send "down" if internal_state is down (aix's query_adp() can do this)
|
|
- removed ferror() check on fread() because fread() randomly fails on initial
|
|
mom startup.
|
|
- send "down" if health check returns "ERROR"
|
|
- send "down" if disk space check fails.
|
|
|
|
1.2.0p5
|
|
|
|
make '-t quick' default behavior for qterm
|
|
added '-p' flag to qdel to enable forced job purge (USC)
|
|
fixed server resources_available n-1 issue
|
|
added further Altix CPUSet support (NCSA)
|
|
added local checkpoint script support for linux
|
|
fixed 'premature end of message warning'
|
|
clarify job deleted mail message (SDSC)
|
|
fixed AIX 5.3 support in configure (WestGrid)
|
|
fixed crash when qrun issued on job with incomplete requeue
|
|
added support for >= 4GB memory usage (GMX)
|
|
log job execution limits failures
|
|
added more detailed error messages for missing user shell on mom
|
|
fixed qsub env overflow issue
|
|
|
|
1.2.0p4
|
|
|
|
extended job prolog to include jobname, resource, queue, and account info (MAINE)
|
|
added support for Darwin 8/OS X 10.4 (MAINE)
|
|
fixed suspend/resume for MPI jobs (NORWAY)
|
|
added support for epilog.precancel to enable local job cancellation handling
|
|
fixed build for case insensitive filesystems
|
|
fixed relative path based Makefiles for xpbsmom
|
|
added support for gcc 4.0
|
|
added PBSDEBUG support to client commands to allow more verbose diagnostics of client failures
|
|
added ALLOWCOMPUTEHOSTSUBMIT option to torque.cfg
|
|
fixed dynamic pbs_server loglevel support
|
|
added mom-server rpp socket diagnostics
|
|
added support for multi-homed hosts w/SERVERHOST parameter in torque.cfg
|
|
added support for static linking w/PBSBINDIR
|
|
added availmem/totmem support to Darwin systems (MAINE)
|
|
added netload support to Darwin systems (MAINE)
|
|
|
|
1.2.0p3
|
|
|
|
enable multiple server to mom communication
|
|
fixed node reject message overwrite issue
|
|
enable pre-start node health check (BOEING)
|
|
fixed pid scanning for RHEL3 (VPAC)
|
|
added improved vmem/mem limit enforcement and reporting (UMU)
|
|
added submit filter return code processing to qsub
|
|
|
|
1.2.0p2
|
|
|
|
enhance network failure messages
|
|
fixed tracejob tool to only match correct jobs (WESTGRID)
|
|
modified reporting of linux availmem and totmem to allow larger file sizes
|
|
fixed pbs_demux for OSF/TRU64 systems to stop orphaned demux processes
|
|
added dynamic pbs_server loglevel specification
|
|
added intelligent mom job stat sync'ing for improved scalability (USC/CRI)
|
|
added mom state sync patch for dup join (USC)
|
|
added spool dir space check (MAINE)
|
|
|
|
1.2.0p1
|
|
|
|
add default DEFAULTMAILDOMAIN configure option
|
|
improve configure options to use pbs environment (USC)
|
|
use openpty() based tty management by default
|
|
enable default resource manager extensions
|
|
make mom config parameters case insensitive
|
|
added jobstartblocktime mom parameter
|
|
added bulk read in pbs_disconnect() (USC)
|
|
added support for solaris 5
|
|
added support for program args in pbsdsh (USC)
|
|
added improved task recovery (USC)
|
|
|
|
1.2.0p0
|
|
|
|
fixed MOM state update behavior (USC/Poland)
|
|
fixed set_globid() crash
|
|
added support for > 2GB file size job requirements
|
|
updated config.guess to 2003 release
|
|
general patch to initialize all function variables (USC)
|
|
added patch for serial job TJE leakage (USC)
|
|
add "hw.memsize" based physmem MOM query for darwin (Maine)
|
|
add configure option (--disable-filesync) to speed up job submission
|
|
set PBS mail precedence to bulk to avoid vactaion responses (VPAC)
|
|
added multiple changes to address gcc warnings (USC)
|
|
enabled auto-sizing of 'qstat -Q' columns
|
|
purge DOS EOL characters from submit scripts
|
|
|
|
1.1.0p6
|
|
|
|
added failure logging for various MOM job launch failures (USC)
|
|
allow qsub '-d' relative path qsub specification
|
|
enabled $restricted parameter w/in FIFO to allow used of non-privileged ports (SAIC)
|
|
checked job launch status code for retry decisions
|
|
added nodect resource_available checking to FIFO
|
|
disabled client port binding by default for darwin systems (use --enable-darwinbind to re-enable)
|
|
- workaround for darwin bind and pclose OS bugs
|
|
fixed interactive job terminal control for MAC (NCIFCRF)
|
|
added support for MAC MOM-level cpu usage tracking (Maine)
|
|
fixed __P warning (USC)
|
|
added support for server level resources_avail override of job nodect limits (VPAC)
|
|
modify MOM copy files and delete file requests to handle NFS root issues (USC/CRI)
|
|
enhance port retry code to support mac socket behavior
|
|
clean up file/socket descriptors before execing prolog/epilog
|
|
enable dynamic cpu set management (ORNL)
|
|
enable array services support for memory management (ORNL)
|
|
add server command logging to diagnostics
|
|
fix linux setrlimit persistance on failures
|
|
|
|
1.1.0p5
|
|
|
|
added loglevel as MOM config parameter
|
|
distributed job start sequence into multiple routines
|
|
force node state/subnode state offline stat synchronization (NCSA)
|
|
fixed N-1 cpu allocation issue (no sanity checking in set_nodes)
|
|
enhance job start failure logging
|
|
added continued port checking if connect fails (rentec)
|
|
added case insensitive host authentication checks
|
|
added support for submitfilter command line args
|
|
added support for relocatable submitfilter via torque.cfg
|
|
fixed offline status cleared when server restarted (USC)
|
|
updated PBSTop to 4.05 (USC)
|
|
fixed PServiceType array to correctly report service messages
|
|
fixed pbs_server crash from job dependencies
|
|
prevent mom from truncating lock file when mom is already running
|
|
tcp timeout added as config option
|
|
|
|
1.1.0p4
|
|
|
|
added 15004 error logging
|
|
added use of openpty() call for locating pseudo terminals (SNL)
|
|
add diagnostic reporting of config and executable version info
|
|
add support for config push
|
|
add support for MOM config version parameters
|
|
log node offline/online and up/down state changes in pbs_server logs
|
|
add mom fork logging and home directory check
|
|
add timeout checking in rpp socket handling
|
|
added buffer overflow prevention routines
|
|
added lockfile logging
|
|
supported protected env variables with qstat
|
|
|
|
1.1.0p3
|
|
|
|
added support for node specification w/pbsnodes -a
|
|
added hstfile support to momctl
|
|
added chroot (-D) support (SRCE)
|
|
added mom chdir pjob check (SRCE)
|
|
fixed MOM HELLO initialization procedure
|
|
added momctl diagnostic/admin command (shutdown, reconfig, query, diagnose)
|
|
added mom job abort bailout to prevent infinite loops
|
|
added network reinitialization when socket failure detected
|
|
added mom-to-scheduler reporting when existing job detected
|
|
added mom state machine failure logging
|
|
|
|
1.1.0p2
|
|
|
|
add support for disk size reporting via pbs_mom
|
|
fixed netload initialization
|
|
fixed orphans on mom fork failure
|
|
updated to pbstop v 3.9 (USC)
|
|
fixed buffer overflow issue in net_server.c
|
|
added pestat package to contrib (ANU)
|
|
added parameter checking to cpy_stage() (NCSA)
|
|
added -x (xml output) support for 'qstat -f' and 'pbsnodes -a'
|
|
added SSS xml library (SSS)
|
|
updated user-project mapping enforcement (ANL)
|
|
fix bogus 'cannot find submitfilter' message for interactive jobs
|
|
fix incorrect job allocation issue for interactive jobs (NCSA)
|
|
prevent failure with invalid 'servername' specification (NCSA)
|
|
provide more meaningful 'post processing error' messages (NCSA)
|
|
check for corrupt jobs in server database and remove them immediately
|
|
enable SIGUSR1/SIGUSR2 pbs_mom dynamic loglevel adjustment
|
|
profiling enhancements
|
|
use local directory variable in scan_non_child_tasks() to prevent race condition (VPAC)
|
|
added AIX 5 odm support for realmem reporting (VPAC)
|
|
|
|
1.1.0p1
|
|
|
|
added pbstop to contrib (USC)
|
|
added OSC mpiexec patch (OSC)
|
|
confirmed OSC mom-restart patch (OSC)
|
|
fix pbsd_init purge job tracking
|
|
allow tracking of completed jobs (w/TORQUEKEEPCOMPLETED env)
|
|
added support for MAC OS 10
|
|
added qsub wrapper support
|
|
added '-d' qsub command line flag for specifying working directory
|
|
fixed numerous spelling issues in pbs docs
|
|
enable logical or'ing of user and group ACL's
|
|
allow large memory sizes for physmem under solaris (USC)
|
|
fixed qsub SEGV on bad '-o' specification
|
|
add null checking on ap->value
|
|
fixed physmem() routine for tru64 systems to load compute node physical memory
|
|
added netload tracking
|
|
|
|
1.1.0p0
|
|
|
|
fixed linux swap space checking
|
|
fixed AIX5 resmom ODM memory leak
|
|
handle split var/etc directories for default server check (CHPC)
|
|
add pbs_check utility
|
|
added TERAGRID nospool log bounds checking
|
|
add code to force host domains to lower case
|
|
verified integration of OSC prologue-environment.patch (export Resource_List.nodes in an environment variable for prologue)
|
|
verified integration of OSC no-munge-server-name.patch (do not install over existing server_name)
|
|
verified integration of OSC docfix.patch (fix minor manpage type)
|
|
|
|
1.0.1p6
|
|
|
|
add messaging to report remote data staging failures to pbs_server
|
|
added tcp_timeout server parameter
|
|
add routine to mark hung nodes as down
|
|
add torque.setup initialization script
|
|
track okclient status
|
|
fixed INDIANA ji_grpcache MOM crash
|
|
fixed pbs_mom PBSLOGLEVEL/PBSDEBUG support
|
|
fixed pbs_mom usage
|
|
added rentec patch to mom 'sessions' output
|
|
fixed pbs_server --help option
|
|
added OSC patch to allow jobs to survive mom shutdown
|
|
added patch to support server level node comments
|
|
added support for reporting of node static resources via sss interface
|
|
added support for tracking available physical memory for IRIX/Linux systems
|
|
added support for per node probes to dynamically report local state of arbitrary value
|
|
fixed qsub -c (checkpoint) usage
|
|
|
|
1.0.1p5
|
|
|
|
add SUSE 9.0 support
|
|
add Linux 2.4 meminfo support
|
|
add support for inline comments in mom_priv/conf
|
|
allow support for upto 100 million unique jobs
|
|
add pbs_resources_all documentation
|
|
fix kill_task references
|
|
add contrib/pam_authuser
|
|
|
|
1.0.1p4
|
|
|
|
fixed multi-line readline buffer overflow
|
|
extended TORQUE documentation
|
|
fixed node health check management
|
|
|
|
1.0.1p3
|
|
|
|
added support for pbs_server health check and routing to scheduler
|
|
added support for specification of more than one clienthost parameter
|
|
added PW unused-tcp-interrupt patch
|
|
added PW mom-file-descriptor-leak patch
|
|
added PW prologue-bounce patch
|
|
added PW mlockall patch (release mlock for mom children)
|
|
added support for job names up to 256 chars in length
|
|
added PW errno-fix patch
|
|
|
|
1.0.1p2
|
|
|
|
added support for macintosh (darwin)
|
|
fixed qsub 'usage' message to correctly represent '-j',
|
|
'-k', '-m', and '-q' support
|
|
add support for 'PBSAPITIMEOUT' env variable
|
|
fixed mom dec/hp/linux physmem probes to support 64 bit
|
|
fixed mom dec/hp/linux availmem probes to support 64 bit
|
|
fixed mom dec/hp/linux totmem probes to support 64 bit
|
|
fixed mom dec/hp/linux disk_fs probes to support 64 bit
|
|
removed pbs server request to bogus probe
|
|
added support for node 'message' attribute to report internal
|
|
failures to server/scheduler
|
|
corrected potential buffer overflow situations
|
|
improved logging replacing 'unknown' error with real error message
|
|
enlarged internal tcp message buffer to support 2000 proc systems
|
|
fixed enc_attr return code checking
|
|
|
|
Patches incorporated prior to patch 2:
|
|
|
|
HPUX superdome support
|
|
|
|
add proper tracking of HP resources - Oct 2003 (NOR)
|
|
|
|
is_status memory leak patches - Oct 2003 (CRI)
|
|
|
|
corrects various memory leaks
|
|
|
|
Bash test - Sep 2003 (FHCRC)
|
|
|
|
allows support for linked shells at configure time
|
|
|
|
AIXv5 support -Sep 2003 (CRI)
|
|
|
|
allows support for AIX 5.x systems
|
|
|
|
OSC Meminfo -- Dec 2001 (P. Wycoff)
|
|
|
|
corrects how pbs_mom figures out how much physical memory each node has under Linux
|
|
|
|
Sandia CPlant Fault Tolerance I (w/OSC enhancements) -- Dec 2001 (L. Fisk/P. Wycoff)
|
|
|
|
handles server-MOM hangs
|
|
|
|
OSC Timeout I -- Dec 2001 (P. Wycoff)
|
|
|
|
enables longer inter daemon timeouts
|
|
|
|
OSC Prologue Env I -- Jan 2002 (P. Wycoff)
|
|
|
|
add support for env variable PBS_RESOURCE_NODES in job prolog
|
|
|
|
OSC Doc/Install I -- Dec 2001 (P. Wycoff)
|
|
|
|
fix to the pbsnodes man page
|
|
Configuration information for Linux on the IA64 architecture
|
|
fix the build process to make it clean out the documentation directories during a "make distclean"
|
|
fix the installation process to keep it from overwriting ${PBS_HOME}/server_name if it already exists
|
|
correct code creating compile time warnings
|
|
allow PBS to compile on Linux systems which do not have the Linux kernel source installed
|
|
|
|
Maui RM Extension -- Dec 2002 (CRI)
|
|
|
|
enable Maui resource manager extensions including QOS, reservations, etc
|
|
|
|
NCSA Scaling I -- Mar 2001 (G. Arnold)
|
|
|
|
increase number of nodes supported by PBS to 512
|
|
|
|
NCSA No Spool -- Apr 2001 (G. Arnold)
|
|
|
|
support $HOME/.pbs_spool for large jobs
|
|
|
|
NCSA MOM Pin
|
|
|
|
pin PBS MOM into memory to keep it from getting swapped
|
|
|
|
ANL RPP Tuning -- Sep 2000 (J Navarro)
|
|
|
|
tuning RPP for large systems
|
|
|
|
WGR Server Node Allocation -- Jul 2000 (B Webb)
|
|
|
|
addresses issue where PBS server incorrectly claims insufficient nodes
|
|
|
|
WGR MOM Soft Kill -- May 2002 (B Webb)
|
|
|
|
processes are killed with SIGTERM followed by SIGKILL
|
|
|
|
PNNL SSS Patch -- Jun 2002 (Skousen)
|
|
|
|
improves server-mom communication and server-scheduler
|
|
|
|
CRI Job Init Patch -- Jul 2003 (CRI)
|
|
|
|
correctly initializes new jobs eliminating unpredictable behavior and crashes
|
|
|
|
VPAC Crash Trap -- Jul 2003 (VPAC)
|
|
|
|
supports PBSCOREDUMP env variable
|
|
|
|
CRI Node Init Patch -- Aug 2003 (CRI)
|
|
|
|
correctly initializes new nodes eliminating unpredictable behavior and crashes
|
|
|
|
SDSC Log Buffer Patch -- Aug 2003 (SDSC)
|
|
|
|
addresses log message overruns
|
|
|
|
|
|
|