Essentially what was happening was that the SubprocessExecutionManager was enqueing multiprocessing.Event objects onto _term_events and not cleaning them up. This meant that because Event is a POSIX semaphore file object, it was left open between pipeline runs. This was never found because we rarely ran tons of simultaneous jobs on a single node to cause the system to fall over.
The fix here is to clean up the _term_event dictionary during _check_for_zombies. However, this will not work if we fire and forget a ton of simultaneous processes, but this is bad for a multitude of reasons anyways (aka why we need pooling or queueing by default which I will address in a future revision).
The fix and tests are included.