The SandBox Tutorial

https://images.unsplash.com/photo-1525298995976-d6c547e7f3f3?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1170&q=80sandbox

Photo Ostap Senyuk on UnSplash

The fake job scheduler “sandbox” was created for three purposes:

  • provide a mock up HPC resources to lemmings during Continuous integration tests on Gitlab.

  • provide a fast testing context for lemmings core developers.

  • provide a playground for workflows developers trying to figure out lemmings functionalities.

Main commands

Start the sandbox

You can start the sandbox with

>lem_sandbox start

Use the flag -d to specify a custom duration in seconds. Here is what happen during the ten fist second of a minute-long session. Each time the sandbox checks for new tasks, you can read Daemon spawn. in the standard output.

 >lem_sandbox start -d 60
INFO - Starting sandbox...
INFO - Freq: 3s
INFO - Max duration: 60s
INFO - Daemon spawn.
INFO - Daemon spawn.
(...)

Submit a job

You can submit a run via a batch file.

>lem_sandbox submit my_batch
Job submitted with PID 32877

Obviously nothing will happen if the sandbox is not running. The command returns the Process ID of the task.

For a conditional submission, use the flag -a or --after:

>lem_sandbox submit my_batch -a  32877
Job submitted with PID 32931

Check the queuing status

The qstat provides the following table

>lem_sandbox qstat
+---------------+---------------+-------+-------+-------------------+-------+
|    job name   |     queue     |  pid  | state |    last update    | after |
+---------------+---------------+-------+-------+-------------------+-------+
| twicer_PUBU75 |  long00:00:30 | 27784 |  done | 05/27/22 12:07:00 |   -   |
| twicer_PUBU75 | short00:00:10 | 27785 |  done | 05/27/22 12:07:05 | 27784 |
| twicer_PUBU75 |  long00:00:30 | 27797 |  done | 05/27/22 12:07:08 |   -   |
| twicer_PUBU75 | short00:00:10 | 27798 |  done | 05/27/22 12:07:12 | 27797 |
| twicer_PUBU75 |  long00:00:30 | 27810 |  done | 05/27/22 12:07:15 |   -   |
| twicer_PUBU75 | short00:00:10 | 27811 |  done | 05/27/22 12:07:20 | 27810 |
| twicer_PUBU75 |  long00:00:30 | 27823 |  done | 05/27/22 12:07:23 |   -   |
| twicer_PUBU75 | short00:00:10 | 27825 |  done | 05/27/22 12:07:28 | 27823 |
| twicer_PUBU75 |  long00:00:30 | 27835 |  done | 05/27/22 12:07:31 |   -   |
| twicer_PUBU75 | short00:00:10 | 27836 |  done | 05/27/22 12:07:36 | 27835 |
| twicer_PUBU75 |  long00:00:30 | 27845 |  done | 05/27/22 12:07:39 |   -   |
| twicer_PUBU75 | short00:00:10 | 27846 |  done | 05/27/22 12:07:43 | 27845 |
+---------------+---------------+-------+-------+-------------------+-------+

Cancel a job

To cancel a job, use cancel:

>lem_sandbox cancel 32877
PID 32877 cancelled

Accounting

You can ask how much time the job took using the acct command:

>lem_sandbox acct 27823
3

The answer is the number of seconds.

Minimal tutorial

To test the sandbox, create two batch files.

  1. mybatch will be the first job:

#SBX job_name=snake
echo "Hello world"
  1. mybatch2 will be a second job, to be executed after the first

#SBX job_name=plissken
echo "lorem ipsum"

Start the sandbox for 120 seconds in a terminal

>lem_sandbox start -d 60
INFO - Starting sandbox...
INFO - Freq: 3s
INFO - Max duration: 60s
INFO - Daemon spawn.
(...)

In an other terminal at the same place submit your tasks

>lem_sandbox submit mybatch
Job submitted with PID 33502
>lem_sandbox submit mybatch2 -a 33502
Job submitted with PID 33510

(Adapt the PID to your situation)

See if the jobs are running

>lem_sandbox qstat
+----------+-------+-------+---------+-------------------+-------+
| job name | queue |  pid  |  state  |    last update    | after |
+----------+-------+-------+---------+-------------------+-------+
|  snake   | dummy | 33502 |   done  | 05/27/22 14:08:26 |   -   |
| plissken | dummy | 33510 | running | 05/27/22 14:08:29 | 33502 |
+----------+-------+-------+---------+-------------------+-------+

Check what happened from the sandbox point of view, looking back into the first terminal:

(...)
INFO - Daemon spawn.
INFO - Daemon spawn.
INFO - start job snake_33502
INFO - Command echo "Hello world"
INFO - Daemon spawn.
INFO - stop job snake_33502
INFO - Daemon spawn.
INFO - start job plissken_33510
INFO - Command echo "Lorem ipsum"
INFO - Daemon spawn.
INFO - stop job plissken_33510
INFO - Daemon spawn.
INFO - Daemon spawn.
INFO - Daemon spawn.

And that is all.

How does it work?

The sandbox is actually a python script, running for a limited time (e.g. one minute), checking each 3 seconds in a file on disk if there is anything to launch as a subprocess.

The file $HOME/lem_sandbox_ddb.json look like this:

{"pid": "27784", "time": "05/27/22 12:06:55", "state": "pending", "after": null, "batchfile": "batch_job", "job_name": "twicer_PUBU75", "queue": "long00:00:30"}
{"pid": "27785", "time": "05/27/22 12:06:55", "state": "pending", "after": "27784", "batchfile": "batch_pjob", "job_name": "twicer_PUBU75", "queue": "short00:00:10"}
{"pid": "27784", "time": "05/27/22 12:06:56", "state": "running", "after": null, "batchfile": "batch_job", "job_name": "twicer_PUBU75", "queue": "long00:00:30"}
{"pid": "27784", "time": "05/27/22 12:07:00", "state": "done", "after": null, "batchfile": "batch_job", "job_name": "twicer_PUBU75", "queue": "long00:00:30"}

Well OK, this is not vanilla JSON, but if you spotted this, you probably also figured out why [] are missing

A job submission is simply adding a line to this file

{"pid": "27784", "time": "05/27/22 12:06:55", "state": "pending", "after": null, "batchfile": "batch_job", "job_name": "twicer_PUBU75", "queue": "long00:00:30"}

Pending jobs become running jobs . Once the batch subprocess is done, running jobs move to status done.

Known limitations

  • This job scheduler only starts batch processes, what you put in those processes is up to you (rm -rf ~ or mpirun are possibilities we never tried )

  • Do not start multiple sessions at the same time

  • The database is a mere file on the disk without I/O lock for the moment. Concurrent I/O could fail (e.g. if you ask qstat 10 times per second). Actually, we never triggered this crash, but you never know.

  • By no means this can compete with an actual job scheduler (SLURM, PBS). If your affair lasts more than 3min on several cores or is production stuff, switch to the real deal.

Well it ain’t much, but it’s honest work, and gets the work done…