The SandCastle Farming Tutorial
sandcastle
Photo Daiga Ellaby on UnSplash
The Sandcastle Farming tutorial follows the Sandcastle tutorial. There you will see how to automatically replicate your workflow.
You will run again the mock up solver twicer using the fake job scheduler sandbox. Make sure you already passed these tutorials.
This tutorial is part of the lemmings test suite. We will create the file here step-by-step, but you can fetch stuff from tests/sandbox/test_sandcastle_farming
.
A code with a parameter changing its output.
We extend the code twicer.py
so it can take one optional parameter : multiplier.
""" Mock up code twicer for lemmings tests """
import os
import yaml
### Main actions
def main(init_file, sol_path, multiplier=2):
"""Main function of twicer
init_file (str): initial solution path to read
sol_path (str): final solution folder
"""
print("Twicer execution...")
# Load data
print(f" - reading data from {init_file} ...")
sol = _load_data(init_file)
print(" - twicing...")
# Increase time
sol["time"]+=1
# Double the content of data
sol["data"] = [multiplier*dat for dat in sol["data"]]
# create output folder if missing
if not os.path.isdir(sol_path):
os.mkdir(sol_path)
# dump solution
out_file = sol_path+"/twicer_sol.yml"
print(f" - dumping data to {out_file} ...")
_dump_data(sol, out_file)
print(f"Execution complete")
### I/O using yaml
def _load_data(filename):
"""Load a yaml file as dict"""
with open(filename, "r") as fin:
data = yaml.load(fin, Loader=yaml.SafeLoader)
return data
def _dump_data(data,filename):
"""Write dict to a a yaml file"""
with open(filename, "w") as fout:
yaml.dump(data,fout, Dumper=yaml.SafeDumper)
## Main function call
if __name__ == "__main__":
"""If called directly from terminal"""
input_params = _load_data("twicer_in.yml")
init_file = input_params["init_sol"]
try:
multiplier = input_params["multiplier"]
except KeyError as e:
multiplier = 2
sol_path = input_params["out_path"]
main(init_file, sol_path, multiplier=multiplier)
The small try/except
in the main call makes this code suitable for both input files:
init_sol: TEMP_0000/twicer_sol.yml
multiplier: 0.6
out_path: TEMP_0001
or
init_sol: TEMP_0000/twicer_sol.yml
out_path: TEMP_0001
where multiplier will be 2, like in the sandcastle tutorial.
Create a workflow for farming
Replicate the workflow four times
The farming mode requires additional data : the parameters that will change between the members of a farming job. These parameters are provided using and extended version of the control file:
#---Required Settings--#
exec: | #your environement, slurm exec, module etc (writted in the batch)
echo "hello world exec"
pwd
python twicer.py
exec_pj: | # exec for the PJ (writted in the batch)
echo "hello world exec_pj"
pwd
job_queue: long # name of the job queue based on your '{machine}.yml'
pjob_queue: short # name of the post-job queue based on your '{machine}.yml'
# #---Optional Settings--#
job_prefix: twicer # Add a prefix to the run_name
# #---Custom Settings--#
custom_params:
simulation_end_time: 3 #final end time desired
farming:
active: True
max_parallel_workflows : 2 # maximum number of workflows to submit at once
parameter_array:
- multiplier: 0.6
dummy: 666
foo: "bar"
- multiplier: 0.8
- multiplier: 1.0
- multiplier: 1.4
- multiplier: 2.8
We see at the bottom a farming
section, with three elements:
active:True
allows the farming to be processed. Set it to False if, encountering problems, you want to run a single workflow without erasing the whole section.max_parallel_workflows:2
is the amount of workflows to be run at the same time. Use this to adjust the greediness of your farming. Other people on the same HPC resources might get angry seeing a single user queuing hundreds of jobs.parameter_array: []
is the array of data you need to provide. Each member will get an element of this list, called member_params. Elements can change from a member to another. Indeed one member can request a simple sub-model with zero inputs, and the other a more complex sub-model with several inputs.
As you can see, this is pretty dumb : e.g. For 12 members in your farming, add 12 member params in the parameter_array.
Using the Farming CLI to create the farming structure
Your initial folder should look like the folder from the sandcastle tutorial:
> tree
.
├── TEMP_0000
│ └── twicer_sol.yml
├── sandbox.yml
├── sandcastle_farm.py
├── sandcastle_farm.yml
└── twicer.py
Here, we simply renamed sandcastle
files into sandcastle_farm
to avoid confusion.
Run the farming CLI, without starting the job scheduled first.
We have 5 members on the farming, with a limit of 2 simultaneous workflow.
Let’s brace ourselves and repeat, because a lot will happen: i) launch a lemmings-farming run
(called (twicer_HODE56)), which ii) starts a lemmings run lemmings-farming run
on a first workflow WF_000 (twicer_LOCA75), iii) then starts a second lemmings-farming run
on the workflow WF_001 (twicer_MAVU18).
│> lemmings-farming run (twicer_HODE56)
├── WF_000 > lemmings run (twicer_LOCA75)
├── WF_001 > lemmings run (twicer_MAVU18)
├── WF_002
├── WF_003
└── WF_003
Ok, now you are ready, we dive! We launch the run without the sandbox first to see what will happen without actually starting the work:
> lemmings-farming run --machine-file sandbox.yml --inputfile sandcastle_farm.yml --job-prefix twicer sandcastle_farm.py
INFO -
##############################
Starting Lemmings 0.8.0...
##############################
INFO - Job name :twicer_HODE56
INFO - Loop :1
INFO - Status :start
INFO - Worflow path :/Users/dauptain/TEST/test_workflow_farming/sandcastle_farm.py
INFO - Imput path :/Users/dauptain/TEST/test_workflow_farming/sandcastle_farm.yml
INFO - Machine path :/Users/dauptain/TEST/test_workflow_farming/sandbox.yml
INFO - Farming mode :True
Disclaimer: this automated job has been submitted by user dauptain
under his/her scrutiny. In case of a wasteful usage of this computer ressources,
the aforementionned user will be found, and forced to take responsibility...
- use `lemmings(-farming) status` to follow your jobs
- use `lemmings(-farming) cancel` to cancel it
- use `find . -name *.log` to see the log files created
INFO -
##############################
Starting Lemmings 0.8.0...
##############################
INFO - Job name :twicer_LOCA75
INFO - Loop :1
INFO - Status :start
INFO - Worflow path :/Users/dauptain/TEST/test_workflow_farming/WF_000_twicer/sandcastle_farm.py
INFO - Imput path :/Users/dauptain/TEST/test_workflow_farming/sandcastle_farm.yml
INFO - Machine path :/Users/dauptain/TEST/test_workflow_farming/WF_000_twicer/sandbox.yml
INFO - Farming mode :True
Disclaimer: this automated job has been submitted by user dauptain
under his/her scrutiny. In case of a wasteful usage of this computer ressources,
the aforementionned user will be found, and forced to take responsibility...
- use `lemmings(-farming) status` to follow your jobs
- use `lemmings(-farming) cancel` to cancel it
- use `find . -name *.log` to see the log files created
INFO - Lemmings START
INFO - Check on startTrue (False -> Exit)
INFO - Prior to job
INFO - Lemmings SPAWN
INFO - Prepare run
INFO - Submit batch 53630
INFO - Submit batch post job 53631
INFO - Launch workflow WF_000_twicer
INFO -
##############################
Starting Lemmings 0.8.0...
##############################
INFO - Job name :twicer_MAVU18
INFO - Loop :1
INFO - Status :start
INFO - Worflow path :/Users/dauptain/TEST/test_workflow_farming/WF_001_twicer/sandcastle_farm.py
INFO - Imput path :/Users/dauptain/TEST/test_workflow_farming/sandcastle_farm.yml
INFO - Machine path :/Users/dauptain/TEST/test_workflow_farming/WF_001_twicer/sandbox.yml
INFO - Farming mode :True
Disclaimer: this automated job has been submitted by user dauptain
under his/her scrutiny. In case of a wasteful usage of this computer ressources,
the aforementionned user will be found, and forced to take responsibility...
- use `lemmings(-farming) status` to follow your jobs
- use `lemmings(-farming) cancel` to cancel it
- use `find . -name *.log` to see the log files created
INFO - Lemmings START
INFO - Check on startTrue (False -> Exit)
INFO - Prior to job
INFO - Lemmings SPAWN
INFO - Prepare run
INFO - Submit batch 53635
INFO - Submit batch post job 53636
INFO - Launch workflow WF_001_twicer
INFO - Switch to exit
INFO - Reason : Replicate workflows launched according to max parallel chains 2
This log , quite verbose, describes what happened to the three workflows. The command starts the farming run. Without the sandbox job scheduler, nothing will actually run. However the Farming structure is already created. Have a look at it:
.
├── TEMP_0000
│ └── twicer_sol.yml
├── WF_000_twicer
│ ├── TEMP_0000
│ │ └── twicer_sol.yml
│ ├── _farming_params.json
│ ├── batch_job
│ ├── batch_pjob
│ ├── database.yml
│ ├── sandbox.yml
│ ├── sandcastle_farm.py
│ ├── twicer.py
│ ├── twicer_YUDA86
│ └── twicer_in.yml
├── WF_001_twicer
│ ├── TEMP_0000
│ │ └── twicer_sol.yml
│ ├── _farming_params.json
│ ├── batch_job
│ ├── batch_pjob
│ ├── database.yml
│ ├── sandbox.yml
│ ├── sandcastle_farm.py
│ ├── twicer.py
│ ├── twicer_ZOVI36
│ └── twicer_in.yml
├── WF_002_twicer
│ ├── TEMP_0000
│ │ └── twicer_sol.yml
│ ├── _farming_params.json
│ ├── sandbox.yml
│ ├── sandcastle_farm.py
│ └── twicer.py
├── WF_003_twicer
│ ├── TEMP_0000
│ │ └── twicer_sol.yml
│ ├── _farming_params.json
│ ├── sandbox.yml
│ ├── sandcastle_farm.py
│ └── twicer.py
├── WF_004_twicer
│ ├── TEMP_0000
│ │ └── twicer_sol.yml
│ ├── _farming_params.json
│ ├── sandbox.yml
│ ├── sandcastle_farm.py
│ └── twicer.py
├── database.yml
├── sandbox.yml
├── sandcastle_farm.py
├── sandcastle_farm.yml
└── twicer.py
Several observations:
The initial folder has been replicated several times : one folder for each item of
parameter_array
.The two first folders already have their
batch_job
andbatch_pjob
created. This means the CLIlemmings_farming
already tried to start as much workflows as possible according tomax_parallel_workflows
(here 2).A new file
_farming_params.json
popped up in these folders
If you look at these _farming_params.json
, you will see they store the content of each parameter_array
item.
>cat WF_000_twicer/_farming_params.json
{"multiplier": 0.6, "dummy": 666, "foo": "bar"}
>cat WF_001_twicer/_farming_params.json
{"multiplier": 0.8}}
We have now a workflow ready to run, we have the parameters at hand, but how do we connect the two?
Adapt your workflow to a farming
Add first in sandcastle_farm.py
the import of the function read_farming_params()
.
from lemmings.chain.lemmingjob_base import LemmingJobBase, read_farming_params
This function will load automatically the parameter file relative to the workflow in use.
Here we will add multiplier to twicer_in.yml
from the prepare_run()
method of our workflow:
def prepare_run(self):
"""
Refresh the input file
"""
last_id = last_folder_id()
farm_params = read_farming_params()
input_twicer = {
"init_sol": _name_folder(last_id)+ "/twicer_sol.yml",
"multiplier": farm_params["multiplier"],
"out_path": _name_folder(last_id+1)
}
with open("./twicer_in.yml", "w") as fout:
yaml.dump(input_twicer,fout)
In the present case, the variable farm_params
is a dict loaded with :
{"multiplier": 0.6, "dummy": 666, "foo": "bar"}
The value of key multiplier
is added into the input file.
Try the farming workflow … for real
Before to run the farming workflow for real, clean the working directory with:
>lemmings-farming clean
Now start the job scheduler sandbox in a separate window, then resubmit your farming :
>lemmings-farming run --machine-file sandbox.yml --inputfile sandcastle_farm.yml --job-prefix twicer sandcastle_farm.py
You're about to start a farming of lemmings chains. Are you sure? (yes/no) yes
#---Starting farming mode---#
Initialise farming database.yml
Checking sandbox.yml
Overriding the $LEMMINGS_MACHINE by a user defined sandbox.yml
Parallel mode enabled
Starting chain named twicer_SAJI25
Starting chain named twicer_REFI93
Replicate workflows launched according to, max parallel chains = 2
You can follow the progress of the farming using :
>lemmings-farming status
S: Submitted, F: Finished, W: Wait, E: Error, K: Killed
+-----------------+--------+
| Workflow number | Status |
+-----------------+--------+
| WF_000_twicer | F |
| WF_001_twicer | F |
| WF_002_twicer | S |
| WF_003_twicer | S |
| WF_004_twicer | W |
+-----------------+--------+
Once the farming is done, you can check that each workflow produced all the results:
>tree WF_00*/TEMP*
WF_000_twicer/TEMP_0000
└── twicer_sol.yml
WF_000_twicer/TEMP_0001
└── twicer_sol.yml
WF_000_twicer/TEMP_0002
└── twicer_sol.yml
WF_001_twicer/TEMP_0000
└── twicer_sol.yml
WF_001_twicer/TEMP_0001
└── twicer_sol.yml
WF_001_twicer/TEMP_0002
└── twicer_sol.yml
WF_002_twicer/TEMP_0000
└── twicer_sol.yml
WF_002_twicer/TEMP_0001
└── twicer_sol.yml
WF_002_twicer/TEMP_0002
└── twicer_sol.yml
WF_003_twicer/TEMP_0000
└── twicer_sol.yml
WF_003_twicer/TEMP_0001
└── twicer_sol.yml
WF_003_twicer/TEMP_0002
└── twicer_sol.yml
WF_004_twicer/TEMP_0000
└── twicer_sol.yml
WF_004_twicer/TEMP_0001
└── twicer_sol.yml
WF_004_twicer/TEMP_0002
└── twicer_sol.yml
Check also that each workflow created different outputs:
>cat WF*/TEMP_0002/twicer_sol.yml
data:
- 0.36
- 1.7999999999999998
- 2.52
- 1.0799999999999998
time: 3.0
data:
- 0.6400000000000001
- 3.2
- 4.48
- 1.9200000000000004
time: 3.0
data:
- 1.0
- 5.0
- 7.0
- 3.0
time: 3.0
data:
- 1.9599999999999997
- 9.799999999999999
- 13.719999999999997
- 5.879999999999999
time: 3.0
data:
- 7.839999999999999
- 39.199999999999996
- 54.87999999999999
- 23.519999999999996
time: 3.0
Congratulations, you have performed your first farming Workflow
Improve the workflow
A workflow for both farming and single runs
It is good practice to keep your workflow compatible to both farming and non-farming usages.
The function read_farming_params()
will return None
if there is no _farming_params.json
.
Use this to make your prepare_run()
adapted to farming and non farming runs:
def prepare_run(self):
"""
Refresh the input file
"""
last_id = last_folder_id()
multiplier_val = 2
pars = read_farming_params()
if pars is not None:
multiplier_val = pars["multiplier"]
input_twicer = {
"init_sol": _name_folder(last_id)+ "/twicer_sol.yml",
"multiplier": multiplier_val,
"out_path": _name_folder(last_id+1)
}
with open("./twicer_in.yml", "w") as fout:
yaml.dump(input_twicer,fout)
Using the _farming_params.json
presence as reference for farming mode is interesting because you can resubmit one of the workflows from its folder using >lemmings run (...)
.
A farming based on an external file
You can use an external file to provide your farming parameters. The format expected is a JSON list. For example, imagine some software generating the file:
import json
max_wf = 10
list_farming_params = list()
for wf_id in range(max_wf):
alpha = 1.* wf_id / (max_wf-1)
params = {"multiplier": 2. + 1.*alpha, "wf_id": wf_id}
list_farming_params.append(params)
with open("farming_params.json", "w") as fout:
json.dump(list_farming_params, fout, indent=4)
You can look at the content of this file:
>cat farming_params.json
[
{
"multiplier": 2.0,
"wf_id": 0
},
{
"multiplier": 2.111111111111111,
"wf_id": 1
},
{
"multiplier": 2.2222222222222223,
"wf_id": 2
},
{
"multiplier": 2.3333333333333335,
"wf_id": 3
},
{
"multiplier": 2.4444444444444446,
"wf_id": 4
},
{
"multiplier": 2.5555555555555554,
"wf_id": 5
},
{
"multiplier": 2.6666666666666665,
"wf_id": 6
},
{
"multiplier": 2.7777777777777777,
"wf_id": 7
},
{
"multiplier": 2.888888888888889,
"wf_id": 8
},
{
"multiplier": 3.0,
"wf_id": 9
}
]
You must now adapt the input of the workflow to make reference to this file:
#---Required Settings--#
exec: | #your environement, slurm exec, module etc (writted in the batch)
echo "hello world exec"
pwd
python twicer.py
exec_pj: | # exec for the PJ (writted in the batch)
echo "hello world exec_pj"
pwd
job_queue: long # name of the job queue based on your '{machine}.yml'
pjob_queue: short # name of the post-job queue based on your '{machine}.yml'
cpu_limit: null # The maximal CPU limit for your Lemmings chain [Hours]
# #---Optional Settings--#
job_prefix: twicer # Add a prefix to the run_name
# #---Custom Settings--#
custom_params:
simulation_end_time: 3 #final end time desired
farming:
active: True
max_parallel_workflows : 2 # maximum number of workflows to submit at once
parameter_file: "./farming_params.json"
You can re-submit your farming and see the results:
>lemmings-farming status
S: Submitted, F: Finished, W: Wait, E: Error, K: Killed
+-----------------+--------+
| Workflow number | Status |
+-----------------+--------+
| WF_000_twicer | S |
| WF_001_twicer | S |
| WF_002_twicer | W |
| WF_003_twicer | W |
| WF_004_twicer | W |
| WF_005_twicer | W |
| WF_006_twicer | W |
| WF_007_twicer | W |
| WF_008_twicer | W |
| WF_009_twicer | W |
+-----------------+--------+
Takeaway
The farming mode of Lemmings is basically a helper to replicate your workflow directory.
One single command must be imported to retrieve the set of parameters relative to your farming member : read_farming_params()
.
Remember to use >lemmings-farming clean
between two runs for the moment. Rest assured, soon, lemmings will automagically take care of this.