The SandCastle Farming Tutorial

sandcastle

The Sandcastle Farming tutorial follows the Sandcastle tutorial. There you will see how to automatically replicate your workflow.

You will run again the mock up solver twicer using the fake job scheduler sandbox. Make sure you already passed these tutorials.

This tutorial is part of the lemmings test suite. We will create the file here step-by-step, but you can fetch stuff from tests/sandbox/test_sandcastle_farming.

A code with a parameter changing its output.

We extend the code twicer.py so it can take one optional parameter : multiplier.

""" Mock up code twicer for lemmings tests """
import os
import yaml


### Main actions
def main(init_file, sol_path,  multiplier=2):
    """Main function of twicer
    
    init_file (str): initial solution path to read
    sol_path (str): final solution folder
    """
    
    print("Twicer execution...")
    # Load data
    print(f" - reading data from {init_file} ...")
    sol = _load_data(init_file)
    
    print(" - twicing...")
    # Increase time
    sol["time"]+=1
    # Double the content of data
    sol["data"] = [multiplier*dat for dat in sol["data"]]

    # create output folder if missing
    if not os.path.isdir(sol_path):
        os.mkdir(sol_path)
    # dump solution
    out_file = sol_path+"/twicer_sol.yml"
    print(f" - dumping data to {out_file} ...")
    _dump_data(sol, out_file)
    print(f"Execution complete")


### I/O using yaml
def _load_data(filename):
    """Load a yaml file as dict"""
    with open(filename, "r") as fin:
        data = yaml.load(fin, Loader=yaml.SafeLoader)    
    return data


def _dump_data(data,filename):
    """Write dict to a  a yaml file"""
    with open(filename, "w") as fout:
        yaml.dump(data,fout, Dumper=yaml.SafeDumper)

## Main function call
if __name__ == "__main__":
    """If called directly from terminal"""
    input_params = _load_data("twicer_in.yml")
    init_file = input_params["init_sol"]
    try:
        multiplier = input_params["multiplier"]
    except KeyError as e:
        multiplier = 2
    sol_path = input_params["out_path"]
    main(init_file, sol_path, multiplier=multiplier)
    

The small try/except in the main call makes this code suitable for both input files:

init_sol: TEMP_0000/twicer_sol.yml
multiplier: 0.6
out_path: TEMP_0001

or

init_sol: TEMP_0000/twicer_sol.yml
out_path: TEMP_0001

where multiplier will be 2, like in the sandcastle tutorial.

Create a workflow for farming

Replicate the workflow four times

The farming mode requires additional data : the parameters that will change between the members of a farming job. These parameters are provided using and extended version of the control file:

#---Required Settings--#
exec: |                      #your environement, slurm exec, module etc (writted in the batch)
      echo "hello world exec"
      pwd
      python twicer.py
exec_pj: |                  # exec for the PJ (writted in the batch)
      echo "hello world exec_pj"
      pwd
job_queue: long         # name of the job queue based on your '{machine}.yml'
pjob_queue: short        # name of the post-job queue based on your '{machine}.yml'

# #---Optional Settings--#
job_prefix: twicer              # Add a prefix to the run_name

# #---Custom Settings--#
custom_params:
  simulation_end_time: 3    #final end time desired

farming:
  active: True
  max_parallel_workflows : 2 # maximum number of workflows to submit at once
  parameter_array:
  - multiplier: 0.6
    dummy: 666
    foo: "bar"
  - multiplier: 0.8
  - multiplier: 1.0
  - multiplier: 1.4
  - multiplier: 2.8

We see at the bottom a farming section, with three elements:

active:True allows the farming to be processed. Set it to False if, encountering problems, you want to run a single workflow without erasing the whole section.
max_parallel_workflows:2 is the amount of workflows to be run at the same time. Use this to adjust the greediness of your farming. Other people on the same HPC resources might get angry seeing a single user queuing hundreds of jobs.
parameter_array: [] is the array of data you need to provide. Each member will get an element of this list, called member_params. Elements can change from a member to another. Indeed one member can request a simple sub-model with zero inputs, and the other a more complex sub-model with several inputs.

As you can see, this is pretty dumb : e.g. For 12 members in your farming, add 12 member params in the parameter_array.

Using the Farming CLI to create the farming structure

Your initial folder should look like the folder from the sandcastle tutorial:

> tree
.
├── TEMP_0000
│   └── twicer_sol.yml
├── sandbox.yml
├── sandcastle_farm.py
├── sandcastle_farm.yml
└── twicer.py

Here, we simply renamed sandcastle files into sandcastle_farm to avoid confusion. Run the farming CLI, without starting the job scheduled first.

We have 5 members on the farming, with a limit of 2 simultaneous workflow. Let’s brace ourselves and repeat, because a lot will happen: i) launch a lemmings-farming run (called (twicer_HODE56)), which ii) starts a lemmings run lemmings-farming run on a first workflow WF_000 (twicer_LOCA75), iii) then starts a second lemmings-farming run on the workflow WF_001 (twicer_MAVU18).

│> lemmings-farming run (twicer_HODE56)
├── WF_000 > lemmings run (twicer_LOCA75)
├── WF_001 > lemmings run (twicer_MAVU18)
├── WF_002 
├── WF_003 
└── WF_003

Ok, now you are ready, we dive! We launch the run without the sandbox first to see what will happen without actually starting the work:

> lemmings-farming run --machine-file sandbox.yml --inputfile sandcastle_farm.yml --job-prefix twicer sandcastle_farm.py
INFO - 
##############################
Starting Lemmings 0.8.0...
##############################

INFO -     Job name     :twicer_HODE56
INFO -     Loop         :1
INFO -     Status       :start
INFO -     Worflow path :/Users/dauptain/TEST/test_workflow_farming/sandcastle_farm.py
INFO -     Imput path   :/Users/dauptain/TEST/test_workflow_farming/sandcastle_farm.yml
INFO -     Machine path :/Users/dauptain/TEST/test_workflow_farming/sandbox.yml
INFO -     Farming mode :True

Disclaimer: this automated job has been submitted by user dauptain
under his/her scrutiny. In case of a wasteful usage of this computer ressources,
the aforementionned user will be found, and forced to take responsibility...

- use `lemmings(-farming) status` to follow your jobs
- use `lemmings(-farming) cancel` to cancel it 
- use `find . -name *.log` to see the log files created

INFO - 
##############################
Starting Lemmings 0.8.0...
##############################

INFO -     Job name     :twicer_LOCA75
INFO -     Loop         :1
INFO -     Status       :start
INFO -     Worflow path :/Users/dauptain/TEST/test_workflow_farming/WF_000_twicer/sandcastle_farm.py
INFO -     Imput path   :/Users/dauptain/TEST/test_workflow_farming/sandcastle_farm.yml
INFO -     Machine path :/Users/dauptain/TEST/test_workflow_farming/WF_000_twicer/sandbox.yml
INFO -     Farming mode :True

Disclaimer: this automated job has been submitted by user dauptain
under his/her scrutiny. In case of a wasteful usage of this computer ressources,
the aforementionned user will be found, and forced to take responsibility...

- use `lemmings(-farming) status` to follow your jobs
- use `lemmings(-farming) cancel` to cancel it 
- use `find . -name *.log` to see the log files created

INFO -     Lemmings START
INFO -          Check on startTrue (False -> Exit)
INFO -          Prior to job
INFO -     Lemmings SPAWN
INFO -          Prepare run
INFO -          Submit batch 53630 
INFO -          Submit batch post job 53631
INFO -          Launch workflow WF_000_twicer
INFO - 
##############################
Starting Lemmings 0.8.0...
##############################

INFO -     Job name     :twicer_MAVU18
INFO -     Loop         :1
INFO -     Status       :start
INFO -     Worflow path :/Users/dauptain/TEST/test_workflow_farming/WF_001_twicer/sandcastle_farm.py
INFO -     Imput path   :/Users/dauptain/TEST/test_workflow_farming/sandcastle_farm.yml
INFO -     Machine path :/Users/dauptain/TEST/test_workflow_farming/WF_001_twicer/sandbox.yml
INFO -     Farming mode :True

Disclaimer: this automated job has been submitted by user dauptain
under his/her scrutiny. In case of a wasteful usage of this computer ressources,
the aforementionned user will be found, and forced to take responsibility...

- use `lemmings(-farming) status` to follow your jobs
- use `lemmings(-farming) cancel` to cancel it 
- use `find . -name *.log` to see the log files created

INFO -     Lemmings START
INFO -          Check on startTrue (False -> Exit)
INFO -          Prior to job
INFO -     Lemmings SPAWN
INFO -          Prepare run
INFO -          Submit batch 53635 
INFO -          Submit batch post job 53636
INFO -          Launch workflow WF_001_twicer
INFO -          Switch to exit
INFO -          Reason : Replicate workflows launched according to max parallel chains 2 

This log , quite verbose, describes what happened to the three workflows. The command starts the farming run. Without the sandbox job scheduler, nothing will actually run. However the Farming structure is already created. Have a look at it:

.
├── TEMP_0000
│   └── twicer_sol.yml
├── WF_000_twicer
│   ├── TEMP_0000
│   │   └── twicer_sol.yml
│   ├── _farming_params.json
│   ├── batch_job
│   ├── batch_pjob
│   ├── database.yml
│   ├── sandbox.yml
│   ├── sandcastle_farm.py
│   ├── twicer.py
│   ├── twicer_YUDA86
│   └── twicer_in.yml
├── WF_001_twicer
│   ├── TEMP_0000
│   │   └── twicer_sol.yml
│   ├── _farming_params.json
│   ├── batch_job
│   ├── batch_pjob
│   ├── database.yml
│   ├── sandbox.yml
│   ├── sandcastle_farm.py
│   ├── twicer.py
│   ├── twicer_ZOVI36
│   └── twicer_in.yml
├── WF_002_twicer
│   ├── TEMP_0000
│   │   └── twicer_sol.yml
│   ├── _farming_params.json
│   ├── sandbox.yml
│   ├── sandcastle_farm.py
│   └── twicer.py
├── WF_003_twicer
│   ├── TEMP_0000
│   │   └── twicer_sol.yml
│   ├── _farming_params.json
│   ├── sandbox.yml
│   ├── sandcastle_farm.py
│   └── twicer.py
├── WF_004_twicer
│   ├── TEMP_0000
│   │   └── twicer_sol.yml
│   ├── _farming_params.json
│   ├── sandbox.yml
│   ├── sandcastle_farm.py
│   └── twicer.py
├── database.yml
├── sandbox.yml
├── sandcastle_farm.py
├── sandcastle_farm.yml
└── twicer.py

Several observations:

The initial folder has been replicated several times : one folder for each item of parameter_array.
The two first folders already have their batch_job and batch_pjob created. This means the CLI lemmings_farming already tried to start as much workflows as possible according to max_parallel_workflows (here 2).
A new file _farming_params.json popped up in these folders

If you look at these _farming_params.json, you will see they store the content of each parameter_array item.

>cat WF_000_twicer/_farming_params.json 
{"multiplier": 0.6, "dummy": 666, "foo": "bar"}
>cat WF_001_twicer/_farming_params.json 
{"multiplier": 0.8}}

We have now a workflow ready to run, we have the parameters at hand, but how do we connect the two?

Adapt your workflow to a farming

Add first in sandcastle_farm.py the import of the function read_farming_params().

from lemmings.chain.lemmingjob_base import LemmingJobBase, read_farming_params

This function will load automatically the parameter file relative to the workflow in use. Here we will add multiplier to twicer_in.yml from the prepare_run() method of our workflow:

 def prepare_run(self):
    """
    Refresh the input file
    """
    last_id = last_folder_id()

    farm_params = read_farming_params()

    input_twicer =  {
        "init_sol": _name_folder(last_id)+ "/twicer_sol.yml",
        "multiplier": farm_params["multiplier"],
        "out_path": _name_folder(last_id+1)
    }
    with open("./twicer_in.yml", "w") as fout:
        yaml.dump(input_twicer,fout)

In the present case, the variable farm_params is a dict loaded with :

{"multiplier": 0.6, "dummy": 666, "foo": "bar"}

The value of key multiplier is added into the input file.

Try the farming workflow … for real

Before to run the farming workflow for real, clean the working directory with:

>lemmings-farming clean

Now start the job scheduler sandbox in a separate window, then resubmit your farming :

>lemmings-farming run --machine-file sandbox.yml --inputfile sandcastle_farm.yml --job-prefix twicer sandcastle_farm.py
You're about to start a farming of lemmings chains. Are you sure? (yes/no)  yes

#---Starting farming mode---#
Initialise farming database.yml
Checking sandbox.yml
Overriding the $LEMMINGS_MACHINE by a user defined sandbox.yml
Parallel mode enabled
Starting chain named twicer_SAJI25
Starting chain named twicer_REFI93
Replicate workflows launched according to, max parallel chains =   2

You can follow the progress of the farming using :

>lemmings-farming status
S: Submitted, F: Finished, W: Wait, E: Error, K: Killed
+-----------------+--------+
| Workflow number | Status |
+-----------------+--------+
|  WF_000_twicer  |   F    |
|  WF_001_twicer  |   F    |
|  WF_002_twicer  |   S    |
|  WF_003_twicer  |   S    |
|  WF_004_twicer  |   W    |
+-----------------+--------+

Once the farming is done, you can check that each workflow produced all the results:

>tree WF_00*/TEMP*
WF_000_twicer/TEMP_0000
└── twicer_sol.yml
WF_000_twicer/TEMP_0001
└── twicer_sol.yml
WF_000_twicer/TEMP_0002
└── twicer_sol.yml
WF_001_twicer/TEMP_0000
└── twicer_sol.yml
WF_001_twicer/TEMP_0001
└── twicer_sol.yml
WF_001_twicer/TEMP_0002
└── twicer_sol.yml
WF_002_twicer/TEMP_0000
└── twicer_sol.yml
WF_002_twicer/TEMP_0001
└── twicer_sol.yml
WF_002_twicer/TEMP_0002
└── twicer_sol.yml
WF_003_twicer/TEMP_0000
└── twicer_sol.yml
WF_003_twicer/TEMP_0001
└── twicer_sol.yml
WF_003_twicer/TEMP_0002
└── twicer_sol.yml
WF_004_twicer/TEMP_0000
└── twicer_sol.yml
WF_004_twicer/TEMP_0001
└── twicer_sol.yml
WF_004_twicer/TEMP_0002
└── twicer_sol.yml

Check also that each workflow created different outputs:

>cat WF*/TEMP_0002/twicer_sol.yml

data:
- 0.36
- 1.7999999999999998
- 2.52
- 1.0799999999999998
time: 3.0

data:
- 0.6400000000000001
- 3.2
- 4.48
- 1.9200000000000004
time: 3.0

data:
- 1.0
- 5.0
- 7.0
- 3.0
time: 3.0

data:
- 1.9599999999999997
- 9.799999999999999
- 13.719999999999997
- 5.879999999999999
time: 3.0

data:
- 7.839999999999999
- 39.199999999999996
- 54.87999999999999
- 23.519999999999996
time: 3.0

Congratulations, you have performed your first farming Workflow

Improve the workflow

A workflow for both farming and single runs

It is good practice to keep your workflow compatible to both farming and non-farming usages. The function read_farming_params() will return None if there is no _farming_params.json. Use this to make your prepare_run() adapted to farming and non farming runs:

def prepare_run(self):
    """
    Refresh the input file
    """
    last_id = last_folder_id()

    multiplier_val = 2
    pars = read_farming_params()        
    if pars is not None:
        multiplier_val = pars["multiplier"]

    input_twicer =  {
        "init_sol": _name_folder(last_id)+ "/twicer_sol.yml",
        "multiplier": multiplier_val,
        "out_path": _name_folder(last_id+1)
    }
    with open("./twicer_in.yml", "w") as fout:
        yaml.dump(input_twicer,fout)

Using the _farming_params.json presence as reference for farming mode is interesting because you can resubmit one of the workflows from its folder using >lemmings run (...).

A farming based on an external file

You can use an external file to provide your farming parameters. The format expected is a JSON list. For example, imagine some software generating the file:

import json

max_wf = 10
list_farming_params = list()

for wf_id in range(max_wf):
    alpha = 1.* wf_id / (max_wf-1)
    params = {"multiplier": 2. + 1.*alpha, "wf_id": wf_id}
    list_farming_params.append(params)

with open("farming_params.json", "w") as fout:
    json.dump(list_farming_params, fout, indent=4)

You can look at the content of this file:

>cat farming_params.json 
[
    {
        "multiplier": 2.0,
        "wf_id": 0
    },
    {
        "multiplier": 2.111111111111111,
        "wf_id": 1
    },
    {
        "multiplier": 2.2222222222222223,
        "wf_id": 2
    },
    {
        "multiplier": 2.3333333333333335,
        "wf_id": 3
    },
    {
        "multiplier": 2.4444444444444446,
        "wf_id": 4
    },
    {
        "multiplier": 2.5555555555555554,
        "wf_id": 5
    },
    {
        "multiplier": 2.6666666666666665,
        "wf_id": 6
    },
    {
        "multiplier": 2.7777777777777777,
        "wf_id": 7
    },
    {
        "multiplier": 2.888888888888889,
        "wf_id": 8
    },
    {
        "multiplier": 3.0,
        "wf_id": 9
    }
]

You must now adapt the input of the workflow to make reference to this file:

#---Required Settings--#
exec: |                      #your environement, slurm exec, module etc (writted in the batch)
      echo "hello world exec"
      pwd
      python twicer.py
exec_pj: |                  # exec for the PJ (writted in the batch)
      echo "hello world exec_pj"
      pwd
job_queue: long         # name of the job queue based on your '{machine}.yml'
pjob_queue: short        # name of the post-job queue based on your '{machine}.yml'
cpu_limit: null            # The maximal CPU limit for your Lemmings chain [Hours]

# #---Optional Settings--#
job_prefix: twicer              # Add a prefix to the run_name

# #---Custom Settings--#
custom_params:
  simulation_end_time: 3    #final end time desired

farming:
  active: True
  max_parallel_workflows : 2 # maximum number of workflows to submit at once
  parameter_file: "./farming_params.json"

You can re-submit your farming and see the results:

>lemmings-farming status
S: Submitted, F: Finished, W: Wait, E: Error, K: Killed
+-----------------+--------+
| Workflow number | Status |
+-----------------+--------+
|  WF_000_twicer  |   S    |
|  WF_001_twicer  |   S    |
|  WF_002_twicer  |   W    |
|  WF_003_twicer  |   W    |
|  WF_004_twicer  |   W    |
|  WF_005_twicer  |   W    |
|  WF_006_twicer  |   W    |
|  WF_007_twicer  |   W    |
|  WF_008_twicer  |   W    |
|  WF_009_twicer  |   W    |
+-----------------+--------+

Takeaway

The farming mode of Lemmings is basically a helper to replicate your workflow directory. One single command must be imported to retrieve the set of parameters relative to your farming member : read_farming_params().

Remember to use >lemmings-farming clean between two runs for the moment. Rest assured, soon, lemmings will automagically take care of this.