Benchmarking Utilities
Module odc.stac.bench
provides utilities for benchmarking data loading. It is both a
library that can be used directly from a notebook and a command line application.
Usage: python -m odc.stac.bench [OPTIONS] COMMAND [ARGS]...
Benchmarking tool for odc.stac.
Options:
--help Show this message and exit.
Commands:
dask Launch local Dask Cluster.
prepare Prepare benchmarking dataset.
report Collate results of multiple benchmark experiments.
run Run data load benchmark using Dask.
Define Test Site
To start you need to define a test site, or use one of the pre-configured examples. Site
configuration is a json file that describes STAC API query and some other metadata. Below is a
definition of the s2-ms-mosaic
sample site.
{
"file_id": "s2-ms-mosaic_2020-06-06--P1D",
"api": "https://planetarycomputer.microsoft.com/api/stac/v1",
"search": {
"collections": ["sentinel-2-l2a"],
"datetime": "2020-06-06",
"bbox": [ 27.345815, -14.98724, 27.565542, -7.710992]
}
}
This would query Planetary Computer STAC API endpoint for Sentinel 2 collection and store results to
a geojson file {file_id}.geojson
. Try it now:
python -m odc.stac.bench prepare --sample-site s2-ms-mosaic
Command above will write a GeoJSON file to your current directory. We will use this file to run benchmarks later on.
Prepare Load Configuration
Let’s create base data loading configuration file suitable for running benchmarks with the site
configuration produced previously. Save example below as cfg.json
.
{
"method": "odc-stac",
"bands": ["B02", "B03", "B04"],
"patch_url": "planetary_computer.sas.sign",
"extra": {
"stackstac": {
"dtype": "uint16",
"fill_value": 0
},
"odc-stac": {
"groupby": "solar_day",
"stac_cfg": {"*": {"warnings": "ignore"}}
}
}
}
Making your own is simple:
Create
BenchLoadParams
objectModify configuration options to match your needs
Dump it to JSON
from odc.stac.bench import BenchLoadParams
params = BenchLoadParams()
params.scenario = "web-zoom-8"
params.bands = ["red", "green", "blue"]
params.crs = "EPSG:3857"
params.resolution = 610
params.chunks = (512, 512)
params.resampling = "bilinear"
print(params.to_json())
Start Dask Cluster
Before we can run the benchmark we need to have an active Dask cluster. You can connect to a remote cluster or run a local one. A convenience local Dask cluster launcher is provided. In a separate shell run this command:
> python -m odc.stac.bench dask --memory-limit=8GiB
GDAL_DISABLE_READDIR_ON_OPEN = EMPTY_DIR
GDAL_HTTP_MAX_RETRY = 10
GDAL_HTTP_RETRY_DELAY = 0.5
GDAL_DATA = /srv/conda/envs/notebook/share/gdal
Launched Dask Cluster: tcp://127.0.0.1:43677
--scheduler='tcp://127.0.0.1:43677'
This will start a local Dask cluster, configure GDAL on Dask workers and print out the address of
the Dask scheduler. Leave this running and take a note of the --scheduler=...
option that was
printed out, we will use it the next step.
Run Benchmark
We are now ready to run some benchmarking with the run
command documented below:
Usage: python -m odc.stac.bench run [OPTIONS] SITE
Run data load benchmark using Dask.
SITE is a GeoJSON file produced by `prepare` step.
Options:
-c, --config FILE Experiment configuration in json format
-n, --ntimes INTEGER Configure number of times to run
--method [odc-stac|stackstac] Data loading method
--bands TEXT Comma separated list of bands
--chunks INTEGER... Chunk size Y,X order
--resolution FLOAT Set output resolution
--crs TEXT Set CRS
--resampling [nearest|bilinear|cubic|cubic_spline|lanczos|average|mode|gauss|max|min|med|q1|q3|sum|rms]
Resampling method when changing
resolution/projection
--show-config Show configuration only, don't run
--scheduler TEXT Dask server to connect to
--help Show this message and exit.
First let’s check configuration, note we will run with the reduced resolution for quicker turn
around (--resolution=80
option). Command line arguments take precedence over configuration
parameters supplied in the json file.
python -m odc.stac.bench run \
s2-ms-mosaic_2020-06-06--P1D.geojson \
--config cfg.json \
--resolution=80 \
--show-config
If the above went well we can start the benchmark, remove --show-config
option, and add
--scheduler=
option that was printed when we started Dask cluster. Let’s also configure number
of benchmarking passes to run with -n 10
option.
python -m odc.stac.bench run \
s2-ms-mosaic_2020-06-06--P1D.geojson \
--config cfg.json \
--resolution=80 \
-n 10 \
--scheduler='tcp://127.0.0.1:43677'
Note
Don’t forget to edit --scheduler=
, part of the above command.
This will first print out configuration that will be used,
Loaded: 9 STAC items from 's2-ms-mosaic_2020-06-06--P1D.geojson'
Will use following load configuration
------------------------------------------------------------
{ /** NOTE: this section was edited for brevity **/
"scenario": "s2-ms-mosaic_2020-06-06--P1D",
"method": "odc-stac",
"chunks": [ 2048, 2048 ],
"bands": [ "B02", "B03", "B04" ],
"resolution": 80.0,
"crs": null,
"resampling": null,
"patch_url": "planetary_computer.sas.sign",
"extra": {
"stackstac": { "dtype": "uint16", "fill_value": 0 },
"odc-stac": { "groupby": "solar_day", "stac_cfg": {"*": {"warnings": "ignore" }}}
}
}
------------------------------------------------------------
followed by information about data being loaded and some stats about the Dask cluster on which the benchmark will run:
Connecting to Dask Scheduler: tcp://127.0.0.1:43677
Constructing Dask graph
Starting benchmark run (10 runs)
============================================================
Will write results to: s2-ms-mosaic_2020-06-06--P1D_20220104T080235.133458.pkl
method : odc-stac
Scenario : s2-ms-mosaic_2020-06-06--P1D
T.slice : 2020-06-06
Data : 1.3.11373.1374.uint16, 89.42 MiB
Chunks : 1.1.2048.1374 (T.B.Y.X)
GEO : epsg:32735
| 80, 0, 499920|
| 0,-80, 9200080|
Cluster : 1 workers, 4 threads, 8.00 GiB
------------------------------------------------------------
As benchmark runs are completed brief summaries are printed:
T.Elapsed : 2.845 seconds
T.Submit : 0.228 seconds
Throughput : 16.480 Mpx/second (overall)
| 4.120 Mpx/second (per thread)
------------------------------------------------------------
T.Elapsed : 2.448 seconds
T.Submit : 0.015 seconds
Throughput : 19.152 Mpx/second (overall)
| 4.788 Mpx/second (per thread)
... continues
You can terminate early without losing data with Ctrl-C
. Benchmark results are saved after each
benchmark pass (overwriting previous save-point) in case there is a crash or some other fatal
error.
Review Results
To convert benchmark results stored in .pkl
file(s) to CSV use the following:
python -m odc.stac.bench report *.pkl --output results.csv
The idea is to run benchmarks with different load configurations, different chunk sizes for example, or comparing relative costs of resampling modes, then combine those into one data table.