Changeset View
Changeset View
Standalone View
Standalone View
docs/content/integrations/dagstermill.mdx
Show All 34 Lines | |||||
> - [K-means clustering for the Iris data set](https://github.com/dagster-io/dagster/blob/0.10.8/examples/docs_snippets/docs_snippets/legacy/data_science/iris-kmeans.ipynb). | > - [K-means clustering for the Iris data set](https://github.com/dagster-io/dagster/blob/0.10.8/examples/docs_snippets/docs_snippets/legacy/data_science/iris-kmeans.ipynb). | ||||
Like many notebooks, this example does some fairly sophisticated work, producing diagnostic plots and a (flawed) statistical model -- which are then locked away in the .ipynb format, can only be reproduced using a complex Jupyter setup, and are only programmatically accessible within the notebook context. | Like many notebooks, this example does some fairly sophisticated work, producing diagnostic plots and a (flawed) statistical model -- which are then locked away in the .ipynb format, can only be reproduced using a complex Jupyter setup, and are only programmatically accessible within the notebook context. | ||||
We can simply turn a notebook into a solid using <PyObject module="dagstermill" object="define_dagstermill_solid" />. Once we create a solid, we can start to make its outputs more accessible. | We can simply turn a notebook into a solid using <PyObject module="dagstermill" object="define_dagstermill_solid" />. Once we create a solid, we can start to make its outputs more accessible. | ||||
```python file=/legacy/data_science/iris_pipeline.py | ```python file=/legacy/data_science/iris_pipeline.py | ||||
import dagstermill as dm | import dagstermill as dm | ||||
from dagster import ModeDefinition, fs_io_manager, local_file_manager, pipeline | from dagster import ( | ||||
ModeDefinition, | |||||
fs_io_manager, | |||||
local_file_manager, | |||||
pipeline, | |||||
) | |||||
from dagster.utils import script_relative_path | from dagster.utils import script_relative_path | ||||
k_means_iris = dm.define_dagstermill_solid( | k_means_iris = dm.define_dagstermill_solid( | ||||
"k_means_iris", script_relative_path("iris-kmeans.ipynb") | "k_means_iris", script_relative_path("iris-kmeans.ipynb") | ||||
) | ) | ||||
@pipeline( | @pipeline( | ||||
mode_defs=[ | mode_defs=[ | ||||
ModeDefinition( | ModeDefinition( | ||||
resource_defs={"io_manager": fs_io_manager, "file_manager": local_file_manager} | resource_defs={ | ||||
"io_manager": fs_io_manager, | |||||
"file_manager": local_file_manager, | |||||
} | |||||
) | ) | ||||
] | ] | ||||
) | ) | ||||
def iris_pipeline(): | def iris_pipeline(): | ||||
k_means_iris() | k_means_iris() | ||||
``` | ``` | ||||
This is the simplest form of notebook integration -- we don't actually have to make any changes in the notebook itself to run it using the Dagster tooling. Just run: | This is the simplest form of notebook integration -- we don't actually have to make any changes in the notebook itself to run it using the Dagster tooling. Just run: | ||||
Show All 35 Lines | |||||
import dagstermill as dm | import dagstermill as dm | ||||
from dagster import InputDefinition, ModeDefinition, fs_io_manager, pipeline | from dagster import InputDefinition, ModeDefinition, fs_io_manager, pipeline | ||||
from dagster.utils import script_relative_path | from dagster.utils import script_relative_path | ||||
from docs_snippets.legacy.data_science.download_file import download_file | from docs_snippets.legacy.data_science.download_file import download_file | ||||
k_means_iris = dm.define_dagstermill_solid( | k_means_iris = dm.define_dagstermill_solid( | ||||
"k_means_iris", | "k_means_iris", | ||||
script_relative_path("iris-kmeans_2.ipynb"), | script_relative_path("iris-kmeans_2.ipynb"), | ||||
input_defs=[InputDefinition("path", str, description="Local path to the Iris dataset")], | input_defs=[ | ||||
InputDefinition( | |||||
"path", str, description="Local path to the Iris dataset" | |||||
) | |||||
], | |||||
) | ) | ||||
@pipeline(mode_defs=[ModeDefinition(resource_defs={"io_manager": fs_io_manager})]) | @pipeline( | ||||
mode_defs=[ModeDefinition(resource_defs={"io_manager": fs_io_manager})] | |||||
) | |||||
def iris_pipeline(): | def iris_pipeline(): | ||||
k_means_iris(download_file()) | k_means_iris(download_file()) | ||||
``` | ``` | ||||
We'll configure the `download_file` solid with the URL to download the file from, and the local path at which to save it. This solid has one output—the path to the downloaded file. | We'll configure the `download_file` solid with the URL to download the file from, and the local path at which to save it. This solid has one output—the path to the downloaded file. | ||||
```python file=/legacy/data_science/download_file.py | ```python file=/legacy/data_science/download_file.py | ||||
from urllib.request import urlretrieve | from urllib.request import urlretrieve | ||||
from dagster import Field, OutputDefinition, String, solid | from dagster import Field, OutputDefinition, String, solid | ||||
from dagster.utils import script_relative_path | from dagster.utils import script_relative_path | ||||
@solid( | @solid( | ||||
name="download_file", | name="download_file", | ||||
config_schema={ | config_schema={ | ||||
"url": Field(String, description="The URL from which to download the file"), | "url": Field( | ||||
"path": Field(String, description="The path to which to download the file"), | String, description="The URL from which to download the file" | ||||
), | |||||
"path": Field( | |||||
String, description="The path to which to download the file" | |||||
), | |||||
}, | }, | ||||
output_defs=[ | output_defs=[ | ||||
OutputDefinition( | OutputDefinition( | ||||
String, name="path", description="The path to which the file was downloaded" | String, | ||||
name="path", | |||||
description="The path to which the file was downloaded", | |||||
) | ) | ||||
], | ], | ||||
description=( | description=( | ||||
"A simple utility solid that downloads a file from a URL to a path using " | "A simple utility solid that downloads a file from a URL to a path using " | ||||
"urllib.urlretrieve" | "urllib.urlretrieve" | ||||
), | ), | ||||
) | ) | ||||
def download_file(context): | def download_file(context): | ||||
▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines | |||||
As with the parameters that dagstermill injects, you can also construct a context object for interactive exploration and development by using the `dagstermill.get_context` API in the tagged `parameters` cell of your input notebook. When dagstermill executes your notebook, this development context will be replaced with the injected runtime context. | As with the parameters that dagstermill injects, you can also construct a context object for interactive exploration and development by using the `dagstermill.get_context` API in the tagged `parameters` cell of your input notebook. When dagstermill executes your notebook, this development context will be replaced with the injected runtime context. | ||||
You can use the development context to access solid config and resources, to log messages, and to yield results and other Dagster events just as you would in production. When the runtime context is injected by dagstermill, none of your other code needs to change. | You can use the development context to access solid config and resources, to log messages, and to yield results and other Dagster events just as you would in production. When the runtime context is injected by dagstermill, none of your other code needs to change. | ||||
For instance, suppose we want to make the number of clusters (the \_k\_ in k-means) configurable. We'll change our solid definition to include a config field: | For instance, suppose we want to make the number of clusters (the \_k\_ in k-means) configurable. We'll change our solid definition to include a config field: | ||||
```python literalinclude showLines emphasize-lines=10-12 caption=iris_pipeline_3.py file=/legacy/data_science/iris_pipeline_3.py | ```python literalinclude showLines emphasize-lines=10-12 caption=iris_pipeline_3.py file=/legacy/data_science/iris_pipeline_3.py | ||||
import dagstermill as dm | import dagstermill as dm | ||||
from dagster import Field, InputDefinition, Int, ModeDefinition, fs_io_manager, pipeline | from dagster import ( | ||||
Field, | |||||
InputDefinition, | |||||
Int, | |||||
ModeDefinition, | |||||
fs_io_manager, | |||||
pipeline, | |||||
) | |||||
from dagster.utils import script_relative_path | from dagster.utils import script_relative_path | ||||
from docs_snippets.legacy.data_science.download_file import download_file | from docs_snippets.legacy.data_science.download_file import download_file | ||||
k_means_iris = dm.define_dagstermill_solid( | k_means_iris = dm.define_dagstermill_solid( | ||||
"k_means_iris", | "k_means_iris", | ||||
script_relative_path("iris-kmeans_2.ipynb"), | script_relative_path("iris-kmeans_2.ipynb"), | ||||
input_defs=[InputDefinition("path", str, description="Local path to the Iris dataset")], | input_defs=[ | ||||
InputDefinition( | |||||
"path", str, description="Local path to the Iris dataset" | |||||
) | |||||
], | |||||
config_schema=Field( | config_schema=Field( | ||||
Int, default_value=3, is_required=False, description="The number of clusters to find" | Int, | ||||
default_value=3, | |||||
is_required=False, | |||||
description="The number of clusters to find", | |||||
), | ), | ||||
) | ) | ||||
@pipeline(mode_defs=[ModeDefinition(resource_defs={"io_manager": fs_io_manager})]) | @pipeline( | ||||
mode_defs=[ModeDefinition(resource_defs={"io_manager": fs_io_manager})] | |||||
) | |||||
def iris_pipeline(): | def iris_pipeline(): | ||||
k_means_iris(download_file()) | k_means_iris(download_file()) | ||||
``` | ``` | ||||
In our notebook, we'll stub the context as follows (in the `parameters` cell): | In our notebook, we'll stub the context as follows (in the `parameters` cell): | ||||
<!-- do not hardcode code snippets https://github.com/dagster-io/dagster/issues/2706 --> | <!-- do not hardcode code snippets https://github.com/dagster-io/dagster/issues/2706 --> | ||||
Show All 32 Lines |