Page MenuHomeElementl

[caprisun] Run ops after assets
Changes PlannedPublic

Authored by sandyryza on Jul 27 2021, 5:00 AM.

Details

Summary

This extends build_assets_job to enable inserting nodes that run after the nodes that update assets.

This enables specifying dependencies between post-nodes and particular assets. An alternative approach that is more restrictive, but simpler, would be to say that post-nodes run after every asset-node has completed.

I went with this approach instead of the alternative approach, because I believe it's often useful for post-nodes to have data dependencies on asset nodes. For example, imagine a job that's built to run an email campaign. The job contains an asset that is a table of all the emails that should be sent out, and it also includes a post-node that sends the emails. The post-node that sends the emails should have a single input - the dataset of emails. Also, if there are assets in the job that summarize the emails table, those shouldn't hold up sending the emails.

Depends on D8969.

Test Plan

bk

Diff Detail

Repository
R1 dagster
Branch
nodes-after
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 27 2021, 6:05 AM
Harbormaster failed remote builds in B34259: Diff 42342!

While I can def see the utility of the idea, this interface feels a bit off to me. From what I understand, the difference between nodes and logical assets isn't entirely cut and dry, and I'm wondering if people shouldn't be able to just add nodes to their list of assets, and we infer them as such / foist them into an asset?

I'm wondering if people shouldn't be able to just add nodes to their list of assets, and we infer them as such / foist them into an asset?

Interesting. So in that world, the nodes would need to encode the assets that they depend on. Which I think means it comes down to the question of whether we expect these nodes to be reusable. If not reusable, then sensible for them to encode the assets they depend on. If reusable, then requires users to make factories, which is a pain.

@sandyryza Writing out / formalizing some stuff from our conversation yesterday, it feels like there are two distinct cases where a person might want to include things that don't produce "assets" (long-living objects external to the dagster ecosystem) within their asset oriented pipeline.

  1. You need to do some pre-computation (like downloading/transforming temp files that you don't care about after the run)
  2. You want to do something that has no data side effect after an asset has been created (alert stakeholders, update documentation, etc.)

I think these do have fundamentally different properties, especially when we imagine functionality to "refresh an asset".

With a pipeline that looks like:

(download s3 users log data to tmp file <not asset>) ----> @asset[bot filter users data, insert into database] ----> (query top users data, store to tmp file <not asset>) ----> (filter rows of tmp file <not asset>) ---> @asset[upload table to database] ---> (alert stakeholders <not asset>)

refreshing the bot filtered user data asset should only run the first two nodes of this pipeline. It would be inefficient / useless to perform the next two non-asset steps, as these tmp files would simply get deleted before the next run anyway. However, if we imagine refreshing the second asset, we actually would expect / want that non-asset entity (alert stakeholders) to get run after the asset is created, as we've defined this "post-op" purely in the context of running after that particular asset.

There's definitely thoughts to be had on how we should actually be representing the non-asset nodes that live between assets (graphs? transient assets?), but I think in either of those worlds, this post-op thing feels a little out of place as a node in the graph, considering its weird rerun semantics.

I thought about it a bit more, and it seems like we already have a concept for "thing that should get run after a node" with SolidHooks. Certainly they are a bit more limited in functionality than Ops (hard to chain together), but considering they have access to the output object of the node they're applied to, as well as the resource system, it seems like they should be expressive enough for any alerting-type behavior, have the benefit of easy reusability, have the correct rerun behavior, and don't require the addition of a new concept. Curious to hear your thoughts.

I thought about it a bit more, and it seems like we already have a concept for "thing that should get run after a node" with SolidHooks. Certainly they are a bit more limited in functionality than Ops (hard to chain together), but considering they have access to the output object of the node they're applied to, as well as the resource system, it seems like they should be expressive enough for any alerting-type behavior, have the benefit of easy reusability, have the correct rerun behavior, and don't require the addition of a new concept. Curious to hear your thoughts.

There are a couple things I'm worried are missing from solid hooks:

  • Re-execution and retries. E.g. imagine you refresh an asset that's a table with a set of emails you want to send, and then you want to send the emails. If the email-sending code fails to connect to the email-sending service, you'd want to be able to re-execute that step without also re-executing the step that generates the table.
  • Executing in separate processes. Generating a table might require running a Spark job on a special Spark cluster or doing inference on a GPU. If you want to do a time-consuming operation afterwards (like send a bunch of emails) that doesn't need the cluster or GPU, it would be wasteful to use the same process.

There are a couple things I'm worried are missing from solid hooks:

  • Re-execution and retries. E.g. imagine you refresh an asset that's a table with a set of emails you want to send, and then you want to send the emails. If the email-sending code fails to connect to the email-sending service, you'd want to be able to re-execute that step without also re-executing the step that generates the table.
  • Executing in separate processes. Generating a table might require running a Spark job on a special Spark cluster or doing inference on a GPU. If you want to do a time-consuming operation afterwards (like send a bunch of emails) that doesn't need the cluster or GPU, it would be wasteful to use the same process.

Hmmm for the beefier use cases I wonder then if asset sensors would be the correct abstraction? For simple things like sending a few emails, I hear you on independent re-execution being nice to have, but I think there are a lot of practical cases where either a) the operation you want to execute is extremely unlikely to fail or b) if it fails, it's not a big deal (so you're happy to just use hooks). If these are not the case, then asset sensors will give you that extra structure (independent retries, monitoring, etc.), while fitting in with concepts that we've already established in the non-asset world.

sandyryza retitled this revision from Run ops after assets to [caprisun] Run ops after assets.Aug 2 2021, 8:20 PM