Page MenuHomeElementl

ObjectStoreOperation/AssetStoreOperation -> HandledOutput/LoadedInput
ClosedPublic

Authored by yuhan on Dec 8 2020, 3:48 AM.

Details

Summary

RIP AssetStoreOperation and ObjectStoreOperation in event logs.

This diff does several things

  • introduces HandledOutput/LoadedInput to log the completion of IO manager's handle_output/load_input operations (previously in D5519 which is a bit hard to test separately without the core changes, so i moved it back to this diff)
  • completely gets rid of AssetStoreOperation which is replaced by HandledOutput/LoadedInput. https://github.com/dagster-io/dagster/issues/3509
  • no long emits ObjectStoreOperation for GET/SET objects in the event logs, notes:
    • ObjectStoreOperation as an event definition still exists because it would be yielded in the user code when a user write their own Intermediate Storage
    • in order to get rid of ObjectStoreOperation in the event logs and (kinda) avoid regression like https://github.com/dagster-io/dagster/issues/3368, this diff implements a stopgap solution to context.log.info the address of an intermediate object see [1]
Test Plan

bk

~/dev/dagster/examples/docs_snippets/docs_snippets/overview/io_managers
dagster-3.7.4 $ dagit -p 3333 -f default_io_manager.py

io manager:

Screen Shot 2021-01-11 at 3.13.57 PM.png (3×2 px, 1 MB)

intermediate storage:

Screen Shot 2021-01-11 at 3.33.28 PM.png (3×2 px, 1 MB)

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Might be good to split the diff into two or three: first two (AssetStoreOperation -> HandledOutput & LoadedInput) + do a full pass on ObjectStoreOperation deprecation

Yeah, agreed that there's a lot going on here. I split out the addition of HandledOutput and LoadedInput into a separate diff: https://dagster.phacility.com/D5519.

The reason I included the AssetStoreOperation and ObjectStoreOperation changes together in this diff was that I didn't want to further complicate the logic inside execute_step/ core_dagster_event_sequence_for_step.

What's your plan of attack with regards to ObjectStoreOperation and the intermediate storage wrapper? Are you planning to preserve ObjectStoreOperation?

sandyryza retitled this revision from handled_output and loaded_input events to ObjectStoreOperation/AssetStoreOperation -> HandledOutput/LoadedInput.Dec 9 2020, 11:10 PM

Im preserving ObjectStoreOperation for the first two diffs on my stack - so when i rebased off of this diff, it complicated my logic lol
I think not to further complicating execute_step logics makes sense here, but let's make it clear that ObjectStoreOperation is still some where in the code space and we will remove the rest on my end :)

lgtm

will let you to handle since you are coordinating interleaved stacks!

Overall I think the idea of replacing ObjectStoreOperation with HandledOutput and LoadedInput makes sense, but I think now we should hold it off until HandledOutput and LoadedInput log the same information as ObjectStoreOperation, i.e. file path.

python_modules/dagster/dagster/core/execution/plan/execute_step.py
379–382

given the regression in https://github.com/dagster-io/dagster/issues/3368, maybe holding off on removing ObjectStoreOperation for now would be safer.

yuhan edited reviewers, added: sandyryza; removed: yuhan.
yuhan edited the test plan for this revision. (Show Details)
yuhan edited the test plan for this revision. (Show Details)

up

squashed D5519 in because D5519 solo can't be tested

guard all log.dagster_event.event_type_value with log.is_dagster_event

Seems good. Will let others do a detailed review!

Looks almost there!

The pattern that we had with ObjectStoreOperation where we put the obj on it and then call serializable to strip it out is a little confusing. Now that the user code isn't responsible for generating the event, do we need this pattern anymore?

A couple potential ways of implementing this:

  • Add a get_load_event method to StepInputSource which returns an event with the appropriate params.
  • Have load_input_object return a LoadResult, which could be a tuple of LoadedEvent and obj
python_modules/dagster/dagster/core/execution/context/system.py
200

Any reason we need to move this?

This revision now requires changes to proceed.Jan 12 2021, 4:48 PM
python_modules/dagster/dagster/core/storage/intermediate_storage.py
65–82

[1]

105–126

[1]

I like LoadResult

python_modules/dagster/dagster/core/execution/context/system.py
200

yes in memoization.py we would need pipeline_context.using_io_manager(step_output_handle).
but im re visiting that function since i landed D5834 (default to mem_io_manager) just now.

@sandyryza re:

The pattern that we had with ObjectStoreOperation where we put the obj on it and then call serializable to strip it out is a little confusing. Now that the user code isn't responsible for generating the event, do we need this pattern anymore?

without this pattern, we would involve DagsterEvent in the inputs.py.

One thing im not sure about (which is why im creating this LoadedInput event def) is whether we want to consolidate all the yield DagsterEvent.some_event() event emitting only in core_dagster_event_sequence_for_step in execute_step.py. I see this is the current pattern. But if it's fine to yield DagsterEvent in load_input_object in input.py as well, i can switch back to the initial approach which yield event or value from load_input_object.
cc @alangenfeld ^regarding the core event sequence for step code path.

oops had some old inlines i forgot to submit too

we want to consolidate all the yield DagsterEvent.some_event() event emitting only in core_dagster_event_sequence_for_step in execute_step.py. I see this is the current pattern.

This doesn't seem critical to me, you'll still have an isinstance check in step.py so I don't think you are losing much understandability

python_modules/dagster/dagster/core/definitions/events.py
877–890

The pattern that we had with ObjectStoreOperation where we put the obj on it and then call serializable to strip it out is a little confusing. Now that the user code isn't responsible for generating the event, do we need this pattern anymore?

++ to ^

this code definitely stands out as very odd

python_modules/dagster/dagster/core/events/__init__.py
1135–1191

does it make sense to convert the old ObjectStore events to the new ones?

yield DagsterEvent in load_input_value

yuhan added inline comments.
python_modules/dagster/dagster/core/events/__init__.py
1135–1191

the old ObjectStoreOperation events also include CP_OBJECT and RM_OBJECT. will investigate it in a follow up to see if we can get rid of copy_required_intermediates_for_execution entirely and then handle back compat from there. if we cannot, it won't make too sense to convert because the old events would still exist in re-execution cases.

don't think it's release-blocking because dagit/gql would still handle the old events.

lgtm!

python_modules/dagster/dagster/core/storage/intermediate_storage.py
67

Nitpick: Store -> Stored

This revision is now accepted and ready to land.Jan 12 2021, 9:39 PM

handle_back_compat for AssetStoreOperation

yuhan added inline comments.
python_modules/dagster/dagster/core/events/__init__.py
1135–1191

While it's fine not to convert ObjectStoreOperation to the new events, we need to convert the AsseStoreOperation events to the new ones because they are completely gone and without the back compat, we will end up breaking our dogfood pipelines' historical run views in dagit, like https://github.com/dagster-io/dagster/issues/3533

requesting review again in terms of the handle_back_compat for AssetStoreOperation - sorry for the back and force.

This revision is now accepted and ready to land.Jan 12 2021, 10:28 PM
yuhan added inline comments.
python_modules/dagster/dagster/core/execution/plan/execute_step.py
261–270