Page MenuHomePhabricator

add PersistentTuple, to add layer of indirection from serialization
ClosedPublic

Authored by prha on Jul 8 2020, 7:53 PM.

Details

Summary

We want to be able to separate out construction of namedtuples from
user-land from the reconstruction of serialized namedtuples from persistence
layer. This adds hooks in the serdes layer to call per-class defined custom
serdes logic.

While the ideal API is a straight-up replacement of the namedtuple factory
function, that requires delving into some dark magic:
https://hg.python.org/cpython/file/b14308524cff/Lib/collections/__init__.py#l232

In future diffs, we could transform all of our persisted classes to use this
and enforce some checks in our storage layer.

Test Plan

bk

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

prha created this revision.Jul 8 2020, 7:53 PM
prha requested review of this revision.Jul 8 2020, 8:08 PM
schrockn added a comment.EditedJul 8 2020, 8:59 PM

I think it would be useful to show what PipelineRun would look like with this to clearly demonstrate the value

python_modules/dagster/dagster/serdes/__init__.py
165–176

I think this should be a different top-level API.
e.g.
@whitelist_for_persistence

schrockn added inline comments.Jul 8 2020, 9:00 PM
python_modules/dagster/dagster/serdes/__init__.py
73

is there a reason to use kwargs here rather than just have a storage_dict param?

We want to be able to separate out construction of namedtuples from user-land from the reconstruction of serialized namedtuples from persistence layer.

Is this mostly motivated by wanting to to move backcompat shit out of constructors or is there other context i am missing

python_modules/dagster/dagster/serdes/__init__.py
30

nit: related to comment below reading "persistence" as prefix all over confuses my brain. persistable makes more sense to me for example

68–74

this could work for any class, not just tuples right? IControlSerdes / ManualSerdes or something like that could be a better name. Maybe Persistable

max added a comment.Jul 9 2020, 3:22 PM

agree it would be helpful to see a motivating example

schrockn requested changes to this revision.Jul 9 2020, 4:22 PM

q mgmt

This revision now requires changes to proceed.Jul 9 2020, 4:22 PM
prha updated this revision to Diff 18365.Jul 9 2020, 5:38 PM
  • rename persistent, extract whitelist decorator
  • convert pipeline_run
schrockn added inline comments.Jul 9 2020, 5:40 PM
python_modules/dagster/dagster/core/storage/pipeline_run.py
129

oh i see kwargs allows you to list out arguments, which might be better?

prha added a comment.Jul 9 2020, 5:41 PM

It doesn't decrease the amount of code needed, but it separates out the backcompat from the constructor

python_modules/dagster/dagster/core/storage/pipeline_run.py
129

yeah, exactly... saves all the storage_dict.get garbage...

140

whoops, will remove.

in this form it seems mostly like just shuffling complexity around but im not opposed. Are there aspects we expect this to set up for?

python_modules/dagster/dagster/core/storage/pipeline_run.py
60–71

unless there are more ambitious plans for how this class/decorator will work and what it will do around persisting these artifacts, I feel like a more honest representation is @whitelist_for_serdes + ManualSerdes/ControlSerdes/SerdesOverride type of name

schrockn added a comment.EditedJul 9 2020, 5:52 PM

@alangenfeld it can be viewed as just shuffling around complexity quite a bit but what I do think it makes much, much clear is how the thing is supposed to be used when it is built by our business logic code as opposed to reconstructing from persistence.

python_modules/dagster/dagster/core/storage/pipeline_run.py
129

yeah i didn't think about that, my b.

prha added a comment.Jul 9 2020, 5:53 PM

One specific use case that triggered this, is the Materialization => AssetMaterialization rename. It's not really a rename so much as the deprecation of Materialization which should include warnings to switch over to AssetMaterialization. In order to do this, but not spam warnings whenever we reconstruct historical runs, we need to have separate paths between constructing new Materialization events, and reconstructing persisted Materialization events from the logs.

schrockn added a comment.EditedJul 9 2020, 5:54 PM

also having the persistence decorator I think it really valuable. right now it's impossible to know by inspection what serdes classes are persisted versus not. therefore if you add a param or field or whatever, you have to inspect the whole system and make sure that it is not persisted anywhere, else you risk breakage. this makes it much more clear what is actually persisted.

In future diffs, we could transform all of our persisted classes to use this and enforce some checks in our storage layer.

I missed ^ from the summary which answered my questions.

prha added inline comments.Jul 9 2020, 6:43 PM
python_modules/dagster/dagster/core/storage/pipeline_run.py
129

actually, i'm wrong.... in order to have the same signature, it needs to be **kwargs explicitly, which is not any better...

schrockn added inline comments.Jul 9 2020, 6:56 PM
python_modules/dagster/dagster/core/storage/pipeline_run.py
129

ah. well as a compromise there could be a convention of an inner method that has the explicit arguments and then explode the kwargs into that

prha updated this revision to Diff 18401.Jul 9 2020, 9:29 PM
  • use helper method with expanded kwargs
prha updated this revision to Diff 18420.Jul 10 2020, 12:01 AM
  • add kwargs for catchall
schrockn requested changes to this revision.Jul 10 2020, 1:32 AM

req'ing changes mostly based on the comment about the __class__ key

python_modules/dagster/dagster/core/storage/pipeline_run.py
129

to go back all the way around, can be just eliminate the helper method and explode the kwargs into the call to from_storage_dict

154

warn if kwargs has values?

python_modules/dagster/dagster/serdes/__init__.py
182

could be slightly cleaner to consolidate these three maps into a single maps namedtuple tuple. it feels like a
data clump if you buy into that

191–195

is there a reason why we don't include the __class__ attribute in the persistence case?

python_modules/dagster/dagster/utils/__init__.py
479–480

this dark magic is very comment worthy :-)

This revision now requires changes to proceed.Jul 10 2020, 1:32 AM

also another question: do we have to change any of the verbiage or logic in _check_serdes_tuple_class_invariants to account for this? Seems like verbiage should be changed at a minimum, but I think we might be able to get rid of some of it?

prha updated this revision to Diff 18525.Jul 10 2020, 11:32 PM
  • combine serdes whitelist dictionaries, rename to_storage_dict => to_storage_value
prha added a comment.Jul 10 2020, 11:35 PM

I think _check_serdes_tuple_class_invariants is actually fine... it's checking the __new__ arguments against the underlying named tuple. None of those invariants should change, since we're still supporting the constructor both for persistable and the non-persistable...

python_modules/dagster/dagster/serdes/__init__.py
191–195

This is kind of underspecified, but I think we want the flexibility to create whatever stored value we want, even if it's a different __class__.

The storage dict returned by default_from_storage_dict has the appropriate __class__ value.

I'll update this and rename to from_storage_value...

prha added inline comments.Jul 10 2020, 11:37 PM
python_modules/dagster/dagster_tests/test_serdes.py
358–360 ↗(On Diff #18525)

@schrockn this last test should flex the different __class__ behavior

schrockn accepted this revision.Mon, Jul 13, 5:25 PM

This is great.

Would like to see @max's feedback before merge but I'm very supportive.

python_modules/dagster/dagster/core/storage/pipeline_run.py
224

weird that this pylint error happens

python_modules/dagster/dagster/serdes/__init__.py
296–298

not a huge fan of defaults in these cases, would rather callsites all be explicit. there should only be a single production call site anyways so default seems unnecessary

This revision is now accepted and ready to land.Mon, Jul 13, 5:25 PM
prha updated this revision to Diff 18658.Mon, Jul 13, 8:49 PM
  • remove default_to_storage_value default
prha retitled this revision from RFC: add PersistentTuple, to add layer of indirection from serialization to add PersistentTuple, to add layer of indirection from serialization.Mon, Jul 13, 9:09 PM