Page MenuHomeElementl

feat: retrieve partition set names from external partition set in schedule based partitions
Needs RevisionPublic

Authored by rexledesma on Jun 3 2021, 4:24 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, May 27, 7:53 AM
Unknown Object (File)
Mon, May 22, 8:57 AM
Unknown Object (File)
Thu, May 18, 3:01 AM
Unknown Object (File)
Wed, May 17, 10:18 PM
Unknown Object (File)
Wed, May 17, 10:15 AM
Unknown Object (File)
Thu, May 11, 9:32 PM
Unknown Object (File)
Thu, May 11, 8:10 AM
Unknown Object (File)
Apr 8 2023, 11:55 PM
Subscribers
None

Details

Summary

Here, we serialize all the necessary information to reconstruct a schedule based partition set from the
parameters in the external partition set, rather than hydrating these values from the grpc server.

Since datetimes cannot be serialized, we deconstruct datetimes as components that can be serialized, and
use that to reconstruct the original datetime.

Test Plan

pytest

Diff Detail

Repository
R1 dagster
Branch
rl/external-partition-set-params
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

rexledesma held this revision as a draft.
rexledesma retitled this revision from Add partition params to external partition set to feat: retrieve partition set names from external partition set in schedule based partitions.Jun 11 2021, 5:47 PM
rexledesma edited the summary of this revision. (Show Details)

overall this makes sense to me - tweak some names and remove some extra timezone fields and i think we're good

python_modules/dagster/dagster/core/definitions/decorators/schedule.py
228–230

is this "if" backwards? this would make sense to me if it were if *not* isinstance (pendulum timestamps implement timestamp() correctly)

229–240

i don't think we need to include three timezones in this object - a start UTC timestamp, end UTC timestamp, and execution timezone should be sufficient. We can assume that the execution timezone is always UTC if it is not set, and assume that the start and end dates either have no timezone (in which case we can assume they are expressed in the execution timezone) or must be pendulum datetimes with a timezone that matches the execution timezone.

so i think the logic is something like in schedule_partition_range

tz = execution_timezone if execution_timezone else "UTC"

_start = (
    to_timezone(start_date, tz)
    if isinstance(start, PendulumDateTime)
    else pendulum.instance(start, tz=tz).  # Change "2020-01-01" to the same date, but in the right timezone
)
start_timestamp = _start.timestamp()
python_modules/dagster/dagster/core/definitions/partition.py
161

comment explaining the format here? day of month / day of week / etc. depending on the schedule type right?

(you could consider having separate fields so that they can be validated differently -day of month can go past 6)

178–182

as above, i think we can make this start_timestamp / end_timestamp / timezone and we're good

242–249

this doesn't have to be a perfect recreation of what they passed into the schedule function. We can coerce it into a correctly-timezoned pendulum time here even if they originally passed in a naive datetime

python_modules/dagster/dagster/core/host_representation/external.py
119

has_external_partition_names

122

get_external_partition_names (although as per above I think we need to adjust the name here to clarify that this won't always be there for all partition sets)

556

this should probably be called partition_names not partition_set_names

can we do more to explain here when/why this is set on some partition sets but not others? I think as a new person to this function I would be very surprised to find that this was not set on every partition set

I would say static_partition_names but that doesn't really encompass the time-based case

Is there a succinct way to express 'doesnt require running user code' and including that in the name here? I think that would help a lot with the confusion

python_modules/dagster/dagster/core/host_representation/repository_location.py
668–671

some naming tweaks will make this logic make a lot more sense I think, if we can just find the right names... I will sleep on it

if has_****_partition_names():
  compute it locally
else
  go out to gRPC

has_locally_computable_partition_names? Bit of a mouthful, but maybe more clear?

python_modules/dagster/dagster_tests/core_tests/host_representation_tests/test_repository_location.py
58–70

its too bad that this case does not get to take advantage of your stuff. are we still planning to introduce a breaking change to create_schedule_definition at some point to make it only work with time based partition param partition sets? (not blocking for this diff)

This revision now requires changes to proceed.Jun 15 2021, 4:09 AM