Page MenuHomeElementl

s3 sensor toy pipeline
ClosedPublic

Authored by prha on Dec 1 2020, 5:52 AM.

Details

Diff Detail

Repository
R1 dagster
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

prha requested review of this revision.Dec 1 2020, 6:13 AM

This makes sense to me. @schrockn did you want to do a pass here as well

python_modules/dagster-test/dagster_test/toys/log_s3.py
13–16

read_s3_key?

python_modules/dagster-test/dagster_test/toys/sensors.py
57–74

since there are two kinds of keys here (run keys and s3 keys) it might be worth being more explicit about which are which (which probably involves prefixing all the s3 keys with s3_)

60–62

this seems more like an exception than a SkipReason?

python_modules/libraries/dagster-aws/dagster_aws/s3/sensor.py
30–31

is there a weird edge case here when it returns exactly MAX_KEYS because that's exactly how many there are? Or does that just make it do one more loop with 0 returned and then we're fine

python_modules/dagster-test/dagster_test/toys/sensors.py
58

i do think we will evitably attach resources to this stuff btw but no need to address it now :-)

65–67

how is last_completion_time computed. I was imagining querying the runs database here and getting the last run_key directly. That would seem to be a nice generalizable pattern where you can implement reconciliation logic

python_modules/libraries/dagster-aws/dagster_aws/s3/sensor.py
10

consider passing in s3_session so this can be tested easily

python_modules/dagster-test/dagster_test/toys/sensors.py
65–67

believe last_completion_time is the time we ran the last successful loop - so if you fail partway through that would not affect last_completion_time

at one point we had a 'cursor' field on context which was exactly that (the last run_key), not sure where exactly we landed there. It's also nice to expose the time as well for the poor souls who don't use run keys I think?

python_modules/dagster-test/dagster_test/toys/sensors.py
65–67

or even if they are using run keys, the data might still only be naturally range-queryable by time rather than by key?

I absolutely think we should expose both, but I do think last_run_key is the more natural way to express reconciliation

schrockn requested changes to this revision.Dec 1 2020, 6:50 PM

interested to see implementation that uses last_run_key. feel free to bounce back in my queue if you want to proceed with this

This revision now requires changes to proceed.Dec 1 2020, 6:50 PM

add another sensor type for s3 bucket, one for updates, one for new files only

dgibson added inline comments.
python_modules/dagster-test/dagster_test/toys/sensors.py
91

re: the other diff - this seems good, and it's probably establishing a convention that you yield a SkipReason when there's no new updates?

schrockn requested changes to this revision.Dec 2 2020, 4:27 PM

So in the end what is your assessment of the time vs key approach? I'm curious to hear about the tradeoffs. A docblock in the test case explaining the tradeoffs would be fantastic.

This revision now requires changes to proceed.Dec 2 2020, 4:27 PM

looking good! will let someone else do final accept

This revision is now accepted and ready to land.Jan 6 2021, 7:25 PM
This revision was automatically updated to reflect the committed changes.