Page MenuHomePhabricator

RFC: GBFS download pipeline
ClosedPublic

Authored by max on Tue, Nov 19, 2:29 AM.

Details

Summary

Pipeline to download real-time JSON files from Bay Bikes.

Test Plan

Local scheduled & ad-hoc execution

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

max created this revision.Tue, Nov 19, 2:29 AM
max edited reviewers, added: nate, alangenfeld; removed: Restricted Project.Tue, Nov 19, 2:55 AM
themissinghlink accepted this revision.Tue, Nov 19, 3:15 AM

Seems fine! Few suggestions and open questions for my edification. We might also want an integration test for this to make sure things work or at least a preset for confirmation?

examples/dagster_examples/bay_bikes/gbfs/pipelines.py
14

Is the point of this pipelines to download schemas from gbfs periodically to make sure that schemas haven't changed? If so can we rename the pipeline to something of that effect: 'sync_gbfs_schemas` or something?

15

Instead of making the same repetitive call which can be a bit overwhelming, I find it to be easier on the eyes to throw the inputs into a constant list which sits outside of the pipeline for example:

SCHEMA_INFORMATION = [
    ('schema/gbfs.json', 'https://gbfs.baywheels.com/gbfs/gbfs.json', 'download_and_validate_gbfs'),
    ....
]

@pipeline
def download_gbfs_files():
    for expected_schema_file, bay_bikes_schema_url, solid_name in SCHEMA_INFORMATION:
        valid_schema_download_solid = download_and_validate_json(...., bay_bikes_schema_url, name=solid_name)
        valid_schema_download_solid()

That way you can decouple setup from pipeline logic which makes life easier for everyone when you inevitably start to make changes to the pipeline.

examples/dagster_examples/bay_bikes/gbfs/solids.py
49

This might just be me, but I actually like when people are explicit about solid factories, but that's up to you!

53

So this seems legit but why the composite solid route where you have a ton of these factories as supposed to a pipeline that alias's a different types of solids which downloads from json, validates, saves to disk? Is it because you don't actually need to configure them manually via dagit?

python_modules/dagster/dagster/core/definitions/decorators.py
673

Nice. I like this!!!!

This revision is now accepted and ready to land.Tue, Nov 19, 3:15 AM
max added inline comments.Tue, Nov 19, 5:39 PM
examples/dagster_examples/bay_bikes/gbfs/pipelines.py
14

hm, no. this assumes the schemas are already present and downloads the actual files, i'll try to find more descriptive names.

15

hm, ok

examples/dagster_examples/bay_bikes/gbfs/solids.py
49

what do you mean be explicit?

53

i'm not sure i understand..

examples/dagster_examples/bay_bikes/gbfs/solids.py
49

literally calling it a factory. Thats a nit tho, up to you!

53

This is more for my edification. At a high level, why the composite solid approach as supposed to one big solid that is alias'd for each schema type?

max added inline comments.Wed, Nov 20, 3:29 AM
examples/dagster_examples/bay_bikes/gbfs/solids.py
53

if i catch your drift, it's because these components are reusable and would typically come from a library shared across pipelines

max updated this revision to Diff 7084.Tue, Dec 3, 9:59 PM

Rebase

This revision was automatically updated to reflect the committed changes.