Page MenuHomePhabricator

add structured logging to bay bikes
ClosedPublic

Authored by themissinghlink on Thu, Mar 19, 10:09 PM.

Details

Summary

changelog

  • Redid the db read/write logic so that we are using json blobs instead of strict schemas and leaving the schema validation to the type system.
  • Added the right materializations to ensure the logging story makes sense
  • Edited tests

Call Out: I added a redundant pipeline because the backfill API doesn't allow for sub solid selection.

To use: Use the backfill API on the weather pipeline to backfill weather data for the year 2020:

dagster pipeline backfill daily_weather_pipeline --mode=development

Then use the playground to execute the trip_etl to get trip data for 2020.

solids:
  trip_etl:
    solids:
      download_baybike_zipfile_from_url:
        inputs:
          file_name:
            value: 202001-baywheels-tripdata.csv.zip
          base_url:
            value: https://s3.amazonaws.com/baywheels-data
      load_baybike_data_into_dataframe:
        inputs:
          target_csv_file_in_archive:
            value: 202001-baywheels-tripdata.csv.zip
      insert_trip_data_into_table:
        inputs:
          table_name:
            value: trips_staging

and then run the rest of the pipeline subset!

Test Plan

unit

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
themissinghlink added a reviewer: schrockn.
  • fix json_normalize so that it works for python3.5
  • disable pylint import issues with json_normalize
  • fixed snapshots
  • tweaked versions
  • got rid of errant prints
schrockn requested changes to this revision.Fri, Mar 20, 12:34 PM

Looks pretty good. Mostly req'ing changes because I can't load it locally:

(dagster-dev-3.7.6) [schrockn@mbp ~/code/dagster/examples (arcpatch-D2291)]$ python --version
Python 3.7.6
(dagster-dev-3.7.6) [schrockn@mbp ~/code/dagster/examples (arcpatch-D2291)]$ pip list | grep pandas
dagster-pandas                0.7.4rc0    /Users/schrockn/code/dagster/python_modules/libraries/dagster-pandas
geopandas                     0.6.2
pandas                        0.25.3
examples/dagster_examples/bay_bikes/schedules.py
56

seems like this could just be a vanilla dictionary rather than a merge

examples/dagster_examples/bay_bikes/solids.py
190

trailing comma after ) can be deleted

195

expected_dagster_pandas_dataframe_type instead of expected_dagster_pandas_dataframe_schema?

195

also given that this a function the is dynamically creating a solid you might want to consider naming it to indicate that. like `create_download_table_as_dataframe_solid' or something

206

so here the default_value is []

216

but here you use get which allows for 'subsets' to not be there.

I would recommend that you get rid of the default_value in the subsets config field and allow it not to be there

This revision now requires changes to proceed.Fri, Mar 20, 12:34 PM

Will make the changes. RE the local issue, json_normalize only works with pandas>=1.0.0 I realized. Can I add that dependency to the setup.py?

themissinghlink marked 6 inline comments as done.
  • made pandas 1.0.0 a requirement for baybikes
  • update snapshots
  • readded fix for python version
themissinghlink edited the summary of this revision. (Show Details)Fri, Mar 20, 9:12 PM

btw airflow complains about pandas 1.0.0

unlikely to cause issue but just to note in case something weird comes up

Given all of these compatibility issues, I was debating making a baybike extra with the pandas 1.0.0 dependency and changing the toxfile for tests. The tradeoff is that it's more friction for people to get things working so I figured I would wait to do this when an actual issue arose.

Yeah I think it’s fine was just noting it in case some weird bug pops up later

Sent via Superhuman iOS ( https://sprh.mn/?vip=schrockn@elementl.com )

schrockn requested changes to this revision.Fri, Mar 20, 10:05 PM

So it's a bit confusing.

There is the standalone weather_etl pipeline. It fails because it doesn't find the DARK_SKY env variable. There are no instructions how to resolve.

Then the generate_training_set_and_train_model pipeline

It has a weather_etl preset. But there's a config parse error.

Same with other presets:

Any preset should have a unit test to make sure config parses.

The partitions load. But they also because of the DARK_SKY thing. I don't know how to resolve.

This revision now requires changes to proceed.Fri, Mar 20, 10:05 PM

So I think you have to write a README with some instructions as to how to set this up.

themissinghlink added a comment.EditedFri, Mar 20, 10:10 PM

In hindsight, I should def get a README going. While I do that in the meanwhile, the dark sky issue is an API key you need to fetch weather data (https://darksky.net/dev). As far as the presets go, because we smashed everything into one pipeline, the presets correspond to the sub_solid that they represent. With the images shown above, you are selecting "*" but you really want to select either weather_etl, trip_etl or produce_trip_dataset* produce_weather_dataset*.

I wonder if we want this to work smoothly from the get go, should I provide a sqldump to populate the db's so that people can just execute produce_trip_dataset* produce_weather_dataset* so that folks don't have to deal with the ETL sub pipelines to populate their db's to train a model?

themissinghlink edited the summary of this revision. (Show Details)
  • added basic readme
This comment was removed by themissinghlink.
  • updated snapshots
alangenfeld resigned from this revision.Mon, Mar 23, 6:39 PM

ill let schrock finish this one out

abandon

schrockn requested changes to this revision.Mon, Mar 23, 6:49 PM

I still can't load any of the presets

This revision now requires changes to proceed.Mon, Mar 23, 6:49 PM
  • ensured presets work with solid subsets

alright we are good to go!

schrockn requested changes to this revision.Mon, Mar 23, 11:52 PM

Good stuff. Code looks and good and I was able to run it locally

Just req'ing to see if you think these are good ideas:

  • Seems like we could make everything simpler by combining all the solids in "produce_trip_dataset* produce_weather_dataset*" into a single composite? There will still be parallelism because it will be compiled out in the execution plan.
  • Would good to also have a preset that runs the entire pipeline?

If you agree and want to address in follow on diffs I'm up for that too

examples/dagster_examples/bay_bikes/README.md
6

problem

17

so no need to block the diff, but asking people to sign up for a service to run an example is a heavy lift. at a minimum we should have a mode where we have no external deps

45

sub pipelines a bit odd since we don't have a concept of subpipelines.

This revision now requires changes to proceed.Mon, Mar 23, 11:52 PM

RE 1: I think that is possible, but then the pipeline is just 3 composite solids that don't actually fit together which could be confusing. What problem is this trying to solve?

RE 2: I mean we could but then it gets weird because weather_etl/trip_etl are really not meant to be run with the rest of the pipeline. They exist to represent an implicit data dependency between produce_trip_dataset* produce_weather_dataset and weather_etl/trip_etl.

examples/dagster_examples/bay_bikes/README.md
17

Yeah I was thinking the best way to solve this problem is to provide a csv file pre-populated with data to run this. So one mode would just read those csv's instead of using a database?

what do you mean they don't fit together?

Actually forget it. I am still getting used to the semantics of composite solids.

train_model(trip_etl(), weather_etl())

looks perfectly fine. I will add it in this diff with the readme fixes since its a mish mash of stuff anyways.

  • squash everything into train_daily_bike_supply_model
  • fixed tests

New pipeline with three composite solids is ready for review. Unsure how to have a preset that runs the entire pipeline. The reasoning for this is that weather_etl just inserts a row for a day, trip_etl inserts a bunch of rows for a month, and train_daily_bike_supply_model downloads data from those tables and trains a model. However, it requires a lot more weather data to actually run (which comes from the backfill API). So running the whole pipeline gets a bit confusing in terms of displaying what is happening.

Ah I see. Well maybe this is a good reason to have weather_etl in its own pipeline eventually. But let's move forward for now.

Yeah that makes sense. However, the first bullet is ready to go. After writing out, you are right, this is a lot neater.

schrockn accepted this revision.Tue, Mar 24, 11:47 PM

cool. let's move forward. thanks for working through this

This revision is now accepted and ready to land.Tue, Mar 24, 11:47 PM
This revision was automatically updated to reflect the committed changes.