Page MenuHomePhabricator

add scaffolding for pipeline and trip datatype
ClosedPublic

Authored by themissinghlink on Thu, Nov 21, 3:58 AM.

Details

Summary

This revision starts the scaffolding for the Feature pipeline and sets up a Trip DagsterType with validation checks!

Then we tack on TrafficDataTypes and WeatherDataFrame types along with a consolidated validation verification engine! With great expectations comes great responsibility. #RIPUncleBen

Test Plan

Unit Tests and presets. If the typecheck passes then we have successfully downloaded this dataset.

Proof this works: When you execute the production preset on dagit, you should see the following!

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

themissinghlink edited the summary of this revision. (Show Details)Thu, Nov 21, 4:00 AM
themissinghlink edited the test plan for this revision. (Show Details)
themissinghlink edited the test plan for this revision. (Show Details)
  • fixed pipeline to get demo working
  • fixed lint errors
themissinghlink edited the test plan for this revision. (Show Details)Fri, Nov 22, 1:38 AM
themissinghlink added reviewers: max, nate.
  • got rid of errant print statement
  • added traffic dataframe solid
alangenfeld resigned from this revision.Mon, Nov 25, 4:45 PM

seems solid, defer to others for last pass

examples/dagster_examples/bay_bikes/solids.py
221–236

what is this sorcery

examples/dagster_examples/bay_bikes/types.py
96–102

maybe add some tests using check_dagster_type?

examples/dagster_examples/bay_bikes/solids.py
221–236

Hahaha, welcome to Pandas. So the way select * from WHERE works in pandas is via these things called masks which are boolean expressions that operate on the row level. Pandas overloads the getitem method for dataframes and accept masks which basically apply these expressions to each row in the dataframe and evaluate to true or false. If true, they select the row. So what this code is saying is:

select *
from trips
where start_time between (interval, upper_bound_interval) //Did the trip start in the sample interval
       or end_time between (interval, upper_bound_interval) //Did the trip end in the sample interval
       or (start_time < interval and end_time >= upper_bound_interval) //Was the bike in transit during the interval;
examples/dagster_examples/bay_bikes/types.py
96–102

Ooooh, I will look into this! Great call!

  • added weather data type and validation
themissinghlink edited the summary of this revision. (Show Details)Mon, Nov 25, 9:47 PM
themissinghlink added reviewers: max, nate.
nate accepted this revision.Mon, Nov 25, 10:21 PM

Few small nits, otherwise LGTM

examples/dagster_examples/bay_bikes/solids.py
212

mind adding some comments on why you're doing the transformations below?

221–236

can you add some version of this ^ as a comment? would be great for future versions of us to remember what this is

examples/dagster_examples/bay_bikes/types.py
21

categories* -> and maybe make these DEFINES at the top to avoid magic strings?

186

categories

This revision is now accepted and ready to land.Mon, Nov 25, 10:21 PM
themissinghlink marked 3 inline comments as done.
  • addressed feedback
This revision was automatically updated to reflect the committed changes.