Page MenuHomePhabricator

added scaffolding and first set of pipelines for fetching baybike data
ClosedPublic

Authored by themissinghlink on Mon, Nov 4, 8:07 AM.

Details

Summary

Background

This revision sets up the basic scaffolding for the bay bikes project. It also implements 2 solids and a pipeline which download zip files from a public s3 bucket and unzips them locally.

Proof This Works

PROTIP: Ensure you have a data directory in the examples directory for this to work. I put this in the gitignore because we don't want to check in csv files.

If you execute the download_csv_locally_pipeline pipeline with the following input config:

solids:
  download_zipfiles_from_urls:
    inputs:
      base_url:
        value: 'https://s3.amazonaws.com/baywheels-data'
      chunk_size:
        value: 8192
      file_names:
          - value: 201801-fordgobike-tripdata.csv.zip
          - value: 201802-fordgobike-tripdata.csv.zip
          - value: 201803-fordgobike-tripdata.csv.zip
          - value: 201804-fordgobike-tripdata.csv.zip
          - value: 201805-fordgobike-tripdata.csv.zip
          - value: 201806-fordgobike-tripdata.csv.zip
          - value: 201807-fordgobike-tripdata.csv.zip
          - value: 201808-fordgobike-tripdata.csv.zip
          - value: 201809-fordgobike-tripdata.csv.zip
          - value: 201810-fordgobike-tripdata.csv.zip
          - value: 201811-fordgobike-tripdata.csv.zip
          - value: 201812-fordgobike-tripdata.csv.zip
          - value: 201901-fordgobike-tripdata.csv.zip
          - value: 201902-fordgobike-tripdata.csv.zip
          - value: 201903-fordgobike-tripdata.csv.zip
          - value: 201904-fordgobike-tripdata.csv.zip
          - value: 201905-baywheels-tripdata.csv.zip
          - value: 201906-baywheels-tripdata.csv.zip
          - value: 201907-baywheels-tripdata.csv.zip
          - value: 201908-baywheels-tripdata.csv.zip
          - value: 201909-baywheels-tripdata.csv.zip
      target_dir:
        value: /tmp
  unzip_csv_files:
    inputs:
      source_dir:
        value: /tmp
      target_dir:
        value: ./data

You should get all of the unziped csv files placed in a local data directory.

Test Plan

unit

Diff Detail

Repository
R1 dagster
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

themissinghlink created this revision.Mon, Nov 4, 8:07 AM
themissinghlink edited the summary of this revision. (Show Details)Mon, Nov 4, 8:12 AM
  • added tests for bay bike pipelines
  • made lint fixes
  • fixed linter check that did apply
  • forgot to apply black changes
themissinghlink added reviewers: Restricted Project, max.Mon, Nov 4, 11:27 PM
  • got rid of pinned example dependencies...I get the problem now
  • forgot to make black I lost my pre commit hook
max accepted this revision.Tue, Nov 5, 2:02 AM

This is lovely. We should probably be using dagster.utils.mkdir_p to make sure the data/target directory is where we expect it in the solids.

examples/dagster_examples/bay_bikes/repository.py
4

why locally?

9

fixme

examples/dagster_examples/bay_bikes/solids.py
6

let's use urllib3

42

probably let's just call this unzip_files, and alias it as unzip_csv_files in the pipeline

This revision is now accepted and ready to land.Tue, Nov 5, 2:02 AM
themissinghlink added inline comments.Tue, Nov 5, 3:50 AM
examples/dagster_examples/bay_bikes/repository.py
4

Yeah I meant save on machine not locally like laptop. I am just gonna get rid of locally in favor of download_csv_pipeline

9

lol good call. cons with copying and pasting XD

examples/dagster_examples/bay_bikes/solids.py
6

OOP, bad habit from py2 days. Will do.

42

Good call!

@max I will add mkdir_p and use the alias in the next step (which is setting up the dagstermill stuff). The reason for this is that the next step will actually consume this pipeline so it will make a lot of sense when folks see it in the diff there!

  • addressed PR feedback

You should get all of the unziped csv files placed in a local data directory.

Is this the intended end state or just a transitory one? It seems odd to me that we want to materialize the unzipped files somewhere permanently as opposed to using a file cache or tempfile esque resource

You should get all of the unziped csv files placed in a local data directory.

Is this the intended end state or just a transitory one? It seems odd to me that we want to materialize the unzipped files somewhere permanently as opposed to using a file cache or tempfile esque resource

That's a good callout. My thought was that we ought to not take an opinion on whether or not the data directory should be a temp file or not. That should be determined by the user who creates the temp file (if need be). An example for why you wouldn't really care about temp files is if you were running a dagster pipeline on a remote machine which would spin down and deleting everything anyways so it doesn't really matter if the directory was transient or not.

themissinghlink updated this revision to Diff 6242.EditedTue, Nov 5, 7:36 PM

up