Page MenuHomePhabricator

Dagster Pandas Guide Docs
Needs RevisionPublic

Authored by themissinghlink on Fri, Jan 3, 10:50 PM.

Details

Summary

The following adds documentation for the custom dataframe type factory/validation logic available in dagster pandas.

Test Plan

unit

Diff Detail

Repository
R1 dagster
Branch
dagster-pandas-guide (branched from master)
Lint
Lint OK
Unit
No Unit Test Coverage

Event Timeline

themissinghlink edited the summary of this revision. (Show Details)Fri, Jan 3, 10:50 PM
themissinghlink edited the summary of this revision. (Show Details)
themissinghlink edited the summary of this revision. (Show Details)
prha added inline comments.Fri, Jan 3, 11:10 PM
docs/sections/learn/guides/dagster_pandas/dagster_pandas.md
28–32

The above example provides a simple introduction to using dataframe types in dagster solids. There are also a lot of ways to maximize your workflow development experience by extending your plain DataFrame types. Luckily, dagster-pandas does this for you and provides an API for creating custom dataframe types that perform data quality checks, emit summary statistics, and enable safe/reliable IO for dataframe serialization/deserialization.

121

s/extensible dataframe//
s/is also/are also/

Furthermore just sounds awkwardly formal.

126

s/synctactic/syntactic/

themissinghlink edited the summary of this revision. (Show Details)
  • made copy edits to documentation to ensure it reads better. Thanks prha
prha resigned from this revision.Sat, Jan 4, 12:40 AM

will let other folks take a look...

schrockn requested changes to this revision.Tue, Jan 7, 8:42 PM

Other than inline comments I think we should do a progression where we introduce three layers of contraints.

  1. Pure datatype checking. Frame this as schema.
  2. Add mins/maxes etc.
  3. Add a totally custom constraint.
docs/sections/learn/guides/dagster_pandas/dagster_pandas.md
12

let's do import from top-level

30

I would include also "schema validation"

35

just use create_dagster_pandas_dataframe_type. I don't think we need to include "factory" verbiage

37

column-wise

46

stick to top level includes

100

probably overkill to have a named output? just return the df?

This revision now requires changes to proceed.Tue, Jan 7, 8:42 PM
themissinghlink marked 3 inline comments as done.
  • made copy edits to documentation to ensure it reads better. Thanks prha
  • made inline fixes
  • edit docs based on nicks feedback
themissinghlink edited the summary of this revision. (Show Details)Fri, Jan 10, 1:18 AM
themissinghlink edited the summary of this revision. (Show Details)
themissinghlink edited the summary of this revision. (Show Details)
themissinghlink edited the summary of this revision. (Show Details)Fri, Jan 10, 1:27 AM
themissinghlink edited the summary of this revision. (Show Details)
themissinghlink updated this revision to Diff 8584.EditedFri, Jan 10, 1:41 AM

Repushed to pick up new master changes because previous version of master was potentially breaking k8 tests.

  • added documentation for dataframe level validation

@schrockn Just added more changes since last time, will need a once over whenever ur free!

schrockn requested changes to this revision.Fri, Jan 10, 11:46 PM

Let me expand on the schema validation stuff. I think we should talk about it in it's own section and then introduce data quality as a *separate* section after it.

docs/sections/learn/guides/dagster_pandas/dagster_pandas.md
11

from dagster_pandas import DataFrame

This revision now requires changes to proceed.Fri, Jan 10, 11:46 PM