Page MenuHomeElementl

[guide] file processing
Needs RevisionPublic

Authored by sandyryza on Jun 4 2021, 9:46 PM.


Test Plan

bk, manual inspection

Diff Detail

R1 dagster
file-io-manager (branched from master)
Lint Passed
No Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 4 2021, 10:09 PM
Harbormaster failed remote builds in B31732: Diff 39071!

I like the structure of the guide.

Regarding the section titles, I kinda feel it's a bit too technical and less user-friendly. maybe something like:

  • Writing processing logics
  • Using local filesystem
  • Using S3
  • Using temporary directory
  • Putting it all together

Also, when explaining dagster nouns, would be good to link to the relevant concept pages - imo the idea of a guide is a reader can follow through it without prior dagster knowledge. however, if we believe that's not the case, we could also link "relevant apis/concepts" at the beginning to make this guide easier to follow


the guide isn't about cloud storage only - it also mentions local filesystem. maybe reword the description to be more generic about file-based processing?


finish the sentence?


why tmpfile? imo i get the idea of passing some path between solids, but temfile is a bit less practical that i don't get the real value of it so im a bit concerned that users may find it hard to follow.

can we hardcode a file path in create_file and pass it down to the second solid make things simpler? or we could make the file path a solid config in the first solid which allows users to specify a custom file path via run config and also demonstrates run_config at the same time.

if we want a "dynamic file path" example, we could expand it to do so later -- i just feel making it a tempdir in the first example feels a bit intimidating for a new comer.


maybe worth explaining the data dependency between solids as well, or link to the Solids concept page, or maybe even include the diagram on IO Manager?

at this point of the guide, a reader without sufficient dagster knowledge could get lost with "why am i returning a path".


would be cool to include screenshots later in this guide.

Overall, I like the guide, but I think we can simplify it a little bit.


re: the "why am I returning a path" thing, I agree that it would be nice to be extra explicit at the top about this being a pattern that is useful for some use cases but not all. I just want to avoid cases where people assume they need to do this path indirection stuff for every pipeline they make, even if there are better options for them.

I do think there is a certain amount of elegance in this idea of tempfiles being scoped only to solid execution, but there is a good bit of mental overhead in untangling all the different temp files and their purposes, and this complexity isn't really necessary for people that just want to get their pipeline up and running.


The reason for the tempfile is that the file-processing tools that are being used can't write directly to S3. So they need to write to a local file that then gets uploaded to S3. It's temporary because there's no reason to keep the local file around after it's been uploaded.

Let me know if I'm misunderstanding your suggestion. I agree that this adds a layer of complexity, but no solution for simplifying it immediately stood out to me.


ohh i see, ya that makes total sense. my reservation of including it in the first example is it may throw too much context at a reader at once in the beginning.

i guess would be great to either

  • include tempdir here and explicit communicate its purpose
  • or, move the filesystem example before s3 so the first step doesn't require this tempfile extra piece, and later in the s3 example change the local filepath to be a tempfile and communicate its purpose.
yuhan requested changes to this revision.Jun 15 2021, 12:03 AM

requesting changes for q management

This revision now requires changes to proceed.Jun 15 2021, 12:03 AM