The testing concepts section comes off pretty intensely and makes some strong statements that are easy to misconstrue in ways that many data engineers would disagree with.
A few typos that I trust you to fix without me re-reviewing plus an idea for the future.
I'm one of the aforementioned data engineers who doesn't agree with everything in this section (for example, I think there can be a lot of merit to running things like Spark locally to assert against actual dataframe transformations) but I do think there are some really interesting philosophies that we're losing by cutting this out in its entirety.
I wonder if this content that we're removing might be worth re-introducing at some point as a blog post or series of blog posts. I could envision a couple of more opinionated pieces that talk about different testing philosophies and show how dagster can accommodate each one. For example, a post on TDDing data pipelines, a post on running integration tests against non-prod external services, etc.
I think you forgot a word
So as the original author of this let me explain what I was trying to get across.
It was effectively what Erik was expressing in the article that went around, that when you are testing business logic it should be in environments as similar as possible as they are in production.
Now I don't agree with everything in Erik's article either, but what I wanted to make sure to get across is the existence of the resource system doesn't mean we are making claims that everything can be locally testable. E.g. it is not recommended to run postgres locally and redshift in prod and assume that everything is good. In fact, I think that that type of environment is likely to cause more problems than it solves, because of sql dialect incompatibility and implementation differences.
However, that fact that @jordansanders thinks that this makes an argument that we shouldn't run Spark locally indicates that the wording should be changed. That is *not* what this was trying to say. I was arguing about complex in memory mocking of dataframe computations.
I stand by a lot of this content (agree that some sentences are a bit hectoring on a re-read) though and I think we are losing something by blanking it entirely. In particular I think the "principle" (aside I'm beyond mortified I used the wrong variant!) and "corollary" bit is a novel and thought-provoking formulation.
Chatted with sandy briefly and got to agreement. This content should eventually be massaged and resurfaced in a guide or post. I archived it here: https://docs.google.com/document/d/1B7fEOM8Ga03NbDwSu29LF7--JDri8JPGV1YBeE4rIIE/edit
Convo with Sandy on slack:
sandy:dagster: 9:02 AM
hey - I think we have agreement on core content. aside from the specific spark callout, I agreed with all the content that I took out.
I don't think I explained this very well in the summary I wrote for that diff, but what motivated that docs change for me was the sense that we were requiring our readers to sit through a lecture on our philosophy of how to test pipelines before telling them how they could do it in Dagster.
the rest of our concept sections stick to talking about what Dagster provides, so this one stuck out, because it spends a lot of time on more general data engineering advice. my understanding of the history of that content is that it was written when our site was organized in a different way, and its focus made more sense then. IMO there's a place for that kind of content, but it's probably a blog post or a guide
schrockn 9:18 AM
I think that was originally a “guideline
agreed this content could be messaged better in a guide or post (i like it the post idea better). the original content was a guide/learn piece which communicated dagster's philosophy and it take on data engineering - i found it very compelling back then (before joining it) so converting it to a post sounds like a great plan for public comms