Page MenuHomePhabricator

RFC Serialization Strategy Refactor
Needs RevisionPublic

Authored by themissinghlink on Dec 5 2019, 10:03 PM.

Details

Summary

Observation

When building ML Models, model serde is probably the hardest problem (in my experience) that ml engineers have to deal with. The problem is that different model training libraries have different interfaces to doing serde which means that our SerializationStrategy interface ought to be as flexible as possible. Currently, SerializationStrategy is setup so that people subclass it and implement the serialize and deserialize abstract methods. This would be fine except that these methods require a file object handle instead of a file path. This is not conducive to API's like Keras which require you to send a file path instead. However, Max correctly pointed out that certain serde could also come from a non file buffer. As a result, you would have to do something like this to make it work:

class KerasModelSerializationStrategy(SerializationStrategy):
	def __init__(self):
		super(KerasModelSerializationStrategy, self).__init__(
	            'keras_strategy', read_mode='r', write_mode='w'
	        )
	def serialize(self, value, write_file_obj):
		pass
	def deserialize(self, read_file_obj):
	    pass
	def serialize_to_file(self, value, write_path):
		value.save(write_path)
		return write_path
	def deserialize_from_file(self, read_path):
		return load_model(read_path)

This is not ideal and is emblematic of a design with artificial coupling.

Solution

To solve this I implemented the following SerializationStrategy design which strives for more explicit decoupling of files and buffers. The con is that you will have to write more with open(...) as blah code, but the pro is more readable and "hopefully" easier to maintain code (just because there is less indirection going on). Also the above KerasSerializationStrategy code then is simplified to:

class KerasModelSerializationStrategy(FileBasedSerializationStrategy):

	def serialize_to_file(self, value, write_path):
		check.str_param(write_file_path, 'write_path')
		value.save(write_path)

	def deserialize_from_file(self, read_path):
		return load_model(read_path)
Test Plan

unit

Diff Detail

Repository
R1 dagster
Branch
asingh-fix-serialization-strategy-interface
Lint
Lint WarningsExcuse: im a bad person
Unit
No Unit Test Coverage

Event Timeline

themissinghlink edited the summary of this revision. (Show Details)Dec 5 2019, 11:12 PM
alangenfeld requested changes to this revision.Dec 5 2019, 11:38 PM

get the tests passing - I'm interested in seeing what the callsite changes look like

This revision now requires changes to proceed.Dec 5 2019, 11:38 PM
alangenfeld added inline comments.Dec 5 2019, 11:40 PM
python_modules/dagster/dagster/core/types/marshal.py
33–51

just making serialize_to_file abstract on SerializationStrategy might be sufficient - its not much or complex code that is getting duplicated

  • moved to a simpler design and got rid of serde methods
  • aligned params in intermediate store tests
  • wip decided to switch to buffer design but the gcp tests are wild
max added a comment.Dec 6 2019, 6:20 PM

i also think that changing the callsites will make this look like less of a win. I also don't like a world where we have a parameter called writable that is a string, i.e., specifically not something you can write to using the file APIs.

python_modules/dagster/dagster/core/types/marshal.py
10

let's not delete this docstring?

30

very 18th century

max added a comment.Dec 6 2019, 6:23 PM

it is also possible that serialization strategy is the wrong place to deal with this (compare https://github.com/dagster-io/dagster/blob/master/examples/dagster_examples/airline_demo/unzip_file_handle.py)

alangenfeld requested changes to this revision.Dec 9 2019, 4:46 PM

wip

back to your queue

This revision now requires changes to proceed.Dec 9 2019, 4:46 PM