Page MenuHomePhabricator

RFC Serialization Strategy Refactor

Authored by themissinghlink on Dec 5 2019, 10:03 PM.




When building ML Models, model serde is probably the hardest problem (in my experience) that ml engineers have to deal with. The problem is that different model training libraries have different interfaces to doing serde which means that our SerializationStrategy interface ought to be as flexible as possible. Currently, SerializationStrategy is setup so that people subclass it and implement the serialize and deserialize abstract methods. This would be fine except that these methods require a file object handle instead of a file path. This is not conducive to API's like Keras which require you to send a file path instead. However, Max correctly pointed out that certain serde could also come from a non file buffer. As a result, you would have to do something like this to make it work:

class KerasModelSerializationStrategy(SerializationStrategy):
	def __init__(self):
		super(KerasModelSerializationStrategy, self).__init__(
	            'keras_strategy', read_mode='r', write_mode='w'
	def serialize(self, value, write_file_obj):
	def deserialize(self, read_file_obj):
	def serialize_to_file(self, value, write_path):
		return write_path
	def deserialize_from_file(self, read_path):
		return load_model(read_path)

This is not ideal and is emblematic of a design with artificial coupling.


To solve this I implemented the following SerializationStrategy design which strives for more explicit decoupling of files and buffers. The con is that you will have to write more with open(...) as blah code, but the pro is more readable and "hopefully" easier to maintain code (just because there is less indirection going on). Also the above KerasSerializationStrategy code then is simplified to:

class KerasModelSerializationStrategy(FileBasedSerializationStrategy):

	def serialize_to_file(self, value, write_path):
		check.str_param(write_file_path, 'write_path')

	def deserialize_from_file(self, read_path):
		return load_model(read_path)
Test Plan


Diff Detail

R1 dagster
Lint WarningsExcuse: im a bad person
No Unit Test Coverage

Event Timeline

alangenfeld requested changes to this revision.Dec 5 2019, 11:38 PM

get the tests passing - I'm interested in seeing what the callsite changes look like

This revision now requires changes to proceed.Dec 5 2019, 11:38 PM
alangenfeld added inline comments.Dec 5 2019, 11:40 PM

just making serialize_to_file abstract on SerializationStrategy might be sufficient - its not much or complex code that is getting duplicated

  • moved to a simpler design and got rid of serde methods
  • aligned params in intermediate store tests
  • wip decided to switch to buffer design but the gcp tests are wild
max added a comment.Dec 6 2019, 6:20 PM

i also think that changing the callsites will make this look like less of a win. I also don't like a world where we have a parameter called writable that is a string, i.e., specifically not something you can write to using the file APIs.


let's not delete this docstring?


very 18th century

max added a comment.Dec 6 2019, 6:23 PM

it is also possible that serialization strategy is the wrong place to deal with this (compare

alangenfeld requested changes to this revision.Dec 9 2019, 4:46 PM


back to your queue

This revision now requires changes to proceed.Dec 9 2019, 4:46 PM

Going to abandon this refactor. Knowing everything I know now, there are some issues with this approach and I need to rethink this whole thing.


So the reason this doesn't work is the situation where you are actually serializing or deserializing from a ByteStream like in the case of GCP