Page MenuHomePhabricator

Move log stream out of Apollo, memoize row contents and fix cascading renders

Authored by bengotow on Sep 27 2019, 5:45 AM.



This diff makes the run log viewer faster.

It turns out there were several problems that had built up in here. We've been mostly implementing the "streaming GraphQL subscription which appends to an array" ourselves, but on each inbound update we update the data and commit it to Apollo's normalized store. We have no interest in a normalized store for log entries and Alex noted earlier today it's almost shockingly slow. We then turn around and pull the data right back out for display, and it seems to re-pluck each requested key-value pair. Oof.

This diff merges the RunSubscriptionProvider and LogFilterProvider into a single LogProvider that opens a websocket, sends the GraphQL request, and then accumulates the logs into our own array which is vended to it's children. The logs are not requested through Apollo at all. It turns out it's faster to retrieve the "initial" / old logs through the socket too by using a cursor of 0. (That way we don't have to insert / retrieve the older logs from Apollo either.)

This diff also avoids throwing away the virtualized grid's cell offset + height cache every time new logs arrive. We were doing this because the search / filters in the view hide items and invalidate the row offsets. Now we generate a key from the filters - as long as the filters remain the same, we know the logs are append-only and it's safe to keep using the cached row offsets. This eliminates one heavy double-render caused by the height calculations.

Because we know we're receiving the logs in order, I also made the LogProvider attach a clientsideKey to each message. Having a unique key for each item in the array allows us to keep the cell height cache even when you're filtering the array and the items change array indexes. (So we invalidate the row offsets as you filter, but not the row sizes).

It turns out we were also spending a good deal of time just rendering the actual cells. Because each cell is rendered and then re-rendered with it's fixed size, memoizing the cells /inside/ of the main row container is effective but memoizing them at the top level doesn't work because the style prop is changing.

Anyway. Here we go - this is just a log_spew run with the Debug message filter toggled on:

Test Plan

Run log_spew (faster)

Diff Detail

R1 dagster
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

bengotow created this revision.Sep 27 2019, 5:45 AM
bengotow retitled this revision from Move log stream out of Apollo, memoize row contents and fix 2x cascading renders to Move log stream out of Apollo, memoize row contents and fix cascading renders.Sep 27 2019, 6:01 AM
bengotow edited the summary of this revision. (Show Details)
bengotow edited the summary of this revision. (Show Details)Sep 27 2019, 6:01 AM
bengotow updated this revision to Diff 5036.Sep 27 2019, 6:05 AM
bengotow edited the summary of this revision. (Show Details)

Remove extra graphql types, use debounce because smaller updates are smoother than infrequent large ones.

bengotow added inline comments.Sep 27 2019, 6:09 AM

Whoops - forgot I need to bring this code back so that the pipeline transitions to the finished state (mostly I think this is just for the status bar dot color.) Will fix this in the AM

bengotow edited the summary of this revision. (Show Details)Sep 27 2019, 6:12 AM
bengotow updated this revision to Diff 5045.Sep 27 2019, 2:49 PM

Sync pipeline run status changes back to Apollo
Rename and move types a bit

bengotow updated this revision to Diff 5049.Sep 27 2019, 3:20 PM

Add file I forgot to commit



I would accept but I think its important that the reconnect behavior continue to work, which is not detailed in your test plan


does this hand rolled one handle reconnects effectively? Might want to move this out in to its own module RawGraphQLSubscription / DirectGraphQLSubscription since we're likely to use it for compute (and itll get more complicated if it has reconnect logic)






alangenfeld requested changes to this revision.Sep 27 2019, 4:49 PM
This revision now requires changes to proceed.Sep 27 2019, 4:49 PM
bengotow planned changes to this revision.Sep 27 2019, 6:19 PM

Sounds good! I think handling reconnects is a good idea - will see if we can do that and pull the websocket impl out into a class for reuse.

bengotow updated this revision to Diff 5219.EditedSep 30 2019, 9:14 PM

Abstract custom subscription into DirectGraphQLSubscription, implement retries

Tested the retry logic by A) killing the dagit process and seeing that it tries to reconnect every 500ms, B) adding a small bit of code to end the socket every 2s and seeing that it re-opens the connection and the UI continues to reflect the log stream properly.

I think this should incorporate all of your feedback now @alangenfeld!

bengotow updated this revision to Diff 5223.Sep 30 2019, 9:30 PM

Rebase on master to fix build(?)

alangenfeld accepted this revision.Sep 30 2019, 9:57 PM
alangenfeld added inline comments.

hmm this feels off we should revisit this probably - we could update the subscription query to keep sending the cursor i think


anything to worry about with this hard coded id?

This revision is now accepted and ready to land.Sep 30 2019, 9:57 PM
bengotow added inline comments.Sep 30 2019, 10:08 PM

Yeah I think you're right. It'd be nice to standardize the way we send / receive cursors too so this class can manage them rather than it being provided per-use-case. I'll try to do this when I switch the stderr/stdout log stream to use this class.

Right now I'm handling retries, but assuming that they're pretty infrequent - I think asking for all the data again is OK on localhost, but if we ship cloud-dagit and the server drops sockets (hopefully not likely) we'd definitely want them to pick up at the cursor.


Hmm that's a good question - I can't find any documentation for this value, but it looks like Apollo resets it on each connection, I think it's just a unique identifier for associating responses with the request.