[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give ownership of byte buffers to write ops #12306

Merged
merged 4 commits into from
Aug 31, 2017

Conversation

kpayson64
Copy link
Contributor
@kpayson64 kpayson64 commented Aug 28, 2017

This implementation does not give ownership of the entire byte_buffer to the transport, but rather allows the transport to steal the reference to the slices.

The reason for this is that the transport only holds on to the slice buffer, which is a field in the byte buffer.

The application is still responsible for calling grpc_byte_buffer_destroy() but the contents of that buffer is no longer guaranteed.

@grpc-kokoro
Copy link
[trickle] No significant performance differences

@grpc-kokoro
Copy link
[microbenchmarks] Performance differences noted:
Benchmark                                                                                 atm_add_per_iteration    atm_cas_per_iteration    cpu_time    real_time
----------------------------------------------------------------------------------------  -----------------------  -----------------------  ----------  -----------
BM_PumpStreamServerToClient<InProcessCHTTP2>/16M                                                                                            -6%         -6%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/16M/2/1                                                    -6%         -6%
BM_TransportStreamSend/128M                                                               -99%                     -98%                     -99%        -99%
BM_TransportStreamSend/16M                                                                -99%                     -88%                     -99%        -99%
BM_TransportStreamSend/256k                                                               -73%                                              -74%        -74%
BM_TransportStreamSend/2M                                                                 -95%                     -50%                     -96%        -96%
BM_TransportStreamSend/32k                                                                -33%                                              -26%        -26%
BM_TransportStreamSend/4k                                                                 -14%
BM_TransportStreamSend/512                                                                -14%
BM_TransportStreamSend/64                                                                 -14%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                  -28%                     -41%        -41%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                               -29%                     -32%        -33%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2M/0                                                -5%

Makefile Outdated
@@ -105,7 +105,7 @@ CC_asan = clang
CXX_asan = clang++
LD_asan = clang++
LDXX_asan = clang++
CPPFLAGS_asan = -O0 -fsanitize-coverage=edge -fsanitize=address -fno-omit-frame-pointer -Wno-unused-command-line-argument -DGPR_NO_DIRECT_SYSCALLS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running ASAN tests on an older clang, so I removed this for testing. I've reverted it.

@@ -308,7 +308,11 @@ class CallOpSendMessage {
template <class M>
Status CallOpSendMessage::SendMessage(const M& message, WriteOptions options) {
write_options_ = options;
return SerializationTraits<M>::Serialize(message, &send_buf_, &own_buf_);
Status result = SerializationTraits<M>::Serialize(message, &send_buf_, &own_buf_);
if (own_buf_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under which situations is own_buf_ true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should we just eliminate this flag from the API?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 -- removing this option seems like the right thing. I'm not really sure why it's here to begin with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left own_buf in there because SerializationTraits is part of the public API.

This would be an API breaking change to remove the ability to not free the byte buffer, so I'd rather defer that change.

@@ -308,7 +308,11 @@ class CallOpSendMessage {
template <class M>
Status CallOpSendMessage::SendMessage(const M& message, WriteOptions options) {
write_options_ = options;
return SerializationTraits<M>::Serialize(message, &send_buf_, &own_buf_);
Status result = SerializationTraits<M>::Serialize(message, &send_buf_, &own_buf_);
if (own_buf_) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 -- removing this option seems like the right thing. I'm not really sure why it's here to begin with.

@@ -255,10 +255,13 @@ cdef class ByteBuffer:

def __cinit__(self, bytes data):
grpc_init()
self._own_buffer = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here as in C++. Is there any case where this attribute will actually be true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would end up being true for write ops. This is no longer used in the current implementation however.

@kpayson64
Copy link
Contributor Author

I've switched up the implementation a bit to address memory leaks.

@grpc-kokoro
Copy link
[trickle] No significant performance differences

@grpc-kokoro
Copy link
[microbenchmarks] Performance differences noted:
Benchmark                                                                               atm_add_per_iteration    atm_cas_per_iteration    cpu_time    locks_per_iteration    nows_per_iteration    real_time
--------------------------------------------------------------------------------------  -----------------------  -----------------------  ----------  ---------------------  --------------------  -----------
BM_PumpStreamClientToServer<InProcess>/128M                                                                                               +10%                                                     +9%
BM_PumpStreamClientToServer<InProcess>/256k                                                                                               +23%                                                     +24%
BM_PumpStreamClientToServer<InProcess>/32k                                                                                                +20%                                                     +20%
BM_PumpStreamServerToClient<InProcess>/128M                                                                                               +7%                                                      +7%
BM_PumpStreamServerToClient<InProcess>/2M                                                                                                 +9%                                                      +7%
BM_PumpStreamServerToClient<InProcess>/32k                                                                                                +22%                                                     +23%
BM_PumpStreamServerToClient<InProcessCHTTP2>/256k                                                                                                                                                  +4%
BM_PumpStreamServerToClient<InProcessCHTTP2>/32k                                                                                          +10%                                                     +9%
BM_PumpStreamServerToClient<TCP>/128M                                                                                                     +10%                                                     +9%
BM_PumpStreamServerToClient<UDS>/2M                                                                                                       +7%                                                      +7%
BM_SliceFromStatic                                                                                                                        +15%                                                     +15%
BM_StreamingPingPong<InProcess, NoOpMutator, NoOpMutator>/128M/2                                                                          +4%                                                      +4%
BM_StreamingPingPong<InProcess, NoOpMutator, NoOpMutator>/16M/2                                                                           +7%                                                      +6%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/16M/2                                                                                 +11%                                                     +11%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/128M                                                                           +4%                                                      +4%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/128M                                                                              +4%                                                      +4%
BM_StreamingPingPongWithCoalescingApi<InProcess, NoOpMutator, NoOpMutator>/128M/1/0                                                       +7%                                                      +7%
BM_StreamingPingPongWithCoalescingApi<InProcess, NoOpMutator, NoOpMutator>/128M/1/1                                                                                          +5%
BM_StreamingPingPongWithCoalescingApi<InProcess, NoOpMutator, NoOpMutator>/128M/2/1                                                       +4%                                                      +4%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/128M/1/0                                                    +4%                                                      +4%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/128M/1/1                                                                                       +4%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/128M/2/0                                                    +6%                                                      +6%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/128M/2/1                                                    +6%                                                      +6%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/16M/1/1                                                     +10%                                                     +10%
BM_TransportStreamSend/128M                                                             -99%                     -98%                     -99%                                                     -99%
BM_TransportStreamSend/16M                                                              -99%                     -88%                     -99%                                                     -99%
BM_TransportStreamSend/256k                                                             -73%                                              -76%                                                     -76%
BM_TransportStreamSend/2M                                                               -95%                     -50%                     -96%                                                     -96%
BM_TransportStreamSend/32k                                                              -33%                                              -29%                                                     -29%
BM_TransportStreamSend/4k                                                               -14%
BM_TransportStreamSend/512                                                              -14%
BM_TransportStreamSend/64                                                               -14%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                -18%                     -24%        -4%                                          -25%
BM_UnaryPingPong<MinInProcess, NoOpMutator, NoOpMutator>/0/16M                                                   +9%
BM_UnaryPingPong<MinInProcess, NoOpMutator, NoOpMutator>/0/2M                                                    +7%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                             -16%                     -20%                                                     -21%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/128M/0                                                                                    +5%                                                      +5%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/128M/128M                                                                                                                                          +4%

@ctiller
Copy link
Member
ctiller commented Aug 29, 2017

If the transport stream op batch structure owns the payload, I'd expect to see some cleanup code in grpc_transport_stream_op_batch_finish_with_failure.

@kpayson64
Copy link
Contributor Author

@ctiller
More specifically, I've changed grpc_slice_buffer_stream to take ownership of the slices in the backing stream. As long as we are calling grpc_buffer_stream_destroy at the end of the transport stream ops, the memory gets freed. (I've confirmed the "early free" behavior).

@grpc-kokoro
Copy link
[trickle] No significant performance differences

@grpc-kokoro
Copy link
[microbenchmarks] Performance differences noted:
Benchmark                                                                                  atm_add_per_iteration    atm_cas_per_iteration    cpu_time    locks_per_iteration    real_time
-----------------------------------------------------------------------------------------  -----------------------  -----------------------  ----------  ---------------------  -----------
BM_PumpStreamClientToServer<InProcessCHTTP2>/16M                                                                                             -4%
BM_SliceFromStatic                                                                                                                           +8%                                +7%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2/0                                                    -9%                                -9%
BM_TransportStreamSend/128M                                                                -99%                     -98%                     -99%                               -99%
BM_TransportStreamSend/16M                                                                 -99%                     -88%                     -99%                               -99%
BM_TransportStreamSend/256k                                                                -73%                                              -75%                               -75%
BM_TransportStreamSend/2M                                                                  -95%                     -50%                     -96%                               -96%
BM_TransportStreamSend/32k                                                                 -33%                                              -23%                               -23%
BM_TransportStreamSend/4k                                                                  -14%
BM_TransportStreamSend/512                                                                 -14%
BM_TransportStreamSend/64                                                                  -14%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                   -26%                     -35%                               -35%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                -34%                     -31%        -4%                    -31%

@markdroth
Copy link
Member

Please add a comment documenting the semantics somewhere around here:

https://github.com/grpc/grpc/blob/master/include/grpc/impl/codegen/grpc_types.h#L514

@grpc-kokoro
Copy link
[microbenchmarks] Performance differences noted:
Benchmark                                                                                     atm_add_per_iteration    atm_cas_per_iteration    cpu_time    locks_per_iteration    real_time
--------------------------------------------------------------------------------------------  -----------------------  -----------------------  ----------  ---------------------  -----------
BM_StreamingPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2                                                                          -5%                                -5%
BM_StreamingPingPongMsgs<MinInProcess, NoOpMutator, NoOpMutator>/32k                                                                            +10%                               +9%
BM_StreamingPingPongMsgs<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/32k                                                                      +9%                                +8%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2/0                                                       -5%                                -4%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/32k/1/0                                                           +8%                                +8%
BM_StreamingPingPongWithCoalescingApi<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2/0                                                    -6%                                -6%
BM_StreamingPingPongWithCoalescingApi<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2/1                                                    -4%                                -4%
BM_TransportStreamSend/128M                                                                   -99%                     -98%                     -99%                               -99%
BM_TransportStreamSend/16M                                                                    -99%                     -88%                     -99%                               -99%
BM_TransportStreamSend/256k                                                                   -73%                                              -75%                               -75%
BM_TransportStreamSend/2M                                                                     -95%                     -50%                     -96%                               -96%
BM_TransportStreamSend/32k                                                                    -33%                                              -28%                               -28%
BM_TransportStreamSend/4k                                                                     -14%
BM_TransportStreamSend/512                                                                    -14%
BM_TransportStreamSend/64                                                                     -14%                                              -8%                                -8%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                      -30%                     -35%                               -34%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/2M/0                                                       -12%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                   -24%                     -36%        -4%                    -36%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2M/0                                                    -6%                      -11%                               -12%

@grpc-kokoro
Copy link
Corrupt JSON data (indicates timeout or crash): 
    bm_fullstack_trickle.BM_PumpStreamServerToClient_Trickle_2M_32k.opt.new: 1


[trickle] No significant performance differences

@kpayson64
Copy link
Contributor Author

Issues:
#12315
#11737
#10643

@RyanGordon
Copy link

In developing a gRPC HHVM extension, I've come across a memory corruption issue related to this. The issue appears to happen in the following situation:

  • In the client a timeout is set low in grpc_completion_queue_next and the timeout is hit before the response is received by the client
  • Thus grpc_byte_buffer_destroy is called to cleanup since grpc_completion_queue_next has returned.
  • Then at some point the response may be received and grpc_slice_buffer_reset_and_unref_internal will be called but dealing with already-freed memory and potentially causing memory corruption which leads to other issues.

This behavior is observed in valgrind in the timeout case as follows:

==21456== Invalid read of size 8
==21456==    at 0x1678C591: grpc_slice_buffer_reset_and_unref_internal (slice_buffer.c:169)
==21456==    by 0x167A1143: slice_buffer_stream_destroy (byte_stream.c:92)
==21456==    by 0x167A0F1A: grpc_byte_stream_destroy (byte_stream.c:49)
==21456==    by 0x167A6D78: grpc_transport_stream_op_batch_finish_with_failure (transport.c:218)
==21456==    by 0x167E782B: waiting_for_pick_batches_fail (client_channel.c:909)
==21456==    by 0x167E8030: pick_done_locked (client_channel.c:1038)
==21456==    by 0x167E80E2: async_pick_done_locked (client_channel.c:1056)
==21456==    by 0x167E86F4: pick_after_resolver_result_cancel_locked (client_channel.c:1196)
==21456==    by 0x16763568: grpc_combiner_continue_exec_ctx (combiner.c:259)
==21456==    by 0x167758B0: grpc_exec_ctx_flush (exec_ctx.c:93)
==21456==    by 0x16775A57: run_closures (executor.c:81)
==21456==    by 0x167760FB: executor_thread (executor.c:181)
==21456==  Address 0x15cc5208 is 40 bytes inside a block of size 320 free'd
==21456==    at 0x51C4CDD: free (vg_replace_malloc.c:530)
==21456==    by 0x16811FAE: gpr_free (alloc.c:78)
==21456==    by 0x1678F659: grpc_byte_buffer_destroy (byte_buffer.c:80)
==21456==    by 0x164B46F4: HPHP::OpsManaged::freeMessage(grpc_byte_buffer*) (call.cpp:309)
==21456==    by 0x164BD39F: void (*std::for_each<grpc_byte_buffer**, void (*)(grpc_byte_buffer*)>(grpc_byte_buffer**, grpc_byte_buffer**, void (*)(grpc_byte_buffer*)))(grpc_byte_buffer*) (stl_algo.h:3756)
==21456==    by 0x164B4697: HPHP::OpsManaged::destroy() (call.cpp:302)
==21456==    by 0x164B456D: HPHP::OpsManaged::~OpsManaged() (call.cpp:291)
==21456==    by 0x164BE8DF: std::default_delete<HPHP::OpsManaged>::operator()(HPHP::OpsManaged*) const (unique_ptr.h:76)
==21456==    by 0x164BD0B4: std::unique_ptr<HPHP::OpsManaged, std::default_delete<HPHP::OpsManaged> >::reset(HPHP::OpsManaged*) (unique_ptr.h:344)
==21456==    by 0x164BD159: std::unique_ptr<HPHP::OpsManaged, std::default_delete<HPHP::OpsManaged> >::operator=(std::unique_ptr<HPHP::OpsManaged, std::default_delete<HPHP::OpsManaged> >&&) (unique_ptr.h:251)
==21456==    by 0x164B24E9: HPHP::CallData::setOpsManaged(std::unique_ptr<HPHP::OpsManaged, std::default_delete<HPHP::OpsManaged> >&&) (call.cpp:127)
==21456==    by 0x164B4F6D: HPHP::c_Call_ni_startBatch(HPHP::ObjectData*, HPHP::Array const&) (call.cpp:391)

@kpayson64
Copy link
Contributor Author

@RyanGordon Do you have a pointer to the HHVM code that you are using that is running into this issue?

Specifically, I'm interested in your second bullet point:
-Thus grpc_byte_buffer_destroy is called to cleanup since grpc_completion_queue_next has returned.

You shouldn't be calling grpc_byte_buffer_destroy until the operation has completed. A completion_queue_next call timing out is not the same as the call timing out, the latter needs happen before you destroy any memory passed in with the call.

@RyanGordon
Copy link

@kpayson64 Yes absolutely:

This is where we do the completion_queue_next call with the timeout argument: https://github.com/grpc/grpc/pull/11553/files#diff-1a64209204460e9c92631584fb266047R597

If the timeout happens then we call grpc_call_cancel_with_status: https://github.com/grpc/grpc/pull/11553/files#diff-1a64209204460e9c92631584fb266047R576

And then before we start the next call grpc_byte_buffer_destroy gets called: https://github.com/grpc/grpc/pull/11553/files#diff-1a64209204460e9c92631584fb266047R312

I guess I don't know how we would know when the call itself has timed out or how to wait on that differently than what we currently do?

@kpayson64
Copy link
Contributor Author

@RyanGordon
I think you should rely on the call timing out rather than grpc_completion_queue_next() timing out.

grpc_channel_create_call() has a parameter for a deadline. If the deadline is reached, the call will be cancelled internally, and grpc_completion_queue_next() will return an event that the call has finished (unsuccessfully). At this point it is safe to destroy memory associated with the call.

If for some reason you need to manually call grpc_call_cancel_with_status(), you should call grpc_completion_queue_next() after cancelling the call, which will return an unsuccessful completion for the call. At this point, it is safe to call OpsManaged::freeMessage

@RyanGordon
Copy link

@kpayson64 Thank you for explaining that - definitely did not realize the interactions there.

Honestly, the only reason why we added a timeout to grpc_completion_queue_next is because during stress testing the epoll_wait/epoll_pwait would sometimes hang indefinitely and we couldn't track down the reason why. We think it might be related to this or related to memory corruption issues but we haven't been able to track down why yet unfortunately

@lock lock bot locked as resolved and limited conversation to collaborators Jan 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants