Give ownership of byte buffers to write ops #12306

kpayson64 · 2017-08-28T19:58:12Z

This implementation does not give ownership of the entire byte_buffer to the transport, but rather allows the transport to steal the reference to the slices.

The reason for this is that the transport only holds on to the slice buffer, which is a field in the byte buffer.

The application is still responsible for calling grpc_byte_buffer_destroy() but the contents of that buffer is no longer guaranteed.

grpc-kokoro · 2017-08-28T20:28:49Z

[trickle] No significant performance differences

grpc-kokoro · 2017-08-28T21:03:27Z

[microbenchmarks] Performance differences noted:
Benchmark                                                                                 atm_add_per_iteration    atm_cas_per_iteration    cpu_time    real_time
----------------------------------------------------------------------------------------  -----------------------  -----------------------  ----------  -----------
BM_PumpStreamServerToClient<InProcessCHTTP2>/16M                                                                                            -6%         -6%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/16M/2/1                                                    -6%         -6%
BM_TransportStreamSend/128M                                                               -99%                     -98%                     -99%        -99%
BM_TransportStreamSend/16M                                                                -99%                     -88%                     -99%        -99%
BM_TransportStreamSend/256k                                                               -73%                                              -74%        -74%
BM_TransportStreamSend/2M                                                                 -95%                     -50%                     -96%        -96%
BM_TransportStreamSend/32k                                                                -33%                                              -26%        -26%
BM_TransportStreamSend/4k                                                                 -14%
BM_TransportStreamSend/512                                                                -14%
BM_TransportStreamSend/64                                                                 -14%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                  -28%                     -41%        -41%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                               -29%                     -32%        -33%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2M/0                                                -5%

ctiller · 2017-08-28T21:48:28Z

Makefile

@@ -105,7 +105,7 @@ CC_asan = clang
 CXX_asan = clang++
 LD_asan = clang++
 LDXX_asan = clang++
-CPPFLAGS_asan = -O0 -fsanitize-coverage=edge -fsanitize=address -fno-omit-frame-pointer -Wno-unused-command-line-argument -DGPR_NO_DIRECT_SYSCALLS


Why is this changing?

I was running ASAN tests on an older clang, so I removed this for testing. I've reverted it.

ctiller · 2017-08-28T21:49:12Z

include/grpc++/impl/codegen/call.h

@@ -308,7 +308,11 @@ class CallOpSendMessage {
 template <class M>
 Status CallOpSendMessage::SendMessage(const M& message, WriteOptions options) {
  write_options_ = options;
-  return SerializationTraits<M>::Serialize(message, &send_buf_, &own_buf_);
+  Status result = SerializationTraits<M>::Serialize(message, &send_buf_, &own_buf_);
+  if (own_buf_) {


Under which situations is own_buf_ true?

(should we just eliminate this flag from the API?)

+1 -- removing this option seems like the right thing. I'm not really sure why it's here to begin with.

I've left own_buf in there because SerializationTraits is part of the public API.

This would be an API breaking change to remove the ability to not free the byte buffer, so I'd rather defer that change.

markdroth · 2017-08-29T14:14:36Z

include/grpc++/impl/codegen/call.h

@@ -308,7 +308,11 @@ class CallOpSendMessage {
 template <class M>
 Status CallOpSendMessage::SendMessage(const M& message, WriteOptions options) {
  write_options_ = options;
-  return SerializationTraits<M>::Serialize(message, &send_buf_, &own_buf_);
+  Status result = SerializationTraits<M>::Serialize(message, &send_buf_, &own_buf_);
+  if (own_buf_) {


+1 -- removing this option seems like the right thing. I'm not really sure why it's here to begin with.

markdroth · 2017-08-29T14:23:31Z

src/python/grpcio/grpc/_cython/_cygrpc/records.pyx.pxi

@@ -255,10 +255,13 @@ cdef class ByteBuffer:

  def __cinit__(self, bytes data):
    grpc_init()
+    self._own_buffer = True


Similar comment here as in C++. Is there any case where this attribute will actually be true?

This would end up being true for write ops. This is no longer used in the current implementation however.

kpayson64 · 2017-08-29T16:34:42Z

I've switched up the implementation a bit to address memory leaks.

grpc-kokoro · 2017-08-29T16:49:10Z

[trickle] No significant performance differences

grpc-kokoro · 2017-08-29T17:23:25Z

[microbenchmarks] Performance differences noted:
Benchmark                                                                               atm_add_per_iteration    atm_cas_per_iteration    cpu_time    locks_per_iteration    nows_per_iteration    real_time
--------------------------------------------------------------------------------------  -----------------------  -----------------------  ----------  ---------------------  --------------------  -----------
BM_PumpStreamClientToServer<InProcess>/128M                                                                                               +10%                                                     +9%
BM_PumpStreamClientToServer<InProcess>/256k                                                                                               +23%                                                     +24%
BM_PumpStreamClientToServer<InProcess>/32k                                                                                                +20%                                                     +20%
BM_PumpStreamServerToClient<InProcess>/128M                                                                                               +7%                                                      +7%
BM_PumpStreamServerToClient<InProcess>/2M                                                                                                 +9%                                                      +7%
BM_PumpStreamServerToClient<InProcess>/32k                                                                                                +22%                                                     +23%
BM_PumpStreamServerToClient<InProcessCHTTP2>/256k                                                                                                                                                  +4%
BM_PumpStreamServerToClient<InProcessCHTTP2>/32k                                                                                          +10%                                                     +9%
BM_PumpStreamServerToClient<TCP>/128M                                                                                                     +10%                                                     +9%
BM_PumpStreamServerToClient<UDS>/2M                                                                                                       +7%                                                      +7%
BM_SliceFromStatic                                                                                                                        +15%                                                     +15%
BM_StreamingPingPong<InProcess, NoOpMutator, NoOpMutator>/128M/2                                                                          +4%                                                      +4%
BM_StreamingPingPong<InProcess, NoOpMutator, NoOpMutator>/16M/2                                                                           +7%                                                      +6%
BM_StreamingPingPong<TCP, NoOpMutator, NoOpMutator>/16M/2                                                                                 +11%                                                     +11%
BM_StreamingPingPongMsgs<MinTCP, NoOpMutator, NoOpMutator>/128M                                                                           +4%                                                      +4%
BM_StreamingPingPongMsgs<TCP, NoOpMutator, NoOpMutator>/128M                                                                              +4%                                                      +4%
BM_StreamingPingPongWithCoalescingApi<InProcess, NoOpMutator, NoOpMutator>/128M/1/0                                                       +7%                                                      +7%
BM_StreamingPingPongWithCoalescingApi<InProcess, NoOpMutator, NoOpMutator>/128M/1/1                                                                                          +5%
BM_StreamingPingPongWithCoalescingApi<InProcess, NoOpMutator, NoOpMutator>/128M/2/1                                                       +4%                                                      +4%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/128M/1/0                                                    +4%                                                      +4%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/128M/1/1                                                                                       +4%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/128M/2/0                                                    +6%                                                      +6%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/128M/2/1                                                    +6%                                                      +6%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/16M/1/1                                                     +10%                                                     +10%
BM_TransportStreamSend/128M                                                             -99%                     -98%                     -99%                                                     -99%
BM_TransportStreamSend/16M                                                              -99%                     -88%                     -99%                                                     -99%
BM_TransportStreamSend/256k                                                             -73%                                              -76%                                                     -76%
BM_TransportStreamSend/2M                                                               -95%                     -50%                     -96%                                                     -96%
BM_TransportStreamSend/32k                                                              -33%                                              -29%                                                     -29%
BM_TransportStreamSend/4k                                                               -14%
BM_TransportStreamSend/512                                                              -14%
BM_TransportStreamSend/64                                                               -14%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                -18%                     -24%        -4%                                          -25%
BM_UnaryPingPong<MinInProcess, NoOpMutator, NoOpMutator>/0/16M                                                   +9%
BM_UnaryPingPong<MinInProcess, NoOpMutator, NoOpMutator>/0/2M                                                    +7%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                             -16%                     -20%                                                     -21%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/128M/0                                                                                    +5%                                                      +5%
BM_UnaryPingPong<TCP, NoOpMutator, NoOpMutator>/128M/128M                                                                                                                                          +4%

ctiller · 2017-08-29T18:17:52Z

If the transport stream op batch structure owns the payload, I'd expect to see some cleanup code in grpc_transport_stream_op_batch_finish_with_failure.

kpayson64 · 2017-08-29T18:33:47Z

@ctiller
More specifically, I've changed grpc_slice_buffer_stream to take ownership of the slices in the backing stream. As long as we are calling grpc_buffer_stream_destroy at the end of the transport stream ops, the memory gets freed. (I've confirmed the "early free" behavior).

grpc-kokoro · 2017-08-29T19:33:50Z

[trickle] No significant performance differences

grpc-kokoro · 2017-08-29T19:41:01Z

[microbenchmarks] Performance differences noted:
Benchmark                                                                                  atm_add_per_iteration    atm_cas_per_iteration    cpu_time    locks_per_iteration    real_time
-----------------------------------------------------------------------------------------  -----------------------  -----------------------  ----------  ---------------------  -----------
BM_PumpStreamClientToServer<InProcessCHTTP2>/16M                                                                                             -4%
BM_SliceFromStatic                                                                                                                           +8%                                +7%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2/0                                                    -9%                                -9%
BM_TransportStreamSend/128M                                                                -99%                     -98%                     -99%                               -99%
BM_TransportStreamSend/16M                                                                 -99%                     -88%                     -99%                               -99%
BM_TransportStreamSend/256k                                                                -73%                                              -75%                               -75%
BM_TransportStreamSend/2M                                                                  -95%                     -50%                     -96%                               -96%
BM_TransportStreamSend/32k                                                                 -33%                                              -23%                               -23%
BM_TransportStreamSend/4k                                                                  -14%
BM_TransportStreamSend/512                                                                 -14%
BM_TransportStreamSend/64                                                                  -14%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                   -26%                     -35%                               -35%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                -34%                     -31%        -4%                    -31%

markdroth · 2017-08-29T21:47:25Z

Please add a comment documenting the semantics somewhere around here:

https://github.com/grpc/grpc/blob/master/include/grpc/impl/codegen/grpc_types.h#L514

grpc-kokoro · 2017-08-30T16:43:18Z

[microbenchmarks] Performance differences noted:
Benchmark                                                                                     atm_add_per_iteration    atm_cas_per_iteration    cpu_time    locks_per_iteration    real_time
--------------------------------------------------------------------------------------------  -----------------------  -----------------------  ----------  ---------------------  -----------
BM_StreamingPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2                                                                          -5%                                -5%
BM_StreamingPingPongMsgs<MinInProcess, NoOpMutator, NoOpMutator>/32k                                                                            +10%                               +9%
BM_StreamingPingPongMsgs<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/32k                                                                      +9%                                +8%
BM_StreamingPingPongWithCoalescingApi<InProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2/0                                                       -5%                                -4%
BM_StreamingPingPongWithCoalescingApi<MinInProcess, NoOpMutator, NoOpMutator>/32k/1/0                                                           +8%                                +8%
BM_StreamingPingPongWithCoalescingApi<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2/0                                                    -6%                                -6%
BM_StreamingPingPongWithCoalescingApi<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/128M/2/1                                                    -4%                                -4%
BM_TransportStreamSend/128M                                                                   -99%                     -98%                     -99%                               -99%
BM_TransportStreamSend/16M                                                                    -99%                     -88%                     -99%                               -99%
BM_TransportStreamSend/256k                                                                   -73%                                              -75%                               -75%
BM_TransportStreamSend/2M                                                                     -95%                     -50%                     -96%                               -96%
BM_TransportStreamSend/32k                                                                    -33%                                              -28%                               -28%
BM_TransportStreamSend/4k                                                                     -14%
BM_TransportStreamSend/512                                                                    -14%
BM_TransportStreamSend/64                                                                     -14%                                              -8%                                -8%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                      -30%                     -35%                               -34%
BM_UnaryPingPong<InProcessCHTTP2, NoOpMutator, NoOpMutator>/2M/0                                                       -12%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/0/16M                                                   -24%                     -36%        -4%                    -36%
BM_UnaryPingPong<MinInProcessCHTTP2, NoOpMutator, NoOpMutator>/2M/0                                                    -6%                      -11%                               -12%

grpc-kokoro · 2017-08-30T17:02:21Z

Corrupt JSON data (indicates timeout or crash): 
    bm_fullstack_trickle.BM_PumpStreamServerToClient_Trickle_2M_32k.opt.new: 1


[trickle] No significant performance differences

kpayson64 · 2017-08-31T19:17:00Z

Issues:
#12315
#11737
#10643

RyanGordon · 2017-09-17T19:29:28Z

In developing a gRPC HHVM extension, I've come across a memory corruption issue related to this. The issue appears to happen in the following situation:

In the client a timeout is set low in grpc_completion_queue_next and the timeout is hit before the response is received by the client
Thus grpc_byte_buffer_destroy is called to cleanup since grpc_completion_queue_next has returned.
Then at some point the response may be received and grpc_slice_buffer_reset_and_unref_internal will be called but dealing with already-freed memory and potentially causing memory corruption which leads to other issues.

This behavior is observed in valgrind in the timeout case as follows:

==21456== Invalid read of size 8
==21456==    at 0x1678C591: grpc_slice_buffer_reset_and_unref_internal (slice_buffer.c:169)
==21456==    by 0x167A1143: slice_buffer_stream_destroy (byte_stream.c:92)
==21456==    by 0x167A0F1A: grpc_byte_stream_destroy (byte_stream.c:49)
==21456==    by 0x167A6D78: grpc_transport_stream_op_batch_finish_with_failure (transport.c:218)
==21456==    by 0x167E782B: waiting_for_pick_batches_fail (client_channel.c:909)
==21456==    by 0x167E8030: pick_done_locked (client_channel.c:1038)
==21456==    by 0x167E80E2: async_pick_done_locked (client_channel.c:1056)
==21456==    by 0x167E86F4: pick_after_resolver_result_cancel_locked (client_channel.c:1196)
==21456==    by 0x16763568: grpc_combiner_continue_exec_ctx (combiner.c:259)
==21456==    by 0x167758B0: grpc_exec_ctx_flush (exec_ctx.c:93)
==21456==    by 0x16775A57: run_closures (executor.c:81)
==21456==    by 0x167760FB: executor_thread (executor.c:181)
==21456==  Address 0x15cc5208 is 40 bytes inside a block of size 320 free'd
==21456==    at 0x51C4CDD: free (vg_replace_malloc.c:530)
==21456==    by 0x16811FAE: gpr_free (alloc.c:78)
==21456==    by 0x1678F659: grpc_byte_buffer_destroy (byte_buffer.c:80)
==21456==    by 0x164B46F4: HPHP::OpsManaged::freeMessage(grpc_byte_buffer*) (call.cpp:309)
==21456==    by 0x164BD39F: void (*std::for_each<grpc_byte_buffer**, void (*)(grpc_byte_buffer*)>(grpc_byte_buffer**, grpc_byte_buffer**, void (*)(grpc_byte_buffer*)))(grpc_byte_buffer*) (stl_algo.h:3756)
==21456==    by 0x164B4697: HPHP::OpsManaged::destroy() (call.cpp:302)
==21456==    by 0x164B456D: HPHP::OpsManaged::~OpsManaged() (call.cpp:291)
==21456==    by 0x164BE8DF: std::default_delete<HPHP::OpsManaged>::operator()(HPHP::OpsManaged*) const (unique_ptr.h:76)
==21456==    by 0x164BD0B4: std::unique_ptr<HPHP::OpsManaged, std::default_delete<HPHP::OpsManaged> >::reset(HPHP::OpsManaged*) (unique_ptr.h:344)
==21456==    by 0x164BD159: std::unique_ptr<HPHP::OpsManaged, std::default_delete<HPHP::OpsManaged> >::operator=(std::unique_ptr<HPHP::OpsManaged, std::default_delete<HPHP::OpsManaged> >&&) (unique_ptr.h:251)
==21456==    by 0x164B24E9: HPHP::CallData::setOpsManaged(std::unique_ptr<HPHP::OpsManaged, std::default_delete<HPHP::OpsManaged> >&&) (call.cpp:127)
==21456==    by 0x164B4F6D: HPHP::c_Call_ni_startBatch(HPHP::ObjectData*, HPHP::Array const&) (call.cpp:391)

kpayson64 · 2017-09-18T17:04:50Z

@RyanGordon Do you have a pointer to the HHVM code that you are using that is running into this issue?

Specifically, I'm interested in your second bullet point:
-Thus grpc_byte_buffer_destroy is called to cleanup since grpc_completion_queue_next has returned.

You shouldn't be calling grpc_byte_buffer_destroy until the operation has completed. A completion_queue_next call timing out is not the same as the call timing out, the latter needs happen before you destroy any memory passed in with the call.

RyanGordon · 2017-09-18T19:41:03Z

@kpayson64 Yes absolutely:

This is where we do the completion_queue_next call with the timeout argument: https://github.com/grpc/grpc/pull/11553/files#diff-1a64209204460e9c92631584fb266047R597

If the timeout happens then we call grpc_call_cancel_with_status: https://github.com/grpc/grpc/pull/11553/files#diff-1a64209204460e9c92631584fb266047R576

And then before we start the next call grpc_byte_buffer_destroy gets called: https://github.com/grpc/grpc/pull/11553/files#diff-1a64209204460e9c92631584fb266047R312

I guess I don't know how we would know when the call itself has timed out or how to wait on that differently than what we currently do?

kpayson64 · 2017-09-18T19:55:31Z

@RyanGordon
I think you should rely on the call timing out rather than grpc_completion_queue_next() timing out.

grpc_channel_create_call() has a parameter for a deadline. If the deadline is reached, the call will be cancelled internally, and grpc_completion_queue_next() will return an event that the call has finished (unsuccessfully). At this point it is safe to destroy memory associated with the call.

If for some reason you need to manually call grpc_call_cancel_with_status(), you should call grpc_completion_queue_next() after cancelling the call, which will return an unsuccessful completion for the call. At this point, it is safe to call OpsManaged::freeMessage

RyanGordon · 2017-09-18T19:59:31Z

@kpayson64 Thank you for explaining that - definitely did not realize the interactions there.

Honestly, the only reason why we added a timeout to grpc_completion_queue_next is because during stress testing the epoll_wait/epoll_pwait would sometimes hang indefinitely and we couldn't track down the reason why. We think it might be related to this or related to memory corruption issues but we haven't been able to track down why yet unfortunately

kpayson64 requested review from ctiller, dgquintas and markdroth as code owners August 28, 2017 19:58

googlebot added the cla: yes label Aug 28, 2017

kpayson64 mentioned this pull request Aug 28, 2017

write slice buffers #12151

Closed

ctiller reviewed Aug 28, 2017

View reviewed changes

markdroth reviewed Aug 29, 2017

View reviewed changes

Have write ops take ownership over slices

567e0f1

kpayson64 force-pushed the ref_counting_slices branch from 471ee98 to 567e0f1 Compare August 29, 2017 16:21

Documentation change + e2e test fix

9519718

kpayson64 added 2 commits August 30, 2017 08:31

Clang_format

4754398

Doc UPdate

08b79c1

ctiller approved these changes Aug 31, 2017

View reviewed changes

kpayson64 merged commit 451b92a into grpc:master Aug 31, 2017

vjpai mentioned this pull request Sep 7, 2017

Allow SerializationTraits to use grpc::ByteBuffer rather than only grpc_byte_buffer #12214

Merged

markdroth mentioned this pull request Feb 13, 2018

Take ownership of byte_buffer contents as soon as send_message op is started. #14416

Merged

markdroth mentioned this pull request Mar 9, 2018

Convert byte_stream API to C++. #14607

Merged

lock bot locked as resolved and limited conversation to collaborators Jan 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give ownership of byte buffers to write ops #12306

Give ownership of byte buffers to write ops #12306

Give ownership of byte buffers to write ops #12306

Give ownership of byte buffers to write ops #12306

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment