Skip to content

Java QpsWorker produces many exceptions #2372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jtattermusch opened this issue Oct 26, 2016 · 13 comments · Fixed by #2376
Closed

Java QpsWorker produces many exceptions #2372

jtattermusch opened this issue Oct 26, 2016 · 13 comments · Fixed by #2376
Assignees
Labels
Milestone

Comments

@jtattermusch
Copy link
Contributor

We've recently enabled printing output from QPS workers in the benchmark suite, but the logs are 93MB big, which makes them completely useless.
It seems that it's mostly because java worker is super chatty (tons of exceptions get printed to the console)

https://grpc-testing.appspot.com/view/Performance/job/gRPC_performance_experiment/3192/consoleFull

@ejona86
Copy link
Member

ejona86 commented Oct 26, 2016

I agree that log file is very big, but I don't see the exceptions you are talking about. Could you paste a bit of it here or say which scenarie it is in?

The most output for a test I saw was:

Oct 26, 2016 12:03:08 AM io.grpc.internal.ManagedChannelImpl <init>
INFO: [ManagedChannelImpl@677df3d4] Created with target directaddress:///grpc-performance-server-8core/10.240.0.90:38564
Oct 26, 2016 12:03:53 AM io.grpc.benchmarks.driver.LoadClient$GenericAsyncPingPongWorker$1 onClose
INFO: Error in Generic Async Ping-Pong call
Oct 26, 2016 12:03:53 AM io.grpc.internal.ManagedChannelImpl maybeTerminateChannel
INFO: [ManagedChannelImpl@677df3d4] Terminated
Oct 26, 2016 12:03:54 AM io.grpc.benchmarks.driver.LoadWorker$WorkerServiceImpl quitWorker
INFO: Received quitWorker request.
Oct 26, 2016 12:03:54 AM io.grpc.benchmarks.driver.LoadWorker main
INFO: DriverServer has terminated.

@carl-mastrangelo
Copy link
Contributor

@ejona86 Its a huge number of "half closed" already called exceptions.

ejona86 added a commit to ejona86/grpc-java that referenced this issue Oct 26, 2016
This doesn't fix and failure in the qps test, but causes it to produce less
useless log output  of the form:

java.lang.IllegalStateException: call was half-closed
	at com.google.common.base.Preconditions.checkState(Preconditions.java:174)
	at io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:380)
	at io.grpc.stub.ClientCalls$CallToStreamObserverAdapter.onNext(ClientCalls.java:299)
	at io.grpc.benchmarks.driver.LoadClient$AsyncPingPongWorker$1.onNext(LoadClient.java:406)
	at io.grpc.benchmarks.driver.LoadClient$AsyncPingPongWorker$1.onNext(LoadClient.java:400)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:382)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessageRead.runInContext(ClientCallImpl.java:473)
	... 7 more

Fixes grpc#2372
ejona86 added a commit to ejona86/grpc-java that referenced this issue Oct 26, 2016
This doesn't impact test behavior per-se, but causes it to produce less
useless log output of the form:

java.lang.IllegalStateException: call was half-closed
	at com.google.common.base.Preconditions.checkState(Preconditions.java:174)
	at io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:380)
	at io.grpc.stub.ClientCalls$CallToStreamObserverAdapter.onNext(ClientCalls.java:299)
	at io.grpc.benchmarks.driver.LoadClient$AsyncPingPongWorker$1.onNext(LoadClient.java:406)
	at io.grpc.benchmarks.driver.LoadClient$AsyncPingPongWorker$1.onNext(LoadClient.java:400)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:382)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessageRead.runInContext(ClientCallImpl.java:473)
	... 7 more

Fixes grpc#2372
ejona86 added a commit to ejona86/grpc-java that referenced this issue Oct 26, 2016
This doesn't impact test behavior per-se, but causes it to produce less
useless log output of the form:

java.lang.IllegalStateException: call was half-closed
	at com.google.common.base.Preconditions.checkState(Preconditions.java:174)
	at io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:380)
	at io.grpc.stub.ClientCalls$CallToStreamObserverAdapter.onNext(ClientCalls.java:299)
	at io.grpc.benchmarks.driver.LoadClient$AsyncPingPongWorker$1.onNext(LoadClient.java:406)
	at io.grpc.benchmarks.driver.LoadClient$AsyncPingPongWorker$1.onNext(LoadClient.java:400)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:382)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessageRead.runInContext(ClientCallImpl.java:473)
	... 7 more

Fixes grpc#2372
ejona86 added a commit that referenced this issue Oct 26, 2016
This doesn't impact test behavior per-se, but causes it to produce less
useless log output of the form:

java.lang.IllegalStateException: call was half-closed
	at com.google.common.base.Preconditions.checkState(Preconditions.java:174)
	at io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:380)
	at io.grpc.stub.ClientCalls$CallToStreamObserverAdapter.onNext(ClientCalls.java:299)
	at io.grpc.benchmarks.driver.LoadClient$AsyncPingPongWorker$1.onNext(LoadClient.java:406)
	at io.grpc.benchmarks.driver.LoadClient$AsyncPingPongWorker$1.onNext(LoadClient.java:400)
	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:382)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessageRead.runInContext(ClientCallImpl.java:473)
	... 7 more

Fixes #2372
@jtattermusch
Copy link
Contributor Author

Thanks!

Now the log went from 90MB to 60MB and there are still some unwanted log messages:
https://grpc-testing.appspot.com/view/Performance/job/gRPC_performance_experiment/3203/consoleFull

Oct 27, 2016 7:19:33 AM io.grpc.netty.NettyServerHandler onStreamError
WARNING: Stream Error
io.netty.handler.codec.http2.Http2Exception$StreamException: Received DATA frame for an unknown stream 143
    at io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:127)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.shouldIgnoreHeadersOrDataFrame(DefaultHttp2ConnectionDecoder.java:556)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onDataRead(DefaultHttp2ConnectionDecoder.java:176)
    at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onDataRead(Http2InboundFrameLogger.java:48)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readDataFrame(DefaultHttp2FrameReader.java:411)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:245)
    at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:155)
    at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
    at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:118)
    at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:335)
    at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:395)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1069)
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:902)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:351)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:373)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:651)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:574)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:488)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:450)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    at java.lang.Thread.run(Thread.java:745)
Oct 27, 2016 7:20:25 AM io.grpc.netty.NettyServerHandler onConnectionError
WARNING: Connection Error
java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

@jtattermusch jtattermusch reopened this Oct 27, 2016
@jtattermusch
Copy link
Contributor Author

Reopened this issue.

@ejona86 ejona86 changed the title Java QpsWorker is very verbose Java QpsWorker produces many exceptions Oct 27, 2016
@ejona86
Copy link
Member

ejona86 commented Oct 28, 2016

@jtattermusch, I'm finding the performance tests far too hard to use. All I want to do is run it locally to test. I've spent hours trying to find the right script and am still pretty discontent. tools/run_tests/run_performance_tests.py -l java worked, but appears to run a ton of different things, and I want to just run one and modifying the test to shortening it. I tried running benchmarks/build/install/grpc-benchmarks/bin/benchmark_worker manually, but then couldn't figure out how to point a driver at it. I tried running qps_json_driver directly and I see corruption/uninitialized variable:

RUNNING SCENARIO: java_protobuf_async_streaming_qps_unconstrained_secure
I1027 17:38:09.551090312    2199 driver.cc:251]              Starting server on ���v�� (worker #0)
I1027 17:38:09.551406215    2199 ev_epoll_linux.c:86]        epoll engine will be using signal: 40
D1027 17:38:09.551454482    2199 ev_posix.c:106]             Using polling engine: epoll
D1027 17:38:10.080072811    2200 dns_resolver.c:186]         dns resolution failed: {"created":"@1477615090.080004673","description":"OS Error","errno":-2,"file":"src/core/lib/iomgr/resolve_address_posix.c","file_line":115,"os_error":"Name or service not known","syscall":"getaddrinfo","target_address":"\u00d8\u0019\u00e1v\u00ca\u007f"}

I remember at one point you had to use environment variables to drive some of this, but I can't find a script to get me going further in that direction.

@jtattermusch
Copy link
Contributor Author

Not sure how running performance tests could be any easier. If you want to run just one scenario, just run:

tools/run_tests/run_performance_tests.py -l java -r java_protobuf_async_streaming_qps_unconstrained_secure (see scenario_config.py for possible scenarios and their configs)

in the local mode, that just builds java and c++ (c++ needed because you need qps_json_driver) and runs the scenario with the right parameters.

If you really DO want to use qps_json_driver directly (your absolutely shouldn't need to), you can run qps_json_driver --scenario_json='YOUR JSON CONFIG' and set the env variable QPS_WORKERS to the list of workers to use (servers come first in the list).
I agree that the QPS_WORKERS env var is not the best way - we could easily have a cmdline arg for that, but that's legacy behavior we inherited from the early versions of qps_driver and it never bothered us enough to change it.

@carl-mastrangelo
Copy link
Contributor

@jtattermusch In general, every script inside the tools/ directory doesn't work off the bat. They all require extra setup. Running the command you just posted results in a compilation error:

Fri Oct 28 10:18:27$ tools/run_tests/run_performance_tests.py -l java -r java_protobuf_async_streaming_qps_unconstrained_secure
START: Preparing hosts.
PASSED: local_prepare [time=0.2sec; retries=0:0]
SUCCESS: Prepare step completed successfully.
START: Building.

tools/run_tests/performance/build_performance.sh: line 31: /home/notcarl/.rvm/scripts/rvm: No such file or directory
++ dirname tools/run_tests/performance/build_performance.sh
+ cd tools/run_tests/performance/../../..
+ CONFIG=opt
+ make CONFIG=opt EMBED_OPENSSL=true EMBED_ZLIB=true qps_worker qps_json_driver -j8
[AR]      Creating /home/notcarl/git/grpc/libs/opt/libz.a
[HOSTCXX] Compiling src/compiler/csharp_generator.cc
[C]       Compiling test/core/util/passthru_endpoint.c
[C]       Compiling test/core/util/port_posix.c
[C]       Compiling test/core/util/port_server_client.c
[C]       Compiling test/core/util/port_windows.c
[C]       Compiling test/core/util/slice_splitter.c
[CXX]     Compiling src/cpp/client/channel.cc
[CXX]     Compiling src/cpp/client/client_context.cc
[CXX]     Compiling src/cpp/client/create_channel.cc
[CXX]     Compiling src/cpp/client/create_channel_internal.cc
[CXX]     Compiling src/cpp/client/create_channel_posix.cc
[CXX]     Compiling src/cpp/client/credentials.cc
[CXX]     Compiling src/cpp/client/generic_stub.cc
[CXX]     Compiling src/cpp/client/insecure_credentials.cc
[CXX]     Compiling src/cpp/common/channel_arguments.cc
[CXX]     Compiling src/cpp/common/completion_queue.cc
[CXX]     Compiling src/cpp/common/core_codegen.cc
src/compiler/csharp_generator.cc:47:43: error: ‘google::protobuf::compiler::csharp::GetReflectionClassName’ has not been declared
 using google::protobuf::compiler::csharp::GetReflectionClassName;
                                           ^
src/compiler/csharp_generator.cc: In function ‘void grpc_csharp_generator::{anonymous}::GenerateServiceDescriptorProperty(grpc::protobuf::io::Printer*, const ServiceDescriptor*)’:
src/compiler/csharp_generator.cc:291:64: error: ‘GetReflectionClassName’ was not declared in this scope
              "umbrella", GetReflectionClassName(service->file()), "index",
                                                                ^
[CXX]     Compiling src/cpp/common/rpc_method.cc
make: *** [/home/notcarl/git/grpc/objs/opt/src/compiler/csharp_generator.o] Error 1
make: *** Waiting for unfinished jobs....
FAILED: local_build [ret=2, pid=94261]
FAILED: Build failed.

@ejona86
Copy link
Member

ejona86 commented Oct 28, 2016

@carl-mastrangelo I didn't have that problem. Maybe you need a git submodule update --init?

@carl-mastrangelo
Copy link
Contributor

Nope, that has already been done.

@carl-mastrangelo
Copy link
Contributor

It looks like the build process prefers the system protobuf instead of the one included in the submodule. Forcing it to use the repo one (why isn't this the default?) Gets a little further:

test/cpp/util/cli_credentials.cc:36:27: fatal error: gflags/gflags.h: No such file or directory
 #include <gflags/gflags.h>

@ejona86
Copy link
Member

ejona86 commented Oct 28, 2016

@jtattermusch, I don't require the process to be super-easy and I expect bumps, but I write below to give you a glimpse that I tried. This is unlikely exhaustive, but it should be clear I'm suffering from death of a thousand cuts. Many of these issues seem addressable.

And all of this is with having a little bit of knowledge of the overall architecture. If I didn't have that, I'd have been even more hosed.

Not sure how running performance tests could be any easier.

Is there any documentation, if I don't already know the proper script? There are lots of scripts. It's really hard to find the "top level" ones. And while digging through it's not clear what the prerequisites are to these scripts, which I will grant is partially because I don't know what they do or when they are used or what their purpose is. I'm okay with digging in some, but some hints go a long way. Maybe I just don't know where the hints are.

My # 1 goal was to find a command to run to reproduce results. I saw benchmarks/build/install/grpc-benchmarks/bin/benchmark_worker --driver_port=10400, but realized I needed something else to drive that. I look higher and see "bins/opt/qps_json_driver". That looks great. But how does it know the workers?

What are the workers? I only see one java worker in the logs. Surely there is more than one worker. It seemed C++ was in the mix because I saw C core-style log statements. But is it the client, or the server? Is C++ always used? I still am not quite certain. It seems maybe only Java is used. If I grep for java_protobuf_async_streaming_ping_pong_secure I don't find anything. But now that I read scenario_config.py I see that if I would have searched for java_protobuf_async_streaming_ping_pong_ then I'd have found it. I see a call to benchmarks/build/install/grpc-benchmarks/bin/benchmark_worker --driver_port=10400 but no other processes started before PASSED. Surely the client and server are in different processes... Is that output for the client or server?

If I have to know something out-of-band, like QPS_WORKERS, then drop me a bone if it is missing. Don't try to connect to corrupt hostname strings. In this case, I knew that some scripts used a magical environment variable, but couldn't figure out its name or format. I didn't know that particular script needed it, but if I could have found more information I would have tried it just to see if it fixed my problem. It doesn't seem like the format of QPS_WORKERS is documented anywhere (not even a comment in the code). The fact that the order matters is only clear if you read the majority of RunScenario in driver.cc.

That isn't going well, so I start at the top of the logs and see run_performance_tests and run it and it runs things locally. Unfortunately, it is taking forever. I used --help to run_performance_tests.py and the "-r" was not obviously related to scenarios. This seems to be a naming problem where the name is the type instead of what it's used for. It's like having an --int argument for the port to use.

The log output was really hard for me to get through. It was a long time before I found where a scenario started ("RUNNING SCENARIO"), because I stopped looking higher when I saw "PASSED: qps_worker_node_0". But then I couldn't find the scenario name. So I look higher and I see things like "START: qps_json_driver.quit". "Well, that certainly seems like it would be part of the previous test." But no. And I see stuff like "QPS worker "grpc-performance-server-8core:10400" is still running", as part of qps_json_driver.quit. Which worries me because that contribute to the errors I'm seeing, and I also think "You had one job!" Even now I have trouble finding RUNNING SCENARIO, because it blends in with the JSON; I have to Ctrl+F to find it. Also, the RUNNING SCENARIO is after the command I need to run to produce that scenario. I expected the highest level script to demarcate its work.

This log snippet is obviously supposed to deal with Java because it has java in the name, but there is no Java log output. Wat? You can see the qps_json_driver.quit at the end, which would seem like it would come after any Java output. How are these logs put together? I still don't think I understand.

RUNNING SCENARIO: java_protobuf_async_streaming_qps_unconstrained_secure
I1027 07:17:53.728253413   31687 ev_epoll_linux.c:86]        epoll engine will be using signal: 40
D1027 07:17:53.728314356   31687 ev_posix.c:107]             Using polling engine: epoll
I1027 07:17:53.729576143   31687 subchannel.c:639]           Connect failed: {"created":"@1477552673.729541732","description":"Failed to connect to remote host","errno":111,"file":"src/core/lib/iomgr/tcp_client_posix.c","file_line":192,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.240.0.90:10400"}
I1027 07:17:53.729607289   31687 subchannel.c:644]           Retry in 0.999722383 seconds
I1027 07:17:54.729747080   31687 subchannel.c:604]           Failed to connect to channel, retrying
D1027 07:17:54.933270724   31687 ev_posix.c:107]             Using polling engine: epoll
D1027 07:17:55.112332076   31687 ev_posix.c:107]             Using polling engine: epoll
I1027 07:17:55.307211825   31687 driver.cc:252]              Starting server on grpc-performance-server-8core:10400 (worker #0)
D1027 07:17:55.307290458   31687 ev_posix.c:107]             Using polling engine: epoll
I1027 07:17:55.755524804   31687 driver.cc:320]              Starting client on grpc-performance-client-8core:10400 (worker #1)
D1027 07:17:55.755628072   31687 driver.cc:359]              Client 0 gets 32 channels
I1027 07:17:55.759135847   31687 driver.cc:320]              Starting client on grpc-performance-client2-8core:10400 (worker #2)
D1027 07:17:55.759179387   31687 driver.cc:359]              Client 1 gets 32 channels
I1027 07:17:56.220167415   31687 driver.cc:381]              Initiating
I1027 07:17:56.248746886   31687 driver.cc:402]              Warming up
I1027 07:18:11.249009706   31687 driver.cc:408]              Starting
I1027 07:18:11.287700172   31687 driver.cc:435]              Running
I1027 07:18:41.248983505   31687 driver.cc:448]              Finishing clients
I1027 07:18:41.308016884   31687 driver.cc:462]              Received final status from client 0
I1027 07:18:41.340913302   31687 driver.cc:462]              Received final status from client 1
I1027 07:18:41.341028875   31687 driver.cc:485]              Finishing servers
I1027 07:18:41.344040256   31687 driver.cc:499]              Received final status from server 0
I1027 07:18:41.344466227   31687 report.cc:75]               QPS: 398688.0
I1027 07:18:41.345413461   31687 report.cc:79]               QPS: 398688.0 (inf/server core)
I1027 07:18:41.345430812   31687 report.cc:84]               Latencies (50/90/95/99/99.9%-ile): 3783.0/6341.6/7465.4/11651.0/19113.9 us
I1027 07:18:41.345435452   31687 report.cc:94]               Server system time: 0.00%
I1027 07:18:41.345439274   31687 report.cc:96]               Server user time:   787.20%
I1027 07:18:41.345442452   31687 report.cc:98]               Client system time: 0.00%
I1027 07:18:41.345445975   31687 report.cc:100]              Client user time:   343.00%
+ '[' performance_test.performance_experiment '!=' '' ']'
+ tools/run_tests/performance/bq_upload_result.py --bq_result_table=performance_test.performance_experiment
Warning: Table performance_experiment already exists
Successfully uploaded scenario_result.json to BigQuery.


PASSED: qps_json_driver.java_protobuf_async_streaming_qps_unconstrained_secure [time=48.7sec; retries=0:0]
START: qps_json_driver.quit

If you really DO want to use qps_json_driver directly (your absolutely shouldn't need to)

I'd love not to need to, but I didn't want to run a full benchmark. I wanted to just run it for a little while to test. I just want warmup_seconds = 1 and benchmark_seconds = 1. If that was an argument that'd be fine. Since I also hadn't figured out -r I was trying to kill two birds with one stone.

@carl-mastrangelo
Copy link
Contributor

I have no idea why, but it seems like the log spam dropped 10x. The change I made should have only affected shutdown scenarios, not the invalid data frames.

Since it isn't reproducing anymore, I am not sure what else there is to do here.

@ejona86 ejona86 closed this as completed Nov 3, 2016
@jtattermusch
Copy link
Contributor Author

On Fri, Oct 28, 2016 at 8:40 PM, Eric Anderson [email protected]
wrote:

@jtattermusch https://github.com/jtattermusch, I don't require the
process to be super-easy and I expect bumps, but I write below to give you
a glimpse that I tried. This is unlikely exhaustive, but it should be clear
I'm suffering from death of a thousand cuts. Many of these issues seem
addressable.

And all of this is with having a little bit of knowledge of the overall
architecture. If I didn't have that, I'd have been even more hosed.

Not sure how running performance tests could be any easier.

Is there any documentation, if I don't already know the proper script?
There are lots of scripts. It's really hard to find the "top level"
ones. And while digging through it's not clear what the prerequisites are
to these scripts, which I will grant is partially because I don't know what
they do or when they are used or what their purpose is. I'm okay with
digging in some, but some hints go a long way. Maybe I just don't know
where the hints are.

I wrote this one a while ago, to make identifying the top level scripts
easier: https://github.com/grpc/grpc/blob/master/tools/run_tests/README.md
The fact that you haven't discovered it makes me think it's not
discoverable enough and perhaps we should move it higher up in the
directory hierarchy.

I agree there's a lot of scripts in the tools/run_tests.py. That's
partially because they've been added organically and no one has ever got
the time to clean up the directory (I've done partial refactoring for some
of the scripts, but it's a lot of work and requires a lot of insight) -
also, it is hard not to break stuff when touching these scripts. But yeah,
we should clean up the directory a bit - I filed
grpc/grpc#8748

My # 1 goal was to find a command to run to reproduce results. I saw
benchmarks/build/install/grpc-benchmarks/bin/benchmark_worker
--driver_port=10400, but realized I needed something else to drive that.
I look higher and see "bins/opt/qps_json_driver". That looks great. But how
does it know the workers?

It uses an env variable, which I think is confusing. Of course this is
fixable. Not a high priority (because one should rarely need to execute
qps_driver himself), but filed grpc/grpc#8749

What are the workers? I only see one java worker in the logs. Surely
there is more than one worker. It seemed C++ was in the mix because I saw C
core-style log statements. But is it the client, or the server? Is C++
always used? I still am not quite certain. It seems maybe only Java is
used. If I grep for java_protobuf_async_streaming_ping_pong_secure I
don't find anything. But now that I read scenario_config.py I see that if I
would have searched for java_protobuf_async_streaming_ping_pong_ then I'd
have found it. I see a call to benchmarks/build/install/grpc-
benchmarks/bin/benchmark_worker --driver_port=10400 but no other
processes started before PASSED. Surely the client and server are in
different processes... Is that output for the client or server?

If I have to know something out-of-band, like QPS_WORKERS, then drop me a
bone if it is missing. Don't try to connect to corrupt hostname strings. In
this case, I knew that some scripts used a magical environment variable,
but couldn't figure out its name or format. I didn't know that
particular script
needed it, but if I could have found more information
I would have tried it just to see if it fixed my problem. It doesn't seem
like the format of QPS_WORKERS is documented anywhere (not even a
comment in the code). The fact that the order matters is only clear if you
read the majority of RunScenario in driver.cc.

Agreed, we should fix that. Part of #8749. Also please note that I didn't
write the code ;-)

That isn't going well, so I start at the top of the logs and see
run_performance_tests and run it and it runs things locally. Unfortunately,
it is taking forever. I used --help to run_performance_tests.py and the
"-r" was not obviously related to scenarios. This seems to be a naming
problem where the name is the type instead of what it's used for. It's like
having an --int argument for the port to use.

Mind coming up with the right wording for "-r" help and sending a PR?
Also there's an example for -r in
https://github.com/grpc/grpc/blob/master/tools/run_tests/README.md but we
already know you haven't seen that one.

The log output was really hard for me to get through. It was a long time
before I found where a scenario started ("RUNNING SCENARIO"), because I
stopped looking higher when I saw "PASSED: qps_worker_node_0". But then I
couldn't find the scenario name. So I look higher and I see things like
"START: qps_json_driver.quit". "Well, that certainly seems like it would be
part of the previous test." But no. And I see stuff like "QPS worker
"grpc-performance-server-8core:10400" is still running", as part of
qps_json_driver.quit. Which worries me because that contribute to the
errors I'm seeing, and I also think "You had one job!" Even now I have
trouble finding RUNNING SCENARIO, because it blends in with the JSON; I
have to Ctrl+F to find it. Also, the RUNNING SCENARIO is after the
command I need to run to produce that scenario. I expected the highest
level script to demarcate its work.

Ok, maybe we need a better documentation for what run_performance_tests.py
is doing (should become a part
https://github.com/grpc/grpc/blob/master/tools/run_tests/README.md):

  1. first it runs the workers on the background as jobset.py tasks
  2. then it runs the scenario using qps_json_driver (and prints out its
    output)
  3. then it starts a fake "quit" scenario that invokes quit RPC on all
    workers previously started to shut them down
  4. wait for workers to finish.

The capabilities of jobset.py are a bit limited so I think I've implemented
the best result I could with what I had. Suggestions are welcome though.

This log snippet is obviously supposed to deal with Java because it has
java in the name, but there is no Java log output. Wat?

No java output because it's the C++ qps_json_driver that runs the java
scenario on existing qps_workers (that are java).

You can see the qps_json_driver.quit at the end, which would seem like it
would come after any Java output. How are these logs put together? I still
don't think I understand.

RUNNING SCENARIO: java_protobuf_async_streaming_qps_unconstrained_secure
I1027 07:17:53.728253413 31687 ev_epoll_linux.c:86] epoll engine will be using signal: 40
D1027 07:17:53.728314356 31687 ev_posix.c:107] Using polling engine: epoll
I1027 07:17:53.729576143 31687 subchannel.c:639] Connect failed: {"created":"@1477552673.729541732","description":"Failed to connect to remote host","errno":111,"file":"src/core/lib/iomgr/tcp_client_posix.c","file_line":192,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.240.0.90:10400"}
I1027 07:17:53.729607289 31687 subchannel.c:644] Retry in 0.999722383 seconds
I1027 07:17:54.729747080 31687 subchannel.c:604] Failed to connect to channel, retrying
D1027 07:17:54.933270724 31687 ev_posix.c:107] Using polling engine: epoll
D1027 07:17:55.112332076 31687 ev_posix.c:107] Using polling engine: epoll
I1027 07:17:55.307211825 31687 driver.cc:252] Starting server on grpc-performance-server-8core:10400 (worker #0)
D1027 07:17:55.307290458 31687 ev_posix.c:107] Using polling engine: epoll
I1027 07:17:55.755524804 31687 driver.cc:320] Starting client on grpc-performance-client-8core:10400 (worker #1)
D1027 07:17:55.755628072 31687 driver.cc:359] Client 0 gets 32 channels
I1027 07:17:55.759135847 31687 driver.cc:320] Starting client on grpc-performance-client2-8core:10400 (worker #2)
D1027 07:17:55.759179387 31687 driver.cc:359] Client 1 gets 32 channels
I1027 07:17:56.220167415 31687 driver.cc:381] Initiating
I1027 07:17:56.248746886 31687 driver.cc:402] Warming up
I1027 07:18:11.249009706 31687 driver.cc:408] Starting
I1027 07:18:11.287700172 31687 driver.cc:435] Running
I1027 07:18:41.248983505 31687 driver.cc:448] Finishing clients
I1027 07:18:41.308016884 31687 driver.cc:462] Received final status from client 0
I1027 07:18:41.340913302 31687 driver.cc:462] Received final status from client 1
I1027 07:18:41.341028875 31687 driver.cc:485] Finishing servers
I1027 07:18:41.344040256 31687 driver.cc:499] Received final status from server 0
I1027 07:18:41.344466227 31687 report.cc:75] QPS: 398688.0
I1027 07:18:41.345413461 31687 report.cc:79] QPS: 398688.0 (inf/server core)
I1027 07:18:41.345430812 31687 report.cc:84] Latencies (50/90/95/99/99.9%-ile): 3783.0/6341.6/7465.4/11651.0/19113.9 us
I1027 07:18:41.345435452 31687 report.cc:94] Server system time: 0.00%
I1027 07:18:41.345439274 31687 report.cc:96] Server user time: 787.20%
I1027 07:18:41.345442452 31687 report.cc:98] Client system time: 0.00%
I1027 07:18:41.345445975 31687 report.cc:100] Client user time: 343.00%

  • '[' performance_test.performance_experiment '!=' '' ']'
  • tools/run_tests/performance/bq_upload_result.py --bq_result_table=performance_test.performance_experiment
    Warning: Table performance_experiment already exists
    Successfully uploaded scenario_result.json to BigQuery.

PASSED: qps_json_driver.java_protobuf_async_streaming_qps_unconstrained_secure [time=48.7sec; retries=0:0]
START: qps_json_driver.quit

If you really DO want to use qps_json_driver directly (your absolutely
shouldn't need to)

I'd love not to need to, but I didn't want to run a full benchmark. I
wanted to just run it for a little while to test. I just want
warmup_seconds = 1 and benchmark_seconds = 1. If that was an argument
that'd be fine. Since I also hadn't figured out -r I was trying to kill
two birds with one stone.

you can change the constants for test duration scenario_config.py easily
(but of course you need to know that). I still don't believe it's worth
exposing everything as a cmdline argument because that will make the script
more and more complex and eventually we won't be sure it's actually passing
all the arguments to the scenarios as it should (in this particular case,
java uses different warmup time than other languages - so having that
logic).

Thanks for the feedback!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2372 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJeq5I0hqAIwUmygjWQHNTwV7on7clT0ks5q4kGhgaJpZM4Kg6NT
.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants