Friday 14 May 2021

Load testing on linux - access.log is pretty rich in content

I've been doing some more load testing using OATS and Linux machines.

This could be one of my last load tests using OATS, as this is not going to be supported going forward.  This is a bit of a shame to me, as I've always been able to completely load test a JD Edwards implementation solo.  From the database, web servers, application servers - everything!  I know that we also had a couple of team members that had the same level of competence.  It's nice when you can simulate a load on JDE and have complete confidence for go-live.

The site that I've been working on are looking to go-live on 9.2 and have not changed their platform.  They run a pretty tight ship and are also running JDE on an ODA.  I have lots of war stories, but this machine is a beast.  768GB of memory per node and all SSD makes it pretty hard to get things wrong to be honest...  No wait, there are a million things to get wrong - but you have a lot of realestate.

So, back to the JDE content here.  The load test is somewhat different, as we are comfortable with the platform and the throughput of the hardware.  The load testing we want to do here is about stability and configuration.  We have changed database to 19c, new weblogic, new Linux and of course tools and apps - so there is a lot of change - but we want to ensure that the basic hardware and database configuration is going to be okay, as we are confident in the platform.

To do this, I find the best testing is simple long running scripts hitting popular applications.  So, lots of Sales order inquirys, item locations, pricing and more.  

While running this load, I happened to get about 3% of sessions failing with a fairly strange error: 


The HTTP error (12152) is: The server returned an invalid or unrecognized response

So, OATS is getting the server to respond, but the response is invalid.  This is going to be a little harder to track down, because it's client side.  and we don't get complete responses client side.


Another annoying (or perhaps positive) observation is that the errors are occurring all over the place - they are not on a particular screen / particular request.  You can also see that overall 36 failed out of 1100, but among the detail of the page, 3 of 1100 are failing - very small representation - but I do want to know more.

Also, I see that there is nothing in the WLS logs - I guess because it's client side.  I even went for the details in the access logging, hoping that this might lead to an observation...  But remember, my OATS scripts easily detect 404's or 500's - they are NOT getting this.

[oracle@jdevpweb03 logs]$ cat access.log |grep '1" ' | awk '{print $(NF -1) }' | sort |uniq -c
   3425 200
      8 302
      1 404


But, good to know that you can sum all of the errors / HTTP return codes from the access logs, you can also see how much traffic is being generated.  This is pretty interesting too, if you are looking for strange behavior or patterns for your security POV.

I can also see this in the OATS logs:  C:\OracleATS\logs\java_agent.log

2021-05-14 13:01:32,541 ERROR [IteratingAgent 1648][oracle.oats.scripting.modules.basic.api.internal.IteratingAgent.1648] Iteration 13 failed at line (script.java:2182). Failed to read HTTP response header from server. 
Comparable WinInet error code: Error 12152: ERROR_HTTP_INVALID_SERVER_RESPONSE The server response could not be parsed. Caused by: java.net.SocketException occurred. Error Message:Connection reset
oracle.oats.scripting.modules.basic.api.exceptions.ScriptException: Failed to read HTTP response header from server. 


Interestingly this message says that the connection was reset, but - the script has not failed - only the individual page response.  OATS actually does not recover very well from this and we get orphan sessions in server manager.  Eventually this will cause the JAS servers to have too many open sessions and crap out all of your testing.

It seems that I need to follow the bouncing ball, as the client is getting "network disconnect"

Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)

I need to look at the LB, to see if this is causing some problems.

I have a theory that this is a problem with the LB, so I'm going to change my configuration that is currently going to the LB to go directly to each web server.  If I get reliability in that test, then everything is going back to the LB.  Incidentally I have run a 24 hour test against the non LB end points and did not get anything like the above (3% failure).

Guess what, all of the sessions that went direct to the webservers had 0 errors [after 2 hours].  We think that dark trace might be killing the sessions.  So the LB is causing my problems, and maybe dark trace.  Good lessons learned in a complex load testing environment