Thursday 29 August 2019

blocking / locking... potato... potato


Blocking is a funny problem and quite often one of the last things that I look for when there are JD Edwards issues.

We've had some recent problems with some serious blocking, but the reconciliation between client and server has made the log analysis almost impossible...  What I mean is that there are often client errors with no server errors.  The web server will give up, nothing in the enterprise server logs...  Until there are IPC errors because queues are too large [too many BSFNs running in queue]  But there are lessons in this.

First, we could see that there was an increase in the number of instances of COSE#1000 
ERP analytics shows us all historical details, wow that is a lot of problems:


Detailed AWS cloudwatch insights allow us unparalleled capability to query and audit all of our logs:

Can u see what is actually being done here, is it SOOOO amazing.   Looking at all logs in the last X[hours|days|minutes] in all log files for 14 servers, enterprise and web…  Looking for the relevant client [web server] and server [app server] logs that relate to and BSFN errors between the two.  Then showing this as a timeline.

I think that the big lesson here is adopting a consolidated approach to your logging, like a SPLUNK type approach.  If you adopt a some consistency, then all of the advantages of monitoring and better global interrogation are opened up to you.

What we have actually done in this instance is use cloudwatch to ingest all of the JD Edwards log files.  We are consolidating server manager, weblogic, enterprise server, system out and  /var/log into our own log streams that we can query.

Any example of a fairly complex query is below

**CloudWatch Logs Insights** 
region: ap-southeast-2 
#Here are the log files that I want to query
log-group-names: jde_wlsServerJASLogsout, jde_wlsServerJASLogs, jde_webServerLogs, jde_entServerLogs  
#looking at the last hour
start-time: -3600s 
end-time: 0s 
query-string:
```
#fields I want to look at and display on the console
fields @timestamp  ,@message,  @logStream, @log, BSFN
#search string in the logs - this is a simple example that will only match web server logs
|filter @message like 'COSE#1000'
#Allows me to create my own fields from the output and summarise based upon these with a little bit of regex magic
|parse @message /(?<date>\d{2}\s+\S{3}\s+\d{4})\s+(?
| sort @timestamp asc
| limit 10000



#And the results:

```
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|       @timestamp        |                                                                                                                 @message                                                                                                                  |                @logStream                 |              @log              |        BSFN         |    date     |     time     | errorlevel | module  |   user    |   Env    |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|--------------------------------|---------------------|-------------|--------------|------------|---------|-----------|----------|
| 2019-08-28 06:15:19.809 | 28 Aug 2019 07:15:19,573 [WARN  ]  - [RUNTIME]         *ERROR* CallObject@3f09721b: COSE#1000 Request timeout: timeout after 90000ms host JDEPROD1:6017(6025) SocID:37364 PID:12066 BSFN:CommitReceiptHeader user:USER Env:JPD920UK  | WEBUK_ip_10_116_22_119i-084b990b0bb60c66e | 202760777498:jde_webServerLogs | CommitReceiptHeader | 28 Aug 2019 | 07:15:19,573 | WARN       | RUNTIME | AMRE001 | JPD920UK |
| 2019-08-28 06:24:31.258 | 28 Aug 2019 07:24:31,092 [WARN  ]  - [RUNTIME]         *ERROR* CallObject@2c3526cf: COSE#1000 Request timeout: timeout after 90000ms host JDEPROD1:6017(6025) SocID:37364 PID:12066 BSFN:CommitReceiptHeader user: USER Env:JPD920UK  | WEBUK_ip_10_116_22_119i-084b990b0bb60c66e | 202760777498:jde_webServerLogs | CommitReceiptHeader | 28 Aug 2019 | 07:24:31,092 | WARN       | RUNTIME | AMI001 | JPD920UK |
| 2019-08-28 06:34:21.978 | 28 Aug 2019 07:34:21,802 [WARN  ]  - [RUNTIME]         *ERROR* CallObject@74a86b0a: COSE#1000 Request timeout: timeout after 90000ms host JDEPROD1:6017(6025) SocID:37364 PID:12066 BSFN:CommitReceiptHeader user: USER Env:JPD920UK  | WEBUK_ip_10_116_22_119i-084b990b0bb60c66e | 202760777498:jde_webServerLogs | CommitReceiptHeader | 28 Aug 2019 | 07:34:21,802 | WARN       | RUNTIME | AME001 | JPD920UK |
| 2019-08-28 06:42:52.420 | 28 Aug 2019 07:42:52,371 [WARN  ]  - [RUNTIME]         *ERROR* CallObject@12ddb7bb: COSE#1000 Request timeout: timeout after 90000ms host JDEPROD1:6017(6025) SocID:37364 PID:12066 BSFN:CommitReceiptHeader user: USER Env:JPD920UK  | WEBUK_ip_10_116_22_119i-084b990b0bb60c66e | 202760777498:jde_webServerLogs | CommitReceiptHeader | 28 Aug 2019 | 07:42:52,371 | WARN       | RUNTIME | AE001 | JPD920UK |
| 2019-08-28 06:45:25.972 | 28 Aug 2019 07:45:25,747 [WARN  ]  - [RUNTIME]         *ERROR* CallObject@256577d3: COSE#1000 Request timeout: timeout after 90000ms host JDEPROD1:6017(6024) SocID:37846 PID:12066 BSFN:CommitReceiptHeader user: USER Env:JPD920UK  | WEBUK_ip_10_116_22_119i-084b990b0bb60c66e | 202760777498:jde_webServerLogs | CommitReceiptHeader | 28 Aug 2019 | 07:45:25,747 | WARN       | RUNTIME | AM001 | JPD920UK |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Entries like the above can mean many things, network problems, security problems...  But when you look at the server logs and see nothing...  Start to think about blocking.  Also look at the BSFN summary page for the web server in question, this will show you the BSFN runtimes - max, min and average.  If this shows the functions are generally very fast - then you know that you might have some locking problems.

Now, validate this at the database.  These statements will only work while there is locking / blocking - so have them ready.  These are also oracle statements.

At the database:

Show me the blocking:

select
   (select username from v$session where sid=a.sid) blocker,   a.sid,   ' is blocking ',   (select username from v$session where sid=b.sid) blockee,   b.sidfrom
   v$lock a,
   v$lock bwhere    a.block = 1and    b.request > 0 and    a.id1 = b.id1 and    a.id2 = b.id2;

What are they doing / blocking / locking?

Select    c.owner,   c.object_name,   c.object_type,   b.sid,   b.serial#,   b.status,   b.osuser,   b.machinefrom    v$locked_object a ,   v$session b,   dba_objects cwhere   b.sid = a.session_idand  a.object_id = c.object_idand b.sid = 1340;

All in one – with JDE information - machine process type and statement

Select    c.owner,   c.object_name, c.object_type,   b.sid,   b.serial#,   b.status,   b.osuser,   b.machine, b.process, b.program, b.sql_id, REPLACE(d.SQL_TEXT,CHR(10),'') STMTfrom    v$locked_object a ,   v$session b,   dba_objects c, v$sqltext dwhere   b.sid = a.session_idand  a.object_id = c.object_idand d.address = b.sql_addressand b.sql_hash_value = d.hash_valueand b.sid in(select
   a.sidfrom
   v$lock a,
   v$lock bwhere    a.block = 1and    b.request > 0 and    a.id1 = b.id1 and    a.id2 = b.id2);

If you see a long list of results (or a single line), you might want to consider your next actions.  Blocking can be pretty normal, but can also be a result of a deadlock (surely that should be a movie title).   A dead lock occurs when two sessions are fighting for the same lock.  Interestingly I've seen locks last for hours without being deadlocks.  Another very interesting problem that does occur is that a client (JDBC) process can lock be locked by a server process (jdenet_k).  I've seen this quite a bit and even programmed enough bad code to create it myself - ha!

Things to remember is that if a jdenet_k process is blocking for > 5 minutes, there is probably something wrong (the client will have given up - timeout's probably 90 seconds)...  So the BSFN is still running.  The actual human that tried the first time is probably trying again in another JAS session...  Things could be escalating...  Generally I think you can kill those sessions.  UBE's are a little different, leave them alone.

If you see java doing the locking.  You need to make the call.  Try and get back to the JDE session that is doing the damage and see what the user is/was doing [they'll generally deny it].  It can be little things that cause big locks.


2 comments:

Glen said...

The biggest challenge I have faced in resolving the database blocking is finding the actual query that creates the block with the values. Configuring the query diagnostics would definitely help here; setting a timeout of 30 sec to a minute; the query gets recorded in the kernel. Helped with resolving db locks for me!

Mat Egan said...

Thanks Shannon - helpful queries.

I had to clean them up from the blog as they have some "and"s and "where" clauses running together which makes it difficult to copy/paste ;)

Mat