Monday 30 November 2009

Transaction server, 8.98, WAS, JAS, RTE, TLA

A bit more information about the transaction server.

Everything has been running for a couple of months without any problems.  Regular scrutiny of F90710 was revealing 0 records, regular scrutiny of MQQueues via SIBExplorer (oh how I love that program), is revealing 0 records in all of the queues – or at least when the system is very busy there are a couple of records in the various queues.

We’ve had one incident of everything falling over, but it was repaired quickly with a transaction server restart…  until…

We started to get a bunch of messages staying at status 4 in the sy812.f90710

sometimes > 5000 a day.  This is nay good I was thinking.  What has changed?

It finally occurred to me (like when Bart was trying to explain to Homer about Sideshow Bob’s plans to kill aunt Selma) that an additional server must be stealing the messages…  And because the prod txn is the only one that should be getting the messages, this is the problem. I’m still trying to jump through the hoops to understand more about this model.

I think that the txn server is updating the records to 4, but then because the message never gets to the txn server – they are not getting deleted (nice and reliable I guess). So what we do have is a bunch of messages at status 4 that the transaction server does not know about.  This is “sub optimal”.

The transaction server seems to poll the F90710 table for any type 3 records.  When it finds them it updates them to 4.  It then grabs the message and puts the message into the MQ queue and then deletes the message once this has been done.  We are having the problem that another txn server is grabbing the messages (I think) and not able to process them.  Therefore they are staying at status 4.  All I need to do is update all of them to a 3 and they process successfully.

A couple of risks with the above is that the messages in the F90710 might contain old data (or might contain new data??) and therefore the old data might be going to all of the servers?  I can’t be sure of this.

So, pretty amazing really.  If you are getting messages staying in the F90710 at status 4 and you have multiple transaction servers, I’d be checking on them and making sure that your OCMs are right.  Make sure that your other txn servers are polling the correct F90710 tables!

Package deploy and locks and more info

I know that I’ve talked about deploying packages and locks and things before…  I don’t want to cover old ground, but I do want to record more information.  We’ve talked about a scenario where you have PD and PP installed on the one enterprise server.  If there are UBE’s running in PD, you can still deploy to PP without affecting business continuity. 

I believe that being a good technical consultant is balancing these two requirements (system availability and technical productivity).  That is that they can get the maximum amount of work done with minimum disruption to the business.

Anyway, off my soap box.  I had a situation the other day where I was deploying to PP in a similar environment as mentioned above.  The hardware was supporting two pathcodes with a instance of JDE.

I deployed the package and waited… and waited…  sweated… waited…

I got very nervous and looked at the deployment logs and saw that the package was trying to get locks on the server.  I looked at task manager and saw that one of the kernels was “spinning it’s wheels”, it has 100000’s of seconds on it’s runsheet, obviously a runaway.  I killed the kernel in question and almost immediately the system took a HUGE gulp of air and things were back to normal.

So, what does this mean…  We’ll it means that you definitely lock the entire instance of JDE when you deploy a package, not just the pathcode that you are using.  I guess the small justification is that if UBEs were running in PD, the package would still go through.

So, my recommendation is changing.  If you are deployed for a pathcode that shares the PD instance, don’t deploy until the system is quiet.  Check your UBE’s before hitting deploy, check for any kernels that are “spinning their wheels”.

<rant mode>

  • I personally think that the new 812 deployments are a huge step backwards.  I took in the marketing hype and thought that it was going to be magical, but in the hard light of day, the process is quite cumbersome and prone to errors. 
  • It takes way too long to deploy (although this is a configurable option). 
  • I get seemingly random “network errors”, “internal server” errors etc. 
  • I like the generation option, but let’s be honest…  Why is not on the deployment server and an option with the build / deploy?  Generate java objects should be another tick box when you are defining the build.  They should generate the objects to a couple of temp F98999% files and then copy then in when deploying.  Job done!  Generation server??  What a waste of resources and a burden on the environment!

</rant mode>

Friday 27 November 2009

Moving your media objects

This is not as easy as you think.  The path to some of the objects is written into the F00165.  So, you might move all of the files and they just might not work anymore…  Arrgghhh – what are you going to do…

You put your thinking cap on, and decide that you could rename the new location (via cname) the same as the old location.  ie.

\\oldMediaObjLocation

and you want to whack them on a new server, say \\newserver

and you still want to reference your chares with \\oldMediaObjLocation

So, you’re smart – you use a cname or an alias which basically (via DNS) points oldMediaObjLocation  to newserver – but it does not work.

I get prompted for a username and password or something about the server already exists, which it does not.

So, now you need to do this for 2003:

1. Create the CNAME record for the file server on the appropriate DNS server, if the CNAME record is not already present.
2. Apply the following registry change to the file server. To do so, follow these steps:
      1. Start Registry Editor (Regedt32.exe).
      2. Locate and click the following key in the registry:
         HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanmanServer\Parameters
      3. On the Edit menu, click Add Value, and then add the following registry value:
         Value name: DisableStrictNameChecking
         Data type: REG_DWORD
         Radix: Decimal
         Value: 1
      4. Quit Registry Editor.
3. Restart your computer.

Then you’ll be able to reference unc shares that are named via a cname.

So, oyu can move your media objects and they will all still work!  Woo hoo!

P98613 – low level cache awesomeness

What a cracker of a table controlling a cracker of a concept.  How many rows of each table is going to be cached by JDE.  It’s often a problem that a customer changes a period for a customer and it takes hours for this to propagate through the JDE infrastructure, not matter how often you clear the web cache.

Remember that we deal with two very distinct caches in JDE. 

1.  JDBj based cache (clear cache from SM or SAW)

2.  JDE middleware cache  (wsj – row exit for clear cache)*

*I’ve seen this one crash servers so be careful

If you have constant problems with particular tables (F0010 company master for example), fire up P98613.  Take F0010 out of the table.  Not more caching of company information.

Remember that this is a system table, not a data table..  So if you change it in DV, it’s going to change it EVERYWHERE!

Thursday 26 November 2009

Oracle Dates confusement

I know that confusement (sp??) is not a real word, but I have heard it used by a lot of people, so I think that it should be a real word.

Anyway, because my experiences of oracle have been mainly with JDE, I don’t understand the oracle date type very well.  Say I want to group data by an oracle date day, how do I do that???  Because the oracle date secretly contains time…

With the adoption of UTIME, I need to update my skills to be able to take this on…  So google search here I come.

My query is simple, I want to fine all of the records in the F90710 that seem to be sticking.  Ie. the trigger message is not making it to the transaction server for these messages and they are stuck!!!

so I want something like:

select count(1), ETEVNTST, ETEVNTNAME, ETEVNTTIME from sy812.f90710 group by ETEVNTNAME,ETEVNTST, ETEVNTTIME

And I get many separate lines for every day, so I need to convert the date to a day I guess.

select count(1), ETEVNTST, ETEVNTNAME, trunc(ETEVNTTIME, 'MONTH')
from sy812.f90710
group by ETEVNTNAME,ETEVNTST, trunc(ETEVNTTIME, 'MONTH')

This did nothing for me, I think it was grouped by month.  A little handy I guess.  I tried using ‘DAY’ in the TRUNC function, was not what I needed.

So, am going to try:

select count(1), ETEVNTST, ETEVNTNAME, to_char(ETEVNTTIME, 'DDMMYY')
from sy812.f90710
group by ETEVNTNAME,ETEVNTST, to_char(ETEVNTTIME, 'DDMMYY')

That’s the ticket.  The use of tochar is awesomely simple. HH:MM:SS are all available.

Friday 20 November 2009

Collaborative portal for authentication

I need to support long usernames for a 8.98.1.1 site via MAD LDAP.  I’m thinking that I can get the collaborative portal to do the full length username authentication, and then map a field in the LDAP to be the E1user id.  This will then be passed into the authentication token.  Job done.  I know that a portal install is not going to be the most efficient mechanism for this, but it might just be okay.

Should I look into GSSAPI?  Can it do username translations?

Alternatively you can choose to enable Single Sign-On and NOT configure LDAP to work with the JDE security kernel. This is key and is the point I have been trying to make. You can do this and enable GSS-API to work with Microsoft's Kerberos implementation (turn on Integrated Windows Authentication in the Internet Explorer browser) and experience SharePoint like Single Sign-On with JDE. I do this for my employer and wouldn't have it any other way for them. Others may choose another method and I wouldn't have any problem with that.

from http://www.jdelist.com/ubb/showthreaded.php?Cat=&Number=140716&page=&view=&sb=5&o=&vc=1

Anyway, watch this space. A solution will be on it’s way.

Sunday 15 November 2009

long running queries in oracle, it’s simple

If you know the session you are looking for (say using a query like the previous post), then with the SID, use the below to get all of the info about it’s progress.

select sofar, time_remaining/60 as time_rem_mins from v$session_longops where sid = 1630

Awesome!

session hogs

stolen from http://www.dba-oracle.com/oracle_tips_top_session.htm

select
sid,
username,
round(100 * total_user_io/total_io,2) tot_io_pct
from
(select
b.sid sid,
nvl(b.username,p.name) username,
sum(value) total_user_io
from
sys.v_$statname c,
sys.v_$sesstat a,
sys.v_$session b,
sys.v_$bgprocess p
where
a.statistic#=c.statistic# and
p.paddr (+) = b.paddr and
b.sid=a.sid and
c.name in ('physical reads',
'physical writes',
'physical writes direct',
'physical reads direct',
'physical writes direct (lob)',
'physical reads direct (lob)')
group by
b.sid, nvl(b.username,p.name)),
(select
sum(value) total_io
from
sys.v_$statname c,
sys.v_$sesstat a
where
a.statistic#=c.statistic# and
c.name in ('physical reads',
'physical writes',
'physical writes direct',
'physical reads direct',
'physical writes direct (lob)',
'physical reads direct (lob)'))
order by
3 desc;

Tuesday 10 November 2009

Big tables in SQLServer

SELECT
    [TableName] = so.name,
    [RowCount] = MAX(si.rows)
FROM
    sysobjects so,
    sysindexes si
WHERE
    so.xtype = 'U'
    AND
    si.id = OBJECT_ID(so.name)
GROUP BY
    so.name
ORDER BY
    2 DESC

 

That’ll give you the ones with lots of rows, then use the following SP, to get the sizes:

sp_spaceused 'PRODDTA.F4211'
sp_spaceused 'PRODDTA.F49211'
sp_spaceused 'PRODDTA.F42199'
sp_spaceused 'PRODDTA.F4111'
sp_spaceused 'PRODDTA.F41051'

How big is that oracle table??

Use this query first if you want the size of 1 table and it’s indexes:  Note that you need to escape the _, because that is usually a wildcard for any character.

select sum(bytes)/(1024*1024) from dba_segments where segment_name = 'F0911' or segment_name like 'F0911!_%' escape '!'

spool a .sql file with the command below, and that will generate all of the statements for you. 

set pagesize 0

set echo off

select 'select sum(bytes)/(1024*1024)' || ', ''' || table_name || '''' || ' from dba_segments where segment_name = ' || '''' || table_name || '''' || ' or segment_name like ' || '''' || table_name || '!_%' || '''' || 'escape ' || '''' || '!' || ''';'  from all_tables where table_name like 'F%' 

This will generate a bunch of statements like:

select sum(bytes)/(1024*1024), 'F0040Z1' from dba_segments where segment_name = 'F0040Z1' or segment_name like 'F0040Z1!_%'escape '!';     

Above is tables and indexes

select 'select sum(bytes)/(1024*1024)' || ', ''' || table_name || '''' || ',''' || owner || '''' || ' from dba_segments where segment_name = ' || '''' || table_name || '''' || ';'  from all_tables where table_name like 'F%'

Above is tables only

i hate vista so much it hurts (sometimes)

Generally if I’m using find, I hate it more…  I hate it like I want to punch it… 

But this has allowed my hate to recede (somewhat)

http://www.microsoft.com/windows/products/winfamily/desktopsearch/technicalresources/advquery.mspx

Actually wait, I still hate it.  The above is GREAT for finding files…  But cr@p for finding things in files…

I hate statements like “windows does it automatically” – my kingdom for grep!!!

http://support.microsoft.com/kb/932989

That lists about 1 million things that could be wrong.

That is it, I’ve had enough.  I’m going to use grep for win32.

Monday 9 November 2009

View csv from wsj not working

Problem

In HTML client you launch WSJ (Work With Submitted Jobs), select your Enterprise/Batch Server, select your finished report  and click on View CSV. Nothing occurs.  On a FAT client, this works correctly.

Solution

To fix the issue, you will need to do the following:

  1. Add the following parameter in your JAS.INI
         [OWWEB]
    UBEContentDisposition=true

Using ServerManager, click on WEB Runtime configuration and Check 'UBEContentDisposition'

    2.  Add the following Windows Registry key to the machine your using to View CSV:
              [HKEY_CLASSES_ROOT\MIME\Database\Content Type\application/csv]
              extension=.csv
If issue persists, check your Internet Explorer settings:

  • Tools
  • Internet Options
  • On Security tab, click on Custom level, scroll to Downloads section and select: Automatic prompting for file download
  • Enable and apply
  • Close Internet Explorer and restart

You should now be able to open your CSV file.

This was copied from oracle support, ID 781462.1

Friday 6 November 2009

tracing tricky connect issues in SQLServer 2008

 

SQL Server 2008 contains a new feature designed to help troubleshoot particularly tricky connectivity problems. This new feature, the Connectivity Ring Buffer, captures a record for each server-initiated connection closure, including each kill of a session or Login Failed event. To facilitate efficient troubleshooting, the ring buffer attempts to provide information to correlate between the client failure and the server’s closing action. Up to 1k records in the ring buffer are persisted for as long as the server is online, and after 1000 records, the buffer wraps around and begins replacing the oldest records. The Connectivity Ring Buffer’s records are exposed by a DMV which can be queried:

SELECT CAST(record AS XML) FROM sys.dm_os_ring_buffers
WHERE ring_buffer_type = 'RING_BUFFER_CONNECTIVITY'

from http://blogs.msdn.com/sql_protocols/archive/2008/05/20/connectivity-troubleshooting-in-sql-server-2008-with-the-connectivity-ring-buffer.aspx

But…  I’m on 2005 and am getting a lot of

“SQL Native Client TCP Provider An existing connection was forcibly closed by the remote host” in my APP server and BAT server jde logs.

I think that it might be network related, but I’ll keep you posted.

Right, so we’ve come a long way with this error.

It seems that this is related to either:

EnableTCPChimney = 0  Follow http://support.microsoft.com/kb/945977 to disable

Another registry change DisableTaskOffload=1  http://blogs.technet.com/brad_rutkowski/archive/2007/08/10/how-to-know-if-tcp-offload-is-working.aspx

There is a lot of documentation on some bad NICs http://support.microsoft.com/kb/942861

Typically, this problem occurs when the network adapter uses the Broadcom 5708 chipset. For example, the following network adapters use the Broadcom 5708 chipset:

  • Broadcom NetXtreme II
  • Hewlett-Packard NC373i Multifunction Gigabit Server Adapter

So, disable Chimney, disableTaskOffload, get the latest NIC drivers & bios

Thursday 5 November 2009

Movember donations, bring it on!

Sponsor me for Movember, it’s for a good cause:

http://au.movember.com/mospace/316906

That is it!

JDESPECRESULT_DUPLICATEKEY is metadata kernel logs

We’ve been getting UBE’s failing randomly.

In the UBE Log we have:

Failed to load job spec cache for job 103247 report R09801_ZJDE0002. Cannot load specs for UBE. Please remove the cache directories for this report under the /PD812/spec/runtimeCache directory and rerun the report. If errors persist, check the metadata kernel logs for more information (object could be missing from package, failed to access package database, etc.)

Hmm, not really that helpful.  But the metadata kernel logs are much better:

79420/6683 WRK:Metadata job 804 Thu Nov 5 00:17:40.634859 ServerDispatch.cpp1419
MD_INFO: Previous UBE Job Failed (R03B551_SMNG051S_103114, Pathcode= isTemplate=<1>)
479420/515 WRK:Metadata job 791 Thu Nov 5 00:17:45.898359 specmisc.c2448
CreateJobCache() - Failed to find version record for R03B551_+ in RDASpec source repository for job 0.
479420/515 WRK:Metadata job 791 Thu Nov 5 00:17:45.899167 ServerDispatch.cpp728
Job Cache Template creation for report_version = R03B551_NNWN050S Failed, job = 103118
479420/6426 WRK:Metadata job 803 Thu Nov 5 00:17:50.790448 specmisc.c3386
copySpecs() - jdeSpecInsert() failed with error code, JDESPECRESULT_DUPLICATEKEY. Record was selected from another repository using the k
ey, R03B551.+.

So, we see the duplicate key problem.  The duplicate is converting F98761 to TAM in the runtimeCache dir. 

I had a sniff around with TAM Browse, thinking that there might be a corruption in the base object, but nothing much could be found.  It all looked good.  There were all of the correct constraints on the tables too, so no duplicates could have hit the database..

Then oracle support came up with the goods.

You CANNOT run the same REPORT|VERSION combo at the same time!!!!  WTF?  what happened to concurrent applications, concurrent UBEs?  Their magical solution is to ensure that jobs are not run at the same time (for the first time) HUH????  This is madness. So you deploy a full package, take JDE down…  Wait 10 mins…  Scheduler is ready to fire off a bunch of jobs… There might be some waiting jobs already…  They all want to run at the same time down multi-threaded queues and corrupt each other.  so the jobs you run more often are more prone to the corruption!  nice work!

SAR is 8933751

NEWS FLASH

It gets worse.  You cannot run parallel reports until a runtimeCache has been established for the report.  So just say you want to launch 4 versions of a report at the same time from the scheduler (after a full package deploy), you cannot.  They will fight each other to write the runtimeCache specs for the base report /PD812/spec/runtimeCache/R0010P/rdaspec.ddb etc and will write corrupt specs or fail or both! 

Remember that a full package deploy will / should delete all runtimeCache specs.

So all UBEs need to be single threaded until at least one has been run before.  Nice business rules!