Page error causes server to crash
These are the symptoms: we start loading a page and Oracle launches a process that hogs as much CPU as it can, slowing everything on the server to a near standstill. If we don't catch it within about 5 minutes, we cannot even log on to kill the process. Within a short space of time the OS is paging like crazy, and within about 10 minutes the 4GB of physical memory and 8GB of swap is all used up. The page never loads, and a few minutes after starting to load the page we get a 404 error. What happens after the available memory falls to zero is unpredictable but never good.
Our APEX/Oracle environment is as follows:
Product Build: 3.0.0.00.20
Language Preference: en-gb
NLS_CHARACTERSET: WE8ISO8859P1
DAD CHARACTERSET: ISO-8859-1
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit
PL/SQL Release 10.2.0.1.0 - Production
CORE 10.2.0.1.0 Production
TNS for Linux: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 - Production
OS: Red Hat Linux ES4 64bit
First, I have to hold my hands up and say that it's always caused by an error that we have made, but often these are relatively trivial errors on the page that APEX will either gloss over or handle correctly 99 times out of a hundred (and certainly nothing obvious like an infinite loop). These kinds of problems also occur very rarely (thankfully), but what gets me is why the result is quite so catastrophic. I should also say that we're talking about a development environment containing very little data - the example we tested is a page with 3 report regions each based on fewer than 100 rows of data, comprising around 20 numeric columns, some dates, flags and audit information.
What I'm keen to understand is whether this is a consequence of Oracle misconfiguration, or if there is anything else (other than never making any coding errors) that we can do to stop this from happening.
These are examples of the kinds of errors that have that lead to the above symptoms (although not necessarily to the output listed below):
1. Deleting a page item that a PL/SQL function within a page region references, without removing that reference. Mostly APEX will ignore such errors, treating the item value as null, but the difference seems to be where the item had a value in session state, and we did nothing to clear it after deleting the page item, although I suspect that this isn't the whole story, as this scenario doesn't predictably lead to the problem we're occasionally seeing.
2. Returning SQL from a PL/SQL function that did not correspond either with the report template in use (for example dummy SQL (within an error block) that doesn't match the report template we're using (for example a column needed for conditional row formatting is missing)), or with the report definition (numbers and attributes of columns displayed - sorry I can't be very specific on this score, but in these cases deleting and recreating the report region with the same source has cured it).
3. Other errors in the region PL/SQL source, such as referring to the wrong package when declaring a type, causing a type mismatch error.
The latest page we had this problem with was returning SQL from a packaged PL/SQL function, in which creating a SQL block had thrown a 'character string buffer too small' error, and returned a dummy SQL string that did not contain a header column referenced in the report template, although some supposition has gone into that, as the focus was more on fixing the problem than pinning down its exact cause.
The Apache logs don't tell us a great deal, other than that the server is in a great deal of trouble. There is nothing to indicate the start of the problem, but after a while we gets lots of entries like this:
[<date/time>] [warn] [client <ip>] mod_plsql: Long running URL [pls/htmldb/wwv_flow.accept] timed out
Later we get lots of:
[<date/time>] [error] [client <ip>] [ecid: 1248707261:<ip>:24047:0:1,0] mod_plsql: /pls/htmldb/f HTTP-503 ORA-3113
And a few:
[<date/time>] [warn] [client <ip>] [ecid: 1248707264:<ip>:24081:0:1,0] mod_plsql: DMS Logging: Unable to read data from <Apache path>/Apache/modplsql/cache/modplsql.lck
Plus a smattering of:
[<date/time>] [error] [client <ip>] [ecid: 1248707384:<ip>:25588:0:6710,0] mod_plsql: Database (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=<server name>)(PORT=1530)))(CONNECT_DATA=(SERVICE_NAME=<service>))) Failed (time=10190ms rc=-1 ORA-1013)
[<date/time>] [error] [client 10.100.36.48] [ecid: 1248707384:<ip>:25588:0:6710,0] mod_plsql: Failed to reset database connection (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=<server name>)(PORT=1530)))(CONNECT_DATA=(SERVICE_NAME=<service>))) rc=-1 ORA-1013 ORA-01013: user requested cancel of current operation\n
[<date/time>] [warn] [client <ip>] [ecid: 1248707384:<ip>:25588:0:6710,0] mod_plsql: Discarded pooled connection (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=<server name>)(PORT=1530)))(CONNECT_DATA=(SERVICE_NAME=<service>))) because of ping failure
On the database side, it's difficult for us to figure out what's happening, but it looks like APEX is trying to run an error-handling routine, which makes sense, but it does appear to be locked into a try-fail-retry cycle. We loaded an APEX page that had previously exhibited those symptoms with the DBAs running a trace, and the highlights of the TKPROF output were:
This SQL: SELECT MESSAGE_TEXT, DECODE(MESSAGE_LANGUAGE,LOWER(NVL(:B3 ,:B2 )),1,0)
MESSAGE_ORDER
FROM
WWV_FLOW_MESSAGES$ WHERE SECURITY_GROUP_ID = 10 AND (MESSAGE_LANGUAGE =
LOWER(NVL(:B3 ,:B2 )) OR MESSAGE_LANGUAGE = LOWER(SUBSTR(NVL(:B3 ,:B2 ),1,2)
)) AND NAME = UPPER(:B1 ) ORDER BY 2 DESC
was run approx 45,000 times, returning no rows.
This SQL: SELECT MESSAGE_TEXT, DECODE(MESSAGE_LANGUAGE,LOWER(NVL(:B5 ,:B4 )),1,0)
MESSAGE_ORDER
FROM
WWV_FLOW_MESSAGES$ WHERE FLOW_ID = NVL(:B7 , :B6 ) AND (MESSAGE_LANGUAGE =
LOWER(NVL(:B5 ,:B4 )) OR MESSAGE_LANGUAGE = LOWER(SUBSTR(NVL(:B5 ,:B4 ),1,2)
)) AND SECURITY_GROUP_ID = DECODE(NVL(:B2 ,0),0,:B3 ,:B2 ) AND NAME =
UPPER(:B1 ) ORDER BY 2 DESC
was also run approx 45,000 times, returning no rows.
This SQL: SELECT M.MESSAGE_TEXT
FROM
WWV_FLOW_MESSAGES$ M, WWV_FLOWS F WHERE M.NAME = UPPER(:B2 ) AND
M.MESSAGE_LANGUAGE = F.FLOW_LANGUAGE AND F.ID = :B1 AND M.SECURITY_GROUP_ID
= 10
was also run approx 45,000 times, returning no rows.
This SQL: SELECT DECODE(MESSAGE_LANGUAGE,'en-us',0,1) WEIGHT, MESSAGE_TEXT
FROM
WWV_FLOW_MESSAGES$ WHERE NAME = UPPER(:B1 ) AND SECURITY_GROUP_ID = 10 ORDER
BY 1 ASC
was also run approx 45,000 times, returning 1 row per execution.
This SQL: SELECT PLUG_SOURCE
FROM
WWV_FLOW_PAGE_PLUGS WHERE ID = :B1
was run nearly 13,000 times, returning 1 row per execution.
This SQL: SELECT USER_ID
FROM
SYS.ALL_USERS WHERE USERNAME = UPPER(:B1 )
was run nearly 6,500 times, returning no rows.
This SQL: SELECT MESSAGE
FROM
WWV_FLOW_DEBUG WHERE ID = :B1 ORDER BY SEQ
was also run nearly 6,500 times, returning no rows.
And so it goes on. Later we have:
SELECT ID, CURRENT_TAB, CURRENT_TAB_FONT_ATTR, NON_CURRENT_TAB,
NON_CURRENT_TAB_FONT_ATTR, CURRENT_IMAGE_TAB, NON_CURRENT_IMAGE_TAB,
TOP_CURRENT_TAB, TOP_CURRENT_TAB_FONT_ATTR, TOP_NON_CURRENT_TAB,
TOP_NON_CURRENT_TAB_FONT_ATTR, HEADER_TEMPLATE, BOX, FOOTER_TEMPLATE,
NAVIGATION_BAR, NAVBAR_ENTRY, SUCCESS_MESSAGE, BODY_TITLE, MESSAGE,
TABLE_BGCOLOR, TABLE_CATTRIBUTES, REGION_TABLE_CATTRIBUTES, HEADING_BGCOLOR,
FONT_SIZE, FONT_FACE, APP_TAB_BEFORE_TABS, APP_TAB_CURRENT_TAB,
APP_TAB_NON_CURRENT_TAB, APP_TAB_AFTER_TABS, NVL(DEFAULT_BUTTON_POSITION,
'TOP') DEFAULT_BUTTON_POSITION
FROM
WWV_FLOW_TEMPLATES WHERE ID = :B1
executing nearly 6,500 times and returning 1 row per execution.
and the same for this SQL: SELECT ERROR_PAGE_TEMPLATE
FROM
WWV_FLOW_TEMPLATES WHERE ID = NVL(:B3 ,NVL(:B2 ,:B1 ))
Later we have some very obscure-looking data dictionary lookups.
I'd be interested to know if anyone has any thoughts on what might be occurring, or what other information I should obtain to provide clues as to what is going wrong.
Many thanks for your help.
Regards,
John.