name servers fail == 11g fails also
716545Sep 11 2009 — edited Sep 16 2009I have a problem with my RAC installation. Toiday, all of the sudden, every application server stopped responding.
On first glance it seemed the problem was with the connection to database. Running sqlplus from such a server [taken from client_11 install] resulted in failed connection due to timeout.
Also, running "lsnrctl status" on RAC node was behaving strangely -- instead of returning services in a blink it took some time to print the registered/running services. Connecting thru IPC worked okay. The situation was similar to the case from metalink note 334587.1. "Lsnrctl Start is Extremely Slow or Appears to Hang". Worth mentionig is the fact that "telnet to RAC's node IP or vip address on port 1521" test worked okay.
We figured out it was the first nameserver in /etc/resolv.conf that went dead for some reason. Quick fix was to put "options rotate" to cause system resolver to use IPs defined in this file in round robin fashion.
But why did name resolving mattered?
No tnsnames.ora we use contain hostnames, they all use IPs only. From network dump analysis we've figured out that when application connects to the RAC's listener [on "vip" address] it gets some kind of redirect to a DNS name. WHY? There's not a single hostname in tnsnames.ora nor in listerner.ora on both RAC nodes and in application server's tnsnames.ora. And all the hostnames that eventually Oracle would figure out it needs to use/resolve are in /etc/hosts on RAC and servers. But sqlplus still get to resolve RAC's hostname when trying to connect. Also, RAC instances ask for DNS records almost constantly:
16:11:27.866431 IP 192.168.101.74.10897 > 192.168.101.123.domain: 23402+ AAAA? dbrac-74.i. (27)
16:11:27.866662 IP 192.168.101.123.domain > 192.168.101.74.10897: 23402* 0/1/0 (74)
16:11:27.902975 IP 192.168.101.74.24988 > 192.168.101.123.domain: 35224+ AAAA? dbrac-74.i. (27)
16:11:27.903183 IP 192.168.101.123.domain > 192.168.101.74.24988: 35224* 0/1/0 (74)
16:11:27.979018 IP 192.168.101.74.29821 > 192.168.101.123.domain: 11034+ AAAA? dbrac-74.i. (27)
16:11:27.979255 IP 192.168.101.123.domain > 192.168.101.74.29821: 11034* 0/1/0 (74)
16:11:28.258696 IP 192.168.101.74.61203 > 192.168.101.123.domain: 21381+ AAAA? dbrac-74.i. (27)
16:11:28.258963 IP 192.168.101.123.domain > 192.168.101.74.61203: 21381* 0/1/0 (74)
16:11:28.330877 IP 192.168.101.74.23137 > 192.168.101.123.domain: 3346+ AAAA? dbrac-74.i. (27)
16:11:28.331109 IP 192.168.101.123.domain > 192.168.101.74.23137: 3346* 0/1/0 (74)
16:11:28.829856 IP 192.168.101.74.35464 > 192.168.101.123.domain: 62385+ AAAA? dbrac-74.i. (27)
16:11:28.830221 IP 192.168.101.123.domain > 192.168.101.74.35464: 62385* 0/1/0 (74)
16:11:30.131158 IP 192.168.101.74.39835 > 192.168.101.123.domain: 34121+ AAAA? dbrac-74.i. (27)
16:11:30.131382 IP 192.168.101.123.domain > 192.168.101.74.39835: 34121* 0/1/0 (74)
16:11:30.166057 IP 192.168.101.74.39187 > 192.168.101.123.domain: 42780+ AAAA? dbrac-74.i. (27)
16:11:30.166279 IP 192.168.101.123.domain > 192.168.101.74.39187: 42780* 0/1/0 (74)
16:11:30.172000 IP 192.168.101.74.58269 > 192.168.101.123.domain: 35227+ AAAA? dbrac-74.i. (27)
16:11:30.172227 IP 192.168.101.123.domain > 192.168.101.74.58269: 35227* 0/1/0 (74)
16:11:30.247576 IP 192.168.101.74.61457 > 192.168.101.123.domain: 18621+ AAAA? dbrac-74.i. (27)
16:11:30.247826 IP 192.168.101.123.domain > 192.168.101.74.61457: 18621* 0/1/0 (74)
16:11:30.289915 IP 192.168.101.74.13906 > 192.168.101.123.domain: 56846+ AAAA? dbrac-74.i. (27)
16:11:30.290134 IP 192.168.101.123.domain > 192.168.101.74.13906: 56846* 0/1/0 (74)
16:11:30.667396 IP 192.168.101.74.36239 > 192.168.101.123.domain: 11083+ AAAA? dbrac-74.i. (27)
16:11:30.667693 IP 192.168.101.123.domain > 192.168.101.74.36239: 11083* 0/1/0 (74)
16:11:30.774745 IP 192.168.101.74.7890 > 192.168.101.123.domain: 45904+ AAAA? dbrac-74.i. (27)
16:11:30.775057 IP 192.168.101.123.domain > 192.168.101.74.7890: 45904* 0/1/0 (74)
16:11:31.119668 IP 192.168.101.74.53591 > 192.168.101.123.domain: 41298+ AAAA? dbrac-74.i. (27)
16:11:31.119933 IP 192.168.101.123.domain > 192.168.101.74.53591: 41298* 0/1/0 (74)
16:11:31.366476 IP 192.168.101.74.61967 > 192.168.101.123.domain: 48340+ AAAA? dbrac-74.i. (27)
16:11:31.366684 IP 192.168.101.123.domain > 192.168.101.74.61967: 48340* 0/1/0 (74)
16:11:34.171092 IP 192.168.101.74.41493 > 192.168.101.123.domain: 10076+ AAAA? dbrac-74.i. (27)
16:11:34.171394 IP 192.168.101.123.domain > 192.168.101.74.41493: 10076* 0/1/0 (74)
16:11:34.210090 IP 192.168.101.74.23754 > 192.168.101.123.domain: 12030+ AAAA? dbrac-74.i. (27)
16:11:34.210452 IP 192.168.101.123.domain > 192.168.101.74.23754: 12030* 0/1/0 (74)
16:11:34.537118 IP 192.168.101.74.11850 > 192.168.101.123.domain: 52502+ AAAA? dbrac-74.i. (27)
16:11:34.537402 IP 192.168.101.123.domain > 192.168.101.74.11850: 52502* 0/1/0 (74)
16:11:34.831453 IP 192.168.101.74.35855 > 192.168.101.123.domain: 39374+ AAAA? dbrac-74.i. (27)
16:11:34.831670 IP 192.168.101.123.domain > 192.168.101.74.35855: 39374* 0/1/0 (74)
16:11:35.272409 IP 192.168.101.74.56118 > 192.168.101.123.domain: 12528+ AAAA? dbrac-74.i. (27)
16:11:35.272629 IP 192.168.101.123.domain > 192.168.101.74.56118: 12528* 0/1/0 (74)
16:11:35.308928 IP 192.168.101.74.10511 > 192.168.101.123.domain: 300+ AAAA? dbrac-74.i. (27)
16:11:35.309167 IP 192.168.101.123.domain > 192.168.101.74.10511: 300* 0/1/0 (74)
16:11:35.596096 IP 192.168.101.74.44572 > 192.168.101.123.domain: 14462+ AAAA? dbrac-74.i. (27)
16:11:35.596300 IP 192.168.101.123.domain > 192.168.101.74.44572: 14462* 0/1/0 (74)
16:11:35.631935 IP 192.168.101.74.40755 > 192.168.101.123.domain: 52948+ AAAA? dbrac-74.i. (27)
16:11:35.632229 IP 192.168.101.123.domain > 192.168.101.74.40755: 52948* 0/1/0 (74)
16:11:35.704574 IP 192.168.101.74.56062 > 192.168.101.123.domain: 13250+ AAAA? dbrac-74.i. (27)
The RAC node resolves "dbrac-74.i" which in fact is present it /etc/hosts along with coresponding IP. [The ".i" thingy is correct, we use ".i" domain internally].
Also when runing strace on a test sqlplus connection from app server to dbrac-74/192.168.101.75 I get:
connect(4, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("192.168.101.75")}, 16) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(62120), sin_addr=inet_addr("192.168.101.70")}, [47519518162960]) = 0
107.76 is retrieved from app server's tnsnames.ora:
DBRAC =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.101.75)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.101.77)(PORT = 1521))
(LOAD_BALANCE = yes)
(FAILOVER = yes)
(FAILOVER_MODE =
(TYPE = SELECT)
(METHOD = BASIC)
(RETRIES = 6)
(DELAY = 5)
)
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = DBRAC)
)
)
Then this redirect comes from this network conversation:
read(4, "\0\n\0\0\5\2\0\0\1Z", 8208) = 10
read(4, "\1d\0\0\6\0\0\0\0@(ADDRESS=(PROTOCOL=TCP)(HOST=dbrac-76)(PORT=1521))\0(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.101.75)(PORT=1521))(LOAD_BALANCE=yes)(FAILOVER=yes)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=6)(DELAY=5))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=CMSRAC)(CID=(PROGRAM=sqlplus)(HOST=v3-as-70)(USER=glassfish))(INSTANCE_NAME=DBCMSR2)))", 8208) = 356
And then sqlplus talks to nameserver [192.168.101.123] to resolve "dbrac-76":
recvfrom(4, "&\245\205\200\0\1\0\0\0\1\0\0\7dbrac-76\1i\0\0\34\0\1\300\24\0\6\0\1\0\0\1,\0#\3dns\4domain\2pl\0\300\'w\300L\34\0\0*1\0\0\1,\0\22u\1\0\0\1,
", 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.101.123")}, [16]) = 74
Seems like "(ADDRESS=(PROTOCOL=TCP)(HOST=dbrac-76)(PORT=1521))" is the redirect since the client tries to connect to dbrac-76/192.168.101.76 later.
DBCMSR2 is instance name of one of the RAC's nodes.
Why RAC is using hostnames when a) I don't want it to b) any kind of DNS failure results in DoS.
It's 11.1.0.7 2-node RAC system.