Skip to Main Content

Hardware

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

vdbench failure: Returning error 797... then BAD_READ_RETURN: 'Read was successful, but data buffer

User_KRFV6Jun 2 2020 — edited Jun 2 2020

Hi All,

I'm looking at a vdbench failure that says we fail at start of workload on first read for BAD_READ_RETURN: 'Read was successful, but data buffer contents not changed' with a before/after pattern of "Front: dea96590 End: deadeeee" --> This seems to indicate a data validation problem, but let's not be so quick to claim this. Workload starts and we first get a 797 error reported by vdbench.

Sys:

18.04LTS Kernel 5.3

vdbench 50403

java 1.8

Workload (script6.txt):

data_errors=1

sd=sd1,lun=/dev/nvme6n1,align=4k

wd=wd1,sd=sd1,rdpct=100,seekpct=-1,xfersize=64k,openflags=o_direct

rd=run1,wd=wd1,iorate=max,threads=64,elapsed=115200,interval=60

vdbench output:

17:43:04.001 Starting RD=run1; I/O rate: Uncontrolled MAX; elapsed=115200; For loops: threads=64

17:43:04.333 17:43:04.333 localhost-0: 17:43:04.332 op: read lun: /dev/nvme6n1 lba: 131072 0x00020000 xfer: 65536 errno: BAD_READ_RETURN: 'Read was successful, but data buffer contents not changed'

17:43:04.836 killCommand: /vdbench/vdbench SlaveJvm -m localhost -n localhost-10-200519-17.43.03.086 -l localhost-0 -p 5582

17:43:04.836

17:43:04.836 'data_errors=1' requested. Abort rd=run1 after last error.

17:43:04.836

17:43:04.836 removeKillAtEnd(): Unable to find /vdbench/vdbench SlaveJvm -m localhost -n localhost-10-200519-17.43.03.086 -l localhost-0 -p 5582

java.lang.RuntimeException: 'data_errors=1' requested. Abort rd=run1 after last error.

at Vdb.common.failure(common.java:350)

at Vdb.common.failure(common.java:297)

at Vdb.ErrorLog.countErrorsOnMaster(ErrorLog.java:224)

at Vdb.SlaveOnMaster.processSlave(SlaveOnMaster.java:207)

at Vdb.SlaveOnMaster.run(SlaveOnMaster.java:42)

Std Out:

17:43:03.350 17:43:03.349 SlaveJvm execution parameter: '-m localhost'

17:43:03.350 17:43:03.349 SlaveJvm execution parameter: '-n localhost-10-200519-17.43.03.086'

17:43:03.350 17:43:03.350 SlaveJvm execution parameter: '-l localhost-0'

17:43:03.350 17:43:03.350 SlaveJvm execution parameter: '-p 5582'

17:43:03.350 17:43:03.350 SlaveJvm positional parameter: 'SlaveJvm'

17:43:03.448 17:43:03.447 successfully connected to master localhost

17:43:03.448 17:43:03.447 Connection to localhost using port 5582 successful

17:43:03.536 17:43:03.535 ticks_per_second: 100

17:43:03.547 17:43:03.547 Configuration interpretation took 0.01 seconds

17:43:03.679 17:43:03.679 Beginning of run setup

17:43:03.680 17:43:03.679 File '/var/log/messages' does not exist. Monitoring bypassed

17:43:03.680 17:43:03.679 **********************

17:43:03.680

17:43:03.680

17:43:03.680 17:43:03.679 File '/var/log/messages.1' does not exist. Monitoring bypassed

17:43:03.758 17:43:03.758 Opening sd=sd1,lun=/dev/nvme6n1; write: false; flags: OpenFlags: 0x00004000 OtherFlags: 0x00000000

17:43:03.770 17:43:03.770 Started 1 Workload Generator threads.

17:43:03.834 17:43:03.834 Started 64 i/o threads for sd1

17:43:03.834 17:43:03.834 Started a total of 64 i/o threads.

17:43:03.834 17:43:03.834 Waiting for task synchronization

17:43:03.835 17:43:03.834 Requesting full garbage collection

17:43:03.835 17:43:03.835 GcTracker: cum: 1 intv: 1 ms: 7 mss: 0.00% Heap_MB max: 911 curr: 252 used: 155 free: 96

17:43:03.855 17:43:03.855 GcTracker: cum: 3 intv: 2 ms: 18 mss: 0.00% Heap_MB max: 911 curr: 284 used: 17 free: 267

17:43:03.855 17:43:03.855 task_wait_start_complete() end

17:43:04.001 Starting RD=run1; I/O rate: Uncontrolled MAX; elapsed=115200; For loops: threads=64

17:43:04.002 17:43:04.002 task_run_all(): 65 tasks

17:43:04.329 17:43:04.328 Pre-formatted read buffer contents still found after successful read. Returning error 797

17:43:04.329 17:43:04.328 Negative response time. Usually caused by out of sync cpu timers.

17:43:04.329 17:43:04.329 Failure 2, buffer: 0x7fb784001000

17:43:04.329 17:43:04.329 Negative response time. Usually caused by out of sync cpu timers.

17:43:04.329 17:43:04.329 Will report a maximum of 100 times after which Vdbench will continue.

17:43:04.329 17:43:04.329 Will report a maximum of 100 times after which Vdbench will continue.

17:43:04.329 17:43:04.329 Front: 00110020 dea96590

17:43:04.329 17:43:04.329 Response time (microseconds): -1589928011188412

17:43:04.329 17:43:04.329 Response time (microseconds): -1589928011188663

17:43:04.329 17:43:04.329 End: 84001000 deadeeee

17:43:04.333 17:43:04.333 17:43:04.332 op: read lun: /dev/nvme6n1 lba: 131072 0x00020000 xfer: 65536 errno: BAD_READ_RETURN: 'Read was successful, but data buffer contents not changed'

17:43:04.436 17:43:04.436 Sent 64 interrupts to 64 waiting IO_task threads

17:43:04.437 17:43:04.436 Java Heap in MB. max: 910.500; current: 284.000; used: 28.442; free: 255.558

17:43:04.437 17:43:04.437 Maximum native memory allocation: 4,194,304; Current allocation: 0

17:43:04.437 17:43:04.437 End of run

17:43:04.437 17:43:04.437 **********

What is 797?

According to @"Henk Vandenbergh-Oracle" (correct me if I'm wrong): "The 797 check is actually not a Data Validation check, it is done for every single read operation, with or without DV being active... Reporting 'read' versus 'expected' data caused serious confusion, letting users think that "we read the wrong data". We did not read the wrong data, we did not read at all... THAT by the way is the main reason I introduced the 797 check, and the fact that the 797 also was done and worked without DV being active was an accidental and very lucky side effect."

Question:

So if the drive failed bc the data was never read, then this seems to point to a system issue, but what exactly? Should focus be on better JVM tuning or workload optimization? What is "Negative response time. Usually caused by out of sync cpu timers?"

Anything else I should be looking at? I looked at kernel log and syslog, but nothing stood out (let me know and I can post).

Thanks!!!

-Brad

Comments
Post Details
Added on Jun 2 2020
0 comments
1,687 views