Same query run during the same hour -- buffer cache is "warm" but that doesn't help much at all in the serial plan. Serial execution (actually doesn't finish for 10 hours), SQL Monitoring report:
https://www.dropbox.com/s/dgvifjt1n28nx6q/noparallel_conv_path_SQL_Monitoring.html?dl=0
Parallel plan -- lots of direct path reads, runs in less than two minutes:
https://www.dropbox.com/s/6u84jnnra0fjhhs/parallel16_direct_path_SQL_Monitoring.html?dl=0
PGA usage is not much higher in the parallel plan, and considering extremely short execution time, customer can tolerate having 32 extra CPU threads allocated. Sounds too good to be true -- what am I missing here? Is the optimizer picking a different plan in the serial version based on bad cardinalities (line 37 of serial plan)? Thank you!!!