I'm on 11.2.0.3 Enterprise on Linux (OEL 5) on a relatively fast machine (both CPU and I/O subsystem). I exported (expdp) a single schema with a couple of hundred objects at PARALLEL=4. export estimated the data volume at 166 GB and took 6 hours to do it. I ran the exact same export with PARALLEL=8 and it took only 1 hour. (The schema and its data did not change.)
I reran both exports and traced each one (TRACE=0480300).
The master process trace for the PARALLEL=4 run showed that for the first and largest table (77 GB) it chose the direct_path access and calculated a parallelism of 2 for it (of course the calculated parallel value didn't mean anything since direct_path does not use it). All other segments also got run with direct_path, so there was no parallel execution (PX) used at all by any of the worker processes. The worker process assigned to this table was the one that ran for the 6 hours basically single threading on that one large segment.
The trace for the PARALLEL=8 run showed that for the same first and largest table it chose the external_table access method, calculated a parallel degree of 4 and ran that in a worker process with that number of PX slaves. This drastically reduced the run time of the whole export simply due to the faster throughput of this one worker.
My question is: how does datapump chose an access method? This one table didn't change in any way, and it could have used external_table and 2 PX slaves in the PARALLEL=4 run but did not.