I am trying to run glpsol (the standalone solver from the GNU Linear Programming Kit) on a very large model. I don't have enough physical memory to fit the entire model, so I configured a lot of swap. glpsol, unfortunately, uses more memory to parse and preprocess the model than it does to actually run the core solver, so my approximately 2-3GB model requires 11GB of memory to get started. (However, much of this acess is sequential.)
What I am encountering is that my new machine, running Solaris 10 (11/06) on a dual-core Athlon (64-bit, naturally) with 2GB or memory, is starting up much, much more slowly than my old desktop machine, running Linux (2.6.3) on a single-core Athlon 64 with 1GB of memory. Both machines are using identical SATA drives for swap, though with different motherboard controllers. The Linux machine gets started in about three hours, while Solaris takes 9 hours or more.
So, here's what I've found out so far, and tried.
On Solaris, swapping takes place 1 page (4KB) at a time. You can see from this example iostat output that I'm getting about 6-7ms latency from the disk but that each of the reads is just 4KB. (629KB/s / 157 read/s = 4KB/read )
device r/s w/s kr/s kw/s wait actv svc_t %w %b
cmdk0 157.2 14.0 628.8 784.0 0.1 1.0 6.6 2 99
cmdk1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
Linux has a feature called page clustering which swaps in multiple 4KB pages at once--- currently set to 8 pages (32KB).
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
hda 1270.06 2.99 184.23 6.39 11635.93 76.65 61.45 1.50 7.74 5.21 99.28
hdc 0.00 0.00 0.40 0.20 4.79 1.60 10.67 0.00 0.00 0.00 0.00
md0 0.00 0.00 1.00 0.00 11.18 0.00 11.20 0.00 0.00 0.00 0.00
hdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
(11636 sectors/sec = 5818KB/sec. Divided by 184 reads/sec gives just under 32KB.)
I didn't find anything I could tune in the Solaris kernel that would increase the granularity at which pages are swapped to disk.
I did find that Solaris supports large pages (2MB on x64, verified with "pagesize -a"), so I modified glpsol to use larger chunks (16MB) for its custom allocator and used memalign to allocate these chunks at 2MB boundaries. Then I rebooted the system and ran glpsol with
ppgsz -o heap=2MB glpsol ...
I verified with pmap -s that 2MB pages were being used, but only a very few of them.
8148: glpsol --cpxlp 3cljf-5.cplex --output solution-5 --log log-5
Address Bytes Pgsz Mode Mapped File
0000000000400000 116K - r-x-- /usr/local/bin/glpsol
000000000041D000 4K 4K r-x-- /usr/local/bin/glpsol
000000000041E000 432K - r-x-- /usr/local/bin/glpsol
0000000000499000 4K - rw--- /usr/local/bin/glpsol
0000000000800000 25556K - rw--- [ heap ]
00000000020F5000 944K 4K rw--- [ heap ]
00000000021E1000 4K - rw--- [ heap ]
00000000021E2000 68K 4K rw--- [ heap ]
00000000021F3000 4K - rw--- [ heap ]
....
00000000087C3000 4K 4K rw--- [ heap ]
00000000087C4000 2288K - rw--- [ heap ]
0000000008A00000 2048K 2M rw--- [ heap ]
0000000008C00000 2876K - rw--- [ heap ]
0000000008ECF000 480K 4K rw--- [ heap ]
0000000008F47000 4K - rw--- [ heap ]
...
000000003F4E8000 4K 4K rw--- [ heap ]
000000003F4E9000 5152K - rw--- [ heap ]
000000003F9F1000 60K 4K rw--- [ heap ]
000000003FA00000 2048K 2M rw--- [ heap ]
000000003FC00000 6360K - rw--- [ heap ]
0000000040236000 368K 4K rw--- [ heap ]
etc.
There are only 19 large pages listed (a total of 38MB of physical memory.)
I think my next step, if I don't receive any advice, is to try to preallocate the entire region of memory which stores (most of) the model as a single allocation. But I'd appreciate any insight as to how to get better performance, without a complete rewrite of the GLPK library.
1. When using large pages, is the entire 2MB page swapped out at once? Or is the 'large page' only used for mapping in the TLB? The documentation I read on swap/paging and on large pages didn't really explain the interaction. (I wrote a dtrace script which logs which pages get swapped into glpsol but I haven't tried using it to see if any 2MB pages are swapped in yet.)
2. If so, how can I increase the amount of memory that is mapped using large pages? Is there a command I can run that will tell me how many large pages are available? (Could I boot the kernel in a mode which uses 2MB pages only, and no 4KB pages?)
3. Is there anything I should do to increase the performance of swap? Can I give a hint to the kernel that it should assume sequential access? (Would "madvise" help in this case? The disk appears to be 100% active so I don't think adding more requests for 4KB pages is the answer--- I want to do more efficient disk access by loading bigger chunks of data.)