Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Problem reading big file. No, bigger than that. Bigger.

844979Mar 4 2011 — edited Mar 5 2011

I am trying to read a file roughly 340 GB in size. Yes, that's "Three hundred forty". Yes, gigabytes. (I've been doing searches on "big file java reading" and I keep finding things like "I have this huge file, it's 600 megabytes!". )

"Why don't you split it, you moron?" you ask. Well, I'm trying to.

Specifically, I need a slice "x" rows in. It's nicely delimited, so, in theory:

(pseudocode)

BufferedFileReader fr=new BufferedFileReader(new FileReader(new File(myhugefile)));
int startLine=70000000;
String line;
linesRead=0;
while ((line=fr.ReadLine()!=null)&&(linesRead<startLine))
{
linesRead++; //we don't care about this
}
//ok, we're where we want to be, start caring
int linesWeWant=100;
linesRead=0;
while ((line=fr.ReadLine()!=null)&&(linesRead<linesWeWant))
{
doSomethingWith(line);
linesRead++'
}

(Please assume the real code is better written and has been proven to work with hundreds of "small" files (under a gigabyte or two). I'm happy with my file read/file slice logic, overall.)

Here's the problem. No matter how I try reading the file, whether I start with a specific line or not, whether I am saving out a line to a string or not, it always dies with an OEM at around row 793,000,000. the OEM is thrown from BufferedReader->ReadLine. Please note I'm not trying to read the whole file into a buffer, just one line at a time. Further, the file dies at the same point no matter how high or low (with reason) I set my heap size, and watching the memory allocation shows it's not coming close to filling memory. I suspect the problem is occurring when I've read more than int bytes into a file.

Now -- the problem is that it's not just this one file -- the program needs to handle a general class of comma- or tab- delimited files which may have any number of characters per row and any number of rows, and it needs to do so in a moderately sane timeframe. So this isn't a one-off where we can hand-tweak an algorithm because we know the file structure. I am trying it now using RandomAccessFile.readLine(), since that's not buffered (I think...), but, my god, is it slow... my old code read 79 million lines and crashed in under about three minutes, the RandomAccessFile() code has taken about 45 minutes and has only read 2 million lines.

Likewise, we might start at line 1 and want a million lines, or start at line 50 million and want 2 lines. Nothing can be assumed about where we start caring about data or how much we care about, the only assumption is that it's a delimited (tab or comma, might be any other delimiter, actually) file with one record per line.

And if I'm missing something brain-dead obvious...well, fine, I'm a moron. I'm a moron who needs to get files of this size read and sliced on a regular basis, so I'm happy to be told I'm a moron if I'm also told the answer. Thank you.

Locked Post

New comments cannot be posted to this locked post.

Locked on Apr 2 2011

Added on Mar 4 2011

15 comments

912 views