Hi, everyone! I get a very annoying proble with a rare character in gb2312 charset and need your generous help very much.
I am writing a project to crawl a series of webpages and extract some specific information on it. I don't save the webpage on my local disk but just open them online and extract the information that I am interested in. Then close the connection.
InputStream wpInStream =webPage2InputStream(threadHplink);
ThreadAnalyzer.Analyze(wpInStream,wpEncoding,threadBuffer);
I read webpage via webPage2InputStream. Then I will use ThreadAnalyzer.Analyze to extract the information I need with charset wpEncoding (it is gb2312 in this case) and store the information in threadBuffer
However a rare character (this one "�E") in gb2312 often appears among the information I am interested in. It appears like a blank a little wider than a normal one like " ". When I paste it in java program, it looks like a rectangle (paste this "�E" to eclipse editor, you'll see). I want to match this symbol in my code. But use something like ("...|"�E"|...") (it appears a rectangle in java code) won't do. I don't know how to use regular expression to match this one.
But very strange if I copy the .html file (of course including this damned symbol) and save it in a .txt file in utf8, then it matches.
This hints me if I should convert the inputstream to utf8 first before I extract the concerned information. Can anyone show me how to deal with this problem?
I really need your help~
It's really annoying because it's just a beginning. I don't know how many rare words existed ahead of me~~~ >_<