Skip to Main Content

Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

An annoying problem with a rare character in gb2312 (Chinese charset)

807589Jul 3 2008 — edited Jul 3 2008
Hi, everyone! I get a very annoying proble with a rare character in gb2312 charset and need your generous help very much.
I am writing a project to crawl a series of webpages and extract some specific information on it. I don't save the webpage on my local disk but just open them online and extract the information that I am interested in. Then close the connection.
            	    InputStream wpInStream =webPage2InputStream(threadHplink);
       		    	ThreadAnalyzer.Analyze(wpInStream,wpEncoding,threadBuffer);
I read webpage via webPage2InputStream. Then I will use ThreadAnalyzer.Analyze to extract the information I need with charset wpEncoding (it is gb2312 in this case) and store the information in threadBuffer
However a rare character (this one "�E") in gb2312 often appears among the information I am interested in. It appears like a blank a little wider than a normal one like " ". When I paste it in java program, it looks like a rectangle (paste this "�E" to eclipse editor, you'll see). I want to match this symbol in my code. But use something like ("...|"�E"|...") (it appears a rectangle in java code) won't do. I don't know how to use regular expression to match this one.
But very strange if I copy the .html file (of course including this damned symbol) and save it in a .txt file in utf8, then it matches.
This hints me if I should convert the inputstream to utf8 first before I extract the concerned information. Can anyone show me how to deal with this problem?
I really need your help~
It's really annoying because it's just a beginning. I don't know how many rare words existed ahead of me~~~ >_<
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Jul 31 2008
Added on Jul 3 2008
13 comments
742 views