Skip to Main Content

New to Java

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Format some text with Java-Thread 2

794113Feb 17 2010 — edited Feb 28 2010
Hi all,

I am starting a new thread as the previous one one had already been answered and had served the purpose.
But I need something more this time.
My text now looks like this:
 [NP The/DT U/NNP ]

 P/.

 [NP Workers/NNPS April/NNP skip/NN ] [PP to/TO ] [NP main/JJ skip/NN ] [PP to/TO ] [NP sidebar/NN ] [NP The/DT U/NNP ]

 P/.

 Workers/NNPS [NP This/DT site/NN ] [VP is/VBZ ] [ADJP open/JJ ] [PP for/IN ] [NP posting/VBG and/CC comments/NNS ] [PP by/IN ] [NP all/DT rank/NN and/CC file/NN administrative/JJ employees/NNS ] [PP of/IN ] [NP the/DT University/NNP ] [PP of/IN ] [NP the/DT Philippines/NNPS ] and/CC [NP the/DT Philippine/NNP General/NNP Hospital/NNP The/NNP National/NNP University/NNP Hospital/NNP ] [ADVP especially/RB ] [NP the/DT officers/NNS and/CC members/NNS ] [PP of/IN ] [NP the/DT All/NNP U/NNP ]

 P/.

 [NP Workers/NNPS Union/NNP ]

 [NP Friday/NNP April/NNP Stop/NNP Paying/NNP Nuke/NNP Plant/NNP Debt/NNP SC/NNP Justice/NNP Urges/NNPS Gov't/NNP ] [VP Posted/VBD pm/VBN ] [NP Mla/NNP time/NN April/NNP By/NNP Vincent/NNP Cabreza/NNP Inquirer/NNP News/NNP Service/NNP Published/NNP ] [PP on/IN ] [NP page/NN A/NNP ] [PP of/IN ] [NP the/DT Apr/NNP ]

But/CC [NP Puno/NNP ] [VP points/VBZ ] [PRT out/RP ] [SBAR that/IN ] [NP the/DT US/NNP law/NN ] [VP bars/VBZ ] [NP the/DT towns/NNS ] [PP from/IN ] [VP issuing/VBG ] [NP new/JJ taxes/NNS ] [VP to/TO pay/VB ] [PP for/IN ] [NP their/PRP$ debts/NNS ] unsafe/JJ

www/WRB
I needed to format the text into this format: {This is the desired output format}
The	DT	B-NP
U	NNP	I-NP

P

Workers	NNPS	B-NP
April	NNP	I-NP
skip	NN	I-NP
to	TO	B-PP
main	JJ	B-NP
skip	NN	I-NP
to	TO	B-PP
sidebar	NN	B-NP
The	DT	B-NP
U	NNP	I-NP

P
Workers  NNPS
.........
etc
.......
I have written the code to transform this into a format but the output does not match the above one. So the requirement is not met.

I am using Regex to solve the problem:
Pattern p = Pattern
            .compile("\\[(\\p{Alpha}+) +(\\p{Graph}+)/(\\p{Alpha}+)(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))? ]+(?:(\\./. |\\./.$))?(?: +(\\./. |\\./.$))?(?: +(\\p{Alnum}+)/(\\p{Alpha}+))?(?:(\\p{Alnum}+)/(\\p{Alpha}+))?",Pattern.MULTILINE);	
Printing the output as:
while (matcher.find()) {
        	//System.out.println();
			System.out.println("For: " +matcher.group())	;		
			System.out.println(matcher.group(2) + "\t" + matcher.group(3)
					+ "\tB-" + matcher.group(1));
		
			if (matcher.group(4) != null) {
				System.out.println(matcher.group(4) + "\t" + matcher.group(5)
						+ "\tI-" + matcher.group(1));
			
			}
-------etc---------------------------------------
The regex looks big as I have trained it to capture all types of words in the brackets []. But it is failing to generate the output when it sees: "But/CC " or this kind of pattern in my text. But when it sees the second one like: "unsafe/JJ" it generates the output.
So currently my output(which is wrong) looks like this(with no gaps after a sentence):
The	DT	B-NP
U	NNP	I-NP
Workers	NNPS	B-NP
April	NNP	I-NP
skip	NN	I-NP
to	TO	B-PP
main	JJ	B-NP
skip	NN	I-NP
to	TO	B-PP
sidebar	NN	B-NP
The	DT	B-NP
U	NNP	I-NP
This	DT	B-NP
site	NN	I-NP
is	VBZ	B-VP

-------
You can see that it has omitted some words straightaway.

So I have 2 requirements:

1. How to capture the pattern "But/CC" (or this type) which is not in brackets?
2. After every sentence or pattern we see that there is a line gap in the input text. Thus after a sentence we see a gap. So in the output also, I need to give a line break after each sentence as provided in the input text file. [Also after P/. there should be a line break as is there in the input]

Please refer to the desired output part of this thread. I need to write a Regex code to solve this. Please help me to modify/write the same.

Thanks!
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Mar 28 2010
Added on Feb 17 2010
12 comments
213 views