I have a file containing several lines similar to:
Name: Peter
Address: St. Serrano número 12, España
Country: Spain
And I need to extract the address using a regular expression, taking into account that it can contain dots, special characters (ñ, ç), áéíóú...
The current code works, but it looks quite ugly:.
Pattern p = Pattern.compile("^(.+?)Address: ([a-zA-Z0-9ñÑçÇáéíóú., ]+)(.+?)$",
Pattern.MULTILINE | Pattern.DOTALL);
Matcher m = p.matcher(content);
if (m.matches()) { ... }
Edit: The Address field could also be divided into multiple lines
Name: Peter
Address: St. Serrano número 12,
Madrid
España
Country: Spain
Edit: I can't use a Properties object or a YAML parser, as the file contains other kind of information, too.
-
Not a Java person, but wouldn't a
"Address: (.*)$"work?Edit: Without the Pattern.MULTILINE | Pattern.DOTALL option it should match only on that line.
-
Can it contain a newline? If it cannot contain a newline, you don't need to use the multiline modifier, and can do instead
Pattern p = Pattern.compile("^Address: (.*)$");If it can, an alternative I can think of is
Pattern p = Pattern.compile("Address: (.*)\nCountry", Pattern.MULTILINE);Without the DOTALL, the dot won't match a newline, so you can explicitly specify it in the regexp, allowing you to do what you asked about.
-
You might want to look into
Propertiesclass instead of regex. It provides you ways to manage plain text or XML files to represent key-value pairs.So you can read in your example file and then get the values like so after loading to a
Propertiesobject:Properties properties = new Properties(); properties.load(/* InputStream of your file */); Assert.assertEquals("Peter", properties.getProperty("Name")); Assert.assertEquals("St. Serrano número 12, España", properties.getProperty("Address")); Assert.assertEquals("Spain", properties.getProperty("Country"));cletus : Why use Apache Commons Assert isntead of Java assert? -
You should definitely check out YAML.
You could try JYaml.
Best of all it has implementations in many languages.
ps I have tried the sample text in YAML::XS, and it works perfectly.
-
I don't mean to be a stick in the mud, but do you have to use a regex? Why not spare your future self (or others) the headache and do:
String line = reader.readLine(); while(line != null) { line = line.trim(); if(line.startsWith("Address: ")) { return line.substr("Address: ".length()).trim(); } line = reader.readLine(); } return null;Of course this can be parameterized a bit as well and put into a method.
Otherwise, I'd second the Properties or JYaml suggestions.
-
Assuming "content" is a string containing the file's contents, your main problem is that you're using
matches()where you should be usingfind().Pattern p = Pattern.compile("^Address:\\s*(.*)$", Pattern.MULTILINE); Matcher m = p.matcher(content); if ( m.find() ) { ... }There seems to be some confusion in other answers about MULTLINE and DOTALL modes. MULTILINE is what lets the
^and$anchors match the beginning and end, respectively, of a logical line. DOTALL lets the dot (period, full stop, whatever) match line separator characters like\n(linefeed) and\r(carriage return). This regex must use MULTILINE mode and must not use DOTALL mode.Guido : Thanks. What if address is a multiline field ? Is it possible to capture it with no need to depend on the next field name ?Alan Moore : Both of Nick's regexes will match if the Address field is at the end of the input. Is that what you mean? -
I don't know Java's regex objects that well, but something like this pattern will do it:
^Address:\s*((?:(?!^\w+:).)+)$assuming multiline and dotall modes are on.
This will match any line starting with Address, followed by anything until a newline character and a single word followed by a colon.
If you know the next field has to be "Country", you can simplify this a little bit:
^Address:\s*((?:(?!^Country:).)+)$The trick is in the lookahead assertion in the repeating group. '(?!Country:).' will match everything except the start of the string 'Country:', so we just stick it in noncapturing parentheses (?:...) and quantify it with +, then group all of that in normal capturing parentheses.
Guido : It worked ! Thank you ! I have to read more about regex :)
0 comments:
Post a Comment