I have a file containing several lines similar to:
Name: Peter
Address: St. Serrano número 12, España
Country: Spain
And I need to extract the address using a regular expression, taking into account that it can contain dots, special characters (ñ, ç), áéíóú...
The current code works, but it looks quite ugly:.
Pattern p = Pattern.compile("^(.+?)Address: ([a-zA-Z0-9ñÑçÇáéíóú., ]+)(.+?)$",
                            Pattern.MULTILINE | Pattern.DOTALL);
Matcher m = p.matcher(content);
if (m.matches()) { ... }
Edit: The Address field could also be divided into multiple lines
Name: Peter
Address: St. Serrano número 12,   
Madrid
España
Country: Spain
Edit: I can't use a Properties object or a YAML parser, as the file contains other kind of information, too.
- 
                        Not a Java person, but wouldn't a "Address: (.*)$"work?Edit: Without the Pattern.MULTILINE | Pattern.DOTALL option it should match only on that line. 
- 
                        Can it contain a newline? If it cannot contain a newline, you don't need to use the multiline modifier, and can do instead Pattern p = Pattern.compile("^Address: (.*)$");If it can, an alternative I can think of is Pattern p = Pattern.compile("Address: (.*)\nCountry", Pattern.MULTILINE);Without the DOTALL, the dot won't match a newline, so you can explicitly specify it in the regexp, allowing you to do what you asked about. 
- 
                        You might want to look into Propertiesclass instead of regex. It provides you ways to manage plain text or XML files to represent key-value pairs.So you can read in your example file and then get the values like so after loading to a Propertiesobject:Properties properties = new Properties(); properties.load(/* InputStream of your file */); Assert.assertEquals("Peter", properties.getProperty("Name")); Assert.assertEquals("St. Serrano número 12, España", properties.getProperty("Address")); Assert.assertEquals("Spain", properties.getProperty("Country"));cletus : Why use Apache Commons Assert isntead of Java assert?
- 
                        You should definitely check out YAML. You could try JYaml. Best of all it has implementations in many languages. ps I have tried the sample text in YAML::XS, and it works perfectly. 
- 
                        I don't mean to be a stick in the mud, but do you have to use a regex? Why not spare your future self (or others) the headache and do: String line = reader.readLine(); while(line != null) { line = line.trim(); if(line.startsWith("Address: ")) { return line.substr("Address: ".length()).trim(); } line = reader.readLine(); } return null;Of course this can be parameterized a bit as well and put into a method. Otherwise, I'd second the Properties or JYaml suggestions. 
- 
                        Assuming "content" is a string containing the file's contents, your main problem is that you're using matches()where you should be usingfind().Pattern p = Pattern.compile("^Address:\\s*(.*)$", Pattern.MULTILINE); Matcher m = p.matcher(content); if ( m.find() ) { ... }There seems to be some confusion in other answers about MULTLINE and DOTALL modes. MULTILINE is what lets the ^and$anchors match the beginning and end, respectively, of a logical line. DOTALL lets the dot (period, full stop, whatever) match line separator characters like\n(linefeed) and\r(carriage return). This regex must use MULTILINE mode and must not use DOTALL mode.Guido : Thanks. What if address is a multiline field ? Is it possible to capture it with no need to depend on the next field name ?Alan Moore : Both of Nick's regexes will match if the Address field is at the end of the input. Is that what you mean?
- 
                        I don't know Java's regex objects that well, but something like this pattern will do it: ^Address:\s*((?:(?!^\w+:).)+)$assuming multiline and dotall modes are on. This will match any line starting with Address, followed by anything until a newline character and a single word followed by a colon. If you know the next field has to be "Country", you can simplify this a little bit: ^Address:\s*((?:(?!^Country:).)+)$The trick is in the lookahead assertion in the repeating group. '(?!Country:).' will match everything except the start of the string 'Country:', so we just stick it in noncapturing parentheses (?:...) and quantify it with +, then group all of that in normal capturing parentheses. Guido : It worked ! Thank you ! I have to read more about regex :)
 
0 comments:
Post a Comment