Java Eclipse Linux Operating Systems Web Technology Software Software Engineering Computing Societies

mykek.com Java Community: Regular Expression Tutorial Part 2

Capturing Groups

The first step to processing variable substitution is to recognize when a variable substitution string is present. This can be done by defining an appropriate pattern in a regular expression. It is also necessary to extract the variable name so that the appropriate value can be retrieved for the substitution. This can be accomplished using groups in the regular expression.

The substitution string ${variable_name} can be matched with the regular expression \$\{[a-zA-Z0-9.]+\}, assuming only alphanumeric and the period characters can be included in the variable name. The string [a-zA-Z0-9.] matches any single character specified within the square brackets. In this case, the character can be any alphanumeric character or a period. The + means one or more occurrences of, so [a-zA-Z0-9.]+ would match any string consisting only of alphanumeric characters and periods. Furthermore, the string may not be the empty string. The characters $, { and } have special meaning in regular expressions. Here we want the literal characters themselves, so we need to escape them using the \ character.

In order to extract the variable name, we need to introduce the concept of a capturing group, or Backreference. We can accomplish this by modifying the above regular expression to \$\{([a-zA-Z0-9.]+)\}. The substring used to match the expression contained within the round brackets can be captured and processed in various ways. We will see how we can use a capturing group to extract the variable name from the string of interest later (in the Java code sample).

We also need to capture the strings that do not form part of the substitution string. We can accomplish this by using another capturing group. The regular expression would now look like (.*?)\$\{([a-zA-Z0-9.]+)\}. The . stands for any character, while the string *? is a reluctant quantifier, as opposed to a greedy quantifier (such as *). This is important, since (.*) could match a string like ${name}, which will prevent us from processing the variable substitution string properly.

substituteString, Take 1

Let's take a look at our first implementation of the substituteString method.

	private static String substituteString(String line) {
		Pattern variablePattern = Pattern.compile("(.*?)\\$\\{([a-zA-Z0-9.]+)\\}");
		Matcher substitutionMatcher = variablePattern.matcher(line);
		if (substitutionMatcher.find()) {
			StringBuffer buffer = new StringBuffer();
			int lastLocation = 0;
			do {
				// Find prefix, preceding ${var} construct
				String prefix = substitutionMatcher.group(1);
				buffer.append(prefix);
				// Retrieve value of variable
				String key = substitutionMatcher.group(2);
				String value = getValue(key);
				buffer.append(value);
				// Update lastLocation
				lastLocation = substitutionMatcher.end();
			} while (substitutionMatcher.find(lastLocation));
			// Append final segment of the string
			buffer.append(line.substring(lastLocation));
			return buffer.toString();
		} else {
			return line;
		}		
	}

Note that all \ have been escaped and replaced by the string \\ in the pattern string, in accordance with Java rules. A Matcher object is created for the string that we want to match to the regular expression. If there is no match, there are no substitution strings in the string to match, and we return the string itself. Otherwise, for each match we found, we extract the string captured by the first capturing group (group #1), which represents the prefix to the variable substitutions string and append that to the buffer. After that we extract the string captured by the second capturing group (representing the variable name), and use that as the key to retrieve the associated value, which is then appended to the buffer. This in essence replaces the variable substitution string, with the value associated with the variable. We repeatedly look for additional matches in the rest of the string, until none can be found. We then append the remaining string to the buffer, and return the buffer content as the result.

See TestRegex01.java for the full source code of the test case. If we run the main method at this point, we end up with as output:

"abc" becomes "abc"
"a${abc}" becomes "aABC"
"${abc}" becomes "ABC"
"${abc}${defg}" becomes "ABCDEFG"
"${abc}z${defg}" becomes "ABCzDEFG"
"abc${abc}${defg}xyz" becomes "abcABCDEFGxyz"
"abc${abc}xy${defg}xyz" becomes "abcABCxyDEFGxyz"
"abc\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "abc\ABCxy\\DEFGxyz\\\ABCxy"
"\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\ABCxy\\DEFGxyz\\\ABCxy"
"\\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\\ABCxy\\DEFGxyz\\\ABCxy"
"\\${abc}\\${defg}xyz\\\${abc}xy" becomes "\\ABC\\DEFGxyz\\\ABCxy"
"\\\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\\\ABCxy\\DEFGxyz\\\ABCxy"
"\a\${abc}xy\\${defg}xyz\\\${abc}xy\\\lmn" becomes "\a\ABCxy\\DEFGxyz\\\ABCxy\\\lmn"

Comparing the results from the expected output, we see that we are not doing too bad. The only problem is we ignored the escape character, and the "escaped variable substitution" string got replaced inappropriately. We will now attempt to fix that problem.

Character Class and Non-Capturing Groups

Let's examine the following regular expression (.*?[^\\](?:\\\\)*)\$\{([a-zA-Z0-9.]+)\}. The substring [^\\] contains a character class that will match any character, except for \, while the substring (?:\\\\)* will match any string that consist of an even number of \. The non-capturing group, as indicated by (?:...), allows us to apply the repetition operator * for more than 1 letter, without creating additional capturing groups. As a result, the second capturing group would continue to correspond to the variable name. The addition of this subexpression will force the second capturing group to match a variable substitution string only if it is preceded by either zero or an even number of \ characters.

substituteString, Take 2

If we modify the test case to use the above regular expression, we end up with TestRegex02.java. Running the main method, we end up with the following output:

"abc" becomes "abc"
"a${abc}" becomes "aABC"
"${abc}" becomes "${abc}"
"${abc}${defg}" becomes "${abc}DEFG"
"${abc}z${defg}" becomes "${abc}zDEFG"
"abc${abc}${defg}xyz" becomes "abcABC${defg}xyz"
"abc${abc}xy${defg}xyz" becomes "abcABCxyDEFGxyz"
"abc\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "abc\${abc}xy\\DEFGxyz\\\${abc}xy"
"\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\${abc}xy\\DEFGxyz\\\${abc}xy"
"\\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\\${abc}xy\\DEFGxyz\\\${abc}xy"
"\\${abc}\\${defg}xyz\\\${abc}xy" becomes "\\${abc}\\DEFGxyz\\\${abc}xy"
"\\\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\\\${abc}xy\\DEFGxyz\\\${abc}xy"
"\a\${abc}xy\\${defg}xyz\\\${abc}xy\\\lmn" becomes "\a\${abc}xy\\DEFGxyz\\\${abc}xy\\\lmn"

While some of the escaped variable substitution strings are treated properly now, some valid variable substitution strings did not get substituted as expected. We will examine why next.

Prev Page 1 2 3 Next

Valid XHTML 1.0!


Written by Mike Kwong


Written by Mike Kwong