Java Eclipse Linux Operating Systems Web Technology Software Software Engineering Computing Societies

mykek.com Java Community: Regular Expression Tutorial Part 3

Alternation and Beginning of Line

The first problem we noticed is that variable substitutions strings at the beginning of the line are ignored. The problem is that the regular expression requires for at least one non-\ character before the variable substitution string. When the variable substitution string is located at the beginning of the line, no such character can be found before the variable substitutions strings can be found. As a result, there is not a match.

To solve this problem, we can make use of the ^ symbol, which indicates the beginning of line, and the conception of alternation. Consider the regular expression (^(?:\\\\)*|.*?[^\\](?:\\\\)*)\$\{([a-zA-Z0-9.]+)\}. The first capturing is modified to match either a string of even number of \ slashed at the beginning of the line (through the subexpression ^(?:\\\\)*, or a string of any characters, followed by at least one non-\ character and optionally an even number of \ characters.

substituteString, Take 3

If we modify the test case to use the above regular expression, we end up with TestRegex03.java. Running the main method, we have the following output:

"abc" becomes "abc"
"a${abc}" becomes "aABC"
"${abc}" becomes "ABC"
"${abc}${defg}" becomes "ABC${defg}"
"${abc}z${defg}" becomes "ABCzDEFG"
"abc${abc}${defg}xyz" becomes "abcABC${defg}xyz"
"abc${abc}xy${defg}xyz" becomes "abcABCxyDEFGxyz"
"abc\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "abc\${abc}xy\\DEFGxyz\\\${abc}xy"
"\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\${abc}xy\\DEFGxyz\\\${abc}xy"
"\\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\\ABCxy\\DEFGxyz\\\${abc}xy"
"\\${abc}\\${defg}xyz\\\${abc}xy" becomes "\\ABC\\${defg}xyz\\\${abc}xy"
"\\\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\\\${abc}xy\\DEFGxyz\\\${abc}xy"
"\a\${abc}xy\\${defg}xyz\\\${abc}xy\\\lmn" becomes "\a\${abc}xy\\DEFGxyz\\\${abc}xy\\\lmn"

This is better, but there are still a few variable substitution strings that were ignored erroneously. We shall now look at how we can remedy that.

Lookbehind Quantifiers

If we examine the output from substituteString, Take 3, we notice that a variable substitution string that follows another does not get properly substituted. The problem is that the regular expression we have thus far expects the variable substitution string to be preceded by at least the beginning of line or any character other than a \, for each substring it is supposed to match. When a variable substitution string is followed immediately by another, thus the second variable string will be matched by .*?, instead of \$\{([a-zA-Z0-9.]+)\} as desired.

The lookbehind quantifiers can come to the rescue in these cases. Lookahead and lookbehind quantifiers are "zero-width", in the sense "consume" the substring that they matched. The matched substring either constitutes a prefix or a suffix that lies outside of the string matched by the regular expression, or the same substring must be matched by another part of the regular expression. The lookahead and lookbehind quantifiers can be used to specify constraints on what must precede or follow a particular regular expression construct.

Consider the regular expression (^(?:\\\\)*|.*?[^\\](?:\\\\)*|(?<=})(?:\\\\)*)\$\{([a-zA-Z0-9.]+)\}. Here we added the subexpression (?<=})(?:\\\\)* to the alternation. The new subexpression is a positive lookbehind quantifier, which means it will enforce the constraint that the string } precedes whatever follows. Taken together the regular expression matches any variable substitution string that are preceded by either

substituteString, Take 4

If we modify the test case to use the above regular expression, we end up with TestRegex04.java. Running the main method, we have the following output:

"abc" becomes "abc"
"a${abc}" becomes "aABC"
"${abc}" becomes "ABC"
"${abc}${defg}" becomes "ABCDEFG"
"${abc}z${defg}" becomes "ABCzDEFG"
"abc${abc}${defg}xyz" becomes "abcABCDEFGxyz"
"abc${abc}xy${defg}xyz" becomes "abcABCxyDEFGxyz"
"abc\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "abc\${abc}xy\\DEFGxyz\\\${abc}xy"
"\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\${abc}xy\\DEFGxyz\\\${abc}xy"
"\\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\\ABCxy\\DEFGxyz\\\${abc}xy"
"\\${abc}\\${defg}xyz\\\${abc}xy" becomes "\\ABC\\DEFGxyz\\\${abc}xy"
"\\\${abc}xy\\${defg}xyz\\\${abc}xy" becomes "\\\${abc}xy\\DEFGxyz\\\${abc}xy"
"\a\${abc}xy\\${defg}xyz\\\${abc}xy\\\lmn" becomes "\a\${abc}xy\\DEFGxyz\\\${abc}xy\\\lmn"

Prev Page 1 2 3 Next


Written by Mike Kwong