Monday, May 23, 2005

Regular Expression Internals - II


Continuing with my previous article of "Regular Expression Internals - Part I" at
http://www.codeproject.com/useritems/RegEx.asp
, where I discussed about the Regex engine, how it backtracks I am discussing on how to frame an expression logically with few examples.


How to formulate a RegularExpression?

Before we actually enter into formulating a regular expression, we will take a look into few concepts like what are characters, special characters and some of other concepts with examples.

The very simplest pattern matched by a regular expression is a literal character or a sequence of literal characters. Anything in the target text that consists of exactly those characters in exactly the order listed will match. A lower case character is not identical with its upper case version, and vice versa. A space in a regular expression, by the way, matches a literal space in the target.

For instance
:

Criteria: /a/

"An institute has posted guidance that protects against a reported vulnerability in all versions of
software that could allow a Web site visitor to view secured content by using specially crafted
requests to
a Web server"
Criteria: /Web/
"An institute has posted guidance that protects against a reported vulnerability in all versions of
software that could allow a
Web site visitor to view secured content by using specially crafted
requests to a
Web server".

Special characters as we mentioned above has special meanings to regular expressions. Any Special characters (like *, \) can be matched, but to do so we must prefix it with the backslash character (this includes matching a backslash character itself: to match a backslash in the target, your regular expression should include "\\").



For instance: /.*/

For instance:

Criteria: /^Web/
"An institute has posted guidance that protects against a reported vulnerability in all versions of
software that could
allow a Web site visitor to view secured content by using specially crafted
requests to
a Web server".
 Criteria: /Web$/
"An institute has posted guidance that protects against a reported vulnerability in all versions of
software that could allow a
Web site visitor to view secured content by using specially crafted
requests to a
Web server".

Criteria: /.a/
"An institute has posted guidance that protects against a reported vulnerability in all versions of
software that could allow a Web site visitor to view secured content by using specially crafted
requests to a Web server".

Character classes consists of set of square brackets ([]) and Range (-) symbol. The negation is the caret (^) symbol within square brackets.


Instead of supplying only a single character, we can include a pattern in a regular expression that matches any of a set of characters. A set of characters can be given as a sample list inside square brackets.



For instance:

Criteria: /[abcde]/ will match any single lowercase vowel.


For letter or number ranges we may also use only the first and last letter of a range, with a Range (-) in the middle.

For instance:

Criteria: /[A-Za-z0-9]/


This will match any lowercase or uppercase of the alphabets or any numerals from 0 to 9.

Negation in this context refers to everything not included in the listed character set is matched.

For instance:


Criteria: /[^a-e]in]/
"An institute has posted guidance that protects against a reported vulnerability in all versions of
software that could allow a
Web site visitor to view secured content by using specially crafted
requests to a
Web server".

Till now we looked into basics of Regular expressions, what is metacharacters, how to formulate an expressions. Now, we will see couple of examples on how to make real time expressions.


Examples:


  1. Website URL:

^(((hH?)(tT?)(tT?)(pP?)(sS?))://)?(www.[a-zA-Z0-9].)[a-zA-Z0-9-.]+.[a-zA-Z]*$


The above expression accepts any URL with or without http/https, and output as given below:

a) http://www.aaa.com

b) https://www.ddd.sds

c) www.abc.co.in


Here, the end-user can type the URL in caps or in lower letters. And moreover, the end-user may need the URL with ?http? or may not need the ?http?. Based on this logic, the logical grouping of the expression has to be made. So we have to formulate http/https part of the expression separately and the URL part of the expression separately.


Ø To mark the beginning of expression string, '^' symbol is used and '$' for end of expression string.

Ø '(...)' is used for logical grouping of part of an expression. Taking the protocol part, we have '(hH?)', where the '' means alternation, and the expression returns 'h', 'H'. And '?' means 0 or 1 of previous expression. That is, '(hH?)' returns either 'h' or 'H'. Like this way, the expression evaluates for other letters too in 'http / https'.

Ø Then coming to the URL part i.e. '(www.[a-zA-Z0-9].)', the end-user can type 'www' or any letters from a-z (caps or small) or any numerals. Following the www portion, the domain, .com, .org must also be with same constraint


Ø Finally, we are left with joining the protocol and URL. A '?' must be placed in between both the expressions, which implies that either true or false of previous expression, i.e. a full URL can contain a protocol or need not have to contain a protocol.


  1. Validating a Number:

^([1-9][1-9]d100)$


The above expression matches whole numbers from 1-100 and output as given below:


a) 1

b) 50

c) 100

Ø As usual to mark the beginning of expression, ?^? symbol is used and ?$? for end of expression.

Ø ?[...]? is used for explicit set of characters to match, i.e. for e.g. a[bB]c -> abc, aBc.


Ø So, for first [1-9] set, it will take all numbers from 1 to 9. Note that there is an alternation (??) used, after which another set of [1-9]. The check for 2nd digit includes ?d?. Alternation includes both the options.


Ø This is an expression for checking whole numbers between 1 to 100. So, finally include 100 also in the alternation.


Happy Programming...

Comments:
good work.. keep going.. and visit http://forum.only4gurus.org
 
Hello.

My name is Gianni, I'm from Brazil I just read your article but I still having some problems to create a regular expression.

I want a regular expression for a password where the minimun size must be 10 chars and the maximum 14. The problem happens when the user must insert 4 digits at least.

At this moment I'm trying to do something like \w+(\d{4}).

I'll be glad if you help me.

Thanks

Gianni Bernardes
 
Post a Comment

This page is powered by Blogger. Isn't yours?