Thursday, January 06, 2005

Regular Expression Internals - I


Many people don’t use Regular Expressions because they can look confusing and complicated and nothing written here is going to change that. A Regular Expression is basically a formula for matching strings that follow some pattern. A bit of ‘put into practice’ is an easy way to get specialized in these intricate expressions. This article covers how regular expression engine works internally.

How a Regular Expression Engine works internally?

Basically, Regular expressions are made up of normal characters and metacharacters. Normal characters include upper and lower case letters and digits. The metacharacters are special characters and have special meanings. Understanding metacharacters really make good use of regular expressions. More than the output, when we know the internals operations of how a particular expression works, then it becomes easy for us to formulate any simple or a complex expression. This will also save us from lots of guesswork and confusions when formulating an expression.

There are basically two kinds of regular expression engines: text-directed engines and regex-directed engines.

Text-directed engine is a DFA (Deterministic Finite Automation) which runs in linear time because they do not require backtracking (and thus they never test the same character twice). They also match the longest possible string. However, since a DFA engine contains only finite state, it cannot match a pattern with backreferences, and because it does not construct an explicit expansion, it cannot capture subexpressions.

The features of text-directed engine are:

  • searching is fast (linear time per pass, O(n^2) worst case)
  • search time depends on length of string, not on regex.
  • takes more memory (state explosion in NFA to DFA construction) than NFA
  • takes longer to compile the regex
    - might be done when program is compiled
    - might be done at runtime (just before string matching is needed)
    - some implementations even compile the regex in the midst of matching (if a match is found before the entire DFA is constructed, they can just stop)

  • Sample diagram is given below:

    Regex-directed engine is a NFA (Non-deterministic Finite Automation) and this algorithm tests all possible expansions of a regular expression in a specific order and accepting the first match. Because a NFA constructs a specific expansion of the regular expression for a successful match, it can capture subexpression matches and matching backreferences. However, since an NFA backtracks, it can visit exactly the same state multiple times if the state is arrived at over different paths. As a result, it can run slowly in the worst case. Since an NFA accepts the first match it finds, it can also leave other (possibly longer) matches undiscovered.

    Regex-directed engines are much favored by programmers because they are more expressive than DFA engine. Although in the worst case they can run slowly, we can steer them to find matches in linear or polynomial time using patterns that reduce ambiguities and limit backtracking.

    The .NET Framework regular expression engine is a backtracking regular expression matcher that incorporates a Nondeterministic Finite Automaton (NFA) engine such as that used by Perl, Python, Emacs, and Tcl.

    Note : A very important point that has to be understood when using a regex-directed engine is that it will always return the leftmost match, even if a "better" match could be found later. When applying a regular expression to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the expression, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.

    The features of regex-directed engine are:

  • searching can be slow (more on this later)
  • search time depends on regex
  • takes less memory than DFA (how much depends on regex)
  • compiles more quickly than DFA (how much depends on regex)
  • need to simulate the NFA (i.e., explore paths)
    - use backtracking algorithm to "try out the guesses"
    - different flavors of backtracking give different performance
  • - when there are options, what order to we try them in?
    - can add in "extras" with no serious costs
    - sub-expression trapping
    - Back-references (non-regular!)

    Sample diagram is given below:

    What is Backtracking ?

    Backtracking is like basically leaving a pile of bread crumbs at every fork in the road. If the path that we choose turns out to be a dead end, then we can retrace our steps giving up ground until we come across a pile of crumbs that indicates an untried path. Should that path, too, turn out to be a dead end, we can continue to backtrack, retracing our steps to the next pile of crumbs, and so on, until we eventually find a path that leads to our goal or until we run out of untried paths.

    There are basically 2 points on backtracking: The general idea of how backtracking works is fairly simple, but some of the details are quite important for real-world use. Specifically, when faced with multiple choices, which choice should be tried first? And Secondly, when forced to backtrack, which saved choice should the engine use?

    In situations where the decision is between "make an attempt'' and "skip an attempt,'' as with items governed by a question, the engine always chooses to first make the attempt. It will return later (to try skipping the item) only if forced by the overall need to reach a global expression-wide match.

    This simple rule has far-reaching repercussions. For starters, it helps explain regex greediness, but not completely. To complete the picture, we need to know which (among possibly many) saved options to use when we backtrack. To simply put: the most recently saved option is the one returned to when a local failure forces backtracking. It's LIFO (last in first out).

    This is easily understood in the crummy analogy -- if your path becomes blocked, you simply retrace your steps until you come across a pile of bread crumbs. The first you'll return to is the most recently laid. The traditional analogy for describing LIFO also holds: like stacking and unstacking dishes, the most-recently stacked will be the first you'll unstack .


    To sum up, a simple regular expression engine applying an expression once will outperform a state of the art plain text search algorithm searching through the data five times. Regular expressions also reduce development time. With a regular expression engine, it takes only one line (e.g. in Perl, PHP, Java or .NET) or a couple of lines (e.g. in C using PCRE) of code to say, check if the user's input looks like a valid email address.

    We will see the concepts in regular expressions and how to formulate an expression in the next article.

    Happy Programming....




    mobile hentai sites
    reverse cell numbers
    ntelos cell phone text messages
    inverness lakes mobile al
    mobile porn 3gp
    table comparing and contrasting an animal cell and a plant cell
    granny porn mobile for iphone
    ebony mobile mpegs
    games for mobile
    tamil funny tones
    mobile letto piazza mezza
    crystal samsung blackjack cell phone covers
    small mobile recliners
    3gp sex mobile
    best mobile lesbian sex
    how to change ringtone on curve blackberry
    lesbian movies for mobile downloads
    free ringtones for metropcs for blackberry
    mobile testing interview questions
    mohs surgery for basal cell carcinoma
    mobile dry cleaners
    change in dna sickle cell
    red ring on xbox 360
    malayalam sex videos for mobile downloads
    taping into a mobile phone
    free porn movies for mobile phone
    how to make ring from paper (vediio)
    does type of wood affects the tone of guitar
    cell drawings and its functions
    ringtone nokia
    manuale italiano garmin mobile
    how to extend ring time on mobile phone
    download hindi mp3 ringtones free
    cell phone contact software
    how to make your body's skin tone lighter
    download free hustler porn for mobile
    ringtones mobile
    managing mobile workers commitment
    maxis hotlink nasyid caller ringtone
    engagement ring sparkle table decor
    3 mobile recharge
    brazzers free sex mobile download
    animal sex mobile
    roly poly bug
    free mobile sex rape
    mobile games w200i
    fucktube for mobile
    mobile hack tools
    free elk ring tones
    how to make capital letters on a lg ce110 cell phone
    how to take off silent mode motorola mobile
    best cell phone plans
    was does it mean when services restricted on cell phone
    msn messenger mobile download
    free mobile ringtones
    free ringtones aalltel
    free dog porn mobile
    driving while talked on cell phone deaths
    free ringtones for lg rumor
    bitdefender mobile security serial key
    swarowski beads ring
    download mp3 player for samsung mobile
    fleetwood mobile homes
    mobile telephone numbers of people in u.a.e
    ring tone redemption codes
    mitza ring
    sevenshot mobile bar
    ringtones ali mp3
    love pictures to send to cell phones
    free porn mobile vldeo 3gp
    message barred cell phone
    cell phone network identification application
    lolita porn mobile iphone
    audiovox cell phone antenna
    garmin mobile for blackberry megaupload
    nokia mobile price
    jail cell
    cal poly pomona university
    ring bearer pillow patterns
    mobile phone service plans
    cosmic tone mp3
    mobile avril album
    telecharger create ringtone gratuit
    purchase ringtones
    Post a Comment

    This page is powered by Blogger. Isn't yours?