Today i stumbled across something really funny which is called “lazy star” and is used as a term within regular expressions. I did work on an expression and could not get it to work till i found the lazy “guy” which was the solution for the problem.
The problem briefly:
Let’s assume the following string (some pseudo html) is given…
- something in between!
and we want to strip all script tags out of the string including their contents. So that it looks like the following afterwards:
- something in between!
“grab everything which starts with <script followed by any character (/s/S) any times (*) and finally ending with </script>” This whole thing works fine but it removes my whole content and the reason is because it grabs the first <script>-tag and strips everything till the last </script>-tag. Therfore unfortunately the “something in between!” string also vanished. Then I ended up playing around with expressions (what i reallly love – i cannot imagine doing something better) so that it grabs everything within the script tag but not if a new script tag already started. It took me at least 2 hours to manage this! damn i was really frustrated
But all of a sudden I found the MAGIC question mark (?) which makes stars to lazy stars. And because of this terminology my day was already saved (no coffee, no drugs, no pills anymore,…) and as a “nice” side effect the regular expression finally worked how it should.. here is the working pattern with the lazy star:
can you feel the lazyness? I did especially for myself. So as an explanation to this I refer to an article which describes this behavoir best: http://www.regular-expressions.info/examples.html
<TAGb[^>]*>(.*?)</TAG> matches the opening and closing pair of a specific HTML tag. Anything between the tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag rather than before the last, like a greedy star would do. This regex will not properly match tags nested inside themselves, like in <TAG>one<TAG>two</TAG>one</TAG>.
I have the feeling that I understand the usage of the question mark here somehow but I am not 100% confident about this cause the theory says the following about quantification (http://en.wikipedia.org/wiki/Regular_expression):
? The question mark indicates there is 0 or 1 of the previous expression. For example, “colou?r” matches both color and colour.
* The asterisk indicates there are 0, 1 or any number of the previous expression. For example, “go*gle” matches ggle, gogle, google, gooogle, etc.
+ The plus sign indicates that there is at least 1 of the previous expression. For example, “go+gle” matches gogle, google, gooogle, etc. (but not ggle).
Thats kinda strange to me because it does only mention how often that preceding expression is allowed to occur and not how the parsing is done. Maybe someone has a better explanation for me…