|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Thu May 01, 2008 11:08 pm
[2.23] Initial subregex documentation |
I guess it is about time for me to post some details so everyone can imagine ways to break it. I couldn't actually test it in CMud, instead I was running it through my own test app. Zugg also mentioned making some adjustments to point at which the CMud parser is engaged.
First I am going to start with the final exam substitution I used.
Text wrote: |
aabbcc112233 aaa |345 ccc d87 ghi=jkl xyz|* |
Pattern wrote: |
(((a|b|c|d)|e|f|g)|h) |
SubText wrote: |
(?ERROR)(?DEBUG)(?(1:+1)v(?(>=:7)\'2:-3'x\k<3:+4>|(?LIST:2:$)y)(?INSTANCE)|(?MEMBER:3)) |
Result wrote: |
<ERROR><\ERROR><match 1 a>v"a"|"a"|"b"|"b"|"c"|"c"|"a"|"a"|"a"|"c"|"c"|"c"|"d"|"g"|""y1</match><match 2 a>v"a"|"a"|"b"|"b"|"c"|"c"|"a"|"a"|"a"|"c"|"c"|"c"|"d"|"g"|""y2</match><match 3 b>v"a"|"a"|"b"|"b"|"c"|"c"|"a"|"a"|"a"|"c"|"c"|"c"|"d"|"g"|""y3</match><match 4 b>v"a"|"a"|"b"|"b"|"c"|"c"|"a"|"a"|"a"|"c"|"c"|"c"|"d"|"g"|""y4</match><match 5 c>v"a"|"a"|"b"|"b"|"c"|"c"|"a"|"a"|"a"|"c"|"c"|"c"|"d"|"g"|""y5</match><match 6 c>v"a"|"a"|"b"|"b"|"c"|"c"|"a"|"a"|"a"|"c"|"c"|"c"|"d"|"g"|""y6</match><text 6>112233 </text><match 7 a>vbxc7</match><match 8 a>vcxc8</match><match 9 a>vcxd9</match><text 9> |345 </text><match 10 c>vax10</match><match 11 c>vax11</match><match 12 c>vax12</match><text 12> </text><match 13 d>vcx13</match><text 13>87 </text><match 14 g>vcx14</match><match 15 h>1</match><text 15>i=jkl xyz|*</text> |
Now the explanation.
The ($ERROR) and ($DEBUG) use XML tags. Error is tacked on the front of the result string, and tries to help with both the pattern and the substitution string. Since any error in your pattern is available at all times it will be included as soon as ($ERROR) is encountered. No record of anything being off in the substitution is recorded until after ($ERROR) is encountered.
The next item in the substitution is ($DEBUG) and it is what adds all those <text ##> and <match ## text> tags. Those let you know what was matched, what was substituted, and what was just copied.
Everything after that is a single conditional. "(?(1:+1)v(?(>=:7)\'2:-3'x\k<3:+4>|(?LIST:2:$)y)(?INSTANCE)|(?MEMBER:3))". I expanded the back reference syntax considerably to allow relative instances. In this particular case the condition is "does \1 get matched on the next match?" This relation is specified with the optional :##. When just a number is used like :7 that means exactly the 7th match instance, however when a + or - used then the instance is relative to the current match.
Looking a little deeper into the condition we see a true phrase of "v(?(>=:7)\'2:-3'x\k<3:+4>|(?LIST:2:$)y)(?INSTANCE)". The first item is straight text of "v", the next item is another conditional. This particular condition is tested batch on the match instance. The section ">=:7" means when the instance is greater then or equal to 7 do the true part. Supported signifiers are > < >= <= = == <> !=, there are also 2 special instance signifiers ^ meaning start of string, and $ meaning end of string. Plus and minus can not be used with instance matching.
Since the second condition is false for the first bunch of matches I will look at the false substitution for it first, "(?LIST:2:$)y". Here have the text "y" after the new (?LIST) item. List generates a | seperated list for the requested backreference, in this case \2 is what we are looking for list of. You will notice that each item in enclosed in quotes, this to allow nesting of lists. You will also notice that the use of (?LIST) here has another section ":$". This is an optional relation section just like conditionals. In fact everything that can interract with a captured portion allows a relation or instance to be specified. The use of a $ for this relation means final match, and that is why the full list is shown during the first 6 matches above.
Now in the true part of the subcondition we have "\'2:-3'x\k<3:+4>". This is 2 different forms of capture back reference replacements, and the text of "x" sandwiched in the middle. It is always recommended to use a delimiter with your captures when you want to put text directly next to it, and it is required to use delimeters when you want to use a relation on that reference. The supported delimeters are '' <> {} and may be used with any of the Perl standard references of \ \k \g. Additionally the reference syntax of (?P=##) is supported and may have a relation section without additional delimeters. All references support both names and numbers.
Now that the secondary condition has been covered we can finish the true side of the first condition. The only thing left there is (?INSTANCE), which lets you put the match instance into your sustituted string.
On the false side we have "(?MEMBER:3)", this means with the with reference 3 output its matched list position in the pattern. This is only used in the above example on match 15 when the original condtion did not get a 16th match for \1. This also supports the full relation syntax. Looking at it right now I think I might have a bug since this should be 0 to indicate that \3 was not matched during the 15th match, but a 1 was substituted. I will have to hunt for that over the weekend.
The final note is on the use of \. It can be used to quote out a portion that would otherwise be a valid substitution command. This is done by putting a backslash before the opening ( or the \ that leads into the command. If the command isn't valid then the extra backslash will just be treated like more text. In other words don't use them unless you need them. They are also supported within the conditionals to quote the | and a closing ), this allows you put a text version of either of those items in. If you need to put a blackslash into your text just put it there, and it will work by itself unless what immediately follows has a special meaning. You can quote a backslash with another backslash for those times when you need it. For example with a pattern that has 1 capture "\1" is a back reference "\1" is the text "\1", "\\1" is the text "\1", "\\\1" is the text "\" folowed by a reference "\1". If the pattern had no captures then \1 would not be a valid reference and everything would be text with no change in the backslash replacement. |
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Fri May 02, 2008 12:50 am |
I decided it would be better to split this off into its own topic. I will be devoting a little bit of time to making some small final changes and corrections in the code for this during the weekend. As I mentioned above it looks like there is a bug with ?MEMBER that snuck past me, so I have at least 1 thing to fix.
Please do me a favor and post any bugs with the subregex code in this topic.
If anyone has any bright ideas on how to present all of the capabilities in a help document post that too. What I have above really is more of a guide for testing it than a proper help document. |
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Sun May 04, 2008 7:19 pm |
I made a few adjustments with this today, that will be available in 2.24
Corrected (?LIST:nn) to output the current list for reference nn, instead of a null
Corrected (?LIST:nn:i) to output the list at instance i for reference n each time, this was giving strange results
Changed the output of null items in ?LIST from "" to a null, this is because CMud has some troubles with "" in some places
I still have to look into what I noticed with ?MEMBER, and am still looking for ideas on how to better document all of this. I really need to be able to create a propper supporting help document for the public version. |
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Mon May 05, 2008 10:18 pm |
I finished working through the bug I spotted with ?MEMBER this morning, and periodically thought about what was missing throughout the day.
Yesterday when I did the pouch capture script for oldguy2 I really wanted to be able to format the output so that the herb name and counts were interleaved in a neat list for easy summation. Instead I put a list of each capture in with 2 $LIST entries, then coordinated between the 2 in further script. It is a lot more elegant then it used to be, but can it be better still?
I think it can so I added 1 more syntax into the conditional group. This new addition does its contents for each match that occurred. For example:
#SHOW %subregex("the quick brown fox jumped over the lazy dog","(?<=(.))o","-(?(*)\1)-")
Displays
the quick br-rf d-wn f-rf d-x jumped -rf d-ver the lazy d-rf d-g
While this
#SHOW %subregex("the quick brown fox jumped over the lazy dog","(?<=(.))o","(?(^)-(?(*)\1)-)")
Displays
-rf d-the quick brwn fx jumped ver the lazy dg
Also while testing this I found out the references such as \1 were not properly using the instance they were told to use. That is now fixed as well. |
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|