Zugg Software :: View topic

SubAdmin Joined: 18 Nov 2001 Posts: 5182

After tweaking around with some time tests I realized I really have to rewrite how the backreferences are handled to smooth them out. Right now they can cause a rather significant slowdown. I doubt I would get a rewrite of that handling to optimize it done before Zugg was ready to release a new version, but I will work at it for another version.

I have also been kicking the idea of putting in something like the pattern conditional into the substitution string. For those that don't know (?(1)yes|no) when used in a pattern would match 'yes' if the 1 backreference was matched, and 'no' otherwise. Using it as a substitution would mean it would put in those words for the same reason.

The reason I started thinking about that addition is that there are far too many times when I am matching against a list, and then end up having to do %ismember on the matched part. This is essentially double matching. If I instead was able to do something like "(?:()item1|()item2|()item3)" for the pattern and then "(?(1)1)(?(2)2)(?(3)3)" for the substitution then its all done. Obviously the conditional syntax doesn't quite lend itself to being an ismember, but since it is also an if I think it might be the better way to add it. The benefit I see is that it should also eliminate some of the really messy situations with having to nest quotes with %char(34) or other syntaxes.

I am wondering what people think about the addition before I start really coding on any of it. I will also have to double check that I am properly notified when a capture is skipped/bypassed in order to make that addition. I really have to check on that anyway, since I didn't put any handling for it in originally; and it might cause a problem depending on how and whether I am notified. I will also look into seeing if there is some way I can list capture to tell me which of its items was used, but I won't try to hard for that if it is not obvious.

Posted: Mon Mar 24, 2008 11:09 pm

I try to avoid backreferences on the principle that they slow things down. I've never had an issue doing the remaining checking in the trigger itself.

Posted: Mon Mar 24, 2008 11:25 pm

Vijilante is talking about "backreferences" within the substitution string, and since the whole purpose of %subregex is to replace a regular expression with other text, using the originally matched text is going to be *very* common. What ReedN is saying is like someone saying that they don't use %1..%99 in their trigger scripts. So I think you are missing the point of Vijilante's post. This has nothing to do with triggers. The %subregex function is like the "preg_replace" function in PHP and handling backreferences in the substitution string is very important.

Vijilante: I'll be happy to make any mods that you come up with. Right now I think it works fine for most people and is still a big improvement over the old version, so I wouldn't stress over it too much.

And I'd worry about the normal backreference stuff before worrying about pattern conditionals. I've never seen pattern conditionals before...does this syntax exists in any other programs? I'd like to keep %subregex working like normal PCRE or preg_replace as much as possible.

Posted: Mon Mar 24, 2008 11:38 pm

Ah yes, I misread Vijilante's post, I was thinking of triggers.

SubAdmin Joined: 18 Nov 2001 Posts: 5182

The conditional syntax is the same as is used in a pattern. A common usage of it is to put the capture you will condition on within a look ahead or behind. This causes the capture to not actually use any of the string but get set when the look is valid. Then the pattern mutates based on the condition.

Another common usage is the nested parenthesis example that is in nearly every regex book at this point.

Adept Joined: 21 Sep 2005 Posts: 250 Location: Austin, TX

While you're working on the backreferences, please take a look at the %pat() function, as well - I imagine that they are very closely related. The following code used to work correctly in 2.18:

SubAdmin Joined: 18 Nov 2001 Posts: 5182

The use of %pat was completely removed. When I wrote the subregex routine I had no way to know how to enable it, and so I wrote a rather simple backreference replacement in. Zugg decided he liked eliminating the confusion with %pat and kept what I wrote. The %pat confusion is that using the %subregex from a trigger you will likely already have values associated with the %pat function, and you may even be mixing some of those references in with the captures from the subregex.

I wrote a reasonably lengthy comment into the help that explained the new usage. Your example properly changed for 2.20 is

Posted: Tue Mar 25, 2008 12:19 pm

This is weird as a bag of snakes; I replied to a bunch of threads earlier (or I thought I did), including this one, yet my reply's now gone. I must've made the whole thing up =/

The simple answer, as Viji says, is just to replace %pat(1) with \1.

Adept Joined: 21 Sep 2005 Posts: 250 Location: Austin, TX

Aha....that'll teach me to use the help file from inside CMUD rather than coming to the website. Makes sense now.

Wizard Joined: 25 Mar 2003 Posts: 1113 Location: USA

The help files and web site draw from the same source, except that the web site shows these additional comments. I'm sure it'll get rolled into the help entry soon, especially after this thread, eh?

Posted: Tue Mar 25, 2008 6:13 pm

I'm surprised Viji didn't just add it to the article, actually. I'm loathe to do it myself because I haven't looked into \k and \K thoroughly.

SubAdmin Joined: 18 Nov 2001 Posts: 5182

I didn't addit it to the help because 2.18 is the public version and that is what the help has to reflect. I put the full explanation there as a comment to document the changes for the beta version figuring that beta testers frequent the forums and would notice there was an update.

In any case since no one seems to have any opinion about the idea of adding the conditional I will let it stew some more as I look into how hard it would be to produce a number return from a capture pattern like (abc|def|ghi). That would be the equivalent of doing %ismember at the same time which was really what got me thinking about it in the first place.

Adept Joined: 21 Sep 2005 Posts: 250 Location: Austin, TX

SubAdmin Joined: 18 Nov 2001 Posts: 5182

Ok. That is 1 vote for the conditionals. I have checked on a few of the things needed for that and they should be pretty easy until I try to handle nesting them as is in your example, but I already have thoughts on how to handle that.

Also for your example the way subregex works and the addition of the conditional would result in "@myFunc(start)@myFunc(match, abc )def@myFunc(match, ghi )jkl@myFunc(match, mno )pqr@myFunc(end)" being passed to the zScript parser for evaluation and return. So it should actually work, but part of the idea was to make it use less script and be easier to use. For your example the final return would be "def jkl pqrabc|ghi|mno", your $op=end #RETURN does not provide a seperator.

On point A, yep. I would have to invent some syntax. I would probably want to keep it something like the regex pattern syntax, but different enough that it is not likely to ever clash with something that might be added to Perl or another language's regex system. Probably something like (?DEBUG=seperator text) and (?LIST=name/number|seperator text). If I can do it these would be for each instance substituted, (?MEMBER=name/number) and (?MATCHED=[i]name/number|relative/absolute instance)

On point B, no it isn't clear how to express it. I would probably just make it the end of the line all the time, and then document it. I don't really like doing something like that, but I definitely see a few uses for such a return value in my scripts. For example I have taken to using %subregex to remove all unwanted items from a list. Sometimes they are garbage, other times I just don't want to look at them in the following #FORALL. If I could split the list into 2 parts with 1 %subregex it would definitely shorten some of my scripts. I also want to try and keep that substitution string short, easy to read, and easy to debug.

You know the funniest thing is I used to hate %subregex. Everytime I tried to use it I would lock up zMud, and then I didn't have much better luck with CMud. Not that long ago, around 2.14, it became my best friend for many parsing things.

I am still just thinking about many of the things, but since Zugg is giving me a bit of trust to let me write code that may be added to CMud I want to make sure it is done right.

SubAdmin Joined: 18 Nov 2001 Posts: 5182

I thought I would give those interested an update. I obviously didn't get anything done in time for 2.22.

I finished all the design work for the data structure to hold the parsed substitution string. I think JQuilici was right when he said, "I think this way lies madness." In any case the structure I designed should be rather efficient on both memory and speed, and will handle a bit more then I described earlier.

I am currently working on the parsing required to handle it. I have the basic code structure for it done, and nearly all of the simple backreference support is written. Right now I am aiming at getting the nasty part, nested stuff inside a conditional, done. For those that want a peek at the thinking behind it, this is most of the regex that will be responsible for parsing the substitution string.

Posted: Sat Apr 05, 2008 3:10 am

Well you've got a month so that should be plenty of time.

SubAdmin Joined: 18 Nov 2001 Posts: 5182

I haven't given up on this. I am still plugging away at it. I really have to laugh at Delphi sometimes. The current one I find totally funny is these few lines of code:

SubAdmin Joined: 18 Nov 2001 Posts: 5182

I am now into the final phases of writing this and need some input on a few items.

First is the new (?ERROR) function. By putting this into your substitution string it will give error information. It starts with any pattern compile error there might be, then as parsing of your substitution string continues it will report possible errors there. How to return it to you is the real question. I was planning on tacking it on at the front of the returned string, but then I got this bright idea.

What if I tacked it on such that it the CMud parser would see it as a function reference? By that I mean doing something like

Posted: Sun Apr 20, 2008 2:35 pm

Seems to me like raising an event would be a much better course than using a function, if you decided to use that route.

However, it'll be obvious that there's an error in the regex when it starts substituting incorrectly. If you're manipulating a string in that way, you're going to be displaying it somewhere eventually. You'll notice that an incorrect string's being displayed and be able to backtrack to the problem and fix it. The only time this'd really be useful is if your %subregex was taking a dynamic regex rather than a fixed one, but I can't think of (m)any tasks that'd require that. Seems very niche.

EDIT: I did just think of one time you might use subregex where an error checker like this might be useful - when building a list to be passed to #forall. I guess it does have its uses.

Similarly with the debug option - I can't see it being all that useful. The only time you'd need it is if you built your regex wrong to capture what you wanted it to (perhaps using a greedy quantifier when you shouldn't have). Regex novices probably won't understand the debug output or how to apply that to their regex, and advanced users probably don't need the crutch.

Adept Joined: 21 Sep 2005 Posts: 250 Location: Austin, TX

Rather than stuffing these strings into the substitution, why not return them through additional arguments to the %subregex function? By that I mean, allow a call like:

SubAdmin Joined: 18 Nov 2001 Posts: 5182

SubAdmin Joined: 18 Nov 2001 Posts: 5182

I have more or less reached a decision with this, and the reason for it was pretty simple