|
Dharkael Enchanter
Joined: 05 Mar 2003 Posts: 593 Location: Canada
|
Posted: Wed May 07, 2008 4:44 am
IDEA:Optimized Stringlists and Database Variables in Triggers |
Inspired by this thread
I was thinking you could make an option to optimize stringlist and database in triggers by to reduce the number of backtrackings necessary to match and more importantly the number of backtracking necessary to fail.
Using a Patricia trie like the one shown here
Using that example you could take a string list containing :
romane
romanus
romulus
rubens
ruber
rubicon
rubicundus
Which normally would would make a regex like
Code: |
(?:romane|romanus|romulus|rubens|ruber|rubicon|rubicundus) |
you could transform it into something like
Code: |
r(?>om(?>ulus|an(?>e|us))|ub(?>e(?>r|ns)|ic(?>on|undus)))
|
This transformation would lower the average time to match and even more so the time to fail.
On larger stringlist and DB variables the speed gains would, I think, be tremendous. |
|
_________________ -Dharkael-
"No matter how subtle the wizard, a knife between the shoulder blades will seriously cramp his style." |
|
|
|
shalimar GURU
Joined: 04 Aug 2002 Posts: 4691 Location: Pensacola, FL, USA
|
Posted: Wed May 07, 2008 5:19 am |
I can only imagine how ugly it would be for my thousand+ strong list..
Would it not be timely to form the regex you speak of from the list?
Especially when the list continues to expand? |
|
_________________ Discord: Shalimarwildcat |
|
|
|
Zugg MASTER
Joined: 25 Sep 2000 Posts: 23379 Location: Colorado, USA
|
Posted: Wed May 07, 2008 6:38 am |
Well, I wouldn't be surprised if the PCRE library is already doing that kind of optimization itself, so I don't think CMUD needs to do it too. But Vijilante might have more insight into this. Remember that PCRE has a "compile" step that converts the initial regular expression into a compiled pattern. I think this is the point where PCRE is building this kind of search tree internally. CMUD keeps the trigger compiled until you change it, so this is basically already being done.
|
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Wed May 07, 2008 10:53 am |
The algorithm used for regex matching actually specifies a list syntax as an if..else chain. It is sometimes important to use the order of the items of the list to control matching, and the PCRE respects that.
With a short set of words like you have shown the gains would be small, but they would be there. With longer phrases the gains would be more noticeable, but with longer phrases you can often find common portions to group together and making some portions optional achieves nearly the same effect. Those small gains from controlling backtracing and the match path would probably be matched by a loss of speed from the extra layers of nesting. Making the net gain/loss near 0. |
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Dharkael Enchanter
Joined: 05 Mar 2003 Posts: 593 Location: Canada
|
Posted: Fri May 09, 2008 3:45 am |
shalimar wrote: |
I can only imagine how ugly it would be for my thousand+ strong list.. |
It's not very pretty, you wouldn't want to write it by hand.
shalimar wrote: |
Would it not be timely to form the regex you speak of from the list?
Especially when the list continues to expand? |
Yeah it would definitely take more time to reformulate the regex, but with the right data structure, ie a Patricia Trie it doesn't take that long.
And unless your list is being changed every few lines or so you're still definitely gonna reap gains from it.
Vijilante wrote: |
Those small gains from controlling backtracing and the match path would probably be matched by a loss of speed from the extra layers of nesting. Making the net gain/loss near 0. |
That's definitely not the case, especially for larger lists.
Using Shalimar's data I filled 2 variables with both a standard alternation regex like the one CMUD currently uses on lists and one using the technique i described above.
Using the %regex function to test it out, The alternation regex's elapsed time was always a multiple of the Trie regex and the longer the string it was tested on the greater the magnitude
I wasn't really sure how to test the triggers so I made REGEX triggers with the same two variables and used CTRL-Q to test them while toggling between the enabled status of the triggers.
and using #CLR to reset each time. On my system the Trie regex returned approx 7.2 and Alternation regex returned 10.2
No doubt you guys have much better ways to test it.
But definitely it doesn't cancel out.
There is a difference in size as well.
The Alternation regex is 19289 characters
The Trie Regex is 16268
I have the code and can post it if you wan't but its pretty large because of the regex variables.
EDIT:I threw both patterns into RegexBuddy and to match large McSweeney bodyguard which i guess alphabetically is about halfway.
It took 671 steps to match the alternation pattern and only 55 to match the Trie pattern.
I really can't see how this could be a bad thing.
And in the case where there are no common prefixes whatsoever (the longer the list the more unlikely that is) the pattern it makes is identical to the naked alternation pattern.
Also I don't think the trie pattern needs to be sorted so that should save some time(although I think sorting might impove its speed, I'll have to consider and do some testing) |
|
_________________ -Dharkael-
"No matter how subtle the wizard, a knife between the shoulder blades will seriously cramp his style." |
|
|
|
Seb Wizard
Joined: 14 Aug 2004 Posts: 1269
|
Posted: Wed Jul 23, 2008 11:35 pm |
So, the PCRE hasn't compiled an optimal version of
Code: |
(?:romane|romanus|romulus|rubens|ruber|rubicon|rubicundus) |
... |
|
|
|
|
|