|
Fang Xianfu GURU
Joined: 26 Jan 2004 Posts: 5155 Location: United Kingdom
|
Posted: Tue Jun 10, 2008 11:23 pm
So I'm looking for a regex |
that'll capture a variable number of words. There're a couple of options for this, the most obvious being ([\w ]+) but if the word is followed by a space (or more than one space), they'll all be captured as well. Something like ((?:\w+ )+) seems like it'll work better, but it'll still capture one too many spaces at the end. So, what regex will capture a string that both begins and ends with a word character? Here's some sample text for you to experiment with:
Code: |
[1533] bayberry bark [2000] bellwort flower [1244] black cohosh
[1998] bloodroot leaf [ 8] blue ink [ 9] cloth
[ 1] crystal pentagon [1999] echinacea [2000] ginger root
[1490] ginseng root [1998] goldenseal root [ 1] green ink
[1995] hawthorn berry [ 505] irid moss [ 407] kuzu root
[2000] lady's slipper [ 267] lobelia seed [1994] myrrh gum
[1909] prickly ash bark [ 7] red ink [ 1] rope
[ 64] skullcap [ 552] valerian [ 19] venom sac
[ 3] yellow ink |
My current train of thought involves \b, but I haven't actually tried anything with it yet. Answers on a postcard, please. |
|
|
|
Brenex Beginner
Joined: 13 May 2008 Posts: 25
|
Posted: Wed Jun 11, 2008 12:05 am |
My Lusternia one:
[\s*(\d+)]\s(\w+(?:\s+\w+)*)\s*
With Trigger Multiple Times in line selected it works perfectly fine in RegexBuddy without capturing the spaces. |
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Wed Jun 11, 2008 4:27 am |
First, dont bother trying match it. The olny things that are truly definite is that there will be at least one item per line, that the opening bracket will occur 2 spaces after the start of line and that the closing bracket will be 4 characters later. Use encapsulating triggers, then parse the recorded lines with %subregex.
|
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
chamenas Wizard
Joined: 26 Mar 2008 Posts: 1547
|
Posted: Wed Jun 11, 2008 12:58 pm |
Vijilante wrote: |
First, dont bother trying match it. The olny things that are truly definite is that there will be at least one item per line, that the opening bracket will occur 2 spaces after the start of line and that the closing bracket will be 4 characters later. Use encapsulating triggers, then parse the recorded lines with %subregex. |
Could you explain why? Good learning experience! |
|
|
|
Caled Sorcerer
Joined: 21 Oct 2000 Posts: 821 Location: Australia
|
Posted: Wed Jun 11, 2008 2:07 pm |
chamenas wrote: |
Vijilante wrote: |
First, dont bother trying match it. The olny things that are truly definite is that there will be at least one item per line, that the opening bracket will occur 2 spaces after the start of line and that the closing bracket will be 4 characters later. Use encapsulating triggers, then parse the recorded lines with %subregex. |
Could you explain why? Good learning experience! |
"The simplest answer is often the best."
Mind you, I do it the way Bremex suggested, but basically, there are certain problems with capturing rift data, and the way around it is to simply ignore the problem (multiple lines). You either do this by firing the trig multiple times per line, or you treat the entire block as a single line. Either or. |
|
_________________ Athlon 64 3200+
Win XP Pro x64 |
|
|
|
Dharkael Enchanter
Joined: 05 Mar 2003 Posts: 593 Location: Canada
|
Posted: Wed Jun 11, 2008 2:13 pm |
I would use something like this.
Code: |
^\s\s(?:\[((?>\s|\d(?!\s)){3,3}\d)\]\s\b([\w\s'\-]+)\b\s*)(?:\[((?>\s|\d(?!\s)){3,3}\d)\]\s\b([\w\s'\-]+)\b\s*)?(?:\[((?>\s|\d(?!\s)){3,3}\d)\]\s\b([\w\s'\-]+)\b\s*)?$ |
The heart of which is simply
Code: |
(?:\[((?>\s|\d(?!\s)){3,3}\d)\]\s\b([\w\s'\-]+)\b\s*) |
Between 1 and 3 inclusive groups the first must be at the start of the string preceeded by 2 spaces.
Fails pretty quick and removes the need for reparsing. |
|
_________________ -Dharkael-
"No matter how subtle the wizard, a knife between the shoulder blades will seriously cramp his style." |
|
|
|
Brenex Beginner
Joined: 13 May 2008 Posts: 25
|
Posted: Wed Jun 11, 2008 3:39 pm |
Dharkael, I'm using your trigger now, but it captures white spaces in the numbers as a heads up. Thanks for making me go look up Atomic Grouping and Look(?:aheads|behinds). I think I might read over regex again to learn some more new tricks.
|
|
|
|
Fang Xianfu GURU
Joined: 26 Jan 2004 Posts: 5155 Location: United Kingdom
|
Posted: Wed Jun 11, 2008 3:58 pm |
I really didn't want to use a function call for this if I didn't need to (yes, it's much simpler, but it's also probably much slower). However, Dharkael's trigger is an order of magnitude more complex than I asked for.
The main problem I originally wanted solving was avoiding capturing spaces in the words, and Dharkael's trigger should do that. My understanding is that his [\w\s'\-]+ will capture all those spaces, sure, until it gets to the opening [ of the next item (or the end of the line), and then realise that the gap between its final character (a space) and the next character ([ or $) isn't valid for \b, since both characters are non-word characters. So it backtracks to the last word-boundary, which was between the end of the final word and the beginning of the spaces, where \b matches successfully and \s* captures the rest of the spaces.
This does exactly what I wanted it to, and hoorays to it for that - but I wonder if there's a way to do it while avoiding the backtracking? |
|
|
|
oldguy2 Wizard
Joined: 17 Jun 2006 Posts: 1201
|
Posted: Wed Jun 11, 2008 9:05 pm |
The one Brenex had wasn't bad but it won't work on things like "lady's slipper".
This one works fine and doesn't capture trailing spaces.
Code: |
<trigger priority="10" repeat="true" regex="true" id="1">
<pattern>\[\s*(\d+)\]\s(\w+'?(?>\s?\w+?)+)</pattern>
<value>#addkey Rift %2 %1</value>
</trigger> |
|
|
|
|
Vijilante SubAdmin
Joined: 18 Nov 2001 Posts: 5182
|
Posted: Wed Jun 11, 2008 9:20 pm |
Off the top of my head.
Code: |
#CLASS RiftCapture
#VAR Rift {} {}
#TRIG RiftCap {^Glancing into the rift you see:} {Rift=""}
#COND {} {#IF (%match(%line,"%dh")) {
Rift=%subregex(@Rift,"\s*\[\s*(\d+)\] ([\w']+ ??)+\s*(?=\[)","\'2'=\'1'|")
#CALL %vartype(Rift,5) //Not sure about the number for record var
#STATE RiftCap 0
} {
Rift=%concat(@Rift," ",%line)
} {prompt|looplines|param=30|stop}
#CLASS 0 |
I am pretty sure you will find it is faster to do it this way. Adjusting the priorities and using an #ONINPUT for the state 0 trigger can make it even faster. |
|
_________________ The only good questions are the ones we have never answered before.
Search the Forums |
|
|
|
Fang Xianfu GURU
Joined: 26 Jan 2004 Posts: 5155 Location: United Kingdom
|
Posted: Wed Jun 11, 2008 11:45 pm |
I didn't really want to argue the relative merits of either method, since both are based on the same idea (a regex that doesn't capture the extra spaces) and beyond that, finding out which I prefer is a simple case of doing them both and seeing. But thanks for your regex suggestion, anyway. The ([\w']+ ??) is nice to look at.
And thanks for yours, Oldguy. It's pretty trivial to change Brenex' \w+ to [\w']+ and if you wanted to use this principle for other strings, not just the rift (the rift only came to mind because there was another thread about it) you'd need do that anyway. Even if I don't end up using it exactly as it is there (not sure I prefer using a single optional apostrophe and optional spaces/word characters) it's definitely given me something to think about, and thanks for that. |
|
|
|
alluran Adept
Joined: 14 Sep 2005 Posts: 223 Location: Sydney, Australia
|
Posted: Thu Jun 12, 2008 3:07 pm |
I'd try:
Code: |
^(?:\[\s{0,3}(\d+)\] (.*?)\s+)(?:\[\s{0,3}(\d+)\] (.*?)\s+)?(?:\[\s{0,3}(\d+)\] (.*?)\s+)?$
|
I dun see why people were trying all these super complex patterns...
May need a \s+ after the leading ^, not sure if the whitespace at start was in the output or not |
|
_________________ The Drake Forestseer |
|
|
|
Dharkael Enchanter
Joined: 05 Mar 2003 Posts: 593 Location: Canada
|
Posted: Thu Jun 12, 2008 3:25 pm |
Very simple... but it doesn't work.
|
|
_________________ -Dharkael-
"No matter how subtle the wizard, a knife between the shoulder blades will seriously cramp his style." |
|
|
|
alluran Adept
Joined: 14 Sep 2005 Posts: 223 Location: Sydney, Australia
|
Posted: Thu Jun 12, 2008 3:47 pm |
fires for me in 2.26
|
|
_________________ The Drake Forestseer |
|
|
|
Brenex Beginner
Joined: 13 May 2008 Posts: 25
|
Posted: Thu Jun 12, 2008 4:06 pm |
It doesn't work because it doesn't take into account the two spaces the lines begin with. You have ^[ and it should be ^\s\s[
|
|
|
|
Fang Xianfu GURU
Joined: 26 Jan 2004 Posts: 5155 Location: United Kingdom
|
Posted: Thu Jun 12, 2008 4:57 pm |
No, it really doesn't work, even with that fix. That was one of the very first things I looked at, and it fires fine but captures wrong. If you give it the first line of example text I have there, you get %1=1533 and %2="bayberry" and that's it. I don't know why it's not backtracking to expand the non-greedy .*? match, but it's not.
|
|
|
|
Brenex Beginner
Joined: 13 May 2008 Posts: 25
|
Posted: Thu Jun 12, 2008 6:38 pm |
Weird, works fine in RegexBuddy (bug?) Didn't test it in cmud since what I had changed it to worked fine anyways. Now that I tried it in CMUD I see what you mean. This works for the first two:
Code: |
^\s\s(?:\[\s{0,3}(\d+)\] (.*?)\s+)(?:\[\s{0,3}(\d+)\] (.*?)\s+)? |
but adding:
Code: |
(?:\[\s{0,3}(\d+)\] (.*?)\s+)?$ |
to make:
Code: |
^\s\s(?:\[\s{0,3}(\d+)\] (.*?)\s+)(?:\[\s{0,3}(\d+)\] (.*?)\s+)?(?:\[\s{0,3}(\d+)\] (.*?)\s+)?$ |
doesn't work. Go figure, but then again I only picked up regex a few days ago anywho. Perhaps it doesn't work that way. |
|
|
|
|
|