Confusing Behaviour of regex in Python -
i'm trying match specific pattern using re module in python. wish match full sentence (more correctly alphanumeric string sequences separated spaces and/or punctuation)
eg.
- "this regular sentence."
- "this valid"
- "so one"
i'm tried out of various combinations of regular expressions unable grasp working of patterns properly, each expression giving me different yet inexplicable result (i admit beginner, still).
i'm tried:
-
"((\w+)(\s?))*"
to best of knowledge should match 1 or more alpha alphanumerics greedily followed either 1 or no white-space character , should match entire pattern greedily. not seems do, wrong know why. (i expected return entire sentence result) result first sample string mentioned above [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
-
"(\w+ ?)*"
i'm not sure how 1 should work. official documentation(python help('re')) says ,+,? match x or x (greedy) repetitions of preceding re. in such case space preceding re '?' or '\w+ ' preceding re? , re '' operator? output ['sentence'].
others such "(\w+\s?)+)" ; "((\w*)(\s??)) etc. variation of same idea sentence set of alpha numerics followed single/finite number of white spaces , pattern repeated on , over.
can tell me go wrong , why, , why above expressions not work way expecting them to?
p.s got "[ \w]+" work me cannot limit number of white-space characters in continuation.
your reasoning regex correct, problem coming using capturing groups *. here's alternative:
>>> s="this regular sentence." >>> import re >>> re.findall(r'\w+\s?', s) ['this ', 'is ', 'a ', 'regular ', 'sentence'] in case might make more sense use \b in order match word boundries.
>>> re.findall(r'\w+\b', s) ['this', 'is', 'a', 'regular', 'sentence'] alternatively can match entire sentence via re.match , use re.group(0) whole match:
>>> r = r"((\w+)(\s?))*" >>> s = "this regular sentence." >>> import re >>> m = re.match(r, s) >>> m.group(0) 'this regular sentence'
Comments
Post a Comment