-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regexp: capturing groups in repetitions #51381
Comments
For now, when capturing groups are used within repetitions, it is impossible to capure what they match E.g. the following regular expression: (0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)(?:\.(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)){3} is a regexp that contains two capturing groups (\1 and \2), but whose the second one is repeated (3 times) to For now, capturing groups don't record the full list of matches, but just override the last occurence of the I'd like to have the possibility to have a compilation flag "R" that would indicate that capturing groups
Effectively, with the same regexp above, we will be able to retreive (and possibily substitute):
This should work with all repetition patterns (both greedy and not greedy, atomic or not, or possessive), in This idea should also be submitted to the developers of the PCRE library (and Perl from which they originate, If there's already a candidate syntax or compilation flag in those libraries, this syntax should be used for |
I'd like to add that the same behavior should also affect the span(index) This also means that the regular expression compilation flag R should be |
Rationale for the compilation flag: You could think that the compilation flag should not be needed. However, The reason is that the MatchObject will have to store lists of The MatchObject.groups() will also continue to return a list of single |
Implementation details: Currently, the capturing groups behave quite randomly in the values returned by MachedObject, when backtracking occurs in a repetition. This |
I'm skeptical about what you are proposing for the following reasons:
Using a flag like re.R to change the behavior might solve the issue 2), Let's take a simpler ipv4 address as example: you may want to use
'^(\d{1,3})(?:\.(\d{1,3})){3}$' to capture the digits (without checking
if they are in range(256)).
This currently only returns:
>>> re.match('^(\d{1,3})(?:\.(\d{1,3})){3}$', '192.168.0.1').groups()
('192', '1') If I understood correctly what you are proposing, you would like it to In these situations where some part is repeating, it's usually easier to
use re.findall() or re.split() (or just a plain str.split for simple
cases like this):
>>> addr = '192.168.0.1'
>>> re.findall('(?:^|\.)(\d{1,3})', addr)
['192', '168', '0', '1']
>>> re.split('\.', addr) # no need to use re.split here
['192', '168', '0', '1'] In both the examples a single step is enough to get what you want '^(\d{1,3})(?:\.(\d{1,3})){3}$' can still be used to check if the string So I'm -1 about the whole idea and -0.8 about an additional flag. |
You're wrong, it WILL be compatible, because it is only conditioned by a Without the regular compilation flag set, as I said, there will be NO Reopening the proposal, which is perfectly valid ! |
Note that I used the IPv4 address format only as an example. There are I'm NOT asking you how to parse it using MULTIPLE regexps and functions. |
In addition, your suggested regexp for IPv4: '^(\d{1,3})(?:\.(\d{1,3})){3}$' is completely WRONG ! It will match INVALID IPv4 address formats like |
Summary of your points with my responses :
That's exactly why I proposed to discuss it with the developers of other
Wrong. This does not even change the syntax of regualr expressions
Wrong. All the mechanic is already implemented: when the parser will
Already suggested above. This will hovever NOT affect the compatibility
There are really a lot ! Using multiple split operations and multiple |
And anyway, my suggestion is certainly much more useful than atomic groups |
If you read what Ezio wrote carefully you will see that he addressed Simple. obvious feature requests can be opened and acted upon directly Note that we don't really have a resolution that says 'sent to |
Just to clarify, when I said "in most cases such an issue would need to |
I had read carefully ALL what ezio said, this is clear in the fact that Capturing groups is a VERY useful feature of regular expressions, but My proposal woul have absolutely NO performance impact when capturing It would also not affect the case where capturing groups are used in the Using multiple parsing operations with multiple regexps is really This extension will also NOT affect the non-capturing groups like: If my suggestion to keep the existing MatchObject.function(index) API MatchObject.groupOccurences(index)
MatchObject.startOccurences(index)
MatchObject.endOccurences(index)
MatchObject.spanOccurences(index)
MatchObject.groupsOccurences(index) But I don't think this is necessary; it will be already expected that May be only PCRE (written for C/C++) would need a new API name to return My proposal is not inconsistant: it returns consistant datatypes when Anyway I'll submit my idea to other groups, if I can find where to post It really reduces the number of transformation steps needed to process |
Sorry, I missed that you mentioned the flag already in the first
Can you provide some example where your solution is better than the
Even with your solution, in most of the cases you will need additional I can see a very limited set of hypothetical corner cases where your
proposal may save a few line of codes but I don't think it's worth
implementing all this just for them.
An example could be:
>>> re.match('^([0-9A-F]{2}){4} ([a-z]\d){5}$', '3FB52A0C a2c4g3k9d3',
re.R).groups()
(['3F', 'B5', '2A', '0C'], ['a2', 'c4', 'g3', 'k9', 'd3'])
but it's not really a real-world case, if you have some real-world
example I'd like to see it.
That's why I wrote 'without checking if they are in range(256)'; the
So maybe this is not the right place to ask.
What I meant is that a regex that uses the re.R flag in Python won't
Usually when the text to be parsed starts to be too complex is better to
Then why no one implemented it yet? :) |
ezio said:
>>> re.match('^(\d{1,3})(?:\.(\d{1,3})){3}$', '192.168.0.1').groups()
('192', '1')
> If I understood correctly what you are proposing, you would like it to
return (['192'], ['168', '0', '1']) instead. Yes, exactly ! That's the correct answer that should be returned, when
Yes, but this is necessary for full consistency of the group indexes. It is then assumed that when the R flag is set, ALL occurences of |
NO ! You have to check also the number of digits for values below 100 (2 And when processing web log files for example, or when parsing Wiki The real need is to match things exactly, within their context, and I gave the IPv4 regexp only as a simple example to show the need, but |
Yes, but this step is trivial and fully predictable. Much more viable How many bugs have been found in code using split() for example to parse And in fine, the only solution is to simply rewrite the parser |
I know this problem, and I have already written about this. It is not Such module can be reduced to just a couple of lines with a single
|
That's because they had to use something else than regexps to do their And then later they regretted it, because they had to fix their |
>>> re.match('^(\d{1,3})(?:\.(\d{1,3})){3}$', '192.168.0.1').groups()
('192', '1')
> If I understood correctly what you are proposing, you would like it to
return (['192'], ['168', '0', '1']) instead. In fact it can be assembled in a single array directly in the regexp, by >>> re.match('^(?P<parts>=\d{1,3})(?:\.(?P<parts>=\d{1,3})){3}$',
'192.168.0.1').groups() would return ("parts": ['192', '168', '0', '1']) in the same first This could be used as well in PHP (which supports associative arrays for |
Instead of a new flag, a '*' could be put after the quantifier, eg:
MatchObject.group(1) would be a string and MatchObject.group(2) would be The group references could be \g<1>, \g<2:0>, \g<2:1>, \g<2:2>. However, I think that it's extending regexes too far; something else -1 from me |
You said that this extension was not implemented anywhere, and you were I've found that it IS implemented in Perl 6! Look at this discussion: http://www.perlmonks.org/?node_id=602361 Look at how the matches in quantified capture groups are returned as So my idea is not stupid. Given that Perl rules the world of the Regexp Already, this is used in CPAN libraries for Perl v6... (when the X flag |
Anyway, there are ways to speedup regexps, even without instructing the See http://swtch.com/~rsc/regexp/regexp1.html Java uses now the Thomson approach in its latest releases, but I wonder Note that I've been using the DFA simulation since more than 20 years in This algorithm has been implemented in some tools replacing the old The Perl 6 extension for quantified capturing groups will have a slow But my suggstion is much more general, as it should not just apply to And the way I specified it, it does not depend on the way the engine is The simple test case is effectively to try to match /(aa?)*a+/ with |
Umm.... I saif that the attribution to Thompson was wrong, in fact it The paper published in swtch.com is effectively written in 2007, but its The cache for DFA states will fill up while matching the regexp against However the paper still does not discusses how to make the DFA states Then the DFA cache can be used in a LIFO manner, to purge it Apparently, GNU awk does not use the cached DFA approach: it just uses I'll try to implement this newer approach first in Java (just because I In Java, there's a clean way to automatically cleanup objects from May be I'll port it later in Python, but don't expect that I'll port it |
I would find this functionality very useful. While I agree that it's often simpler to extract the relevant information in several steps, there are situations in which I'd prefer to do it all in one go. The application I'm writing at the moment needs to extract metadata from text files. This metadata actually appears as text at the top of each file. For example: title: Example title Example title Here is the first paragraph. I had expected something like this to get the job done: meta = re.match(r'(?ms)(?:^(\S+):\s*(.*?)$\n)+^\s*$', contents_of_file) Ideally in this case, meta.groups() would return: ('title', 'Example title', 'tags', 'Django, Python, regular expressions') |
Earlier this week I discovered that .Net supports repeated capture and its API suggested a much cleaner approach than what Perl offered, so I'll be adding it to the regex module at:
The new methods will follow the example of .group() & co. Given a match object m, m.group(i) returns the last match of group i (or None if there's no match), so I'll be adding m.captures(i) to return a tuple of the captures (an empty tuple if there's no match). I'll also be adding m.starts(i), m.ends(i) and m.spans(i). The issue for this work is bpo-2636. Units tests are welcome. |
Can this be closed as has happened with numerous other issues as a result of work done on the new regex module via bpo-2636? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: