Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexp: capturing groups in repetitions #51381

Open
verdyp mannequin opened this issue Oct 14, 2009 · 27 comments
Open

Regexp: capturing groups in repetitions #51381

verdyp mannequin opened this issue Oct 14, 2009 · 27 comments
Labels
stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement

Comments

@verdyp
Copy link
Mannequin

verdyp mannequin commented Oct 14, 2009

BPO 7132
Nosy @ezio-melotti, @davidchambers

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2009-10-14.20:08:14.901>
labels = ['type-feature', 'library']
title = 'Regexp: capturing groups in repetitions'
updated_at = <Date 2014-03-22.23:46:07.052>
user = 'https://bugs.python.org/verdyp'

bugs.python.org fields:

activity = <Date 2014-03-22.23:46:07.052>
actor = 'BreamoreBoy'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2009-10-14.20:08:14.901>
creator = 'verdy_p'
dependencies = []
files = []
hgrepos = []
issue_num = 7132
keywords = []
message_count = 27.0
messages = ['94022', '94025', '94029', '94030', '94034', '94038', '94040', '94043', '94045', '94046', '94047', '94049', '94050', '94051', '94053', '94056', '94057', '94058', '94060', '94065', '94066', '94067', '94068', '94071', '102080', '121484', '214523']
nosy_count = 5.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'verdy_p', 'davidchambers', 'BreamoreBoy']
pr_nums = []
priority = 'low'
resolution = 'rejected'
stage = 'resolved'
status = 'languishing'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue7132'
versions = ['Python 3.2']

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

For now, when capturing groups are used within repetitions, it is impossible to capure what they match
individually within the list of matched repetitions.

E.g. the following regular expression:

(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)(?:\.(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)){3}

is a regexp that contains two capturing groups (\1 and \2), but whose the second one is repeated (3 times) to
match an IPv4 address in dotted decimal format. We'd like to be able to get the individual multiple matchs
for the second group.

For now, capturing groups don't record the full list of matches, but just override the last occurence of the
capturing group (or just the first if the repetition is not greedy, which is not the case here because the
repetition "{3}" is not followed by a "?"). So \1 will effectively return the first decimal component of the
IPv4 address, but \2 will just return the last (fourth) decimal component.

I'd like to have the possibility to have a compilation flag "R" that would indicate that capturing groups
will not just return a single occurence, but all occurences of the same group. If this "R" flag is enabled,
then:

  • the Match.group(index) will not just return a single string but a list of strings, with as many occurences
    as the number of effective repetitions of the same capturing group. The last element in that list will be the
    one equal to the current behavior

  • the Match.start(index) and Match.end(index) will also both return a list of positions, those lists having
    the same length as the list returned by Match.group(index).

  • for consistency, the returned values as lists of strings (instead of just single strings) will apply to all
    capturing groups, even if they are not repeated.

Effectively, with the same regexp above, we will be able to retreive (and possibily substitute):

  • the first decimal component of the IPv4 address with "{\1:1}" (or "{\1:}" or "{\1}" or "\1" as before),
    i.e. the 1st (and last) occurence of capturing group 1, or in Match.group(1)[1], or between string positions Match.start(1)[1] and Match.end(1)[1] ;

  • the second decimal component of the IPv4 address with "{\2:1}", i.e. the 1st occurence of capturing group
    2, or in Match.group(2)[1], or between string positions Match.start(2)[1] and Match.end(2)[1] ;

  • the third decimal component of the IPv4 address with "{\2:2}", i.e. the 2nd occurence of capturing group 2,
    or in Match.group(2)[2], or between string positions Match.start(2)[2] and Match.end(2)[2] ;

  • the fourth decimal component of the IPv4 address with "{\2:3}" (or "{\2:}" or "{\2}" or "\2"), i.e. the 3rd
    (and last) occurence of capturing group 2, or in Match.group(2)[2], or between string positions
    Match.start(2)[3] and Match.end(2)[3] ;

This should work with all repetition patterns (both greedy and not greedy, atomic or not, or possessive), in
which the repeated pattern contains any capturing group.

This idea should also be submitted to the developers of the PCRE library (and Perl from which they originate,
and PHP where PCRE is also used), so that they adopt a similar behavior in their regular expressions.

If there's already a candidate syntax or compilation flag in those libraries, this syntax should be used for
repeated capturing groups.

@verdyp verdyp mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Oct 14, 2009
@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

I'd like to add that the same behavior should also affect the span(index)
method of MatchObject, that should also not just return a single (start,
end) pair, but that should in this case return a list of pairs, one for
each occurence, when the "R" compilation flag is specified.

This also means that the regular expression compilation flag R should be
supported as these constants:
Regexp.R, or Regexp.REPETITIONS

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

Rationale for the compilation flag:

You could think that the compilation flag should not be needed. However,
not using it would mean that a LOT of existing regular expressions that
already contain capturing groups in repetitions, and for which the
caputiring group is effectively not used and should have been better
encoded as a non-capuring group like (?:X) instead of (X), will suffer a
negative performance impact and a higher memory usage.

The reason is that the MatchObject will have to store lists of
(start,end) pairs instead of just a single pair: using a list will not
be the default, so MatchObject.group(groupIndex),
MatchObject.start(groupIndex), MatchObject.end(groupIndex), and
MatchObject.span(groupIndex) will continue to return a single value or
single pair, when the R compilation flag is not set (these values will
continue to return only the last occurence, that will be overriden after
each matched occurence of the capturing group.

The MatchObject.groups() will also continue to return a list of single
strings without that flag set (i.e. a list of the last occurence of each
capturing group). But when the R flag will be specified, it will return,
instead, a list of lists, each element being for the group with the same
index and being itself a list of strings, one for each occurence of the
capturing group.

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

Implementation details:

Currently, the capturing groups behave quite randomly in the values returned by MachedObject, when backtracking occurs in a repetition. This
proposal will help fix the behavior, because it will also be much easier
to backtrack cleanly, occurence by occurence, by just dropping the last
element in the list of (start,end) pairs stored in the MatchedObject for
all capturing groups specified WITHIN the repeated sub-expression.

@ezio-melotti
Copy link
Member

I'm skeptical about what you are proposing for the following reasons:

  1. it doesn't exist in any other implementation that I know;
  2. if implemented as default behavior:
    • it won't be backward-compatible;
    • it will increase the complexity;
  3. it will be a proprietary extension and it will reduce the
    compatibility with other implementations;
  4. I can't think to any real word situation where this would be really
    useful.

Using a flag like re.R to change the behavior might solve the issue 2),
but I'll explain why I don't think this is useful.

Let's take a simpler ipv4 address as example: you may want to use
'^(\d{1,3})(?:\.(\d{1,3})){3}$' to capture the digits (without checking
if they are in range(256)).
This currently only returns:
>>> re.match('^(\d{1,3})(?:\.(\d{1,3})){3}$', '192.168.0.1').groups()
('192', '1')

If I understood correctly what you are proposing, you would like it to
return (['192'], ['168', '0', '1']) instead. This will also require an
additional step to join the two lists to get the list with the 4 values.

In these situations where some part is repeating, it's usually easier to
use re.findall() or re.split() (or just a plain str.split for simple
cases like this):
>>> addr = '192.168.0.1'
>>> re.findall('(?:^|\.)(\d{1,3})', addr)
['192', '168', '0', '1']
>>> re.split('\.', addr) # no need to use re.split here
['192', '168', '0', '1']

In both the examples a single step is enough to get what you want
without changing the existing behavior.

'^(\d{1,3})(?:\.(\d{1,3})){3}$' can still be used to check if the string
has the right "format", before using the other methods to extract the data.

So I'm -1 about the whole idea and -0.8 about an additional flag.
Maybe you should discuss about this on the python-ideas ML.

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

You're wrong, it WILL be compatible, because it is only conditioned by a
FLAG. The flag is there specifically for instructing the parser to
generate lists of values rather than single values.

Without the regular compilation flag set, as I said, there will be NO
change.

Reopening the proposal, which is perfectly valid !

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

Note that I used the IPv4 address format only as an example. There are
plenty of other more complex cases for which we really need to capture the
multiple occurences of a capturing group within a repetition.

I'm NOT asking you how to parse it using MULTIPLE regexps and functions.
Of course you can, but this is a distinct problem, but certinaly NOT a
general solution (your solution using split() will NOT work with really A
LOT of other regular expressions).

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

In addition, your suggested regexp for IPv4:

'^(\d{1,3})(?:\.(\d{1,3})){3}$'

is completely WRONG ! It will match INVALID IPv4 address formats like
"000.000.000.000". Reread the RFCs... because "000.000.000.000" is
CERTAINLY NOT an IPv4 address (if it is found in an URL) but a domain name
that must be resolved into an IP address using domain name resolution
requests.

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

Summary of your points with my responses :

  1. it doesn't exist in any other implementation that I know;

That's exactly why I proposed to discuss it with the developers of other
implementations (I cited PCRE, Perl and PHP developers, there are
others).

  1. if implemented as default behavior:
  • it won't be backward-compatible;

Wrong. This does not even change the syntax of regualr expressions
themselves.

  • it will increase the complexity;

Wrong. All the mechanic is already implemented: when the parser will
store string index positions for a matched group it will just have to
append a pair in the list stored in MatchObject.group(index) (it will
create the list if it is not ealready there, but it should be already
initialized to an empty list by the compiler) when the flag.R is set,
instead of overwriting the existing pair without wondering if there was
already another occurence of the same capturing group.

  1. it will be a proprietary extension and it will reduce the
    compatibility with other implementations;

Already suggested above. This will hovever NOT affect the compatibility
of existing implementation that don't have the R flag.

  1. I can't think to any real word situation where this would be really
    useful.

There are really a lot ! Using multiple split operations and multiple
parsing on partly parsed regular expressions will not be a solution in
many situations (think about how you would perform matches and using
them that in 'vi' or 'ed' with a single "s/regexp/replacement/flag"
instruction, if there's no extension with a flag and a syntax for
accesing the individual elements the replacement string).

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

And anyway, my suggestion is certainly much more useful than atomic groups
and possessive groups that have much lower use, and which are already
being tested in Perl but that Python (or PCRE, PHP, and most
implementations of 'vi'/'ed', or 'sed') still does not support.

@bitdancer
Copy link
Member

If you read what Ezio wrote carefully you will see that he addressed
both of your points: he acknowledged that a flag would solve (2) (but
disagreed that it was worth it), and he said you could use the first
expression to validate the string before using the split to obtain the
data. Doing it all in one regex might seem more efficient, but at least
in the single use case you have presented it would lead to more
complicated code.

Simple. obvious feature requests can be opened and acted upon directly
in the tracker, but more complicated requests should be made as
proposals on the python-ideas list, and if they are well received there
then an issue can be opened (or in this case you could reopen this one)
in the tracker with a pointer to the python-ideas thread. In most
cases, such an issue would need to include a proposed patch.

Note that we don't really have a resolution that says 'sent to
python-ideas', thus the use of the 'rejected' status.

@bitdancer
Copy link
Member

Just to clarify, when I said "in most cases such an issue would need to
include a proposed patch", I mean that even if everyone agrees it is a
good idea it isn't likely to happen unless there is a proposed patch :)

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

I had read carefully ALL what ezio said, this is clear in the fact that
I have summarized my responses to ALL the 4 points given by ezio.

Capturing groups is a VERY useful feature of regular expressions, but
they currently DON'T work as expected (in a useful way) when they are
used within repetitions (unless you don't need any captures at all, for
example when just using find(), and not performing substitutions on the
groups.

My proposal woul have absolutely NO performance impact when capturing
groups are not used (find only, without replacement, so there the R flag
can be safely ignored).

It would also not affect the case where capturing groups are used in the
regexp, but these groups are not referenced in the substitution or in
the code using MatchObject.group(index) : these indexes are already not
used (or should not, because this is most of the time a bug when it just
returns the last occurence).

Using multiple parsing operations with multiple regexps is really
tricky, when all could be done directly from the original regexp,
without modifying it. In addition, using split() or similar will not
work as expected, when the splitting operations will not correctly parse
the context in which the multiple occurences are safely separated (this
context is only correctly specified in the original regexp where the
groups, capturing or not, are specified).

This extension will also NOT affect the non-capturing groups like:
(?:X){m,n}
(?:X)*
(?:X)+
It will ONLY affect the CAPTURING groups like:
(X){m,n}
(X)*
(X)+
and only if the R flag is set (in which case this will NOT affect the
backtracking behavior, or which strings that will be effectively
matched, but only the values of the returned "\n" indexed group.

If my suggestion to keep the existing MatchObject.function(index) API
looks too dangerous for you, because it would change the type of the
returned values when the R flag is set, you can as well rename them to
get a specific occurence of a group. Such as:

 MatchObject.groupOccurences(index)
 MatchObject.startOccurences(index)
 MatchObject.endOccurences(index)
 MatchObject.spanOccurences(index)
 MatchObject.groupsOccurences(index)

But I don't think this is necessary; it will be already expected that
they will return lists of values (or lists of pairs), instead of just
single values (or single pairs) for each group: Python (as well as PHP
or Perl) can already manage return values with varying datatypes.

May be only PCRE (written for C/C++) would need a new API name to return
lists of values instead of single values for each group, due to existing
datatype restrictions.

My proposal is not inconsistant: it returns consistant datatypes when
the R flag is set, for ALL capturing groups (not just those that are
repeated.

Anyway I'll submit my idea to other groups, if I can find where to post
them. Note that I've already implemented it in my own local
implementation of PCRE, and this works perfectly with effectively very
few changes (currently I have had to change the datatypes for matching
objects so that they can return varying types), and I have used it to
create a modified version of 'sed' to perform massive filtering of data:

It really reduces the number of transformation steps needed to process
such data correctly, because a single regexp (exactly the same that is
already used in the first step used to match the substrings we are
interested in, when using existing 'sed' implementations) can be used to
perform the substitutions using indexes within captured groups. And I
would like to have it incoporated in Python (and also Perl or PHP) as
well.

@ezio-melotti
Copy link
Member

You're wrong, it WILL be compatible, because it is only conditioned
by a FLAG.

Sorry, I missed that you mentioned the flag already in the first
message, but what I said in 1), 3) and 4) is still valid.

There are plenty of other more complex cases for which we really
need to capture the multiple occurences of a capturing group within
a repetition.

Can you provide some example where your solution is better than the
other available way of doing it? During the years lot of extensions have
been added to the regex engines, if no one added this is probably
because these problems can be already solved in other ways.

I'm NOT asking you how to parse it using MULTIPLE regexps and
functions.
Of course you can, but this is a distinct problem, but certinaly NOT
a general solution (your solution using split() will NOT work with
really A LOT of other regular expressions).

Even with your solution, in most of the cases you will need additional
steps to assemble the results (at least in the cases with some kind of
separator, where you have to join the first element with the followings).

I can see a very limited set of hypothetical corner cases where your
proposal may save a few line of codes but I don't think it's worth
implementing all this just for them.
An example could be:
>>> re.match('^([0-9A-F]{2}){4} ([a-z]\d){5}$', '3FB52A0C a2c4g3k9d3',
re.R).groups()
(['3F', 'B5', '2A', '0C'], ['a2', 'c4', 'g3', 'k9', 'd3'])
but it's not really a real-world case, if you have some real-world
example I'd like to see it.

In addition, your suggested regexp for IPv4:
'^(\d{1,3})(?:\.(\d{1,3})){3}$'
is completely WRONG !

That's why I wrote 'without checking if they are in range(256)'; the
fact that this regex matches invalid digits was not relevant in my
example (and it's usually easier to convert the digits to int and check
if 0 <= digits <= 255). :)

> 1) it doesn't exist in any other implementation that I know;

That's exactly why I proposed to discuss it with the developers of
other implementations (I cited PCRE, Perl and PHP developers, there
are others).

So maybe this is not the right place to ask.

> 3) it will be a proprietary extension and it will reduce the
> compatibility with other implementations;

Already suggested above. This will hovever NOT affect the
compatibility of existing implementation that don't have the R flag.

What I meant is that a regex that uses the re.R flag in Python won't
work in other languages/implementations because they don't have it, and
a "general" regex (e.g. for an ipv6 address) will have to be
adapted/rewritten in order to take advantage of re.R.

> 4) I can't think to any real word situation where this would be
> really useful.

There are really a lot ! Using multiple split operations and multiple
parsing on partly parsed regular expressions will not be a solution
in many situations (think about how you would perform matches and
using them that in 'vi' or 'ed' with a single
"s/regexp/replacement/flag" instruction, if there's no extension
with a flag and a syntax for accesing the individual elements the
replacement string).

Usually when the text to be parsed starts to be too complex is better to
use another approach, e.g. using a real parser or dividing the text in
smaller units and work on them independently. Even if re.R could make
this easier I would still prefer to have a few more line of code that do
different things that a single big regex that does everything.

And anyway, my suggestion is certainly much more useful than atomic
groups and possessive groups that have much lower use [...]

Then why no one implemented it yet? :)

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

ezio said:
>>> re.match('^(\d{1,3})(?:\.(\d{1,3})){3}$', '192.168.0.1').groups()
('192', '1')
> If I understood correctly what you are proposing, you would like it to
return (['192'], ['168', '0', '1']) instead.

Yes, exactly ! That's the correct answer that should be returned, when
the R flag is set.

This will also require an additional step to join the two lists to get
the list with the 4 values.

Yes, but this is necessary for full consistency of the group indexes.
The current return value is clearly inconsistant (generally it returns
the last occurence of the capturing group, but I've discovered that this
is not always the case, because of matches that are returned after
backtracking...)

It is then assumed that when the R flag is set, ALL occurences of
repeated groups will be returned individually, instead of just a
'random' one. Note that for full generalization of the concept, there
should even be lists of lists if a capturing group contains itself
another inner capturing group (with its own index), in order to
associate them correctly with each occurence of the outer capturing
group (however I've still not experimented this in my local
experimentation, so all occurences are grouped in the same returned
list, independantly of the occurence of the outer capturing group in
which they were found).

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 14, 2009

That's why I wrote 'without checking if they are in range(256)'; the
fact that this regex matches invalid digits was not relevant in my
example (and it's usually easier to convert the digits to int and check
if 0 <= digits <= 255). :)

NO ! You have to check also the number of digits for values below 100 (2
digits only) or below 10 (1 digit only)

And when processing web log files for example, or when parsing Wiki
pages or emails in which you want to autodetect the presence of ONLY
valid IP addresses within some contexts, where you want to transform
them to another form (for example when converting them to links or to
differentiate 'anonymous' users in wiki pages from registered named
users, you need to correctly match these IP addresses. In addition,
these files will often contain many other occurences that you don't want
to transform, but just some of them in specific contexts given by the
regexp. for this reason, your suggestion will often not work as
expected.

The real need is to match things exactly, within their context, and
capturing all occurences of capturing groups.

I gave the IPv4 regexp only as a simple example to show the need, but
there are of course much more complex cases, and that's exactly for
those cases that I would like the extension: using alternate code with
partial matches and extra split() operations give a code that becomes
tricky, and most often bogous. Only the original regexp is precise
enough to parse the content correctly, find only the matches we want,
and capturing all the groups that we really want, in a single operation,
and with a near-zero cost (and without complication in the rest of the
code using it).

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 15, 2009

Even with your solution, in most of the cases you will need additional
steps to assemble the results (at least in the cases with some kind of
separator, where you have to join the first element with the
followings).

Yes, but this step is trivial and fully predictable. Much more viable
than the other solutions proposed which gives tricky and often complex
and bogous code.

How many bugs have been found in code using split() for example to parse
URLs ? There are countlesss in many softwares (and it is not terminated
!)

And in fine, the only solution is to simply rewrite the parser
completely, without regexps at all, or to reduce the generality of the
problems that the program was supposed to solve (i.e. asserting in the
code some implementation limits, to reject some forms that were
previously considered valid). Think about it, the capturing groups are
the perfect solution for solving the problem cleanly, provided that they
work as intended and return all their occurences.

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 15, 2009

a "general" regex (e.g. for an ipv6 address)

I know this problem, and I have already written about this. It is not
possible to parse it in a single regexp if it is written without using
repetitions. But in that case, the regexp becomes really HUGE, and the
number of groups in the returned match object is prohibitive. That's why
CPAN has had to write a specific module for IPv6 addresses in Perl.

Such module can be reduced to just a couple of lines with a single
regexp, if its capturing groups correctly return ALL their occurences in
the regexp engine: it requires no further processing and analysis, and
the data can effectively be reassembled cleanly, just from the returned
groups (as lists):

  • \1 and \2 (for hex components of IPv6 in hex format only, where \1 can
    occur 0 or 1 time, and \2 can occur 0 to 7 times)
  • or from \1 to \2 and \3 to \4 (for hex components in \1..\2, where \1
    occurs 0 or 1 time and \2 occurs 0 to 5 times, and for decimal
    components in \3..\4, where \3 occurs 1 time and \4 occurs exactly 3
    times).

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 15, 2009

> And anyway, my suggestion is certainly much more useful than atomic
> groups and possessive groups that have much lower use [...]
Then why no one implemented it yet? :)

That's because they had to use something else than regexps to do their
parsing. All those that had to do that *pested* that the regexps were
not capturing all occurences.

And then later they regretted it, because they had to fix their
alternate code (such as those using the bogous split() alternatives...)
and finally rewrote their own parsers (sometimes with a combination of
(f)lex+yacc/bison, even if the full expression was given in a single
regexp which was expressive enough to match only the exact match they
wanted, but without using the returned captured groups): this means an
extra parsing for the found substring (in their correct context) in
order to process it.

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 15, 2009

>>> re.match('^(\d{1,3})(?:\.(\d{1,3})){3}$', '192.168.0.1').groups()
('192', '1')
> If I understood correctly what you are proposing, you would like it to
return (['192'], ['168', '0', '1']) instead.

In fact it can be assembled in a single array directly in the regexp, by
naming the destination capturing group (with the same name, it would get
the same group index, instead of of allocating a new one). E.g., with
someting like:

>>> re.match('^(?P<parts>=\d{1,3})(?:\.(?P<parts>=\d{1,3})){3}$', 
'192.168.0.1').groups()

would return ("parts": ['192', '168', '0', '1']) in the same first
group.

This could be used as well in PHP (which supports associative arrays for
named groups which are also indexed positionnaly).

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Oct 15, 2009

Instead of a new flag, a '*' could be put after the quantifier, eg:

(\d+)(?:\.(\d+)){3}*

MatchObject.group(1) would be a string and MatchObject.group(2) would be
a list of strings.

The group references could be \g<1>, \g<2:0>, \g<2:1>, \g<2:2>.

However, I think that it's extending regexes too far; something else
should be used, eg pyparsing or some type of context-free grammar with
optional constraints.

-1 from me

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 15, 2009

You said that this extension was not implemented anywhere, and you were
wrong.

I've found that it IS implemented in Perl 6! Look at this discussion:

http://www.perlmonks.org/?node_id=602361

Look at how the matches in quantified capture groups are returned as
arrayref's (i.e. references to a numbered array).

So my idea is not stupid. Given that Perl rules the world of the Regexp
language, it will be implemented elsewhere sonner or later, notably in
PCRE, awk, vi, sed, PHP, .NET library...

Already, this is used in CPAN libraries for Perl v6... (when the X flag
is set for extensions)

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 15, 2009

Anyway, there are ways to speedup regexps, even without instructing the
regexps with anti-backtracking syntaxes.

See http://swtch.com/~rsc/regexp/regexp1.html
(article dated January 2007)
Which discusses how Perl, PCRE (and PHP), Python, Java, Ruby, .NET
library... are slow because they are using backtracking a single state
in the NFA, instead of using simultaneously active states (which
correctly emulates the DFA, without having to actually build the DFA
transition states table, which could grow combinatorially, as seen in
yacc and Bison).

Java uses now the Thomson approach in its latest releases, but I wonder
how Python works: does it use the DFA simulation? Do you use PCRE?

Note that I've been using the DFA simulation since more than 20 years in
1987, when I built my first regexp matcher (because the existing
implementation at that time were really damn slow), after I read the
Aho/Sethi/Ullman book which already demonstrated the algorithm.

This algorithm has been implemented in some tools replacing the old
yacc/Bison tools, because they generate huge DFA transition tables (and
this was the reason why Lex and Yacc were maintained as separate tools,
Lex using the single-state NFA approach with excessive backtracking, and
Yacc/Bison trying to generate the full DFA transition tables.) : the
first language to use this approach was the Purdue Univerity Compiler
Construction Tools Set (PCCTS) which was initially written in C and is
now fully written and supported in Java.

The Perl 6 extension for quantified capturing groups will have a slow
adoption, as long as Perl will continue the slow single-state NFA
approach with excessive backtracking, instead of the Aho/Sethi/Ullman
(ASU) approach (that some are attributing to Thomson due to the article
in 2007, but this is false) using simultaneously active states. But
anyway, it already exists (and Perl developers are already working on
rewriting the engine using the ASU approach).

But my suggstion is much more general, as it should not just apply to
quantified capturing groups, but to any capturing group that is part of
a subexpression which is quantified.

And the way I specified it, it does not depend on the way the engine is
written (whever it uses a single-state NFA or multi-state NFA or
generates and uses a DFA transition state with single-state like in
Yacc/Bison), because capturing groups are just used to store position
pairs, and regexp engines already have to count them for effectively
matching the greedy or non-greedy quantifiers, so this immediately
provides a usable index for storing at successive positions in a
numbered array for captured groups.

The simple test case is effectively to try to match /(aa?)*a+/ with
strings longer than a few dozens of 'a'.

@verdyp
Copy link
Mannequin Author

verdyp mannequin commented Oct 15, 2009

Umm.... I saif that the attribution to Thompson was wrong, in fact it
was correct. Thompson designed and documented the algorithm in 1968,
long before the Aho/Seti/Ullman green book... so the algorithm is more
than 40 years old, and still not in Python, Perl and PCRE (but it is
present in GNU awk...)

The paper published in swtch.com is effectively written in 2007, but its
conclusions are perfectly valid. The interesting aspect of this paper is
that it demonstrates that the Thompson's multi-state NFA approach is
still the best one, and better than what Perl, PCRE (and PHP), Python,
Ruby and others are using, but that it can be also optimized further by
caching the DFA construction "on the fly" (see the blue curve on the
displayed diagram) while parsing the the already precompiled multi-state
NFA.

The cache for DFA states will fill up while matching the regexp against
actual strings, so it will be much faster (and much less memory-and-
time-greedy than generating the full DFA transition table at compilation
time like in Bison/Yacc).

However the paper still does not discusses how to make the DFA states
cache limited in size. Notably because the longer the input text will
be, the more the DFA cache will contain DFA states. One simple rule is
to limit the number of cached DFA states, and then to allow all stored
transitions to go all multiple-states in the NFA, and optionally to a
single DFA state in the cache.

Then the DFA cache can be used in a LIFO manner, to purge it
automatically from unused states, in order to reuse them (for caching
another new DFA state which is still not present in the cache, when the
cache has reached its maximum size): if this occurs, the other existing
DFA states that point to it must be cleaned (their DFA state pointer or
reference, stored in their NFA or DFA transitions, must be cleared/set
to null, so that they will only contain the list of pointers to outgoing
NFA states). The problem is how to look for a multistate in the DFA
cache: this requires some fast lookup, but this can be implemented in a
fast way using hash tables (by hashing the list of NFA states
represented in the DFA state).

Apparently, GNU awk does not use the cached DFA approach: it just uses
the NFA directly when the input text is lower than two dozens of
characters, then it builds the full DFA as soon as the input text
becomes larger (this explains the sudden, but moderate increase of
time). But I've seen that GNU awk has the defect of this unlimited DFA
generation approach: its excessive use of memory when the input text
increases, because the number of DFA states added to the cache is
contantly growing with the input, and the time to match each characer
from the input slowly increases too. At some point, it will crash and
exit due to memory limits exhaustion, when no more DFA states can be
stored. That's why the DFA cache MUST be maintained under some level.

I'll try to implement this newer approach first in Java (just because I
better know this language than Python, and beacause I think it is more
solid in terms of type-safety, so it can reduce the number of bugs to
correct before getting something stable).

In Java, there's a clean way to automatically cleanup objects from
collections, when they are no longer needed: you just need to use weak
references (the garbage collector will automatically cleanup the older
objects, when needed). But this approach is not easy to port, and in
fact, if it can effectively solve some problems, it will still not
forbif the cache to use up to the maximum VM size. for performance
reasons, I see little interest in storing more than about 1 million DFA
states in the cache (also because the hash table that would be used
would be less efficient when looking up for the key of a NFA multi-state
where the DFA state is stored). So I will probably not use weak
references, but will first use a maximum size (even if weak references
could help maintain the cache at even lower bounds than the allowed
maximum, if VM memory size is more constrained: it is a good idea in all
Java programs to allow caches introduced only for performance reasons to
also collaborate with the garbage collector, in order to avoid the
explosion of all caches used in various programs or libraries). I don't
know if Python supports the concept of weak references for handling
performance caches.

May be I'll port it later in Python, but don't expect that I'll port it
to C/C++ (as a replacement of PCRE), because I now hate these unsafe
languages despite I have practiced them for many years: others would do
that for me, when I'll have published my Java implementation.

@davidchambers
Copy link
Mannequin

davidchambers mannequin commented Apr 1, 2010

I would find this functionality very useful. While I agree that it's often simpler to extract the relevant information in several steps, there are situations in which I'd prefer to do it all in one go.

The application I'm writing at the moment needs to extract metadata from text files. This metadata actually appears as text at the top of each file. For example:

title: Example title
tags: Django, Python, regular expressions

Example title
=============

Here is the first paragraph.

I had expected something like this to get the job done:

meta = re.match(r'(?ms)(?:^(\S+):\s*(.*?)$\n)+^\s*$', contents_of_file)

Ideally in this case, meta.groups() would return:

('title', 'Example title', 'tags', 'Django, Python, regular expressions')

@ezio-melotti ezio-melotti added the stale Stale PR or inactive for long period of time. label Apr 9, 2010
@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Nov 18, 2010

Earlier this week I discovered that .Net supports repeated capture and its API suggested a much cleaner approach than what Perl offered, so I'll be adding it to the regex module at:

http://pypi.python.org/pypi/regex

The new methods will follow the example of .group() & co.

Given a match object m, m.group(i) returns the last match of group i (or None if there's no match), so I'll be adding m.captures(i) to return a tuple of the captures (an empty tuple if there's no match). I'll also be adding m.starts(i), m.ends(i) and m.spans(i).

The issue for this work is bpo-2636.

Units tests are welcome.

@BreamoreBoy
Copy link
Mannequin

BreamoreBoy mannequin commented Mar 22, 2014

Can this be closed as has happened with numerous other issues as a result of work done on the new regex module via bpo-2636?

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants