Find shortest matches between two strings

Question

I have a large log file, and I want to extract a multi-line string between two strings: start and end.

The following is sample from the inputfile:

start spam
start rubbish
start wait for it...
    profit!
here end
start garbage
start second match
win. end

The desired solution should print:

start wait for it...
    profit!
here end
start second match
win. end

I tried a simple regex but it returned everything from start spam. How should this be done?

Edit: Additional info on real-life computational complexity:

actual file size: 2GB
occurrences of 'start': ~ 12 M, evenly distributed
occurences of 'end': ~800, near the end of the file.

Well, if you want to match between start and end, then it's normal that you get start spam as the beginning result... Could you clarify the behavior that you want? — lcoderre, Commented Jul 8, 2014 at 19:35

famousgarkin · Accepted Answer · 2014-07-08 20:14:52Z

19

This regex should match what you want:

(start((?!start).)*?end)

Use re.findall method and single-line modifier re.S to get all the occurences in a multi-line string:

re.findall('(start((?!start).)*?end)', text, re.S)

See a test here.

edited Jul 8, 2014 at 20:14

answered Jul 8, 2014 at 19:40

famousgarkin

14.1k5 gold badges60 silver badges74 bronze badges

Good answer and demo on regex101. The key that I was missing was the negative lookahead. Really useful.
– Eero Aaltonen
Commented Jul 9, 2014 at 9:25
Working in JS as well.
– semanser
Commented Aug 11, 2017 at 9:33
Could you explain ((?!start).)?
– roschach
Commented Jan 27, 2019 at 10:32
@FrancescoBoi See Tempered Greedy Token - What is different about placing the dot before the negative lookahead.
– Wiktor Stribiżew
Commented Aug 14, 2019 at 13:54
In case you start having performance issues using this pattern use re.findall(r'(start([^se]*(?:s(?!tart)[^se]*|e(?!nd)[^se]*)*end)', text)
– Wiktor Stribiżew
Commented Aug 28, 2019 at 18:33

| Show 1 more comment

gkusner · Accepted Answer · 2014-07-08 19:49:42Z

1

Do it with code - basic state machine:

open = False
tmp = []
for ln in fi:
    if 'start' in ln:
        if open:
            tmp = []
        else:
            open = True

    if open:
        tmp.append(ln)

    if 'end' in ln:
        open = False
        for x in tmp:
            print x
        tmp = []

answered Jul 8, 2014 at 19:49

gkusner

1,2441 gold badge11 silver badges14 bronze badges

Add a comment |

Community · Accepted Answer · 2017-05-23 12:18:03Z

0

This is tricky to do because by default, the re module does not look at overlapping matches. Newer versions of Python have a new regex module that allows for overlapping matches.

https://pypi.python.org/pypi/regex

You'd want to use something like

regex.findall(pattern, string, overlapped=True)

If you're stuck with Python 2.x or something else that doesn't have regex, it's still possible with some trickery. One brilliant person solved it here:

Python regex find all overlapping matches?

Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Jul 8, 2014 at 19:38

TheSoundDefense

6,9451 gold badge31 silver badges42 bronze badges

I added some information on the actual size of the log file. In this case, storing all overlapping matches would exceed the disk space of my computer.
– Eero Aaltonen
Commented Jul 9, 2014 at 12:18
Well, the solution I linked to returns an iterator, so you wouldn't actually need to store all overlapping matches, just one or two at a time. But given the format of the file you're trying to parse, the accepted solution is probably better for your purposes.
– TheSoundDefense
Commented Jul 9, 2014 at 14:04

Add a comment |

David Ehrmann · Accepted Answer · 2014-07-08 19:42:12Z

0

You could do (?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in "end".

answered Jul 8, 2014 at 19:42

David Ehrmann

7,6163 gold badges35 silver badges45 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Find shortest matches between two strings

4 Answers 4

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Linked

Related