5

I have a large log file, and I want to extract a multi-line string between two strings: start and end.

The following is sample from the inputfile:

start spam
start rubbish
start wait for it...
    profit!
here end
start garbage
start second match
win. end

The desired solution should print:

start wait for it...
    profit!
here end
start second match
win. end

I tried a simple regex but it returned everything from start spam. How should this be done?

Edit: Additional info on real-life computational complexity:

  • actual file size: 2GB
  • occurrences of 'start': ~ 12 M, evenly distributed
  • occurences of 'end': ~800, near the end of the file.
1
  • 2
    Well, if you want to match between start and end, then it's normal that you get start spam as the beginning result... Could you clarify the behavior that you want?
    – lcoderre
    Commented Jul 8, 2014 at 19:35

4 Answers 4

19

This regex should match what you want:

(start((?!start).)*?end)

Use re.findall method and single-line modifier re.S to get all the occurences in a multi-line string:

re.findall('(start((?!start).)*?end)', text, re.S)

See a test here.

6
1

Do it with code - basic state machine:

open = False
tmp = []
for ln in fi:
    if 'start' in ln:
        if open:
            tmp = []
        else:
            open = True

    if open:
        tmp.append(ln)

    if 'end' in ln:
        open = False
        for x in tmp:
            print x
        tmp = []
0
0

This is tricky to do because by default, the re module does not look at overlapping matches. Newer versions of Python have a new regex module that allows for overlapping matches.

https://pypi.python.org/pypi/regex

You'd want to use something like

regex.findall(pattern, string, overlapped=True)

If you're stuck with Python 2.x or something else that doesn't have regex, it's still possible with some trickery. One brilliant person solved it here:

Python regex find all overlapping matches?

Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.

2
  • I added some information on the actual size of the log file. In this case, storing all overlapping matches would exceed the disk space of my computer. Commented Jul 9, 2014 at 12:18
  • Well, the solution I linked to returns an iterator, so you wouldn't actually need to store all overlapping matches, just one or two at a time. But given the format of the file you're trying to parse, the accepted solution is probably better for your purposes. Commented Jul 9, 2014 at 14:04
0

You could do (?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in "end".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.