Forgiving regex to extract key-value pairs from plain text files

Recording data manually (typing them with your fingers) in plain text files is still a viable option, even though not that common.

If what you are recording consists of multiple data values, you need some kind of key-value format. A simple format like this does the job:

First key:
Lorem ipsum dolor
 
Second key:
Nunc volutpat cursus
First key:
Lorem ipsum dolor

Second key:
Nunc volutpat cursus

The rules can be summarized like this:

the key ends with a colon :
the value is on the next line
key-value pairs are delimited by empty lines

Whatever format you choose, you have to leave room for human error, e.g., extra spaces are very common.

Or you might want to allow some flexibility, or you expect long strings of text that should be broken into multiple lines to increase readability.

Here's an entry with some exaggerated formatting issues:

Alpha:
Lorem ipsum
 
Beta gamma:
dolor sit amet
consectetur adipiscing
elit
 
delta: quam vehicula   
 
Epsilon:-Zeta:
Curabitur interdum massa
 
   Eta:
Maecenas ac felis
 
Theta:
 
 
 
Iota:   
Morbi at lobortis
Alpha:
Lorem ipsum

Beta gamma:
dolor sit amet
consectetur adipiscing
elit

delta: quam vehicula   

Epsilon:-Zeta:
Curabitur interdum massa

   Eta:
Maecenas ac felis

Theta:



Iota:   
Morbi at lobortis

In the end, if you are looking to extract a list of keys and values, you need a regex rule that goes beyond the basics:

[
    [
        "Alpha",
        "Beta gamma",
        "delta",
        "Epsilon:-Zeta",
        "Eta",
        "Theta",
        "Iota"
    ],
    [
        "Lorem ipsum",
        "dolor sit amet\nconsectetur adipiscing\nelit",
        "quam vehicula",
        "Curabitur interdum massa",
        "Maecenas ac felis",
        "",
        "Morbi at lobortis"
    ]
]
[
    [
        "Alpha",
        "Beta gamma",
        "delta",
        "Epsilon:-Zeta",
        "Eta",
        "Theta",
        "Iota"
    ],
    [
        "Lorem ipsum",
        "dolor sit amet\nconsectetur adipiscing\nelit",
        "quam vehicula",
        "Curabitur interdum massa",
        "Maecenas ac felis",
        "",
        "Morbi at lobortis"
    ]
]

The regex that allows this amount of forgiveness looks like this:

[ ]*(.+):[ ]*\n?((?:.+\n?[^\s])*)
[ ]*(.+):[ ]*\n?((?:.+\n?[^\s])*)

Use this pattern with the preg_match_all function, and you have a solid start to do something with the data.

To deconstruct the pattern, head over to RegEx101 where you will see the captured groups highlighted and the tokens annotated.

2021-01-05