LFCS Regular Expression (Part 1)

Jarret B

Well-Known Member
Staff member
Joined
May 22, 2017
Messages
339
Reaction score
369
Credits
11,689
When dealing with files and folders on a system or even going through the contents of a file you will need to understand ‘Regular Expression’. ‘Regular Expression’ are characters used to define a pattern. The pattern can be used to search through a text file or listing files and folders.

Sometimes instead of ‘Regular Expression’ you may see ‘regex’ or ‘regexp’.

The parts of the ‘Regular Expression’ can be distinguished in many ways, but I will try to separate them a little more to keep it simple.

The parts of the ‘Regular Expression’ are:

  1. Characters and Groups
  2. Anchors
  3. Modifiers
  4. Class/Range
  5. Quantifier
  6. Special Characters
NOTE: Part 1 of this article will deal with Characters and Groups, and Anchors.

Performing Searches

The majority of searches will be done with a command called ‘grep’. The ‘grep’ command allows you to specify ‘Regular Expression’ searches to perform on a file or a text stream.

To perform a search on a file the file name will be specified with the ‘Regular Expression’ being used. The syntax is:

grep <regexp> <filename>

If the text is coming from a stream then the information is piped (|) from one command to ‘grep’. For example, to stream the output from a list of files and folders using the command ‘ls’ we would use a command like ‘ls | grep <regexp>’. The output of the command ‘ls’ is sent to the command ‘grep’ with whatever ‘regexp’ we want.

Characters and Groups

The most common searches will be for specific words, characters or groups of words.

To do a basic search for the characters ‘line’ we would use the command ‘grep line <filename>’. Keep in mind that we are not searching for the word ‘line’, but any four characters in a row which are ‘line’. The search is also case-sensitive. The search can return the lines of a file which contain the words ‘line’, ‘lines’, ‘in-line’, ‘new-line’ etc.

NOTE: If you wanted to perform a case-insensitive search then add the parameter ‘-i’ to the ‘grep’ command. The command would be ‘grep -i <regexp> <filename>’. You can also find the line number of the file on which a match occurs by adding the ‘-n’ parameter as well. You can get just a count of the matches by using the parameter ‘-c’, but the matches will not be shown.

If you wanted to find all cases where the letters ‘in’ were followed by some other letter then you can use a wild-card to represent one character. The single-character wild-card is a period (.). To find all matches for the letters ‘in’ followed by some other character the search string would be ‘in.’. The search would return words like ‘using’, ‘binary’, ‘input’, information’, etc.

Since the period represents 0-1 characters we can use the asterisk (*) to represent the repeat of the previous character 0 or more times. For instance, to perform a search for the string ‘ing*’ it will find all cases of ‘in’, ‘ing’, ‘ingg’, ‘inggg’, etc. The ‘g’ will exist after ‘in’ 0 or more times. If the search string were ‘in*’ then the results would be ‘i’, ‘in’, ‘inn’, etc.

You can also create a grouping of characters. Groups are placed inside parenthesis, but these characters must be escaped. To ‘escape’ a character you place a backslash (\) before it so that ‘grep’ will see the character as a literal and not a command character. When using the parenthesis to create a group of characters you would have ‘\(’ and ‘\)’.

Let’s look at some examples.

Characters and Groups Examples

Let’s make a file to work with for some of the examples. In a Terminal type the command ‘grep --help > grephelp.txt’. The command will create a file named ‘grephelp.txt’ which will contain the help information listed when you type the command ‘grep --help’. The contents of the file is what we will use then for examples.

If we want to look for the letters ‘ine’ we would use the command “grep ‘ine’ grephelp.txt” and quite a bit shows up. Let’s get a count on this output by using the command “grep -c ‘ine’ grephelp.txt”. The result is a count of 15 as shown in Figure 1 along with the output for the command “grep ‘ine’ grephelp.txt”.

Figure 01.jpg

FIGURE 1

If I wanted to find all matches for the letters ‘in’ followed by some other character, then I would enter the command “grep ‘in.’ grephelp.txt”.

You can also search for more than one single character by placing more than one period. For example, the command “grep ‘in..’ grephelp.txt” would find words like ‘lines’, ‘line’, ‘binary’, ‘invocation’, etc.

You can take this previous search and type the command “grep ‘ing*’ grephelp.txt”. Notice how it finds strings with ‘in’, ‘ing’ and ‘ingg’. Try the regex ‘in*’ and see how it finds ‘i’, ‘in’ and ‘inn’.

What if I wanted to make a group of two words to search for in a file? If I wanted to search for the words ‘before’ and ‘after’ then I would use the command “grep ‘\(before\|after\)’ grephelp.txt”. Lets’ look a the regex expression and see what is going on here. The regex expression is ‘\(before\|after\)’. You need to ‘escape’ the characters ‘(‘, ‘|’ and ‘)’. The expression is showing that we are looking for the word ‘before’ or ‘after’. The pipe ‘|’ here is used to represent ‘or’.

If I wanted to search for the words ‘matching’ and ‘matches’ I would start with the same base word of ‘match’ and add on either ‘ing’ or ‘es’ as shown in the command “grep 'match\(ing\|es\)' grephelp.txt”. The output is shown in Figure 2. More items can be added to the group to make more search items. Remember that anything before the parenthesis will be added to the items in the group to make longer search strings.

Figure 02.jpg

Figure 2

Anchor


The Anchor is a character used to specify a specific location. There are two Anchors to keep in mind.

The first Anchor is the ‘^’ which signifies the beginning of a line.

The second Anchor is the ‘$’ which denotes the end of the line.

With the Anchors we can ‘anchor’ our search to either the beginning or ending of a line.

Anchor Examples

To give an example, let’s say we are searching through a list of files. We can use the ‘ls -l’ command. Open a Terminal and go to the Home folder by typing ‘cd ~’. Type the command ‘ls -l’. You should get a listing of files and folders. If the left-most character is a ‘d’ then the listing is for a directory. If the character is a dash (‘-’) then it is a file. If you wanted to list only the folders you can use the command ‘ls -l | grep ‘^d’ or ‘ls -l | grep ‘^d’’. The output of the ‘ls -l’ command is piped to the ‘grep’ command which is used to search text. The ‘Regular Expression’ being searched for is when the first character is a ‘d’.

NOTE: The character is case-sensitive. If you use an uppercase ‘D’ then no results would be found. Also, the ‘Regular Expression’ can be in single quotes or not.

So if we want to find all lines which end with an ‘s’ the command would be ‘grep s$ grephelp.txt’. The output should be similar to Figure 3.

Figure 03.jpg

FIGURE 3

The output may be different if you have a different version of ‘grep’ which would produce a different output for the test file.

We can also search for all lines which end with the letters ‘ne’. The command would be ‘grep ne$ grephelp.txt’. Sample results are shown in Figure 4, but your results may vary.

Figure 04.jpg

FIGURE 4

If you wanted to perform multiple searches on a file, such as finding the first letter and the last letter, you can perform two searches. First, find one of the searches you want then pipe the output to the second search. If we wanted to find all lines which ended with ‘ne’ we would use the command ‘grep ne$ grephelp.txt’. If we wanted to find the lines starting with a ‘W’ then the command is ‘grep ^W grephelp.txt’. To get the two to work together we would pipe the output of the first to the second with the command ‘grep ne$ grephelp.txt | grep ^W’. In the second part of the command we do not need to specify the input file since it is coming from the first grep command.

Practice ‘Regular Expressions’ as given in the examples. Try searches of your own. Understand how everything works so far before going on to the next article about ‘Regular Expressions’.
 
Last edited:

Members online


Top