The asterisk wildcard in sed does some really weird things

C

CrazedNerd

Guest

In sed, the "*" by itself is just interpreted literally/normally in my experience, for example, you could remove or replace every asterisk in a document like this:

Code:
sed 's/*//g' file

However, when you start using it try and match regular expressions, it does something far stranger than bash or grep does. These are the contents of my "perfect sentence" document that i use to test text editing at times:

Code:
The quick brown fox jumped over the lazy dog. It was very lazy today.
The quick brown fox jumped over the lazy dog. It was very lazy today.

If i do this with sed:

Code:
sed 's/T*/SQUIRMY/g' perfect-sentence

I get this, and i have no idea why, and the documentation doesn't clarify this behavior at all:

Code:
SQUIRMYhSQUIRMYeSQUIRMY SQUIRMYqSQUIRMYuSQUIRMYiSQUIRMYcSQUIRMYkSQUIRMY SQUIRMYbSQUIRMYrSQUIRMYoSQUIRMYwSQUIRMYnSQUIRMY SQUIRMYfSQUIRMYoSQUIRMYxSQUIRMY SQUIRMYjSQUIRMYuSQUIRMYmSQUIRMYpSQUIRMYeSQUIRMYdSQUIRMY SQUIRMYoSQUIRMYvSQUIRMYeSQUIRMYrSQUIRMY SQUIRMYtSQUIRMYhSQUIRMYeSQUIRMY SQUIRMYlSQUIRMYaSQUIRMYzSQUIRMYySQUIRMY SQUIRMYdSQUIRMYoSQUIRMYgSQUIRMY.SQUIRMY SQUIRMYISQUIRMYtSQUIRMY SQUIRMYwSQUIRMYaSQUIRMYsSQUIRMY SQUIRMYvSQUIRMYeSQUIRMYrSQUIRMYySQUIRMY SQUIRMYlSQUIRMYaSQUIRMYzSQUIRMYySQUIRMY SQUIRMYtSQUIRMYoSQUIRMYdSQUIRMYaSQUIRMYySQUIRMY.SQUIRMY
SQUIRMYhSQUIRMYeSQUIRMY SQUIRMYqSQUIRMYuSQUIRMYiSQUIRMYcSQUIRMYkSQUIRMY SQUIRMYbSQUIRMYrSQUIRMYoSQUIRMYwSQUIRMYnSQUIRMY SQUIRMYfSQUIRMYoSQUIRMYxSQUIRMY SQUIRMYjSQUIRMYuSQUIRMYmSQUIRMYpSQUIRMYeSQUIRMYdSQUIRMY SQUIRMYoSQUIRMYvSQUIRMYeSQUIRMYrSQUIRMY SQUIRMYtSQUIRMYhSQUIRMYeSQUIRMY SQUIRMYlSQUIRMYaSQUIRMYzSQUIRMYySQUIRMY SQUIRMYdSQUIRMYoSQUIRMYgSQUIRMY.SQUIRMY SQUIRMYISQUIRMYtSQUIRMY SQUIRMYwSQUIRMYaSQUIRMYsSQUIRMY SQUIRMYvSQUIRMYeSQUIRMYrSQUIRMYySQUIRMY SQUIRMYlSQUIRMYaSQUIRMYzSQUIRMYySQUIRMY SQUIRMYtSQUIRMYoSQUIRMYdSQUIRMYaSQUIRMYySQUIRMY.SQUIRMY

Not only does it alter the word "The" as intended, it inserts "SQUIRMY" between every other letter in the document! Maybe God doesn't actually exist...and the universe is governed by chaos...which is what i believe anyway.
 


It looks like the asterisk is matching zero or more occurrences of the previous pattern and /g is applying the substitution to every occurrence of the pattern.
 
It looks like the asterisk is matching zero or more occurrences of the previous pattern and /g is applying the substitution to every occurrence of the pattern.
Yeah it would make sense if it just did that for the word "The", but to me its still wierd that it gets put between every letter.
 
Yeah it would make sense if it just did that for the word "The", but to me its still wierd that it gets put between every letter.
That's where the /g comes in. Every letter matches "0 or more occurrences" of T. So it's replacing every match of nothing (0 occurrences) with SQUIRMY.
 
That's where the /g comes in. Every letter matches "0 or more occurrences" of T. So it's replacing every match of nothing (0 occurrences) with SQUIRMY.
Well the asterisk in sed is doing what it does in a much less straightforward manner than bash...it's kinda wierd in grep too, but atleast in grep what it does is more benign and predictable.

There's only one occurrence of T* in that sentence, the fact its treating each letter like a match is still strange to me.
 
Well the asterisk in sed is doing what it does in a much less straightforward manner than bash...it's kinda wierd in grep too, but atleast in grep what it does is more benign and predictable.

There's only one occurrence of T* in that sentence, the fact its treating each letter like a match is still strange to me.
Sed is pretty straight forward once you learn it.
The one occurrence of T* is the key. "T*" is T+everything after it. The * wildcard matches everything. It was doing exactly what you told it to do. Match T-and everything after it.

If you would have left out the "/g" it would have only taken the first occurrence of T*

I'm pretty sure I posted this before, but I'll do it again, it may help you out. I found it a LOOONG time ago. ;)
 

Attachments

  • sed.txt
    15.8 KB · Views: 175
Sed is pretty straight forward once you learn it.
The one occurrence of T* is the key. "T*" is T+everything after it. The * wildcard matches everything. It was doing exactly what you told it to do. Match T-and everything after it.

If you would have left out the "/g" it would have only taken the first occurrence of T*

I'm pretty sure I posted this before, but I'll do it again, it may help you out. I found it a LOOONG time ago. ;)
No need to be silly, i already put your tricks in a file on my computer.

Once you learn it

Its actually impossible to understand a program like sed without understanding the minutae and mechanics which is something i've been working on. If i have to evoke responses from strangers on the internet so that im lessening the frustration for my brain, then so be it. What i uncovered with this experiment is what a "thing" is to sed...sort of!
 
If you’re expecting "T*" to match only the word "The" - it’s not going to do that.
Sed uses regular expressions, not globbing. The asterisk works differently in regular expressions.

@MattWinter and @Dart hit the nail on the head with their posts.

For a short word like "The", you may as well just type the whole word, instead of using a regex to match it. It’s not worth writing a regex as it would require more than three characters.

It’s not the minutae of sed that you need to learn, it’s more about learning how regular expressions work and about some of the slightly different implementations of regex syntax.

Some programs use Perl style regular expressions, others use their own implementation, but with slight differences, or extensions.

But the core of regular expression syntax is usually pretty much the same.

And there are plenty of examples/resources/tutorials all over the internet.
 
@MattWinter and @Dart hit the nail on the head with their posts.
I still haven't seen an explanation as to why "T*" matches the middle of every letter when used as a regex, but not the spaces in the sentence. If my SQUIRMY matched each letter, then it would seem to do what its supposed to do.

Plus, there are lots of tutorials and videos of sed online, but its not particularly helpful when its either just going over something you understand, not addressing something thats immediately useful, or is an example of some long sed command with a brief explanation.


The asterisk works differently in regular expressions.

Thats part of what i was looking for, where regexes are seemingly just a particular type of pattern search w/ POSIX standards.
 
It isn't actually "matching" the other letters. It's matching the zero occurrences of T. Also, it's treating spaces the same as the letters. It might be more accurate to say "character" rather than letter, because it's also treating the period the same as letters and spaces.

So, in your example the T matches zero or more occurrences of T, so replace the T with SQUIRMY. Now it gets to "h" are there zero or more occurrences of "T"? Yes, so replace those with "SQUIRMY". The "h" itself doesn't match, so it stays. Same with "e". Same with space.

That why you have "SQUIRMY" + "h" + "SQUIRMY" + "e" + "SQUIRMY" + " " + "SQUIRMY" ... etc
 
It isn't actually "matching" the other letters. It's matching the zero occurrences of T. Also, it's treating spaces the same as the letters. It might be more accurate to say "character" rather than letter, because it's also treating the period the same as letters and spaces.

So, in your example the T matches zero or more occurrences of T, so replace the T with SQUIRMY. Now it gets to "h" are there zero or more occurrences of "T"? Yes, so replace those with "SQUIRMY". The "h" itself doesn't match, so it stays. Same with "e". Same with space.

That why you have "SQUIRMY" + "h" + "SQUIRMY" + "e" + "SQUIRMY" + " " + "SQUIRMY" ... etc
Okay, that's better, but it still doesn't make any sense why the space in between two letters (a non character) matches the expression whereas whitespace does not.
 
What makes you think whitespace is being treated differently?
the above copied and pasted example...sed overall actually does recognize whitespace, one of the things i love about it is that you can search and replace with whitespace as part of the strings. For example, i just made this document:

Code:
echo A R E Y O U O K A Y >> whitespace

now, you can replace all of the whitespace (not including blank lines because they wouldn't match),
like this. Below the command for changing the file is the output:

Code:
sed 's/ /-/g' whitespace
A-R-E-Y-O-U-O-K-A-Y

Sed just did EXACTLY what i told it to do.

...but don't focus on that, the important thing to me is what the heck is being matched in between the letters "h" and "e", and "q" and "u", because there is LITERALLY nothing there, so how is sed finding a match between letters?
 
Part of the reason why i wanted to flesh out this small aspect of sed regular expressions is that asterisks have a small but not very specific function in sed. I had this document a few months ago where the headers were like '**TITLE**' so, i decided to remove the asterisks when they are used in that fashion. However, it ended up erasing the entire document:

Code:
sed '/**/d' <file>

And i was utterly confused as to the why of it, because when you do this, it simply purges your document of asterisks without affecting anything else:

Code:
sed 's/*//g' <file>

There are actually two ways you can get rid of a string of "**":

Code:
sed 's/\*\*//g'

sed 's/*\|*//g'

They both get of exactly "**" within a line in every part of the line.

And here's also another way to just remove everything in a document that i found out the hard way!

Code:
sed '/$/d' <file>

Many ways to skin a cat ;)
 
Despite all of my thinking (clearly not worth a huge amount...) and community input, the way that sed sees a "zero or more occurrence" and white space baffles me. I decided to try matching based on zero or more occurrences of white space:

Code:
sed 's/ */SQUIRMY/g' perfect-sentence

And not only does it fill the white space like i thought it would, it also inserts the replacement both before and after every letter (into "the nothing"):

Code:
SQUIRMYTSQUIRMYhSQUIRMYeSQUIRMYqSQUIRMYuSQUIRMYiSQUIRMYcSQUIRMYkSQUIRMYbSQUIRMYrSQUIRMYoSQUIRMYwSQUIRMYnSQUIRMYfSQUIRMYoSQUIRMYxSQUIRMYjSQUIRMYuSQUIRMYmSQUIRMYpSQUIRMYeSQUIRMYdSQUIRMYoSQUIRMYvSQUIRMYeSQUIRMYrSQUIRMYtSQUIRMYhSQUIRMYeSQUIRMYlSQUIRMYaSQUIRMYzSQUIRMYySQUIRMYdSQUIRMYoSQUIRMYgSQUIRMY.SQUIRMYISQUIRMYtSQUIRMYwSQUIRMYaSQUIRMYsSQUIRMYvSQUIRMYeSQUIRMYrSQUIRMYySQUIRMYlSQUIRMYaSQUIRMYzSQUIRMYySQUIRMYtSQUIRMYoSQUIRMYdSQUIRMYaSQUIRMYySQUIRMY.SQUIRMY
SQUIRMYTSQUIRMYhSQUIRMYeSQUIRMYqSQUIRMYuSQUIRMYiSQUIRMYcSQUIRMYkSQUIRMYbSQUIRMYrSQUIRMYoSQUIRMYwSQUIRMYnSQUIRMYfSQUIRMYoSQUIRMYxSQUIRMYjSQUIRMYuSQUIRMYmSQUIRMYpSQUIRMYeSQUIRMYdSQUIRMYoSQUIRMYvSQUIRMYeSQUIRMYrSQUIRMYtSQUIRMYhSQUIRMYeSQUIRMYlSQUIRMYaSQUIRMYzSQUIRMYySQUIRMYdSQUIRMYoSQUIRMYgSQUIRMY.SQUIRMYISQUIRMYtSQUIRMYwSQUIRMYaSQUIRMYsSQUIRMYvSQUIRMYeSQUIRMYrSQUIRMYySQUIRMYlSQUIRMYaSQUIRMYzSQUIRMYySQUIRMYtSQUIRMYoSQUIRMYdSQUIRMYaSQUIRMYySQUIRMY.SQUIRMY
 

Members online


Top