How to shorten first line characteres in multiple files

jhcuarta

New Member
Joined
Aug 5, 2021
Messages
19
Reaction score
5
Credits
156
Hi
I have a series of .fasta files which first line looks like this
file 1: >Vibrio_cholerae_strain_1Mo
file 2: >Vibrio_cholerae_2012EL-1097_89x
file 3: >Vibrio_cholerae_strain_4536

I would like to remove from the first line all the "Vibrio_cholerae_" characters from this line, leaving the symbol ">", it is whorth nothing it is for hundredes of files
 


OK. No probs.
The way we can identify the word to remove is by the ">Vibrio_cholerae_" - which is common to all of them.
So we just need to come up with a regular expression that will describe that.
We want to take everything from ">Vibrio_cholerae_" up to the next space character.
So I think this regex will work >Vibrio_cholerae_[^ ]*.
The [^ ]* should forward match to the space.
And we'll replace that pattern with > , including a space, to keep the > separate from whatever other data is on that first line (if any!).

We'll use sed to perform the edits.
I'm pretty certain it will work. But to make sure, we'll test the regex is correct, without actually editing any files.
So pick one of your .fasta files and carefully type (or copy/paste) the following command:
Bash:
sed 's/>Vibrio_cholerae_[^ ]*/> /g' filename.fasta | head -n 2
Where filename.fasta is the actual name of the .fasta file you want to test with.
That will attempt to replace ">Vibrio_cholerae_....." in the first line with "> " and will show you the first two lines in the processed file.

If the output of the first line doesn't look right, let me know what you see and how you actually want it to appear and I'll see if I can tweak things.

But if you're happy with what you see, choose ONE of the following commands to edit all of the .fasta files in one go. But before doing so, please read the rest of my post!
Bash:
sed -i.bak -s 's/>Vibrio_cholerae_[^ ]*/> /g' *.fasta
OR
Bash:
sed -i -s 's/>Vibrio_cholerae_[^ ]*/> /g' *.fasta

The -s flag tells sed to treat each file as a separate file, rather than as one massive, single stream of data.
The -i flag tells sed to edit the files in-place - so we aren't going to redirect the output to a new file, we're going to overwrite the original files themselves.

If the -i flag has a file extension specified immediately after it, then sed creates a backup of the original files with that extension appended to the end.

The first sed command edits all of the .fasta files in place AND creates a backup copies of the original files, with an appended .bak extension.

The second command edits all .fasta files in place, but does NOT create a backup.

The first option is safer, because it creates a backup of your original data files.
Whereas the second one does not. So the first option might be the best one to take.
If you don't need the backup files afterwards, you could always remove them with rm.
e.g.
rm *.fasta.bak.


Now you've read the whole post - go ahead and pick one of the final sed commands to edit the files.
Anyway, hopefullly that helps!
 
Hi
Thanks for the short time response, but I tried both commands and it eliminates eveything after "Vibrio_cholerae_" I need to preserve whats next to it, for instance

file 1: >strain_1Mo
file 2: >2012EL-1097_89x
file 3: >strain_4536

Thanks in advance, really grateful
 
Hi
Thanks for the short time response, but I tried both commands and it eliminates eveything after "Vibrio_cholerae_" I need to preserve whats next to it, for instance

file 1: >strain_1Mo
file 2: >2012EL-1097_89x
file 3: >strain_4536

Thanks in advance, really grateful
Sorry about that. I thought you meant you wanted to remove the entire word starting from Vibrio_cholerae_, so I took my fix a bit too far. After posting my initial reply, I posted replies in a couple of other threads before falling asleep. It was almost 2am here, ha ha!

Thanks to @digitaltrails for chiming in with the correct amendment to my fix!
 
Last edited:
Hi
I was wondering if you could help me out a little bit more
after first line is edited
file 1: >strain_1Mo
file 2: >2012EL-1097_89x
file 3: >strain_4536

I need to add after the last character ".chr1"

for instance
file 1: >strain_1Mo.chr1
file 2: >2012EL-1097_89x.chr1
file 3: >strain_4536.chr1

Any help would be grateful
 
You'd have to start with the original files and capture a bit of each match so you can append .chr1 onto it:

Bash:
sed 's/>Vibrio_cholerae_\([^\W]*\)/>\1.chr1/'

The \( \) enclose a capturing group that follows Vibrio_cholerae_. The group to be captured is [^\W]*, which is to say a set of characters enclosed by [], that set being those characters that are not ^, whitespace characters \W, of which we will accept zero or more instances *.

The in the output, we add the captured group
\1 back into the result (each captured group is numbered, we only have one).

The reason I match non-whitespace is because I'm unsure whether your lines might have spaces on the end of them.

At least that's one way to do it.
 
Top