How to shorten first line characteres in multiple files

jhcuarta · Feb 12, 2023

Hi
I have a series of .fasta files which first line looks like this
file 1: >Vibrio_cholerae_strain_1Mo
file 2: >Vibrio_cholerae_2012EL-1097_89x
file 3: >Vibrio_cholerae_strain_4536

I would like to remove from the first line all the "Vibrio_cholerae_" characters from this line, leaving the symbol ">", it is whorth nothing it is for hundredes of files

JasKinasis · Feb 13, 2023

OK. No probs.
The way we can identify the word to remove is by the ">Vibrio_cholerae_" - which is common to all of them.
So we just need to come up with a regular expression that will describe that.
We want to take everything from ">Vibrio_cholerae_" up to the next space character.
So I think this regex will work >Vibrio_cholerae_[^ ]*.
The [^ ]* should forward match to the space.
And we'll replace that pattern with > , including a space, to keep the > separate from whatever other data is on that first line (if any!).

We'll use sed to perform the edits.
I'm pretty certain it will work. But to make sure, we'll test the regex is correct, without actually editing any files.
So pick one of your .fasta files and carefully type (or copy/paste) the following command:

Bash:

sed 's/>Vibrio_cholerae_[^ ]*/> /g' filename.fasta | head -n 2

Where filename.fasta is the actual name of the .fasta file you want to test with.
That will attempt to replace ">Vibrio_cholerae_....." in the first line with "> " and will show you the first two lines in the processed file.

If the output of the first line doesn't look right, let me know what you see and how you actually want it to appear and I'll see if I can tweak things.

But if you're happy with what you see, choose ONE of the following commands to edit all of the .fasta files in one go. But before doing so, please read the rest of my post!

Bash:

sed -i.bak -s 's/>Vibrio_cholerae_[^ ]*/> /g' *.fasta

OR

Bash:

sed -i -s 's/>Vibrio_cholerae_[^ ]*/> /g' *.fasta

The -s flag tells sed to treat each file as a separate file, rather than as one massive, single stream of data.
The -i flag tells sed to edit the files in-place - so we aren't going to redirect the output to a new file, we're going to overwrite the original files themselves.

If the -i flag has a file extension specified immediately after it, then sed creates a backup of the original files with that extension appended to the end.

The first sed command edits all of the .fasta files in place AND creates a backup copies of the original files, with an appended .bak extension.

The second command edits all .fasta files in place, but does NOT create a backup.

The first option is safer, because it creates a backup of your original data files.
Whereas the second one does not. So the first option might be the best one to take.
If you don't need the backup files afterwards, you could always remove them with rm.
e.g.
rm *.fasta.bak.

Now you've read the whole post - go ahead and pick one of the final sed commands to edit the files.
Anyway, hopefullly that helps!

jhcuarta · Feb 13, 2023

Hi
Thanks for the short time response, but I tried both commands and it eliminates eveything after "Vibrio_cholerae_" I need to preserve whats next to it, for instance

file 1: >strain_1Mo
file 2: >2012EL-1097_89x
file 3: >strain_4536

Thanks in advance, really grateful

digitaltrails · Feb 13, 2023

The pattern argument to sed is slightly wrong, this would work:

Code:

sed 's/>Vibrio_cholerae_/>/g'

This just changes >Vibrio_cholerae_ to >. If there is only one line to change in each file, the `g` is unnecessary, so the pattern could be:

Code:

sed 's/>Vibrio_cholerae_/>/'

jhcuarta · Feb 13, 2023

Hi
Indeed, worked fine

sed -i -s 's/>Vibrio_cholerae_/>/' *.fasta

best regards both of you

JasKinasis · Feb 13, 2023

jhcuarta said:
Hi
Thanks for the short time response, but I tried both commands and it eliminates eveything after "Vibrio_cholerae_" I need to preserve whats next to it, for instance

file 1: >strain_1Mo
file 2: >2012EL-1097_89x
file 3: >strain_4536

Thanks in advance, really grateful

Sorry about that. I thought you meant you wanted to remove the entire word starting from Vibrio_cholerae_, so I took my fix a bit too far. After posting my initial reply, I posted replies in a couple of other threads before falling asleep. It was almost 2am here, ha ha!

Thanks to @digitaltrails for chiming in with the correct amendment to my fix!

jhcuarta · Feb 14, 2023

Hi
I was wondering if you could help me out a little bit more
after first line is edited
file 1: >strain_1Mo
file 2: >2012EL-1097_89x
file 3: >strain_4536

I need to add after the last character ".chr1"

for instance
file 1: >strain_1Mo.chr1
file 2: >2012EL-1097_89x.chr1
file 3: >strain_4536.chr1

Any help would be grateful

digitaltrails · Feb 14, 2023

You'd have to start with the original files and capture a bit of each match so you can append .chr1 onto it:

Bash:

sed 's/>Vibrio_cholerae_\([^\W]*\)/>\1.chr1/'

The \( \) enclose a capturing group that follows Vibrio_cholerae_. The group to be captured is [^\W]*, which is to say a set of characters enclosed by [], that set being those characters that are not ^, whitespace characters \W, of which we will accept zero or more instances *.

The in the output, we add the captured group \1 back into the result (each captured group is numbered, we only have one).

The reason I match non-whitespace is because I'm unsure whether your lines might have spaces on the end of them.

At least that's one way to do it.

jhcuarta · Feb 14, 2023

sed -i -s 's/>Vibrio_cholerae_\([^\W]*\)/>\1.chr1/' *.fasta

worked just fine

Many thanks

Best regards

How to shorten first line characteres in multiple files

jhcuarta

New Member

JasKinasis

Super Moderator

jhcuarta

New Member

digitaltrails

Member

jhcuarta

New Member

JasKinasis

Super Moderator

jhcuarta

New Member

digitaltrails

Member

jhcuarta

New Member

Members online

Latest posts