How to erase following characters after the first word in multiple lines containing ">" inside same text file for multiple files

jhcuarta

New Member
Joined
Aug 5, 2021
Messages
19
Reaction score
5
Credits
156
Hi
I was wondering if you could help me out to propose a command for the issue I need to resolve. I basically need to erase all characters following the first word, including white space, in multiple lines containing as first character the symbol ">" inside a plain text file (extension .fa), for multiple files inside same folder. For instance, for one file the first lines looks like this

>KLDFOOAE_00001 tape measure protein [Vibrio cholerae]

MANNLKTDIVLNLQGDLAQKARSYSKEMTTLATRSKAAFSMISSSAIAASRGIDTFGNRL
LFITGAAAVGFERTFVKTAAEFERYQTMLNKLQGSPEAGAKAMAWIEEFTQNTPYAIDEV
TQSFVRLKAFGIDPMDGTMQSIADQAAMIGGTAETVEGIATALGQAWTKGKLQSEEALQL
>KLDFOOAE_00002 MULTISPECIES: phage tail assembly protein [Vibrio]
MAVMTFNLEDGFKVGDAQCHEVGLKELTPKDVFDAQLASEKIGILNGRPHAYTSDVQMGM
ELLCRQVEFIGNVQGPFSVKEILKLSSRDFATLQQKARELDDIMFSDDALEGLEARGRD
>KLDFOOAE_00003 MULTISPECIES: hypothetical protein [Vibrio]
MEHVYQLVDGIVFKGKLQKQVTLHPIDSVSYDLVEQLVEEQLQHIQNQADVVLVNDSHLQ
GLKGYMLLNESAASSISKIGDENVDLMFFDLCQLKISAQDWNVILTANLAIAEYYANQAA
MLA


I need to keep the symbol and the code of the file, not the description, so the file I need would look like this


>KLDFOOAE_00001

MANNLKTDIVLNLQGDLAQKARSYSKEMTTLATRSKAAFSMISSSAIAASRGIDTFGNRL
LFITGAAAVGFERTFVKTAAEFERYQTMLNKLQGSPEAGAKAMAWIEEFTQNTPYAIDEV
TQSFVRLKAFGIDPMDGTMQSIADQAAMIGGTAETVEGIATALGQAWTKGKLQSEEALQL
>KLDFOOAE_00002
MAVMTFNLEDGFKVGDAQCHEVGLKELTPKDVFDAQLASEKIGILNGRPHAYTSDVQMGM
ELLCRQVEFIGNVQGPFSVKEILKLSSRDFATLQQKARELDDIMFSDDALEGLEARGRD
>KLDFOOAE_00003
MEHVYQLVDGIVFKGKLQKQVTLHPIDSVSYDLVEQLVEEQLQHIQNQADVVLVNDSHLQ
GLKGYMLLNESAASSISKIGDENVDLMFFDLCQLKISAQDWNVILTANLAIAEYYANQAA
MLA


Besides, I need to do this for multiple files inside same folder

Best regards, and thanks ahead
 


This sounds like homework or an assignment, is that the case?

We can't do homework for you or you will not retain the knowledge.

Chris Turner
wizardfromoz
 
Hi
Its for my investigation project, not a homework, I need to reedit files in order it works and pass filters in a software I'm trying to use (syntenet). Im a bioinformatician with not much programming expertise
 
I understand, thanks for clarifying.

Someone will probably be along in the next 24 hours - we are scattered around the world.

Wizard

and Good luck
 
AS LONG AS EACH LINE LIKE THAT IS THE SAME, where there are only spaces after the content you want to keep, here's how you'd do that with sed:

sed 's/ .*//g' text-file

That would make your text look the way you want it to like in the bottom section. If you execute it like that in the command line, it will just print changes to standard output, but if you wanted to create a separate file:

sed 's/ .*//g' text-file > altered_file

I'd recommend you run it for display in standard output to make sure that that makes the changes you want, then i'd run the second one so you have separate lab documentation.
 
Hi
It's been helpful, it does work for one file at the time, but I need to do this for several files at the time (415), if you could help me out a little bit further I would appreciate it.
 
Hi
It's been helpful, it does work for one file at the time, but I need to do this for several files at the time (415), if you could help me out a little bit further I would appreciate it.
Okay, well, there is one way to do the above command for all those files in one fell swoop without putting your data in danger, but all those files need to be in the same folder, and there can't be any other files in the folder:

sed 's/ .*//g' * > altered_file

That will combine all your files, with the changes, and put them all into a single file (you can choose whatever name for the file of course). The asterisk in the sed command ( .*) is a regular expression, but the asterisk fallowing it represents every single file in the folder. The command above successfully makes the change to every file and redirects it to a file with that contains all the data.

However, sed takes an unlimited number of files as as arguments, so you can can also do it like this:

sed 's/ .*//g' file1 file2 file3....

I personally recommend that you learn bash and/or python if you have to do this kind of stuff on a regular basis, but you can click on the link in my signature to get started on sed. It has a lot of powerful features.

However, Python is currently the go-to programming language for lab data manipulation, so if you read my guide and then dive into the python books/training, then i think that will be your most efficient/best educational route.
 

Members online


Latest posts

Top