Command line to convert large word list to UTF-8

None-yet · Aug 17, 2020

I have a wordlist that currently is 40 gb in size. I must convert it to UTF 8 and not sure of how to do so. I assume it can be done through the command line. Can someone help me out and provide me a command line string that I may try. Thank you!

captain-sensible · Aug 17, 2020

was that using crunch ? best to get it to produce small files in first place

I got a 1 gig file called wordlist.txt and broke it up like this:

split -b 50M wordlist.txt

it then produced circa 17 files 50 mb each called xaa ,xab etc with no suffix . Files were still valid though and worked with wpscan. I was doing some " White hat" for a investigative Journalist ..on their web .

bash-5.0$ env | grep LANG
LANG=en_US.UTF-8

what do you get ? maybe you can tweak that . I don't know what your file type is but what I can say is that it took 11 hours to do 2% of a 50MB file. So i wonder how long 40gig would take ?

split(1) - Linux manual page

captain-sensible · Aug 17, 2020

just estimated it ; using a wordlist if on an equivalent laptop like mine with similar Internet bandwidth runnign 24 hrs a day it would take 1.2 years to complete !

None-yet · Aug 17, 2020

First it rarely ever finishes. It has a little more than 80% success rate. A very targeted attack makes all the difference. Last week with a target I started it at about 30% in and let it run a couple hours and got in. The list is broken into sections so depending on the target I do a search inside and start it at that point. Also any time I do have to generate a list for whatever reason then when I am done I just add to this one. I never throw away a list I make. Dis list has been bery bery good tu me. lol

JasKinasis · Aug 17, 2020

iconv is the only thing I can think of offhand.

And if memory serves - to use it, it's something like this:

Bash:

iconv  -t UTF8 /path/to/inputfile -o /path/to/convertedFile

But I'd imagine a 40Gb file will take a long time to convert!

captain-sensible · Aug 17, 2020

bash-5.0$ iconv --help
Usage: iconv [OPTION...] [FILE...]
Convert encoding of given files from one encoding to another.

Input/Output format specification:
-f, --from-code=NAME encoding of original text
-t, --to-code=NAME encoding for output

JasKinasis · Aug 18, 2020

captain-sensible said:
bash-5.0$ iconv --help
Usage: iconv [OPTION...] [FILE...]
Convert encoding of given files from one encoding to another.

Input/Output format specification:
-f, --from-code=NAME encoding of original text
-t, --to-code=NAME encoding for output

I don’t think you have to specify the encoding in the original file. If memory serves - iconv can deduce the encoding of the original file. You can simply tell it the encoding to convert to and pass it the file to convert and the output file.

None-yet · Aug 19, 2020

Thank yous all!

captain-sensible · Aug 19, 2020

Any luck then @None-yet

None-yet · Aug 22, 2020

Have not been able to try yet but plan to in just a bit.

captain-sensible · Aug 22, 2020

let us know how you get on. Some kali users would probably be interested if you created it with crunch or its a type of rainbow download or custom sort of mash up

None-yet · Aug 24, 2020

This is what I keep getting :
root@kali:/media/sf_Share/Newest Version# iconv --verbose -f UTF8 /media/sf_Share/Newest Version/Full-Master-8-16-20b.txt -o /media/sf_Share/Newest Version/finished.txt
/media/sf_Share/Newest:
iconv: cannot open input file `/media/sf_Share/Newest': No such file or directory
Version/Full-Master-8-16-20b.txt:
iconv: cannot open input file `Version/Full-Master-8-16-20b.txt': No such file or directory
Version/finished.txt:
iconv: cannot open input file `Version/finished.txt': No such file or directory
root@kali:/media/sf_Share/Newest Version#

JasKinasis · Aug 24, 2020

That's because of the spaces in the "Newest Version" directory in both of the paths you have specified for the input and output files.

You need to either enclose the paths in double quotes, or escape the spaces with a backslash.
e.g.
Enclose the paths in quotes:

Bash:

iconv --verbose -f UTF8 "/media/sf_Share/Newest Version/Full-Master-8-16-20b.txt" -o "/media/sf_Share/Newest Version/finished.txt"

OR - escape the spaces with a backslash:

Bash:

iconv --verbose -f UTF8 /media/sf_Share/Newest\ Version/Full-Master-8-16-20b.txt -o /media/sf_Share/Newest\ Version/finished.txt

None-yet · Aug 24, 2020

Each one says the same error "iconv: illegal input sequence at position 1"

JasKinasis · Aug 24, 2020

None-yet said:
Each one says the same error "iconv: illegal input sequence at position 1"

That stumped me for a while, but I think I see the problem.
Didn't you say that you were converting converting the original file TO UTF8?
Because your command is attempting to convert the original file FROM UTF8?

Perhaps try using

Code:

-t UTF8

instead of

Code:

-f UTF8

None-yet · Aug 25, 2020

Dude, first allow me to say thank you for your help. Often when someone puts out this amount of effort to help someone in a forum they are very often not provided the respect and the thank you's that they deserve. So I wanted to make sure you knew how much I appreciate you.

OK, let me update. Yes you are correct. However I misspoke. This list is supposed to be asc2. I think it is currently UTF8. I have been in contact with the team that wrote this software I am attempting to run this list through. I know that there shouldn't be any difference between the two because they were initially created to support English charset only and not any accents such as ç, ñ, á, ü, ç, etc. They think my list could be in unicode because of the way their software reacts when I run it through. They are adamant about no unicode.

At the moment I am not sure exactly what encoding the list may be in. I know I have made you role your eyes a bit here. In my line of work I have never faced an issue where I needed to know some of this because most of my tools are pre-built. I have done much work for this team and when they asked me to help them with some issues I was happy to do so. Once we were done they wanted to know just how this software would run under certain heavy conditions such as running my list. I explained it was a bit out of my area but would be glad to do what I could. I love to learn new things so I wanted to take it on.

At this point I guess I need to learn what encoding this list is before I continue down the path I was on in this post. I started this morning to research how to determain what encodeing this is in between taking phone calls. So this is where I am. Can you tell I am confused? lol

JasKinasis · Aug 25, 2020

None-yet said:
Dude, first allow me to say thank you for your help. Often when someone puts out this amount of effort to help someone in a forum they are very often not provided the respect and the thank you's that they deserve. So I wanted to make sure you knew how much I appreciate you.

OK, let me update. Yes you are correct. However I misspoke. This list is supposed to be asc2. I think it is currently UTF8. I have been in contact with the team that wrote this software I am attempting to run this list through. I know that there shouldn't be any difference between the two because they were initially created to support English charset only and not any accents such as ç, ñ, á, ü, ç, etc. They think my list could be in unicode because of the way their software reacts when I run it through. They are adamant about no unicode.

At the moment I am not sure exactly what encoding the list may be in. I know I have made you role your eyes a bit here. In my line of work I have never faced an issue where I needed to know some of this because most of my tools are pre-built. I have done much work for this team and when they asked me to help them with some issues I was happy to do so. Once we were done they wanted to know just how this software would run under certain heavy conditions such as running my list. I explained it was a bit out of my area but would be glad to do what I could. I love to learn new things so I wanted to take it on.

At this point I guess I need to learn what encoding this list is before I continue down the path I was on in this post. I started this morning to research how to determain what encodeing this is in between taking phone calls. So this is where I am. Can you tell I am confused? lol

No worries!
I’m always glad to help out with programming and scripting/terminal related problems.

Old Tom Bombadil · Aug 26, 2020

None-yet said:
At this point I guess I need to learn what encoding this list is

Maybe this will help:

Code:

file -bi <filename>

Source

None-yet · Aug 27, 2020

Hey all, I am going to close this out. I, along with you all have it all worked out. Although it has been very frustrating to me, the forum really made it work out very well. I don't like to copy and paste what you guys have put in your posts without understanding what I am pasting in. This is where you all were fantastic. Much of the knowledge you all shoveled into my brain was done by you all in a manner that I understood what I was doing. That makes this forum a priceless jewel. I truly appreciate the time, effort and patience you all put into helping me solve this issue. With no personal benefit to any of you and providing the effort you all did makes it even more meaningful to me.

If this post shows up in any future searches by anyone out there. Consider joining up. This is a great goldmine of information!

Command line to convert large word list to UTF-8

Member

Well-Known Member

Well-Known Member

Member

Super Moderator

Well-Known Member

Super Moderator

Member

Well-Known Member

Member

Well-Known Member

Member

Super Moderator

Member

Super Moderator

Member

Super Moderator

Active Member

Member