Command line to convert large word list to UTF-8



captain-sensible

Well-Known Member
Credits
6,832
was that using crunch ? best to get it to produce small files in first place

I got a 1 gig file called wordlist.txt and broke it up like this:

split -b 50M wordlist.txt

it then produced circa 17 files 50 mb each called xaa ,xab etc with no suffix . Files were still valid though and worked with wpscan. I was doing some " White hat" for a investigative Journalist ..on their web .

bash-5.0$ env | grep LANG
LANG=en_US.UTF-8



what do you get ? maybe you can tweak that . I don't know what your file type is but what I can say is that it took 11 hours to do 2% of a 50MB file. So i wonder how long 40gig would take ?


 

captain-sensible

Well-Known Member
Credits
6,832
just estimated it ; using a wordlist if on an equivalent laptop like mine with similar Internet bandwidth runnign 24 hrs a day it would take 1.2 years to complete !
 

None-yet

Member
Credits
701
First it rarely ever finishes. It has a little more than 80% success rate. A very targeted attack makes all the difference. Last week with a target I started it at about 30% in and let it run a couple hours and got in. The list is broken into sections so depending on the target I do a search inside and start it at that point. Also any time I do have to generate a list for whatever reason then when I am done I just add to this one. I never throw away a list I make. Dis list has been bery bery good tu me. lol
 

JasKinasis

Well-Known Member
Credits
3,300
iconv is the only thing I can think of offhand.

And if memory serves - to use it, it's something like this:
Bash:
iconv  -t UTF8 /path/to/inputfile -o /path/to/convertedFile
But I'd imagine a 40Gb file will take a long time to convert!
 

captain-sensible

Well-Known Member
Credits
6,832
bash-5.0$ iconv --help
Usage: iconv [OPTION...] [FILE...]
Convert encoding of given files from one encoding to another.

Input/Output format specification:
-f, --from-code=NAME encoding of original text
-t, --to-code=NAME encoding for output
 

JasKinasis

Well-Known Member
Credits
3,300
bash-5.0$ iconv --help
Usage: iconv [OPTION...] [FILE...]
Convert encoding of given files from one encoding to another.

Input/Output format specification:
-f, --from-code=NAME encoding of original text
-t, --to-code=NAME encoding for output
I don’t think you have to specify the encoding in the original file. If memory serves - iconv can deduce the encoding of the original file. You can simply tell it the encoding to convert to and pass it the file to convert and the output file.
 

captain-sensible

Well-Known Member
Credits
6,832
let us know how you get on. Some kali users would probably be interested if you created it with crunch or its a type of rainbow download or custom sort of mash up
 

None-yet

Member
Credits
701
This is what I keep getting :
[email protected]:/media/sf_Share/Newest Version# iconv --verbose -f UTF8 /media/sf_Share/Newest Version/Full-Master-8-16-20b.txt -o /media/sf_Share/Newest Version/finished.txt
/media/sf_Share/Newest:
iconv: cannot open input file `/media/sf_Share/Newest': No such file or directory
Version/Full-Master-8-16-20b.txt:
iconv: cannot open input file `Version/Full-Master-8-16-20b.txt': No such file or directory
Version/finished.txt:
iconv: cannot open input file `Version/finished.txt': No such file or directory
[email protected]:/media/sf_Share/Newest Version#
 

JasKinasis

Well-Known Member
Credits
3,300
That's because of the spaces in the "Newest Version" directory in both of the paths you have specified for the input and output files.

You need to either enclose the paths in double quotes, or escape the spaces with a backslash.
e.g.
Enclose the paths in quotes:
Bash:
iconv --verbose -f UTF8 "/media/sf_Share/Newest Version/Full-Master-8-16-20b.txt" -o "/media/sf_Share/Newest Version/finished.txt"
OR - escape the spaces with a backslash:
Bash:
iconv --verbose -f UTF8 /media/sf_Share/Newest\ Version/Full-Master-8-16-20b.txt -o /media/sf_Share/Newest\ Version/finished.txt
 

JasKinasis

Well-Known Member
Credits
3,300
Each one says the same error "iconv: illegal input sequence at position 1"
That stumped me for a while, but I think I see the problem.
Didn't you say that you were converting converting the original file TO UTF8?
Because your command is attempting to convert the original file FROM UTF8?

Perhaps try using
Code:
-t UTF8
instead of
Code:
-f UTF8
 

None-yet

Member
Credits
701
Dude, first allow me to say thank you for your help. Often when someone puts out this amount of effort to help someone in a forum they are very often not provided the respect and the thank you's that they deserve. So I wanted to make sure you knew how much I appreciate you.

OK, let me update. Yes you are correct. However I misspoke. This list is supposed to be asc2. I think it is currently UTF8. I have been in contact with the team that wrote this software I am attempting to run this list through. I know that there shouldn't be any difference between the two because they were initially created to support English charset only and not any accents such as ç, ñ, á, ü, ç, etc. They think my list could be in unicode because of the way their software reacts when I run it through. They are adamant about no unicode.

At the moment I am not sure exactly what encoding the list may be in. I know I have made you role your eyes a bit here. In my line of work I have never faced an issue where I needed to know some of this because most of my tools are pre-built. I have done much work for this team and when they asked me to help them with some issues I was happy to do so. Once we were done they wanted to know just how this software would run under certain heavy conditions such as running my list. I explained it was a bit out of my area but would be glad to do what I could. I love to learn new things so I wanted to take it on.

At this point I guess I need to learn what encoding this list is before I continue down the path I was on in this post. I started this morning to research how to determain what encodeing this is in between taking phone calls. So this is where I am. Can you tell I am confused? lol
 

JasKinasis

Well-Known Member
Credits
3,300
Dude, first allow me to say thank you for your help. Often when someone puts out this amount of effort to help someone in a forum they are very often not provided the respect and the thank you's that they deserve. So I wanted to make sure you knew how much I appreciate you.

OK, let me update. Yes you are correct. However I misspoke. This list is supposed to be asc2. I think it is currently UTF8. I have been in contact with the team that wrote this software I am attempting to run this list through. I know that there shouldn't be any difference between the two because they were initially created to support English charset only and not any accents such as ç, ñ, á, ü, ç, etc. They think my list could be in unicode because of the way their software reacts when I run it through. They are adamant about no unicode.

At the moment I am not sure exactly what encoding the list may be in. I know I have made you role your eyes a bit here. In my line of work I have never faced an issue where I needed to know some of this because most of my tools are pre-built. I have done much work for this team and when they asked me to help them with some issues I was happy to do so. Once we were done they wanted to know just how this software would run under certain heavy conditions such as running my list. I explained it was a bit out of my area but would be glad to do what I could. I love to learn new things so I wanted to take it on.

At this point I guess I need to learn what encoding this list is before I continue down the path I was on in this post. I started this morning to research how to determain what encodeing this is in between taking phone calls. So this is where I am. Can you tell I am confused? lol
No worries!
I’m always glad to help out with programming and scripting/terminal related problems.
 

None-yet

Member
Credits
701
Hey all, I am going to close this out. I, along with you all have it all worked out. Although it has been very frustrating to me, the forum really made it work out very well. I don't like to copy and paste what you guys have put in your posts without understanding what I am pasting in. This is where you all were fantastic. Much of the knowledge you all shoveled into my brain was done by you all in a manner that I understood what I was doing. That makes this forum a priceless jewel. I truly appreciate the time, effort and patience you all put into helping me solve this issue. With no personal benefit to any of you and providing the effort you all did makes it even more meaningful to me.

If this post shows up in any future searches by anyone out there. Consider joining up. This is a great goldmine of information!
 


Members online


Latest posts

Top