sort --unique not working properly

FBClark

New Member
Joined
Oct 19, 2020
Messages
17
Reaction score
19
Credits
268
Foist off, a little intro. I played with DOS batch files back in the early 80s so I'm not only out of practice but I'm switching script language from batch to bash. Yeah, this is my first foray into the world of bash after using Linux for just shy of 15 years.
I'm downloading, merging and organizing 11 hosts files from online sources into one. I'm sudo cat'ing these three files to write over the /etc/hosts: I have a copy of my original hosts header with several of my own additions for blocking, then I make a current date and time file prepended to that so I know when I've updated the hosts file. Then 11 source files from online are cat'ed and cleaned up. That assemblage is then my new hosts file. I've added a test section to determine if a copy of the original hosts file exists and if it doesn't, one is made. Can't be too careful! Just call me Sir Blockalot.
I’ve run into a problem I just can’t figure out. Using ‘sort --unique’ I end up with a tiny handful of duplicate lines, always pairs, no more than two of each duplicate. Naturally when all those files are downloaded there are many-several duplicates and many of them are more than just double copies. Not only is this the first time I’ve tackled bash, but also a large mix of files from the GSW - Great Spider Web.
I'm assuming, since they're downloaded, that the problem is caused by either MS or HTML formatting? If I'm right, how can I figure it out and then fix it? It's either that or the size of the finished file, 565,000 lines give or take. I don't think it's the size, but I'll toss that out as a possibility.
Here's what one pair looks like.
0.0.0.0 zhirok.com #[Spamdexing]
0.0.0.0 zhirok.com #[Spamdexing]
I update the hosts files on my old laptop, my wife's laptop and her 2in1 monthly and I decided to automate the process and just for spits and gurgles I decided to add all those other online hosts files. I've made the script file mostly generic, using "${USER}" in the file paths so I can toss it onto all of our PCs without any editing of the script.
BTW, did you ever notice that there are two large industries where the customer is referred to as a user?
Drugs and software.
For all the more duplicates there are, it's not mission critical. I could easily just ignore them, but it bugs me, ya know. I'm a prefectionist - a poorfuctionist - I'd just like it tidy.
This isn't critical to my question, but I'm using LMDE4.
 


JasKinasis

Well-Known Member
Joined
Apr 25, 2017
Messages
1,321
Reaction score
1,870
Credits
8,343
I can't say I've ever had a problem with sort --unique ever letting duplicates through.

Is it possible that the line-endings on one of the duplicates is different?
e.g.
one has MSDOS line endings and the other has UNIX line endings?

If so, consider running the final, merged file through dos2unix to normalise all of the line-endings before doing the final, unique sort.

Or perhaps one of the duplicate lines has trailing whitespace at the end of the line?
In which case, run the merged file through sed to remove trailing whitespace characters (spaces, tabs etc) at the end of each line, before doing your final, unique sort.

And if the size of the file being sorted is somehow the problem - which seems unlikely, then you could perhaps try an extra sort, in order to trim any remaining duplicates.
 
OP
FBClark

FBClark

New Member
Joined
Oct 19, 2020
Messages
17
Reaction score
19
Credits
268
I can't say I've ever had a problem with sort --unique ever letting duplicates through.

Is it possible that the line-endings on one of the duplicates is different?
e.g.
one has MSDOS line endings and the other has UNIX line endings?

If so, consider running the final, merged file through dos2unix to normalise all of the line-endings before doing the final, unique sort.

Or perhaps one of the duplicate lines has trailing whitespace at the end of the line?
In which case, run the merged file through sed to remove trailing whitespace characters (spaces, tabs etc) at the end of each line, before doing your final, unique sort.

And if the size of the file being sorted is somehow the problem - which seems unlikely, then you could perhaps try an extra sort, in order to trim any remaining duplicates.
Thanks. I tried adding a couple lines of sorts through the script. I also tried piping through uniq as well as using the --unique switch in sort. There was the possibility of a few trailing spaces and I'd already trimmed them. I'm leaning toward the EOL being the problem. If it is, then I'll have to install dos2unix on my PC as well as each one I'm planning on using this script on. I was really hoping for a bash cure, but maybe there isn't one.
I have to say, for my first shot with bash, I've surprised the crap out of myself. After playing around with bash some more, I might take a look at a more advanced language. If it rolls well, I might look for a part time gig as an amateur. Since I'm retired that'll keep the brain cells firing!
 
OP
FBClark

FBClark

New Member
Joined
Oct 19, 2020
Messages
17
Reaction score
19
Credits
268
Is it possible that the line-endings on one of the duplicates is different?
I was leaning that way. After trying some other options including another new one for me 'tr' with no success, I broke down, installed dos2unix and used it with success. That was the item! Then I played a bit and came up with a way to test for d2u on the system, then inserted an installation if it isn't on the system. I'd hoped not to go that route, but, I've learned quite a bit in the last couple weeks!
 
OP
FBClark

FBClark

New Member
Joined
Oct 19, 2020
Messages
17
Reaction score
19
Credits
268
I was leaning that way. After trying some other options including another new one for me 'tr' with no success, I broke down, installed dos2unix and used it with success. That was the item! Then I played a bit and came up with a way to test for d2u on the system, then inserted an installation if it isn't on the system. I'd hoped not to go that route, but, I've learned quite a bit in the last couple weeks!
Well, had to download dos2unix and try it. That did the trick. Then I had to add a test to see if dos2unix would be on the system and add an installation if it wasn't. I also went through the script and reorganized to make it more efficient and orderly and managed to remove some lines of script in the process!
I took to to my wife's 2in1 and it purred like a kitten! Everything including the dos2unix section went smooth. Then ...
I took it to my wife's old Toshiba and crap broke loose. First, the dos2unix install mangled and I had to shut the script down and manually install. I restarted and then it threw a 'file not found' at me during the hosts file assembly. I hadn't got anywhere near that section with the install fiasco. I shut down again and checked the path. File was there. I checked my typing. It was OK. What??? I ran it again and this time it flew through without a hitch.
Oh! Toshiba! Maybe I should have translated the script into Japanese?
 
$100 Digital Ocean Credit
Get a free VM to test out Linux!

Members online


Top