Trying to count CAD files recursively on root while avoiding counting duplicates RHEL 7.9

peskysushi

New Member
Credits
32
I am current counting files of the form ??-????-????-?.prt on root using the following syntax:
[~] find / -iname "??-????-????-?.prt" | wc -l

It works, and returns a number, but I am not sure this gives me a recursive search count or not.
I also want to be able to count duplicates of this file type so I can eliminate them from the total count.

I also tried to take the output and save it to a file. prt_files.txt,
find /na-* -name "??-????-????-?.prt" >> /na-projects8/prt_files.txt


then i sorted the output sort and use uniq -d to identify the duplicates, but still came back with zero duplicates. Not sure why it isn't working.

Thanks
 
Last edited:


JasKinasis

Well-Known Member
Credits
4,005
I imagine it’s because although some of the file-names might be the same - the paths to them are different. So the uniq command will view them as unique objects.
E.g.
If find found two .prt files called file.prt on different paths:
/path/to/file.prt
And:
/path/to/another/file.prt

uniq would view them both as unique because their paths are different.

So to work out if there are duplicate file-names, you’d need to process the results from find slightly differently.

I’ll have a think and will try to come up with something this evening, after I’ve finished work.

Are you determining that a file is a duplicate purely based on filename? Or are there other criteria involved?

Is it possible that there could be two files with the same name, but with completely different, unrelated content?

One of the duplicates could be an earlier, or later iteration of a file too. So in this case, I’d assume that you want to keep the latest version.

There are a number of pre-existing tools that can be used to locate duplicate files. Offhand I think they’re more generalised and look at the entire file-system for duplicate files of any type. I’m not sure if there are any where you can specify a particular file-type to search for duplicates of.
Again, I’ll take a look later if nobody else manages to find a solution for this!
 

peskysushi

New Member
Credits
32
I imagine it’s because although some of the file-names might be the same - the paths to them are different. So the uniq command will view them as unique objects.
E.g.
If find found two .prt files called file.prt on different paths:
/path/to/file.prt
And:
/path/to/another/file.prt

uniq would view them both as unique because their paths are different.

So to work out if there are duplicate file-names, you’d need to process the results from find slightly differently.

I’ll have a think and will try to come up with something this evening, after I’ve finished work.

Are you determining that a file is a duplicate purely based on filename? Or are there other criteria involved?

Is it possible that there could be two files with the same name, but with completely different, unrelated content?

One of the duplicates could be an earlier, or later iteration of a file too. So in this case, I’d assume that you want to keep the latest version.

There are a number of pre-existing tools that can be used to locate duplicate files. Offhand I think they’re more generalised and look at the entire file-system for duplicate files of any type. I’m not sure if there are any where you can specify a particular file-type to search for duplicates of.
Again, I’ll take a look later if nobody else manages to find a solution for this!
Thank you JasKinasis. Yes, two files could have the same name, but may have different content. Any slight change an engineer makes to the CAD design would change the content. So the duplicates can be alike in name, but different in content.

I want to add some clarifying remarks that might help you find any potential flaws in my process. The problem might be in the uniq -d argument. Not sure. I think each duplicate will still have a unique inode.

1605623190396.png

Thanks for the update. KGILL Here are some clarifying remarks, that I hope might help.

View attachment 7786
JasKinasis I finally ended up using "sort - u" to report only unique files and "cut -d'/' -f1" to remove file path information, along with | wc -l to count the files and found that there are more duplicates than than there are unique files. . 34 million vs 1.2 million. Thanks, Jeff

I used
1605645967123.png
 

wizardfromoz

Super Moderator
Staff member
Gold Supporter
Credits
7,453
Also, wiz is probably going to move this to the command line section.
You got that in one, Maine friend ;)

G'day Jeff and welcome to linux.org :)

I am moving this to Command Line, which also handles scripting inquiries - all participants and Viewers are notified.

Cheers

Chris Turner
wizardfromoz
 



Top