theweaselking: (Work now)
[personal profile] theweaselking
Due to (inherited, not my fault) poor planning, I have a folder structure with a few million files in a few tens of thousands of nested folders.

These folders are all stored in an ext3 file system, which is case sensitive.
They are largely being accessed by Windows clients, which are not case sensitive, via Samba, which is kinda case sensitive but mostly defers to Windows.
These files are being collected from other places, and being dropped into this location by both Windows and Linux clients.

There are an unknown-but-at-least-three-so-far number of folders with the same name, differing only in case - eg "Pogo" and "POGO". And this is a royal pain in the ass when a Linux rsync job drops files from otherplace\POGO into thisplace\POGO and then a Windows user clicks on POGO and gets the contents of Pogo because it's alphabetically first and it's the same name and thus the same folder, right? Hey, where are my files? Why aren't my files there?

There's gotta be an easy way - some "find" flag or some reasonably non-stupid bash script - to get a list of all cases where there's multiple paths differing only in case, that I can wind up and let run on this for a week or so and find all the duplicates. Ideally there'd also be a way to trigger an automated rename on one of them, but a complete list would be a perfectly cromulent start.

I mean, I *could* write a script to do it. But I don't WANNA. This is a wheel that has to have been invented previously, right? Someone's got a magic spell to do this in a much simpler way?

(no subject)

Date: 2014-05-06 06:40 pm (UTC)
From: [identity profile] rbarclay.livejournal.com
Ugly:
find /tmp/ 2>/dev/null | tr '[:lower:]' '[:upper:]' | sort -n | uniq -c | egrep -v "^\ *1\ "

(no subject)

Date: 2014-05-06 07:04 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
That is, indeed, quite ugly.

(no subject)

Date: 2014-05-06 08:47 pm (UTC)
From: [identity profile] the-imp.livejournal.com
Slightly less ugly:

find /tmp/ 2>/dev/null | sort -f | uniq -id
Edited Date: 2014-05-06 08:49 pm (UTC)

(no subject)

Date: 2014-05-07 06:54 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
No good: This appears to consider the directory /POGO to be a match for the file /POGO/pong.txt in some circumstances, resulting in ~800,000 false positives.

(no subject)

Date: 2014-05-09 11:19 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
For the record, I'm an idiot who saw 800,000 entries, couldn't find the first few matches, and figured there must be false positives.

No, there's 800,000 duplicates. And I'm not 100% sure why your script outputs /POGO when /POGO doesn't have a dupe but there's /POGO/pong.txt and /POGO/PonG.txt, and I missed that when looking because they weren't next to each other. Because, of course, "p" and "P" *shouldn't* be next to each other.

I'm a dumbass. Thanks for the help.

(no subject)

Date: 2014-05-06 10:47 pm (UTC)
From: [identity profile] lafinjack.livejournal.com
Whosever bright idea case sensitivity was should have automatically included this function back when the idiot thought case sensitivity was a good idea.

(no subject)

Date: 2014-05-06 11:04 pm (UTC)
From: [identity profile] rbarclay.livejournal.com
ITYM "case INsensitivity". TYVM.

(no subject)

Date: 2014-05-07 12:41 am (UTC)
From: [identity profile] theweaselking.livejournal.com
I'm pretty sure that guy's also dead.

(no subject)

Date: 2014-05-07 12:41 am (UTC)
From: [identity profile] theweaselking.livejournal.com
I'm pretty sure that guy's dead.

(no subject)

Date: 2014-05-07 08:12 am (UTC)
From: [identity profile] lafinjack.livejournal.com
The guy who wrote Fortran only died in 2007, so this guy could easily still be alive.

(no subject)

Date: 2014-05-07 01:19 pm (UTC)
From: [identity profile] pappy-legba.livejournal.com
Case sensitivity: because the world does not yet have enough opportunities for syntax errors.

(no subject)

Date: 2014-05-07 10:27 pm (UTC)
From: [identity profile] jsbowden.livejournal.com
Samba has an option to ignore case sensitivity for this very reason:

https://lists.samba.org/archive/samba/2008-January/137622.html

(no subject)

Date: 2014-05-07 11:00 pm (UTC)
From: [identity profile] theweaselking.livejournal.com
That appears to prevent a Windows user from creating duplicate directories via Samba.

It does not appear to address "the directories are fucked up by non-Samba means, and then Samba shows them to the Windows users"

And it definitely doesn't fix "Well, the shit's fucked up NOW"

(I have eliminated the process that creates duplicate entries. Now, any duplicates will have to be deliberately created. However, that doesn't get rid of any past duplicates.)

Profile

theweaselking: (Default)theweaselking
Page generated Mar. 8th, 2026 09:21 am