Code Answer: Elegant way to search for UTF-8 files with BOM?

For debugging purposes, I need to recursively search a directory for all files which start with a UTF-8 byte order mark (BOM). My current solution is a simple shell script:

find -type f |
while read file
do
    if [ "`head -c 3 -- "$file"`" == $'\xef\xbb\xbf' ]
    then
        echo "found BOM in: $file"
    fi
done

Or, if you prefer short, unreadable one-liners:

find -type f|while read file;do [ "`head -c3 -- "$file"`" == $'\xef\xbb\xbf' ] && echo "found BOM in: $file";done

It doesn't work with filenames that contain a line break, but such files are not to be expected anyway.

Is there any shorter or more elegant solution?

Are there any interesting text editors or macros for text editors?

From stackoverflow vog

If you accept some false positives (in case there are non-text files, or in the unlikely case there is a ZWNBSP in the middle of a file), you can use grep:
```
fgrep -rl `echo -ne '\xef\xbb\xbf'` .
```
From CesarB
```
find -type f -print0 | xargs -0 grep -l `printf '^\xef\xbb\xbf'` | sed 's/^/found BOM in: /'
```
- find -print0 puts a null \0 between each file name instead of using new lines
- xargs -0 expects null separated arguments instead of line separated
- grep -l lists the files which match the regex
- The regex ^\xeff\xbb\xbf isn't entirely correct, as it will match non-BOMed UTF-8 files if they have zero width spaces at the start of a line
MSalters : You still need a "head 1" in the pipe before the grep

From Jonathan Wright
I would use something like:
```
grep -orHbm1 "^`echo -ne '\xef\xbb\xbf'`" . | sed '/:0:/!d;s/:0:.*//'
```
Which will ensure that the BOM occurs starting at the first byte of the file.

From Marcus Griep
What about this one simple command which not just finds but clears nasty BOM? :)
```
find . -type f -exec sed 's/^\xEF\xBB\xBF//' -i.bak {} \; -exec rm {}.bak \;
```
I love "find" :)

If you want just to show BOM files, use this one:
```
grep -rl $'\xEF\xBB\xBF' .
```
From Denis
```
find . -type f -print0 | xargs -0r awk '
    /^\xEF\xBB\xBF/ {print FILENAME}
    {nextfile}'
```
Most of the solutions given above test more than the first line of the file, even if some (such as Marcus's solution) then filter the results. This solution only tests the first line of each file so it should be a bit quicker.

From

Code Answer

Saturday, February 12, 2011

Elegant way to search for UTF-8 files with BOM?

0 comments:

Post a Comment

Blog Archive