Remove BOM

Learn how to remove bom

A+ A-

A lot of programmer folks face the irritating problems because of Byte Order Mark (or BOM). They want to remove bom character but it adamantly stay put in their files like HTML, XML, ASCII text. Upon investigation, programmers find that they need to remove ÿþ Unicode 65279 character to get rid of extra space or newline in their files. Let’s examine the issue and see how to remove bom, the byte order mark.

What is Byte Order Mark?

Byte Order Mark (or BOM) is a signal that tells the computer how the bytes are ordered in a Unicode document. Because Unicode can be used in the formats of 8, 16 and 32 bits –it is important for the computer to understand which encoding has been used in the Unicode document. BOM tells exactly the same to the computer.

BOM is actually a “zero-width non-breaking space” (practically a NULL character) and it is represented as U+FEFF

In ISO-8859-1
BOM appears as
(big endian)
(little endian)
(big endian)
□□þÿ (□ is the ASCII null character)
(little endian)
ÿþ□□ (□ is the ASCII null character)

In HTML code the BOM character can also appear as 

Remove BOM from an XML file

Just open the file in vim text editor use the “nobomb” command

# vim file.xml
:set nobomb
Removal from HTML Files

When faced with the bom character problem, many webpage developers try setting encoding of their page to “charset=utf-8” through meta property. But doing this does not mean that you will not face the BOM problem. If a BOM character is causing problems in your HTML display -the problem actually lies in the text editor and not in your HTML/CSS code.

Most HTML editors, like Dreamweaver, Programmer’s Notepad, TextPad etc., do provide a way to disable BOM. The option usually appears in the place where you set the encoding of your text editor. It may appear as options like “UTF-8 without BOM” or “UTF-8 No BOM”.

Appearance of  character in your HTML code can also be solved using the above encoding change in HTML editor. Just set the encoding without BOM and then save the file.

Setting UTF without BOM character in Macromedia Dreamweaver

Setting UTF without BOM character in Macromedia Dreamweaver

Setting UTF without BOM in Programmer's Notepad

Setting UTF without BOM in Programmer’s Notepad

Detection and Removal of BOM in Linux

Linux commands make it easier to find BOM character and then remove it from files. Powerful Linux tools like grep and shell programming make it a cakewalk. Here is how we can do it:

Find the list of files containing BOM characters

find /var/www/website/ -type f -print  -exec hd -n 3 {} \;  | grep -1 "ef bb bf" | grep "some_part_of_the_path" > bom_lines.txt

Remove BOM character

while read l; do sed -i '1 s/^\xef\xbb\xbf//'   $l ; done < bom_lines.txt

So this was it! This is how you can remove bom character from your program/text file. I decided to write this article because I had to waste two hours in learning how to remove the nuisance of ÿþ Unicode 65279 character. Once I learned it, I thought it should be documented so that other programmers can save some time!

I hope it was useful for you. Thank you for using TechWelkin.