BOM Character? What is that? What does it have to do with Microsoft Excel?
Yes, I know a lot of questions do arise in the title of the blog itself.
This blog is to tell a story about how file encoding and its interpretations by Microsoft Excel can cause big issues. In one of the project, we had to create CSV files. CSV files are really simple to construct. The columns are separated with commas and the rows are separated with newline characters. Our development machines are mostly Linux. During development, we used Open Office Excel to open the document and check the validity of the data. All the data would show perfectly. It did not matter what language is used to create the file. During the whole development cycle we did not think we would spend 2 weeks on solving a mystery why the same file contains gibberish characters when opened with Microsoft Excel but Open Office Excel would shows everything nice and dandy. Our Quality Assurance guy documented the issue, I almost did not want to believe him but eventually I saw the truth in his tests. Now the mystery had to solved.
Even though we have built CSV files before this project, XML/XSL transformations (some proprietary stuff) took care of encoding. After doing lots of google search and forum reads, we found a good stackoverflow discussion that had valuable discussion about MS-Excel and CSV files. - . There are a lot of ways suggested on this page, we tried many of them but only one finally worked. Adding Byte Order Marker character to the first row of the file. We did a small experiment by testing only file creation related code and found that file could be read successfully in Microsoft Excel.
Giveaways:
A] Two absolute great readings -
1) Joel Spolsky's blog on character encoding.
2) BOM Character Wikipedia Entry
B] Write a small and specialized class that only solves given problem, this reduces test time.
Next time you have to construct a CSV file which may/may not have foreign characters, use BOM :)
Yes, I know a lot of questions do arise in the title of the blog itself.
This blog is to tell a story about how file encoding and its interpretations by Microsoft Excel can cause big issues. In one of the project, we had to create CSV files. CSV files are really simple to construct. The columns are separated with commas and the rows are separated with newline characters. Our development machines are mostly Linux. During development, we used Open Office Excel to open the document and check the validity of the data. All the data would show perfectly. It did not matter what language is used to create the file. During the whole development cycle we did not think we would spend 2 weeks on solving a mystery why the same file contains gibberish characters when opened with Microsoft Excel but Open Office Excel would shows everything nice and dandy. Our Quality Assurance guy documented the issue, I almost did not want to believe him but eventually I saw the truth in his tests. Now the mystery had to solved.
Even though we have built CSV files before this project, XML/XSL transformations (some proprietary stuff) took care of encoding. After doing lots of google search and forum reads, we found a good stackoverflow discussion that had valuable discussion about MS-Excel and CSV files. - . There are a lot of ways suggested on this page, we tried many of them but only one finally worked. Adding Byte Order Marker character to the first row of the file. We did a small experiment by testing only file creation related code and found that file could be read successfully in Microsoft Excel.
Giveaways:
A] Two absolute great readings -
1) Joel Spolsky's blog on character encoding.
2) BOM Character Wikipedia Entry
B] Write a small and specialized class that only solves given problem, this reduces test time.
Next time you have to construct a CSV file which may/may not have foreign characters, use BOM :)
No comments:
Post a Comment