Friday, April 8, 2011

C# Help reading foreign characters using StreamReader

I'm using the code below to read a text file that contains foreign characters, the file is encoded ANSI and looks fine in notepad. The code below doesn't work, when the file values are read and shown in the datagrid the characters appear as squares, could there be another problem elsewhere?

    StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.ANSI);
    using (reader = File.OpenText(inputFilePath))

Thanks

Update: I have tried all encodings found under System.Text.Encoding. and all fail to show the file correctly.

Update 2: I've changed the file encoding (resaved the file) to unicode and used System.Text.Encoding.Unicode and it worked just fine. So why did notepad read it correctly? And why didn't System.Text.Encoding.Unicode read the ANSI file?

From stackoverflow
  • Yes, it could be with the actual encoding of the file, probably unicode. Try UTF-8 as that is the most common form of unicode encoding. Otherwise if the file ASCII then standard ASCII encoding should work.

  • Try a different encoding such as Encoding.UTF8. You can also try letting StreamReader find the encoding itself:

        StreamReader reader = new StreamReader(inputFilePath, System.Text.Encoding.UTF8, true)
    

    Edit: Just saw your update. Try letting StreamReader do the guessing.

  • You may also try the Default encoding, which uses the current system's ANSI codepage.

    StreamReader reader = new StreamReader(inputFilePath, Encoding.Default, true)
    

    When you try using the Notepad "Save As" menu with the original file, look at the encoding combo box. It will tell you which encoding notepad guessed is used by the file.

    Also, if it is an ANSI file, the detectEncodingFromByteOrderMarks parameter will probably not help much.

  • Using Encoding.Unicode won't accurately decode an ANSI file in the same way that a JPEG decoder won't understand a GIF file.

    I'm surprised that Encoding.Default didn't work for the ANSI file if it really was ANSI - if you ever find out exactly which code page Notepad was using, you could use Encoding.GetEncoding(int).

    In general, where possible I'd recommend using UTF-8.

0 comments:

Post a Comment