NetTalk Central

Author Topic: Problems with non-ascii characters  (Read 16611 times)

vklemet

  • Newbie
  • *
  • Posts: 30
    • View Profile
Problems with non-ascii characters
« on: March 24, 2015, 11:30:27 PM »
Hello,

We are using C9.1 and NT 8.39 to build a software with email client to receive emails for processing.

When we receive mails with non-ascii characters in the subject or message with quoted-printable -encoding, the decoding in NetTalk does not always work.

A couple of examples of the problems. In both cases the subject should be "FW: Projektin ja ajanhallintasofta omaan  toimistokäyttöön_150124_TK.docx"

In the first example the decoded subject is shown in the attached png, here is the encoded subject:

Code: [Select]
Subject: =?utf-8?Q?FW:_Projektin_ja_ajanhallintasof?=
=?utf-8?Q?ta_omaan__toimistok=C3=A4ytt=C3=B6=C3=B6n=5F15012?=
=?utf-8?Q?4=5FTK.docx?=

Next example has a different problem. The decoding doesn't seem to happend at all, so subject after decoding is still like FW:_Projektin_ja_ajanhallintasofta_omaan_toimistok?=E4ytt=F6=F6n=5F150124=5FTK.docx:

Code: [Select]
Subject: =?iso-8859-1?Q?FW:_Projektin_ja_ajanhallintasofta_omaan_toimistok?=
=?iso-8859-1?Q?=E4ytt=F6=F6n=5F150124=5FTK.docx?=

Does anyone have any ideas how to make the decoding work properly?

-Vesa-

[attachment deleted by admin]

Bruce

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 11158
    • View Profile
Re: Problems with non-ascii characters
« Reply #1 on: May 14, 2015, 12:21:20 AM »
Hi Vesa,

I feel like maybe you have got the examples backwards? In my test here the ?iso-8859-1? worked but the utf-8 one displayed with extra characters.

Can you confirm that please?

Also - are you calling the StringTheory method to decode the header (encodedWordDecode method?) I'm not seeing that in the class, but possibly it's done in your procedure code?

cheers
Bruce

« Last Edit: May 14, 2015, 12:27:04 AM by Bruce »

vklemet

  • Newbie
  • *
  • Posts: 30
    • View Profile
Re: Problems with non-ascii characters
« Reply #2 on: May 14, 2015, 10:49:03 PM »
Hi Bruce,

and thanks for the reply!

I checked out the examples again. They are correct. The UTF-version looks like the one in the png and ?iso-8859-1?. Could the decoding outcome alter on a different regional setting on windows? I can forward you the actual emails that we have problems with, if you would like to do some testing with them?

The StringTheory method you mentioned is not called in our code. Would the StringTheory method work better?

There is actually not a lot of hand code in the email receiving, and in the Done-embed of the NetEmail-object we just take the self.Subject and insert it to CSTRING and save it:

Code: [Select]
   if finished = 0
        ! Display the message
        CLEAR(VIESTI)
           
        VST:IDSTRING = clip (self.messageID)

Access:VIESTI.PrimeAutoInc()
        VST:FROM = clip (self.From)
        Message(self.subject)
VST:SUBJECT = clip (self.subject)
VST:TOLIST = clip (self.ToList)
VST:CCLIST = clip (self.ccList)

   
VST:LAHETYSPVM = ParseEmailDate(clip  (self.sentDate))
VST:LAHETYSKLO = ParseEmailTime(clip  (self.sentDate))   
 
        if self.MessageTextLen > 0
          VST:TEXTMESSAGE = clip (self.MessageText[1:self.MessageTextLen])
        !MESSAGE(self.wholemessage)
        else
          VST:TEXTMESSAGE = ''
        end
           
        if self.MessageHtmlLen > 0
          !MyHTML = clip (self.MessageHtml[1:self.MessageHtmlLen])   Ei sopivaa paikkaa viestitaulussa!
        else
          !MyHTML = ''
        end   

VST:PROJEKTI = 0
VST:APUPROJ = 0       
Access:VIESTI.TryInsert()
           
 
  else

Where is the decoding done in NetEmail-class? I quickly browsed through .clw, but it did not catch my eye.

 BR,
 -Vesa-


Bruce

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 11158
    • View Profile
Re: Problems with non-ascii characters
« Reply #3 on: May 17, 2015, 10:01:15 PM »
Hi Vesa,

>>  I can forward you the actual emails that we have problems with, if you would like to do some testing with them?

yes please - send them to nettest at you know where .com

>> Where is the decoding done in NetEmail-class? I quickly browsed through .clw, but it did not catch my eye.

I've looked as well, but not found it - I'm wondering if I do it at all - or maybe it's done in your embed code? I'll know more once I've run your emails through the test here...

cheers
Bruce

vklemet

  • Newbie
  • *
  • Posts: 30
    • View Profile
Re: Problems with non-ascii characters
« Reply #4 on: May 17, 2015, 10:44:12 PM »
Hi Bruce,

I have sent you two emails which we have problems with. The one with subject starting "VL: Projektin ja ajanhallinta" has the problem described in the first example of the first post and the one starting "FW: Projektin ja ajanhallinta" has the problem described in the second example.

I do think that the NetEmail-class does the decoding somewhere along the way, because the subject has been decoded (i.e. the "=?utf-8?" part has been removed) in the .Done-method when we read the self.subject -property of the class. We have no embed code that decodes the message or subject.

BR,

-Vesa-

Bruce

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 11158
    • View Profile
Re: Problems with non-ascii characters
« Reply #5 on: May 17, 2015, 11:31:39 PM »
Hi Vesa,

I tracked down where the decoding was being done (in the DLL, called from the class) and I've refactored the class to use the StringTheory methods instead. Could you please forward me one of the ISO-8859-1 emails as well so I can confirm that is still ok?

The two you sent me were both utf-8 encoded. At some point to display or store these with Clarion they need to be converted back to ANSI. I am considering the best (smoothest) way to do this at the moment.

Cheers
Bruce

vklemet

  • Newbie
  • *
  • Posts: 30
    • View Profile
Re: Problems with non-ascii characters
« Reply #6 on: May 17, 2015, 11:48:26 PM »
Hi Bruce,

I just sent one ISO-encoded message. I had multiple messages with same subject and picked the wrong one to send earlier, sorry for that

So you saw the problem with the UTF-message and it can be fixed by converting the strign to ANSI?

BR
-Vesa-

Bruce

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 11158
    • View Profile
Re: Problems with non-ascii characters
« Reply #7 on: May 18, 2015, 02:16:53 AM »
Hi Vesa,

Ok, this applies to build 8.49 (which should be out soon.) Make sure you have the latest StringTheory at that time as well.

a) The email class will handle the various encodings of the Subject better. Internally there's a switch to Stringtheory there which gives you a bit more control. If the string is in unicode then it will be converted back to ANSI automatically.

b) The _default_ code page is
st:CP_ISO_8859_1
but you should set the code page before the call to get the emails. What you should set it to does depend on the kinds of characters you would expect to receive - so in your code ISO-8859 is ok, but others may need to change that.
the default _may_ change (I don't know what the best default is yet) so it's best to specify the code page specifically.

cheers
Bruce



vklemet

  • Newbie
  • *
  • Posts: 30
    • View Profile
Re: Problems with non-ascii characters
« Reply #8 on: May 18, 2015, 03:27:29 AM »
Hi Bruce:

Hi Vesa,

Ok, this applies to build 8.49 (which should be out soon.) Make sure you have the latest StringTheory at that time as well.

a) The email class will handle the various encodings of the Subject better. Internally there's a switch to Stringtheory there which gives you a bit more control. If the string is in unicode then it will be converted back to ANSI automatically.

Thanks! We'll update to the latest version and try it out.

b) The _default_ code page is
st:CP_ISO_8859_1
but you should set the code page before the call to get the emails. What you should set it to does depend on the kinds of characters you would expect to receive - so in your code ISO-8859 is ok, but others may need to change that.
the default _may_ change (I don't know what the best default is yet) so it's best to specify the code page specifically.

I'm not sure I understood this correctly. Are you saying that we need to know what encoding will be used in a email we will receive with the app before compiling?

BR,
-Vesa-

Bruce

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 11158
    • View Profile
Re: Problems with non-ascii characters
« Reply #9 on: May 18, 2015, 05:32:51 AM »
>> Are you saying that we need to know what encoding will be used in a email we will receive with the app before compiling?

No, the problem is that the incoming Subject line (and I'm just talking about this Subject line at this point - the contents of the email itself are dealt with elsewhere) may be in Unicode format.

Clarion doesn't (yet) have a Unicode string type, so it has to be converted to ANSI before you can do anything useful with it. ANSI in turn allows for the use of "code pages" which basically map different characters into the chr(128) to chr(255) space. Given that you might get a Scandinavian character in one email, but a Polish one in another, and a Greek in a third leads to problems.

So you "best guess" what it might be, and if you get the others, well there will be some junk in the text. In your case I'm thinking mostly Scandinavia characters, so the best code page is ISO-8859-1 but if you were writing  the same server for a primarily Polish audience then you might (should) choose something different.

the problem is not the incoming email -that we understand perfectly -the problem is Clarion and it's inability to really deal with Unicode.

cheers
Bruce

vklemet

  • Newbie
  • *
  • Posts: 30
    • View Profile
Re: Problems with non-ascii characters
« Reply #10 on: May 18, 2015, 05:42:12 AM »
Ok, thanks for the clarification, I think I got it now.

And thanks a lot for your help with this problem!

BR,
-Vesa-

vklemet

  • Newbie
  • *
  • Posts: 30
    • View Profile
Re: Problems with non-ascii characters
« Reply #11 on: May 25, 2015, 06:29:01 AM »
Hi again Bruce,

And thanks for the update. It fixed most of our problems. However, a new problem came after the update. If the "From" -field is encoded, it does not get decoded in NetTalk anymore. For example, if the from field is in the following email source:
Code: [Select]
From: =?iso-8859-1?Q?Jonna_Kiiskil=E4?= <jonna.kiiskila@whatever.com> the self.from in the Done-method has the following text: =?iso-8859-1?Q?Jonna_Kiiskil=E4?= <janne.kiiskila@whatever.com>

Few minor issues still with the subject line also:
1. The decoding doesn't seem to happen:
Code: [Select]
Subject: [Fwd: FW: Projektin ja ajanhallintasofta omaan
 =?iso-8859-1?Q?toimistok=E4ytt=F6=F6n=5F150124=5FTK.docx]?=
shows in self.subject as: [Fwd: FW: Projektin ja ajanhallintasofta omaan =?iso-8859-1?Q?toimistok=E4ytt=F6=F6n=5F150124=5FTK.docx]?=

2. End gets clipped, if multiple encoded strings on the same line:
Code: [Select]
Subject: =?UTF-8?Q?Fwd=3A_FW=3A_Projektin_ja_ajanhallintasofta_omaan_toim?=  =?UTF-8?Q?istok=C3=A4ytt=C3=B6=C3=B6n=5F150124=5FTK=2Edocx?=shows in self.subject as:  Fwd: FW: Projektin ja ajanhallintasofta_omaan_toim

We managed to handle the issue with the from-field and the minor issues with the subject line by adding some code in _ProcessCleanString -embed before parent call, so issue is not currently very critical, but I thought that it would be good to let you know about these problems.

The "utf-8" encoding with "quoted-printable" content-transfer type  was not working very well in the message's text part, so you may want to check that out also. We added some decoding code in the _ProcessStoreText -embed after parent call to do the decoding and ansi-conversion and we got it working pretty well, so the problem is not critical for us right now.

BR
-Vesa-

Bruce

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 11158
    • View Profile
Re: Problems with non-ascii characters
« Reply #12 on: May 27, 2015, 06:07:37 AM »
Hi Vesa,

definitely some interesting combinations there - can you forward them to the same email address as before please?

cheers
Bruce

vklemet

  • Newbie
  • *
  • Posts: 30
    • View Profile
Re: Problems with non-ascii characters
« Reply #13 on: May 27, 2015, 10:18:41 PM »
Hello Bruce,

I have just sent you four emails, in this order:

1. one has the UTF-encoding in the subject problem. This message will be coming from address starting "testi.vesa"
2. one has the encoding problem with the senders name. This message will be coming from address starting "janne.kiiskila"
3. one has the ISO-encoding in the subject. This message will be coming from address starting "vesa.klemetti"
4. one has the UTF-8 and quoted-printable combo in the message text. This message will be coming from address starting "vesa.klemetti"

I have fixed all of these problems in our application by using StringTheory to slice and decode the text in the embeds mentioned in previous message. The code is not very sophisticated, but if you think the code could help I can give it to you.

-Vesa-