Teaching an old PHP application to talk Spanish, Chinese & Hebrew

December 11, 2009

Introduction

Over the last few months I’ve had the dubious pleasure of converting an older PHP application to Spanish, Chinese and Hebrew. Although painful at times, it was, in the end, a very rewarding experience that taught me a great deal about web application design in general and internationalization (I18n) in particular. Although there are thousands of resources on the web about I18n, as usual there are often multiple solutions to each problem and it took a lot of trial and error to figure out which solutions provided the best cross-browser support and overall usability/appeal for the end user.

The aim of this article is not to provide a comprehensive “I18n How-To” but more to record all of the hints and tips that I discovered during my project in the one place. I’m sure that most of what is included will be relevant for pretty much any other PHP developer who has to take an older PHP application and add I18n support.

If you are internationalizing a PHP application, I would also highly recommend this article by Paul Reinheimer. This site was also an extremely useful resource that I referred to very frequently throughout the course of the project.

The Database

The first thing I checked was the database’s ability to store UTF8 characters. It turned out that I was already using the UTF8 encoding of the Unicode character set with the utf8_unicode_ci collation. So far this has worked fine for English, Spanish, Simplified Chinese and Hebrew characters. I’m not sure about Traditional Chinese though - I think I read somewhere that you need UTF16 for that. See this page for more information about MySQL and Unicode.

DOCTYPE & HTML Header

For some strange reason the previous developer of the application had failed to include a DOCTYPE declaration in the pages, so I added one that suited the HTML that the application was already generating:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional" "http://www.w3.org/TR/html4/loose.dtd">

Another fairly strange omission was a content type declaration in the HTML header, so I added this one:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

More information on these things can be found here.

Externalizing the Strings

There is a whole GNU project called gettext that automates the process of extracting the strings out of an existing PHP application and also provides tools for developers and translators to help them work with these files. Its a real pity, then, that I didn’t discover gettext until I was too far down the road with my own solution! Thankfully, the web app I was working on didn’t have too many strings, so I just went through manually and added a function called t() around each string (this will look familiar to Drupal coders :-). This approach also afforded me the opportunity to identify places in the code where sentences were being constructed based on English syntax and grammar, as these required special treatment when converting to Hebrew and Chinese.

The one part of the application that did contain loads of text were the downloadable PDF reports. Thankfully these slabs of text had been externalized in the form of XML files which, it turns out, can be edited reasonably well in Microsoft Word 2003 for Windows. This version of Word will actually retain the XML format and, as an added benefit, will allow the use of the spell checker, which was a big advantage for our translators. Yes, I know, I’m normally bashing Microsoft for one thing or another, but hey, credit where credit’s due: Word 2003 for Windows rocks!

Language Selection

The user’s preferred language (as per their browser settings) is available in the $_SERVER['HTTP_ACCEPT_LANGUAGE'] variable. This was used to figure which language should be used to render the login screen, which also contained a language selector that allowed the user to select a different language.

Typing in Accented and Unicode Characters

Something that comes up fairly soon in the I18n process is entering characters from different languages for testing purposes. This page proved to be invaluable in this regard. And then you get to Chinese and Hebrew and realise you need to type in Unicode characters occasionally. Thankfully, OS X has awesome support for this.

Right-To-Left Support

It was great to implement Spanish first, as that allowed me to get the basic infrastructure in place before attempting Hebrew, which presents three major difficulties:

Working with Right-To-Left (RTL) strings
Figuring out the best option for telling the browser about RTL
Having a HTML/CSS layout that renders well in both directions

That first one - working with RTL text - is a major PITA. Although Eclipse does a great job of working with RTL text, it just constantly messes with your head when the left and right cursor keys and backspace key start operating backwards. If you don’t believe me, try copying the text below into your favourite text editor and see what happens:

שלום העולם

Once you’ve got your head around that little challenge, then there’s the problem of figuring out the best approach for designing your bi-directional HTML. The best information I found was here. Although there are several options for telling the browser that the content should be rendered RTL, the option I went with was as follows:

Include dir="rtl" in the head element of the page and, if you are specifically targeting Hebrew, include lang="he" as well
Do NOT include dir="rtl" in the body element of the page, as this will result in the scroll bar being rendered on the left side of the browser Window, which apparently is not the standard approach (Javascript pop-ups can also be affected by setting the rtl attribute on the body element)
Immediately after the body element, include a new div element that has the dir="rtl" attribute

If you follow the above approach and your layout is fairly basic, then this may be all you need to do!

However, if the existing layout does something unfortunate, like, say, rendering a rounded-corners look that utilizes a whole bunch of little jpg files for the corners, then you may be in trouble (like I was). The solution in my case was to spend three days re-writing all the code that produced the HTML (which was sprinkled throughout the entire code base) and developing a new look and feel based on these awesome stylesheets from Matthew James Taylor. In the end, there were only eight CSS hacks needed for RTL rendering and four for RTL rendering on Internet Explorer 6 and they were mostly due to application-specific issues.

Generating PDF Files

Although it was great that the text for the PDF reports was stored in XML files, there was still the issue of generating PDF files that rendered the text correctly. Unfortunately, FPDF, which I had been using for over two years, does not support Unicode characters, so it was time to find a new PDF class. After investigating a few, I settled on TCPDF, which is based on FPDF, which meant that my existing code worked with it pretty much straight out of the box. TCPDF supports Unicode, RTL and lots of other good stuff.

Preserving Unicode file names in Internet Explorer

Once I was producing PDF files in different languages, along came the RFC 2183 problem, which is basically that HTTP headers are only supposed to be encoded in US-ASCII. This includes the filename parameter of the Content-Disposition header. As it happens, you can set the file name to a string that contains Unicode characters and Firefox, for example, will actually decode it correctly. However, downloading the same file with Internet Explorer results in the file name in the dialog box becoming garbled junk.

After searching high and low, I eventually found the solution here. The executive summary is:

Always include double quotes around the file name, as Safari prefers it this way
Firefox and Safari both seem to support Unicode file names, so no special treatment required
For Internet Explorer, as long as the file extension is US-ASCII, you can urlencode the entire filename (including the extension) and it will save the file name correctly

Here is some sample code:

$filename = ’世界您好’;

if ( mb_detect_encoding( $filename ) != ’ASCII’ && strpos( $_SERVER['HTTP_USER_AGENT'], 'MSIE' ) !== false )
{
    $filename = urlencode( $filename . '.pdf' );
}
else
{
    $filename .= '.pdf';
}

header( 'Pragma: public' );
header( 'Expires: 0' );
header( 'Cache-Control: must-revalidate, post-check=0, pre-check=0' );
header( 'Cache-Control: private', false ); // required for certain browsers
header( 'Content-Type: application/pdf; charset=utf-8;' );
header( 'Content-Disposition: attachment; filename="' . $filename . '"' );
header( 'Content-Transfer-Encoding: binary' );
header( 'Content-Length: ' . filesize( $filename ));
readfile( $filename );
exit;

The aforementioned approach works on Internet Explorer 6, 7 & 8.

Zip Files

Our web application also has a function where multiple PDF files can be delivered in a single zip archive. Although this has worked splendidly for some time, Windows, once again, had difficulties with Unicode file names.

Actually, the problem is related to the zip file format itself, which, until recently didn’t contain an indicator as to which encoding was being used for storing the names of the files inside the archive. As a result, Windows native zip support will simply interpret the file names according to the current code page. But if the file names are in Unicode then it won’t matter what code page Windows is currently set to, the file names will come out garbled.

However, there is a partial workaround in Windows. If you go to the "Advanced" tab of the "Regional and Language Options" part of Control Panel you will find an option called "Language for non-Unicode programs". If you are trying to unzip an archive that has Unicode file names that contain, say, Chinese characters, then selecting one of the Chinese options will probably enable the zip files to be extracted with the file names correctly converted. I say "partial" workaround only because it is unlikely that all users will be willing and/or able to alter this setting, which leaves us back at square one.

I should point out that there are no such workarounds required on OS X. It just works. Of course. :-)

Detecting Chinese Characters

One aspect of our web application is that it stores names as two separate fields in the database: first_name and last_name. This allows us to do some really simple things, like say "Dear first_name" at the top of e-mail messages. Chinese names, however, are typically rendered with the last name first and one of the business requirements for our translation was to say “Dear LastnameFirstname” at the start of e-mails.

In order to do this, the code that generated the e-mails needed to figure out if the name fields were in Chinese or not. This resulted in yet another extended session of intense Googling, only to find that the solution can be implemented in one line:

function is_Chinese( $str )
{
    return preg_match("/[\x{4e00}-\x{9fa5}]/u", $str );
}

The basic approach is to build a regular expression that looks for one or more characters within a certain range of Unicode code points.

Multilingual Emails

The alternative title for this section could well be “Don’t Believe Everything You Read On The Web”! It is kind of ironic that after so much success Googling for I18n solutions that I could be let down so horrendously on the HTML e-mail construction front. Pretty much every single article I read on HTML e-mails said to remove "unnecessary" e-mail tags such as DOCTYPE, HTML tag, HEAD tag, BODY tag, etc. Although the rationale for this approach makes sense, I think the information might simply be out-of-date. My reasons for this assertion are as follows:

Almost every marketing e-mail that I’ve received this year (yes, I do file them for later analysis - what a geek!) DID contain those elements
Adding them did not break rendering in GMail, Hotmail or Yahoo mail
Adding them did not break rendering in Outlook or Apple Mail
Adding them DID make Chinese e-mails render correctly in Lotus Notes

So the code for generating the HTML part of the e-mails within the application looks like this:

$body = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional" "http://www.w3.org/TR/html4/loose.dtd">' .
        '<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8">' .
        '<title>' . $subject . '</title>' .
        '<style type="text/css">body, p, td { font-family:Verdana,Arial,Helvetica;font-size:12px; }</style></head>' .
        '<body>' . $content . '</body></html>';

In the case of Hebrew e-mails, the content was also wrapped in an additional div tag:

<div dir="rtl" lang="he" style="font-family:arial; font-size:larger;">

Conclusion

I’ll add to this page as I find better solutions for some of the challenges I encountered and if you have a specific challenge that you’ve been unable to solve (or if you find an error in any of the solutions above), drop a comment below.