A reader had a short and sweet question for me:
What is the best way to create a bilingual site...english and spanish?
Like many short questions - this one is actually a quite huge topic. I'm going to answer it, but folks should be aware that I'm just scratching the surface of a more in-depth answer.
So at a high level, there are two things you should be concerned about. Internationalization is the process of preparing your site for translation. So for example, BlogCFC was internationalized so that no English was used for the public UI. So simple buttons that would say "Add Entry" were instead set to point to a generic resource instead. The second step is localization, where you actually write the language resources for your application. Along with localizing the content, you also want to localize the date strings. Now, I don't know Spanish, but if they do dates as many other non-American people do (day/month/year instead of month/day/year) then you will also need to be sure your dates are formatted correctly. Luckily ColdFusion provides localized versions of the date formatting functions so that is rather easy.
So that handles the front end of your site. (Again, remember I'm just giving a broad, high level overview.) What about your data? If you plan on ensuring that all of your content will be in both languages, you will want to set up your database so you can flag the content. So imagine that you have press releases. In a typical site the columns for such content may include:
id (primary key)
title
published date
body
In a site with both English and Spanish content, you will need a way to flag the language, and you will need a way to signify that two articles (one English and one Spanish) both represent one core article.
For a simple solution, you could just add extra columns:
id (primary key)
title
titleSpanish
published date
body
bodySpanish
This works fine, but if your site supported multiple languages, it could get a bit unwieldy. You could instead try something like this:
id (primary key)
pressReleaseID (UUID)
title
published date
body
language
In the above table schema I added a pressReleaseID column. This column will signify one press release, but will not be unique to the table. Any other row that matches this value will represent the same article, but another language. The other column I added was a language column which, obviously, represents the language.
As I said, this is a way simplistic overview of how such a site could be done. The Jedi Master of issues like this is Paul Hastings. He helped me internationalize BlogCFC. Hopefully this gives you the basic gist of how a multilingual site could be handled.
p.s. In case folks are curious why I'm not blogging like my normal mad self, I'm on vacation today and tomorrow.
Archived Comments
just to be clear on localisation:
[from livedocs]
LSDateFormat
Formats the date part of a date/time value in a locale-specific format.
A formatted date/time value. If no mask is specified, the value is formatted according to the locale setting of the _client computer_.
isn't the "client computer" the CF server (and it's locale settings)? so you'd have to analyise what locale that visitor comes from and format accordingly? (per request or session?)
IIRC, Locale isn't sent as a CGI var, only "HTTP_ACCEPT_LANGUAGE" is sent by the browser...
of course. if your users are registered, that would be one of the things you'd have in their profiles. if not you'd use something like our geoLocator CFC to figure out their locale from their browser settings & IP.
btw what ray means by generic resource i guess is "resource bundle". and of course besides localized date strings you'll also have to work on number/currency formatting, calendars (not everybody uses the gregorian calendar & if they do not everybody's week starts on sunday), writing system direction, collation, etc.--i could go on all day ;-) as ray says, it's a huge topic.
Being a linguist, one of the things you might want to take into consideration is breaking out or creating a file that can help you with this. One of the better implementations of having a bilingual Blog is one that is called AVBlog:
http://www.avblog.org/index...
Andrea Veggiani does a really good job of creating a CF Blog that handles multiple languages including the interface.
Generically speaking, these issues take quite a bit of planning as was mentioned above. Especially if you will be using the non-Romanized DBCS (Double Bit Character System) system to handle languages like Chinese, Japanese, Korean where the character systems run upwards of a couple thousand or more. Planning will go a long way.
that's called a resource bundle (rb) & that's what ray's blog does. andrea's blog is pretty cool (i helped work on earlier versions) but it's use of xml in place of rb is kind of non-standard. i18n requires a technical ecosystem to support it & unless you're simply translating stuff (and standardized on XLIFF) xml isn't usually the first choice especially for really complex apps.
nobody talks about non-Romanized DBCS (Double Bit Character System) these days (and btw thats Byte not Bit). it's unicode or nothing.
Using xml for resource bundle is a very good choice because you can save more info than simple properties (binary data too as icons) and using an xml schema you have the validation too. (no duplicate keys etc).
Say you have 2000 labels to be translated, using xml + xsd and a good xml editor facilitate the work of the people who have to translate all this stuff.
One of the best way to localize the web UI should be to use the xml entities:
http://xulplanet.com/tutori...
but unfortunately IE doesn't render them correctly.
Paul, yes, I was talking about resource bundles. I was just trying to keep things simple by not naming them.
I agree with Faser --- I have used XML to handle application level text and settings, labels, etc. Handing off an XML file to a translator is very easy to do. We used one XML file per page essentially. The XML file had every language used on that page, validation for any fields, etc.
I just want to point out that Ray's reader was asking about "creating a bilingual site" in English and Spanish.
While Ray's response, and the comments are fairly accurate regarding Internationalization/Globalization and localization, this is actually quite different, and much more complex, than creating a bilingual site.
A bilingual site can be as simple as translating content in 2 languages. For example, if one were creating a resource for a local school district in New York City, and wanted the site presented in multiple languages. Globalization/localization issues would likely not come into play. Just translation. This is very common.
In fact Globalization/localization may have nothing to do with translation at all! Imagine a site that targets users in the US and Canada. The site can be presented in one language, English, yet be localized regarding everything else (currencies, date/time formats, number formats etc. )
I have found that when doing Globalization, the disconnect between translation and globalization has been one of the most difficult concepts for people to wrap their heads around.
Gus
One other comment about Globalization...
If you are globalizing a site, make sure your datatypes can handle double byte characters!
Gus
Gus, I'm not sure I agree with you. I mean, yes, you can provide content in 2 languages, but that isn't enough. I mean, if your menus and forms are in one language, then you are preventing half your audience from using the site. That's why I talked about localizing the UI. I wouldn't consider a site with a Spanish version of the dynamic content to be truly bilingual if that was the only step taken.
@Gus
Just to make this entry more complete, when creating table in SQL Server (I have no idea what you would use for MySQL), you need to use the 'N' datatypes. All character datatypes in SQL Server have a corresponding 'N' datatype, such as
varchar > nvarchar
char > nchar
you get the idea. These column are used for handling double byte characters in languages, such as Japanese and Chinese.
sorry no xml isn't the best choice at least not by today's industry standards. dumping all the localizations into 1 xml is going to be a nightmare to manage as is trying to use a standard xml editor for this sort of task--it's simply not meant for that kind of work. taking an extreme, i suppose you could use notepad if the job's simple enough but you *will* die sooner or later as the apps grow more complex. XLIFF is swell for translation agencies, but those folks aren't developers and you will hardly find any serious i18n developer suggesting you use it instead of rb (though good rb tools will import/export XLIFF to speed things along).
and just fyi, while i've never done it myself, i've read of folks using binary data (images) in rb files.
you really need that ecosystem to support serious, complex i18n stuff. there are several rb management tools around, even a nice cf based one (jason sheedy's rbMan) that swim in that sea. unfortunately xml-spy, etc. isn't part of this ecosystem though i suppose it's fine for rinky-dink stuff (but for stuff that simple, notepad does plenty fine).
ray,
yeah just keeping you honest ;-) this blog will become some kind of "site of record" so i'd like things as "correct" as possible.
I dont think Faser or I said it was the best choice. He said it was a good choice and I agreed. I also said I have 1 file per page that needs translation. There would never be a huge XML file to manage.
I had built this in ASP, so for a registration form you would see registration_1.asp, registration_1.xml. We created an object that would read in the xml and lanaguage branch it needed and that xml would handle labels, buttons, error messages validation and so on.
For pure content like Press releases this would be database driven not XML driven.
There may be better solutions but this one was easy to put together and worked great for the situation.
"The XML file had every language used on that page" that to me is a nightmare to manage.
"good/very good" choice makes no difference to me, xml isn't a good/better/best choice for serious i18n work. period.
Ray,
Sorry if I was unclear, but I did not mean to infer that UI elements or forms should not be translated. In fact meta-data should be translated as well.
The point of my comment was that there is a significant difference between translation and globalization/localization.
Globalization/localization may include translation, but not necessarily. ( An english language site that targets users in both the US and Canada for example )
A site can be multi-lingual without Globalization/localization issues ever coming into play. ( A site that targets Canadadian users that is presented in French and English )
As I said.. the difference between translation and globalization/localization is one of the more difficult concepts for people to wrap their heads around.
Gus
Makes sense Gus. Thanks to all for adding to this thread. (I hate to say I love my blog because then it sounds egotistical, so I'll say I love my readers. :)
We've had this discussion on the CFCDev list before and come to the same conclusion; what you choose will be determined by your specific needs. You could look at it like installing a version of CF:
Developer edition = basic DB-driven translations;
Standard edition = XML-driven translations;
Enterprise edition = rb files.
We use an XML-driven system that drives a 10 language site, including Russian and Japanese (cyrillic and kanji). Translations are loaded into server scope on startup using Application.cfc to create a service factory that reads in a single XML file (containing one XML node per language).
It's a high traffic site that's in use 24/7 by an international clientele, and for our needs an XML-based solution works fine. That's not to say that we couldn't make performance improvements by moving to Java RBs, but XML is a simple, obvious and intuitive solution that can be implemented quickly and easily using the basic CFMX toolset.
If anyone wants code examples just post a request in these comments.
Ed, I would be interested in any code examples you might provide.
From my own work in localization I have found that the one xml page per display page became very hard to manage, with lots of duplication of coding.
I have moved the language sensitive terms to a table in the database, and then on application start I would load that table into application space as a pair of structs (English and French).
I would then call either struct to display the data depending on the user’s language choice.
As for images I would store the language sensitive images in a separate directory and store the directory as an entry in the aforementioned structs.
I also made an interface for the translators to take my English terms and translate them into French, or to add new terms if I needed them.
Adding a third row, and struct for another language say Spanish would be really easy, except for the translations. But most important I would not have to change any code, including static XML.
ed, i'm curious as to when this was discussed on the CFC list. i can't find it in my archives.
using rb vs xml really hasn't anything to do w/performance. it's more about suitable toolsets & managability.
Here's an example of how you could structure a translations XML file for a simple site. (Paul, this CFCDev discussion occurred around 19th Dec 2005.)
<translation default="en">
<language id="en" name="english" label="English">
<section name="contactUs">
<resource name="pageTitle">Contact Us</resource>
<resource name="bodyTitle">Feel free to contact us</resource>
</section>
<section name="aboutUs">
<resource name="pageTitle">About Us</resource>
<resource name="bodyTitle">Here's some info about our company</resource>
</section>
</language>
<language id="fr" name="french" label="Francais">
<section name="contactUs">
<resource name="pageTitle">Contactez-nous</resource>
<resource name="bodyTitle">Veuillez nous contacter</resource>
</section>
<section name="aboutUs">
<resource name="pageTitle">Au sujet de nous</resource>
<resource name="bodyTitle">Voici une certaine information au sujet de notre compagnie</resource>
</section>
</language>
</translation>
Load this as an XML structure into server scope on app startup, then just call the correct placeholder for the user's session language, something like this:
#server.translate(session.language, "aboutUs", bodyTitle")#
French user in the About Us section sees 'Voici une certaine information au sujet de notre compagnie', English user sees 'Here's some info about our company'.
I'm not advocating this as a way of storing large translated documents (nor indeed for enormous sites with many thousands of translations), but it works well for page furniture.
Further to my last comment, the actual XML file read/parsing operation is done only once by the service factory on app startup. The data from the XML file is written into a CF Struct in server scope; the data from this struct is used dynamically to populate placeholders in the site layout. (Sounds similar to Chris' approach above.)