Translating HTML files
How to translate correctly HTML files. How HTML works, basic tags, style sheets,
what are the issues a translator should be aware of, how to prepare (tag) an HTML
file for translation, what to watch for when translating a website,...
Translating Web Sites
Today, being able to translate HTML is crucial, for obvious reasons, and about every
translator will accept HTML files. Yet, although it's not politically correct to
mention this here, truth is that many translators don't know enough about HTML and
websites to do a professional job.
There are LOTS of good HTML tutorials around, but they are all intended for webmasters
wannabes or even professional webmasters, and skip important issues a translator
should be aware of. I hope this fills in the gap and helps you do a better job.
If you are already well familiar with HTML, Keywords handling and style sheets,
go straight to “How to translate HTML” for more on preparing an HTML
file for translation and doing the translation itself.
HTML issues
(Basic and not so basic)
What is HTML and how does it work? HTML stands for HyperText Markup Language. Hypertext
is text characterized by the presence of links. Take a book. You read from the beginning
and move toward the end. With hypertext, you can have access immediately to the
information you are looking for by clicking on links.
An HTML file is a simple text file with an “htm” or “html”
extension. Do the following experience: Take a simple text file, “whatever.txt"
and rename it to “whatever.htm”. Double click on it and it will display
in your default web browser. Now, you will note that there are no links. There are
no bold, no underlines, no tables, no pictures and not even paragraph marks.
HTML is the "language" that you use to tell the browser (Internet Explorer,
Netscape, Mozilla, Opera...) how the page should be displayed and what it should
do in different situations (the user click on a link, the navigator finds the page
and display it, for instance). To do that, it uses “markups”. A markup
- or tag - is a small piece of code that provides this information. In HTML, tags
are made of a “<” sign, some code and a “>” sign.
Case is not important.
For instance “” tells the browser that whatever information
follows that tag should be displayed in bold. Now, unless you want everything to
be displayed in bold, there must be another tag to tell the browser where it should
stop to display the text in bold. That tag is “”. Note the
“/” sign. The tag triggering the bold display () is called
an opening tag. The tag canceling the action of the opening tag () is
called a closing tag. There are tags for about every formatting option: italics,
underline, color, size… You will find them very easily on the net, like here
for instance.
There are other types of tags in an HTML document. For instance, there are tags
detailing the structure of the page and its general behavior. An HTML page is usually
as follow:
(To tell the browser that this page is in HTML)
(Header. Contains information about the page that will not be displayed,
but can nevertheless influence the display.)
(Closes the “” tag. Most tags should be opened
and closed.)
(The actual page. This is what you see when you open the page in the
browser)
(Closing tag for )
(Closing tag for )
You need not change the structure tags when you translate.
Another type of tag is the Meta tag. These are located in the header and give information
on the page, used mostly by search engines, like keywords, description of the page,
author and copyrights… You will need to translate the contents of some of
these tags. Bearing in mind that these tags are mostly intended for search engines,
you have to translate the keywords and description using words that people will
use to find the web site. It’s not a matter of just translating those.
You have to think a little bit about which terms are applicable to the page and
will be the most popular. You are likely to find misspellings in the Meta tags.
They are there on purpose, so that people who misspell their search terms in the
search engine find the page anyway. If so, misspell too. Google listed the misspellings
it found for “Britney Spears”. There are hundreds, and they have been
searched for by thousands of people, so misspelling on popular searches could amount
to a significant traffic.
If you find well thought of descriptions and several typos in the Meta tags, be
extra careful, for this is evidence that your customer has attempted some search
engine optimization, and perhaps paid a lot of money to do so. Don’t ruin
it.
There is one other important item in the Meta tags: The charset. It tells the browser
which character set is used in the page. If you translate from a language with a
character encoding different of yours, you may have to change the encoding for the
page to display properly. Here is what that Meta tag looks like:
The TITLE tag (in the header. Shows in the title bar of the web browser when you
display the page) . THIS is the single most important piece of text
in your web page. Why? Because Search Engines value it above everything else, when
they analyze the page. “Welcome to Whatever.inc” is probably the most
stupid title you can come up with. A title should contain the keywords that will
be used to find the page. If the page talks about Blue widgets, the title should
have “Blue widget” in it! Now, of course, you are translating. That
means you have to follow the original Web page, and if the original name is “Welcome
to Whatever.inc”, then keep it, but if you can see the author has put some
thought on the title to include keywords in a specific sequence, give it some thought
yourself.
Links. In HTML, a link looks like this:
Web Site
“a” stands for “Anchor”, and “href” tells the
browser where that “anchor” is located (here, “http://www.website.com”).
“Title” gives a title for the link, so that when you pass the mouse
over the link, a small note will display, “Good web site”, in this example.
You have to translate it. “Web Site” is the text of the link. You may
or may not have to translate it. “” is the closing tag.
Images. Although you see images in web pages, they are not really inside the HTML
document. It’s a simple text file, right? In fact, you have a tag that tells
the web browser where the picture is stored and how to display it (what size, with
or without a border, where in the screen…). The image tag is
. It has no closing tag. You should
not change the image tag except for the content of the "alt" tag. “Alt”
stands for “Alternate text”.
In the early days of Internet, many browsers were not able to display pictures,
or it was too slow, so many users disabled the pictures to surf faster. To enable
those users to understand what picture should be there, the alt text is displayed
instead. Even if the image is displayed, the alt text shows when you move the mouse
over the image. You have to translate it.
The “alt" and the “title” are usually loaded with keywords
for the search engines. If this is the case, make sure that the translation is the
same way.
HTML has evolved a lot from the first version. Nowadays, a web designer can decide
exactly the size of the text, create styles (a concept similar to styles in a word
processor – more on that later), set the position and so on. But in the early
days, HTML was much more frugal.
The web was used for text. You had a series of tags to identify the document’s
hierarchy, called the “heading tags”
,
,
…
and their closing tags,
, , . H1 is the main heading.
It's big, bold, often too big, in fact. H2 is a secondary heading, slightly smaller.
H3 is again small... You got the idea.
Although there are much better ways in current HTML to arrange the display, the
H tags have remained and are used by search engines when they analyze a page, the
rationale being that if a word is in a heading, it is more relevant to the page
content. This is the main reason why many web sites still use those tags even if
that means a little bit more work. As a translator, these tags tell you that you
are translating a heading, and its position in the document's hierarchy.
They are also a warning that you have to be aware that the words inside these tags.
Exactly. Keywords. Usually, you will see the same keywords used in the H tags and
in the “keywords” Meta tag. Make sure that you use the same keywords.
Search Engines analyze, amongst other things, the number of times a specific keyword
appears compared to the total number of words in the page, and where. Try to keep
the same proportion as the original document, and if a keyword is in a header, make
sure your translation leaves a keyword in that same header.
For the same reason, HTML contains a number of redundant tags, like and
, or old ones that you almost don’t see anymore, like “”
(self explanatory, I think). Look for these. Too easy to concentrate on the “standard”
, ... and forget to handle those old things. you may need to move
them, too.
Next, styles and style sheets. A “style” is a series of attributes defined
in advance, either in the header of the document, or in a separate file called a
style sheet.
To understand styles, you need to understand what problems they resolve:
Suppose you want the big titles in your web site to be bold, italic, blue, and centered.
In good old HTML, you would write:
Title 1
Title 2
Title 3
Title 4
…
Title 356
Pretty clumsy, isn’t it? And that's just 4 simple attributes. The solution
is to define a style with all these specifications: It’s bold, it’s
blue, it's centered, and you give it a name, i.e.: bbc (For Bold Blue Centered.
Just an example. It’s normally named so that one remembers easily what it
is). Then, you don't need to write it every time. In the header of the page, you
write:
Then, anytime you have a title, you write
Title 1
Title 2
Title 3
…
But the best is that if after all is done, you decide that it would be nicer in
red, or that italics would be cool, you don’t have to look all over the document
and change all the tags, each time. You simply change 1 word in the style definition
and every instance change at once. This not only saves a lot of time when you design
the page, but also make the page size smaller, and thus faster to load.
Now, if you want to use a style in several pages, or even the whole site, you have
to copy the same styles in the header of each page. Not too smart. The solution
was to write all the styles in a separate file, called a style sheet, then to link
each page to the style sheet. That way, you write the styles only one time, and
in each page, you have a link in the header that looks like this:
A style sheet file’s extension is “*.css”. Now, as a translator,
this is relatively important to know because it determines how the text will be
displayed and where. The same page can look completely different with and without
the style sheet. With experience, you can look at the source code and “see”
the page (No, this ain’t the Matrix yet ;-). That helps a lot, because you
don’t need to check out the page in the browser every few minutes.
Anyway, this should cover the basic HTML you need to translate. When you get a bit
more time, pick one of the many HTML tutorials on the Web and learn about tables
and frames.
How to translate HTML
There are two reliable, proven methods and many wrong methods. Amongst the wrong
methods, the most populars are:
• Opening the HTML file in Word, working there and “Save as a web page”.
This changes the code and turns it into a complete mess that is twice the size of
the original page, cause display issues no-end and is about as popular for search
engines as a dead cat at a wedding. If you want to hear a knowledgeable customer
scream, go ahead.
• Translating in other WYSIWYG editors (What You See Is What You Get). They
mess up the code as well, usually, while I don’t know any as bad as Word for
that matter, save perhaps frontpage. Dreamweaver is an exception to that rule, but
a costly one if you are simply translating.
• Using a translation software that hides the tags. That can be very attractive
for beginners, but if you understood the section above properly, you will see why
this is not a good solution at all. An example of such software is Catscraddle.
That software is very smooth but will cause problems because you don't know what
is what, and the sentences are cut midway if the page use formating. If it was doing
a correct job, I would be the first to use it because I love the interface and it's
very fast. Unfortunately, the basic concept is VERY flawed and if you want to do
a professional job, just don’t.
The correct methods include :
• Open the page in an HTML editor, preferably one that support color coding
of the tags. There are many freewares. I like very much AceHTML, but that's far
from the only one available. Either way, translate the text and move the tags as
needed. I.e.:
English: John’s girlfriend is quite cute.
French: La petite amie de John est plutôt mignone.
As you can see, you have to decide where the tags should be in the target language.
Working that way can be a pain, but if you know your code and are careful, the output
will be irreproachable. However, you must stay very alert not to forget or erase
tags by mistake.
• Preparing the file, then using a CAT like Wordfast or Trados to translate
it, then restoring the HTML format. Not all CAT work the same way, but remember
that professional handling of web sites translation *requires* quick access to the
tags. The ability to move, edit or delete tags is not optional, it’s a must.
With Trados, you can also use TagEditor, although you may miss the flexibility that
comes with working in Word. Moving/deleting tags can be quite clumsy in TE.
Preparing the text for translation:
1. What are tagged files?
What do I mean by “Preparing the text for translation”? For translation
purposes, there are 2 types of tags:
• Tags that you may need to move or edit and that are/could be located in
the middle of a segment
• Tags that you will almost never change and are not (should not) be in the
middle of a segment
Overall, there are very few tags that you may need to delete during the translation
process.
"Preparing files" means modifying the files so that they can be translated
easily using a CAT. What follow is a description of a file prepared for Wordfast/Trados,
a “tagged file”, in the translator lingo. Since Trados is/was widely
used, most professional CAT can handle this type of files, with more or less success.
However, if you own and use another CAT (SDLX, DV,…), please check your CAT's
documentation. As you will use a CAT to work of the tagged file, I assume that you
are familiar with the basic concepts. (If not, please read the following pages of
this web site before going further: “What are CATs?” and “First
translation”)
A tagged file is a RTF file containing the source code (meaning, tags + text) of
the original HTML file. The tags are identified using 2 styles: tw4winInternal and
tw4winExternal. Without getting into details, the tw4winInternal style is red, and
the tw4winExternal is light grey. Whenever you receive a file with tags in red and
grey, it’s almost a given that the file has been tagged. Although the handling
is very similar, beware that HTML files are not the only tagged files, and many
more exotic formats are tagged for use with CATs, like SGML, XML, QuarkXpress, FrameMaker,
etc.
All tags are protected against deletion by default, to avoid you deleting one by
mistake. Tags that you may need to move, like (bold), are in tw4winInternal.
“Internal” because they will be included in the segment you have to
translate. They are in red. Tags that you don't need to change or to be concerned
about during the translation process are in tw4winExternal, (like
(paragraph
mark), , …) and are in grey. A tag in tw4winExternal style will
end a segment automatically.
Here is an example:
Correct: You are learning to translate Web Sites
Bla
bla bla
By now, you should know that “Web sites” is in bold, and that the
shows the end of a paragraph. When you open that sentence with Wordfast (or Trados),
the segment will end just after the , although there is no period, because
is in tw4winExternal style.
Incorrect: You are learning to translate Web Sites
Bla
bla bla
(The segment would stop right after “translate”).
Incorrect: You are learning to translate Web Sites
Bla
bla bla
(The segment would include everything).
Incorrect: You are learning to translate Web Sites
bla
bla bla
(The segment would include everything and the tags are not protected).
2. Tagging an HTML file?
If you open the source code of virtually any HTML file, you will see there are a
LOT of tags. So changing the styles manually is just not workable. You need to use
another software to tag (prepare) the file. It’s rather easy to do for HTML,
and other relatively common formats like XML and SGML. My personal preference goes
to a software called Rainbow (freeware). There are other possibilities like +Tools
(also freeware).
The process is rather simple and well explained in both software documentation's,
so I won’t overkill it. In Rainbow, (once installed), you click on “Add”,
select the HTML files you need to prepare, go to the Tools menu, select “Prepare
for translation”, fill out the needed options, and under the tab “Package”,
you select where the tagged files should be created.
Some stuff may look complex, but frankly it’s a no-brainer, when all you have
to do is prepare an HTML file.
Find your files, open the rtf file in Word, and you are ready to translate.
3. Translating a tagged file.
This depends on your CAT. In Wordfast, start the translation as usual, with your
TM and glossaries, the lock bolt on the door, gaffer tape across the neighbor’s
kid mouth, Mozart playing (or AC/DC – your call), …,whatever your set-up
usually is when you translate. ;-)
Tags in tw4winInternal are considered as placeables. You can select them in the
source segment using “Ctrl + Alt + Left/Right” and “Ctrl + Alt
+ Down” will copy it inside the target segment, at the insertion point. Type
your translation in the target and bring down the tags at the appropriate points
in the target sentence.
Use the tags to know how the text will look like and do not hesitate to refer to
the original HTML file, when in doubt. As explained, before, keep keywords in mind
and balance the text to match the original’s proportions as closely as possible.
(Of course, if the page is not meant for the general public but for Intranet, that
becomes much less important).
Please refer to the “tagged files” section of your Wordfast’s
manual. In summary, you have to make sure that you do not forget tags (Wordfast
has settings to remind you), that you keep the internal tags in the tw4winInternal
and the translatable text in whatever is the style originally used.
Example:
You are translating an HTML file!
Vous êtes en train de traduire un fichier HTML !
4. Done, now, what?
When your translation is done and the file cleaned (meaning all source segments
and segment delimiter have been deleted), you have a nice …RTF file. If both
the source and the target language do not require Unicode and that you do not have
special characters in the file, save it as txt (or copy all the code in Notepad)
and change the extension to “*.htm” or “*.html”. If you
use a language that requires Unicode (Chinese, Japanese, Russian, Thai,...), save
the file with the appropriate encoding and modify the charset information in the
file header to reflect the new language (i.e.: UTF-8.) See the HTML links to find
out more about encodings and file formats.
If you have respected the tags, the file should look about right in the browser.
However, the translation is seldom the same size as the original text, and if so,
you may have to make a few arrangements to make it fit nice. If lucky, everything
can stay the same.
You are through. I hope these information will help you tackling HTML files in a
professional manner and feel confident with them. As you can see, there is nothing
really hard in HTML files, but they do require some extra attention too. If it's
HTML, it's not just text.
At times the client wants you to translate the text with no consideration with the
HTML or a potential use on the net. That’s all right. If so, skip everything
and ask him to provide a regular *.doc file, or open the HTML in word and save it
as *.doc.
By Sylvain Galibert
© Sylvain Galibert. Reproduced with permission. This article is a courtesy of
www.your-translations.com.