Mike Burr - log

[mind] Extracting information about "similarity" between natural languages with machine translation

It's fun to play with Google Translate. One fun trick is to play a game of "translation telephone", where you translate from one language, then to a third, and then back to your original language. You get laughable results about 80% of the time. There are even bots and stuff to automate this.

I wonder what kinds of things a machine could say about the result these "loops".

When designing a system, you know that the goal is for the result to be identical to the source. There are many different strategies you could use to "rank" the quality of the result even if you don't know the language. num_words_diff before_after_length_diff. A machine could have access to a thesaurus and get more sophisticated. If one word is different, look it up in the thesaurus and at least see if the thesaurus is thinks the new word is considered a synonym.

Conjugation throws a huge wrench into this. It should at least be the case that the conjugation should be the same. So you would need to expect for quickly to be in the synonym list of briskly. You would need to rely it not being quick, though few thesauruses would have an entry for both quick and quickly...so there's that.

Also, the concept of "a word" is mushy and can mean different things in different languages. Compare English and German.

This might be a non-issue as you are only considering your source/destination language (we'll pretend it's English). Also, even in languages that run their words together in writing, there are always kinds of boundaries, if one needed to, one could try to create "rules" for what words are.

Hard!

But what I'm getting at is if one could come up with a meaningful metric for "how bad the result", a machine could learn things about natural languages. Maybe. Suppose we had a "perfect" system to rank the results, and assume we're just using plain old Google Translate for the experiment, nothing fancier.

We might find out that the quality of

en -> es -> pt -> en

was better than the quality of

en -> es -> en

That kinda thing.

Human languages form a kind of tree. We might discover some cool patterns. Is all.

And by "quality" I mean, "as computed by the system over trillions of example sentences." (as resources allow)

I make no guarantees about the quality of this idea.