Logogram Language Generator

Creating an alien language system using Deep Learning

8 min readDec 23, 2020

Selim Şeker — Yavuz Durmazkeser — Nurullah Cebeci — Güldeste Selen Dal

Sample words from generated Logogram Language

Logograms, written characters that represent words, have started taking a role in human life from the very beginning of writing systems. Sumerian archaic writing, Egyptian hieroglyphs, Chinese Characters, and so on. On the other hand in popular culture constructed scripts and alphabets are widely used to present more artistic stories, like logogram language in the movie Arrival or Tengwar alphabet for Elvish in the Lord of the Rings.
With these motivations, in this project, we aim to make a logogram language generator system. We used Unsupervised Multilingual Word Embedding to acquire universal embeddings for each word and beta-Variational Autoencoder as a sketcher. With such a system, our desired contribution is helping storytellers to create artistic languages from scratch for better stories.

Presentation in Turkish

Project GitHub repo.

This project is done within the scope of inzva AI Projects #5 August-November 2020. Check out other projects from the here and inzva GitHub.

First things first, what does this “Logogram” mean?

According to Wikipedia,

In a written language, a logogram or logograph is a written character that represents a word or morpheme.

They appear in many ancient languages and writing systems such as Sumerian, Akkadian, and Chinese Characters. Moreover, in 2016 the great Stephen Wolfram and his son Christopher Wolfram built a computational logogram language for the movie Arrival. It was designed for the aliens in the movie called Heptapod. And that was our inspiration point for that project.

Alrighty, so what? Are you creating a random language?

Well, kind of. But we have set some conditions for that language.

First, it should be consistent so that each word should have a unique symbol and semantically similar words should have visually similar symbols.
Second, it should be artistic in a way that when a person looks at it should seem like a hand-written symbol not like a random QR code.
And lastly, it should be multilingual, in other words, correspondents of the same words in different languages should be mapped to the same logogram symbols. For example, the “world” in English “dünya” in Turkish, “monde” in French, and “welt” in German have the same meaning, which our home planet and their output logogram symbols should be the same. Here is our mysterious black box model.

Literature Review

To achieve such a system we need two main components, one is Multilingual Word Embedder so that we can use similar embedding vectors for multilingual words and the other is an image sketcher model. After our literature review, we ended up using the following two methods.

Unsupervised Multilingual Word Embeddings

(Chen and Cardie, EMNLP 2018)

With this work, the authors introduced a fully unsupervised framework to learn MWEs. The model simply maps the n different language space to one shared space. Each language has an orthogonal linear encoder so that we can linearly map one language to our shared language space. Since it is an orthogonal matrix, we can simply take its transpose and end up with a decoder. Besides encoder and decoders, each language also has a discriminator component. The goal of discriminators is to discriminate between the mapped language and real language. For the sake of optimization, we can also specify the shared embeddings space with a language from our n sized language set. So that instead of learning n encoders now it is decreased to n-1. Our source and target language setup is mentioned in the following sections.

For more detailed and technical information we strongly recommend to have a look up the paper, it is a pretty cool work.

𝛃-Variational Autoencoders

(Higgins et al., ICLR 2017)

Without a doubt, autoencoders are one of the commonly used for image generation. Briefly, the model tries to learn encoding the given input image to a latent vector and at the same time tries to learn decoding the encoded vector back to the input image itself. Training such a system over a loss called reconstruction loss allows the learner to generate images like the seen ones. In Variational-autoencoders, adding another loss term called KL-Divergence makes the latent distribution a Gaussian one therefore we can smooth transections between each image class and style. Lastly adding a 𝛃 coefficient to KL-Divergence creates a trade-off between reconstruction quality and the extent of disentanglement of latent variables.

Again we strongly recommend having a look up the paper and other detailed blog-posts about VAEs.

Datasets

We used pre-trained monolingual word embeddings of fastText as input for the UMWE model. They are 300-dimensional monolingual word embeddings for 157 languages trained on Common Crawl and Wikipedia. Specifically, we use English (as the target language), French, Spanish, Italian.

For the image generation part, we used the Omniglot dataset, 1623 different handwritten characters from 50 different alphabets.

The following baseline method section explains how we use those models and datasets.

Baseline Method

Our method consists of two pre-training processes, one is pre-training the multilingual mappers and the other is pre-training the autoencoder’s decoder. We pre-trained the autoencoder with the Omniglot so that our encoder learned how to properly encode the different symbols from different alphabets and the decoder learned to sketch images from the latent space. After the pre-training phase, we extract the decoder from the autoencoder and feed any mapped word embedding vector to the direct decoder. Here is our baseline model.

First Results…

As you can see, it has failed. And the reason is simple. Our shared embedding space and decoder’s latent space are different from each other. As we mentioned above in variational autoencoders latent space is normal distribution but there is no such constraint for embedding spaces. So we need some kind of mapping between these two spaces.

Two Main Improvement Ideas

Training a mapping layer between embedding and latent spaces
Normalization

1.1 Layer with Pixel-wise Loss

We trained a mapping to minimize:

the absolute value of the difference between adjacent pixel values

and to maximize:

the absolute value of the difference between 0.5 and each pixel

1.2 Layer with Discriminator

The goal of the discriminator is penalizing the mapping layer with the Binary Cross-Entropy Loss by discriminating the logogram images (fake) and the real images from the Omniglot dataset (real).

Better but still not good. Now let’s move on to normalization

(Fun fact: we don’t know why but a super simple idea, normalization, comes to our minds a week after the above two ideas. For one week we struggled to train a mapping layer with complicated loss functions)

Normalization:

Too simple, we just calculated embedding space’s mean and standard deviation then just normalize the embedding vector.

Results?

better…

Now all we need is a bit of post-processing and “everything nice” (except the chemical-x)

Added some saturation
A little contrast
And gamma too
Lastly round the intermediate values to 1 and 0

now we are talking…

As you can see consistency has gained. Each symbol has a uniqueness and semantically similar words are mapped with similar symbols (king-queen, man-woman)

Linear Interpolation for more Artistic View

What about multilinguality?

Since our UMWE model does not work with %100 accuracy even small changes between word embeddings affect the output symbol a lot. Here some sample logograms with the same words in different languages.

Siamese Neural Networks

Lastly, we came up with a new idea, Multilingual Mapper, for getting closer to the same words from different languages and moving away from the different words from the same languages. To do so we trained a Siamese Neural Network by labeling same-word-diff-lang as one and diff-word-same-lang as zero. In such a way the siamese network learned to penalize the mapper, and the mapper will force itself for tricking the siamese by getting closer to the same-word-diff-lang vectors.

And results…

It is still an open problem for us, number one task in future works.

Conclusion

After the 3-month marathon, we ended up in a pretty good position. We are taking pretty consistent and artistic results. Although we have not reached multilinguality yet, for now, our Alien-ish language is just for English.

What Could Be The Next?

Well, obviously we need to improve multilingualism, and currently, we are working on it. Also, we’re planning to image2word translation, so that we can create logogram from words and vice-versa

Extending the dataset with the Google-Doodle dataset?

Any idea about you?

Please feel free to contact us with any questions and ideas.

selim.seker00@gmail.com