Revamping my Bachelor's project.

Way back when (in the spring of 2021), I was a student in my final year of my Bachelor's degree, wherein I had to do a Bachelor's project.

This project, which I took up mainly because I had to, was on a topic proposed by Laurent Moccozet, one of my professors, mainly due to the fact I was unmotivated to come up with a fun project idea.

In essentia, it revolved around the visualisation of Dominant Language Constellations, which are the focus of a number of academic works authored by Larissa Aronin:

L. Aronin, "Multi-competence and Dominant Language Constellation" in The Cambridge Handbook of Linguistic Multi-Competence, 2016, doi: 10.1017/CBO9781107425965.007
L. Aronin, "Dominant Language Constellations In Education, Language Teaching And Multilingualism", 2018, [online]: Personal website
J. Bianco and L. Aronin, "Introduction: The Dominant Language Constellations: A New Perspective on Multilingualism", 2020, doi: 10.1007/978-3-030-52336-7_1

The tool was created as part of an academic collaboration between Larissa Aronin and Laurent Moccozet, a lecturer and researcher at the University of Geneva, and myself.

This blog post explains how it was developed and goes into some of the meatier technical details.

Summary

The "Research Problem"

The first questions you may be asking are "well, what exactly is a Dominant Language Constellation?", and "why is it important?".

To the former, I can answer by saying that a Dominant Language Constellation (or DLC) is a "snapshot" representation of the languages a person knows, augmented with a quantification of their perception of the distance between these languages.

Explained differently, it is a snapshot of a person's vehicle linguistic abilities at a given time.

The latter question is then answered by stating that this perceived distance has ties to factors influenced by a person's context (as it may change over time), leading to shifts in globalization and of superdiversity, but also of geography, society, work, and community, pointing to an interconnectedness of global and local contexts, which is a phenomenon named after the contraction of the words globalization and localization: glocalization.

The main issue arises when you attempt to represent a DLC, which researchers had taken to doing with sticks and papier-maché balls:

The initial DLC modelization method.

Although this method enables a tangible representation of the DLC, it does not allow one to systematically compare DLCs, nor does it allow for large studies without a significant amount of organization.

Due to the potential n-dimensionality of a DLC, representing it as a 2D render on a computer is also not very practical (see below), because the notions of depth and interactability are lost.

A screenshot of 3 orbs linked with black lines.

A potential 3D computer representation of a DLC

To summarize, this project needed an accessible way to visualize and compare DLCs, which can be interacted with and present the notion of depth (ie. 3D representation).

On the technical side, this means that we need an interface (preferably in a browser) that has the ability to display data collected from any respondent to the experiment, implying that it is stored in a database of some kind.

Representation questions

Before getting into specific technologies, I needed to consider the various use cases (diagram below), how the information would be collected, what would be an efficient way to represent it as information, and how the various services would communicate.

The use cases diagram for the project, which includes an actor visualizing DLC's or DLC clusters.

The use cases diagram for this project.

For simplicity, we opted to collect information via Google Forms, which is a bit iffy from a privacy perspective, but due to the information collected we were told did not constitute a problem.

Respondent Details:

Age
Gender
Nationality
Country of Residence
Level of Education

Languages spoken:

Language name and spoken dialect
Proficiency
Familiarity

Self-assessed linguistic "distance"

We could then move on to conceiving and implementing this service.

Initial implementation

Since we now had data from 51 respondents, I had to figure out how one would effectively be able to exploit the relationships between the two node types that were available to me (Person and Language).

Thankfully, I had recently seen graph databases in a class, and decided that using Neo4j would be an interesting experiment.

Database

So I came up with a database schema that would make use of the graph-type relationships that were available:

Initial representation of the database model.

Here, a Person has a relationship called :KNOWS_LANGUAGE with a Language. Additionally, two Language nodes have a relationship specifically with relation to a Person, which is called :HAS_DISTANCE.

The following representation is not really "standard" but you can see two beings (Blue and Green) which know a number of different languages, and the relationship between these languages that carries the "distance" is also color-coded Blue or Green depending on whom the relationship references.

A representation of how individuals are related to languages.

A schematic diagram representing the being-language relationship.

Since our form produced a CSV file, we first had to convert the CSV into a series of Cypher queries, which would be done using a Python script. (Cypher being the language used to query Neo4j)

If you really want to take a look at the (ugly) Python script, you can check it out here: extractor.py.

The Cypher output then looks like below, and can be injected into a Neo4j database using the web-interface or a CLI interface such as cycli.

A screenshot of the Cypher input which would be used to generate the database.

The Cypher input for the Neo4j database.

This in turn produces a database which is represented like such:

A screenshot of the Neo4j graph database.

The database, in Neo4j

Backend

Now that we had a database, I could write the backend in Node.js and express, simply because a database connector existed for Neo4j in that environment, and it is quite easy to write a passthru API in it.

An express API endpoint.

Frontend

With a database, a backend, and time pressing, I then started building the frontend in Angular.

This brought me to the main question: How does one represent a 3D model in a browser and have the ability to interact with it?

The answer, as it turns out, was force graphs! The main reason being that the force dynamics settle, allowing for the "closest" possible match to the expected model.

And thankfully, a fork of d3 exists to do exactly this: d3-force-3d.

A force graph, linking one cube (the being) to 5 spheres (the languages this being knows). The thickness of the cube-sphere line indicates their proficiency, and the length of the sphere-sphere lines indicates how much distance the being says is between the represented languages.

A 3D force graph of a DLC.

Now all that remained was to make these graphs react based on provided input, and display it to the user:

The main UI.

Here, the aside allows one to access the various functionalities, the data pane provides the ability to comprehend the raw data, and the graphs pane provides the ability to interact with the rendered model.

A few features were devised, namely visualizing a single DLC (above), a comparative DLC (below - first), a self-design interface, or the overall DLC cluster (below - second) (or a sub-cluster thereof, below - third).

What is a DLC cluster? Well it is the averaged DLC of every single respondent, and the sub-cluster only focuses on the languages that connect to a set of specified languages.

The comparison UI.

The UI, showing a full DLC cluster, and the relationship between the languages as reported by beings.

The cluster UI.

The UI, showing a subgraph of all languages as related to a selected set of languages.

The subgraph UI.

Problems

At this time, I had then presented my work and received my grade, after which the project may or may not have lain dormant for 1.5 years during which I did my M. Sc.

After defending my M. Sc. thesis (see the related blogpost), and considering the overall study was still active, I decided to hand over the project to Laurent Moccozet.

However this means a few things needed to be fixed, namely:

The ability for Laurent to be able to add datasets;
The ability for users to switch datasets and see more information on the service page;
Using a faster database because of lag on certain requests (Neo4j is Java, after all);
Switching the API to something less finicky than raw Node.js and Express;
Having a form of deployment architecture for it to be deployed at the University of Geneva.

Revamped implementation