This page describes a work in progress. The model and results presented here should not be considered as statistically relevant until further ways of improvement are figured out. If you're interested to contribute, please contact me ! Open discussion on Facebook.
Imagine yourself living in a french small town sometime in the Middle-Age and there are five people named Antoine around. For some reason, you need to be able to distinguish one from the other in a conversation, like:
That's it, this is a possible origin for my surname Mazières, one of many vulgar latin variants of the word Masure, meaning shitty house falling appart. The same goes for many of the most common french surnames, making reference to places, physical traits, occupations, nicknames, etc. For instance: Dupont ("The one near the bridge"), Petit ("The tiny one"), Morel ("The one with a dark skin"), Fournier ("The baker"), Ferrand ("The farrier"), Martin ("The fertile warrior"), Bernard ("The strong bear"), etc.
From there it took a few steps (mainly in 1474 and 1539 in France) for various administations to freeze these descriptive names expressed in a local dialect into hereditary identifiers free from their original meaning but carrying an origin. To some extent, the same naming practices are observed in many european countries: necessity of surnames appears sometime between the Xth and the XVth centuries with some demographic growth (ie. too many Antoines in town) and sometime around the end of the Middle-Age, things got freezed for the administrative needs of some rising, centralized, crowny power. Things differs quite a bit for other world regions, such as China where some legends date the adoption of surnames to Emperor Fu Xi in 2852 BC. Also, the precision of the origin expressed by a Chinese surname is much lower, as 85% of current Chinese population (1.3 billion people) make use of a mere hundred surnames, Wang (王), Li (李), and Zhang (張) covering 20% of the population.
Listing the main origins of surnames around the world is behind the scope of this report, the point being that in most cases, surnames were frozen at some point, at least a few centuries back, as a trade-off between some local languages, administrative purposes and political/religious power plays. Looking at more recent history, some origins have been almost wiped out from surnames, such as native and afro americans whose surnames carries most of the time iberic or anglo-saxon origins. To a lesser extent, african origins are still present in surnames but under a transliteration, if not translation, of a local name to the language of an colonizing population like English, French or Portuguese. Also, an important source of bias when looking at information from surnames is that kids usually receive their surnames from their father only, erasing at every generation the information coming from the mother.
Consider this: if our names were frozen 500 years ago, that's a 20 generation thing, which means that at the top of your ancestry tree, you could have received your surname from 1 among 2^20 persons (> 1 million). Your name represent only one path among this gigantic binary tree:
How is it possible that among such a great number of possibilities, your name may still represent some kind of origin and not just a random information ?
One of the possible main reason for this is called, among humans, Endogamy. Which is that people tend to mate (ie. sex + kids) with people "close" to them in many different ways. This closeness can be geographical, religious, ethnical, economic, social, etc. The hierarchy between these criterias evolve over space and time, for example geographical proximity may have loose some grip with accessible transportation, social endogamy may have decreased with the end of some official state-backed nobility, etc. Yet, whether in some place and time, endogamy is increasing or decreasing, accelerating or slowing down, it's not too much of a stretch to say that Endogamy is strong among humans and random mating is merely a mathematical fantasy.
In our world of big databases and machine learning algorithms, whatever is not random is full of statistical information and predictive power. In that sense, some areas such as bio-medical sciences, genealogy, demography and marketing have used information contained in surnames for various purposes. My purpose here is to use information about origins to observe their distribution in a population and measure how it differs from various sub-group of the same population, and therefore assess the representativity of the latters.
For more details and code snippets of the methods quickly mentionned here, have a look at this python notebook.
How do we, humans, infer the origin of a name ? Think about a Japanese name like Toriyama. If you have never seen this surname but have seen many japanese names already, you might guess that this name is Japanese. How so ? The way it sounds when pronounced ? By patterns of letters ordering ? Building a statistical model to reproduce this intuition means learning from many examples which patterns of letters and combinations is most likely to represent an origin. In order to build such a model we need a lot of examples and hopefully and open source of data so our experiment can be reproduced. To that end, I chose to use PubMed data which contains more that 25 millions affiliations of scientific papers authors with their institution from which we can extract a country. This dataset is highly biased since we can assume that scientists tend to be more nomadic that the average, that observing academic publications over-represent rich countries (ex: North America, Europe) and under-represent poor countries (ex: Africa), and that PubMed may reference better english publications. Using various statistical techniques we managed to partially counter-balance these biases and build a training dataset of surnames and their origins. To decompose the surnames in features that would catacterize each of them, we used n-grams, such as:
>>> ngrams('ROTH', depth=3) ['R', 'O', 'T', 'H', 'RO', 'OT', 'TH', 'ROT', 'OTH']
From here, the idea is to use a classfier for a computer program to learn which features are relevant to infer an origin. This classifier obtained at first pretty weak results since many countries tend to share too much features with other countries to be spotted by the classifier has a specific origin. For instance, a spanish origin is very difficult to differentiate from many countries in latin america since most of surnames in both places have hispanic origins. Therefore we used clustering technique to regroup countries by features affinity. As observable in the following figure, this yield pretty impressive results and we can observe that, albeit a few errors, the clustering procedure successfully recovered main cultural/linguistic regions of the word:
With these clusters the learning procedure yields at it best the results shown in the figure below. While these results are not bad, they are far from being robust enough to base a serious usage on such model.
Here are a few examples of names correctly classified by the model : Mazieres (North European), Roth (North European), Khemakhem (Arabic), Nguyen (Asian), Traore (African). On the other hand, here is a list of examples misclassified:
Despite the fact that the model is not robust enough to allow a legitimate use, let's see a few applications I had in mind when trying to build it. First, while in many countries statistics are partially available about the distribution of origins through ethnicity census, such data is not available in France. By taking the surnames of all candidates to the Brevet, the most widely passed exam in France, we get a pretty representative sample of the diversity of surnames among french youth. Using our algorithm to infer the origins of these surnames, we get this picture of French youth surnames origins distribution (A). This base dataset can be compared to other groups in French society to assess the representativity of the latters. In that sense, figure B shows us the distributions of surnames origins for the national legislative assembly in 2011. Finally a simple operation (
log(B/A)) allow us to compare these two distributions and see how much each origin is fairly represented.
Once again, the model used is far from a reasonable statistical relevance and the results should more be interpreted as a proof-of-concept. For instance the inferrence of origins on french deputies shows many errors such as : Guigou is classified as African, Jung as Asian, Lefait as Arabian, etc. However, the main picture provided by figure A seems not so far from a fair-ish representations of french people origins diversity.
If such tool could reach better accuracy, we could imagine inferring the representativity of many sub-groups of any base dataset, my main focus being here France. For instance asserting the representativity of professions, political institutions, schools, companies, etc. Moreover, instead of simply describing potentially discriminant situation, such algorithm could allow to spot who is doing better than others in term of diversity and therefore trigger precise and localised qualitative research to recover the determinants of successful efforts.
While I'll be updating this report as my work improve, here are a few links if you want to start on your own:
Don't hesitate to contact me if you want more information or to participate in the project !