When analyzing social media data (such as tweets) it is often the case that you want to understand the demographics of the people that you are studying e.g. age, sex, nationality or where they live. Unfortunately this data is not given so the analyst needs to infer this from what is normally available e.g. username, user profile picture, tweets content and geotags.
This blog post looks at the challenge predicting someones age based on their user name. There are of course alternative approaches such as predicting age based on facial recognition of profile picture but this type of machine learning is beyond the scope of this blog-post.
The approach I take in this post is inspired by Nate Silver’s post on fivethirtyeight.com which is as follows:
- Pattern match the username against a database of known first names thus outputting the first name of best fit
- Given a first name, estimate the probability that they were born in each year between 1920 – present day.
Unlike Nate Silver’s post I will be sharing R code so that you can build you own solution and I’ll also be dealing with the case where you’ve only got rank data e.g. a list of top 100 names rather than a full census that includes the frequency of names.
Pattern Matching against username
To perform the pattern matching exercise you’ll need a database of first names – one reasonable source of names is the Social Security Administration popular names list. This amazingly gives you every first name registered in the USA between 1880 – 2014 that has been used >= 5 times, along with gender and the number of times it was used in a particular year.
To find the name of best fit one can use grep for pattern matching – here’s a code snippet below to illustrate the process:
In situations where you have multiple matches on a first name a useful heuristic is to select the longest name that matches. This takes care of situations where one name is a derivative of another e.g. Jack and Jacky. Clearly this process is not fool proof you might get a username like “Zoe Davidson” and accidentally match onto David as a first name. To avoid these cases you may wish to take into account ordering of the match but for my purposes I ignore them and treat them as random noise.
Given a first name estimate the probability that they were born in each decade between 1920 – present day.
To estimate the probability of a first name for a given decade (or year) you’ll need to know what proportion of people with the selected first name are still alive from each year. This means you’ll want to know:
- [A] what proportion of people with name x were born in a particular year?
- [B] of those people born with name x how many a still alive today?
To answer question [A] one can simply look-up the value within the Social Security Administration popular names list or a similar alternative for the country you’re interested in. When performing a look up in these tables you’ll get a name, a year and a number of births e.g Emma, 1985, 940.
To answer question [B] one can simply look-up the probability of person born in 1985 still being alive today using a life table. You can get life tables for many countries from http://www.mortality.org. For example, in 1985 a female born in the USA has a probability of 0.9789 to be still alive today.
One can then simply put the answers of question [A] and [B] together giving you the number of Emma’s from 1985 whom are still alive today – which is ~920 (940 * 0.9789). If we then calculate this for each year from 1920 to present day then we can trivially calculate the probability of Emma being born between 1920 -> present day by normalizing the values by the total number of Emma’s still alive.
What can you do if you only have rank data?
The above approach won’t work when you are provided with rank data (a league table of popular names). This is the case for the United Kingdom which only provides the top 100 baby names by decade for historical purposes. In these situations you’re going to have to map ranks to number of children born per year with a specific name. My approach for performing this mapping is as follows:
- [A] Get number of people born in year of interest [Table of Birth statistics]
- [B] Get number of people still alive that are born in year of interest [Life table]
- [C] Estimate what proportion of people are assigned names of rank 1, 2, 3, 4, 5, … and so on.
To answer question [A] a table of birth statistics can be downloaded from mortality.org. To answer [B] you can use the life table from mortality.org that we used previously. To answer [C] one can either assume a Zipfian distribution or fit your distribution to some empirical data e.g. 2013 UK Boys Names which includes frequency values. I opted for the second approach as research suggests that there is no simple intuitive distribution that names fit but may be approximated using a combination beta and exponential distributions.
Using the above described approach I obtained the following age distribution estimates for UK ranked data: