Name temporal similarity

Evolution usage-based similarity of names in Argentina.

How do we measure the similarity of names in terms of the evolution of their use over time?

This is the question that kickstarted this small toy project. It gave me a nice excuse to take my first steps into the pandas project. We took data from Argentina’s public data portal so the results are only applicable there. The ideas could be adapted to any population if the data is available.

Data loading and preprocessing

Naming by year in Argentina dataset taken from this website. Here we assume the file historico-nombres.csv is placed in the same folder as the python script.

import os
import numpy as np
import pandas
pandas.set_option('max_rows', 10)

folder = os.getcwd()
file_name = os.path.join(folder, 'historico-nombres.csv')

df = pandas.read_csv(file_name)
df.columns = ['name', 'amount', 'year']
df.head()

	name	amount	year
0	Maria	314	1922
1	Rosa	203	1922
2	Jose	163	1922
3	Maria Luisa	127	1922
4	Carmen	117	1922

We see the dataset is a table with three columns. Each row indicates the number of people named in a certain way in a given year, from 1922 to 2015.

The dataset has some inconsistencies that need fixing. For example, we want to treat “Raúl”, “Raul” and “ Raul” as the same name.

# Strip tildes and spaces
df.name = df.name.str.strip(' .,|')
df.name = df.name.str.normalize('NFKD')
df.name = df.name.str.encode('ascii', errors='ignore').str.decode('utf-8')
df.name = df.name.str.replace(r"\(.*","")
df.name = df.name.str.lower()

# After processing, we will have repeated entries (the same name appearing in a single year) that we need to sum
df = df.groupby(['name', 'year']).sum().reset_index()

A fair comparison year to year would use naming probabilities instead of naming amounts. Therefore, we need to divide each value by each year’s total.

total = df.groupby('year').amount.transform('sum')
df['probability'] = 100*df.amount/total
df = df[['name', 'year', 'probability']]    # Here we drop 'amount' column, we keep 'probability'

Dropping data we don’t need

How many different names are there in the list? How are they distributed?

n = df.name.nunique()
print('Number of different names: ', n)

# Histogram
fig, ax = plt.subplots()
max_probabilities = df.groupby('name')['probability'].max()
bins = np.arange(min(max_probabilities), max(max_probabilities) + 0.05, 0.05)    # We bin data in 0.05% increments
max_probabilities.hist(log=True, bins=bins)
ax.set_xlabel('Maximum naming probability', fontsize=20)

print(n - max_probabilities.value_counts(bins=bins).cumsum().iloc[:5])

Number of different names:  3061802
(-0.0009260000000000001, 0.0501]    943
(0.0501, 0.1]                       354
(0.1, 0.15]                         216
(0.15, 0.2]                         140
(0.2, 0.25]                         100
Name: probability, dtype: int64

png

We see we have over 3 million different names but only less than 1000 reach a probability of 0.05%. The histogram shows there are more than three orders of magnitude between the first and second bin. Since we will only be interested in fairly common names, we will just keep those with a maximum probability of at least 0.05.

df = df.groupby('name').filter(lambda x: np.any(x.probability > 0.05))

print('Number of different names: ', df.name.nunique())

The features of each name will be each year’s naming probability. Therefore, we have to move the years to the columns labels. This way, every single name will be represented as a row. Since this dataset does not include years with naming amount equal to zero, in the new representation missing data will have to be replaced by zeroes.

df = df.set_index(['name', 'year']).probability.unstack()
df.fillna(0, inplace=True)
df.head()

Next, we smooth the data with a 5 year window because we are not interested in fast changes when comparing name trends. Once smoothed, we keep one data point every 5 years.

def smooth(y, n):
    n = np.min([len(y), n])
    box = np.ones(n)/n
    ySmooth = np.convolve(y, box, mode='same')
    return ySmooth

df = df.apply(lambda x: smooth(x, 5), axis=1)

# Once smoothed, we exclude data on the time edges because the smoothing is not effective there.
pad = 5//2
yrs = (df.columns[-1 - pad] - df.columns[pad])
df = df.iloc[:, list(np.arange(pad, yrs, 5))]

How data looks now.

df.head()

year	1924	1929	1934	1939	1944	1949	1954	1959	1964	1969	1974	1979	1984	1989	1994	1999	2004	2009
name
abril	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000061	0.000000	0.000102	0.000130	0.001151	0.011472	0.085924	0.157746	0.093620
adela	0.138682	0.098194	0.083173	0.051787	0.037440	0.028439	0.020482	0.017497	0.016626	0.013102	0.008506	0.007353	0.006418	0.004856	0.002633	0.001098	0.000781	0.001559
adelaida	0.039425	0.036597	0.028950	0.023223	0.018194	0.016922	0.015419	0.011146	0.008885	0.006189	0.004281	0.003339	0.003009	0.002819	0.001571	0.000865	0.000516	0.000187
adelina	0.098266	0.073453	0.049704	0.039264	0.027734	0.022658	0.014906	0.013861	0.008023	0.005570	0.003515	0.003794	0.002722	0.002967	0.001522	0.000835	0.000773	0.000867
adolfo	0.079500	0.076196	0.070493	0.063345	0.053743	0.039679	0.031995	0.029128	0.023658	0.015156	0.010175	0.010474	0.009725	0.006738	0.004276	0.002130	0.002094	0.001971

Adding gender information

We need to generate a new column with gender information for each name. For this we will evaluate the gender of the first name in each entry against this database.

# Keep first word in each name and strip all tildes
df['gender'] = df.index.str.split().str.get(0)

# Generate gender column
# Thanks to https://stackoverflow.com/questions/48993409/assign-values-groupwise-using-the-group-name-as-input
gn = pandas.read_csv(os.path.join(folder, 'us-names-by-gender-state-year.csv'))
gn.name = gn.name.str.lower()
gn = gn.groupby('name').first().reset_index()[['name', 'sex']]
df.gender = df.gender.replace(dict(zip(gn.name, gn.sex)))

Measuring similarity and visualizing results

We are going to measure the similarity between any two names as the quadratic sum of differences between the values of their features. That is to say, their probability arrays over time.

def timeSimilarity(df, name):

    # Get name's sex
    gen = df.loc[name]['gender']

    # Only keep names with the same sex and exclude the gender column
    df = df[df.gender == gen].iloc[:, :-1]

    use_trend = df.loc[name].values
    n = len(df.index)
    result = pandas.DataFrame(
        {'name': [''] * n, 'similarity': [None] * n})

    result.name = df.index
    for i in np.arange(n):
        diff = df.iloc[i, :].values - use_trend
        result.similarity[result.name == df.index[i]] = np.sum(diff**2)
    return result

Let’s take a look at the results for the name “virginia” as an example. We can list the most and the least similar names.

results = timeSimilarity(df, 'virginia').sort_values(by='similarity')
print(results)

                 name  similarity
        virginia           0
     maria julia  0.00116193
          cecilia  0.00150801
           laura  0.00161193
          marina  0.00205274
..                ...         ...
            rosa     1.09346
maria del carmen     1.64696
           maria     2.14243
  maria cristina     2.91974
        ana maria     4.44416

[515 rows x 2 columns]

Let’s plot the naming trend over time for “virginia” and a few others. We see how closely “maria julia” fits “virginia” compared to “gabriela soledad” and “esther”.

print(results.iloc[[0, 1, int(0.4*len(results)), int(0.8*len(results))]])

import matplotlib.pyplot as plt
plt.style.use('seaborn-dark')
fig = plt.figure(dpi=70, facecolor='w', edgecolor='k')
ax1 = df.loc['virginia'][:-1].plot()
df.loc['maria julia'][:-1].plot()
df.loc['gabriela soledad'][:-1].plot()
df.loc['esther'][:-1].plot()
ax1.legend()
ax1.set_ylabel('Probability', fontsize=20)
ax1.set_xlabel('Year', fontsize=20)
ax1.grid()

                 name  similarity
        virginia           0
     maria julia  0.00116193
gabriela soledad   0.0157149
          esther   0.0341416

png

We would like to visualize the whole dataset. We apply principal component analysis to reduce the amount of features to 2, so that we can visualize them in a 2d plot. Following this guide.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Separating out the features (we drop the gender)
x = df[df.gender == 'F'].iloc[:, :-1].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

pca = PCA(n_components=2)
pca_x = pca.fit_transform(x)
pca_df = pandas.DataFrame(data=pca_x, columns = ['principal component 1', 'principal component 2'])
pca_df['name'] = df[df.gender == 'F'].index

pca_df.head()

	principal component 1	principal component 2	name
0	-1.820200	1.104530	abril
1	0.211481	-1.554167	adela
2	-0.920068	-1.154190	adelaida
3	-0.421686	-1.447306	adelina
4	-0.134817	-0.263345	adriana

sample = pca_df.sample(frac=0.5, replace=True) # We keep half of the data so that the graph is less messy

fig = plt.figure(figsize = (15, 15))
ax = fig.add_subplot(1, 1, 1)
ax.scatter(sample['principal component 1'], sample['principal component 2'], s=50)
for index, row in sample.iterrows():
    ax.annotate(row['name'], (row['principal component 1'], row['principal component 2']))

ax.set_xlabel('Principal Component 1', fontsize=20)
ax.set_ylabel('Principal Component 2', fontsize=20)
ax.set_xlim(-2, 1)
ax.set_ylim(-2, 1)
ax.grid()

png

Cool!

Concluding remarks

This was a fun idea! It could be extended to add more features and further expand the idea of similarity between names. Some of these features could be

vowel/consonant ratio
length
whether it is a single word name or not

You can find the jupyter notebook from this post here.

For the moment, the analysis presented here fulfills my pandas practice needs. Would you have done it differently? Is it possible to optimize some part of it? Please let me know in the comments!

Share on

Twitter Facebook Google+ LinkedIn

Your email address will not be published. Required fields are marked *

Comment *

Markdown is supported.

Name *

Email address *

Website (optional)