Text similarity with sentence embeddings

Measuring the distance between a concept and a text.
sentence embeddings
Published

February 23, 2024

A recent talk made me think about measuring the distance between a concept and a text. I know that, for example, Dustin Stolz and Marshall A. Taylor developed “Concept Mover’s Distance” and that folks like Laura K. Nelson have done nifty things, but I haven’t really thought about the idea much in years. Most of the sociological work has used word embeddings, which are constructed using algorithms like Word2Vec, GloVe, or FastText, to analyze a set of texts and translate words into vectors based on the context in which words appear and their co-occurrence with other words. These vectors capture semantic and syntactic similarities among words. Sentences, paragraphs, or longer texts are represented as the average value of their word embeddings, and the method works quite well for measuring how similar two texts are based on the distance between their word embeddings. In contrast, sentence embeddings generated from models like BERT or other transformer-based architectures do not merely combine word vectors. Instead, these models are trained to understand the context and relationships between words in a sentence, producing embeddings that capture the nuanced meaning of the entire sentence. Fine-tuning these models on specific tasks, such as text similarity, further enhances their ability to represent sentences in a way that aligns with the task’s requirements. The resulting sentence transformer models seem to be what most people are using today, so I thought I would play around with them. That said, most of the interesting sociological work has been folks computing their own word vectors, but I’m just going to use a pretrained model.

pip install -U --q sentence-transformers
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.
from sentence_transformers import SentenceTransformer, util
import torch
import pandas as pd
import numpy as np
import warnings

# Show wider columns
pd.set_option('display.max_colwidth', 200)


# Filter out all warnings
warnings.filterwarnings("ignore")

There are lots of different pretrained SentenceTransformer models to try. Since I don’t have a GPU, I’m using the smallest decent one.

model = SentenceTransformer('all-MiniLM-L6-v2')

Encoding a sentence returns an array, with each position measuring some latent aspect of the sentence. The all-MiniLM-L6-v2 model has 384 dimensions, meaning it represents sentences in a 384-dimensional space where each dimension captures a different aspect of the sentence’s semantic and syntactic properties.

model.encode('We study apples.')
array([ 7.25412294e-02,  2.94132698e-02, -1.04851276e-02,  7.10855499e-02,
       -1.60725769e-02, -1.44585697e-02,  2.01909002e-02, -1.64043065e-02,
        5.85170425e-02,  4.42925990e-02,  3.38043720e-02, -3.70339900e-02,
       -1.01375952e-02,  1.51030989e-02, -2.33681127e-02, -7.01107532e-02,
       -6.28457516e-02, -1.98269282e-02, -2.56151836e-02, -3.29240002e-02,
       -3.28812934e-02,  1.20462896e-02,  2.59449407e-02, -9.30627715e-03,
        3.87127437e-02,  5.07903211e-02,  1.87327876e-03, -3.24000195e-02,
       -3.31559330e-02, -2.26897225e-02, -3.54236215e-02,  5.46763092e-02,
        9.66718495e-02,  3.60654816e-02, -5.42746857e-02, -6.75004860e-03,
        1.40686721e-01, -5.24841994e-02,  2.66690422e-02, -1.14018116e-02,
       -5.27682193e-02,  7.28323534e-02,  2.73659322e-02,  1.07277647e-01,
        1.61659811e-02,  3.81597877e-02, -1.69903729e-02, -3.23689356e-02,
        2.42316276e-02,  6.92270622e-02, -7.24644810e-02,  5.75410959e-04,
       -2.30131280e-02, -6.04602210e-02,  2.60792784e-02,  6.71790168e-02,
        9.09342691e-02, -3.36206444e-02,  7.23590478e-02, -1.65215111e-03,
        3.39566097e-02, -8.45273733e-02, -4.85875309e-02,  3.91481146e-02,
        9.04154778e-03, -8.98436904e-02, -5.66358864e-02,  4.21869494e-02,
       -2.75326185e-02,  8.08383152e-03,  3.66940014e-02,  2.62768455e-02,
        4.43452112e-02,  8.13365355e-02,  4.25574370e-02,  4.02889028e-02,
        1.33797051e-02, -5.26551530e-02, -1.71313528e-03, -3.53252031e-02,
       -1.33084031e-02, -3.56830396e-02, -8.78533255e-03, -2.92915851e-04,
       -5.95996482e-03,  1.08087016e-02, -3.73279974e-02,  1.11694261e-02,
       -8.32256749e-02,  3.41632962e-02,  1.11821033e-02, -4.44804654e-02,
       -5.93070313e-02,  2.85391472e-02, -5.28932542e-05,  1.43566700e-02,
        7.79099949e-03, -4.66752164e-02,  3.37223820e-02,  1.38128072e-01,
       -2.48398129e-02,  6.67299479e-02,  4.20545265e-02,  6.46362547e-03,
       -2.01197024e-02, -4.62081917e-02, -7.40140006e-02, -6.16762936e-02,
        6.25649691e-02,  5.63399047e-02,  5.71039356e-02, -1.97048280e-02,
       -1.04044281e-01,  1.12997763e-01,  7.35701993e-02, -5.85305728e-02,
        5.75302951e-02,  2.13396046e-02,  2.31305137e-02, -1.74588244e-02,
       -1.42283470e-03,  9.98423249e-02, -5.42779267e-02, -4.08335775e-02,
        3.16505209e-02, -6.30080476e-02,  5.79967070e-03, -6.57705620e-33,
        9.47775226e-03, -2.89162844e-02,  5.21772802e-02,  4.72146124e-02,
        1.75573980e-03, -2.80149970e-02,  6.52783411e-03,  6.76577762e-02,
        1.11630604e-01, -5.42772980e-03, -1.63363609e-02,  1.93208605e-02,
       -6.67217327e-03,  3.95874586e-03,  6.84757829e-02, -4.26633768e-02,
       -5.80286346e-02,  4.60271128e-02, -6.45208359e-02, -1.97640937e-02,
       -4.20666821e-02, -1.40169218e-01,  1.41780432e-02, -4.24528383e-02,
       -7.28586018e-02, -4.38546352e-02,  2.99364813e-02, -1.22642733e-01,
        1.05384283e-01, -9.39285476e-03,  4.56614830e-02,  4.46087569e-02,
       -7.81959221e-02, -2.15675607e-02, -1.61162820e-02,  1.18166637e-02,
        1.05830543e-01,  4.55662869e-02,  2.54090354e-02,  7.13379355e-04,
       -4.73194495e-02,  6.20454252e-02,  1.28243357e-01, -1.05106169e-02,
        6.04844540e-02,  4.86876108e-02,  6.08739592e-02,  7.04808533e-02,
       -3.78209017e-02,  1.76480156e-03, -7.23417252e-02, -4.73359115e-02,
        3.92539166e-02, -5.92344292e-02, -2.03246195e-02,  1.21152870e-01,
        2.25762613e-02, -2.12923903e-02, -7.87275955e-02,  1.72557738e-02,
       -1.17437020e-02,  4.75608334e-02,  2.89987470e-03,  2.87241656e-02,
       -7.83866942e-02,  1.22669660e-01, -7.32923672e-02, -2.76942160e-02,
        3.90719483e-03,  5.07837161e-02, -1.27230436e-01, -1.96105265e-03,
        1.29499817e-02, -2.39144196e-03, -1.02660574e-01, -3.95029709e-02,
        2.90869810e-02,  1.89506123e-03, -1.57893542e-02, -1.35825286e-02,
       -1.47497458e-02, -7.59655535e-02, -2.41609924e-02, -4.64349203e-02,
       -5.03189899e-02,  1.03728287e-01,  5.30537264e-03, -4.08360064e-02,
        4.37239483e-02,  1.50352474e-02,  1.78553362e-03,  2.78608631e-02,
       -1.33840907e-02, -1.94217525e-02, -6.97820038e-02,  3.64022021e-33,
       -3.52125727e-02, -3.46545912e-02, -1.97930746e-02,  3.53415050e-02,
        4.72753681e-02, -3.01812422e-02, -2.71848068e-02, -4.41191671e-03,
       -5.37364073e-02, -3.98576930e-02, -5.23934960e-02,  2.20293589e-02,
       -1.31326532e-02,  3.97575926e-03,  5.15052769e-03, -5.09656556e-02,
        2.74809133e-02,  6.49610087e-02, -3.33442353e-02,  3.82365957e-02,
       -8.04736614e-02,  4.02408168e-02,  1.55164497e-02, -3.17362957e-02,
       -8.52044765e-03, -1.36113567e-02, -1.03016114e-02, -1.63739058e-03,
       -3.56106050e-02,  4.95722368e-02,  3.19658928e-02, -1.06780723e-01,
       -6.47186339e-02, -5.92985041e-02,  7.48166954e-03,  3.27526592e-02,
       -2.17823554e-02, -3.67993712e-02, -1.45561453e-02,  5.68174422e-02,
        5.13046794e-02,  2.38617416e-02, -5.04206121e-02,  7.14061782e-02,
        5.55717610e-02,  9.42208767e-02, -2.91015077e-02,  1.10552333e-01,
       -1.50970956e-02,  1.36306453e-02,  2.84949895e-02,  1.54124191e-02,
       -9.30073187e-02, -7.49932304e-02,  3.89047265e-02,  6.32112427e-03,
        1.68052688e-02, -5.48552349e-02, -4.07663062e-02,  3.40671390e-02,
       -2.44489089e-02, -6.63542328e-03,  1.33677274e-02,  4.33265902e-02,
       -6.07616976e-02, -2.95820832e-03, -8.81790649e-03,  3.45228761e-02,
       -4.97724488e-02,  4.87365052e-02,  5.39725199e-02,  2.30813660e-02,
       -6.28042817e-02, -1.07893415e-01, -4.70633470e-02,  2.30335053e-02,
       -1.66683551e-02, -6.54008314e-02, -5.05928211e-02,  3.11856903e-02,
        8.44117161e-03,  6.16906621e-02, -1.53189339e-02,  6.40073717e-02,
        2.83199176e-02,  2.28565577e-02,  3.77516486e-02,  2.01664753e-02,
       -4.26778495e-02, -1.64321400e-02, -7.81823099e-02, -3.34092714e-02,
        3.54452953e-02, -1.01244465e-01, -7.63411000e-02, -1.36826950e-08,
       -6.46132380e-02, -4.54020202e-02,  7.10007921e-02, -1.07585511e-03,
        9.78543423e-03,  4.50233184e-02, -7.06218481e-02,  7.65154064e-02,
        1.31740945e-03, -2.06936337e-02, -5.31158149e-02,  4.04155143e-02,
       -7.19135329e-02,  7.89430514e-02,  5.02120666e-02,  7.71357790e-02,
        9.33832824e-02,  7.45651722e-02, -4.75667641e-02, -5.17647667e-03,
       -4.44874056e-02,  3.04975566e-02, -6.90343650e-03,  8.24641958e-02,
       -1.22354254e-02,  6.76435754e-02,  6.36463240e-03,  2.63797827e-02,
        4.96475771e-02,  8.34565535e-02,  3.48740071e-02,  2.49225888e-02,
       -9.88677517e-02, -5.98776303e-02,  3.49258445e-03, -6.12113327e-02,
       -2.36107334e-02,  2.68156715e-02,  2.23525390e-02,  9.43851192e-03,
       -1.19599409e-01, -6.57922998e-02, -2.84594391e-02,  3.78113380e-03,
       -3.77423279e-02,  6.28536614e-03, -4.52296101e-02,  7.04570040e-02,
       -7.60547956e-03,  7.97596648e-02, -2.58326419e-02, -3.47822644e-02,
        8.89370292e-02,  8.53143539e-03,  4.75420151e-03,  1.21760555e-02,
        3.13673206e-02, -1.13887928e-01, -6.46701753e-02,  7.35506192e-02,
        9.01015177e-02, -3.20984498e-02,  4.94505614e-02, -1.10336766e-03],
      dtype=float32)
len(model.encode('We study apples.'))
384

A little function to compare the embeddings of a single concept with each word in a sentence, and then the entire sentence. I built this to confirm that sentence embeddings actually do what I think they are doing.

def concept_string_sim(concept, 
                       text
):
    concept_emedding = model.encode(concept)
    for word in text.split():
        word_embedding = model.encode(word)
        similarity = util.pytorch_cos_sim(concept_emedding, word_embedding)[0][0]
        print(f'Similarity between {word} and {concept} is {similarity:.2f}.')

    text_embedding = model.encode(text)
    similarity = util.pytorch_cos_sim(concept_emedding, text_embedding)[0][0]
    print(f'Similarity between {text} and {concept} is {similarity:.2f}.')
concept_string_sim('rock, music', 'Grateful Dead show')
Similarity between Grateful and rock, music is 0.18.
Similarity between Dead and rock, music is 0.37.
Similarity between show and rock, music is 0.28.
Similarity between Grateful Dead show and rock, music is 0.36.
concept_string_sim('thankful attitude', 'Grateful Dead show')
Similarity between Grateful and thankful attitude is 0.50.
Similarity between Dead and thankful attitude is 0.23.
Similarity between show and thankful attitude is 0.20.
Similarity between Grateful Dead show and thankful attitude is 0.20.

In the first example, the concept “rock, music” shows varying degrees of similarity with each word in the sentence “Grateful Dead show,” with the highest similarity observed with the word “Dead” (0.37), suggesting that the model captures the association between the band “Grateful Dead” and rock music. The overall sentence similarity score (0.36) closely aligns with the highest word similarity score, indicating that sentence embeddings can indeed capture the essence of the concept in relation to the full sentence context.

In the second case, the concept “thankful attitude” has the highest similarity with the word “Grateful” (0.50), which is intuitive given the semantic closeness of “thankful” and “Grateful.” However, the overall similarity between the entire sentence “Grateful Dead show” and the concept “thankful attitude” is lower (0.20), suggesting that while individual words like “Grateful” strongly resonate with the concept, the context provided by the entire sentence shifts the meaning away from the concept’s core, demonstrating how sentence embeddings can differentiate between the significance of individual words and the collective meaning of a sentence.

No measuring some sentence similarities.

def string_string_sim(text1,
                      text2,
):
    text1_embedding = model.encode(text1)
    text2_embedding = model.encode(text2)

    similarity = util.pytorch_cos_sim(text1_embedding, text2_embedding)[0][0]
    print(f'Similarity is {similarity:.2f}.')
string_string_sim('We study revolutions.', 'This paper examines social movements.')
Similarity is 0.47.
string_string_sim('We study revolutions.', 'This paper examines health behaviors.')
Similarity is 0.18.
string_string_sim('We study revolutions.', 'This paper examines social movements in Algeria during the 1970s.')
Similarity is 0.36.

This also worked. Note, however, that the last example had a lower correlation with “We study revolutions.” This is because the the additional information in the sentence, like the country and time, were uncorrelated with the shorter, first sentence. I suspect this means that, on average, longer texts will have a lower correlation with a short, conceptual phrase than shorter texts simply because they are more likely to include different types of information.

On to measuring a concept. In this case, I’m interested in to what degree a sociological research article’s abstract is about measuring social movements and protest.

movement_word_list=['social movement', 'contentious politics', 'mobilization',]

movement_words = ', '.join(movement_word_list)
string_string_sim(movement_words,
                      'We study revolutions.',
)
Similarity is 0.44.
string_string_sim(movement_words,
                      'This paper examines health behaviors.',
)
Similarity is 0.18.
string_string_sim(movement_words,
                      'We study revolutions in Algeria.',
)
Similarity is 0.42.

Great. Now an example from something publishing in Mobilization.

abstract_sf = (
    "All around the world, school-entry cohorts are organized on an annual "
    "calendar so that the age of students in the same cohort differs by up to "
    "one year. It is a well-established finding that this age gap entails a "
    "consequential (dis)advantage for academic performance referred to as the "
    "relative age effect (RAE). This study contributes to a recent strand of "
    "research that has turned to investigate the RAE on non-academic outcomes "
    "such as personality traits. An experimental setup is used to estimate the "
    "causal effect of monthly age on cognitive effort in a sample of 798 "
    "fifth-grade students enrolled in the Spanish educational system, "
    "characterized by strict enrolment rules. Participants performed three "
    "different real-effort tasks under three different incentive conditions: no "
    "rewards; material rewards; and material and status rewards. We observe "
    "that older students outwork their youngest peers by two-fifths of a "
    "standard deviation, but only when material rewards for performance are in "
    "place. Despite the previously reported higher taste for competition among "
    "the older students within a school-entry cohort, we do not find that the "
    "RAE on cognitive effort increases after inducing competition for peer "
    "recognition. Finally, the study also provides suggestive evidence of a "
    "larger RAE among boys and students from lower social strata. Implications "
    "for sociological research on educational inequality are discussed. To "
    "conclude, we outline policy recommendations such as implementing "
    "evaluation tools that nudge teachers toward being mindful of relative age "
    "differences."
)

string_string_sim(abstract_moby, movement_words)
Similarity is 0.36.

An a non-movements article from Social Forces.

abstract_sf = (
    "All around the world, school-entry cohorts are organized on an annual "
    "calendar so that the age of students in the same cohort differs by up to "
    "one year. It is a well-established finding that this age gap entails a "
    "consequential (dis)advantage for academic performance referred to as the "
    "relative age effect (RAE). This study contributes to a recent strand of "
    "research that has turned to investigate the RAE on non-academic outcomes "
    "such as personality traits. An experimental setup is used to estimate the "
    "causal effect of monthly age on cognitive effort in a sample of 798 "
    "fifth-grade students enrolled in the Spanish educational system, "
    "characterized by strict enrolment rules. Participants performed three "
    "different real-effort tasks under three different incentive conditions: no "
    "rewards; material rewards; and material and status rewards. We observe "
    "that older students outwork their youngest peers by two-fifths of a "
    "standard deviation, but only when material rewards for performance are in "
    "place. Despite the previously reported higher taste for competition among "
    "the older students within a school-entry cohort, we do not find that the "
    "RAE on cognitive effort increases after inducing competition for peer "
    "recognition. Finally, the study also provides suggestive evidence of a "
    "larger RAE among boys and students from lower social strata. Implications "
    "for sociological research on educational inequality are discussed. To "
    "conclude, we outline policy recommendations such as implementing "
    "evaluation tools that nudge teachers toward being mindful of relative age "
    "differences."
)

string_string_sim(abstract_sf, movement_words)
Similarity is -0.04.

Great. Results are plausible. Now I’m going to try it out an entire dataset of 10K recent sociology articles.

df = pd.read_json('https://raw.githubusercontent.com/nealcaren/notes/main/posts/abstracts/sociology-abstracts.json')
len(df)
9797

I revised the function so that it takes a word embedding, rather than a word for one of the inputs. This way, it only calculates the movement words embedding once, rather than once for each of the 10,000 comparisons. It also outputs just the numeric value of the correlation, rather than a phrase.


movement_word_embedding = model.encode(movement_words)

def string_embedding_sim(text1,
                      text2_embedding=movement_word_embedding,
):
    text1_embedding = model.encode(text1)

    similarity = util.pytorch_cos_sim(text1_embedding, text2_embedding)[0][0].item()
    return similarity

# check that it works
string_embedding_sim(abstract_moby )
0.36133891344070435

Applying the function, which uses the smallest model, to my dataframe of 10,000 cases takes about 4 minutes on my MacBook Air with an M2 processor. In contrast, it takes only seconds running in an environment with a GPU, such as Google Colab. Also, you would ideally want to store the article embeddings somewhere rather than discarding them, as the encoding phase of the function is the computationally-intense part.

df['abstract_movement_similarity'] = df['Abstract'].apply(string_embedding_sim)

Plot the results. Looks pretty normal but with a little right skew, which are presumably the articles that focus on movements.

df['abstract_movement_similarity'].hist(bins=20)

Next, look at a sample of articles with different similarity scores, sorted by highest to lowest. The measure has some face validity, as the movementness of the articles declines across the clusters.

# Step 1: Create quintiles
df['Quintile'] = pd.qcut(df['abstract_movement_similarity'], 20, labels=False)

# Step 2: Filter for the top quarter quintiles (quintiles 15-19)
top_half_quintiles = df[df['Quintile'] >= 15]

# Step 3: Display "Title" and "Source title" for a random sample of 5 rows within each top half quintile
for quintile in range(19, 14, -1):
    sample = top_half_quintiles[top_half_quintiles['Quintile'] == quintile].sample(n=5)
    print(f"Quintile {quintile + 1}:")
    display(sample[['Source title', 'Title', 'abstract_movement_similarity']])
Quintile 20:
Source title Title abstract_movement_similarity
4666 Sociological Forum Going Green: Environmental Protest, Policy, and CO2 Emissions in U.S. States, 1990–2007 0.439596
3386 Social Currents Tactics and targets: Explaining shifts in grassroots environmental resistance 0.476411
4199 Mobilization Movement-countermovement dynamics and mobilizing the electorate 0.524236
1474 Sociological Forum Be Careful What You Wish For: The Ironic Connection Between the Civil Rights Struggle and Today's Divided America 0.596116
8583 Sociological Inquiry The Cultural and the Racial: Stitching Together the Sociology of Race and Ethnicity and the Sociology of Culture 0.429177
Quintile 19:
Source title Title abstract_movement_similarity
7252 Qualitative Sociology The Social Life of the State: Relational Ethnography and Political Sociology 0.366393
5255 Du Bois Review Rethinking models of minority political participation: Inter-and intra-group variation in political "styles" 0.394801
5346 Social Forces The Persuasive Power of Protest. How Protest wins Public Support 0.363187
1640 Social Psychology Quarterly Samuel Stouffer and Relative Deprivation 0.362094
8030 Sociological Forum Broker Wisdom: How Migrants Navigate a Broker-Centric Migration System in Vietnam1,2 0.403658
Quintile 18:
Source title Title abstract_movement_similarity
3620 Mobilization Loud and clear: The effect of protest signals on congressional attention 0.335533
4417 Gender and Society “Manning Up” to be a Good Father: Hybrid Fatherhood, Masculinity, and U.S. Responsible Fatherhood Policy 0.315162
9603 City and Community Community Social Capital, Racial Diversity, and Philanthropic Resource Mobilization in the Time of a Pandemic 0.330078
2222 Sociological Perspectives Cultural Capital, Motherhood Capital, and Low-income Immigrant Mothers' Institutional Negotiations 0.339970
6999 Social Forces Emigration and Electoral Outcomes in Mexico: Democratic Diffusion, Clientelism, and Disengagement 0.326051
Quintile 17:
Source title Title abstract_movement_similarity
881 Sociological Forum Changing Childrearing Beliefs Among Indigenous Rural-to-Urban Migrants in El Alto, Bolivia 0.291009
3873 American Journal of Sociology Interlock globally, act domestically: Corporate political unity in the 21st century 0.289036
2762 Social Problems Moral panic, moral breach: Bernhard goetz, george zimmerman, and racialized news reporting in contested cases of self-defense 0.302752
501 Du Bois Review Race, justice, and desegregation 0.290871
105 Symbolic Interaction Chicago, jazz and marijuana: Howard Becker on Outsiders 0.308609
Quintile 16:
Source title Title abstract_movement_similarity
1024 Social Problems Chilling Effects: Diminished Political Participation among Partners of Formerly Incarcerated Men 0.267233
7057 Gender and Society The Gender Mobility Paradox: Gender Segregation and Women’s Mobility Across Gender-Type Boundaries, 1970–2018 0.263790
7265 Social Forces The Limits of Gaining Rights while Remaining Marginalized: The Deferred Action for Childhood Arrivals (DACA) Program and the Psychological Wellbeing of Latina/o Undocumented Youth 0.267221
873 Social Currents Rethinking organizational decoupling: Fields, power struggles, and work routines 0.267197
6205 Symbolic Interaction Digitalization as “an Agent of Social Change” in a Supermarket Chain: Applying Blumer's Theory of Industrialization in Contemporary Society 0.271341

Another check. Which journals publish the most movement research?

df.groupby('Source title')['abstract_movement_similarity'].mean().sort_values(ascending=True).plot(kind='barh')

And what’s the most movementy article in each journal?

# Group by 'Source Title' and find the index of the max 'abstract_movement_similarity' in each group
idx = df.groupby('Source title')['abstract_movement_similarity'].idxmax()

# Filter the DataFrame to keep only the rows with the highest 'abstract_movement_similarity' in each group
highest_values_df = df.loc[idx]
display = ['Source title', 'Title', 'abstract_movement_similarity']
highest_values_df[display].sort_values(by='abstract_movement_similarity', ascending=False)
Source title Title abstract_movement_similarity
1923 Mobilization Social movements in an age of participation 0.702846
433 American Journal of Sociology Issue bricolage: Explaining the configuration of the social movement sector, 1960–1995 0.680012
3768 Social Problems Economic breakdown and collective action 0.660584
4400 Sociology of Race and Ethnicity The Anti-oppressive Value of Critical Race Theory and Intersectionality in Social Movement Study 0.652746
5221 Social Currents Assessing the Explanatory Power of Social Movement Theories across the Life Course of the Civil Rights Movement 0.641801
8064 Sociological Perspectives Policy Relay: How Affirmative Consent Went from Controversy to Convention 0.632110
4489 Theory and Society Combining transition studies and social movement theory: towards a new research agenda 0.631707
6325 Social Forces Pathways to modes of movement participation: Micromobilization in the nashville civil rights movement 0.629331
2482 American Sociological Review Tactical Innovation in Social Movements: The Effects of Peripheral and Multi-Issue Protest 0.616886
6201 City and Community Confronting Scale: A Strategy of Solidarity in Urban Social Movements, New York City and Beyond 0.606433
1474 Sociological Forum Be Careful What You Wish For: The Ironic Connection Between the Civil Rights Struggle and Today's Divided America 0.596116
4818 Qualitative Sociology Life Histories and Political Commitment in a Poor People’s Movement 0.581159
5574 Sociological Theory Overflowing Channels: How Democracy Didn’t Work as Planned (and Perhaps a Good Thing It Didn’t) 0.561438
1278 Sociological Science Dissecting The Spirit Of Gezi: Influence vs. selection in the occupy Gezi movement 0.559833
3367 Social Science Research How social media matter: Repression and the diffusion of the Occupy Wall Street movement 0.559743
4354 Sociological Inquiry Practicing Gender and Race in Online Activism to Improve Safe Public Space in Sweden 0.557239
7128 Symbolic Interaction “Meet Them Where They Are”: Attentional Processes in Social Movement Listening 0.546802
9150 Social Networks How networks of social movement issues motivate climate resistance 0.531231
4094 Work and Occupations Renewed Activism for the Labor Movement: The Urgency of Young Worker Engagement 0.527430
3722 Social Psychology Quarterly Measuring Resonance and Dissonance in Social Movement Frames With Affect Control Theory 0.505426
8292 Du Bois Review REACTION to the BLACK CITY AS A CAUSE of MODERN CONSERVATISM 0.505141
4360 Sociological Methods and Research Size Matters: Quantifying Protest by Counting Participants 0.504456
7333 Gender and Society Immigrant and Refugee Youth Organizing in Solidarity With the Movement for Black Lives 0.497979
2597 Poetics Political space and the space of polities: Doing politics across nations 0.487815
833 Sociology of Education The Origins of Race-conscious Affirmative Action in Undergraduate Admissions: A Comparative Analysis of Institutional Change in Higher Education 0.465736
8179 Journal of Marriage and Family Central American immigrant mothers' narratives of intersecting oppressions: A resistant knowledge project 0.403927
2628 Demography Large-Scale Urban Riots and Residential Segregation: A Case Study of the 1960s U.S. Riots 0.347279
7289 Journal of Health and Social Behavior From Medicine to Health: The Proliferation and Diversification of Cultural Authority 0.344940

Can I add to my CV that I have the most movementy article in Social Problems?

Okay, now a different concept. How quantitative is the research?


quantitative_words = ', '.join([
    "surveys",
    "experiments",
    "quasi-experiments",
    "regression analysis",
    "statistical analysis",
    "correlation",
])

quantitative_embedding = model.encode(quantitative_words)
df['abstract_quant_similarity'] = df['Abstract'].apply(string_embedding_sim, 
                                                       text2_embedding=quantitative_embedding)
df.groupby('Source title')['abstract_quant_similarity'].mean().sort_values(ascending=True).plot(kind='barh')

Very plausible, although I think that SM&R scores highest both because the publish lots of quantitative work plus the abstracts rarely discuss anything but methods, so it’s not quite an apples-to-apples comparison.

Finally, how are the two concepts related at the abstract level?

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Assuming df is your DataFrame
# Calculate the Pearson correlation coefficient
correlation = df['abstract_quant_similarity'].corr(df['abstract_movement_similarity'])

# Determine the common range for both axes based on the min and max of both series
common_range = [
    min(df['abstract_quant_similarity'].min(), df['abstract_movement_similarity'].min()), 
    max(df['abstract_quant_similarity'].max(), df['abstract_movement_similarity'].max())
]

# Create the scatter plot with square dimensions
plt.figure(figsize=(4, 4))  # Makes the figure square in size
plt.scatter(df['abstract_quant_similarity'], df['abstract_movement_similarity'], alpha=0.5)

# Set the same range for both X and Y axes
plt.xlim(common_range)
plt.ylim(common_range)


# Add a title with the correlation coefficient, formatted to two decimal places
plt.title(f'Scatter Plot of Quantitative vs. Movement Similarity in Abstracts\nCorrelation: {correlation:.2f}')

# Add x and y labels
plt.xlabel('Quantitative Similarity')
plt.ylabel('Movement Similarity')

# Ensure the aspect ratio is equal to make the plot truly square
plt.gca().set_aspect('equal', adjustable='box')

plt.show()

I interpret this as both (1) movements research is not heavily quantitative and (2) abstracts that discuss movement theories and cases spend less time talking about methods, but that’s is probably also true of most articles that are about something rather than methods.

To do: Can you do math with sentence embeddings, like what’s a similar paper but instead of analyzing surveys, uses in-depth interviews? Here’s a sample code from ChatGPT.