More with sentence embeddings

Adding and subtracting concepts from a text.
sentence embeddings
Published

February 25, 2024

This is a followup to an early post on sentence embeddings. I wanted to investigate what it looks like when you you add or subtract concept embeddings before finding the nearest matching text.

from sentence_transformers import SentenceTransformer, util
import torch
import pandas as pd
import numpy as np
import warnings

# Show wider columns
pd.set_option("display.max_colwidth", 600)


# Filter out all warnings
warnings.filterwarnings("ignore")

Using the same sentence transformer model and corpus of sociological abstracts as before.

model = SentenceTransformer("all-MiniLM-L6-v2")
df = pd.read_json(
    "https://raw.githubusercontent.com/nealcaren/notes/main/posts/abstracts/sociology-abstracts.json"
)
len(df)
9797

I knew that I would want to compute the embeddings for the abstract corpus just once, and then save the results in a feature in the dataframe, since this steps takes a couple of minutes.

def make_tensors(text):
    return model.encode(text, convert_to_tensor=True)


df["abstract_encodings"] = df["Abstract"].apply(make_tensors)

I asked ChatGPT to write me a function that would take a text and find the most similar text from a corpus, with the option to add or remove specific concepts, using sentence transformers. After some back and forth, here’s what it gave me.

def transform_and_match_df(
    text, *, remove_concept=None, add_concept=None, df, feature_name, n=5
):
    """
    Transforms the text by optionally removing one concept and adding another, then finds the n nearest matches from a DataFrame.

    Parameters:
    - text (str): The base text for transformation.
    - remove_concept (str, optional): The concept to remove from the base text.
    - add_concept (str, optional): The concept to add to the transformed text.
    - df (pandas.DataFrame): A DataFrame containing pre-computed embeddings.
    - feature_name (str): The name of the DataFrame column containing the embeddings.
    - n (int): The number of nearest neighbors to return.

    Returns:
    - pandas.DataFrame: A DataFrame of the n nearest matches.
    """
    # Initialize the model
    model = SentenceTransformer("all-MiniLM-L6-v2")

    # Generate embedding for the text
    text_embedding = model.encode(text, convert_to_tensor=True)

    # Initialize transformed_embedding with text_embedding
    transformed_embedding = text_embedding.clone()

    # If remove_concept is specified, subtract its embedding
    if remove_concept is not None:
        remove_embedding = model.encode(remove_concept, convert_to_tensor=True)
        transformed_embedding -= remove_embedding

    # If add_concept is specified, add its embedding
    if add_concept is not None:
        add_embedding = model.encode(add_concept, convert_to_tensor=True)
        transformed_embedding += add_embedding

    # Convert DataFrame embeddings column from list to tensor
    df_embeddings = torch.stack([torch.tensor(e) for e in df[feature_name].values])

    # Compute cosine similarities
    similarities = util.cos_sim(transformed_embedding, df_embeddings).cpu()

    # Find the indices of the top n similarity scores
    nearest_match_indices = similarities.squeeze(0).topk(n).indices

    # Return the nearest matching rows from the DataFrame
    return df.iloc[nearest_match_indices.tolist()]


# This enforces named arguments for everything except 'text'
# Example usage:
# nearest_matches = transform_and_match_df(text="Your text here", df=your_dataframe, feature_name="your_embedding_column_name", n=3)
# print("Nearest matches:", nearest_matches)

First test. Swap out a qualitative method for a experiments.

r = transform_and_match_df(
    text="We use interviews to analyze racial inequalities.",
    remove_concept="qualitative",
    add_concept = "experiments",
    df= df,
    feature_name = "abstract_encodings",
    n=5,
)
r[["Abstract"]]
Abstract
8904 Field experiments have proliferated throughout the social sciences and have become a mainstay for identifying racial discrimination during the hiring process. To date, field experiments of labor market discrimination have generally drawn their sample of job postings from limited sources, often from a single major online job posting website. While providing a large pool of job postings across labor markets, this narrow sampling procedure leaves open questions about the generalizability of the findings from field experiments of racial discrimination in the extant literature. In this paper, w...
6162 Field experiments using fictitious applications have become an increasingly important method for assessing hiring discrimination. Most field experiments of hiring, however, only observe whether the applicant receives an invitation to interview, called the “callback.” How adequate is our understanding of discrimination in the hiring process based on an assessment of discrimination in callbacks, when the ultimate subject of interest is discrimination in job offers? To address this question, we examine evidence from all available field experimental studies of racial or ethnic discrimination i...
9168 Previous research has established that people shift their identities situationally and may come to subconsciously mirror one another. We explore this phenomenon among survey interviewers in the 2004-2018 General Social Survey by drawing on repeated measures of racial identification collected after each interview. We find not only that interviewers self-identify differently over time but also that their response changes cannot be fully explained by several measurement-error related expectations, either random or systematic. Rather, interviewers are significantly more likely to identify thei...
9049 How do the demographic contexts of urban labor markets correlate with the extent to which racial and ethnic minorities are disadvantaged at the hiring stage? This paper builds on two branches of labor market stratification literature to link demographic contexts of labor markets to race- and ethnicity-based hiring discrimination that manifest within them. Relying on a unique large-scale field experiment that involved submitting nearly 12,000 fictitious resumes to real job postings across 50 major urban areas, I found that Black population size is associated with greater discrimination agai...
4843 How do white observers react when people with black heritage assert a biracial, multiracial, or white identity rather than a black identity? To address this question, I conducted two experiments in which participants evaluated a darker-skinned or a lighter-skinned job applicant who presented his identity as either black, biracial, multiracial, or white. The results show that identity assertions influenced how white observers categorized applicants but not how they evaluated applicants. Most white observers accepted the identity of both the lighter-skinned and darker-skinned applicant when ...

Took about eight seconds on my MacBook Air (M2). Results seem encouraging.

How about an actual abstract?

abstract = """Black men and women have different levels of average educational attainment, yet
few studies have focused on explaining how and why these patterns develop. One
explanation may be inequality in experiences with institutional punishment
through exclusionary school discipline and criminal justice exposure. Drawing on
intersectional frameworks and theories of social control, I examine the long-
term association between punishment and the Black gender gap using data from the
Children of the National Longitudinal Survey of Youth 1979 cohort (NLSY-C).
Decomposition analyses reveal that about one third of the gender gap can be
explained by gender differences in experiences with institutional punishments,
net of differences in observed behaviors. These measures are predictive at key
educational transition points, including finishing high school and earning a
4-year college degree. Though Black boys and girls have similar family
backgrounds and grow up in similar neighborhoods, results suggest that Black
girls have a persistent advantage in educational attainment due in part to their
lower levels of exposure to exclusionary school discipline and the criminal
justice system. In addition, I find that gender differences in early
achievement, early externalizing behavioral problems, school experiences, and
substance use in adolescence and early adulthood are associated with gender
differences in educational attainment. Taken together, these results illustrate
the importance of punishment disparities in understanding disparate educational
outcomes over the life course of Black men and women."""

First, try to get the most similar article that doesn’t is explicitly about racial discrimination.

r = transform_and_match_df(
    text=abstract,
    remove_concept="racial discrimination, African American, Black",
    df= df,
    feature_name = "abstract_encodings",
    n=5,
)
r[["Abstract"]]
Abstract
4711 Although prior research links parental incarceration to deleterious outcomes for children during the life course, few studies have examined whether such incarceration affects the social exclusion of children during adolescence. Drawing on several lines of scholarship, the authors examined whether adolescents with incarcerated parents have fewer or lower quality relationships, participate in more antisocial peer networks, and feel less integrated or engaged in school. The study applies propensity score matching to survey and network data from a national sample of youth. Analyses indicated t...
2378 Why do men in the United States today complete less schooling than women? One reason may be gender differences in early self-regulation and prosocial behaviors. Scholars have found that boys’ early behavioral disadvantage predicts their lower average academic achievement during elementary school. In this study, I examine longer-term effects: Do these early behavioral differences predict boys’ lower rates of high school graduation, college enrollment and graduation, and fewer years of schooling completed in adulthood? If so, through what pathways are they linked? I leverage a nationally rep...
9104 Objective: The goal of this study is to examine the association between parental incarceration and parent–youth closeness. Background: Despite the established complex repercussions of incarceration for relationships between adults, and the well-known intergenerational consequences of parental incarceration, little is known about how incarceration structures intergenerational relationships between parents and children. Methods: In this article, I use data from the Future of Families and Child Wellbeing Study (N = 3408), a cohort of children followed over a 15-year period, to examine how par...
4611 In the present study, we examine the relationship between involvement in the criminal justice system and achieved socioeconomic status (SES), as well as the moderating effect of ascribed SES. Using data from the National Longitudinal Study of Adolescent to Adult Health, we find a nonlinear relationship between criminal justice involvement and achieved SES, such that deeper involvement leads to increasingly negative consequences on achieved SES. Furthermore, those coming from the highest socioeconomic backgrounds are not "protected"from the deleterious consequences of system involvement, bu...
5538 Research has documented a negative association between women’s educational attainment and early sexual intercourse, union formation, and pregnancy. However, the implications that school progression relative to age may have for the timing and order of such transitions are poorly understood. In this article, I argue that educational attainment has different implications depending on a student’s progression through school grades relative to her age. Using month of birth and age-at-school-entry policies to estimate the effect of advanced school progression by age, I show that it accelerates th...

Pretty good, although maybe skewing a bit too much towards parental incarceration.

What about articles on the same topic focusing on Latino/Latinas?

r = transform_and_match_df(
    text=abstract,
    remove_concept="African American, Black",
    add_concept = "Latino, Latina, Hispanic",
    df= df,
    feature_name = "abstract_encodings",
    n=5,
)
r[["Abstract"]]
Abstract
2513 The socializing power of the prison is routinely discussed as a prisonization process in which inmates learn to conform to life in the correctional facility. However, the impact that identities socialized in the prison may have outside of the institution itself remains an under-researched aspect of mass incarceration's collateral consequences. In this article I use ethnographic data collected over fifteen months in two juvenile justice facilities and interviews with twenty-four probation youth to examine how the identities socialized among Latino prison inmates spill over into high-incarce...
4732 Using a nationally representative sample of approximately 3,500 public schools, this study builds on and extends our knowledge of how ‘‘minority threat’’ manifests within schools. We test whether various disciplinary policies and practices are mobilized in accordance with Latino/a student composition, presumably the result of a group response to perceptions that white racial dominance is jeopardized. We gauge how schools’ Latino/a student populations are associated with the availability and use of several specific types of discipline. We further explore possible moderating influences of sc...
5818 We advance current knowledge of school punishment by examining (1) the prevalence of exclusionary discipline in elementary school, (2) racial disparities in exclusionary discipline in elementary school, and (3) the association between exclusionary discipline and aggressive behavior in elementary school. Using child and parent reports from the Fragile Families Study, we estimate that more than one in ten children born between 1998 and 2000 in large US cities were suspended or expelled by age nine, when most were in third grade. We also find extreme racial disparity; about 40 percent of non-...
9509 Recent work has begun to investigate how criminalization is mediated through interpersonal relationships. While this research emphasizes the importance of gender dynamics and cross-gender intimate relations for boys and men of color, little is known about how gendered and sexualized relationships matter for criminalized women and girls of color. This study seeks to fill this knowledge gap and asks: How do system-involved Chicanas’ relationships with men and boys shape their experiences of criminalization over the life course? How do they navigate criminalization through men and boys? While...
556 Opportunities for upward mobility have been declining in the United States in recent decades. Within this context, I examine the mobility trajectories of a contemporary cohort of 1.5-, second-, and third-plus-generation Latino youth. Drawing on survey data from California that accounts for the precarious legal status of many 1.5 generation immigrants, I find that Latino youths' patterns of postsecondary enrollment and employment do not differ by generation since migration. Additionally, I do not find evidence of racial/ethnic barriers to Latino youths' enrollment in less selective colleges...

Seems solid.

What about dialing up gender and women’s experiences?

r = transform_and_match_df(
    text=abstract,
    add_concept = "gender, female, women",
    df= df,
    feature_name = "abstract_encodings",
    n=5,
)
r[["Abstract"]]
Abstract
3590 School disciplinary processes are an important mechanism of inequality in education. Most prior research in this area focuses on the significantly higher rates of punishment among African American boys, but in this article, we turn our attention to the discipline of African American girls. Using advanced multilevel models and a longitudinal data set of detailed school discipline records, we analyze interactions between race and gender on office referrals. The results show troubling and significant disparities in the punishment of African American girls. Controlling for background variables...
6741 In this research, I use theories of framing and social construction to investigate how race and gender are featured in national news coverage of the school-to-prison pipeline, and how policies and practices funnel students from school to the criminal justice system. Results indicate that there are three primary narratives surrounding the school-to-prison pipeline. The first is a narrative that harsh disciplinary practices in schools are irrational and negatively impact all students. The second narrative crafts the school-to-prison pipeline as a social problem for all Black students irrespe...
6720 We examine change across U.S. cohorts born between 1920 and 2000 in their probability of having had sex with same-sex partners in the last year and since age 18. Using data from the 1988–2018 General Social Surveys, we explore how trends differ by gender, race, and class background. We find steep increases across birth cohorts in the proportion of women who have had sex with both men and women since age 18, whereas increases for men are less steep. We suggest that the trends reflect an increasingly accepting social climate, and that women’s steeper trend is rooted in a long-term asymmetry ...
7229 Intersectionality scholars have long identified dynamic configurations of race and gender ideologies. Yet, survey research on racial and gender attitudes tends to treat these components as independent. We apply latent class analysis to a set of racial and gender attitude items from the General Social Survey (1977 to 2018) to identify four configurations of individuals’ simultaneous views on race and gender. Two of these configurations hold unified progressive or regressive racial and gender attitudes. The other two formations have discordant racial and gender attitudes, where progressive v...
5067 This study is the first to use nationally representative data to examine whether differences in gender-typical behaviors among adolescents are associated with high school academic performance and whether such associations vary by race or socioeconomic status. Using wave I data from the National Longitudinal Study of Adolescent Health and linked academic transcript data from the Adolescent Health and Academic Achievement study, we find that boys who report moderate levels of gender atypicality earn the highest grade point averages (GPAs), but few boys score in this range. As gender typicali...

Seems good, but we seem to have drifted more towards punishiment than education, which is, I think a function of how the abstract was written. What if dial up the education?

r = transform_and_match_df(
    text=abstract,
    remove_concept="survey, regression, statistical analysis",
    add_concept = "education, in-depth interviews, participant observation",
    df= df,
    feature_name = "abstract_encodings",
    n=5,
)
r[["Abstract"]]
Abstract
6741 In this research, I use theories of framing and social construction to investigate how race and gender are featured in national news coverage of the school-to-prison pipeline, and how policies and practices funnel students from school to the criminal justice system. Results indicate that there are three primary narratives surrounding the school-to-prison pipeline. The first is a narrative that harsh disciplinary practices in schools are irrational and negatively impact all students. The second narrative crafts the school-to-prison pipeline as a social problem for all Black students irrespe...
279 Though sociologists have examined how mass incarceration affects stratification, remarkably little is known about how it shapes educational disparities. Analyzing the Fragile Families Study and its rich paternal incarceration data, I ask whether black and white children with fathers who have been incarcerated are less prepared for school both cognitively and non-cognitively as a result, and whether racial and gendered disparities in incarceration help explain the persistence of similar gaps in educational outcomes and trajectories. Using a variety of estimation strategies, I show that expe...
2219 Research consistently demonstrates that black students are disproportionately subject to behavioral sanctions, yet little is known about contextual variation. This paper explores the relationship between school racial composition and racial inequality in discipline. Prior work suggests that demographic composition predicts harsh punishment of minorities. Accordingly, a threat framework suggests that increases in black student enrollment correspond to increases in punitive school policies. Results from this paper find some support for this hypothesis, finding that the percent of black stude...
3590 School disciplinary processes are an important mechanism of inequality in education. Most prior research in this area focuses on the significantly higher rates of punishment among African American boys, but in this article, we turn our attention to the discipline of African American girls. Using advanced multilevel models and a longitudinal data set of detailed school discipline records, we analyze interactions between race and gender on office referrals. The results show troubling and significant disparities in the punishment of African American girls. Controlling for background variables...
5735 Punitive and disciplinary forms of governance disproportionately target low-income Black Americans for surveillance and punishment, and research finds far-reaching consequences of such criminalization. Drawing on in-depth interviews with 46 low-income Black mothers of adolescents in urban neighborhoods, this article advances understanding of the long reach of criminalization by examining the intersection of two related areas of inquiry: the criminalization of Black youth and the institutional scrutiny and punitive treatment of Black mothers. Findings demonstrate that poor Black mothers cal...

Again, seems like a plausible set of results. Finally, let’s try to swap out the method.

r = transform_and_match_df(
    text=abstract,
    remove_concept="survey, regression, statistical analysis",
    add_concept = "in-depth interviews, participant observation, enthography, qualitative research",
    df= df,
    feature_name = "abstract_encodings",
    n=5,
)
r[["Abstract"]]
Abstract
6741 In this research, I use theories of framing and social construction to investigate how race and gender are featured in national news coverage of the school-to-prison pipeline, and how policies and practices funnel students from school to the criminal justice system. Results indicate that there are three primary narratives surrounding the school-to-prison pipeline. The first is a narrative that harsh disciplinary practices in schools are irrational and negatively impact all students. The second narrative crafts the school-to-prison pipeline as a social problem for all Black students irrespe...
5859 Although Black mothers are disproportionately represented among formerly incarcerated mothers in the United States, existing research has largely neglected to document the challenges they face in resuming their parenting roles after prison or jail. This study addresses this gap using 18 months of participant observations with formerly incarcerated Black women to examine how state surveillance under post-release supervision and child welfare services shapes and constrains formerly incarcerated Black women's mothering practices. The study develops a typology of three context-specific strateg...
5735 Punitive and disciplinary forms of governance disproportionately target low-income Black Americans for surveillance and punishment, and research finds far-reaching consequences of such criminalization. Drawing on in-depth interviews with 46 low-income Black mothers of adolescents in urban neighborhoods, this article advances understanding of the long reach of criminalization by examining the intersection of two related areas of inquiry: the criminalization of Black youth and the institutional scrutiny and punitive treatment of Black mothers. Findings demonstrate that poor Black mothers cal...
9348 Numerous articles and textbooks advise qualitative researchers on accessing “hard-to-reach” or “hidden” populations. In this article, I compare two studies that I conducted with justice-involved women in the United States: a yearlong ethnography inside a state women’s prison and an interview study with formerly incarcerated women. Although these two populations are interconnected—and both are widely deemed hard-to-reach—the barriers to access differed. In the prison study, hard-to-reach reflected an issue of institutional legitimacy, in which researchers must demonstrate themselves and the...
279 Though sociologists have examined how mass incarceration affects stratification, remarkably little is known about how it shapes educational disparities. Analyzing the Fragile Families Study and its rich paternal incarceration data, I ask whether black and white children with fathers who have been incarcerated are less prepared for school both cognitively and non-cognitively as a result, and whether racial and gendered disparities in incarceration help explain the persistence of similar gaps in educational outcomes and trajectories. Using a variety of estimation strategies, I show that expe...

Drift a bit toward incarceration, and the methods thing wasn’t perfectly implemented, seemed to grab some things that were vague about the method, or perhaps described the quantitative work in qualitative ways.

Overall, I would put the results comparable to the first page of a Google search. The first choice isn’t necessarily what I was looking for, but, overall, the top five give me something to work with.