This special talk is about the use of Data Science & Machine Learning. Gerard Torrats-Espinosa, Professor at Columbia University, dives into how machine learning was used to estimate the effect of racial segregation on COVID-19 mortality in the United States. There is a short Q&A session at the end.
About the Speaker:
Gerard Torrats-Espinosa
Professor at Columbia University
Gerard Torrats-Espinosa received his Ph.D. in Sociology from New York University in 2019. He is an Assistant Professor of Sociology and a member of the Data Science Institute. His research draws from the literatures on urban sociology, stratification, and criminology, and it focuses on understanding how the spatial organization of the American stratification system creates and reproduces inequality.
Gerard’s current research agenda investigates (1) how the neighborhood context, particularly the experience of community violence, determine the life chances of children; (2) how social capital and social organization emerge and evolve in spatial contexts; and (3) how place and geography structure educational and economic opportunity in America and elsewhere. His work has been published or is forthcoming in the American Sociological Review, Child Development, Demography, Housing Policy Debate, the Journal of Housing Economics, the Journal of Urban Economics, and the Proceedings of the National Academy of Sciences.
—
This video is a recording of a live event. To follow along live and participate in upcoming workshops and events, RSVP at
SUBSCRIBE for more! 👉https://tinyurl.com/codesmithsub 👈
Stay connected to our community!
Codesmith:
Learn JavaScript for Free:
Free Events & Workshops:
#coding #datascience #machinelearning #softwareengineer
[Music] Oh hello everybody an absolute pleasure To see you all here for our first of I Hope many many Uh Hawks from expert Distinguished speakers uh over the Coming months ahead in the data science And machine learning field we are Absolutely delighted to have both our Current data science machine learning Dsml we kind of call it call it the Shorthand uh residence I can see Margaret I can see Paul Uh Kevin Ryan Natalie uh Sylvia wonderful to see one Of them some of our team as well uh Brilliant uh Um data science machine learning Residence uh our faculty for that Program I can see Jonathan Alex uh and Laura I do feel honored absolutely Honored to welcome our speaker today uh Professor Gerard Espinoza who is going To share with us a talk on the work They've done using machine learning To evaluate uh covid-19 outcomes in the US it's a pleasure to be here Um as will has briefly mentioned I've Had the chance to See the development of the dsml program Um I'm outside in from inside and it's Incredible what's what's been built in Got mid so I look forward to To seeing how things keep progressing
And getting getting bigger Um I want to thank will and the team for Welcoming me here and give me a chance To to present to present this work Um I'm very passionate about this work And I hope that in addition to Showcasing the power uh of the methods Um I'll get some people excited to Pursue some interesting social science Questions that can be tackled with this This methodologist Um So Um I'll share my screen and I'll be Talking over a few slides that I've put Together if that's okay for everyone Um so this is a study that was published Relatively recently in the in the President of the National Academy of Sciences Um to give you uh context for uh where The study began so I'm going to take you Back to the spring of 2020 Um Besides uh you know a few of us had the Privilege to have to only worry whether We have to wipe down the groceries uh uh After we bought them during you know During covet time but there was a more Important emerging Trend happening which Is that Um plugs and Hispanics where he Harder By copied whatever and however you could You cut the data uh at the city level at
The neighborhood level Democratic states Republican states uh Then we're going to say Republican States uh you'd see massive disparities Uh in Kobe mortality Between blacks Hispanics and and whites Um You know there were several journalistic Accounts that pointed to Tire infection Rates and mortality and minority Communities And evidence that was coming from the Few cities that at the time were Publishing raise specific uh data on Mortality we're showing massive uh Of representation of minorities among The among the deaths so for example in Chicago and Milwaukee 70 and 73 respectively of Kobe deaths uh Had been among African Americans Including the first few months of the Pandemic uh at the state level states of Louisiana and Michigan for example 70 And 40 of the vets were among blacks uh If you look at the uh proportion that Blacks and African-Americans make up the Population uh these figures are severely Overrepresented so there was pointing That it was a clear Point towards uh Massive uh racial disparities in in Coveted outcomes both in terms of that And and and infections Um There was some reason to uh you know for
That uh for that to to be happening so It was there is existing evidence from Other studies uh that plugs and Hispanics uh are more likely to Experience the pre-existing health Conditions that we knew at the time were Making them more more vulnerable to Covid-19 We also knew that were blacks and Hispanics were all represented in the Jobs that had that had been classified As essential so they had to be uh Working uh even though devices were Still creating Uh uh you know not under control at the Time uh we also know that racial Minorities are more likely to live in Multi-generational households which Means that the elderly are more likely To be exposed to the virus it uh you Know if uh younger cohorts are Interacting with other people at higher Rates for example we also know that Racial Marathi Israel is likely to have Health insurance which exacerbates Um any any risks that they might be Facing uh on covid Um we also know that they live in Neighborhoods that were essential Establishments like pharmacies grocery Store supermarkets uh are more scared so They probably had to travel longer and Be exposed for longer periods of time to The virus maybe taking the subway to get
The groceries things like that And we also know from sociology that Um you know given the Institutional barriers uh and Discrimination that they face across Several contexts black and Hispanics uh May be reluctant to seek medical care uh If they fear that interacting with the Public health system might reveal Something that they uh that they afraid Of for example if someone has an arrest Warrant that person is likely to show up At a hospital because he or she might Fear that that might trigger uh you know An arrest so a bunch of uh both Individual level risk factors and Structural conditions that make those uh Those racial uh Uh racial groups uh more more likely to Be uh you know experiencing copied at Higher rates Um if you overlay on all of these uh What we know about the spatial Distribution of Um Blacks and Hispanics and racial Minorities more generally in the US The case for studying uh segregation as A potential explanation for why we saw This massive disparities uh in Um covet outcomes I think starts to make Sense so Racial segregation means you know uh Higher likelihood of experiencing
Socialization we also know that uh from Existing research in sociology as well That friendships and social interactions Are defined along racial lines so blacks Are more likely to interact with blacks And whites are more likely to interact With whites we all have very diverse Group of friends but people in this call Are not representative of what's Happening in the United States overall So there is this we socially as we call This term Monopoly the the idea that Social interactions are more are more Likely to happen within the same uh the Same racial groups so here uh I show Three hypothetical scenarios uh so a b And c in each of them there are 100 Squares each of them representing one Household Uh 80 of the households are white uh and Twenty percent uh are are black the Black squares are the black household And the white squares are the other White household household so depending On how blacks and Hispanics are Distributed uh in space Um Dividers and the potential for Interaction uh the potential for Infection due to the virus might look Very different so In Senate and give it given what we know About how members of different racial Groups interact with each other if we
Know that blacks and whites don't Interact with each other much if blacks Have a higher likelihood of experiencing The virus the virus might not spread Very far along because this you know Black Castle here it's not interacting With people around you know in the in The vicinity so the virus will not Travel very far Um whereas if you're in scenario B where You know we have these five clusters uh Of uh black households here overall you Should expect to see a higher infection Rate because you know this virus will Spread rapidly in this club in this Cluster also in this cluster and same For the other for the other three and if You have a scenario where there's Perfect segregation where all black Households live in a corner of the city You know this dividers will will spread Even even faster so this is the general General the conceptual the conceptual Idea that I have here for house Segregation might play a role in in uh In producing these massive disparities Like we saw uh during during uh the Early stages of the pandemic in fact if You look at the literature before uh Before kobit building between racial Segregation and health had been Established in several in several Contexts so Um uh there's it's it's been shown that
Uh places where there is more uh social Isolation and population concentration That's due to racial segregation in Those places we have higher uh risk of Tuberculosis uh and HIV uh tends to Spread faster these are studies from the From the 80s and 90s but they're they Were clear in showing that in places Where there was more segregation uh Those uh infectious diseases tended to Spread more rapidly so there's it's it's I'm not saying Anything very new here I'm just applying The existing literature to uh to uh the Case of Kobe 19. So the simple hypothesis I'm going to be Testing here uh is uh this one so places Where racial segregation is higher uh I Claim that we'll have higher fatality Rates uh and also larger rationality Gaps in pathology rates so that's what I'm going to be testing uh with my data And methodology So how am I going to do that so what I Did is I collected data at the county Level Um First on the outcome that I care about Here which is covid-19 debts uh just Looking overall uh that's in the Population of the country level and also Racial and the gaps in in debts I'm going to be measuring racial Segregation at the county level as well
Using this metric core relative Diversity index it turns out that if you There are at least a dozen ways to Measure segregation uh different authors Have their own particular ways of doing It turns out that however you match your Segregation in this context you keep Finding the same the same results but I Went with this one because it's what Satisfied uh the the reviewers of paper Um and I'm going to be adding relevant Control variables and I'm building here The relevant because it's where the Methodology that I'm going to be using Here comes into play Um the last the last method so you'll You'll you'll see more clearly when I uh Whatever why I'm highlighting relevant Controls here Uh so the channel idea once this data Set has been assembled uh is to use Progression to estimate the relationship Between covid-19 outcomes and Segregation controlling for as many Compounding variables as possible Uh so A closer look at the data so I ended up Having uh 2100 almost 2200 companies in The United States uh there are about Three thousand plus counties in the US But it turns out that most people live Uh in those two thousand counties so Even though I don't have all counties in The US
Uh I have a 96 coverage of the US Population so it means that the Countries that I'm not including are Rural counties where there is almost no Population so it's a fairly Representative sample of the United States So county level data for uh Representative number of counties and For each County I was able to collect Data on 50 different uh factors uh that Could be potentially related to both Segregation and covid-19 infection uh so I grouped those 50 Um factors or controls into eight Categories uh those categories are Defined or are informed by existing Sociological theories uh in public Health literature so I look at Demographics Uh it's a one bucket of variables second One is Variables getting a density and Potential for public interaction things Like population density uh average time That people are commuting so something That gets at how much interaction in the Public exists Social Capital you know The level of community institutions uh In those in those in those neighborhoods And counties that that Foster you know Social capital in in in in space Have risk factors like uh people who had Been vaccinated against the flu in the
Years in the Years prior Um capacity of the healthcare system Like things like hospitals per capita Doctors per capita air pollution which Has been shown to be highly cooperated With Um with segregation and also potentially Um uh impacting covet outcomes uh share Of the population that's employed in Essential businesses the idea here is That in countries where there is a Higher share of people who have to show Up to work uh that might also influence Covet outcomes and political views uh You know the vaccine was uh I mean we All followed the the the the craziness Of those early months of the pandemic But it was all sorts of conspiracy Theories going on uh reluctance to get Vaccinated or not so I felt that Controlling for uh political views in The county uh was something that uh was Relevant here so those are the eight Buckets so if you you count what's in Each of those packets you end up with 50 Variables which I list here so uh so Going to the looking at the bucket of Demographics I have like percent younger Than 25 percent more than 65 then 65 A measure of how people are circulated Along along H lines uh racial Demographics so on and so forth so same For the bucket of density and public Interaction so it's like I'm showing
Here the uh between these two slides the The 50 potential controls that I could Include in the model Um uh and showing the correlation that They have uh with the with the covet uh Mortality outcome uh just the Correlations here are not relevant I'm Just showing this slide uh just to show You the list of the 50 variables that Are candidates for uh for controls so This is the data 50 so the outcome is Covered what I care about this Segregation and I have 50 other Variables that I could include in the Model I could include all of them but that Could be statistically Reckless so if Any of you have any expired exposure to Regression you you know that this is not Something that you should be doing just Throwing everything into the model Because there is potential for Overfitting and it's not You know it's not theoretical sound so I'm gonna walk you through how I go About selecting the relevant 50 Variables the relevant the relevant Variables among the 50 that should Feature in the in the model but that's Coming up in a few slides so don't get Too anxious Um so if you forget about uh controlling Um about anything else and you just look At the correlation between uh covet
Outcomes and racial segregation you find What you would expect given given the Theoretical framework that I outline so Um these scatter plot just let's focus On the one on the right uh so the x-axis Uh it's a normalized measure of racial Segregation and the y-axis is normalized Uh measure of uh coping mortality so Each dot represents a County And the size of the dots represent the Population of the county so bigger dots Are bigger counties so you know these Big one here is probably New York City From Manhattan um yeah so What you find here is a a positive Correlation between uh ritual Segregation and and copied mortality so Places where Christian segregation was Higher uh the fatality rate was higher So this is the the scalable on the right Instead of just looking at at that you Just Take a broader View and just look at Cases You find the same same positive Correlation Um that's Cool I mean I don't think that at the Time anyone had shown that but I think We can do better than just simply just Doing a simple correlation Um and not controlling for anything else So The analytical strategy that I have in
Mind just to refine this analysis uh is As I said at the beginning is used for Regression analysis to estimate the Relationship between counter level Segregation and concrete level mortality So I'm interested in the impact of Segregation on program mortality but There are challenges if we're just gonna Run a simple progression of these two Variables Challenge number one is that there are Confounding factors that are potentially Driving the relationship between Segregation and coping mortality so here The idea of correlation is not causality It's when it comes into place and it Comes into play so There are things that are both affecting Segregation and coping mortality that if You don't control for them Are going to be biasing whatever Number you get for the association Correlation uh between acceleration and Coveted mortality So things like population density we Know that Um Places that are more densely populated Than to be also higher in segregation And places where population doesn't is Higher uh you know Kobe mortality will Likely be higher same if there is a Higher proportion of Frontline workers In higher in high segregated places
Um you and Frontline workers are more Exposed to dividers then that's a Compounding Factor access to medical Care if places where there is more Segregation there is a harder access to Medical Care and that also impacts Kobe Immortality failing to account for that To control for that in the models will Lead to a flawed estimation or a bias or A bias result But the challenge gets even harder Because uh yes you have confounding Factors Some of them you can observe and control For like three that I just mentioned Like population density online workers Access to medical care you could measure That and I'm able to measure that uh and Throw that into the model and account For that so but there are other other Factors that are unobservable there is No way to measure them like Personal networks and I mean we don't Have data on on the extent to which People are connected to each other uh Along along raise or along uh H Um uh age lines um the quality of Medical care I can measure the the Number of doctors probably you cannot Measure the quality of those doctors for Example so there are things just like I mean those are not the best examples I'm realizing but the idea here is that You know there are things that you will
Not you will you will never be able to Measure uh an account for So how do we deal with this with these Two challenges remember challenge number One there is things that are driving the Correlation between segregation and Criminal mortality that we should Account for and among these things there Are some that are not observable So What's the solution here so to identify The controls that should be included to Identify the controls among the ones That I can observe that should be Included in the model I'm going to be Using What's called the double lasso Methodology I'm going to walk you Through that to Account for the fact that there are Things that I will never be able to Measure and observe and include my Bundles I'll be running some sensitivity Analysis just to give you a preview of What I mean by sensitivity analysis here Is given that I I'm gonna be doing some Backup the envelope back of the envelope Calculation that we'll get at the Following question given that I forgot To some to measure something that could Potentially Drive the correlation Between segregation and mortality how Big that unobserved thing should be to Make my results go away And if it's
Way bigger than things that are you know Um if it should be way bigger than Things that we know are highly Correlated with Kobe mortality and Segregation we think that this thing is Unlikely to exist in the social in the Social world for example if population Density is the thing that correlates the Highest with segregation income Immortality and I'm in my sensitivity Analysis shows that the one thing that Should that would mess up all my models Should be three times as strong as a Population density I mean it's unlikely That that thing exists in in the real World whereas in my sensitivity analysis Says You only need something that's a tenth Of population density to make your Results go away then I'm in trouble Because that thing that just one tenth Of population density could be anything In the social world Is that idea clear just what sensitivity Analysis are here okay because I'm not Going to be talking a lot about Sensitivity analysis I'm going to be Spending my time in in uh just going Through the about lesson Um All right so now we're getting into uh The lasso approach that I'm following Here so giving this model that I have in Mind so why is my outcome of Interest
Which is uh Going mortality uh racial gaps in Kobe Mortality uh whichever you whichever you Want So I'm interested in estimating a model That has an intercept that's a Traditional regression analysis and has A bunch of coefficients here so one of Those coefficients is the one that I Really care about which is the first one Let's call it beta1 it's a coefficient On segregation and then there are a Bunch of other controls that I that I Can add so that's why I'm summing over Uh p uh potential uh uh predictors that I could have here but there is one that I care a lot about which is beta1 Control on uh the coefficient on racial Segregation So to estimate the the this model I have Uh uh displayed here I could use uh OLS Uh in OLS finds a solution to so the the Set of OLX coefficients uh are the ones That minimize uh the sum of the square Errors you taking you know the fellow The the current students in in dsml will Find this uh very familiar with they Covered that in the in week one or week Two so uh very very familiar to to People Um I have another alternative here which I Could instead of using olis I could use Uh Blaster progression muscle regression
Looks very similar to Um OLS but it adds a penalty term Dependent the penalty term here ensures That that the model that I end up with It's a more personalized model basically Penalizes me for having too many Controls and the strings some of the Controls in size some of the Coefficients in size will drop some of Them so Ultimately uh lasso tends to yield more Parsimonious more parsimonious models in My parsimonials here I mean that I will Have fewer controls in addition to this To the to the segregation variable that I care about like forcing money tends to Be better uh in general because less Likely to overfit the data simple to Interpret Um and and it has other other appealing Other building properties Um the you know the current students in The in the in the talk will uh identify This uh this Lambda parameter here which Is the the thing that we are tuning via The across validation and finding the Optimal Lambda parameter here uh that Yields the most uh uh the better model So to speak Um all right So the lasso the way I've explained that It's a very powerful regression tool as I said that deals uh parsimonious cross Validator model that will predict the
Outcome very well so we'll predict so my Lasso coefficients uh coming from this Uh from this minimization problem here Will be highly predictive of the outcome Will lead the very personalities model Things that we like uh when we are just Simply doing prediction But we're not just simply doing Prediction I'm not trying to predict Copied mortality I'm trying to estimate The role that segregation plays in Common mortality so I care a lot about Getting right the coefficient on covet On sorry on segregation So The lasso it turns out that my Um Harmony instead of helping me why Because as I said at the beginning here What the last of us it will drop some Coefficients It will swing some others so that we end Up with a more possibilities model it Could be the case that the lasso uh is Shrinking my segregation coefficient Towards zero which will be biased but it Could even drop it all together because It finds that you know segregation Should not be here there is something Else that's more powerful than Segregation I'm just going to drop Segregation I mean the lasso doesn't Care about the theory that I have in Mind the lasso doesn't know if I throw These
All these variables to a lasso Regression the the the last of us know That hey I care a lot about servation Just keep segregation there the last of The if you don't do anything else will Not do that so you have to do something To ensure that segregation is kept there And then the shrinkage happens in the Other controls that you don't care much About So how do we take advantage of these Very appealing properties of the lasso Of yielding very personal model and Still keeping intact uh this the Coefficient on circulation So We do double loss not just lasso what How does uh level double loss of work so Going back to uh the model that I had in Mind at the beginning the the the one That I used to lay out and set up the The whole Enterprise here so I care about the relationship between Segregation and green mortality and I Know that there are confounding factors Um I've put together a data set that has 50 Of those potential compounding factors So what I'm gonna do I mean what the Double lasso does Is four steps so the first step I'll run A simple lesson where I'm going to be Predicting segregation from the 50 Controls and because I told you that the
Lasso is very good at producing very Personal news models they're Cross-validated of those 50 controls I'm Going to keep some the lasso the Regression on the Lesser regression Observation on the 50 controls will keep Some of them so Step one Rana lasso or segregation on on the 50 Controls and keep the ones that the Lasso tells me to give Step number two do the same for coping So I'm gonna grab my covet variable run A lasso on covet and the 50 controls and The lasso because of the way it works It's going to keep some of those 50 Controls so now I have two sets of Controls the sum of the 50 that I kept From this step number one And some of the 50 that I kept from Step Number two well I could look at the Intersection between the two right I Have potentially a a pool of controls That intersect And those are the ones here so First step is just looking for the ones That were selected by the first lasso And the second lasso look for overlap And Runner regresso a regression of the of Covid-19 conservation on the double Lasso controls if you're curious about Which controls have been selected by the By the double lesson here it's a total
Of 18 controls of the 50 that I that I Throw into into the two lessons so uh The Double lasso tells me to keep Present Hispanic or send White Surprisingly I have to keep Sports and Bowling centers per capita uh apparently That does that thing matters uh for Segregation and kovid uh I can see why Um Hospitals per capita knows incomes per Capita Percent of people working from home Percent of households with six or more Occupants percent of individuals with Less than high school with no High School uh percent HIV positive in the Population Um income segregation which is a Different way of measuring segregation Along uh social economic lines Percent younger than 25 percent older Than 65 percent that use public transit To go to work Um a measure of uh air pollution Percent of people who got the flu Vaccine the year before person who had Uh health insurance And median income in the population so With these 18 controls now I can do a Very simple thing just run a simple Regression no no more lasso of Kobe Mortality on segregation uh in those 18 Controls those are the this is the Optimal set of controls that I should
Keep to make my model as personal as Possible without sacrificing Um predictive power uh and without Overfilling the data that's the beauty Of the of the of the global of the level Also here So given that I run Um I run those models so for those of You who haven't seen a figure like that This is a lot of coefficients so Let's focus on the ones at the bottom so Uh coefficients to the right uh of the Vertical line at zero means a positive Relationship between uh segregation and Uh and the outcome stated at the top so In this case uh the debt rate Uh so I'm gonna walk you through this What these three rows means so OLS here means that if you just run a Progression of uh segregation of covet Debts and or in in segregation without Any controls this is the coefficient That you get so this means that for a One standard deviation change in Segregation Uh the mortality rate increases by Um 0.36 uh uh sorry by 36 percent that's a Huge amount of change it's probably Confounded by the many things that I'm Failing to control so this is the very Naive model the one for which I show you The scatter plot the one that has no Control so
You know it makes sense that's so big Because there are so many other things Going on here that are not being counted For so here uh the second one What I do is what's called a state fixed Effect so basically I'm running many Regressions within States and average Averaging across across all states it It's better than just doing the simple Correlation that I did in the scatter Plot but the good one the one that I Want you to focus on is the is the is The double loss so here is this is the Results of the regressions where I'm Running I'm progressing Covet outcome on segregation and the Optimal set of controls that the lasso Has identified as you can tell you know Once you control for things that are Relevant the size of the of the Association frames significantly But still uh it is still a statistically Significant so the fact that these Horizontal lines are not crossing the Vertical line it means that they're Reaching a statistical significance so In Practical terms it means that Counties that were once under the Deviation above uh the mean of racial Segregation Experience mortality and infection rates That were eight percent higher and five Percent higher for for infection If you translate that to what it means
In in terms of number of people um Getting infected uh or dying So it means that uh for one standard Deviation change in segregation uh you Get four additional debts or 105 Infections in a county with 100 000 Residents so if the counted speaker you Should scale that by by the Corresponding scaling Factor so for a County with seven like with a million Would be 10 times 10 times it's like 40 Deaths and a thousand additional Infections Um this is for just looking at debts uh Without distinguishing between Um uh members of different ethnic or Racial groups uh if you look at uh the Impact of segregation on Gaps in in in mortality between blacks And whites and Hispanics and whites You find uh just focus on the on the Bottom row which is the best model the One that has all the controls that need To be in the model you find that uh in Counties where uh there are one standard Deviation above the black white Segregation mean the gap between blacks And whites was eight percent higher Surprisingly there is no there is no Impact on the on the gap between Hispanics and and whites Um I would expect to find an association Here if you look at the public health Literature uh this is a very common
Thing this is a very common pattern Um people have named this it's called The Hispanic the Hispanic Paradox so Paradox comes from the fact that Hispanics and blacks experience very Similar structural conditions or Disadvantage socioeconomic disadvantage Uh racial segregation but Hispanics Surprisingly do very well in terms of Public Health outcomes almost surpassing Whites Um there are many theories for why this Could be the case uh people think that Hispanics tend to keep you know um Social bonds that are more conducive to Better Health outcomes I mean this is All speculation that uh Scholars of Public Health have have put up there uh But but if you just put aside what could Be driving this and you just look at the Data Uh Hispanics are doing much better than Blacks in terms of health outcomes Despite facing very similar structural Conditions of disadvantages it's a Puzzle that keeps showing up in in the Public health literature and it's Interesting it shows up in this in this Analysis here so uh segregation seems Not to have an impact on the differences In Hispanic uh White uh uh mortality Rates Um That's all I have I mean here's a figure
Of the I can just walk you through the Sensitivity analysis here so Um Um what the sensitivity analysis does Here is it plots this is a these are the This is the Contour of variables that Could be you know if you are any Variable that falls in this Contour or Outside same for this one it's a Variable that could be sending all my Results to trash meaning No significant association between Crystal segregation and and and incoming Mortality the Dots here are the Variables that are included in the model So what the scatter plot shows here is The correlation between uh correlation Of the variable that's in the model with Uh the measures of segregation and Correlation uh with the mortality Gap so Any variable that it's outside of here Means that it's more strongly correlated To segregation and and mortality Um I find this highly implausible that Something that strong exists given that What's inside this Contour here is Already highly predictive of observation That's why the results namely confident That there is nothing out there that Could be Driving the correlation that I'm that I'm reporting in in the models that's Why I think that reviewers like this
Paper because it combines this Um the use of double lasso which is you Know of all the things that I could Measure just let the machine let the Algorithm select the ones that are Uh highly most highly predictive about Both segregation in and coping mortality And then Let me be also vulnerable to having Failed to measure something out there And just put a number of how big that Something out there should be And compare that to what's already model And what the sensitivity knowledge shows Is that even if there is something up There I mean statistically something Could exist but in Practical terms Statistically there is a variable that Could make all the results go away you Know something that has correlation 0.3 With uh uh just like take this one here So something that has correlation 0.25 With um with mortality and 0.3 with uh Segregation at the same time that Variable could make everything go away But it turns out that nothing has these Two correlations at the same time so At least among the things that we can Measure happy to answer questions Um you want to see people's faces better So enormous things I I know it's not the Most uh easy thing to do remote but Maybe you could all unmute and do a Hopeless all Round of Applause at the
Same time because Um that was a absolutely fascinating Talk on a topic that both is very Significant important Um On the sort of societal level and thank You for that but also in terms of the Science uh the data science techniques Use as well so I'm going to uh pass to Questions and I really really hope it's Okay to say Um Gerard if where people who are newer To our community Um I this is not this is not an environment In which You know the question of what the heck Does that thing mean is a question That's off the table that that is a Question very much on the table so I Love these questions yeah so don't if You've never seen if you've never seen a A Simple regression or anything just like Ask and and we'll try to explain it in In plain English Um thank you so much Gerard I thought it Was super interesting like your point About how you can't necessarily use loss Of regression because it will shrink Coefficients and I thought that's really Interesting I never thought about that Before like because in machine learning We're just trying to predict the outcome
As accurately as possible but in Research we really care about the value Of the coefficient like we care about What is the effect of this outcome so Yeah like and lasso and ridge really Mess with the coefficients so does that Mean that typically when you're doing Research you don't use those regression Models unless you're doing something Like more sophisticated like what you Did with the double lasso Um I wouldn't call it more sophisticated Just like a fundamentally different Question that you have in mind so uh if You're trying to predict uh Uh again uh Going back to the example I gave one Time you're trying to predict so this is Like like let's say that you could you Could predict you could use lasso to Predict whether what your self-driving Car is saying in front of uses stop sign Or not you don't care about you know Which coefficients are in that model and What's the size of the coefficient you Just care that the prediction is 101 Right So it doesn't matter what's in what's in The what's in the uh which which the Sites of the coefficients are being Estimated but for social science we Happen to care about these things Because they have policy implications Because we want to build theories around
That because we want to give a precise Estimate to you know to one coefficient Then it's when you have to get more Creative it's either you stick to Traditional progression or logic Regression where there is no shrinkage Happening Or you do things like the double Blaster Where you use the predictive power of The lasso to predict the variable that You care about the outcome and look for An intersection of the two And that forces that That forces that the coefficient on on Segregation will be will be preserved Yeah no that makes sense this is super Interesting because in in sociology but Also in all sorts of fields like Economics pretty much any social science Like you care about the size of the Effect of Something on the outcome There are questions in social science Where are purely predictive and then Yeah you might not care about the the Coefficients but we we tend to in the Social World We tend to and in the Policy work and to care about by various Relationships and Um And some of the machine learning methods Uh you know they are very powerful but You you'll have to tweak them somehow uh To to make them work for you in that
Case awesome thank you Yes Thank you for the question mark Next up kaju Hey Gerard thank you for the talk um so I was wondering uh I've got two Questions one is you know when you Showed that uh 96 Um you have uh accounted for 96 of the Population what is what would happen if You know say you got four percent of of Uh counties that you left out and some Of them are heavily black but it's like Um they don't have as high of uh covet Mortality rate well that's the the slope Of the graph that you had with all the Circles with that 10 down and that mess With your equation yeah that that that's Uh so let me let me uh it's a great Question let me put back the All right my screen again so if you look At the So one observation here uh that casual You just made is that I'm missing a Thousand countries here so there are Three thousand plus counties in the United States but those thousand Counties that I haven't haven't been Included in the data uh are only four Percent of the population the reason why I have not included them it's because They did not have uh information on the 50 variables that I that I care about so I could have include all counties and
Instead of measuring 50 things measure Perhaps 25 things so I chose to measure More things string the sides of the Counties It still preserves a lot of the Population for which I'm making the Inferences right so that's the rationale For why I I made a decision Um Notice one thing here tells you that the That the uh Those uh correlations here or those uh Scatter Plots here are weighted by Population So even if the Thousand counties that I'm Measure were added here because they're So tiny That are I mean you wouldn't even see Them in the in in the in the scatter Plot they would not mess up a lot of my Of my of my estimation and probably some Of them will follow up here some other Will fall down here it's just like rural Counties that are very interesting news So um I did some sensitivity analysis Just Running the regressions uh with full Sample of counties with fewer variables Uh the ones that the ones for which I Could see for the 3000 plus counties and Um nothing significantly different What's happening but yeah it's a good
Observation it's a good observation and That's a problem in in many social Science questions what happens with the Data that you are not including what Happens with the missing data what Happens with the observations that you Dropped Um at least you have to show us Or at least you have to show some Comparison between those that you Dropped and those that you kept and see How different they are from each other So that Traders have an informed way of Of of of Working through the question that you're Asking uh uh in this case yeah gotcha Um so just I'm in Louisiana so what you Just said Uh oh yeah at the beginning yeah yeah uh It it I saw it firsthand and it was um You know it's uh I don't know it was What it was I guess Um and I have one quick little question Is that uh what how do you how do you Define a confounding variable Um yeah great question Um all right let's go back to this um to This graph here So a confounding variable is by Definition a variable that it's both Related to The variable that you care about and the Outcome in this case I care about Segregation and the outcome is called
Immortality anything that's related to Both of them at the same time that's a Compounding Factor A variable that's only related to The outcome or only already related to Segregation that's not a confounding Factor if you forget to include that Variable It will not significantly impact the Results Turns out that in the social World there Is not that many things that are not Related to everything Um but uh that's like at a conceptual Level that's the idea behind it on Family Practice something that's both Related to segregation and code Immortality in this case I see so so would that impact the Intersection of the double lasso as well Like for example I know you said Something about uh bowling alleys and um And sports centers which I can see which Is where you know it's very segregated So like that's where people gather but Would that have some effect on that the Confounding factors yes yes for sure Yeah yeah yeah yeah it will change the Pool of uh who likely change the pool of Compounders that what and I I'm like you Can never be certain that you've Measured all confounding factors because There are things in the social world That matter but you can just not measure
Them just like I don't know people's uh attitudes Towards vaccines I mean we can ask People surveys but we know that people Live in survey so we we we even there Are things that we don't know and that's When You know sensitivity analysis comes into Play all right there is something out There that I cannot measure that I Failed to measure that I forgot to Measure whatever how big that thing has To be to mess up the whole Machinery here That's idea Wow thank you so much Thank you for the question Thanks so much for your presentation I'm Just curious if there are Um other methodologies or ways of Objectively identifying Um controls uh besides the double lasso Um and like in other words like if you Could do this again would there be other Sort of like mathematical approaches That might yield similar results There's a close comes into the lasso Which we call the rich regression uh but I tend to prefer the last so it's more Intuitive and I think that for the Shrinkage I think it works it works Better I'm not aware of any new development Recently that could do that could play a
A similar role here Um the most important one uh Sarah here Is just Theory just thinking about the Variables and the things that should Matter for both segregation in in Um incoming mortality and make the Effort to find information on those on Those things so of this whole project Like Running the lasso was 10 of the of the Time ninety percent of the time was Spent on thinking about which Things should matter here and going out There and measuring them That's where the challenge is that's Where the time in any data science Project uh is is spent so I would say There is no substitute for good theory And good data management measurement uh And then the methods uh you know they're Important but of course are important But it's not what you spend most of the Time say thank you Gerard uh for that Great presentation I definitely feel Enlightened and I think I speak for the Whole room when I say that well at Information how did you measure Segregation Yeah so um I imagine in in in various ways so the The one that I end up including in the Uh in the models I made it to the paper Uh it's called Uh the diversity Thinking on the exact name now
Uh throughout the diversity index uh Which basically captures the probability That two individuals of the of a Different race would interact with each Other uh at the at the neighborhood Level so the way this is measured from Sensors track or sensors block data so For you to measure segregation at the County level at the city level you have To go down to see what's the Configuration at the neighborhood level Who you know what proportion of black Sleep in this neighborhood which other Live in the neighborhood and then just Use that data to Um create the index you could use a Um the tail index that uses information Theory an entropy as well uh it turns Out that if you use that model that way You're measuring segregation things Don't change that was my initial Proposal but reviewers asked for this One uh you could use uh uh what's called A measure of of uh exposure so what's The Percentage of individuals of a given Race to which blacks are exposed in a Given City so just different ways of Capturing proximity of individuals of Different races Okay okay great thank you it has to be It has to be computed from data more That's more disregated at the Neighborhood level
So you have to see the distribution of People inside the county across Neighborhoods Right right right right okay great thank You I think it's kind of gives me a Second question too so for your Sensitivity analysis you basically had The measure of sensitivity on the x-axis If I'm not mistaken I'm sure it again You know yeah it's do a check I wasn't Quite sure why you chose the the two Measures that you did on for the Sensitivity part Okay so it goes back goes back to this Um Um to the question that casual asked Before Um Were you casual the one was about Sensitivity yes Um so goes back to the idea of what's a Confounder Confounder is something that's Correlated with both segregation and Cobit that's the thing Are problematic here something that has A correlation with covet and a Correlation with segregation so what I'm Plotting here so I'm plotting what's the Correlation of all variables in my model That they have with Um Uh the mortality Gap metric and Segregation metrics so just take that
Dot here I don't know which variable in The model is this one but this is a very Low that's correlated 0.2 with uh gaps In in Kobe mortality positively and Minus two minus 0.2 uh with uh Surrogation Okay okay I understood and I do this for All variables in the model and then I'm Plotting the the the the the frontier Here uh outside of which uh you know uh Any variable that's outside of that of That of that border will are struggle so To speak Okay yes understood okay thank you it's Characterizing that these two Characterizing How strong these two correlations have To be simultaneously for uh for a Confounder to be something Problematic And you can click it in different ways Jonathan you can so this this Sensitivity sensitivity analysis Corresponds to one in which uh this Contour This Frontier here and anything outside Defines the sets of variables that if Included in the model if they existed And where to included in the model would Make the correlation between segregation And and Kobe to go to zero you could you Could you could uh take a more Conservative approach and say all right So I'm going to run a sensitivity
Analysis to find variables that could That would cut the magnitude of my of my Association in half or that will carry It into like 25 so you could just tweak It in different ways but this one is the Most aggressive uh version which is all Right so let's assume that the Correlation between segregation and Kobe Doesn't exist it's zero I'm just making up here or I'm just lucky with the Controls that I that I happen to find What is this something out there that Haven't measured How big it is that's the idea here a big Define in terms of each correlation with The outcome and segregation One more question so I know you said That you go down to the neighborhood Level to get that data about segregation And now we we know that you know uh uh You know a black neighborhood is that is It's much less likely to report that Kind of stuff that self-reported race Data then uh than a white neighborhood So how does that how would that skew Your stuff and and your segregation Um measurement yeah well uh it's a great Question Um and here we're just relying what the With what the census tells us so the Data that I'm using for uh the Neighborhood level that the neighborhood Level data that I'm using to compute Segregation it's something that comes
From the Census So in theory the census Captures Um everything everyone in the country Um but it's up to the people who respond To the census to Um you know first respond and send back The the documentation uh and to just be Accurate so this is a this is a a Potential bug that everyone using census Data for demographics has to Um but yeah it's a good observation but There's not a lot we we could do here But my sense is that it's on average the Sensor is very it's fairly accurate it's Really accurate there are there might be Pockets where you know there is Under Reporting or People failing to comply with whatever The census tells them to to submit but Um in general the census is regarded as As Data that's that's reliable yeah it was It was just interesting when you're Saying that because I saw pockets of Like people that were Um uh non-non-diverse I guess you know And uh I mean this is of course for Experience but like the the conspiracy Theories and stuff like that would be Kind of spiking up in those Pockets Because they're not seeing what's Outside but then you know like you said It's much on average exactly yeah yeah But yeah definitely like if you drill
Down to particular blocks or particular Census tracks you might find Inconsistencies for sure yeah thank you Gerard so much for sharing your research That was really really interesting and Uh thanks everyone for coming Um keep an eye on our events page for More of these dsml talks um and if You're not part of our code Smith Community yet uh check out the dsml page I shared in the chat uh we have more Workshops reading groups pair Programming sessions and more coming up So keep an eye on that thanks again Gerard and thanks everybody have a good Night [Music]