The Exploration of Mexican Migration
This is a research project exploring migration data and attempting to understand differences within migrants. As well as the disparties migrants suffer as well.
Immigration has been a polarizing issue in the United States for many years and has amplified in the past few years due to positions in government. Mexican migration is not an easy endeavor and many of these migrants are coming to better their lives. The vast majority of migrants are coming for economic opportunity but recent years have made the overall goal of migration convoluted in a manner that sheds them in a bad light. It is vital to understand migration patterns and how they possibly result in policies that aid in migration as well as possibly aiding countries that these migrants are fleeing from.
The data has multiple data books to them that have different types of questions that are asked to migrants. We initially were going to use 4 data books but narrowed our research down to only using 2 data books since our research questions became altered as well.
Data Books
- HOUSE
The HOUSE part of the data set refers to household composition, economic and migratory activity of the members in the household. This includes land ownership of migrants, home/real estate, vehicle and livestock ownership, and business ownership and operation. This is pretty important to know to understand what types of migrants are owning land while coming over to the United States.
- MIG
The MIG portion of the data set refers to a person-level file containing details of all border crossings (up to 30) by each head of household, as well as measures of economic and social activity during the last U.S. visit.
Research Questions
- We intend to explore the following research questions:1. What demographics such as sex and age are mainly migrating to the United States. 2. What does life look like for many of these immigrants after migrating and are they able to find better living/working conditions?
We will so conduct text analysis on Trump tweets to understand the narrative that was being built around Mexican migration since 2016. This will set up our analysis in showing migrants are simply trying to better the lives. Our group intends to make bar and line graphs showing migration patterns. Along with colored line graphs possibly showing income brackets before their migration and after. We plan to gather the data and explore the relationship between sex and age as variables in how that has played out in Mexican migration patterns. We also will explore the cost of migration for the years available in the data set. After further exploration of the data, our group aims to examine the type of amenities migrants are able to use and at which income brackets they use those.
Our Analytical Process:
Us researchers explored the data by looking at the code books first. This gave us a good explanation of all the variables that was in this dataset and each of the code books. We then actually started to work with the data to see what exaclty the variables were and how we could work them. We narrowed down our research to only the HOUSE and MIG data books as previously mentioned.
We wanted to explore the main aspects such as sex, age, and income. Once we got those parts down, we wanted to explore income with these demographics to understand some difference in terms of sex. Each of the data books have income but each are a bit different. The MIG data, we use income of the head of household and compared that to last income recieved in Mexico. Wanted to explore if the money migrants were receiving was better in the U.S. or better in Mexico. The HOUSE data looked at the same thing, head of household income. We had different purposes for these as HOUSE data book had the amenities. We did the samething with wages they made on the first trip to the U.S. to compare it to their last trip.
We came upon the idea of looking at tweets from Donald Trump and determine what his narrative was on Mexican migration. We used vader sentiment analysis to understand what he was saying was negative or positive. We also looked for the key words that were used a lot in his tweets and which of these were negative as well. We decided to start this part off in our analysis as we think it could help guide our narrative.
import pandas as pd
import matplotlib.pyplot as plt
import nltk
import requests
from bs4 import BeautifulSoup
import numpy as np
from sklearn.datasets import load_iris
import seaborn as sns
from nltk.sentiment import vader
nltk.download('vader_lexicon')
tweets = pd.read_csv('Tweets.csv')
tweets.describe()
df = pd.DataFrame(tweets)
#a = df.loc[df['Tweet'] == 'immigrant']
#look at tweets that include words like immigrant, mexican, immigrants, borders, etc. and then look at the most common words
#within the tweets that contain those words
#get the index of those tweets and then look at the dates they were tweeted
key_words = df.loc[df['Tweet'].str.contains("immigrant|mexican|mexican immigrants|mexicans|immigrants", case=False)]
key_words
key_words
scores = vader.SentimentIntensityAnalyzer()
compound_scores = []
for i in key_words['Tweet']:
print(scores.polarity_scores(i))
compound_scores.append(scores.polarity_scores(i)['compound'])
df_scores = pd.DataFrame(compound_scores)
print(df_scores)
print( "Mean: " ,df_scores.mean())
df_scores.plot(kind='hist', bins = 25)
def hiphop():
j = 0
for i in music:
if str(i['artist']['terms']) == 'hip hop':
#print(str(i['artist']['name'])+ ' is hip hop!')
j +=1
print('There are a total of ' + str(j) + ' hip hop artists')
string_tweets = ''''''
for i in key_words['Tweet']:
string_tweets+=str(i)
string_tweets
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')
sent = sent_tokenize(string_tweets)
#print(sent)
words = []
for s in sent:
for w in word_tokenize(s):
words.append(w)
no_words = ["``", "''","’","//t.co/", "“", "”"]
myStopWords = list(punctuation) + stopwords.words('english')+no_words
wordsNoStop = []
for i in words:
if i not in myStopWords:
wordsNoStop.append(i)
#print(words)
print(wordsNoStop)
from nltk.collocations import *
from nltk.probability import FreqDist
freq = FreqDist(wordsNoStop)
for i in sorted(freq, key=freq.get, reverse=True):
print(i,freq[i])
freq
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer
stemmed_words = [LancasterStemmer().stem(w) for w in words]
stemmed_words
myStopWords = list(punctuation) + stopwords.words('english')+ no_words
wordsNoStop = []
for i in stemmed_words:
if i not in myStopWords:
wordsNoStop.append(i)
#print(words)
print(wordsNoStop)
from nltk.collocations import *
from nltk.probability import FreqDist
freq = FreqDist(wordsNoStop)
for i in sorted(freq, key=freq.get, reverse=True):
print(i,freq[i])
freq
sortspeechcount = {k: v for k, v in sorted(freq.items(), key=lambda item: -item[1])}
sortspeechcount
barcolors = []
for i in frequentwords:
# could be:
# if i == 'i' or i == 'have' or i == 'dream'
# or could be:
if i in ['kil','drug','illeg']:
barcolors.append('red')
else:
barcolors.append('grey')
barcolors
plt.figure(figsize=(12,5))
plt.barh(frequentwords,frequentvals,color=barcolors)
insert explanation
mig174_df = pd.read_csv('mig174.csv')
house174_df = pd.read_csv('house174.csv')
HOUSE174_df = house174_df[house174_df != 9999]
MIG174_df = mig174_df[mig174_df.age != 9999]
ax = MIG174_df["age"].plot(kind="box",figsize=(12,6))
ax.set_title('Box Plot of Age', fontsize = 14)
#ax.set_ylabel('age')
MIG174_df["age"].describe()
a = MIG174_df.groupby('sex')['country'].count().plot(kind='bar',figsize=(12,6))
a.set_title('Sex Difference of Migrants',fontsize=14)
a.set_xlabel('Sex',fontsize=14)
a.set_ylabel('Migrant Count',fontsize=14)
plt.legend(['1: Male 2: Female'],fontsize=12)
This bar graph is showing the sex difference on those that are migrating. Males are the ones mostly migrating and there has been research to explain why this is so.
HOUSE174_df = house174_df[house174_df != 9999]
MIG174_df.loc[MIG174_df['ldowage'] == ' ','ldowage'] = '0'
MIG174_df.loc[MIG174_df['ldowage'] == ' ','ldowage']
MIG174_df[['ldowage']] = MIG174_df[['ldowage']].astype(float)
df3 = MIG174_df.loc[(MIG174_df['ldowage'] <100000) & (MIG174_df['ldowage']>=1000) & (MIG174_df['ldowage'] != 9999), 'ldowage']
a = df3.plot(kind='hist',bins=150,figsize=(12,6))
a.set_title('Income in Mexico BEFORE Migration',fontsize=14)
a.set_xlabel('Wages',fontsize=14)
a.set_ylabel('Immigrant Count',fontsize=14)
MIG174_df.loc[MIG174_df['hhincome'] == ' ','hhincome'] = '0'
MIG174_df.loc[MIG174_df['hhincome'] == ' ','hhincome']
MIG174_df[['hhincome']] = MIG174_df[['hhincome']].astype(float)
df4 = MIG174_df.loc[(MIG174_df['hhincome'] <100000) & (MIG174_df['hhincome']>=1000) & (MIG174_df['hhincome'] != 9999), 'hhincome']
a = df4.plot(kind='hist',bins=150,figsize=(12,6))
a.set_title('Income in U.S. AFTER Migration',fontsize=14)
a.set_xlabel('Wages',fontsize=14)
a.set_ylabel('Immigrant Count',fontsize=14)
MIG174_df.loc[MIG174_df['age']<=19, 'age_group'] = '0-19'
MIG174_df.loc[MIG174_df['age'].between(20,29), 'age_group'] = '20-29'
MIG174_df.loc[MIG174_df['age'].between(30,39), 'age_group'] = '30-39'
MIG174_df.loc[MIG174_df['age'].between(40,49), 'age_group'] = '40-49'
MIG174_df.loc[MIG174_df['age'].between(50,59), 'age_group'] = '50-59'
MIG174_df.loc[MIG174_df['age'].between(60,69), 'age_group'] = '60-69'
MIG174_df.loc[MIG174_df['age'].between(70,79), 'age_group'] = '70-79'
MIG174_df.loc[MIG174_df['age'].between(80,89), 'age_group'] = '80-89'
MIG174_df.loc[MIG174_df['age'].between(90,99), 'age_group'] = '90-99'
ax = MIG174_df.loc[MIG174_df['sex'] == 2].groupby('age_group')['sex'].count().plot(kind='bar', edgecolor = 'black', figsize=(12,6))
ax.set_title('Ages of Female Immigrants',fontsize=14)
ax.set_xlabel('Ages',fontsize=14)
ax.set_ylabel('Immigrant Count',fontsize=14)
ax = MIG174_df.loc[MIG174_df['sex'] == 1].groupby('age_group')['sex'].count().plot(kind='bar', edgecolor = 'black', figsize=(12,6))
ax.set_title('Ages of Male Immigrants',fontsize=14)
ax.set_xlabel('Ages',fontsize=14)
ax.set_ylabel('Immigrant Count',fontsize=14)
MIG174_df.loc[MIG174_df['hhincome']<=10000, 'income_group'] = '0-10000'
MIG174_df.loc[MIG174_df['hhincome'].between(10000,19999), 'income_group'] = '10000-19999'
MIG174_df.loc[MIG174_df['hhincome'].between(20000,29999),'income_group'] = '20000-29999'
MIG174_df.loc[MIG174_df['hhincome'].between(30000,39999), 'income_group'] = '30000-39999'
MIG174_df.loc[MIG174_df['hhincome'].between(40000,49999), 'income_group'] = '40000-49999'
MIG174_df.loc[MIG174_df['hhincome'].between(50000,59999), 'income_group'] = '50000-59999'
MIG174_df.loc[MIG174_df['hhincome'].between(60000,69999), 'income_group'] = '60000-69999'
MIG174_df.loc[MIG174_df['hhincome'].between(70000,79999), 'income_group'] = '70000-79999'
MIG174_df.loc[MIG174_df['hhincome'].between(80000,89999), 'income_group'] = '80000-89999'
MIG174_df.loc[MIG174_df['hhincome'].between(90000,100000), 'income_group'] = '90000-100000'
cx = MIG174_df.loc[MIG174_df['sex'] == 2].groupby('income_group')['sex'].count().plot(kind='barh', edgecolor = 'black', figsize=(12,6))
cx.set_title('Income of Female Immigrants AFTER Migration',fontsize=14)
cx.set_xlabel('Immigrant Count',fontsize=10)
#cx.set_ylabel('Immigrant Count',fontsize=14)
ax = MIG174_df.loc[MIG174_df['sex'] == 1].groupby('income_group')['sex'].count().plot(kind='barh', edgecolor = 'black', figsize=(12,6))
ax.set_title('Income of Male Immigrants AFTER Migration',fontsize=14)
ax.set_xlabel('Immigrant Count',fontsize=10)
#ax.set_ylabel('Immigrant Count',fontsize=14)
MIG174_df['income_diff'] = MIG174_df['hhincome'] - MIG174_df['ldowage']
df5 = MIG174_df.loc[(MIG174_df['income_diff'] <50000) & (MIG174_df['income_diff']>=-25000) & (MIG174_df['income_diff'] != 9999) & (MIG174_df['income_diff'] != -9999), 'income_diff']
dx = df5.plot(kind='hist',figsize=(12,6),bins=150)
dx.set_title('Income Change After Immigration',fontsize=14)
print('Mean: ',MIG174_df.loc[(MIG174_df['income_diff'] <50000) & (MIG174_df['income_diff']>=-25000) & (MIG174_df['income_diff'] != 9999) & (MIG174_df['income_diff'] != -9999), 'income_diff'].mean())
print('Median: ',MIG174_df.loc[(MIG174_df['income_diff'] <50000) & (MIG174_df['income_diff']>=-25000) & (MIG174_df['income_diff'] != 9999) & (MIG174_df['income_diff'] != -9999), 'income_diff'].median())
#print('Mode: ',MIG174_df.loc[(MIG174_df['income_diff'] < 50000) & (MIG174_df['income_diff']>=-25000) & (MIG174_df['income_diff'] != 9999) & (MIG174_df['income_diff'] != -9999), 'income_diff'].mode())
MIG174_df.loc[MIG174_df['uswage1']<=100, 'wage_group'] = '0-100'
MIG174_df.loc[MIG174_df['uswage1'].between(101,499), 'wage_group'] = '101-499'
MIG174_df.loc[MIG174_df['uswage1'].between(500,999), 'wage_group'] = '500-999'
MIG174_df.loc[MIG174_df['uswage1'].between(1000,1999), 'wage_group'] = '1000-1999'
MIG174_df.loc[MIG174_df['uswage1'].between(2000,2999),'wage_group'] = '2000-2999'
MIG174_df.loc[MIG174_df['uswage1'].between(3000,3999), 'wage_group'] = '3000-3999'
MIG174_df.loc[MIG174_df['uswage1'].between(4000,4999), 'wage_group'] = '4000-4999'
MIG174_df.loc[MIG174_df['uswage1'].between(5000,5999), 'wage_group'] = '5000-5999'
MIG174_df.loc[MIG174_df['uswage1'].between(6000,6999), 'wage_group'] = '6000-6999'
MIG174_df.loc[MIG174_df['uswage1'].between(7000,7999), 'wage_group'] = '7000-7999'
MIG174_df.loc[MIG174_df['uswage1'].between(8000,8999), 'wage_group'] = '8000-8999'
MIG174_df.loc[MIG174_df['uswage1'].between(9000,9999), 'wage_group'] = '9000-9999'
MIG174_df.loc[MIG174_df['uswage1'].between(10000,19999), 'wage_group'] = '10000-19999'
MIG174_df.loc[MIG174_df['uswage1'].between(20000,29999), 'wage_group'] = '20000-29999'
MIG174_df.loc[MIG174_df['uswage1'].between(30000,39999), 'wage_group'] = '30000-39999'
MIG174_df.loc[MIG174_df['uswage1'].between(40000,49999), 'wage_group'] = '40000-49999'
MIG174_df.loc[MIG174_df['uswage1'].between(50000,59999), 'wage_group'] = '50000-59999'
MIG174_df.loc[MIG174_df['uswage1'].between(60000,69999), 'wage_group'] = '60000-69999'
MIG174_df.loc[MIG174_df['uswage1'].between(70000,79999), 'wage_group'] = '70000-79999'
MIG174_df.loc[MIG174_df['uswage1'].between(80000,89999), 'wage_group'] = '80000-89999'
MIG174_df.loc[MIG174_df['uswage1'].between(90000,100000), 'wage_group'] = '90000-100000'
wage_df = [MIG174_df['uswage1'], MIG174_df['uswagel'], MIG174_df['wage_group']]
headers = ['uswage1', 'uswagel','wage_group']
Wage_df = pd.concat(wage_df,axis = 1, keys= headers)
print(Wage_df)
Wage_df.loc[(Wage_df['uswagel'] != 9999) & (Wage_df['uswagel'] != 8888) & (Wage_df['uswage1'] != 9999) & (Wage_df['uswage1'] != 8888), ['uswage1','uswagel','wage_group']]
sns.lmplot(data=Wage_df, x="uswage1", y="uswagel",
hue="wage_group");
plt.ylim(0,10000)