cnamrata - Tumblr blog

cnamrata · 4 years

Text

Assignment 4: Data Management

To submit:

1. Create graphs of your variables one at a time (univariate graphs).Examine both their center and spread.

2. Create a graph showing the association between your explanatory and response variables (bivariate graph). Your output should be interpretable (i.e. organized and labeled).

Data set used: Gapminder dataset

Python Code:

--Start of Code--

import pandas import numpy import seaborn import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory = False)

pandas.set_option('display.max_columns', None) pandas.set_option('display.max_rows', None)

pandas.set_option('display.float_format',lambda x: '%f' %x)

data["employrate"]= data["employrate"].convert_objects(convert_numeric=True) data["urbanrate"]= data["urbanrate"].convert_objects(convert_numeric=True) data["lifeexpectancy"]= data["lifeexpectancy"].convert_objects(convert_numeric=True) data["incomeperperson"]= data["incomeperperson"].convert_objects(convert_numeric=True) data["femaleemployrate"]= data["femaleemployrate"].convert_objects(convert_numeric=True) data["internetuserate"]= data["internetuserate"].convert_objects(convert_numeric=True)

#Univariate analysis of Employment Rate

sub1 = data[ (data['employrate']>=50) & (data['employrate']<=100) & (data['urbanrate']>= 40)]

#copies sub1 data into a new subset called sub2 sub2 = sub1.copy() dist1=seaborn.distplot(sub2['employrate'].dropna(),kde=False); plt.xlabel('Employment Rate in various countries') plt.title('Employment Rate Analysis across Countries')

desc1=data['employrate'].describe()

print(desc1)

desc2=data['urbanrate'].describe()

print(desc2)

desc3=data['internetuserate'].describe()

print(desc3)

#Bivariate analysis of internet user rate, life expectancy etc. scat1=seaborn.regplot(x="urbanrate",y="internetuserate", data=data) plt.xlabel('Urban Rate') plt.ylabel('Internet Use Rate') plt.title('Scatter plot for Association Between Urban Rate and Internet Use Rate')

print('Urban Rate in 4 Quartiles') data['UrbanQuartile']=pandas.qcut(data.urbanrate, 4, labels=["1=25th%tile","2=50th%tile","3=75th%tile","4=100th%tile"]) c1 = data['UrbanQuartile'].value_counts(sort=False, dropna=True) print(c1) fact1=seaborn.factorplot(x='UrbanQuartile', y='lifeexpectancy', data=data, kind="bar", ci=None)

plt.xlabel('Urban Rate') plt.ylabel('Life Expectancy') plt.title('Relation Between Urban Rate and Life Expectancy')

c2= data.groupby('UrbanQuartile').size() print(c2)

--End of code--

Output:

count 178.000000 mean 58.635955 std 10.519455 min 32.000000 25% 51.225000 50% 58.700000 75% 64.975000 max 83.200000 Name: employrate, dtype: float64

count 203.000000 mean 56.769360 std 23.844933 min 10.400000 25% 36.830000 50% 57.940000 75% 74.210000 max 100.000000 Name: urbanrate, dtype: float64

count 192.000000 mean 35.632760 std 27.780433 min 0.210000 25% 10.000000 50% 31.810000 75% 56.415000 max 95.640000 Name: internetuserate, dtype: float64

Urban Rate in 4 Quartiles 1=25th%tile 51 2=50th%tile 51 3=75th%tile 50 4=100th%tile 51

Description of Output:

- The output of desc function gives frequency, count, spread of the variables employment rate, urban rate and internet use rate of various countries.

From the output of employment rate, we can see that the average employment rate among the given set of countries is 58.6% and the standard deviation is 10.5% indicating that most of the values are concentrated around the mean.

The spread of employment rate is given as maximum-minimum which is 51.2%

- I did univariate analysis on Employment rate by country by using a histogram. Based on the analysis, we can conclude that about highest number of countries (~24) are found to have employment rate of around 57%. Number of countries with highest employment rate (~75%) is 5.

- I used the Scatter Plot function to analyse the association of urban rate (independent variable) and internet usage rate (dependent variable). From the plot, we can conclude that internet usage rate increases linearly as urban rate increases. However, the plot also indicates that the data is not concentrated along the line of best fit but is scattered.

- I did a bivariate analysis to arrive at the relation of urban rate and life expectancy. Urban Rate is divided into 4 quartiles. It can be concluded that with increase in urban rate, life expectancy rate also increases.

0 notes

cnamrata · 5 years

Text

Assignment 3- Making Data Management Decisions

To submit: Write a successful program that manages your data, create a blog entry where you post your program and the results/output that displays at least 3 of your data managed variables as frequency distributions. Write a few sentences describing these frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.

I have used the gapminder data set to display the frequency distribution of variables such as employrate and life expectancy rate given a level of urban rate and alcohol consumption respectively.

Python Program for frequency distribution:

Below is the program for frequency distribution in Python:

--Start of program--

import pandas import numpy

data = pandas.read_csv('gapminder.csv', low_memory = False) pandas.set_option('display.float_format',lambda x: '%f' %x)

print("Returns the count of number of countries in each employment level") c1=data.groupby("Countrylevelofemployment").size() print (c1)

print("Returns % distribution of countries in each employment level") p1=data.groupby("Countrylevelofemployment").size() * 100/len(data) print(p1)

sub1 = data[ (data['employrate']>=50) & (data['employrate']<=100) & (data['urbanrate']>= 40)] sub3=data[ (data['lifeexpectancy']>=60) & (data['alcconsumption']>0.5)]

sub2 = sub1.copy() sub4 = sub3.copy()

print('Quartile distribution of countries based on employment rate given a level of urban rate') sub2 ['Employrate 4']=pandas.qcut(sub2.employrate,4,labels=["1-25%tile","2-50%tile","3-75%tile","4-100%tile"]) c2=sub2['Employrate 4'].value_counts(sort=False,dropna=True) print(c2)

print('Quartile distribution of countries based on life expectancy given a level of alcohol consumption') sub4 ['Life expectancy 4']=pandas.qcut(sub4.lifeexpectancy,4,labels=["1-25%tile","2-50%tile","3-75%tile","4-100%tile"]) c3=sub4['Life expectancy 4'].value_counts(sort=False,dropna=True) print(c3)

-- End of Program--

Output of the program:

Returns the count of number of countries in each employment level Countrylevelofemployment Average Employment Rate (50% to 70%) 114 Data Not Available 35 High Employment Rate (> 70%) 27 Low Employment Rate (0% to 50%) 37 dtype: int64 Returns % distribution of countries in each employment level Countrylevelofemployment Average Employment Rate (50% to 70%) 53.521127 Data Not Available 16.431925 High Employment Rate (> 70%) 12.676056 Low Employment Rate (0% to 50%) 17.370892 dtype: float64 Quartile distribution of countries based on employment rate given a level of urban rate 1-25%tile 25 2-50%tile 23 3-75%tile 22 4-100%tile 23 Name: Employrate 4, dtype: int64 Quartile distribution of countries based on life expectancy given a level of alcohol consumption 1-25%tile 33 2-50%tile 32 3-75%tile 32 4-100%tile 32 Name: Life expectancy 4, dtype: int64

Description of the program:

The program gives:

a) The distribution of countries based on level of employees and % level of employment in each level. This clearly specifies blank rows where data is not available

b) Quartile distribution of countries where employment rate is between 50% to 100% and where urban rate is greater than 40%. As per the output, we can see that the distribution of countries is almost equal across the quartiles. The output excludes countries/rows where data is not available.

c) Quartile distribution of countries where life expectancy is greater than 60% and alcohol consumption is greater than 0.5. As per the output, we can see that the distribution of countries is almost equal across the quartiles. The output excludes countries/rows where data is not available.

0 notes

cnamrata · 5 years

Text

Assignment 2: Data Management and Visualization

Requirement of Assignment 2: Following completion of your first program, create a blog entry where you post 1) your program 2) the output that displays three of your variables as frequency tables and 3) a few sentences describing your frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.

1. Python program

I ran the program in python and below is the code:

import pandas import numpy

data = pandas.read_csv('gapminder.csv', low_memory = False)

print("Returns the count of number of countries in each employment level") c1=data.groupby("Countrylevelofemployment").size() print (c1)

print("Returns % distribution of countries in each employment level") p1=data.groupby("Countrylevelofemployment").size() * 100/len(data) print(p1)

print("Returns the count of countries falling under various income levels") c2=data.groupby("Incomelevel").size() print(c2)

print ("Returns % of countries falling in each income level") p2=data.groupby("Incomelevel").size()*100/len(data) print (p2)

print("Returns internet usage level across countries") c3=data.groupby("Internetusagelevel").size() print(c3)

print("Returns % countries in each internet usage bracket") p3=data.groupby("Internetusagelevel").size() *100/len(data) print(p3)

print("Returns the number of countries at each level of urbanization") c4=data.groupby("Urbanizationlevel").size() print(c4)

print("Returns % of countries at each level of urbanization") p4=data.groupby("Urbanizationlevel").size()*100/len(data) print(p4)

2. Output of the program

Returns % distribution of countries in each employment level Countrylevelofemployment Average Employment Rate (50% to 70%) 53.521127 Data Not Available 16.431925 High Employment Rate (> 70%) 12.676056 Low Employment Rate (0% to 50%) 17.370892 dtype: float64

Returns the count of countries falling under various income levels Incomelevel Data Not Available 23 High income countries (> $30,000) 16 Low income countries ($0 to $10000) 143 Mid level income countries ($10,000 to $30,000) 31 dtype: int64

Returns % of countries falling in each income level Incomelevel Data Not Available 10.798122 High income countries (> $30,000) 7.511737 Low income countries ($0 to $10000) 67.136150 Mid level income countries ($10,000 to $30,000) 14.553991 dtype: float64

Returns internet usage level across countries Internetusagelevel Data Not Available 21 High internet usage (> 60%) 47 Low internet usage (0% to 30%) 93 Medium internet usage (30% to 60%) 52 dtype: int64

Returns % countries in each internet usage bracket Internetusagelevel Data Not Available 9.859155 High internet usage (> 60%) 22.065728 Low internet usage (0% to 30%) 43.661972 Medium internet usage (30% to 60%) 24.413146 dtype: float64

Returns the number of countries at each level of urbanization Urbanizationlevel Data Not Available 10 Developed (> 70%) 64 Developing (40% to 70%) 80 Underdeveloped (0% to 40%) 59 dtype: int64

Returns % of countries at each level of urbanization Urbanizationlevel Data Not Available 4.694836 Developed (> 70%) 30.046948 Developing (40% to 70%) 37.558685 Underdeveloped (0% to 40%) 27.699531 dtype: float64

3. Description of the program

I chose GapMinder dataset in which the entire data is not absolute numbers. Data is either in percentages such as the variable ‘employrate’ which was % of people employed in the entire population of the country or variables like ‘income per person’ which describe the income earned on a per capita basis. Frequency distribution on such data would not have given relevant results. Hence, I introduced few dummy variables which i used for frequency distribution. They are as follows:

A) Countrylevelofemployment: This variable takes 4 values namely

Low Employment Rate for countries whose employment rate is between 0 to 50%

Average Employment Rate for countries whose employment rate is between 50 to 70% and

High Employment Rate for countries whose employment rate is greater than 70%

Data Not Available for countries which have no information available

B) Incomelevel: This variable takes 4 values namely

Low income countries for those countries whose per capita income level is less than $10K

Mid level income countries for those countries whose per capita income is between $10K to $30K

High income countries for those countries whose per capita income is greater than $30K

and Data Not Available incase no information was available about that country

C) Internetusagelevel: This variable takes 4 values namely

Low internet usage for countries whose rate is less than 30%

Medium internet usage for countries whose usage rate is between 30 to 60%

High internet usage for countries whose usage rate is greater than 60% and

Data Not Available for countries on which information was available

D) Urbanization level: This variable takes 4 value namely

Under developed for countries whose urbanization levels were less than 40%

Developing countries were the ones with urbanization rate between 40 to 70%

Developed countries were the ones with urbanization rate greater than 70% and

Data Not Available for those countries which had no information available

Observation from program output:

Based on the output of the program, highest values across each distribution are given below

i) 53.5% of the countries fall under Average Employment Rate (50% to 70%)

ii) 67.1% of countries are low income countries (percapita income of $0 to $10,000)

iii) 43.7% of countries had low internet usage (0% to 30%)

iv) 37.6% of countries were developing (40% to 70%). This is closely followed by 30% of countries which were developed (>70% urbanization rate) which is in turn closely followed by 27.7% of countries whch were under developed (<40% urbanization rate).

0 notes

cnamrata · 5 years

Text

Assignment 1 for Data Management and Visualization

Selecion of dataset- Gap minder dataset

I would like to select female employment as my research area with the following being my primary research questions:

a) Is female employment rate dependent on total employment rate?

b) Does income per person of a country influence employment of females in that country?

c) Does female employment rate depend on internet use rate in a country?

d) Is urbanization of a country a good indicator of female employment?

Hypothesis- Female employment rate depends on total employment rate of the country, income per person, internet usage and urban rate

Variables to be used:

femaleemployrate

internetuserrate

employrate

income per person

urbanrate

Secondary research question:

Is life expectancy of a country dependent on female employment rate?

Hypothesis- Life expectancy in a country is dependent on female employment rate

Variables to be used:

femaleemployrate

lifeexpectancy

References:

1. Understanding income inequalities in health among men and women in Britain and Finland

Ossi Rahkonen, Sara Arber, Eero Lahelma, Pekka Martikainen, Karri Silventoinen

International Journal of Health Services 30 (1), 27-47, 2000

2.The changing status of women in India

R.N. Ghosh, K.C. Roy

International Journal of Social Economics

ISSN: 0306-8293

1 note · View note