Course: Data Analysis Tools | Assignment Week 2
The following scripts is an extension of the script of Week 1 Assignment. It investigates into 4 different income groups of gapminder dataset (<15k, <30k, <45k, <60k income) whether the average life expectancy is larger or smaller than 78 years.
The results show that only income group 1 has a statistically robust deviation from the other groups based on the Chi Squared Test.
Console Output (Chi Squared related, for rest see previous Tumblr entry):
INCGROUPS15K 15k 30k 45k 60k
LIFEEXPGROUPS70Y
(45, 78] 136 2 0 0
(78, 90] 8 16 9 1
INCGROUPS15K 15k 30k 45k 60k
LIFEEXPGROUPS70Y
(45, 78] 0.944444 0.111111 0.000000 0.000000
(78, 90] 0.055556 0.888889 1.000000 1.000000
chi-square value, p value, expected counts
(113.1514634839443, 2.30160388506565e-24, 3, array([[115.53488372, 14.44186047, 7.22093023, 0.80232558],
[ 28.46511628, 3.55813953, 1.77906977, 0.19767442]]))
Adjusted p value (Bonferroni Adjustment: 4 groups --> 6 pairwise comparisons
0.008333333333333333
COMP1v2 15k 30k
LIFEEXPGROUPS70Y
(45, 78] 136 2
(78, 90] 8 16
COMP1v2 15k 30k
LIFEEXPGROUPS70Y
(45, 78] 0.944444 0.111111
(78, 90] 0.055556 0.888889
chi-square value, p value, expected counts
(81.5640285326087, 1.696744687700559e-19, 1, array([[122.66666667, 15.33333333],
[ 21.33333333, 2.66666667]]))
COMP1v3 15k 45k
LIFEEXPGROUPS70Y
(45, 78] 136 0
(78, 90] 8 9
COMP1v3 15k 45k
LIFEEXPGROUPS70Y
(45, 78] 0.944444 0.000000
(78, 90] 0.055556 1.000000
chi-square value, p value, expected counts
(67.236328125, 2.4083516525272384e-16, 1, array([[128., 8.],
[ 16., 1.]]))
COMP1v4 15k 60k
LIFEEXPGROUPS70Y
(45, 78] 136 0
(78, 90] 8 1
COMP1v4 15k 60k
LIFEEXPGROUPS70Y
(45, 78] 0.944444 0.000000
(78, 90] 0.055556 1.000000
chi-square value, p value, expected counts
(3.317199130809731, 0.0685585370880382, 1, array([[1.35062069e+02, 9.37931034e-01],
[8.93793103e+00, 6.20689655e-02]]))
COMP2v3 30k 45k
LIFEEXPGROUPS70Y
(45, 78] 2 0
(78, 90] 16 9
COMP2v3 30k 45k
LIFEEXPGROUPS70Y
(45, 78] 0.111111 0.000000
(78, 90] 0.888889 1.000000
chi-square value, p value, expected counts
(0.0675, 0.7950121719642381, 1, array([[ 1.33333333, 0.66666667],
[16.66666667, 8.33333333]]))
COMP2v4 30k 60k
LIFEEXPGROUPS70Y
(45, 78] 2 0
(78, 90] 16 1
COMP2v4 30k 60k
LIFEEXPGROUPS70Y
(45, 78] 0.111111 0.000000
(78, 90] 0.888889 1.000000
chi-square value, p value, expected counts
(0.0, 1.0, 1, array([[ 1.89473684, 0.10526316],
[16.10526316, 0.89473684]]))
COMP3v4 45k 60k
LIFEEXPGROUPS70Y
(78, 90] 9 1
COMP3v4 45k 60k
LIFEEXPGROUPS70Y
(78, 90] 1.000000 1.000000
chi-square value, p value, expected counts
(0.0, 1.0, 0, array([[9., 1.]]))
Code:
Import Libraries
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
smf provides ANOVA F-test
import statsmodels.formula.api as smf
multi includes the package to do Post Hoc multi comparison test
import statsmodels.stats.multicomp as multi
scipy includes the Chi Squared Test of Independence
import scipy.stats
bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
Set Pandas to show all colums and rows in Dataframes
pandas.set_option('display.max_columns', None)
pandas.set_option('display.max_rows', None)
Import gapminder.csv
data = pandas.read_csv('gapminder.csv', low_memory=False)
Replace all empty entries with 0
data = data.replace(r'^\s*$', numpy.NaN, regex=True)
Extract relevant variables from original dataset and save it in subdata set
print('List of extracted variables in subset')
subdata = data[['incomeperperson', 'lifeexpectancy', 'suicideper100th']]
Safe backup file of reduced dataset
subdata2 = subdata.copy()
Convert all entries to numeric format
subdata2['incomeperperson'] = pandas.to_numeric(subdata2['incomeperperson'])
subdata2['lifeexpectancy'] = pandas.to_numeric(subdata2['lifeexpectancy'])
subdata2['suicideper100th'] = pandas.to_numeric(subdata2['suicideper100th'])
All rows containing value 0 / previously had no entry are deleted from the subdata set
subdata2 = subdata2.dropna()
print(subdata2)
Describe statistical distribution of variable values
print('Statistics on "Income per Person"')
desc_income = subdata2['incomeperperson'].describe()
print(desc_income)
print('Statistics on "Life Expectancy"')
desc_lifeexp = subdata2['lifeexpectancy'].describe()
print(desc_lifeexp)
print('Statistics on "Suicide Rate per 100th"')
desc_suicide = subdata2['suicideper100th'].describe()
print(desc_suicide)
Identify min & max values within each column
print('Minimum & Maximum Income')
min_income = min(subdata2['incomeperperson'])
print(min_income)
max_income = max(subdata2['incomeperperson'])
print(max_income)
print('')
print('Minimum & Maximum Life Expectancy')
min_lifeexp = min(subdata2['lifeexpectancy'])
print(min_lifeexp)
max_lifeexp = max(subdata2['lifeexpectancy'])
print(max_lifeexp)
print('')
print('Minimum & Maximum Suicide Rate')
min_srate = min(subdata2['suicideper100th'])
print(min_srate)
max_srate = max(subdata2['suicideper100th'])
print(max_srate)
print('')
Split up income into percentiles
subdata2['INCGROUPS15K']=pandas.qcut(subdata2.incomeperperson, 5, labels=["1=0%tile","2=15%tile","3=50%tile","4=75%tile","5=100%tile"])
inc_dist_percent = subdata2['INCGROUPS15K'].value_counts(sort=False, normalize=True, dropna=True)
subdata2['INCGROUPS15K'] = subdata2['INCGROUPS15K'].astype('category')
print(inc_dist_percent)
print(subdata2)
subdata2['INCGROUPS15K'] = pandas.cut(subdata2.incomeperperson, [0, 15000, 30000, 45000, 60000])
subdata2['INCGROUPS15K'] = subdata2['INCGROUPS15K'].astype('category')
subdata2['INCGROUPS15K'] = subdata2['INCGROUPS15K'].cat.rename_categories(["15k", "30k", "45k", "60k"])
inc_dist_dollar = subdata2['INCGROUPS15K'].value_counts(sort=False, normalize=True, dropna=True)
subdata2['INCGROUPS15K'] = subdata2['INCGROUPS15K'].astype('category')
print(inc_dist_dollar)
subdata2['LIFEEXPGROUPS70Y'] = pandas.cut(subdata2.lifeexpectancy, [45, 78, 90])
lifeexp_dist = subdata2['LIFEEXPGROUPS70Y'].value_counts(sort=False, normalize=True, dropna=True)
subdata2['LIFEEXPGROUPS70Y'] = subdata2['LIFEEXPGROUPS70Y'].astype('category')
print(lifeexp_dist)
The following cross table compares income and life expectancy of different groups
print('First, simplified comparison of income and life expectancy')
comparison = pandas.crosstab(subdata2['INCGROUPS15K'], subdata2['LIFEEXPGROUPS70Y'])
print(comparison)
Univeriate plots
seaborn.countplot(x='INCGROUPS15K', data=subdata2)
plt.xlabel('Average Income of Income Group')
plt.title('Count distribution of Income per Person')
Biveriate plots
seaborn.catplot(x='INCGROUPS15K', y='lifeexpectancy', data=subdata2, kind="bar", ci=None)
plt.xlabel('Income Group')
plt.ylabel('Life Expectancy')
seaborn.catplot(x='INCGROUPS15K', y='suicideper100th', data=subdata2, kind="bar", ci=None)
plt.xlabel('Income Group')
plt.ylabel('Suicide Rate per 100 persons')
New script section for ANOVA F-Testing
subdata3 = subdata2.dropna()
print ('means for lifeexpectancy by income groups (15K steps)')
m1= subdata3.groupby('INCGROUPS15K').mean()
print (m1)
print ('standard deviations for lifeexpectancy by income groups (15K steps)')
sd1 = subdata3.groupby('INCGROUPS15K').std()
print (sd1)
using ols (ordinary least squares) function for calculating the F-statistic and associated p value / C indicates Categorical Value
model1 = smf.ols(formula='lifeexpectancy ~ C(INCGROUPS15K)', data=subdata2)
results1 = model1.fit()
print (results1.summary())
mc1 = multi.MultiComparison(subdata3['lifeexpectancy'], subdata3['INCGROUPS15K'])
res1 = mc1.tukeyhsd()
print(res1.summary())
New script section for Chi Squared Test of Independence
First part: Chi Squared test of independence of all sample groups (income groups up to 15k, 30k, 45k, 60k).
Life expectancy is split up into <70 and >70 years
subdata4 = subdata2.dropna()
contingency table of observed counts
ct1=pandas.crosstab(subdata4['LIFEEXPGROUPS70Y'], subdata4['INCGROUPS15K'])
print (ct1)
column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
chi-square
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
Post Hoc Testing (pairwise)
subdata5 = subdata2.dropna()
recode1 = {'15k': '15k', '30k': '30k'}
subdata5['COMP1v2']= subdata5['INCGROUPS15K'].map(recode1)
Target p-value after Bonferroni Adjustment. Reminder how to adjust: p (Post Hoc Test) = p / number of sample groups that are checked
p = 0.05
p_bonferroni = p / 6
print('')
print('Adjusted p value (Bonferroni Adjustment: 4 groups --> 6 pairwise comparisons')
print(p_bonferroni)
contingency table of observed counts
print('')
ct2=pandas.crosstab(subdata5['LIFEEXPGROUPS70Y'], subdata5['COMP1v2'])
print (ct2)
column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print('')
print(colpct)
print('')
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
Pairwise comparison #2
recode2 = {'15k': '15k', '45k': '45k'}
subdata5['COMP1v3']= subdata5['INCGROUPS15K'].map(recode2)
contingency table of observed counts
print('')
ct3=pandas.crosstab(subdata5['LIFEEXPGROUPS70Y'], subdata5['COMP1v3'])
print (ct3)
column percentages
colsum=ct3.sum(axis=0)
colpct=ct3/colsum
print('')
print(colpct)
print('')
print ('chi-square value, p value, expected counts')
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)
Pairwise comparison #3
recode3 = {'15k': '15k', '60k': '60k'}
subdata5['COMP1v4']= subdata5['INCGROUPS15K'].map(recode3)
contingency table of observed counts
print('')
ct4=pandas.crosstab(subdata5['LIFEEXPGROUPS70Y'], subdata5['COMP1v4'])
print (ct4)
column percentages
colsum=ct4.sum(axis=0)
colpct=ct4/colsum
print('')
print(colpct)
print('')
print ('chi-square value, p value, expected counts')
cs4= scipy.stats.chi2_contingency(ct4)
print (cs4)
Pairwise comparison #4
recode4 = {'30k': '30k', '45k': '45k'}
subdata5['COMP2v3']= subdata5['INCGROUPS15K'].map(recode4)
contingency table of observed counts
print('')
ct5=pandas.crosstab(subdata5['LIFEEXPGROUPS70Y'], subdata5['COMP2v3'])
print (ct5)
column percentages
colsum=ct5.sum(axis=0)
colpct=ct5/colsum
print('')
print(colpct)
print('')
print ('chi-square value, p value, expected counts')
cs5= scipy.stats.chi2_contingency(ct5)
print (cs5)
Pairwise comparison #5
recode5 = {'30k': '30k', '60k': '60k'}
subdata5['COMP2v4']= subdata5['INCGROUPS15K'].map(recode5)
contingency table of observed counts
print('')
ct6=pandas.crosstab(subdata5['LIFEEXPGROUPS70Y'], subdata5['COMP2v4'])
print (ct6)
column percentages
colsum=ct6.sum(axis=0)
colpct=ct6/colsum
print('')
print(colpct)
print('')
print ('chi-square value, p value, expected counts')
cs6= scipy.stats.chi2_contingency(ct6)
print (cs6)
Pairwise comparison #6
recode6 = {'45k': '45k', '60k': '60k'}
subdata5['COMP3v4']= subdata5['INCGROUPS15K'].map(recode6)
contingency table of observed counts
print('')
ct7=pandas.crosstab(subdata5['LIFEEXPGROUPS70Y'], subdata5['COMP3v4'])
print (ct7)
column percentages
colsum=ct7.sum(axis=0)
colpct=ct7/colsum
print('')
print(colpct)
print('')
print ('chi-square value, p value, expected counts')
cs7= scipy.stats.chi2_contingency(ct7)
print (cs7)
0 notes
Compared with the original openGauss, Dolphin modifies the time/date function as follows:
The dayofmonth, dayofweek, dayofyear, hour, microsecond, minute, quarter, second, weekday, weekofyear, year, and current_date functions are added.
The curdate, current_time, curtime, current_timestamp, localtime, localtimestamp, now, and sysdate functions are added.
The makedate, maketime, period_add, period_diff, sec_to_time, and subdate functions are added.
The subtime, timediff, time, time_format, timestamp, and timestamppadd functions are added.
The to_days, to_seconds, unix_timestamp, utc_date, utc_time, and utc_timestamp functions are added.
The date_bool and time_bool functions are added.
The dayname, monthname, time_to_sec, month, day, date, week, yearweek functions are added and the last_day function is modified.
The datediff, from_days, convert_tz, date_add, date_sub, adddate, addtime functions are added and the timestampdiff function is modified.
The get_format, date_format, from_unixtime, str_to_date functions are added, and the extract function is modified.
0 notes