数据可视化与Z-Test假设检验方法

本文使用Python数学绘图库matplotlib针对“2011年英格兰各地区的人口普查结果”数据集进行可视化探索。依据绘制出的图表设定合理的假设，并采用Z检验方法对假设进行检验论证。

数据集简介

本次实验采用的数据集为2011年英格兰各地区的人口普查年龄样本数据。数据集已整理为csv格式，第一列为数据索引，共有1000行数据，之后的每一列分别包含一个地区的年龄数据。其中，每一个数据点代表截至调查时间某个人最后一个生日时的年龄（0代表尚未满1岁的婴儿，35代表一个已满35岁的人）。特例：100代表年龄位于100到114之间，且数据集中不包含年龄大于114的数据。

数据来源：英国国家统计局，2011年人口普查汇总数据。DOI: http://dx.doi.org/10.5257/census/aggregate-2011-2

直接取自：http://infuse2011.ukdataservice.ac.uk/

数据读取

导入接下来会用到的相关库。使用Pandas读取数据集文件，并显示前10行，结果如下表。

import pandas as pd
import matplotlib.pyplot as plt
import random
import numpy as np
from scipy import stats
df = pd.read_csv("census11sample.csv", index_col=0)
df.head(10)

	North East	North West	Yorkshire and The Humber	East Midlands	West Midlands	East of England	London	South East	South West
0	27	32	65	60	44	79	43	8	9
1	10	36	40	19	52	46	58	87	40
2	82	53	60	34	2	29	18	46	57
3	33	63	25	60	9	26	30	52	74
4	55	98	4	47	58	37	18	67	19
5	27	26	66	70	33	41	38	78	63
6	42	44	49	59	40	65	13	8	52
7	62	44	41	76	8	74	54	40	72
8	34	17	33	71	45	73	1	77	75
9	15	46	37	3	13	41	19	47	69

数据集划分

由于之后我们将根据可视化图表设定假设用以验证测试，为了确保获得无偏测试结果，我们需要将原数据集划分为两部分，一部分用于任意探索并绘制可视化图表，另一部分用于假设的检验。接下来，我们分别称这两部分数据集为sample_explore和sample_test。

数据集的划分仍然基于之前读取的DataFrame，我们计划将原数据集按照1:1进行划分，先用df.sample方法随机取样，然后使用布尔掩码筛选出另一部分，最后不要忘了重建索引。

# divide the original dataset into 1:1 with no overlapping
sample_explore = df.sample(frac=0.5)
sample_test = df[~df.index.isin(sample_explore.index)]
sample_explore = sample_explore.reset_index(drop=True)
sample_test = sample_test.reset_index(drop=True)
sample_explore

	North East	North West	Yorkshire and The Humber	East Midlands	West Midlands	East of England	London	South East	South West
0	20	78	24	26	3	2	24	74	21
1	43	53	30	60	66	57	12	48	37
2	70	49	80	23	42	71	51	2	22
3	90	69	62	59	24	77	46	47	67
4	52	60	66	5	18	46	65	44	55
…	…	…	…	…	…	…	…	…	…
495	12	4	74	45	33	43	44	66	42
496	43	19	12	47	29	55	51	66	55
497	73	87	41	50	57	2	34	20	78
498	26	19	4	40	76	37	10	58	27
499	7	34	16	49	44	31	51	11	26

500 rows × 9 columns

数据探索和挖掘

下面我们将尽可能充分挖掘sample_explore样本集，并希望从中发现一些客观规律以形成可被证伪的客观假设。

首先计算数据集中各个地区的整体描述类统计信息，包含最值、中值、均值、标准误差、方差等等。

# calculate descriptive statistics
sample1d = sample_explore.stack().reset_index(drop=True)
statistics = []
for column in ['Total'] + list(sample_explore.columns):
    c_data = sample1d if column == 'Total' else sample_explore[column]
    statistics.append([c_data.min(), c_data.max(), c_data.median(), ', '.join(str(x) for x in c_data.mode()), c_data.quantile(0.25), c_data.quantile(0.75), c_data.quantile(0.75) - c_data.quantile(0.25), c_data.mean(), c_data.sem(), c_data.var(), c_data.std()])
desc_stats = pd.DataFrame(statistics, index=['Total (England)']+list(sample_explore.columns), columns=['Min','Max','Median', 'Mode', 'Q1','Q3','IQR','Mean','Standard Error','Variance','Standard Deviation'])
print("  ↓ Descriptive statistics of the whole England & different regions of the 'sample_explore' dataset")
desc_stats

↓ Descriptive statistics of the whole England & different regions of the ‘sample_explore’ dataset

	Max	Median	Mode	Q1	Q3	IQR	Mean	Standard Error	Variance	Standard Deviation
Total (England)	100	39.0	47	20.0	58.00	38.00	39.535556	0.350215	551.926497	23.493116
North East	90	39.5	49	20.0	57.25	37.25	39.442000	1.051182	552.491619	23.505140
North West	98	39.0	18	18.0	58.00	40.00	39.058000	1.084249	587.798232	24.244551
Yorkshire and The Humber	100	39.0	47	20.0	57.00	37.00	39.568000	1.052629	554.013403	23.537489
East Midlands	90	38.0	37, 43	20.0	57.00	37.00	39.392000	1.014396	514.499335	22.682578
West Midlands	96	39.0	27, 58	20.0	58.00	38.00	39.942000	1.050327	551.593824	23.486035
East of England	97	40.0	48	21.0	59.25	38.25	40.322000	1.050430	551.701719	23.488332
London	94	34.0	38	17.0	49.00	32.00	34.908000	0.982324	482.480497	21.965439
South East	95	39.0	20	20.0	57.00	37.00	38.828000	1.030495	530.960337	23.042577
South West	98	45.0	47	23.0	64.00	41.00	44.360000	1.099384	604.323046	24.582983

开始进行数据可视化，首先绘制探索样本集整体的数据分布直方图。

# style of plots
plt.style.use("seaborn-darkgrid")

# histogram
plt.figure(figsize=(18,10))
plt.title("Histogram of ages in the whole England")
plt.hist(sample1d, bins=50, edgecolor='white', color='darkorange')
plt.xlabel("Age")
plt.ylabel("Number")
plt.show()

绘制各个地区的年龄数据散点图。设置x方向一个小幅度的随机偏移令数据点稍微错开。

# scatter plot
plt.figure(figsize=(18,10))
plt.title("Scatter plot of ages in different regions of England")
for i,column in enumerate(sample_explore.columns):
    plt.scatter([i+random.random()*0.6-0.3 for x in range(len(sample_explore))], sample_explore[column], s=2)
plt.xticks(list(range(len(sample_explore.columns))), sample_explore.columns)
plt.xlabel("Regions")
plt.ylabel("Age")
plt.show()

绘制各个地区含误差条（标准均值误差）的均值分布图。

# means with errorbars
plt.figure(figsize=(18,10))
plt.title("Means and standard errors of ages in different regions of England")
plt.errorbar(list(range(1, 10)), sample_explore.mean(), yerr=sample_explore.sem(), fmt='o', capsize=10, color="darkviolet")
plt.xticks(list(range(1, 10)), sample_explore.columns)
plt.xlabel("Regions")
plt.ylabel("Age")
plt.show()

绘制各个地区年龄数据的箱型图。箱型图可以展示一组数据的中位数、上下四分位数和上下边缘（上下四分位数±1.5*IQR）。

# boxplot
plt.figure(figsize=(18,10))
plt.title("Boxplot of ages in different regions of England")
plt.boxplot(sample_explore, labels=list(sample_explore.columns), notch=True, medianprops={'color':'darkorange', 'linewidth':2})
plt.xlabel("Regions")
plt.ylabel("Age")
plt.show()

最后，再加上一张展示各个地区不同年龄段层次比例的柱状图。

# grouped bar plot to show age structures
age_structs = [[len(sample_explore[(sample_explore[column] >= 0) & (sample_explore[column] <= 20)]),
                len(sample_explore[(sample_explore[column] >= 21) & (sample_explore[column] <= 45)]),
                len(sample_explore[(sample_explore[column] >= 46) & (sample_explore[column] <= 70)]),
                len(sample_explore[(sample_explore[column] >= 71) & (sample_explore[column] <= 100)])]
                for column in sample_explore.columns]
age_nums = []
for s in age_structs:
    age_nums += s
plt.figure(figsize=(18,10))
plt.title("Age structures of dirrerent regions in England")
plt.bar(list(range(0, 45, 5)), [x[0] for x in age_structs], width=0.9, color='dodgerblue', label="0 - 20")
plt.bar(list(range(1, 45, 5)), [x[1] for x in age_structs], width=0.9, color='darkorange', label="21 - 45")
plt.bar(list(range(2, 45, 5)), [x[2] for x in age_structs], width=0.9, color='limegreen', label="46 - 70")
plt.bar(list(range(3, 45, 5)), [x[3] for x in age_structs], width=0.9, color='darkorchid', label="71 - 100")
for a,b in zip([x for x in range(45) if x % 5 != 4], age_nums):
    plt.text(a, b+1, '{:.1f}'.format((b/len(sample_explore))*100)+'%', ha='center', va= 'bottom', fontsize=9)
plt.xlabel("Regions")
plt.ylabel("Number")
plt.xticks([x+1.5 for x in range(0, 41, 5)], sample_explore.columns)
plt.legend()
plt.show()

设定假设

假设：伦敦人口的平均年龄小于英格兰东南部人口的平均年龄。
证据：伦敦的平均年龄误差条明显低于东南部的平均年龄误差线，且大致相差两段误差线。

检验方案

根据如上总体数据分布直方图，可以发现英格兰的人口年龄分布大致遵循正态分布规律。因此，计划采用Z检验来比较假设中两个地区的人口年龄均值，以进一步验证我的假设。

原假设 (Null Hypothesis)：伦敦人口的平均年龄等于英格兰东南部人口的平均年龄。
备择假设 (Alternative Hypothesis)：伦敦人口的平均年龄小于英格兰东南部人口的平均年龄。
置信度 (Probability level)：95% (α=0.05)

接下来将在sample_test测试数据集上进行假设检验，尝试通过拒绝原假设来验证备择假设。根据备择假设做出的描述，此处应进行左尾检验 (Left Tail Test)。

具体的，我们将把检验测试结果与临界值 (critical z value) 进行比较，若测试结果小于临界值 (p-value < α)，则可以拒绝原假设并支持备择假设，即意味着我设定的原始假设通过了检验。

Z检验 (Z-Test)

Z检验又称U检验，是一种常用于样本均值对比或样本均值与常量对比的假设检验方法。其适用于服从或近似服从正态分布的随机变量，可以通过给定样本判断数学期望差异的显著性。

当进行两个样本间的均值对比时，Z分数的计算公式如下：

$$Z = \frac{\bar{x}_1 – \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

其中 $\bar{x}_1$ 和 $\bar{x}_2$ 表示两个样本集数据的均值，$s_1^2$ 和 $s_2^2$ 为样本方差，$n_1$ 和 $n_2$ 为样本总数。即两样本均值差与总体标准误差的比值。

$$\frac{\hat{m}_1 – \hat{m}_2}{\sqrt{SEM_1^2 + SEM_2^2}}$$

Z分数的计算结果满足标准正态分布，因此临界值等于标准正态分布的累计分布函数取 $\alpha$ 值时对应的 $x$ 值，即 $ppf(\alpha)$。

需要注意的是，Z检验仅适用于样本容量较大的情况，一般要求样本容量 $\ge 30$，否则应采用学生T检验。

完成假设检验

根据设定的检验方案，完成检验流程，并判断输出结果。

# get test data from sample_test dataset
londonSample = sample_test['London']
southEastSample = sample_test['South East']

# set alpha value and calculate critical value
alpha = 0.05
critical_value = stats.norm.ppf(alpha)
critical_value

# calculate the z test statistic by formula
meanLondon = londonSample.mean()
meanSouthEast = southEastSample.mean()
meanDiff = meanLondon - meanSouthEast

varLondon = londonSample.var()
varSouthEast = southEastSample.var()

combined_sem = np.sqrt(varLondon/len(londonSample) + varSouthEast/len(southEastSample))

z_stat = meanDiff / combined_sem

# compare z_stat and critical value to decide the result
print("alpha value is", alpha)
print("critical value is", critical_value)
print("z statistic is", z_stat)
print("result of left tail test:")
if (z_stat < critical_value):
    print("reject Null hypothesis")
else:
    print("failed to reject Null hypothesis")

alpha value is 0.05
critical value is -1.6448536269514729
z statistic is -2.967612239842122
result of left tail test:
reject Null hypothesis

结论

使用经典假设检验方法——Z检验，在95%的置信度下，最终成功拒绝原假设。

数据可视化与Z-Test假设检验方法

数据集简介

数据读取

数据集划分

数据探索和挖掘

设定假设

检验方案

Z检验 (Z-Test)

完成假设检验

结论

RyanXin

留言取消回复

	North East	North West	Yorkshire and The Humber	East Midlands	West Midlands	East of England	London	South East	South West
0	27	32	65	60	44	79	43	8	9
1	10	36	40	19	52	46	58	87	40
2	82	53	60	34	2	29	18	46	57
3	33	63	25	60	9	26	30	52	74
4	55	98	4	47	58	37	18	67	19
5	27	26	66	70	33	41	38	78	63
6	42	44	49	59	40	65	13	8	52
7	62	44	41	76	8	74	54	40	72
8	34	17	33	71	45	73	1	77	75
9	15	46	37	3	13	41	19	47	69

	North East	North West	Yorkshire and The Humber	East Midlands	West Midlands	East of England	London	South East	South West
0	20	78	24	26	3	2	24	74	21
1	43	53	30	60	66	57	12	48	37
2	70	49	80	23	42	71	51	2	22
3	90	69	62	59	24	77	46	47	67
4	52	60	66	5	18	46	65	44	55
…	…	…	…	…	…	…	…	…	…
495	12	4	74	45	33	43	44	66	42
496	43	19	12	47	29	55	51	66	55
497	73	87	41	50	57	2	34	20	78
498	26	19	4	40	76	37	10	58	27
499	7	34	16	49	44	31	51	11	26

2024 年 12 月
日	一	二	三	四	五	六
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

	North East	North West	Yorkshire and The Humber	East Midlands	West Midlands	East of England	London	South East	South West
0	27	32	65	60	44	79	43	8	9
1	10	36	40	19	52	46	58	87	40
2	82	53	60	34	2	29	18	46	57
3	33	63	25	60	9	26	30	52	74
4	55	98	4	47	58	37	18	67	19
5	27	26	66	70	33	41	38	78	63
6	42	44	49	59	40	65	13	8	52
7	62	44	41	76	8	74	54	40	72
8	34	17	33	71	45	73	1	77	75
9	15	46	37	3	13	41	19	47	69

	North East	North West	Yorkshire and The Humber	East Midlands	West Midlands	East of England	London	South East	South West
0	20	78	24	26	3	2	24	74	21
1	43	53	30	60	66	57	12	48	37
2	70	49	80	23	42	71	51	2	22
3	90	69	62	59	24	77	46	47	67
4	52	60	66	5	18	46	65	44	55
…	…	…	…	…	…	…	…	…	…
495	12	4	74	45	33	43	44	66	42
496	43	19	12	47	29	55	51	66	55
497	73	87	41	50	57	2	34	20	78
498	26	19	4	40	76	37	10	58	27
499	7	34	16	49	44	31	51	11	26

数据集简介

数据读取

数据集划分

数据探索和挖掘

设定假设

检验方案

Z检验 (Z-Test)

完成假设检验

结论

RyanXin

留言 取消回复

留言取消回复

	North East	North West	Yorkshire and The Humber	East Midlands	West Midlands	East of England	London	South East	South West
0	27	32	65	60	44	79	43	8	9
1	10	36	40	19	52	46	58	87	40
2	82	53	60	34	2	29	18	46	57
3	33	63	25	60	9	26	30	52	74
4	55	98	4	47	58	37	18	67	19
5	27	26	66	70	33	41	38	78	63
6	42	44	49	59	40	65	13	8	52
7	62	44	41	76	8	74	54	40	72
8	34	17	33	71	45	73	1	77	75
9	15	46	37	3	13	41	19	47	69

	North East	North West	Yorkshire and The Humber	East Midlands	West Midlands	East of England	London	South East	South West
0	20	78	24	26	3	2	24	74	21
1	43	53	30	60	66	57	12	48	37
2	70	49	80	23	42	71	51	2	22
3	90	69	62	59	24	77	46	47	67
4	52	60	66	5	18	46	65	44	55
…	…	…	…	…	…	…	…	…	…
495	12	4	74	45	33	43	44	66	42
496	43	19	12	47	29	55	51	66	55
497	73	87	41	50	57	2	34	20	78
498	26	19	4	40	76	37	10	58	27
499	7	34	16	49	44	31	51	11	26