KNN与Naive_Bayes代码实现
任务要求
采用Python实现分类算法:
- 不得借助现成的工具包调库,例如SKlearn
- 至少实现k-近邻,朴素贝叶斯,逻辑回归,决策树与支持向量机中的其中一个算法。k-临近,朴素贝叶斯相对较简单,逻辑回归,决策树与支持向量机相对较难。
- 对breast cancer数据集调用编写的函数进行分类演示。
- 能力强的可以多实现几种算法
算法实现——kNN
利用breast_cancer中的数据,实现kNN算法。
- 导入数据集,并分为训练集和测试集
- 实现kNN算法
- 对每一个测试集中的实例,计算它距离训练集中的点的距离
- 根据选定的k值,选择距离最近的k个点数量更多的“标签”
- 算法效果测试,测试算法的精确度,和SKlearn提供的kNN算法进行比较。
源码
导入数据
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn import neighbors
datasets = datasets.load_breast_cancer() X = datasets.data; y = datasets.target; X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0) k = 5
y_predict = []
|
kNN算法实现
def knn(X_train, y_train, X_test, y_predict): ''' 对测试集的数据进行预测,得到的结果与y_test比较。用欧式距离进行计算。 ''' for test_data in X_test: first_k_instance = [] for i in range(len(X_train)): distance = 0; for attributes_no in range(len(X_train[0])): distance += (test_data[attributes_no] - X_train[i][attributes_no]) ** 2 Euclid_distance = distance ** 0.5 if i < k: first_k_instance.append((i, Euclid_distance)) elif Euclid_distance < first_k_instance[k-1][1]: first_k_instance[k-1] = (i, Euclid_distance) first_k_instance = sorted(first_k_instance, key = lambda x:x[1]) benign = 0 malignant = 0 for instance in first_k_instance: if y_train[instance[0]] == 0: malignant += 1 else: benign += 1 if malignant >= benign: y_predict.append(0) else: y_predict.append(1)
|
精确度计算函数
def accuracy(y_predict, y_test): correct = 0 for i in range(len(y_predict)): if y_predict[i] == y_test[i]: correct += 1 accuracy_rate = correct / len(y_predict) return correct, accuracy_rate
|
主函数
def main(): knn(X_train, y_train, X_test, y_predict) correct, accuracy_rate = accuracy(y_predict, y_test) print(y_predict) print("kNN模型测试集预测的准确率为:%.3f" % accuracy_rate); KNN = neighbors.KNeighborsClassifier(n_neighbors = 5) KNN.fit(X_train, y_train) print("sklearn库中kNN模型预测的准确率为:%.3f" % KNN.score(X_test, y_test)); if __name__ == '__main__': main()
|
[0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1]
kNN模型测试集预测的准确率为:0.947
sklearn库中kNN模型预测的准确率为:0.947
通过实验结果可以发现,我们实现的kNN与SKlearn中提供的kNN效果一致。
我们可以通过设置k的值和转换寻找相似样本的策略(将欧式距离替换为匹配系数或Jaccard等),进一步优化精确度。
算法实现——Naive_Bayes
利用breast_cancer中的数据,实现Naive_Bayes算法。
- 导入数据集,并分为训练集和测试集
- 实现Naive Bayes算法
- 把连续的属性划分区间,计算正例和反例落在每个属性的每个区间的个数
- 计算概率值,预测测试集的标签
- 算法效果测试,测试算法的精确度,和SKlearn提供的Naive Bayes算法进行比较。
源码
导入数据
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn import naive_bayes
datasets = datasets.load_breast_cancer() X = datasets.data; y = datasets.target; X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
y_predict = []
|
由于30个属性全部都是连续值,我们使用朴素贝叶斯的时候需要将属性的值的范围分为几个区间,计算实例落在该区间的概率。这里每个属性我都以平均值作为间隔来划分区间。
对每个连续的属性划分区间并统计个数
def distribution(X_train, y_train): ''' 先把区间分好,然后再计算概率。 ''' attributes_max_min_mean = [] for i in range(len(X_train[0])): section = [X_train[0][i], X_train[0][i], 0] for instance in X_train: if instance[i] > section[0]: section[0] = instance[i] if instance[i] < section[1]: section[1] = instance[i] section[2] += instance[i] section[2] /= len(X_train) attributes_max_min_mean.append(section) instance_distribution = [] for i in range(len(X_train[0])): smaller_benign = 0 larger_benign = 0 smaller_malignant = 0 larger_malignant = 0 for j in range(len(X_train)): if X_train[j][i] > attributes_max_min_mean[i][2]: if y_train[j] == 1: larger_benign += 1 else: larger_malignant +=1 elif y_train[j] == 1: smaller_benign += 1 else: smaller_malignant += 1 instance_distribution.append([smaller_benign, larger_benign, smaller_malignant, larger_malignant]) return instance_distribution, attributes_max_min_mean
|
实现朴素贝叶斯
def Naive_Bayes(X_test, y_predict, instance_distribution,attributes_max_min_mean): for test_data in X_test: malignant = instance_distribution[0][2] + instance_distribution[0][3] benign = instance_distribution[0][0] + instance_distribution[0][1] p_xc0 = 1 p_xc1 = 1 for i in range(len(X_train[0])): if test_data[i] > attributes_max_min_mean[i][2]: p_xc0 *= instance_distribution[i][3] / malignant p_xc1 *= instance_distribution[i][1] / benign else: p_xc0 *= instance_distribution[i][2] / malignant p_xc1 *= instance_distribution[i][0] / benign p0 = p_xc0 * malignant / (malignant + benign) p1 = p_xc1 * benign / (malignant + benign) if p0 > p1: y_predict.append(0) else: y_predict.append(1)
|
计算精确度
def accuracy(y_predict, y_test): correct = 0 for i in range(len(y_predict)): if y_predict[i] == y_test[i]: correct += 1 accuracy_rate = correct / len(y_predict) return correct, accuracy_rate
|
主函数
def main(): instance_distribution, attributes_max_min_mean = distribution(X_train, y_train) Naive_Bayes(X_test, y_predict, instance_distribution, attributes_max_min_mean) correct, accuracy_rate = accuracy(y_predict, y_test) print(y_predict) print("Naive Bayes模型测试集预测的准确率为:%.3f" % accuracy_rate); bayes = naive_bayes.GaussianNB() bayes.fit(X_train, y_train) print("sklearn库中Naive Bayes模型预测的准确率为:%.3f" % bayes.score(X_test, y_test)); if __name__ == '__main__': main()
|
[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1]
Naive Bayes模型测试集预测的准确率为:0.930
sklearn库中Naive Bayes模型预测的准确率为:0.924
通过实验结果可以发现,我们实现的朴素贝叶斯比SKlearn提供的朴素贝叶斯效果更好。
我们可以通过尝试各属性不同的区间划分,进一步优化精确度。而SKlearn提供的朴素贝叶斯效果不好的原因可能就是将连续值转换为离散值的区间划分没有做好。