수정
상단노출
1. (9쪽) 자료 문자열
a1 = 'I like math. I like science as well, especially physics.' a2 = 'I prefer math and geometry to social science. And I don’t like sports.' a3 = 'I’m afraid but I don’t like math nor science.' a4 = 'I like cooking and my favorite food is Korean food.'
2. (9쪽) 전처리 작업 코드
data_list = [a1, a2, a3, a4] mining_list = [] for data in data_list : data = data.replace("’", "") data = data.replace(",","") data = data.replace(".", "") data = data.lower() mining_list.append(data) for new_data in mining_list : print(new_data)
3. (10쪽) 벡터화 코드
from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer() vect.fit(corpus) vect_dict = vect.vocabulary_ word_list = [] index_list = [] for tup in sorted(vect_dict.items()) : word_list.append(tup[0]) index_list.append(tup[1]) print(sorted(vect_dict.items()), end=' ')
4. (11쪽) 각 문장 벡터로 변환하는 과정 코드
vectors = [] u = a1 + ' ' + a2 + ' ' + a3 + ' ' + a4 for i in corpus : a = vect.transform([i]).toarray() vectors.append(a) u_ = vect.transform([u]).toarray() vectors.append(u_) namelist = ['A1', 'A2', 'A3', 'A4', 'U' ] print('---------- 단어 사전 ----------') print(sorted(vect_dict.items()), end = ' ') print(""" """) print('---------- 벡터 ----------') for name, vector in zip(namelist, vectors) : print(name, ':', vector)
5. (12쪽) 표로 표현하는 과정 코드
import pandas as pd con_dict = { 'word' : word_list, 'A1' : vectors[0][0], 'A2' : vectors[1][0], 'A3' : vectors[2][0], 'A4' : vectors[3][0], 'U' : vectors[4][0] } a = pd.DataFrame(con_dict) a
6. (13쪽) 문서 빈도수 출력 코드
df_list = [] for k in range(0, len(word_list)) : df_list.append(len(vectors)-1) for i in range(0, len(vectors[0][0])) : for vector in vectors : if vector[0][i] == 0 : df_list[i] -= 1 print(df_list)
7. (13쪽) 상대도수 출력 코드
n = len(vectors) - 1 relative_freq_list = [ df/n for df in df_list ] relative_freq_dict = { 'word' : word_list, 'Relative_Freq' : relative_freq_list } temp = pd.DataFrame(relative_freq_dict) temp
8. (14쪽) 역문서 빈도수 출력 코드
idf_list = [ 1/relative_freq for relative_freq in relative_freq_list] idf_dict = { 'word' : word_list, 'Relative_Freq' : relative_freq_list, 'Inverse DF' : idf_list } temp = pd.DataFrame(idf_dict) temp
9. (15쪽) TFxIDF 출력 코드
tfidf_list = [tf*idf for tf, idf in zip(vectors[0][0], idf_list)] tfidf_dict = { 'word' : word_list, 'TF' : vectors[0][0], 'IDF' : idf_list, 'TFxIDF' : tfidf_list } temp = pd.DataFrame(tfidf_dict) temp
10. (23쪽) 유사도 분석 코드
import pandas as pd a1 = 'I like math. I like science as well, especially physics.' a2 = "I prefer math and geometry to social science. And I don't like sports." a3 = "I’m afraid but I don’t like math nor science." a4 = 'I like cooking and my favorite food is Korean food.' data_list = [a1, a2, a3, a4] mining_list = [] for data in data_list : data = data.replace("'", "") data = data.replace(",", "") data = data.replace(".", "") data = data.lower() mining_list.append(data) corpus = [a1, a2, a3, a4] = [new_data for new_data in mining_list] from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer() vect.fit(corpus) vect_dict = vect.vocabulary_ word_list = [] index_list = [] vectors = [] u = a1 + ' '+ a2 + ' '+ a3 + ' '+ a4 for i in corpus : a = vect.transform([i]).toarray() vectors.append(a) vectors
11. (24쪽) 리스트 간소화
vectors_list = [] for vector_array in vectors : vectors_list.append(list(vector_array[0])) [v1,v2,v3,v4] = vectors_list print(v1,v2,v3,v4)
12. (24쪽) 유클리드 유사도 함수로 만드는 코드
import math # Euclidean Similarity e(p,q) def e(p,q) : val = 0 for k in range(len(p)): val += (p[k]-q[k])*(p[k]-q[k]) val = math.sqrt(val) return round(val,2)
13. (25쪽) 상호유사도행렬 출력, 시각화 코드
import seaborn as sns import matplotlib.pyplot as plt euclidean_m = [] temp = [] for a in vectors_list : for b in vectors_list : temp.append(e(a,b)) euclidean_m.append(temp) temp = [] print('Euclidean Similarity Table :') for k in euclidean_m: print(k) labels = ['v1', 'v2', 'v3', 'v4'] df = sns.heatmap(euclidean_m, cmap='Reds', annot = True, xticklabels = labels, yticklabels = labels) plt.show()
14. (27쪽) 코사인 유사도 분석 코드
# Cosine Similarity c(p,q) def c(p,q) : val = 0 upp = 0 down_1 = 0 down_2 = 0 for k in range(len(p)) : upp += p[k]*q[k] down_1 += p[k]*p[k] down_2 += q[k]*q[k] down_1 = math.sqrt(down_1) down_2 = math.sqrt(down_2) val += upp val /= down_1*down_2 return round(val,2)
15. (27쪽) 코사인 유사도 분석 상호 유사도 행렬 및 시각화 코드
cosine_m = [] temp = [] for a in vectors_list : for b in vectors_list : temp.append(c(a,b)) cosine_m.append(temp) temp = [] print('Cosine Similarity Table :') for k in cosine_m: print(k) labels = ['v1', 'v2', 'v3', 'v4'] df = sns.heatmap(cosine_m, cmap='Reds', annot = True, xticklabels = labels, yticklabels = labels) plt.show()
전송