根據cosine similarity得出的結果, 如何在dataframe中根據indice尋找相應的資料

tfidf python dataframe indice

Wilson_mak 2020-08-22 17:53:37 ‧ 1505 瀏覽

分享至

我有一個dataframe,名為sample_df, 資料大小為3910 行 x 9 列

body_text是抽取sample_df其中一個column(body_text)
question是根據使用者輸入的問題


question_cleaned = processed_query(question)

tfidf_article = tfidf_vector.fit_transform(body_text)
tfidf_question = tfidf_vector.transform(question)

print(f'Shape of tfidf_article:{tfidf_article.shape}, Shape of tifdf_question:{tfidf_question.shape}')

cosine_similarities = linear_kernel(tfidf_article,tfidf_question).flatten()
print(f'length of cosine similarities: {len(cosine_similarities)}')

  
docs_indices = cosine_similarities.argsort()[::-1][:5]
print(f'doc_indices:{docs_indices}') 

score = cosine_similarities[related_docs_indices] 
print(f'score: {score}')

結果如下

Shape of tfidf_article:(3910, 414), Shape of tifdf_question:(7, 414)
length of cosine similarities: 27370
doc_indices:[22357 13724  7781 16496 19156]
score: [0.80544066 0.74396796 0.68608747 0.67502091 0.66033177]

我的問題是: 在使用了cosine similarities的方法後, 我怎樣才能獲得當中的indices讓我可以找回在sample_df相應的數據？因為從結果看到，有些doc_indices的值是超過3910(sample_df長度)。