我有一個dataframe,名為sample_df, 資料大小為3910 行 x 9 列
body_text
是抽取sample_df其中一個column(body_text)question
是根據使用者輸入的問題
question_cleaned = processed_query(question)
tfidf_article = tfidf_vector.fit_transform(body_text)
tfidf_question = tfidf_vector.transform(question)
print(f'Shape of tfidf_article:{tfidf_article.shape}, Shape of tifdf_question:{tfidf_question.shape}')
cosine_similarities = linear_kernel(tfidf_article,tfidf_question).flatten()
print(f'length of cosine similarities: {len(cosine_similarities)}')
docs_indices = cosine_similarities.argsort()[::-1][:5]
print(f'doc_indices:{docs_indices}')
score = cosine_similarities[related_docs_indices]
print(f'score: {score}')
結果如下
Shape of tfidf_article:(3910, 414), Shape of tifdf_question:(7, 414)
length of cosine similarities: 27370
doc_indices:[22357 13724 7781 16496 19156]
score: [0.80544066 0.74396796 0.68608747 0.67502091 0.66033177]
我的問題是: 在使用了cosine similarities的方法後, 我怎樣才能獲得當中的indices讓我可以找回在sample_df相應的數據?因為從結果看到,有些doc_indices的值是超過3910(sample_df長度)。