承接上一篇文章,爬蟲、整理分析方法都已經在前面文章教學過,本篇文章就直接進入正題,進行分析。
def tokenize(sentence):
terms = []
if pd.notnull(sentence):
for term in jieba.cut(sentence):
term = term.lower()
if term not in stops:
terms.append(term)
return terms
for g in set(df['group']):
print(g)
df_content = df[df['group'] == g][['text_content']]
df_content['processed'] = df_content['text_content'].apply(tokenize)
total_terms = []
for terms in df_content['processed']:
total_terms.extend(terms)
wordcloud = WordCloud(font_path="simsun.ttf", background_color='white')
wordcloud.generate_from_frequencies(frequencies=Counter(total_terms))
plt.figure(figsize=(10, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.savefig(os.path.join('day19_pic', 'worldcloud_' + g + '_chinese'), bbox_inches='tight', pad_inches=0)
plt.show()
for g in set(df['group']):
print(g)
df_content = df[df['group'] == g][['text_content']]
df_content['processed'] = df_content['text_content'].apply(tokenize)
english_terms = []
for terms in df_content['processed']:
for term in terms:
match_eng = re.match(r'[a-z]+', term)
if match_eng != None and match_eng.group(0) == term:
english_terms.append(term)
wordcloud = WordCloud(font_path="simsun.ttf", background_color='white')
wordcloud.generate_from_frequencies(frequencies=Counter(english_terms))
plt.figure(figsize=(10, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.savefig(os.path.join('day19_pic', 'worldcloud_' + g + '_English'), bbox_inches='tight', pad_inches=0)
plt.show()
group_lang_count = {}
for g in set(df['group']):
print(g)
df_content = df[df['group'] == g][['text_content']]
df_content['processed'] = df_content['text_content'].apply(tokenize)
total_terms = []
for terms in df_content['processed']:
for term in terms:
match_eng = re.match(r'[a-z]+', term)
if match_eng != None and match_eng.group(0) == term:
total_terms.append(term)
langs = ["nodejs", "node", "reactjs", "react", "js",
"python", "javascript", "ruby", 'java', 'c',
'c#', 'angularjs', 'angular', 'typescript',
'd3', 'd3js', 'sql', 'html', 'css', 'jquery',
'go', 'vue', 'vuejs', 'r']
langs = sorted(langs)
lang_count = {}
for lang in langs:
count = 0
if lang in total_terms:
count = Counter(total_terms).get(lang)
lang_count[lang] = count
print(lang_count)
group_lang_count[g] = lang_count
df_lang = pd.DataFrame(list(group_lang_count.values()), index=group_lang_count.keys())
df_lang
angular | angularjs | c | c# | css | d3 | d3js | go | html | java | javascript | jquery | js | node | nodejs | python | r | react | reactjs | ruby | sql | typescript | vue | vuejs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DevOps | 0 | 0 | 1 | 0 | 41 | 0 | 0 | 4 | 21 | 13 | 1 | 0 | 22 | 67 | 6 | 7 | 0 | 0 | 0 | 3 | 2 | 0 | 0 | 0 |
Security | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
AI&MachineLearning | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 28 | 0 | 8 | 0 | 14 | 13 | 9 | 59 | 5 | 0 | 0 | 0 | 5 | 6 | 0 | 0 |
自我挑戰組 | 13 | 2 | 57 | 0 | 144 | 0 | 0 | 10 | 119 | 47 | 96 | 31 | 79 | 49 | 3 | 76 | 21 | 31 | 0 | 16 | 16 | 14 | 26 | 8 |
SoftwareDevelopment | 0 | 0 | 55 | 0 | 55 | 0 | 0 | 43 | 78 | 120 | 22 | 4 | 92 | 16 | 26 | 63 | 8 | 4 | 0 | 14 | 32 | 0 | 0 | 0 |
ModernWeb | 181 | 8 | 87 | 0 | 474 | 0 | 0 | 116 | 407 | 51 | 699 | 57 | 730 | 163 | 18 | 35 | 4 | 313 | 22 | 34 | 32 | 115 | 198 | 4 |
DataTechnology | 1 | 0 | 6 | 0 | 2 | 0 | 0 | 9 | 19 | 12 | 7 | 0 | 22 | 10 | 17 | 118 | 76 | 15 | 0 | 4 | 35 | 6 | 7 | 0 |
for g in set(df['group']):
df_lang.T[g].plot(kind='pie', autopct='%.2f', title=g, fontsize=10, )
plt.title(g, fontsize=20)
plt.savefig(os.path.join('day19_pic', 'lang_pie_' + g), bbox_inches='tight', pad_inches=0)
plt.show()
writer_name | article_length | group | corpus_title | writer_url |
---|---|---|---|---|
IcodesoIam. | 1917.3333333333333 | AI&MachineLearning | 以100張圖理解NeuralNetwork--觀念與實踐系列 | https://ithelp.ithome.com.tw/users/20001976/ironman |
shortid | 1426.25 | AI&MachineLearning | 機器學習你也可以-文組帶你手把手實做機器學習聖經系列 | https://ithelp.ithome.com.tw/users/20107850/ironman |
DuranHsieh | 970.421052631579 | AI&MachineLearning | 利用MSBotframework與CognitiveService建構自用智慧小秘書系列 | https://ithelp.ithome.com.tw/users/20091494/ironman |
AlanTsai | 2384.5 | DataTechnology | DataScience到底是什麼-從一個完全外行角度來看系列 | https://ithelp.ithome.com.tw/users/20083151/ironman |
arksu | 2224.3333333333335 | DataTechnology | MicrosoftRSolution系列 | https://ithelp.ithome.com.tw/users/20103333/ironman |
bymiachang | 1699.25 | DataTechnology | DataSciencewithAzure系列 | https://ithelp.ithome.com.tw/users/20103632/ironman |
帥哥 | 1715.857142857143 | DevOps | Openstack學習與介紹系列 | https://ithelp.ithome.com.tw/users/20103615/ironman |
zxcvbnius | 1684.6666666666667 | DevOps | Kubernetes30天學習筆記系列 | https://ithelp.ithome.com.tw/users/20103753/ironman |
James | 1494.75 | DevOps | k8s不自賞系列 | https://ithelp.ithome.com.tw/users/20107062/ironman |
Wellwind | 3990.25 | ModernWeb | AngularMaterial完全攻略系列 | https://ithelp.ithome.com.tw/users/20020617/ironman |
KuroHsu | 2839.6315789473683 | ModernWeb | 重新認識JavaScript系列 | https://ithelp.ithome.com.tw/users/20065504/ironman |
Arel | 2715.25 | ModernWeb | JS30錄系列 | https://ithelp.ithome.com.tw/users/20107212/ironman |
WLLO | 1913.6666666666667 | Security | 鯊魚咬電纜:30天玩Wireshark系列 | https://ithelp.ithome.com.tw/users/20107304/ironman |
frankyzyao | 1640.3333333333333 | Security | 資安分析師的轉職升等之路系列 | https://ithelp.ithome.com.tw/users/20084806/ironman |
sunallen | 1230.25 | Security | 為了明日的重開機系列 | https://ithelp.ithome.com.tw/users/20006132/ironman |
nonerkao | 2728.25 | SoftwareDevelopment | 與妖精共舞:在RISC-V架構上使用GO語言實作binutils工具包系列 | https://ithelp.ithome.com.tw/users/20103524/ironman |
HelloWorld | 2120.25 | SoftwareDevelopment | 系統架構秘辛:了解RISC-V架構底層除錯器的秘密!系列 | https://ithelp.ithome.com.tw/users/20107327/ironman |
fntsrlike | 2040.75 | SoftwareDevelopment | 淺談軟體開發與工程的基本追求系列 | https://ithelp.ithome.com.tw/users/20103676/ironman |
M157q | 2076.6666666666665 | 自我挑戰組 | M157q的待業程式生活日誌系列 | https://ithelp.ithome.com.tw/users/20107813/ironman |
willyc20 | 1439.0 | 自我挑戰組 | 區塊鏈報明牌系列 | https://ithelp.ithome.com.tw/users/20107460/ironman |
ntausr4 | 1317.875 | 自我挑戰組 | 資訊技術解戈迪安繩結系列 | https://ithelp.ithome.com.tw/users/20107621/ironman |
writer_name | like_count | group | corpus_title | writer_url |
---|---|---|---|---|
IcodesoIam. | 1.6666666666666667 | AI&MachineLearning | 以100張圖理解NeuralNetwork--觀念與實踐系列 | https://ithelp.ithome.com.tw/users/20001976/ironman |
GoatWang | 0.5882352941176471 | AI&MachineLearning | 玩轉資料與機器學習-以自然語言處理為例系列 | https://ithelp.ithome.com.tw/users/20107576/ironman |
shortid | 0.5 | AI&MachineLearning | 機器學習你也可以-文組帶你手把手實做機器學習聖經系列 | https://ithelp.ithome.com.tw/users/20107850/ironman |
plusone | 1.0 | DataTechnology | 使用Python進行資料分析系列 | https://ithelp.ithome.com.tw/users/20107514/ironman |
JasonKuan(CapillaryJ) | 0.8 | DataTechnology | 你都在公司都在幹啥R?R語言資料分析經驗分享系列 | https://ithelp.ithome.com.tw/users/20107299/ironman |
polo | 0.5 | DataTechnology | GraphQL+ApolloData入門系列 | https://ithelp.ithome.com.tw/users/20103438/ironman |
blackie1019 | 2.3333333333333335 | DevOps | AmazonCloudService30dayschallenge系列 | https://ithelp.ithome.com.tw/users/20083507/ironman |
zxcvbnius | 1.3333333333333333 | DevOps | Kubernetes30天學習筆記系列 | https://ithelp.ithome.com.tw/users/20103753/ironman |
cythilya | 0.8333333333333334 | DevOps | Nightwatch101:使用Nightwatch實現End-to-EndTesting系列 | https://ithelp.ithome.com.tw/users/20092232/ironman |
sfisonly | 4.666666666666667 | ModernWeb | 前端工程師養成手冊系列 | https://ithelp.ithome.com.tw/users/20040221/ironman |
vibertthio | 2.6666666666666665 | ModernWeb | aesthEtic,CYBERのaudio/VISUAL,網頁中的聲音與影像研究系列 | https://ithelp.ithome.com.tw/users/20107828/ironman |
Wellwind | 2.25 | ModernWeb | AngularMaterial完全攻略系列 | https://ithelp.ithome.com.tw/users/20020617/ironman |
frankyzyao | 1.0 | Security | 資安分析師的轉職升等之路系列 | https://ithelp.ithome.com.tw/users/20084806/ironman |
evanslify | 0.3333333333333333 | Security | 網路安全概述系列 | https://ithelp.ithome.com.tw/users/20107704/ironman |
虎虎 | 0.2777777777777778 | Security | CEH之越挫越勇系列 | https://ithelp.ithome.com.tw/users/20103647/ironman |
Howard | 1.4782608695652173 | SoftwareDevelopment | 爬蟲始終來自於墮性系列 | https://ithelp.ithome.com.tw/users/20107159/ironman |
微中子 | 1.4166666666666667 | SoftwareDevelopment | 來做個網路瀏覽器吧!Let'sbuildawebbrowser!系列 | https://ithelp.ithome.com.tw/users/20103745/ironman |
Wolke | 1.3333333333333333 | SoftwareDevelopment | MsBotframework30天上手系列 | https://ithelp.ithome.com.tw/users/20046160/ironman |
rabbitlai | 2.75 | 自我挑戰組 | 工程師職災的認識與預防系列 | https://ithelp.ithome.com.tw/users/20107803/ironman |
royal801991 | 1.5833333333333333 | 自我挑戰組 | 使用PHP串接金流相關API系列 | https://ithelp.ithome.com.tw/users/20107301/ironman |
我是一支小小小小鳥 | 1.4166666666666667 | 自我挑戰組 | GAME30天系列 | https://ithelp.ithome.com.tw/users/20107379/ironman |
writer_name | browse_count | group | corpus_title | writer_url |
---|---|---|---|---|
Bonny | 601.0 | DataTechnology | Python學習筆記系列 | https://ithelp.ithome.com.tw/users/20107290/ironman |
Wolke | 570.8181818181819 | DataTechnology | MicrosoftBotFramework30天上手系列 | https://ithelp.ithome.com.tw/users/20046160/ironman |
stana | 509.1578947368421 | DataTechnology | "Hadoopecosystem工具簡介,安裝教學與各種情境使用系列" | https://ithelp.ithome.com.tw/users/20107349/ironman |
cythilya | 2663.25 | DevOps | Nightwatch101:使用Nightwatch實現End-to-EndTesting系列 | https://ithelp.ithome.com.tw/users/20092232/ironman |
AkitoSun | 655.4 | DevOps | 大型敏捷專案的DevOps系列 | https://ithelp.ithome.com.tw/users/20094400/ironman |
yangj26952 | 604.6842105263158 | DevOps | 用30天來介紹和使用Docker系列 | https://ithelp.ithome.com.tw/users/20103456/ironman |
etrexkuo | 8985.666666666666 | ModernWeb | 只要有心,人人都可以做卡米狗系列 | https://ithelp.ithome.com.tw/users/20107309/ironman |
riven | 1482.0 | ModernWeb | 「放下屠龍刀!論開發者如何與設計師打交道」系列 | https://ithelp.ithome.com.tw/users/20107565/ironman |
sfisonly | 1073.2222222222222 | ModernWeb | 前端工程師養成手冊系列 | https://ithelp.ithome.com.tw/users/20040221/ironman |
虎虎 | 678.5 | Security | CEH之越挫越勇系列 | https://ithelp.ithome.com.tw/users/20103647/ironman |
Fu-sheng | 556.8421052631579 | Security | 資安的學習心得及分享系列 | https://ithelp.ithome.com.tw/users/20107445/ironman |
wkpeng | 536.2105263157895 | Security | IT安全稽核系列 | https://ithelp.ithome.com.tw/users/20107482/ironman |
Wolke | 1019.0 | SoftwareDevelopment | MsBotframework30天上手系列 | https://ithelp.ithome.com.tw/users/20046160/ironman |
Howard | 803.0 | SoftwareDevelopment | 爬蟲始終來自於墮性系列 | https://ithelp.ithome.com.tw/users/20107159/ironman |
阿志 | 747.8888888888889 | SoftwareDevelopment | 30天快樂學習FunctionalProgramming系列 | https://ithelp.ithome.com.tw/users/20103386/ironman |
ocom | 1448.8333333333333 | 自我挑戰組 | 如何用電商一個月從0賺到100萬系列 | https://ithelp.ithome.com.tw/users/20107558/ironman |
Kyle | 787.0 | 自我挑戰組 | 自己動手實作新創空間系列 | https://ithelp.ithome.com.tw/users/20006680/ironman |
我是一支小小小小鳥 | 726.9166666666666 | 自我挑戰組 | GAME30天系列 | https://ithelp.ithome.com.tw/users/20107379/ironman |