[Python]Natural Language Toolkit

第 12 屆 iThome 鐵人賽

DAY 29

自我挑戰組

軟體開發隨筆雜記--試著解決問題系列第 28 篇

12th鐵人賽 python nltk nlp text

KaliChen

2020-10-14 10:30:45

2298 瀏覽

分享至

http://www.nltk.org/
NLTK 是一個主流用於自然語言處理的 Python 庫

import nltk
nltk.download()

pip3 install html5lib

Collecting html5lib
  Downloading https://files.pythonhosted.org/packages/6c/dd/a834df6482147d48e225a49515aabc28974ad5a4ca3215c18a882565b028/html5lib-1.1-py2.py3-none-any.whl (112kB)
     |████████████████████████████████| 112kB 328kB/s
Collecting webencodings
  Downloading https://files.pythonhosted.org/packages/f4/24/2a3e3df732393fed8b3ebf2ec078f05546de641fe1b667ee316ec1dcf3b7/webencodings-0.5.1-py2.py3-none-any.whl
Requirement already satisfied: six>=1.9 in c:\python37\lib\site-packages (from html5lib) (1.13.0)
Installing collected packages: webencodings, html5lib
Successfully installed html5lib-1.1 webencodings-0.5.1

import nltk

#使用 NLTK 刪除停止詞
from nltk.corpus import stopwords

#使用 urllib模組來抓取網頁
import urllib.request
response = urllib.request.urlopen('http://php.net/')
html = response.read()
print (html)

b'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en">\n<head>\n\n
.
.
.
<span id="toTopHover"></span><img width="40" height="40" alt="To Top" src="/images/to-top@2x.png"></a>\n\n</body>\n</html>\n'

#去掉HTML標記，將抓取的網頁轉換為乾淨的文字
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)

PHP: Hypertext PreprocessorDownloadsDocumentationGet InvolvedHelpGetting StartedIntroductionA simple tutorialLanguage ReferenceBasic syntaxTypesVariablesConstantsExpressionsOperatorsControl
.
.
.
The list of changes is recorded in theChangeLog.Older News EntriesUpcoming conferencesPHP Conference China 2020International PHP Conference Munich 2020International PHP Conference Berlin 2020PHP.RUHR 2020 - Web Development & Digital CommerceUser Group EventsSpecial ThanksSocial media@official_phpCopyright © 2001-2020 The PHP GroupMy PHP.netContactOther PHP.net sitesPrivacy policy

#將文字分詞
tokens = [t for t in text.split()]
print (tokens)

['PHP:', 'Hypertext', 'PreprocessorDownloadsDocumentationGet', 'InvolvedHelpGetting', 'StartedIntroductionA', 'simple', 'tutorialLanguage', 'ReferenceBasic',...
..., 'media@official_phpCopyright', '©', '2001-2020', 'The', 'PHP', 'GroupMy', 'PHP.netContactOther', 'PHP.net', 'sitesPrivacy', 'policy']

#通過對列表中的標記進行遍歷並刪除其中的停止詞
clean_tokens = tokens[:]
sr = stopwords.words('english')
for token in tokens:
  if token in stopwords.words('english'):
    clean_tokens.remove(token)

#使用 Python NLTK 來計算每個詞的出現頻率。NLTK 中的FreqDist( ) 函式可以實現詞頻統計的功能
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
  print (str(key) + ':' + str(val))

PHP::1
Hypertext:1
PreprocessorDownloadsDocumentationGet:1
InvolvedHelpGetting:1
StartedIntroductionA:1
simple:1
tutorialLanguage:1
ReferenceBasic:1
syntaxTypesVariablesConstantsExpressionsOperatorsControl:1
StructuresFunctionsClasses:1
and:45
.
.
.
2001-2020:1
GroupMy:1
PHP.netContactOther:1
PHP.net:1
sitesPrivacy:1
policy:1