從資料夾中開啟美國獨立宣言文本
>>> usa = open('../txt/usa_en.txt',encoding='utf8').read()
>>> usa = ''.join(usa)
使用re.split()將單字分開做成list
>>> print(re.split(r'\W+', usa))
['Declaration', 'of', 'Independence', 'The', 'Unanimous', 'Declaration', 'of', 'the', 'Thirteen', 'United', 'States', 'of', 'America', 'When', 'in', 'the', 'course', 'of', 'human', 'events', 'it', 'becomes', 'necessary', 'for', 'one', 'people',...
使用r'\W+'結果讓nature's God被分為['nature','s','God']
改用r'\s+'只挑選非空格字串
>>> re.split(r'\s+', usa)
['Declaration',
'of',
'Independence',
...
'of',
"nature's",
'God',
'entitle',
...
>>> s = 'colorless'
>>> print(s[:4]+'u'+s[4:])
colourless
>>> monty = '12345678901234567890'
>>> monty[10:2:-3]
'185'
[a,an,the]
看不懂
- 寫一個工具函數以URL為參數,返回刪除所有的HTML標記的URL的內容。使用from urllib import request和request.urlopen( 'http://nltk.org/' ).read().decode( 'utf8' )來訪問URL的內容。
```python
>>> def cleanhtml(url=""):
... from urllib import request
... from bs4 import BeautifulSoup
... html = request.urlopen( url ).read().decode( 'utf8' )
... raw = BeautifulSoup(html).get_text()
... return raw
>>> cleanhtml('http://nltk.org/')
>>> def load(f=""):
... usa = open('../txt/'+f)
... return ''.join(usa.read())
>>> load('usa_en.txt')
>>> re.findall(r"n't|\w+{2}","don't")
['do', "n't"]
寫一個正則表達式,識別連字符連結的跨行處的詞彙。這個表達式將需要包含\n字符。
使用re.sub()從這些詞中刪除\n字符。
你如何確定一旦換行符被刪除後不應該保留連字符的詞彙,如'encyclo-\npedia'?
>>> txt = "encyclo-\npedia long-\nterm"
>>> tlist = re.findall(r'.+(?:-\n)\w+',txt)
>>> sublist=[re.sub(r'\n',"",w) for w in tlist]
>>> english_vocab = set(w.lower() for w in nltk.corpus.words.words())
>>> for s in sublist:
... if re.sub(r'-',"",s) in english_vocab:
... s = re.sub(r'-',"",s)
... else:
... pass
... print(s)
>>> #[(re.sub(r'-',"",s)) for s in sublist if re.sub(r'-',"",s) in english_vocab]
encyclopedia
long-term
參考資料:Python 自然语言处理 第二版 https://usyiyi.github.io/nlp-py-2e-zh/