如題目,目前正學習資料爬蟲!
想要將爬蟲的結果文字檔案(CSV)進行文字情緒分析~
但是文件內容含有大量Emoji導致情緒辨識無法執行!
故希望找到事前清洗Emoji的方法!
目前嘗試程式碼如下,但是還是無法順利執行,懇求前輩們指點迷津!
import re
import pandas as pd
def demoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U00010000-\U0010ffff"
"]+", flags=re.UNICODE)
return(emoji_pattern.sub(r'', text))
data = pd.read_csv('test.csv',encoding='utf-8', sep='\t') # read tsv file
data[u'header'] = data[u'header'].astype(str)
data[u'header'] = data[u'header'].apply(lambda x:demoji(x))
data.to_csv('output.csv',index=False, encoding='utf-8') # save to csv file
錯誤顯示為KeyError: 'header'
才疏學淺沒辦法獨立除錯,再麻煩有經驗的前輩協助,感激不盡~
程式基本上沒問題,只是忘記建立欄位名稱,我試作一次如下:
import re
import pandas as pd
def demoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U00010000-\U0010ffff"
"]+", flags=re.UNICODE)
return(emoji_pattern.sub(r'', text))
with open('data.txt', 'r',encoding='utf-8') as f:
text = f.read()
data = pd.DataFrame({'header':[text]})
#data = pd.read_csv('data.txt',encoding='utf-8')
data['header'] = data['header'].apply(lambda x:demoji(x))
data.to_csv('output.csv',index=False, encoding='utf-8')
data.txt內容:
In case you’ve been sleeping for the past twenty years, emoji usage has been going ???. By mid-2015, half of all comments on Instagram included an emoji. Hollywood released a full feature-length film titled The Emoji Movie. Even Google’s CEO Sundar Pichai is posting about urgent fixes to the hamburger emoji.
For some, emoji have caused frustration for users (how the heck are you supposed to use the ? emoji?). Yet for many others, emoji has opened up a fascinating new medium of communication. There are even emoji charade-esque “games” where users can guess a movie title based on a series of emoji. (try these: ?? or ???⚡). But what happens when you push emoji a step further?
Really good question for users to understand all the features. Unique and impressive geometry dash to research the source effectively.