iT邦幫忙

0

巢狀結構json檔如何轉csv

這是我們要做專題所需要用到的資料
資料部分是用kaggle
2022_boston_gov_opendata : https://data.boston.gov/dataset/food-establishment-inspections/resource/4582bec6-2b4f-4f9e-bc55-cbaa73117f4c

因為檔案有6G多所以這邊先將前100筆放進list裡

import json
path = "yelp_academic_dataset_review.json"
reviewList = []
i = 0
with open(path, 'r', encoding='utf-8') as f:
    try:
        while i<100:
            line_data = f.readline()
            if line_data:
                data = json.loads(line_data)
                reviewList.append(data)
                i+=1
            else:
                break
    except Exception as e:
        print(e)
        f.close()
reviewList[:3]

可以得到資料結構大概是

[{'review_id': 'lWC-xP3rd6obsecCYsGZRg',
  'user_id': 'ak0TdVmGKo4pwqdJSTLwWw',
  'business_id': 'buF9druCkbuXLX526sGELQ',
  'stars': 4.0,
  'useful': 3,
  'funny': 1,
  'cool': 1,
  'text': "Apparently Prides Osteria had a rough summer as evidenced by the almost empty dining room at 6:30 on a Friday night. However new blood in the kitchen seems to have revitalized the food from other customers recent visits. Waitstaff was warm but unobtrusive. By 8 pm or so when we left the bar was full and the dining room was much more lively than it had been. Perhaps Beverly residents prefer a later seating.",
  'date': '2014-10-11 03:34:02'},
 {'review_id': '8bFej1QE5LXp4O05qjGqXA',
  'user_id': 'YoVfDbnISlW0f7abNQACIg',
  'business_id': 'RA4V8pr014UyUbDvI-LW2A',
  'stars': 4.0,
  'useful': 1,
  'funny': 0,
  'cool': 0,
  'text': 'This store is pretty good. Not as great as Walmart (or my preferred, Milford Target), but closer and in a easier area to get to.  \nThe store itself is pretty clean and organized, the staff are friendly (most of the time), and BEST of all is the Self Checkout this store has! \nGreat clearance sections throughout, and great prices on everything in the store, in general (they pricematch too!). ',
  'date': '2015-07-03 20:38:25'}]

目前想用類似這樣的方式,做for迴圈

reviewList[55]['review_id']

單筆打印是成功的
'z-6ElnK3yXpn2oNKR7NYtw'

但是想把每列排進去就開始抱錯!

s="review_id,user_id,business_id,stars,useful,funny,cool,text,date\n"
for i in reviewList:
    
    s=s+"{},{},{},{},{},{},{},{},{}\n".format(reviewList[i]['review_id'],reviewList[i]['user_id'],reviewList[i]['business_id'],reviewList[i]['stars'],reviewList[i]['useful'],reviewList[i]['funny'],reviewList[i]['cool'],reviewList[i]['text'],reviewList[i]['date'])

(這部分)出現下方錯誤

TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_15668/4124199551.py in <module>
      3     #list2=list(dict())
      4     #print(list2)
----> 5     s=s+"{},{},{},{},{},{},{},{},{}\n".format(reviewList[i]['review_id'],reviewList[i]['user_id'],reviewList[i]['business_id'],reviewList[i]['stars'],reviewList[i]['useful'],reviewList[i]['funny'],reviewList[i]['cool'],reviewList[i]['text'],reviewList[i]['date'])

TypeError: list indices must be integers or slices, not dict

改成for i in range(len(reviewList)):以解決

最後存成csv

fw=open("review.csv","w", encoding='utf-8')
fw.write(s)
fw.close()

但發現檔案輸出結果不是我要的
https://ithelp.ithome.com.tw/upload/images/20211203/20144480RvovQDAoMl.jpg

原因是text的欄位出現許多\n換行符號,

'text': 'This store is pretty good. Not as great as Walmart (or my preferred, Milford Target), but closer and in a easier area to get to.  \nThe store itself is pretty clean and organized, the staff are friendly (most of the time), and BEST of all is the Self Checkout this store has! \nGreat clearance sections throughout, and great prices on everything in the store, in general (they pricematch too!). '

導致檔案欄位沒有對齊,想請問如何解決這個問題,先說謝謝,感謝

DanSnow iT邦研究生 4 級 ‧ 2021-12-03 12:27:16 檢舉
你的 i 不是 list 的 index ,而是裡面的 dict ,直接從 i 取資料就好 i['cool'],不要再拿去給 reviewList 當索引 reviewList[i]

2 個回答

2
海綿寶寶
iT邦大神 1 級 ‧ 2021-12-03 22:54:03
最佳解答

關鍵的問題是:CSV 是要給「誰」用?

json 已經是用來交換資料的格式
能直接用是最好

如果硬要轉成 CSV
你的資料有兩個特殊字元要處理
1.換列(\n)
2.逗號

以下是我的處理方式
1.換列(\n):一律用空白字元取代
2.逗號:text 欄位前後加 double quote

如果這處理方式你能接受
以下是所有資料及程式

yelp_academic_dataset_review.json

{"review_id": "lWC-xP3rd6obsecCYsGZRg","user_id": "ak0TdVmGKo4pwqdJSTLwWw","business_id": "buF9druCkbuXLX526sGELQ","stars": 4.0,"useful": 3,"funny": 1,"cool": 1,"text": "Apparently Prides Osteria had a rough summer as evidenced by the almost empty dining room at 6:30 on a Friday night. However new blood in the kitchen seems to have revitalized the food from other customers recent visits. Waitstaff was warm but unobtrusive. By 8 pm or so when we left the bar was full and the dining room was much more lively than it had been. Perhaps Beverly residents prefer a later seating.","date": "2014-10-11 03:34:02"}
{"review_id": "8bFej1QE5LXp4O05qjGqXA","user_id": "YoVfDbnISlW0f7abNQACIg","business_id": "RA4V8pr014UyUbDvI-LW2A","stars": 4.0,"useful": 1,"funny": 0,"cool": 0,"text": "This store is pretty good. Not as great as Walmart (or my preferred, Milford Target), but closer and in a easier area to get to.  \nThe store itself is pretty clean and organized, the staff are friendly (most of the time), and BEST of all is the Self Checkout this store has! \nGreat clearance sections throughout, and great prices on everything in the store, in general (they pricematch too!). ","date": "2015-07-03 20:38:25"}

python code

import csv
import json

path = "yelp_academic_dataset_review.json"
reviewList = []
i = 0
with open(path, "r", encoding="utf-8") as f:
    try:
        while i<100:
            line_data = f.readline()
            if line_data:
                data = json.loads(line_data)
                reviewList.append(data)
                i+=1
            else:
                break
    except Exception as e:
        print(e)
        f.close()

#print(reviewList)

s="review_id,user_id,business_id,stars,useful,funny,cool,text,date\n"
for i in range(len(reviewList)):
    s=s+"{},{},{},{},{},{},{},\"{}\",{}\n".format(reviewList[i]['review_id'],reviewList[i]['user_id'],reviewList[i]['business_id'],reviewList[i]['stars'],reviewList[i]['useful'],reviewList[i]['funny'],reviewList[i]['cool'],reviewList[i]['text'].replace('\n',' '),reviewList[i]['date'])
    
fw=open("review.csv","w", encoding="utf-8")
fw.write(s)
fw.close()

with open('review.csv', newline='') as csvfile:
  rows = csv.reader(csvfile)

  for row in rows:
    print(row)

review.csv

review_id,user_id,business_id,stars,useful,funny,cool,text,date
lWC-xP3rd6obsecCYsGZRg,ak0TdVmGKo4pwqdJSTLwWw,buF9druCkbuXLX526sGELQ,4.0,3,1,1,"Apparently Prides Osteria had a rough summer as evidenced by the almost empty dining room at 6:30 on a Friday night. However new blood in the kitchen seems to have revitalized the food from other customers recent visits. Waitstaff was warm but unobtrusive. By 8 pm or so when we left the bar was full and the dining room was much more lively than it had been. Perhaps Beverly residents prefer a later seating.",2014-10-11 03:34:02
8bFej1QE5LXp4O05qjGqXA,YoVfDbnISlW0f7abNQACIg,RA4V8pr014UyUbDvI-LW2A,4.0,1,0,0,"This store is pretty good. Not as great as Walmart (or my preferred, Milford Target), but closer and in a easier area to get to.   The store itself is pretty clean and organized, the staff are friendly (most of the time), and BEST of all is the Self Checkout this store has!  Great clearance sections throughout, and great prices on everything in the store, in general (they pricematch too!). ",2015-07-03 20:38:25

最後用 python 讀回 review.csv

['lWC-xP3rd6obsecCYsGZRg', 'ak0TdVmGKo4pwqdJSTLwWw', 'buF9druCkbuXLX526sGELQ', '4.0', '3', '1', '1', 'Apparently Prides Osteria had a rough summer as evidenced by the almost empty dining room at 6:30 on a Friday night. However new blood in the kitchen seems to have revitalized the food from other customers recent visits. Waitstaff was warm but unobtrusive. By 8 pm or so when we left the bar was full and the dining room was much more lively than it had been. Perhaps Beverly residents prefer a later seating.', '2014-10-11 03:34:02']
['8bFej1QE5LXp4O05qjGqXA', 'YoVfDbnISlW0f7abNQACIg', 'RA4V8pr014UyUbDvI-LW2A', '4.0', '1', '0', '0', 'This store is pretty good. Not as great as Walmart (or my preferred, Milford Target), but closer and in a easier area to get to.   The store itself is pretty clean and organized, the staff are friendly (most of the time), and BEST of all is the Self Checkout this store has!  Great clearance sections throughout, and great prices on everything in the store, in general (they pricematch too!). ', '2015-07-03 20:38:25']
2
一級屠豬士
iT邦大師 1 級 ‧ 2021-12-05 14:47:12

借用海綿寶寶的json, 建立一個檔案,放到 /tmp 下.
這時候可以使用 sed, 或是 tr 將 \n \r 取代為 ' '.
或是先不做.再來是MySQL的例子.

create table it1205a (
  id int unsigned auto_increment primary key
, info text
);

LOAD DATA INFILE '/tmp/dataset_review.json' INTO TABLE it1205a(info);

update it1205a
   set info = REPLACE(REPLACE(info, '\r', ''), '\n', '');

-- 若在之前已經取代了\n \r , 其實是可以直接load 進 json

alter table it1205a
add column js json;

update it1205a
   set js = cast(info as JSON);

select id
     , json_pretty(js)
   from it1205a;

https://ithelp.ithome.com.tw/upload/images/20211205/20050647NUPCQgZbTe.png

我要發表回答

立即登入回答