iT邦幫忙

第 12 屆 iThome 鐵人賽

DAY 5
0

In the previous lesson we did extracting, analyzing, and translating text from images. And this time we are going to classify text into categories by using Cloud Natural Language API.

The Cloud Natural Language API lets you extract entities from text, perform sentiment and syntactic analysis, and classify text into categories. In this lesson, we'll focus on text classification. Using a database of 700+ categories, this API feature makes it easy to classify a large dataset of text.

Wow look like we are gonna handle a huge amount of data now /images/emoticon/emoticon04.gif

Don't panic! We can do it one step at a time~


  1. Open Google Cloud Platform ( follow the step in A Tour of Qwiklabs and Google Cloud )

  2. Activate Cloud Shell
    Like what we did in the previous lesson.

  3. Make sure Cloud Natural Language API is enabled
    Go to APIs & services in Google Cloud Platform.
    Search for Cloud Natural Language API and select it.
    Click Enable to enable it.

  4. Create an API Key
    Like what we did in the previous lesson.

  5. Classify a news article
    Creare a request.json with the following sample text:

{
  "document":{
    "type":"PLAIN_TEXT",
    "content":"A Smoky Lobster Salad With a Tapa Twist. This spin on the Spanish pulpo a la gallega skips the octopus, but keeps the sea salt, olive oil, pimentón and boiled potatoes."
  }
}

Send this text to the Natural Language API's classifyText method with the following curl command:

curl "https://language.googleapis.com/v1/documents:classifyText?key=${API_KEY}" \
  -s -X POST -H "Content-Type: application/json" --data-binary @request.json

Here is the response:
https://ithelp.ithome.com.tw/upload/images/20200918/20130054hsQ4hQVIwS.png

Run the following command to save the response in the result.json file:

curl "https://language.googleapis.com/v1/documents:classifyText?key=${API_KEY}" \
  -s -X POST -H "Content-Type: application/json" --data-binary @request.json > result.json
  1. Classify a large text dataset
    Classifying a single article is cool, but to really see the power of this feature, let's classify lots of text data.

6-1. Create a BigQuery table for our categorized text data
Go to BigQuery in Google Cloud Platform.
Click on the name of your project, then click Create dataset.
Name the dataset news_classification_dataset, then click Create dataset.
Click on the name of the dataset, then select Create Table.
Click Add Field and add the following 3 fields: articletext, category, and confidence.
Click Create Table.

6-2. Classify news data and storing the result in BigQuery
Run the following commands to create a service account:

gcloud iam service-accounts create my-account --display-name my-account
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:my-account@$PROJECT.iam.gserviceaccount.com --role=roles/bigquery.admin
gcloud iam service-accounts keys create key.json --iam-account=my-account@$PROJECT.iam.gserviceaccount.com
export GOOGLE_APPLICATION_CREDENTIALS=key.json

Create a file classify-text.py and add the following code:

from google.cloud import storage, language, bigquery

# Set up our GCS, NL, and BigQuery clients
storage_client = storage.Client()
nl_client = language.LanguageServiceClient()
# TODO: replace YOUR_PROJECT with your project name below
bq_client = bigquery.Client(project='YOUR_PROJECT')

dataset_ref = bq_client.dataset('news_classification_dataset')
dataset = bigquery.Dataset(dataset_ref)
table_ref = dataset.table('article_data')
table = bq_client.get_table(table_ref)

# Send article text to the NL API's classifyText method
def classify_text(article):
        response = nl_client.classify_text(
                document=language.types.Document(
                        content=article,
                        type=language.enums.Document.Type.PLAIN_TEXT
                )
        )
        return response


rows_for_bq = []
files = storage_client.bucket('qwiklabs-test-bucket-gsp063').list_blobs()
print("Got article files from GCS, sending them to the NL API (this will take ~2 minutes)...")

# Send files to the NL API and save the result to send to BigQuery
for file in files:
        if file.name.endswith('txt'):
                article_text = file.download_as_string()
                nl_response = classify_text(article_text)
                if len(nl_response.categories) > 0:
                        rows_for_bq.append((str(article_text), nl_response.categories[0].name, nl_response.categories[0].confidence))

print("Writing NL API article data to BigQuery...")
# Write article text + category data to BQ
errors = bq_client.insert_rows(table, rows_for_bq)
assert errors == []

Start classifying articles and importing them to BigQuery:

python3 classify-text.py

In BigQuery, you can input SQL in Query Table and query data.

Let's try to query all the data from the articles:

SELECT * FROM `YOUR_PROJECT.news_classification_dataset.article_data`

The result will be:
https://ithelp.ithome.com.tw/upload/images/20200918/20130054EuP598F4sL.png

The category column has the name of the first category the Natural Language API returned for the article, and confidence is a value between 0 and 1 indicating how confident the API is that it categorized the article correctly.

Next, let's see which categories were most common in the dataset.

SELECT
  category,
  COUNT(*) c
FROM
  `YOUR_PROJECT.news_classification_dataset.article_data`
GROUP BY
  category
ORDER BY
  c DESC

You will see this category /News/Politics is the most common:
https://ithelp.ithome.com.tw/upload/images/20200918/20130054W6Y5qtp1Rl.png

Next, let's find the article has a more obscure category like /Arts & Entertainment/Music & Audio/Classical Music:

SELECT * FROM `YOUR_PROJECT.news_classification_dataset.article_data`
WHERE category = "/Arts & Entertainment/Music & Audio/Classical Music"

There's one result for this obscure category:
https://ithelp.ithome.com.tw/upload/images/20200918/20130054ck01x3f5Kn.png

Lastly, let's try to find the confidence score greater than 90%:

SELECT
  article_text,
  category
FROM `YOUR_PROJECT.news_classification_dataset.article_data`
WHERE cast(confidence as float64) > 0.9

These articles have confidence score greater than 90%:
https://ithelp.ithome.com.tw/upload/images/20200918/20130054TI0kN8Duq8.png

So this is how to classify text into categories and query data as desired /images/emoticon/emoticon37.gif


Yah~ today's lesson is not so long~ /images/emoticon/emoticon34.gif

Really appreciate how natural language has done all these tedious analyzing and categorizing works for us.

Hope you enjoy today's lesson~


上一篇
Extract, Analyze, and Translate Text from Images with the Cloud ML APIs
下一篇
Detect Labels, Faces, and Landmarks in Images with the Cloud Vision API
系列文
Machine Learning Study Jam 202012
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言