menu
arrow_back

Serverless Data Analysis (Python)

480m access · 480m completion
Connection Details

08:00:00

SERVERLESS DATA ANALYSIS

GETTING STARTED WITH GCP CONSOLE

When the lab is ready a green button will appear that looks like this:

2fa0ccada9d929f0.png

When you are ready to begin, click Start Lab.

Logging in to Google Cloud Platform

Step 1: Locate the Username, Password and Project Id

Press the green [Start] button to start the lab. After setup is completed you will see something similar to this on the right side of the Qwiklabs window:

eaa80bb0490b07d0.png

Step 2: Browse to Console

Open an Incognito window in your browser.
And go to http://console.cloud.google.com

Step 3: Sign in to Console

Log in with the Username and Password provided. The steps below are suggestive. The actual dialog and procedures may vary from this example.

1c492727805af169.png

Step 4: Accept the conditions

Accept the new account terms and conditions.

32331ec60c5f6609.png

This is a temporary account. You will only have access to the account for this one lab.

  • Do not add recovery options
  • Do not sign up for free trials

Step 5: Don't change the password

If prompted, don't change the password. Just click [Continue].

ef164317a73a66d7.png

Step 6 Agree to the Terms of Service

Select (x) Yes, (x) Yes and click [AGREE AND CONTINUE].

e0edec7592d289e1.png

Step 7: Console opens

The Google Cloud Platform Console opens.

You may see a bar occupying the top part of the Console inviting you to sign up for a free trial. You can click on the [DISMISS] button so that the entire Console screen is available.

a1b4bfec239cc863.png

Step 8: Switch project (if necessary)

On the top blue horizontal bar, click on the drop down icon to select the correct project (if not already so). You can confirm the project id from your Qwiklabs window (shown in step 1 above).

849103afbf5e9178.png

Click on "view more projects" if necessary and select the correct project id.

PART 1: BUILD A BIGQUERY QUERY

Overview

Duration is 1 min

In this lab, you learn how to build up a complex BigQuery using clauses, subqueries, built-in functions and joins.

What you learn

In this lab, you:

  • Create and run a query
  • Modify the query to add clauses, subqueries, built-in functions and joins.

Introduction

Duration is 1 min

The goal of this lab is to build up a complex BigQuery using clauses, subqueries, built-in functions and joins, and to run the query.

Before you begin

Duration is 1 min

If you have not started the lab, go ahead and click the green "Start Lab" button. Once done, it will display credentials for this lab. Repeat the steps in Lab 0 to log into the Cloud console with the credentials provided in this lab.

Here is a quick reference:

Open new incognito window → go to cloud console → login with provided credentials → follow the prompts → switch project if necessary

Create and run a query

Duration is 3 min

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI, click on the Compose Query button on top left, and then click on Show Options, and ensure you are using Standard SQL. You are using Standard SQL if the "Use Legacy SQL" checkbox is unchecked.

beb1dee4f27cf451.png

Step 2

Click Compose Query.

Step 3

In the New Query window, type (or copy-and-paste) the following query:

SELECT
  airline,
  date,
  departure_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  departure_delay > 0
  AND departure_airport = 'LGA'
LIMIT
  100

What does this query do? ______________________

Step 4

Click Run Query.

Aggregate and Boolean functions

Duration is 5 min

Step 1

To the previous query, add an additional clause to filter by date and group the results by airline. Because you are grouping the results, the SELECT statement will have to use an aggregate function. In the New Query window, type the following query:

SELECT
  airline,
  COUNT(departure_delay)
FROM
   `bigquery-samples.airline_ontime_data.flights`
WHERE
  departure_airport = 'LGA'
  AND date = '2008-05-13'
GROUP BY
  airline
ORDER BY airline

Step 2

Click Run Query. What does this query do? ______________________________________________________

What is the number you get for American Airlines (AA)?


Step 3

Now change the query slightly:

SELECT
  airline,
  COUNT(departure_delay)
FROM
   `bigquery-samples.airline_ontime_data.flights`
WHERE
  departure_delay > 0 AND
  departure_airport = 'LGA'
  AND date = '2008-05-13'
GROUP BY
  airline
ORDER BY airline

Step 4

Click Run Query. What does this query do? ______________________________________________________

What is the number you get for American Airlines (AA)?


Step 5

The first query returns the total number of flights by each airline from La Guardia, and the second query returns the total number of flights that departed late. (Do you see why?)

How would you get both the number delayed as well as the total number of flights?



Step 6

Run this query:

SELECT
  f.airline,
  COUNT(f.departure_delay) AS total_flights,
  SUM(IF(f.departure_delay > 0, 1, 0)) AS num_delayed
FROM
   `bigquery-samples.airline_ontime_data.flights` AS f
WHERE
  f.departure_airport = 'LGA' AND f.date = '2008-05-13'
GROUP BY
  f.airline

String operations

Duration is 3 min

Step 1

In the New Query window, type the following query:

SELECT
  CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
FROM
  `bigquery-samples.weather_geo.gsod`
WHERE
  station_number = 725030
  AND total_precipitation > 0

Step 2

Click Run Query.

Step 3

How would you do the airline query to aggregate over all these dates instead of just ‘2008-05-13'?


You could use a JOIN, as shown next.

Join on Date

Duration is 3 min

Step 1

In the New Query window, type the following query:

SELECT
  f.airline,
  SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
  COUNT(f.arrival_delay) AS total_flights
FROM
  `bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
  SELECT
    CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
  FROM
    `bigquery-samples.weather_geo.gsod`
  WHERE
    station_number = 725030
    AND total_precipitation > 0) AS w
ON
  w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airline

Step 2

Click __Run Query. __How would you get the fraction of flights delayed for each airline?

You could put the entire query above into a subquery and then select from the columns of this result

Subquery

Duration is 3 min

Step 1

In the New Query window, type the following query:

SELECT
  airline,
  num_delayed,
  total_flights,
  num_delayed / total_flights AS frac_delayed
FROM (
SELECT
  f.airline AS airline,
  SUM(IF(f.arrival_delay > 0, 1, 0)) AS num_delayed,
  COUNT(f.arrival_delay) AS total_flights
FROM
  `bigquery-samples.airline_ontime_data.flights` AS f
JOIN (
  SELECT
    CONCAT(CAST(year AS STRING), '-', LPAD(CAST(month AS STRING),2,'0'), '-', LPAD(CAST(day AS STRING),2,'0')) AS rainyday
  FROM
    `bigquery-samples.weather_geo.gsod`
  WHERE
    station_number = 725030
    AND total_precipitation > 0) AS w
ON
  w.rainyday = f.date
WHERE f.arrival_airport = 'LGA'
GROUP BY f.airline
  )
ORDER BY
  frac_delayed ASC

Step 2

Click Run Query

Stop here if you are done. Wait for instructions from the Instructor before going into the next section

PART 2: LOADING AND EXPORTING DATA

Overview

Duration is 1 min

In this lab you use load data in different formats into BigQuery tables.

What you learn

In this lab, you:

  • Load a CSV file into a BigQuery table using the web UI
  • Load a JSON file into a BigQuery table using the CLI

Introduction

Duration is 1 min

In this lab, you load data into BigQuery in multiple ways. You also transform the data you load, and you query the data.

Upload data using the web UI

Duration is 14 min

__Task: __In this section of the lab, you upload a CSV file to BigQuery using the BigQuery web UI.

BigQuery supports the following data formats when loading data into tables: CSV, JSON, AVRO, or Cloud Datastore backups. This example focuses on loading a CSV file into BigQuery.

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI.

Step 2

Click the blue arrow to the right of your project name and choose Create new dataset.

Step 3

In the ‘Create Dataset' dialog, for Dataset ID, type cpb101_flight_data and then click OK.

Step 4

Download the following file to your local machine. This file contains the data that will populate the first table.

Download airports.csv

Step 5

Create a new table in the cpb101_flight_data dataset to store the data from the CSV file. Click the create table icon (the plus sign) to the right of the dataset.

Step 6

On the Create Table page, in the Source Data section:

  • For Location, leave File upload selected.
  • To the right of File upload, click Choose file, then browse to and select airports.csv.
  • Verify File format is set to CSV.

Note: __When you have created a table previously, the __Create from Previous Job option allows you to quickly use your settings to create similar tables.

Step 7

In the Destination Table section:

  • For Table name, leave cpb101_flight_data selected.
  • For Destination table name, type AIRPORTS.
  • For Table type, Native table should be selected and unchangeable.

Step 8

In the Schema section:

  • Add fields one at a time. The airports.csv has the following fields: , , , , which are of type and , which are of type . Make all these fields .

Step 9

In the Options section:

  • For Field delimiter, verify Comma is selected.
  • Since airports.csv contains a single header row, for Header rows to skip, type 1.
  • Accept the remaining default values and click Create Table. BigQuery creates a load job to create the table and upload data into the table (this may take a few seconds). You can track job progress by clicking Job History.

Step 10

Once the load job is complete, click cpb101_flight_data > AIRPORTS.

Step 11

On the Table Details page, click Details to view the table properties and then click Preview to view the table data.

Upload data using the CLI

Duration is 7 min

Task: In this section of the lab, you upload multiple JSON files and an associated schema file to BigQuery using the CLI.

Step 1

Navigate to the Google Cloud Platform Console and to the right of your project name, click Activate Google Cloud Shell.

Step 2

Type the following command to download schema_flight_performance.json (the schema file for the table in this example) to your working directory.

curl https://storage.googleapis.com/cloud-training/CPB200/BQ/lab4/schema_flight_performance.json -o schema_flight_performance.json

Step 3

The JSON files containing the data for your table are stored in a Google Cloud Storage bucket. They have URIs like the following:

Type the following command to create a table named flights_2014 in the __cpb101_flight_data __dataset, using data from files in Google Cloud Storage and the schema file stored on your virtual machine.

Note that your Project ID is stored as a variable in Cloud Shell () so there's no need for you to remember it. If you require it, you can view your Project ID in the command line to the right of your username (after the @ symbol).

bq load --source_format=NEWLINE_DELIMITED_JSON $DEVSHELL_PROJECT_ID:cpb101_flight_data.flights_2014 gs://cloud-training/CPB200/BQ/lab4/domestic_2014_flights_*.json ./schema_flight_performance.json

If you are prompted to select a project to be set as default, choose the Project ID that was setup when you started this qwiklab (Look in the "Connect" tab of your Qwiklabs window, the project id typically looks something like "qwiklabs-gcp-123xyz" ).

Note

There are multiple JSON files in the bucket named according to the convention: domestic_2014_flights_*.json. The wildcard (*) character is used to include all of the .json files in the bucket.

Step 4

Once the table is created, type the following command to verify table flights_2014 exists in dataset cpb101_flight_data.

bq ls $DEVSHELL_PROJECT_ID:cpb101_flight_data

The output should look like the following:

Export table

Duration is 6 min

__Task: __In this section of the lab, you export a BigQuery table using the web UI.

Step 1

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Step 2

Back to the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI

Step 3

Select the AIRPORTS table that you created recently, and using the "down" button to its right, select the option for Export Table.

Step 4

In the dialog, specify and click _OK. _

Step 5

Use the CLI to export the table:

bq extract cpb101_flight_data.AIRPORTS gs://<your-bucket-name>/bq/airports2.csv

Remember to change with the bucket you created earlier.

Step 6

Browse to your bucket and ensure that both .csv files have been created.

Stop here if you are done. Wait for instructions from the Instructor before going into the next section

PART 3: ADVANCED SQL QUERIES

Overview

Duration is 1 min

In this lab you use some advanced SQL concepts to answer the question: what programming languages do open-source programmers program in on weekends?

What you learn

In this lab, you write a query that uses advanced SQL concepts:

  • Nested fields
  • Regular expressions
  • With statement
  • Group and Having

Introduction

Duration is 1 min

In this lab, you use some advanced SQL concepts to answer the question: what programming languages do open-source programmers program in on weekends?

To answer this question, we will use a BigQuery public dataset that has information on all GitHub commits.

Get information about code commits

Duration is 5 min

In this section, you will learn how to work with nested fields.

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI.

Step 2

Compose a new query, making sure that the "Legacy SQL" option is not checked (you are using Standard SQL).

SELECT
  author.email,
  diff.new_path AS path,
  author.date
FROM
  `bigquery-public-data.github_repos.commits`,
  UNNEST(difference) diff
WHERE
  EXTRACT(YEAR
  FROM
    author.date)=2016
LIMIT 10

Step 3

Play a little with the query above to understand what it is doing. For example, instead of , try just . What type of field is author?

Step 4

Change to . Why does it not work? Replace by . Does this work? Why? What is the doing?

Extract programming language

Duration is 5 min

In this section, you will learn how to use regular expressions. Let's assume that the filename extension is the programming language, i.e., a file that ends in .py has the language "py". How will you pull out the extension from the path?

Step 1

Type the following query:

SELECT
  author.email,
  LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
  diff.new_path AS path,
  author.date
FROM
  `bigquery-public-data.github_repos.commits`,
  UNNEST(difference) diff
WHERE
  EXTRACT(YEAR
  FROM
    author.date)=2016
LIMIT
  10

Step 2

Modify the query above to only use lang if the language consists purely of letters and has a length that is fewer than 8 characters.

Step 3

Modify the query above to group by language and list in descending order of the number of commits. Here's a potential solution:

WITH
  commits AS (
  SELECT
    author.email,
    LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
    diff.new_path AS path,
    author.date
  FROM
    `bigquery-public-data.github_repos.commits`,
    UNNEST(difference) diff
  WHERE
    EXTRACT(YEAR
    FROM
      author.date)=2016 )
SELECT
  lang,
  COUNT(path) AS numcommits
FROM
  commits
WHERE
  LENGTH(lang) < 8
  AND lang IS NOT NULL
  AND REGEXP_CONTAINS(lang, '[a-zA-Z]')
GROUP BY
  lang
HAVING
  numcommits > 100
ORDER BY
  numcommits DESC

Weekend or weekday?

Duration is 5 min

Now, group the commits based on whether or not it happened on a weekend. How would you do it?

Step 1

Modify the query above to extract the day of the week from author.date. Days 2 to 6 are weekdays.

Step 2

Here's a potential solution:

WITH
  commits AS (
  SELECT
    author.email,
    EXTRACT(DAYOFWEEK
    FROM
      author.date) BETWEEN 2
    AND 6 is_weekday,
    LOWER(REGEXP_EXTRACT(diff.new_path, r'\.([^\./\(~_ \- #]*)$')) lang,
    diff.new_path AS path,
    author.date
  FROM
    `bigquery-public-data.github_repos.commits`,
    UNNEST(difference) diff
  WHERE
    EXTRACT(YEAR
    FROM
      author.date)=2016)
SELECT
  lang,
  is_weekday,
  COUNT(path) AS numcommits
FROM
  commits
WHERE
  LENGTH(lang) < 8
  AND lang IS NOT NULL
  AND REGEXP_CONTAINS(lang, '[a-zA-Z]')
GROUP BY
  lang,
  is_weekday
HAVING
  numcommits > 100
ORDER BY
  numcommits DESC

Ignoring file extensions that do not correspond to programming languages, it appears that the most popular weekend programming languages are JavaScript, PHP and C.

Acknowledgment: This section of lab (and query) is based on an article by Felipe Hoffa: https://medium.com/@hoffa/the-top-weekend-languages-according-to-githubs-code-6022ea2e33e8#.8oj2rp804

Stop here if you are done. Wait for instructions from the Instructor before going into the next section

PART 4: A SIMPLE DATAFLOW PIPELINE

Overview

Duration is 1 min

In this lab, you learn how to write a simple Dataflow pipeline and run it both locally and on the cloud.

What you learn

In this lab, you learn how to:

  • Setup a Java Dataflow project using Maven
  • Write a simple pipeline in Java
  • Execute the query on the local machine
  • Execute the query on the cloud

Introduction

Duration is 1 min

The goal of this lab is to become familiar with the structure of a Dataflow project and learn how to execute a Dataflow pipeline.

Open Dataflow project

Duration is 3 min

Step 1

Start CloudShell and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/data_analysis/lab2/python

If this directory doesn't exist, you may need to git clone the repository first:

cd ~
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd ~/training-data-analyst/courses/data_analysis/lab2/python

Step 2

Install the necessary dependencies for Python dataflow:

sudo ./install_packages.sh

Verify that you have the right version of pip (should be > 8.0):

pip -V

If not, open a new CloudShell tab and it should pick up the updated pip.

Pipeline filtering

Duration is 5 min

Step 1

View the source code for the pipeline using nano:

cd ~/training-data-analyst/courses/data_analysis/lab2/python
nano grep.py

Step 2

What files are being read? _____________________________________________________

What is the search term? ______________________________________________________

Where does the output go? ___________________________________________________

There are three transforms in the pipeline:

  1. What does the transform do? _________________________________
  2. What does the second transform do? ______________________________
  • Where does its input come from? ________________________
  • What does it do with this input? __________________________
  • What does it write to its output? __________________________
  • Where does the output go to? ____________________________
  1. What does the third transform do? _____________________

Execute the pipeline locally

Duration is 2 min

Step 1

Execute locally:

python grep.py

Note: if you see an error that says " you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.

Step 2

Examine the output file:

cat /tmp/output-*

Does the output seem logical? ______________________

Execute the pipeline on the cloud

Duration is 10 min

Step 1

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Step 2

Copy some Java files to the cloud (make sure to replace with the bucket name you created in the previous step):

gsutil cp ../javahelp/src/main/java/com/google/cloud/training/dataanalyst/javahelp/*.java gs://<YOUR-BUCKET-NAME>/javahelp

Step 3

Edit the Dataflow pipeline in by opening up in nano:

cd ~/training-data-analyst/courses/data_analysis/lab2/python
nano grepc.py

and changing the PROJECT and BUCKET variables appropriately.

Step 4

Submit the Dataflow to the cloud:

python grepc.py

Because this is such a small job, running on the cloud will take significantly longer than running it locally (on the order of 2-3 minutes).

Step 5

On your Cloud Console, navigate to the Dataflow section (from the 3 bars on the top-left menu), and look at the Jobs. Select your job and monitor its progress. You will see something like this:

8826f7db15d23f15.png

Step 6

Wait for the job status to turn to Succeeded. At this point, your CloudShell will display a command-line prompt. In CloudShell, examine the output:

gsutil cat gs://<YOUR-BUCKET-NAME>/javahelp/output-*

Stop here if you are done. Wait for instructions from the Instructor before going into the next section

PART 5: MAPREDUCE IN DATAFLOW

Overview

Duration is 1 min

In this lab, you learn how to use pipeline options and carry out Map and Reduce operations in Dataflow.

What you learn

In this lab, you learn how to:

  • Use pipeline options in Dataflow
  • Carry out mapping transformations
  • Carry out reduce aggregations

Introduction

Duration is 1 min

The goal of this lab is to learn how to write MapReduce operations using Dataflow.

Identify Map and Reduce operations

Duration is 5 min

Step 1

Start CloudShell and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/data_analysis/lab2

If this directory doesn't exist, you may need to git clone the repository:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Step 2

View the source code for the pipeline using nano:

cd ~/training-data-analyst/courses/data_analysis/lab2/python
nano is_popular.py

Step 3

What custom arguments are defined? ____________________

What is the default output prefix? _________________________________________

How is the variable output_prefix in main() set? _____________________________

How are the pipeline arguments such as --runner set? ______________________

Step 4

What are the key steps in the pipeline? _____________________________________________________________________________

Which of these steps happen in parallel? ____________________________________

Which of these steps are aggregations? _____________________________________

Execute the pipeline

Duration is 2 min

Step 1

Install the necessary dependencies for Python dataflow:

sudo ./install_packages.sh

Verify that you have the right version of pip (should be > 8.0):

pip -V

If not, open a new CloudShell tab and it should pick up the updated pip.

Step 2

Run the pipeline locally:

./is_popular.py

Note: if you see an error that says " you may ignore it. The error is simply saying that logging from the oauth2 library will go to stderr.

Step 3

Examine the output file:

cat /tmp/output-*

Use command line parameters

Duration is 2 min

Step 1

Change the output prefix from the default value:

./is_popular.py --output_prefix=/tmp/myoutput

What will be the name of the new file that is written out?

Step 2

Note that we now have a new file in the /tmp directory:

ls -lrt /tmp/myoutput*

Stop here if you are done. Wait for instructions from the Instructor before going into the next section

PART 6: SIDE INPUTS

Overview

Duration is 1 min

In this lab, you learn how to use BigQuery as a data source into Dataflow, and how to use the results of a pipeline as a side input to another pipeline.

What you learn

In this lab, you learn how to:

  • Read data from BigQuery into Dataflow
  • Use the output of a pipeline as a side-input to another pipeline

Introduction

Duration is 1 min

The goal of this lab is to learn how to use BigQuery as a data source into Dataflow, and how to use the result of a pipeline as a side input to another pipeline.

Try out BigQuery query

Duration is 4 min

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI, and click on Compose Query.

Step 2

Copy-and-paste this query:

SELECT
  content
FROM
  [fh-bigquery:github_extracts.contents_java_2016]
LIMIT
  10

Step 3

Click on Run Query.

What is being returned? _______________________________ ____________________

The BigQuery table contains the content (and some metadata) of all the Java files present in github in 2016.

Step 4

To find out how many Java files this table has, type the following query and click Run Query:

SELECT
  COUNT(*)
FROM
  [fh-bigquery:github_extracts.contents_java_2016]

The reason zero bytes are processed is that this is table metadata.

How many files are there in this dataset? __________________________________

Is this a dataset you want to process locally or on the cloud? ______________

Explore the pipeline code

Duration is 10 min

Step 1

On your Cloud Console, start CloudShell and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/data_analysis/lab2 

If this directory doesn't exist, you may need to git clone the repository:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Step 2

View the pipeline code using nano and answer the following questions:

cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
nano src/main/java/com/google/cloud/training/dataanalyst/javahelp/JavaProjectsThatNeedHelp.java

The pipeline looks like this (refer to this diagram as you read the code):

56694328f3596edc.png

Step 3

Looking at the class documentation at the very top, what is the purpose of this pipeline? __________________________________________________________

Where does GetJava get Java content from? _______________________________

What does ToLines do? (Hint: look at the content field of the BigQuery result) ____________________________________________________

Step 4

Why is the result of ToLines stored in a named PCollection instead of being directly passed to another apply()? ________________________________________________

What are the two actions carried out on javaContent? ____________________________

Step 5

If a file has 3 FIXMEs and 2 TODOs in its content (on different lines), how many calls for help are associated with it? __________________________________________________

If a file is in the package com.google.devtools.build, what are the packages that it is associated with? ____________________________________________________

Why is the numHelpNeeded variable not enough? Why do we need to do Sum.integersPerKey()? ___________________________________ (Hint: there are multiple files in a package)

Why is this converted to a View? ___________________________________________

Step 6

Which operation uses the View as a side input? _____________________________

Instead of simply ParDo.of(), this operation uses ____________________________

Besides c.element() and c.output(), this operation also makes use of what method in ProcessContext? __________________________________________________________

Execute the pipeline

Duration is 5 min

Step 1

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Step 2

Execute the pipeline by typing in (make sure to replace with the bucket name you created in the previous step):

cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
./run_oncloud3.sh <PROJECT> <YOUR-BUCKET-NAME> JavaProjectsThatNeedHelp

Monitor the job from the GCP console from the Dataflow section.

Step 3

Once the pipeline has finished executing, download and view the output:

gsutil cp gs://<YOUR-BUCKET-NAME>/javahelp/output.csv .
head output.csv

Stop here if you are done. Wait for instructions from the Instructor before going into the next section

PART 7: STREAMING INTO BIGQUERY

Overview

Duration is 1 min

In this lab, you learn how to use Dataflow to aggregate records received in real-time in Cloud Pub/Sub. The aggregate statistics will then be streamed into BigQuery and analyzed even as the data are streaming in.

What you learn

In this lab, you learn how to:

  • Create Cloud Pub/Sub topic
  • Read from Pub/Sub in Dataflow
  • Compute windowed aggregates
  • Stream into BigQuery

Introduction

Duration is 1 min

The goal of this lab is to learn how to use Pub/Sub as a real-time streaming source into Dataflow and BigQuery as a streaming sink.

246cac282f5e8b2b.png

Set up BigQuery and Pub/Sub

Duration is 3 min

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI. Next click on the blue arrow next to your project name (on the left-hand panel) and click on Create new dataset and if you do not have a dataset named , please create one.

59c77bc988898ac5.png

Step 2

Back on you Cloud Console, visit the Pub/Sub section of GCP Console and click on Create Topic. Give your new topic the name and select Create.

Explore the pipeline code

Duration is 10 min

Step 1

Start CloudShell and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/data_analysis/lab2 

If this directory doesn't exist, you may need to git clone the repository:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Step 2

View the pipeline code using nano and answer the following questions:

cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
nano src/main/java/com/google/cloud/training/dataanalyst/javahelp/StreamDemoConsumer.java

Step 3

What are the fields in the BigQuery table? _______________________________

Step 4

What is the pipeline source? ________________________________________________

Step 5

How often will aggregates be computed? ___________________________________________

Aggregates will be computed over what time period? _________________________________

Step 6

What aggregate is being computed in this pipeline? ____________________________

How would you change it to compute the average number of words in each message over the time period? ____________________________

Step 7

What is the output sink for the pipeline? ____________________________

Execute the pipeline

Duration is 3 min

Step 1

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Step 2

Execute the pipeline by typing in (make sure to replace with the bucket name you created in the previous step):

cd ~/training-data-analyst/courses/data_analysis/lab2/javahelp
./run_oncloud4.sh <PROJECT> <YOUR-BUCKET-NAME>

Monitor the job from the GCP console from the Dataflow section. Note that this pipeline will not exit.

Step 3

Visit the Pub/Sub section of GCP Console and click on your streamdemo topic. Notice that it has a Dataflow subscription. Click on the Publish button and type in a message (any message) and click Publish:

54513e6524bf166a.png

Step 4

Publish a few more messages.

Carry out streaming analytics

Duration is 3 min

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI. Compose a new query and type in (change your PROJECTID appropriately):

SELECT timestamp, num_words from [PROJECTID:demos.streamdemo] LIMIT 10

Clean up

Duration is 3 min

Step 1

Cancel the job from the GCP console from the Dataflow section.

Step 2

Delete the topic from the Pub/Sub section of GCP Console

Step 3

Delete the table from the left-panel of BigQuery console

©Google, Inc. or its affiliates. All rights reserved. Do not distribute.

Provide Feedback on this Lab

×
view_comfy
Materials
Materials
history
My Learning
My Learning
account_circle
My Account
My Account
menu
More
More