Lab Setting Up Lab Running

08:00:00

STREAMING DATA PROCESSING

GETTING STARTED WITH GCP CONSOLE

When the lab is ready a green button will appear that looks like this:

When you are ready to begin, click Start Lab.

Logging in to Google Cloud Platform

Step 1: Locate the Username, Password and Project Id

Press the green [Start] button to start the lab. After setup is completed you will see something similar to this on the right side of the Qwiklabs window:

Step 2: Browse to Console

Open an Incognito window in your browser.
And go to http://console.cloud.google.com

Log in with the Username and Password provided. The steps below are suggestive. The actual dialog and procedures may vary from this example.

Step 4: Accept the conditions

Accept the new account terms and conditions.

This is a temporary account. You will only have access to the account for this one lab.

Do not add recovery options
Do not sign up for free trials

Step 5: Don't change the password

If prompted, don't change the password. Just click [Continue].

Step 6 Agree to the Terms of Service

Select (x) Yes, (x) _Yes and click _[AGREE AND CONTINUE].

Step 7: Console opens

The Google Cloud Platform Console opens.

You may see a bar occupying the top part of the Console inviting you to sign up for a free trial. You can click on the __[DISMISS] __button so that the entire Console screen is available.

Step 8: Switch project (if necessary)

On the top blue horizontal bar, click on the drop down icon to select the correct project (if not already so). You can confirm the project id from your Qwiklabs window (shown in step 1 above).

Click on "view more projects" if necessary and select the correct project id.

PART 1: PUBLISH STREAMING DATA INTO PUB/SUB

Overview

Duration is 1 min

Google Cloud Pub/Sub is a fully-managed real-time messaging service that allows you to send and receive messages between independent applications. Use Cloud Pub/Sub to publish and subscribe to data from multiple sources, then use Google Cloud Dataflow to understand your data, all in real time.

In this lab you will use simulate your traffic sensor data into a Pubsub topic for later to be processed by Dataflow pipeline before finally ending up in a BigQuery table for further analysis.

What you learn

In this lab, you will learn how to:

Create a Pubsub topic and subscription
Simulate your traffic sensor data into Pubsub

Create PubSub topic and Subscription

Step 1

Open a new CloudShell window and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/streaming/publish

If this directory doesn't exist, you may need to git clone the repository first:

cd ~
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd ~/training-data-analyst/courses/streaming/publish

Step 2

Run the following command to configure gcloud:

gcloud init

Note: when prompted, select option 1 to Re-initialize the configuration and when further prompted, choose the right account and project ID (look at your Qwiklabs "Connect" tab to confirm).

Install the Cloud SDK beta command component:

gcloud components install beta

Step 3

Create your topic and publish a simple message:

gcloud beta pubsub topics create sandiego
gcloud beta pubsub topics publish sandiego "hello"

Step 4

Create a subscription for the topic:

gcloud beta pubsub subscriptions create --topic sandiego mySub1

Step 5

Pull the first message that was published to your topic:

gcloud beta pubsub subscriptions pull --auto-ack mySub1

Do you see any result? If not, why?

Step 6

Try to publish another message and then pull it using the subscription:

gcloud beta pubsub topics publish sandiego "hello again"

gcloud beta pubsub subscriptions pull --auto-ack mySub1

Did you get any response this time?

Step 7

Cancel your subscription:

gcloud beta pubsub subscriptions delete mySub1

Simulate your traffic sensor data into PubSub

Step 1

Explore the python script to simulate San Diego traffic sensor data:

nano send_sensor_data.py

Look at the simulate function. This one lets the script behave as if traffic sensors were sending in data in real time to PubSub. The speedFactor parameter determines how fast the simulation will go.

Step 2

Download traffic dataset

./download_data.sh

Step 3

To ensure the shell has the right permissions, run the following command:

gcloud auth application-default login

When you run the gcloud command, you will get a confirmation prompt. Enter ‘Y' to continue. Next, you will be given a url, which you need to type in a your browser tab.

→

You will next be prompted to select the account, and click Next. The next page requires you to approve authorization, so click Allow. Finally, you get a code which you need to copy and paste it back in shell where you ran the gcloud command where you would be prompted to enter code.

Step 4

Once re-authenticated, run the send_sensor_data.py

./send_sensor_data.py --speedFactor=60

This command will send 1 hour of data in 1 minute.

Note:

If you get the google.gax.errors.RetryError: GaxError OR "StatusCode.PERMISSION_DENIED, User not authorized to perform this action.", then simply re-authenticate the shell and run the script again

gcloud auth application-default login
./send_sensor_data.py --speedFactor=60

If this fails because google.cloud.pubsub can not be found, then do the pip install below and run the send_sensor_data.py again:

sudo pip install google-cloud-pubsub
./send_sensor_data.py --speedFactor=60

If you get a failure that the module pubsub has no attribute called Client then you are running into path problems because an older version of pub/sub is installed on your machine. The solution is to use virtualenv:

virtualenv cpb104
source cpb104/bin/activate
pip install google-cloud-pubsub
gcloud auth application-default login

Then, try the send_sensor_data.py again

./send_sensor_data.py --speedFactor=60

Step 5

Create a new tab in Cloud Shell and change into the directory you were working in:

cd ~/training-data-analyst/courses/streaming/publish

Step 6

Create a subscription for the topic and do a pull to confirm that messages are coming in:

gcloud beta pubsub subscriptions create --topic sandiego mySub2
gcloud beta pubsub subscriptions pull --auto-ack mySub2

Confirm that you see a message with traffic sensor information.

Step 7

Cancel this subscription.

gcloud beta pubsub subscriptions delete mySub2

In the next lab, you will run a Dataflow pipeline to read in all these messages and process them.

Step 8

Go to the Cloud Shell tab with the publisher and type to stop it.

Stop here if you are done. Wait for instructions from the Instructor before going into the next section

PART 2: STREAMING DATA PIPELINES

Overview

Duration is 1 min

In this lab you will use Dataflow to collect traffic events from simulated traffic sensor data made available through Google Cloud PubSub, process them into an actionable average, and store the raw data in BigQuery for later analysis. You will learn how to start a Dataflow pipeline, monitor it, and, lastly, optimize it.

What you learn

In this lab, you will learn how to:

Launch Dataflow and run a Dataflow job
Understand how data elements flow through the transformations of a Dataflow pipeline
Connect Dataflow to Pub/Sub and BigQuery
Observe and understand how Dataflow autoscaling adjusts compute resources to process input data optimally
Learn where to find logging information created by Dataflow
Explore metrics and create alerts and dashboards with Stackdriver Monitoring

Create BigQuery Dataset and Storage bucket

The Dataflow pipeline we will create later will write into a table in this dataset.

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI

Step 2

Click the blue arrow to the right of your project name and choose Create new dataset.

Step 3

In the ‘Create Dataset' dialog, for Dataset ID, type demos and then click OK.

Step 4

If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.

Simulate your traffic sensor data into PubSub

Step 1

In Cloud Shell, start the script to read from the csv data and publish to PubSub

cd ~/training-data-analyst/courses/streaming/publish

./send_sensor_data.py --speedFactor=60

This command will send 1 hour of data in 1 minute.

Note:

If you get the google.gax.errors.RetryError: GaxError OR "StatusCode.PERMISSION_DENIED, User not authorized to perform this action.", then simply re-authenticate the shell and run the script again

gcloud auth application-default login
./send_sensor_data.py --speedFactor=60

If this fails because google.cloud.pubsub can not be found, then do the pip install below and run the send_sensor_data.py again:

sudo pip install google-cloud-pubsub
./send_sensor_data.py --speedFactor=60

If you get a failure that the module pubsub has no attribute called Client then you are running into path problems because an older version of pub/sub is installed on your machine. The solution is to use virtualenv:

virtualenv cpb104
source cpb104/bin/activate
pip install google-cloud-pubsub
gcloud auth application-default login

Then, try the send_sensor_data.py again

./send_sensor_data.py --speedFactor=60

Launch Dataflow Pipeline

Duration is 9 min

Step 1

Open a new CloudShell window and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/streaming/process/sandiego

If this directory doesn't exist, you may need to git clone the repository first:

cd ~
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd ~/training-data-analyst/courses/streaming/process/sandiego

Step 2

Explore the scripts that create and run a Dataflow pipeline in the cloud:

nano run_oncloud.sh

The script takes 3 required arguments: project id, bucket name, classname and possibly a 4th argument: options. We will cover the options argument in a later part of the lab.

project id : this is your GCP project

bucket name : this your Cloud Storage bucket you created earlier

*classname *: we have 4 java files that you can choose from, each reads the traffic data rom Pub/Sub and runs different aggregations/computations. Go into the java directory and explore one of the files we will be using:

cd src/main/java/com/google/cloud/training/dataanalyst/sandiego 
nano AverageSpeeds.java

What does the script do?

Step 3

Run the Dataflow pipeline to read from PubSub and write into BigQuery

cd ~/training-data-analyst/courses/streaming/process/sandiego
./run_oncloud.sh yourproject yourbucket AverageSpeeds

Note: make sure to plug in your project id and bucket name for the first and second arguments respectively.

Note: If you are on a free trial account, you might get an error about insufficient quota(s) to execute this workflow with 3 instances. If so, add to the command-line in so as to keep the Dataflow pipeline under quota. If you do this, though, you will not be able to observe autoscaling.

Explore the pipeline

Duration is 4 min

In this activity, you will learn more about the pipeline that you launched in the previous steps.

This Dataflow pipeline:

reads messages from a Pub/Sub topic,
parses the Json of the input message and produces one main output
and writes into BigQuery.

Step 1

Go to the Dataflow Jobs page in the Cloud Console.

Step 2

Click on the pipeline you created in the lab, it will have your username in the pipeline name.

Step 4

Compare the code you saw earlier of the pipeline (AverageSpeeds.java) and the pipeline graph in the Cloud Console.

Step 5

Find the "GetMessages" pipeline step in the graph, and then find the corresponding code snippet in the AverageSpeeds.java file. This is the pipeline step that reads from the Pub/Sub topic. It creates a collection of Strings - the read Pub/Sub messages.

Do you see a subscription created?

How does the code pull messages from Pub/Sub?

Step 6

Find the "Time Window" pipeline step in the graph and in code. In this pipeline step we create a window of a duration specified in the pipeline parameters (sliding window in this case). This window will accumulate the traffic data from the previous step until end of window, and pass it to the next steps for further transforms.

What is the window interval ? How often is a new window created?

Step 7

Find the "BySensor" and "AvgBySensor" pipeline steps in the graph, and then find the corresponding code snippet in the AverageSpeeds.java file. This "BySensor" does a grouping of all events in the window by sensor id, while "AvgBySensor" will then compute the mean speed for each grouping.

Step 8

Find the "ToBQRow" pipeline step in the graph and in code. In this step we simply create a "row" with the average computed from previous step together with the lane information. In this step, you can do other interesting things like maybe compare the calculated mean against a predefined threshold and log the results of the comparison, which you can later search for in Stackdriver Logging. In the later steps, we use the predefined metrics and look at the logging info.

Step 9

Lastly, find the "BigQueryIO.Write" in both the pipeline graph and in source code. In this step we are writing the row out of the pipeline into a BigQuery table. Because we chose the WriteDisposition.WRITE_APPEND write disposition, new records will be appended to the table.

Determine throughput rates

Duration is 3 min

One common activity when monitoring and improving Dataflow pipelines is figuring out how many elements the pipeline processes per second, what the system lag is, and how many data elements have been processed so far. In this activity you will learn where in the Cloud Console one can find information about processed elements and time.

Step 1

Go to the Dataflow Jobs page in the Cloud Console.

Step 2

Click on the pipeline you created in the lab.

Step 3

Select the "GetMessages" pipeline node in the graph and look at the step metrics on the right.

System Lag is an important metric for streaming pipelines. It represents the amount of time data elements are waiting to be processed since they "arrived" in the input of the transformation step.
Elements Added metric under output collections tells you how many data elements exited this step (for the "Read PubSub Msg" step of our pipeline it also represents the number of Pub/Sub messages read from the topic by the Pub/Sub IO connector)

Another important metric is the Step Throughput, measured in data elements per second. You will find it inside step nodes in the pipeline graph.

Step 4

Select the next pipeline node in the graph - "Time Window". Observe how the Elements Added metric under the Input Collections of the "Time Window" step matches the Elements Added metric under the Output Collections of the previous step "GetMessages". Generally speaking, output of Step N will be equal to the input of the next Step N+1.

Review BigQuery output

Duration is 4 min

Find the output tables in BigQuery and run commands to view written records.

Step 1

Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI, and explore the demos dataset. Note that streaming tables may not show up immediately. You can still query the tables though.

Step 2

Use the following query to observe the output from your Dataflow job.

SELECT * 
FROM [<PROJECTID>:demos.average_speeds] 
ORDER BY timestamp DESC
LIMIT 100

Step 3

Find the last update to the table by running the following SQL:

SELECT
  MAX(timestamp)
FROM
  [<PROJECTID>:demos.average_speeds]

Step 4

Use the BigQuery Table Decorator to look at results in the last 10 minutes:

SELECT
  *
FROM
  [<PROJECTID>:demos.average_speeds@-600000]
ORDER BY
  timestamp DESC

Use the BigQuery Invalid Snapshot Time, try reducing the 60000 to perhaps 10000.

Observe and understand autoscaling

Duration is 4 min

In this activity, we will observe how Dataflow scales the number of workers to process the backlog of incoming Pub/Sub messages.

Step 1

Go to the Dataflow Jobs page in the Cloud Console.

Step 2

Click on the pipeline you created in the lab. Find the Summary panel on the right, and review the Autoscaling panel. Check how many workers are currently being used to process messages in the Pub/Sub topic.

Step 3

Click on "See More History" link and review how many workers were used at different points in time during the pipeline execution.

Step 4

The data from a traffic sensor simulator started at the beginning of the lab creates hundreds of messages per second in the Pub/Sub topic. This will cause Dataflow to increase the number of workers in order to keep the system lag of the pipeline at optimal levels.

In the "Worker History" screen, observe how Dataflow changed the number of workers, and the rationale for these decisions in the history table.

Monitor pipelines

Duration is 2 min

Note: Dataflow / Stackdriver Monitoring Integration is currently available as part of an Early Access Program (EAP). Features and behavior are not final and will change as we move towards General Availability.

Dataflow integration with Stackdriver Monitoring allows users to access Dataflow job metrics such as System Lag (for streaming jobs), Job Status (Failed, Successful), Element Counts, and User Counters from within Stackdriver.

You can also employ Stackdriver alerting capabilities to get notified of a variety of conditions such as long streaming system lag or failed jobs.

Dataflow / Stackdriver Monitoring Integration allows you to:

Explore Dataflow Metrics: Browse through available Dataflow pipeline metrics (see next section for a list of metrics) and visualize them in charts.
Chart Dataflow metrics in Stackdriver Dashboards: Create Dashboards and chart time series of Dataflow metrics.
Configure Alerts: Define thresholds on job or resource group-level metrics and alert when these metrics reach specified values.
Monitor User-Defined Metrics: In addition to Dataflow metrics, Dataflow exposes user-defined metrics (SDK Aggregators) as Stackdriver custom counters in the Monitoring UI, available for charting and alerting.

Monitor pipelines (cont'd)

Duration is 2 min

What Dataflow pipeline metrics are available in Stackdriver?

Some of the more important metrics Dataflow provides are:

Job status: Job status (Failed, Successful), reported as an enum every 30 secs and on update.
Elapsed time: Job elapsed time (measured in seconds), reported every 30 secs.
System lag: Max lag across the entire pipeline, reported in seconds.
Current vCPU count: Current # of virtual CPUs used by job and updated on value change.
Estimated byte count: Number of bytes processed per PCollection. Note: This is a per-PCollection metric, not a job-level metric, so it is not yet available for alerting.

What are user-defined metrics?

Any Aggregator defined in a Dataflow pipeline will be reported to Stackdriver as a custom metric. Dataflow will define a new custom metric on behalf of the user and report incremental updates to Stackdriver approximately every 30 secs.

Explore metrics

Duration is 3 min

Step 1

Navigate to Stackdriver Monitoring and go to Resources > Metrics Explorer

Note: If this is your first time trying out Stackdriver for this project, you may need to set up for your account. Just follow the prompts to activate 30-day trial.

Step 2

In the Metrics Explorer, find and select the dataflow_job resource type. You should now see a list of Dataflow-related metrics you can choose from.

Step 3

Select a metric you want to observe for one of your jobs. The pipeline you launched at the beginning of the lab is a streaming pipeline, and one of the more important metrics of streaming pipelines is System Lag. Select System Lag as the metric to observe the system lag of the streaming pipeline you launched.

Step 4

Stackdriver will populate a list of jobs running in our lab project on the right side of the page. Select your pipeline and observe the progress of the metric over time.

Create alerts

Duration is 4 min

If you want to be notified when a certain metric crosses a specified threshold (for example, when System Lag of our lab streaming pipeline increases above a predefined value), you could use the Alerting mechanisms of Stackdriver to accomplish that.

Step 1

On the Stackdriver Monitoring page, navigate to the Alerting menu and select Policies Overview.

Step 2

Click on Add Policy.

Step 3

The "Create new Alerting Policy" page allows you to define the alerting conditions and the channels of communication for alerts. For example, to set an alert on the System Lag for our lab pipeline group, do the following:

click on "Add Condition",
click on "Select" under Metric Threshold,
select "Dataflow Job" in the Resource Type dropdown,
select "Single" in the "Applies To" dropdown,
select the group you created in the previous step,
select "Any Member Violates" in the "Condition Triggers If" dropdown,
select "System Lag" in the "If Metric" dropdown, and
select Condition "above" a Threshold of "5" seconds.

Click on Save Condition to save the alert.

Step 4

Add a Notifications channel, give the policy a name, and click on "Save Policy".

Step 5

After you created an Alert, you can review the Events related to Dataflow in the Alerting>Events page. Every time an alert is triggered by a Metric Threshold condition, an Incident and a corresponding Event are created in Stackdriver. If you specified a notification mechanism in the alert (email, SMS, pager, etc), you will also receive a notification.

Set up dashboards

Duration is 5 min

You can easily build dashboards with the most relevant Dataflow-related charts with Stackdriver Monitoring Dashboards.

Step 1

On the Stackdriver Monitoring page, go to the Dashboards menu and select "Create Dashboard".

Step 2

Click on Add Chart.

Step 3

On the Add Chart page:

select "Dataflow Job" as the Resource Type,
select a metric you want to chart in the Metric Type field (e.g. System Lag),
in the Filter panel, select a group that you created in one of the previous steps and that contains your Dataflow pipeline,
click "Save".

You can add more charts to the dashboard, if you would like, for example, PubSub publish rates on the topic, or subscription backlog (which is a signal to the Dataflow auto-scaler).

Launch another streaming pipeline

Duration is 9 min

Step 1

Go back to the CloudShell where you ran the first Dataflow pipeline.

Run the CurrentConditions java code in a new Dataflow pipeline; this script is simpler in the sense that it does not do many transforms like AverageSpeeds. We will use the results in the next lab to build dashboards and run some transforms (functions) while retrieving the data from BigQuery

cd ~/training-data-analyst/courses/streaming/process/sandiego
./run_oncloud.sh yourproject yourbucket CurrentConditions

Step 2

Go to the Dataflow Jobs page in the Cloud Console and confirm you see the pipeline job listed. Further ensure that it is running (no errors).

Stop here if you are done. Wait for instructions from the Instructor before going into the next section

PART 3: STREAMING ANALYTICS AND DASHBOARDS

Overview

Duration is 1 min

Data visualization tools can help you make sense of your BigQuery data and help you analyze the data interactively. You can use visualization tools to help you identify trends, respond to them, and make predictions using your data. In this lab, you use Google Data Studio to visualize data in the BigQuery table populated by your Dataflow pipeline in the previous exercise.

What you learn

In this lab, you:

Connect to a BigQuery data source
Create reports and charts to visualize BigQuery data

Creating a data source

Duration is 10 min

In this section of the lab, you use Google Data Studio to visualize data in BigQuery using the BigQuery connector. You create a data source, a report, and charts that visualize data in the sample table.

The first step in creating a report in Data Studio is to create a data source for the report. A report may contain one or more data sources. When you create a BigQuery data source, Data Studio uses the BigQuery connector.

You must have the appropriate permissions in order to add a BigQuery data source to a Data Studio report. In addition, the permissions applied to BigQuery datasets will apply to the reports, charts, and dashboards you create in Data Studio. When a Data Studio report is shared, the report components are visible only to users who have appropriate permissions.

To create a data source:

Step 1

Open Google Data Studio.

Step 2

On the Reports page, in the Start a new report section, click the Blank template. This creates a new untitled report.

Step 3

If prompted, click I accept the terms and conditions and then click Accept. You may need to click the Blank template again after agreeing to the terms and conditions.

Step 4

In the Add a data source window, click Create new data source.

Step 5

For Connectors, click BigQuery.

Step 6

For Authorization, click Authorize. This allows Data Studio access to your GCP project.

Step 7

In the Request for permission dialog, click Allow to give Data Studio the ability to view data in BigQuery. You may not receive this prompt if you previously used Data Studio.

Step 8

Select My Projects, then click on your project name

Step 9

For Dataset, click demos.

Step 10

For Table, click current_conditions.

Step 11

If you need to specify a Billing Project, then select your GCP project.

Step 12

In the upper right corner of the window, click Connect.

Step 13

Once Data Studio has connected to the BigQuery data source, the table's fields are displayed. You can use this page to adjust the field properties or to create new calculated fields. Click Create report.

Step 14

When prompted, click Add to report.

Step 15

In the Request for permission dialog, click Allow to give Data Studio the ability to view and manage files in Google Drive. You may not receive this prompt if you previously used Data Studio.

Creating a bar chart using a calculated field

Duration is 15 min

Introduction

Once you have added the current_conditions data source to the report, the next step is to create a visualization. Begin by creating a bar chart. The bar chart displays the total number of vehicles captured for each highway. To display this, you create a calculated field as follows:

Step 1

(Optional) At the top of the page, click Untitled Report to change the report name. For example, type -report1-yourname.

Step 2

When the report editor loads, click Insert > Bar chart.

Step 3

Using the handle, draw a rectangle on the report to display the chart.

Step 4

In the Bar chart properties window, on the Data tab, notice the value for Data Source (current_conditions) and the default values for Dimension and Metric.

Step 5

If Dimension is not set to , then change Dimension to . In the Dimension section, click the existing dimension.

Step 6

In the Dimension picker, select highway.

Step 7

Click the back arrow to close the Dimension picker.

Step 8

In the Metric section, click the existing metric.

Step 9

In the Metric picker, click Create new metric.

Step 10

Click (Create a calculated field). To display a count of the number of vehicles using each highway, create a calculated field. For this lab, you count the entries in the field. The value is irrespective, we just need the number of occurrences.

Step 11

For Name, type vehicles.

Step 12

Leave the ID unchanged.

Step 13

For Formula, type the following (or use the formula assistant): COUNT(sensorId).

Step 14

Click Create field.

Step 15

Click Done.

Step 16

In the Metric picker, select vehicles.

Step 17

Click the back arrow to close the Metric picker. The Dimension should be set to highway and the Metric should be set to vehicles. Notice the chart is sorted in Descending order by default. The highway with the most vehicles are displayed first.

Step 18

To enhance the chart, change the bar labels. In the Bar chart properties window, click the Style tab.

Step 19

In the Bar chart section, check Show data labels.

The total number of vehicles is displayed above each bar in the chart.

Creating a chart using a custom query

Duration is 15 min

Introduction

Because Data Studio does not allow aggregations on metrics, some report components are easier to generate using a custom SQL query. The Custom Query option also lets you leverage BigQuery's full query capabilities such as joins, unions, and analytical functions.

Alternatively, you can leverage BigQuery's full query capabilities by creating a view. A view is a virtual table defined by a SQL query. You can query data in a view by adding the dataset containing the view as a data source.

When you specify a SQL query as your BigQuery data source, the results of the query are in table format, which becomes the field definition (schema) for your data source. When you use a custom query as a data source, Data Studio uses your SQL as an inner select statement for each generated query to BigQuery. For more information on custom queries in Data Studio, consult the online help.

To add a bar chart to your report that uses a custom query data source:

Step 1

Click Insert > Bar chart.

Step 2

Using the handle, draw a rectangle on the report to display the chart.

Step 3

In the Bar chart properties window, on the Data tab, notice the value for Data Source (natality) and the default values for Dimension and Metric are the same as the previous chart. In the Data Source section, click (Select data source).

Step 4

Click Create new data source.

Step 5

For Connectors, click BigQuery.

Step 6

For My Projects, click Custom query.

Step 7

For Project, select your GCP project.

Step 8

Type the following in the Enter custom query window:

SELECT max(speed) as maxspeed, min(speed) as minspeed, avg(speed) as avgspeed, highway FROM [<PROJECTID>:demos.current_conditions] group by highway

This query uses max/min/avg functions to give you the same for each highway..

Step 9

At the top of the window, click Untitled data source, and change the data source name to San Diego highway traffic summary.

Step 10

In the upper right corner of the window, click Connect. Once Data Studio has connected to the BigQuery data source, the results of the query are used to determine the table schema.

Step 11

When the schema is displayed, notice the type and aggregation for each field.

Step 12

Click Add to report.

Step 13

When prompted, click Add to report.

Step 14

Data Studio may be unable to determine the appropriate Dimension and Metrics for the chart. This results in the error: . In the Bar chart properties, on the Data tab, in the Dimension section, click Invalid metric.

Step 15

In the Metric picker, select maxspeed.

Step 16

Click the back arrow to close the Metric picker.

Step 17

In the Metric section, click Add a metric.

Step 18

In the Metric picker, select minspeed.

Step 19

Click the back arrow to close the Metric picker.

Step 20

In the Metric section, click Add a metric.

Step 21

In the Metric picker, select avgspeed.

Step 22

Click the back arrow to close the Metric picker. Your chart now displays the max speed, minimum speed and average speed for each highway.

Step 23

For readability, change the chart styles. In the Bar chart properties, click the Style tab.

Step 24

In the Bar chart __section, *deselect *Single color__.

Step 25

Notice each bar has a default color based on the order the metrics were added to the chart.

Viewing your query history

Duration is 3 min

Introduction

You can view queries submitted via the BigQuery Connector by examining your query history in the BigQuery web interface. Using the query history, you can estimate query costs, and you can save queries for use in other scenarios.

To examine your query history:

Step 1

In the Google Cloud Console, using the menu, navigate into BigQuery web UI, click Query History. (Note: you may need to refresh the BigQuery Web UI).

Step 2

The list of queries is displayed with the most recent queries first. Click Open Query to view details on the query such as Job ID and Bytes Processed.

Stop here if you are done. Wait for instructions from the Instructor before going into the next section

PART 4: STREAMING DATA PIPELINES INTO BIGTABLE

Overview

Duration is 1 min

In this lab you will use Dataflow to collect traffic events from simulated traffic sensor data made available through Google Cloud PubSub, and write them into a Bigtable table.

What you learn

In this lab, you will learn how to:

Launch Dataflow pipeline to read from PubSub and write into Bigtable
Open an HBase shell to query the Bigtable data

Simulate your traffic sensor data into PubSub

Step 1

In Cloud Shell, run the script to download and unzip the quickstart files (you will later use these to run the HBase shell)

cd ~/training-data-analyst/courses/streaming/process/sandiego
./install_quickstart.sh

Step 2

In Cloud Shell, start the script to read from the csv data and publish to PubSub

cd ~/training-data-analyst/courses/streaming/publish
./send_sensor_data.py --speedFactor=30

This command will send 1 hour of data in 2 minutes

Note:

If you get the google.gax.errors.RetryError: GaxError OR "StatusCode.PERMISSION_DENIED, User not authorized to perform this action.", then simply re-authenticate the shell and run the script again

gcloud auth application-default login
./send_sensor_data.py --speedFactor=30

If this fails because google.cloud.pubsub can not be found, then do the pip install below and run the send_sensor_data.py again:

sudo pip install google-cloud-pubsub
./send_sensor_data.py --speedFactor=30

If you get a failure that the module pubsub has no attribute called Client then you are running into path problems because an older version of pub/sub is installed on your machine. The solution is to use virtualenv:

virtualenv cpb104
source cpb104/bin/activate
pip install google-cloud-pubsub
gcloud auth application-default login

Then, try the send_sensor_data.py again

./send_sensor_data.py --speedFactor=30

Launch Dataflow Pipeline

Duration is 9 min

Step 1

Open a new CloudShell window and navigate to the directory for this lab:

cd ~/training-data-analyst/courses/streaming/process/sandiego

If this directory doesn't exist, you may need to git clone the repository first:

cd ~
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd ~/training-data-analyst/courses/streaming/process/sandiego

Step 2

Ensure to authenticate shell to have the right permissions for the pipeline later

gcloud auth application-default login

Step 3

Explore the scripts that create and run a Dataflow pipeline in the cloud:

nano run_oncloud.sh

The script takes 3 required arguments: project id, bucket name, classname and possibly a 4th argument: options. In this part of the lab, we will use the option which will direct the pipeline to write into Cloud Bigtable.

Example: ./run_on_cloud.sh qwiklabs-gcp-123456 my-bucket1 CurrentConditions --bigtable

cd src/main/java/com/google/cloud/training/dataanalyst/sandiego 
nano CurrentConditions.java

What does the script do?

Step 4

Run the script below to create the Bigtable instance

cd ~/training-data-analyst/courses/streaming/process/sandiego
./create_cbt.sh

Step 5

Run the Dataflow pipeline to read from PubSub and write into Cloud Bigtable

cd ~/training-data-analyst/courses/streaming/process/sandiego
./run_oncloud.sh yourproject yourbucket CurrentConditions --bigtable

Note: make sure to plug in your project id and bucket name for the first and second arguments respectively.

Explore the pipeline

Duration is 4 min

In this activity, you will learn more about the pipeline you just launched that writes into Bigtable

Step 1

Go to the Dataflow Jobs page in the Cloud Console.

Step 2

Click on the pipeline you created in the lab, it will have "currentconditions" followed by your username in the pipeline name.

Step 3

Find the "write:cbt" step in the pipeline graph, and click on the down arrow on the right to see the writer in action. Review the Bigtable options in the step summary.

Query Bigtable data

Step 1

Back at the cloud shell, run the quickstart.sh script to launch the HBase shell:

cd ~/training-data-analyst/courses/streaming/process/sandiego/quickstart
./quickstart.sh

If the script runs successfully, you would be in a HBase shell prompt that looks something like:

Step 2

At the HBase shell prompt, type the following query to retrieve 2 rows from your Bigtable table that was populated by the pipeline.

scan 'current_conditions', {'LIMIT' => 2}

Review the output. Notice each row is broken into column,timestamp,value combinations.

Step 3

Lets run another query. This time we only look at the lane:speed column, and limit to 10 rows, and also specify rowid patterns for start and end rows to scan over.

scan 'current_conditions', {'LIMIT' => 10, STARTROW => '15#S#1', ENDROW => '15#S#999', COLUMN => 'lane:speed'}

Review the output. Notice that you see 10 of the column,timestamp,value combinations, all of which correspond to Highway 15. Also notice that column is restricted to lane:speed.

Step 4

Feel free to run other queries if you are familiar with the syntax. Once you're satisfied, ‘quit' to exit the shell.

quit

Cleanup

Step 1

Run the script to delete your Bigtable instance

cd ~/training-data-analyst/courses/streaming/process/sandiego
./delete_cbt.sh

Step 2

On your Dataflow page in your Cloud Console, click on the pipeline job name and click the ‘stop job' on the right panel.

Step 3

Go back to the first Cloud Shell tab with the publisher and type to stop it.

Step 4

Go to the BigQuery console and delete the dataset .

Provide Feedback on this Lab

Streaming Data Processing

Connection Details

STREAMING DATA PROCESSING

GETTING STARTED WITH GCP CONSOLE

Logging in to Google Cloud Platform

Step 1: Locate the Username, Password and Project Id

Step 2: Browse to Console

Step 3: Sign in to Console

Step 4: Accept the conditions

Step 5: Don't change the password

Step 6 Agree to the Terms of Service

Step 7: Console opens

Step 8: Switch project (if necessary)

PART 1: PUBLISH STREAMING DATA INTO PUB/SUB

Overview

What you learn

Create PubSub topic and Subscription

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Simulate your traffic sensor data into PubSub

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

PART 2: STREAMING DATA PIPELINES

Overview

What you learn

Create BigQuery Dataset and Storage bucket

Step 1

Step 2

Step 3

Step 4

Simulate your traffic sensor data into PubSub

Step 1

Launch Dataflow Pipeline

Step 1

Step 2

Step 3

Explore the pipeline

Step 1

Step 2

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9

Determine throughput rates

Step 1

Step 2

Step 3

Step 4

Review BigQuery output

Step 1

Step 2

Step 3

Step 4

Observe and understand autoscaling

Step 1

Step 2

Step 3

Step 4

Monitor pipelines

Monitor pipelines (cont'd)

Explore metrics

Step 1

Step 2

Step 3

Step 4

Create alerts

Step 1