STREAMING DATA PROCESSING
GETTING STARTED WITH GCP CONSOLE
When the lab is ready a green button will appear that looks like this:
When you are ready to begin, click Start Lab.
Logging in to Google Cloud Platform
Step 1: Locate the Username, Password and Project Id
Press the green [Start] button to start the lab. After setup is completed you will see something similar to this on the right side of the Qwiklabs window:
Step 2: Browse to Console
Open an Incognito window in your browser.
And go to http://console.cloud.google.com
Step 3: Sign in to Console
Log in with the Username and Password provided. The steps below are suggestive. The actual dialog and procedures may vary from this example.
Step 4: Accept the conditions
Accept the new account terms and conditions.
This is a temporary account. You will only have access to the account for this one lab.
- Do not add recovery options
- Do not sign up for free trials
Step 5: Don't change the password
If prompted, don't change the password. Just click [Continue].
Step 6 Agree to the Terms of Service
Select (x) Yes, (x) _Yes and click _[AGREE AND CONTINUE].
Step 7: Console opens
The Google Cloud Platform Console opens.
You may see a bar occupying the top part of the Console inviting you to sign up for a free trial. You can click on the __[DISMISS] __button so that the entire Console screen is available.
Step 8: Switch project (if necessary)
On the top blue horizontal bar, click on the drop down icon to select the correct project (if not already so). You can confirm the project id from your Qwiklabs window (shown in step 1 above).
Click on "view more projects" if necessary and select the correct project id.
PART 1: PUBLISH STREAMING DATA INTO PUB/SUB
Overview
Duration is 1 min
Google Cloud Pub/Sub is a fully-managed real-time messaging service that allows you to send and receive messages between independent applications. Use Cloud Pub/Sub to publish and subscribe to data from multiple sources, then use Google Cloud Dataflow to understand your data, all in real time.
In this lab you will use simulate your traffic sensor data into a Pubsub topic for later to be processed by Dataflow pipeline before finally ending up in a BigQuery table for further analysis.
What you learn
In this lab, you will learn how to:
- Create a Pubsub topic and subscription
- Simulate your traffic sensor data into Pubsub
Create PubSub topic and Subscription
Step 1
Open a new CloudShell window and navigate to the directory for this lab:
cd ~/training-data-analyst/courses/streaming/publish
If this directory doesn't exist, you may need to git clone the repository first:
cd ~
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd ~/training-data-analyst/courses/streaming/publish
Step 2
Run the following command to configure gcloud:
gcloud init
Note: when prompted, select option 1 to Re-initialize the configuration and when further prompted, choose the right account and project ID (look at your Qwiklabs "Connect" tab to confirm).
Install the Cloud SDK beta command component:
gcloud components install beta
Step 3
Create your topic and publish a simple message:
gcloud beta pubsub topics create sandiego
gcloud beta pubsub topics publish sandiego "hello"
Step 4
Create a subscription for the topic:
gcloud beta pubsub subscriptions create --topic sandiego mySub1
Step 5
Pull the first message that was published to your topic:
gcloud beta pubsub subscriptions pull --auto-ack mySub1
Do you see any result? If not, why?
Step 6
Try to publish another message and then pull it using the subscription:
gcloud beta pubsub topics publish sandiego "hello again"
gcloud beta pubsub subscriptions pull --auto-ack mySub1
Did you get any response this time?
Step 7
Cancel your subscription:
gcloud beta pubsub subscriptions delete mySub1
Simulate your traffic sensor data into PubSub
Step 1
Explore the python script to simulate San Diego traffic sensor data:
nano send_sensor_data.py
Look at the simulate function. This one lets the script behave as if traffic sensors were sending in data in real time to PubSub. The speedFactor parameter determines how fast the simulation will go.
Step 2
Download traffic dataset
./download_data.sh
Step 3
To ensure the shell has the right permissions, run the following command:
gcloud auth application-default login
When you run the gcloud command, you will get a confirmation prompt. Enter ‘Y' to continue. Next, you will be given a url, which you need to type in a your browser tab.
→ | → |
You will next be prompted to select the account, and click Next. The next page requires you to approve authorization, so click Allow. Finally, you get a code which you need to copy and paste it back in shell where you ran the gcloud command where you would be prompted to enter code.
Step 4
Once re-authenticated, run the send_sensor_data.py
./send_sensor_data.py --speedFactor=60
This command will send 1 hour of data in 1 minute.
Note:
- If you get the google.gax.errors.RetryError: GaxError OR "StatusCode.PERMISSION_DENIED, User not authorized to perform this action.", then simply re-authenticate the shell and run the script again
gcloud auth application-default login
./send_sensor_data.py --speedFactor=60
- If this fails because google.cloud.pubsub can not be found, then do the pip install below and run the send_sensor_data.py again:
sudo pip install google-cloud-pubsub
./send_sensor_data.py --speedFactor=60
- If you get a failure that the module pubsub has no attribute called Client then you are running into path problems because an older version of pub/sub is installed on your machine. The solution is to use virtualenv:
virtualenv cpb104
source cpb104/bin/activate
pip install google-cloud-pubsub
gcloud auth application-default login
Then, try the send_sensor_data.py again
./send_sensor_data.py --speedFactor=60
Step 5
Create a new tab in Cloud Shell and change into the directory you were working in:
cd ~/training-data-analyst/courses/streaming/publish
Step 6
Create a subscription for the topic and do a pull to confirm that messages are coming in:
gcloud beta pubsub subscriptions create --topic sandiego mySub2
gcloud beta pubsub subscriptions pull --auto-ack mySub2
Confirm that you see a message with traffic sensor information.
Step 7
Cancel this subscription.
gcloud beta pubsub subscriptions delete mySub2
In the next lab, you will run a Dataflow pipeline to read in all these messages and process them.
Step 8
Go to the Cloud Shell tab with the publisher and type to stop it.
Stop here if you are done. Wait for instructions from the Instructor before going into the next section |
PART 2: STREAMING DATA PIPELINES
Overview
Duration is 1 min
In this lab you will use Dataflow to collect traffic events from simulated traffic sensor data made available through Google Cloud PubSub, process them into an actionable average, and store the raw data in BigQuery for later analysis. You will learn how to start a Dataflow pipeline, monitor it, and, lastly, optimize it.
What you learn
In this lab, you will learn how to:
- Launch Dataflow and run a Dataflow job
- Understand how data elements flow through the transformations of a Dataflow pipeline
- Connect Dataflow to Pub/Sub and BigQuery
- Observe and understand how Dataflow autoscaling adjusts compute resources to process input data optimally
- Learn where to find logging information created by Dataflow
- Explore metrics and create alerts and dashboards with Stackdriver Monitoring
Create BigQuery Dataset and Storage bucket
The Dataflow pipeline we will create later will write into a table in this dataset.
Step 1
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI
Step 2
Click the blue arrow to the right of your project name and choose Create new dataset.
Step 3
In the ‘Create Dataset' dialog, for Dataset ID, type demos and then click OK.
Step 4
If you don't already have a bucket on Cloud Storage, create one from the Storage section of the GCP console. Bucket names have to be globally unique.
Simulate your traffic sensor data into PubSub
Step 1
In Cloud Shell, start the script to read from the csv data and publish to PubSub
cd ~/training-data-analyst/courses/streaming/publish
./send_sensor_data.py --speedFactor=60
This command will send 1 hour of data in 1 minute.
Note:
- If you get the google.gax.errors.RetryError: GaxError OR "StatusCode.PERMISSION_DENIED, User not authorized to perform this action.", then simply re-authenticate the shell and run the script again
gcloud auth application-default login
./send_sensor_data.py --speedFactor=60
- If this fails because google.cloud.pubsub can not be found, then do the pip install below and run the send_sensor_data.py again:
sudo pip install google-cloud-pubsub
./send_sensor_data.py --speedFactor=60
- If you get a failure that the module pubsub has no attribute called Client then you are running into path problems because an older version of pub/sub is installed on your machine. The solution is to use virtualenv:
virtualenv cpb104
source cpb104/bin/activate
pip install google-cloud-pubsub
gcloud auth application-default login
Then, try the send_sensor_data.py again
./send_sensor_data.py --speedFactor=60
Launch Dataflow Pipeline
Duration is 9 min
Step 1
Open a new CloudShell window and navigate to the directory for this lab:
cd ~/training-data-analyst/courses/streaming/process/sandiego
If this directory doesn't exist, you may need to git clone the repository first:
cd ~
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd ~/training-data-analyst/courses/streaming/process/sandiego
Step 2
Explore the scripts that create and run a Dataflow pipeline in the cloud:
nano run_oncloud.sh
The script takes 3 required arguments: project id, bucket name, classname and possibly a 4th argument: options. We will cover the options argument in a later part of the lab.
project id : this is your GCP project
bucket name : this your Cloud Storage bucket you created earlier
*classname *: we have 4 java files that you can choose from, each reads the traffic data rom Pub/Sub and runs different aggregations/computations. Go into the java directory and explore one of the files we will be using:
cd src/main/java/com/google/cloud/training/dataanalyst/sandiego
nano AverageSpeeds.java
What does the script do?
Step 3
Run the Dataflow pipeline to read from PubSub and write into BigQuery
cd ~/training-data-analyst/courses/streaming/process/sandiego
./run_oncloud.sh yourproject yourbucket AverageSpeeds
Note: make sure to plug in your project id and bucket name for the first and second arguments respectively.
Note: If you are on a free trial account, you might get an error about insufficient quota(s) to execute this workflow with 3 instances. If so, add to the command-line in so as to keep the Dataflow pipeline under quota. If you do this, though, you will not be able to observe autoscaling.
Explore the pipeline
Duration is 4 min
In this activity, you will learn more about the pipeline that you launched in the previous steps.
This Dataflow pipeline:
- reads messages from a Pub/Sub topic,
- parses the Json of the input message and produces one main output
- and writes into BigQuery.
Step 1
Go to the Dataflow Jobs page in the Cloud Console.
Step 2
Click on the pipeline you created in the lab, it will have your username in the pipeline name.
Step 4
Compare the code you saw earlier of the pipeline (AverageSpeeds.java) and the pipeline graph in the Cloud Console.
Step 5
Find the "GetMessages" pipeline step in the graph, and then find the corresponding code snippet in the AverageSpeeds.java file. This is the pipeline step that reads from the Pub/Sub topic. It creates a collection of Strings - the read Pub/Sub messages.
Do you see a subscription created?
How does the code pull messages from Pub/Sub?
Step 6
Find the "Time Window" pipeline step in the graph and in code. In this pipeline step we create a window of a duration specified in the pipeline parameters (sliding window in this case). This window will accumulate the traffic data from the previous step until end of window, and pass it to the next steps for further transforms.
What is the window interval ? How often is a new window created?
Step 7
Find the "BySensor" and "AvgBySensor" pipeline steps in the graph, and then find the corresponding code snippet in the AverageSpeeds.java file. This "BySensor" does a grouping of all events in the window by sensor id, while "AvgBySensor" will then compute the mean speed for each grouping.
Step 8
Find the "ToBQRow" pipeline step in the graph and in code. In this step we simply create a "row" with the average computed from previous step together with the lane information. In this step, you can do other interesting things like maybe compare the calculated mean against a predefined threshold and log the results of the comparison, which you can later search for in Stackdriver Logging. In the later steps, we use the predefined metrics and look at the logging info.
Step 9
Lastly, find the "BigQueryIO.Write" in both the pipeline graph and in source code. In this step we are writing the row out of the pipeline into a BigQuery table. Because we chose the WriteDisposition.WRITE_APPEND write disposition, new records will be appended to the table.
Determine throughput rates
Duration is 3 min
One common activity when monitoring and improving Dataflow pipelines is figuring out how many elements the pipeline processes per second, what the system lag is, and how many data elements have been processed so far. In this activity you will learn where in the Cloud Console one can find information about processed elements and time.
Step 1
Go to the Dataflow Jobs page in the Cloud Console.
Step 2
Click on the pipeline you created in the lab.
Step 3
Select the "GetMessages" pipeline node in the graph and look at the step metrics on the right.
- System Lag is an important metric for streaming pipelines. It represents the amount of time data elements are waiting to be processed since they "arrived" in the input of the transformation step.
- Elements Added metric under output collections tells you how many data elements exited this step (for the "Read PubSub Msg" step of our pipeline it also represents the number of Pub/Sub messages read from the topic by the Pub/Sub IO connector)
Another important metric is the Step Throughput, measured in data elements per second. You will find it inside step nodes in the pipeline graph.
Step 4
Select the next pipeline node in the graph - "Time Window". Observe how the Elements Added metric under the Input Collections of the "Time Window" step matches the Elements Added metric under the Output Collections of the previous step "GetMessages". Generally speaking, output of Step N will be equal to the input of the next Step N+1.
Review BigQuery output
Duration is 4 min
Find the output tables in BigQuery and run commands to view written records.
Step 1
Open the Google Cloud Console (in the incognito window) and using the menu, navigate into BigQuery web UI, and explore the demos dataset. Note that streaming tables may not show up immediately. You can still query the tables though.
Step 2
Use the following query to observe the output from your Dataflow job.
SELECT *
FROM [<PROJECTID>:demos.average_speeds]
ORDER BY timestamp DESC
LIMIT 100
Step 3
Find the last update to the table by running the following SQL:
SELECT
MAX(timestamp)
FROM
[<PROJECTID>:demos.average_speeds]
Step 4
Use the BigQuery Table Decorator to look at results in the last 10 minutes:
SELECT
*
FROM
[<PROJECTID>:demos.average_speeds@-600000]
ORDER BY
timestamp DESC
Use the BigQuery Invalid Snapshot Time, try reducing the 60000 to perhaps 10000.
Observe and understand autoscaling
Duration is 4 min
In this activity, we will observe how Dataflow scales the number of workers to process the backlog of incoming Pub/Sub messages.
Step 1
Go to the Dataflow Jobs page in the Cloud Console.
Step 2
Click on the pipeline you created in the lab. Find the Summary panel on the right, and review the Autoscaling panel. Check how many workers are currently being used to process messages in the Pub/Sub topic.
Step 3
Click on "See More History" link and review how many workers were used at different points in time during the pipeline execution.
Step 4
The data from a traffic sensor simulator started at the beginning of the lab creates hundreds of messages per second in the Pub/Sub topic. This will cause Dataflow to increase the number of workers in order to keep the system lag of the pipeline at optimal levels.
In the "Worker History" screen, observe how Dataflow changed the number of workers, and the rationale for these decisions in the history table.
Monitor pipelines
Duration is 2 min
Note: Dataflow / Stackdriver Monitoring Integration is currently available as part of an Early Access Program (EAP). Features and behavior are not final and will change as we move towards General Availability.
Dataflow integration with Stackdriver Monitoring allows users to access Dataflow job metrics such as System Lag (for streaming jobs), Job Status (Failed, Successful), Element Counts, and User Counters from within Stackdriver.
You can also employ Stackdriver alerting capabilities to get notified of a variety of conditions such as long streaming system lag or failed jobs.
Dataflow / Stackdriver Monitoring Integration allows you to:
- Explore Dataflow Metrics: Browse through available Dataflow pipeline metrics (see next section for a list of metrics) and visualize them in charts.
- Chart Dataflow metrics in Stackdriver Dashboards: Create Dashboards and chart time series of Dataflow metrics.
- Configure Alerts: Define thresholds on job or resource group-level metrics and alert when these metrics reach specified values.
- Monitor User-Defined Metrics: In addition to Dataflow metrics, Dataflow exposes user-defined metrics (SDK Aggregators) as Stackdriver custom counters in the Monitoring UI, available for charting and alerting.
Monitor pipelines (cont'd)
Duration is 2 min
What Dataflow pipeline metrics are available in Stackdriver?
Some of the more important metrics Dataflow provides are:
- Job status: Job status (Failed, Successful), reported as an enum every 30 secs and on update.
- Elapsed time: Job elapsed time (measured in seconds), reported every 30 secs.
- System lag: Max lag across the entire pipeline, reported in seconds.
- Current vCPU count: Current # of virtual CPUs used by job and updated on value change.
- Estimated byte count: Number of bytes processed per PCollection. Note: This is a per-PCollection metric, not a job-level metric, so it is not yet available for alerting.
What are user-defined metrics?
Any Aggregator defined in a Dataflow pipeline will be reported to Stackdriver as a custom metric. Dataflow will define a new custom metric on behalf of the user and report incremental updates to Stackdriver approximately every 30 secs.
Explore metrics
Duration is 3 min
Step 1
Navigate to Stackdriver Monitoring and go to Resources > Metrics Explorer
Note: If this is your first time trying out Stackdriver for this project, you may need to set up for your account. Just follow the prompts to activate 30-day trial.
Step 2
In the Metrics Explorer, find and select the dataflow_job resource type. You should now see a list of Dataflow-related metrics you can choose from.
Step 3
Select a metric you want to observe for one of your jobs. The pipeline you launched at the beginning of the lab is a streaming pipeline, and one of the more important metrics of streaming pipelines is System Lag. Select System Lag as the metric to observe the system lag of the streaming pipeline you launched.
Step 4
Stackdriver will populate a list of jobs running in our lab project on the right side of the page. Select your pipeline and observe the progress of the metric over time.
Create alerts
Duration is 4 min
If you want to be notified when a certain metric crosses a specified threshold (for example, when System Lag of our lab streaming pipeline increases above a predefined value), you could use the Alerting mechanisms of Stackdriver to accomplish that.
Step 1
On the Stackdriver Monitoring page, navigate to the Alerting menu and select Policies Overview.
Step 2
Click on Add Policy.
Step 3
The "Create new Alerting Policy" page allows you to define the alerting conditions and the channels of communication for alerts. For example, to set an alert on the System Lag for our lab pipeline group, do the following:
- click on "Add Condition",
- click on "Select" under Metric Threshold,
- select "Dataflow Job" in the Resource Type dropdown,
- select "Single" in the "Applies To" dropdown,
- select the group you created in the previous step,
- select "Any Member Violates" in the "Condition Triggers If" dropdown,
- select "System Lag" in the "If Metric" dropdown, and
- select Condition "above" a Threshold of "5" seconds.
Click on Save Condition to save the alert.
Step 4
Add a Notifications channel, give the policy a name, and click on "Save Policy".
Step 5
After you created an Alert, you can review the Events related to Dataflow in the Alerting>Events page. Every time an alert is triggered by a Metric Threshold condition, an Incident and a corresponding Event are created in Stackdriver. If you specified a notification mechanism in the alert (email, SMS, pager, etc), you will also receive a notification.
Set up dashboards
Duration is 5 min
You can easily build dashboards with the most relevant Dataflow-related charts with Stackdriver Monitoring Dashboards.
Step 1
On the Stackdriver Monitoring page, go to the Dashboards menu and select "Create Dashboard".
Step 2
Click on Add Chart.
Step 3
On the Add Chart page:
- select "Dataflow Job" as the Resource Type,
- select a metric you want to chart in the Metric Type field (e.g. System Lag),
- in the Filter panel, select a group that you created in one of the previous steps and that contains your Dataflow pipeline,
- click "Save".
You can add more charts to the dashboard, if you would like, for example, PubSub publish rates on the topic, or subscription backlog (which is a signal to the Dataflow auto-scaler).
Launch another streaming pipeline
Duration is 9 min
Step 1
Go back to the CloudShell where you ran the first Dataflow pipeline.
Run the CurrentConditions java code in a new Dataflow pipeline; this script is simpler in the sense that it does not do many transforms like AverageSpeeds. We will use the results in the next lab to build dashboards and run some transforms (functions) while retrieving the data from BigQuery
cd ~/training-data-analyst/courses/streaming/process/sandiego
./run_oncloud.sh yourproject yourbucket CurrentConditions
Step 2
Go to the Dataflow Jobs page in the Cloud Console and confirm you see the pipeline job listed. Further ensure that it is running (no errors).
Stop here if you are done. Wait for instructions from the Instructor before going into the next section |
PART 3: STREAMING ANALYTICS AND DASHBOARDS
Overview
Duration is 1 min
Data visualization tools can help you make sense of your BigQuery data and help you analyze the data interactively. You can use visualization tools to help you identify trends, respond to them, and make predictions using your data. In this lab, you use Google Data Studio to visualize data in the BigQuery table populated by your Dataflow pipeline in the previous exercise.
What you learn
In this lab, you:
- Connect to a BigQuery data source
- Create reports and charts to visualize BigQuery data
Creating a data source
Duration is 10 min
In this section of the lab, you use Google Data Studio to visualize data in BigQuery using the BigQuery connector. You create a data source, a report, and charts that visualize data in the sample table.
The first step in creating a report in Data Studio is to create a data source for the report. A report may contain one or more data sources. When you create a BigQuery data source, Data Studio uses the BigQuery connector.
You must have the appropriate permissions in order to add a BigQuery data source to a Data Studio report. In addition, the permissions applied to BigQuery datasets will apply to the reports, charts, and dashboards you create in Data Studio. When a Data Studio report is shared, the report components are visible only to users who have appropriate permissions.
To create a data source:
Step 1
Open Google Data Studio.
Step 2
On the Reports page, in the Start a new report section, click the Blank template. This creates a new untitled report.
Step 3
If prompted, click I accept the terms and conditions and then click Accept. You may need to click the Blank template again after agreeing to the terms and conditions.
Step 4
In the Add a data source window, click Create new data source.
Step 5
For Connectors, click BigQuery.
Step 6
For Authorization, click Authorize. This allows Data Studio access to your GCP project.
Step 7
In the Request for permission dialog, click Allow to give Data Studio the ability to view data in BigQuery. You may not receive this prompt if you previously used Data Studio.
Step 8
Select My Projects, then click on your project name
Step 9
For Dataset, click demos.
Step 10
For Table, click current_conditions.
Step 11
If you need to specify a Billing Project, then select your GCP project.
Step 12
In the upper right corner of the window, click Connect.
Step 13
Once Data Studio has connected to the BigQuery data source, the table's fields are displayed. You can use this page to adjust the field properties or to create new calculated fields. Click Create report.
Step 14
When prompted, click Add to report.
Step 15
In the Request for permission dialog, click Allow to give Data Studio the ability to view and manage files in Google Drive. You may not receive this prompt if you previously used Data Studio.
Creating a bar chart using a calculated field
Duration is 15 min
Introduction
Once you have added the current_conditions data source to the report, the next step is to create a visualization. Begin by creating a bar chart. The bar chart displays the total number of vehicles captured for each highway. To display this, you create a calculated field as follows:
Step 1
(Optional) At the top of the page, click Untitled Report to change the report name. For example, type
Step 2
When the report editor loads, click Insert > Bar chart.
Step 3
Using the handle, draw a rectangle on the report to display the chart.
Step 4
In the Bar chart properties window, on the Data tab, notice the value for Data Source (current_conditions) and the default values for Dimension and Metric.
Step 5
If Dimension is not set to , then change Dimension to . In the Dimension section, click the existing dimension.
Step 6
In the Dimension picker, select highway.
Step 7
Click the back arrow to close the Dimension picker.
Step 8
In the Metric section, click the existing metric.
Step 9
In the Metric picker, click Create new metric.
Step 10
Click (Create a calculated field). To display a count of the number of vehicles using each highway, create a calculated field. For this lab, you count the entries in the field. The value is irrespective, we just need the number of occurrences.
Step 11
For Name, type vehicles.
Step 12
Leave the ID unchanged.
Step 13
For Formula, type the following (or use the formula assistant): COUNT(sensorId).
Step 14
Click Create field.
Step 15
Click Done.
Step 16
In the Metric picker, select vehicles.
Step 17
Click the back arrow to close the Metric picker. The Dimension should be set to highway and the Metric should be set to vehicles. Notice the chart is sorted in Descending order by default. The highway with the most vehicles are displayed first.
Step 18
To enhance the chart, change the bar labels. In the Bar chart properties window, click the Style tab.
Step 19
In the Bar chart section, check Show data labels.
The total number of vehicles is displayed above each bar in the chart.
Creating a chart using a custom query
Duration is 15 min
Introduction
Because Data Studio does not allow aggregations on metrics, some report components are easier to generate using a custom SQL query. The Custom Query option also lets you leverage BigQuery's full query capabilities such as joins, unions, and analytical functions.
Alternatively, you can leverage BigQuery's full query capabilities by creating a view. A view is a virtual table defined by a SQL query. You can query data in a view by adding the dataset containing the view as a data source.
When you specify a SQL query as your BigQuery data source, the results of the query are in table format, which becomes the field definition (schema) for your data source. When you use a custom query as a data source, Data Studio uses your SQL as an inner select statement for each generated query to BigQuery. For more information on custom queries in Data Studio, consult the online help.
To add a bar chart to your report that uses a custom query data source:
Step 1
Click Insert > Bar chart.
Step 2
Using the handle, draw a rectangle on the report to display the chart.
Step 3
In the Bar chart properties window, on the Data tab, notice the value for Data Source (natality) and the default values for Dimension and Metric are the same as the previous chart. In the Data Source section, click (Select data source).
Step 4
Click Create new data source.
Step 5
For Connectors, click BigQuery.
Step 6
For My Projects, click Custom query.
Step 7
For Project, select your GCP project.
Step 8
Type the following in the Enter custom query window:
SELECT max(speed) as maxspeed, min(speed) as minspeed, avg(speed) as avgspeed, highway FROM [<PROJECTID>:demos.current_conditions] group by highway
This query uses max/min/avg functions to give you the same for each highway..
Step 9
At the top of the window, click Untitled data source, and change the data source name to San Diego highway traffic summary.
Step 10
In the upper right corner of the window, click Connect. Once Data Studio has connected to the BigQuery data source, the results of the query are used to determine the table schema.
Step 11
When the schema is displayed, notice the type and aggregation for each field.
Step 12
Click Add to report.
Step 13
When prompted, click Add to report.
Step 14
Data Studio may be unable to determine the appropriate Dimension and Metrics for the chart. This results in the error: . In the Bar chart properties, on the Data tab, in the Dimension section, click Invalid metric.
Step 15
In the Metric picker, select maxspeed.
Step 16
Click the back arrow to close the Metric picker.
Step 17
In the Metric section, click Add a metric.
Step 18
In the Metric picker, select minspeed.
Step 19
Click the back arrow to close the Metric picker.
Step 20
In the Metric section, click Add a metric.
Step 21
In the Metric picker, select avgspeed.
Step 22
Click the back arrow to close the Metric picker. Your chart now displays the max speed, minimum speed and average speed for each highway.
Step 23
For readability, change the chart styles. In the Bar chart properties, click the Style tab.
Step 24
In the Bar chart __section, *deselect *Single color__.
Step 25
Notice each bar has a default color based on the order the metrics were added to the chart.
Viewing your query history
Duration is 3 min
Introduction
You can view queries submitted via the BigQuery Connector by examining your query history in the BigQuery web interface. Using the query history, you can estimate query costs, and you can save queries for use in other scenarios.
To examine your query history:
Step 1
In the Google Cloud Console, using the menu, navigate into BigQuery web UI, click Query History. (Note: you may need to refresh the BigQuery Web UI).
Step 2
The list of queries is displayed with the most recent queries first. Click Open Query to view details on the query such as Job ID and Bytes Processed.
Stop here if you are done. Wait for instructions from the Instructor before going into the next section |
PART 4: STREAMING DATA PIPELINES INTO BIGTABLE
Overview
Duration is 1 min
In this lab you will use Dataflow to collect traffic events from simulated traffic sensor data made available through Google Cloud PubSub, and write them into a Bigtable table.
What you learn
In this lab, you will learn how to:
- Launch Dataflow pipeline to read from PubSub and write into Bigtable
- Open an HBase shell to query the Bigtable data
Simulate your traffic sensor data into PubSub
Step 1
In Cloud Shell, run the script to download and unzip the quickstart files (you will later use these to run the HBase shell)
cd ~/training-data-analyst/courses/streaming/process/sandiego
./install_quickstart.sh
Step 2
In Cloud Shell, start the script to read from the csv data and publish to PubSub
cd ~/training-data-analyst/courses/streaming/publish
./send_sensor_data.py --speedFactor=30
This command will send 1 hour of data in 2 minutes
Note:
- If you get the google.gax.errors.RetryError: GaxError OR "StatusCode.PERMISSION_DENIED, User not authorized to perform this action.", then simply re-authenticate the shell and run the script again
gcloud auth application-default login
./send_sensor_data.py --speedFactor=30
- If this fails because google.cloud.pubsub can not be found, then do the pip install below and run the send_sensor_data.py again:
sudo pip install google-cloud-pubsub
./send_sensor_data.py --speedFactor=30
- If you get a failure that the module pubsub has no attribute called Client then you are running into path problems because an older version of pub/sub is installed on your machine. The solution is to use virtualenv:
virtualenv cpb104
source cpb104/bin/activate
pip install google-cloud-pubsub
gcloud auth application-default login
Then, try the send_sensor_data.py again
./send_sensor_data.py --speedFactor=30
Launch Dataflow Pipeline
Duration is 9 min
Step 1
Open a new CloudShell window and navigate to the directory for this lab:
cd ~/training-data-analyst/courses/streaming/process/sandiego
If this directory doesn't exist, you may need to git clone the repository first:
cd ~
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
cd ~/training-data-analyst/courses/streaming/process/sandiego
Step 2
Ensure to authenticate shell to have the right permissions for the pipeline later
gcloud auth application-default login
Step 3
Explore the scripts that create and run a Dataflow pipeline in the cloud:
nano run_oncloud.sh
The script takes 3 required arguments: project id, bucket name, classname and possibly a 4th argument: options. In this part of the lab, we will use the option which will direct the pipeline to write into Cloud Bigtable.
Example: ./run_on_cloud.sh qwiklabs-gcp-123456 my-bucket1 CurrentConditions --bigtable
cd src/main/java/com/google/cloud/training/dataanalyst/sandiego
nano CurrentConditions.java
What does the script do?
Step 4
Run the script below to create the Bigtable instance
cd ~/training-data-analyst/courses/streaming/process/sandiego
./create_cbt.sh
Step 5
Run the Dataflow pipeline to read from PubSub and write into Cloud Bigtable
cd ~/training-data-analyst/courses/streaming/process/sandiego
./run_oncloud.sh yourproject yourbucket CurrentConditions --bigtable
Note: make sure to plug in your project id and bucket name for the first and second arguments respectively.
Explore the pipeline
Duration is 4 min
In this activity, you will learn more about the pipeline you just launched that writes into Bigtable
Step 1
Go to the Dataflow Jobs page in the Cloud Console.
Step 2
Click on the pipeline you created in the lab, it will have "currentconditions" followed by your username in the pipeline name.
Step 3
Find the "write:cbt" step in the pipeline graph, and click on the down arrow on the right to see the writer in action. Review the Bigtable options in the step summary.
Query Bigtable data
Step 1
Back at the cloud shell, run the quickstart.sh script to launch the HBase shell:
cd ~/training-data-analyst/courses/streaming/process/sandiego/quickstart
./quickstart.sh
If the script runs successfully, you would be in a HBase shell prompt that looks something like:
Step 2
At the HBase shell prompt, type the following query to retrieve 2 rows from your Bigtable table that was populated by the pipeline.
scan 'current_conditions', {'LIMIT' => 2}
Review the output. Notice each row is broken into column,timestamp,value combinations.
Step 3
Lets run another query. This time we only look at the lane:speed column, and limit to 10 rows, and also specify rowid patterns for start and end rows to scan over.
scan 'current_conditions', {'LIMIT' => 10, STARTROW => '15#S#1', ENDROW => '15#S#999', COLUMN => 'lane:speed'}
Review the output. Notice that you see 10 of the column,timestamp,value combinations, all of which correspond to Highway 15. Also notice that column is restricted to lane:speed.
Step 4
Feel free to run other queries if you are familiar with the syntax. Once you're satisfied, ‘quit' to exit the shell.
quit
Cleanup
Step 1
Run the script to delete your Bigtable instance
cd ~/training-data-analyst/courses/streaming/process/sandiego
./delete_cbt.sh
Step 2
On your Dataflow page in your Cloud Console, click on the pipeline job name and click the ‘stop job' on the right panel.
Step 3
Go back to the first Cloud Shell tab with the publisher and type to stop it.
Step 4
Go to the BigQuery console and delete the dataset .