mirror of
https://github.com/autistic-symposium/tensorflow-for-deep-learning-py.git
synced 2025-05-11 11:14:57 -04:00
648 lines
No EOL
18 KiB
Text
648 lines
No EOL
18 KiB
Text
{
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0,
|
|
"metadata": {
|
|
"colab": {
|
|
"name": "intro_to_pandas.ipynb",
|
|
"provenance": [],
|
|
"collapsed_sections": [
|
|
"JndnmDMp66FL",
|
|
"YHIWvc9Ms-Ll",
|
|
"TJffr5_Jwqvd"
|
|
]
|
|
},
|
|
"kernelspec": {
|
|
"name": "python3",
|
|
"display_name": "Python 3"
|
|
}
|
|
},
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "JndnmDMp66FL"
|
|
},
|
|
"source": [
|
|
"#### Copyright 2017 Google LLC."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "hMqWDc_m6rUC",
|
|
"cellView": "both",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
|
|
"# you may not use this file except in compliance with the License.\n",
|
|
"# You may obtain a copy of the License at\n",
|
|
"#\n",
|
|
"# https://www.apache.org/licenses/LICENSE-2.0\n",
|
|
"#\n",
|
|
"# Unless required by applicable law or agreed to in writing, software\n",
|
|
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
|
|
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
|
|
"# See the License for the specific language governing permissions and\n",
|
|
"# limitations under the License."
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "rHLcriKWLRe4"
|
|
},
|
|
"source": [
|
|
"# Intro to pandas"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "QvJBqX8_Bctk"
|
|
},
|
|
"source": [
|
|
"**Learning Objectives:**\n",
|
|
" * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library\n",
|
|
" * Access and manipulate data within a `DataFrame` and `Series`\n",
|
|
" * Import CSV data into a *pandas* `DataFrame`\n",
|
|
" * Reindex a `DataFrame` to shuffle data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "TIFJ83ZTBctl"
|
|
},
|
|
"source": [
|
|
"[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.\n",
|
|
"Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "s_JOISVgmn9v"
|
|
},
|
|
"source": [
|
|
"## Basic Concepts\n",
|
|
"\n",
|
|
"The following line imports the *pandas* API and prints the API version:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "aSRYu62xUi3g",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"from __future__ import print_function\n",
|
|
"\n",
|
|
"import pandas as pd\n",
|
|
"pd.__version__"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "daQreKXIUslr"
|
|
},
|
|
"source": [
|
|
"The primary data structures in *pandas* are implemented as two classes:\n",
|
|
"\n",
|
|
" * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.\n",
|
|
" * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.\n",
|
|
"\n",
|
|
"The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in [Spark](https://spark.apache.org/) and [R](https://www.r-project.org/about.html)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "fjnAk1xcU0yc"
|
|
},
|
|
"source": [
|
|
"One way to create a `Series` is to construct a `Series` object. For example:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "DFZ42Uq7UFDj",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"pd.Series(['San Francisco', 'San Jose', 'Sacramento'])"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "U5ouUp1cU6pC"
|
|
},
|
|
"source": [
|
|
"`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "avgr6GfiUh8t",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n",
|
|
"population = pd.Series([852469, 1015785, 485199])\n",
|
|
"\n",
|
|
"pd.DataFrame({ 'City name': city_names, 'Population': population })"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "oa5wfZT7VHJl"
|
|
},
|
|
"source": [
|
|
"But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "av6RYOraVG1V",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
|
|
"california_housing_dataframe.describe()"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "WrkBjfz5kEQu"
|
|
},
|
|
"source": [
|
|
"The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`, which displays the first few records of a `DataFrame`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "s3ND3bgOkB5k",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"california_housing_dataframe.head()"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "w9-Es5Y6laGd"
|
|
},
|
|
"source": [
|
|
"Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "nqndFVXVlbPN",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"california_housing_dataframe.hist('housing_median_age')"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "XtYZ7114n3b-"
|
|
},
|
|
"source": [
|
|
"## Accessing Data\n",
|
|
"\n",
|
|
"You can access `DataFrame` data using familiar Python dict/list operations:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "_TFm7-looBFF",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n",
|
|
"print(type(cities['City name']))\n",
|
|
"cities['City name']"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "V5L6xacLoxyv",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"print(type(cities['City name'][1]))\n",
|
|
"cities['City name'][1]"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "gcYX1tBPugZl",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"print(type(cities[0:2]))\n",
|
|
"cities[0:2]"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "65g1ZdGVjXsQ"
|
|
},
|
|
"source": [
|
|
"In addition, *pandas* provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "RM1iaD-ka3Y1"
|
|
},
|
|
"source": [
|
|
"## Manipulating Data\n",
|
|
"\n",
|
|
"You may apply Python's basic arithmetic operations to `Series`. For example:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "XWmyCFJ5bOv-",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"population / 1000."
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "TQzIVnbnmWGM"
|
|
},
|
|
"source": [
|
|
"[NumPy](http://www.numpy.org/) is a popular toolkit for scientific computing. *pandas* `Series` can be used as arguments to most NumPy functions:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "ko6pLK6JmkYP",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"np.log(population)"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "xmxFuQmurr6d"
|
|
},
|
|
"source": [
|
|
"For more complex single-column transformations, you can use `Series.apply`. Like the Python [map function](https://docs.python.org/2/library/functions.html#map), \n",
|
|
"`Series.apply` accepts as an argument a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), which is applied to each value.\n",
|
|
"\n",
|
|
"The example below creates a new `Series` that indicates whether `population` is over one million:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "Fc1DvPAbstjI",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"population.apply(lambda val: val > 1000000)"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "ZeYYLoV9b9fB"
|
|
},
|
|
"source": [
|
|
"\n",
|
|
"Modifying `DataFrames` is also straightforward. For example, the following code adds two `Series` to an existing `DataFrame`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "0gCEX99Hb8LR",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n",
|
|
"cities['Population density'] = cities['Population'] / cities['Area square miles']\n",
|
|
"cities"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "6qh63m-ayb-c"
|
|
},
|
|
"source": [
|
|
"## Exercise #1\n",
|
|
"\n",
|
|
"Modify the `cities` table by adding a new boolean column that is True if and only if *both* of the following are True:\n",
|
|
"\n",
|
|
" * The city is named after a saint.\n",
|
|
" * The city has an area greater than 50 square miles.\n",
|
|
"\n",
|
|
"**Note:** Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing *logical and*, use `&` instead of `and`.\n",
|
|
"\n",
|
|
"**Hint:** \"San\" in Spanish means \"saint.\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "zCOn8ftSyddH",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"# Your code here"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "YHIWvc9Ms-Ll"
|
|
},
|
|
"source": [
|
|
"### Solution\n",
|
|
"\n",
|
|
"Click below for a solution."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "T5OlrqtdtCIb",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n",
|
|
"cities"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "f-xAOJeMiXFB"
|
|
},
|
|
"source": [
|
|
"## Indexes\n",
|
|
"Both `Series` and `DataFrame` objects also define an `index` property that assigns an identifier value to each `Series` item or `DataFrame` row. \n",
|
|
"\n",
|
|
"By default, at construction, *pandas* assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "2684gsWNinq9",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"city_names.index"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "F_qPe2TBjfWd",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"cities.index"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "hp2oWY9Slo_h"
|
|
},
|
|
"source": [
|
|
"Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "sN0zUzSAj-U1",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"cities.reindex([2, 0, 1])"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "-GQFz8NZuS06"
|
|
},
|
|
"source": [
|
|
"Reindexing is a great way to shuffle (randomize) a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to NumPy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the `DataFrame` rows to be shuffled in the same way.\n",
|
|
"Try running the following cell multiple times!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "mF8GC0k8uYhz",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"cities.reindex(np.random.permutation(cities.index))"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "fSso35fQmGKb"
|
|
},
|
|
"source": [
|
|
"For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "8UngIdVhz8C0"
|
|
},
|
|
"source": [
|
|
"## Exercise #2\n",
|
|
"\n",
|
|
"The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "PN55GrDX0jzO",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"# Your code here"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "TJffr5_Jwqvd"
|
|
},
|
|
"source": [
|
|
"### Solution\n",
|
|
"\n",
|
|
"Click below for the solution."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "8oSvi2QWwuDH"
|
|
},
|
|
"source": [
|
|
"If your `reindex` input array includes values not in the original `DataFrame` index values, `reindex` will add new rows for these \"missing\" indices and populate all corresponding columns with `NaN` values:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"metadata": {
|
|
"colab_type": "code",
|
|
"id": "yBdkucKCwy4x",
|
|
"colab": {}
|
|
},
|
|
"source": [
|
|
"cities.reindex([0, 4, 5, 2])"
|
|
],
|
|
"execution_count": 0,
|
|
"outputs": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"colab_type": "text",
|
|
"id": "2l82PhPbwz7g"
|
|
},
|
|
"source": [
|
|
"This behavior is desirable because indexes are often strings pulled from the actual data (see the [*pandas* reindex\n",
|
|
"documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) for an example\n",
|
|
"in which the index values are browser names).\n",
|
|
"\n",
|
|
"In this case, allowing \"missing\" indices makes it easy to reindex using an external list, as you don't have to worry about\n",
|
|
"sanitizing the input."
|
|
]
|
|
}
|
|
]
|
|
} |