|
9 | 9 | "source": [
|
10 | 10 | "# Exploring data using Pandas\n",
|
11 | 11 | "\n",
|
| 12 | + "<div class=\"alert alert-info\">\n", |
| 13 | + "\n", |
| 14 | + "Finnish university students are encouraged to use the CSC Notebooks platform.<br/>\n", |
| 15 | + "<a href=\"https://notebooks.csc.fi/#/blueprint/7e62ac3bddf74483b7ac7333721630e2\"><img alt=\"CSC badge\" src=\"https://img.shields.io/badge/launch-CSC%20notebook-blue.svg\" style=\"vertical-align:text-bottom\"></a>\n", |
| 16 | + "\n", |
| 17 | + "Others can follow the lesson and fill in their student notebooks using Binder.<br/>\n", |
| 18 | + "<a href=\"https://mybinder.org/v2/gh/geo-python/notebooks/master?urlpath=lab/tree/L4/functions.ipynb\"><img alt=\"Binder badge\" src=\"https://img.shields.io/badge/launch-binder-red.svg\" style=\"vertical-align:text-bottom\"></a>\n", |
| 19 | + "\n", |
| 20 | + "</div>\n", |
| 21 | + "\n", |
12 | 22 | "Our first task in this week's lesson is to learn how to **read and explore data files in Python**. We will focus on using [pandas](https://pandas.pydata.org/pandas-docs/stable/) which is an open-source package for data analysis in Python. Pandas is an excellent toolkit for working with **real world data** that often have a tabular structure (rows and columns).\n",
|
13 | 23 | "\n",
|
14 |
| - "We will first get familiar with **pandas data structures**: *DataFrame* and *Series*:\n", |
| 24 | + "We will first get familiar with the **pandas data structures**: *DataFrame* and *Series*:\n", |
15 | 25 | "\n",
|
16 | 26 | "\n",
|
17 | 27 | "\n",
|
|
24 | 34 | "\n",
|
25 | 35 | "As you can see, both DataFrames and Series in pandas have an index that can be used to select values, but they also have column labels to identify columns in DataFrames. In the lesson this week we'll use many of these features to explore real-world data and learn some useful data analysis procedures.\n",
|
26 | 36 | "\n",
|
27 |
| - "For a comprehensive overview of pandas data structures you can have a look at Chapter 5 in Wes MacKinney's book [Python for Data Analysis (2nd Edition, 2017)](https://geo-python.github.io/site/course-info/resources.html#books) and [Pandas online documentation about data structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html).\n", |
| 37 | + "For a comprehensive overview of pandas data structures you can have a look at Chapter 5 in Wes MacKinney's book [Python for Data Analysis (2nd Edition, 2017)](https://geo-python-site.readthedocs.io/en/latest/course-info/resources.html#books) and [Pandas online documentation about data structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html).\n", |
28 | 38 | "\n",
|
29 | 39 | "<div class=\"alert alert-info\">\n",
|
30 | 40 | "\n",
|
31 | 41 | "**Note**\n",
|
32 |
| - " \n", |
33 |
| - "Pandas is a \"high-level\" package, which means that it makes use of several other packages, such as [NumPy](https://numpy.org/), in the background. There are several ways in which data can be read from a file in Python, and this year we have decided to focus primarily on pandas because it is easy-to-use, efficient and intuitive. If you are curoius about other approaches for interacting with data files, you can find lesson materials from previous years about reading data using [NumPy](https://geo-python.github.io/site/2018/notebooks/L5/numpy/1-Exploring-data-using-numpy.html#Reading-a-data-file-with-NumPy) or [built-in Python functions](https://geo-python.github.io/site/2017/lessons/L5/reading-data-from-file.html). \n", |
34 |
| - " \n", |
35 |
| - "</div>\n", |
36 | 42 | "\n",
|
| 43 | + "Pandas is a \"high-level\" package, which means that it makes use of several other packages, such as [NumPy](https://numpy.org/), in the background. There are several ways in which data can be read from a file in Python, and this year we have decided to focus primarily on pandas because it is easy-to-use, efficient and intuitive. If you are curoius about other approaches for interacting with data files, you can find lesson materials from previous years about reading data using [NumPy](https://geo-python-site.readthedocs.io/en/2018.1/notebooks/L5/numpy/1-Exploring-data-using-numpy.html#Reading-a-data-file-with-NumPy) or [built-in Python functions](https://geo-python-site.readthedocs.io/en/2017.1/lessons/L5/reading-data-from-file.html).\n", |
| 44 | + "</div>\n", |
37 | 45 | "\n",
|
38 | 46 | "## Input data: weather statistics\n",
|
39 | 47 | "\n",
|
40 | 48 | "Our input data is a text file containing weather observations from Kumpula, Helsinki, Finland retrieved from [NOAA](https://www.ncdc.noaa.gov/)*:\n",
|
41 | 49 | "\n",
|
42 | 50 | "- File name: [Kumpula-June-2016-w-metadata.txt](Kumpula-June-2016-w-metadata.txt) (have a look at the file before reading it in using pandas!)\n",
|
43 |
| - "- The file is available in the binder and CSC notebook instances, under L5 folder \n", |
| 51 | + "- The file is available in the binder and CSC notebook instances, under the L5 folder \n", |
44 | 52 | "- The data file contains observed daily mean, minimum, and maximum temperatures from June 2016 recorded from the Kumpula weather observation station in Helsinki.\n",
|
45 | 53 | "- There are 30 rows of data in this sample data set.\n",
|
46 | 54 | "- The data has been derived from a data file of daily temperature measurments downloaded from [NOAA](https://www.ncdc.noaa.gov/cdo-web/).\n",
|
47 | 55 | "\n",
|
48 |
| - "\n", |
49 | 56 | "\\*US National Oceanographic and Atmospheric Administration's National Centers for Environmental Information climate database\n",
|
50 | 57 | "\n",
|
51 | 58 | "## Reading a data file with Pandas\n",
|
52 | 59 | "\n",
|
53 |
| - "\n", |
54 |
| - "Now we're ready to read in our temperature data file. **First, we need to import the Pandas module.** It is customary to import pandas as `pd`:\n" |
| 60 | + "Now we're ready to read in our temperature data file. **First, we need to import the Pandas module.** It is customary to import pandas as `pd`:" |
55 | 61 | ]
|
56 | 62 | },
|
57 | 63 | {
|
|
87 | 93 | },
|
88 | 94 | "outputs": [],
|
89 | 95 | "source": [
|
90 |
| - "# Read the file using pandas\n", |
91 |
| - "data = pd.read_csv('Kumpula-June-2016-w-metadata.txt')" |
| 96 | + "# Read the file using pandas\n" |
92 | 97 | ]
|
93 | 98 | },
|
94 | 99 | {
|
95 | 100 | "cell_type": "markdown",
|
96 | 101 | "metadata": {},
|
97 | 102 | "source": [
|
98 | 103 | "<div class=\"alert alert-info\">\n",
|
99 |
| - "\n", |
100 |
| - "**Delimiter and other optional parameters**\n", |
101 | 104 | " \n",
|
102 |
| - "Our input file is a comma-delimited file; columns in the data are separted by commas (`,`) on each row. the Pandas `.read_csv()` -function has comma as the default delimiter so we don't need to specify it separately. In order to make the delimiter visible also in the code, could add the `sep` parameter:\n", |
| 105 | + "**Delimiter and other optional parameters**\n", |
| 106 | + "\n", |
| 107 | + "Our input file is a comma-delimited file; columns in the data are separted by commas (`,`) on each row. the Pandas `.read_csv()` function has the comma as the default delimiter so we don't need to specify it separately. In order to make the delimiter visible also in the code for reading the file, could add the `sep` parameter:\n", |
103 | 108 | " \n",
|
104 |
| - "``` \n", |
| 109 | + "```python\n", |
105 | 110 | "data = pd.read_csv('Kumpula-June-2016-w-metadata.txt', sep=`,`)\n",
|
106 | 111 | "```\n",
|
107 |
| - " \n", |
108 |
| - "The `sep` parameter cam be used to spesify if the input data uses some other character, such as `;` as a delimiter. For a full list of available parameters, please refer to [pandas documentation for pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), or run `help(pd.read_csv)`.\n", |
109 |
| - " \n", |
110 |
| - "</div>\n", |
111 |
| - "\n", |
112 |
| - "\n" |
| 112 | + " \n", |
| 113 | + "The `sep` parameter cam be used to specify whether the input data uses some other character, such as `;` as a delimiter. For a full list of available parameters, please refer to the [pandas documentation for pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), or run `help(pd.read_csv)`.\n", |
| 114 | + "</div>" |
113 | 115 | ]
|
114 | 116 | },
|
115 | 117 | {
|
|
119 | 121 | "<div class=\"alert alert-info\">\n",
|
120 | 122 | "\n",
|
121 | 123 | "**Reading different file formats**\n",
|
122 |
| - " \n", |
| 124 | + "\n", |
123 | 125 | "`pandas.read_csv()` is a general function for reading data files separated by commas, spaces, or other common separators. \n",
|
124 | 126 | "\n",
|
125 |
| - " \n", |
126 |
| - "Pandas has several different functions for parsing input data from different formats. There is, for example, a separate function for reading Excel files `read_excel`. Another useful function is `read_pickle` for reading data stored in the [Python pickle format](https://docs.python.org/3/library/pickle.html). Check out [pandas documentation about input and output functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-tools-text-csv-hdf5) and Chapter 6 in [MacKinney (2017): Python for Data Analysis](https://geo-python.github.io/site/course-info/resources.html#books) for more details about reading data.\n", |
127 |
| - " \n", |
| 127 | + "Pandas has several different functions for parsing input data from different formats. There is, for example, a separate function for reading Excel files `read_excel`. Another useful function is `read_pickle` for reading data stored in the [Python pickle format](https://docs.python.org/3/library/pickle.html). Check out the [pandas documentation about input and output functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-tools-text-csv-hdf5) and Chapter 6 in [MacKinney (2017): Python for Data Analysis](https://geo-python-site.readthedocs.io/en/latest/course-info/resources.html#books) for more details about reading data.\n", |
128 | 128 | "</div>"
|
129 | 129 | ]
|
130 | 130 | },
|
|
201 | 201 | "editable": true
|
202 | 202 | },
|
203 | 203 | "source": [
|
204 |
| - "Fortunately, that's easy to do when reading in data using pandas. We just need to add the `skiprows` parameter when we read the file, listing the number of rows to skip (8 in this case).\n", |
| 204 | + "Fortunately, skipping over rows is easy to do when reading in data using pandas. We just need to add the `skiprows` parameter when we read the file, listing the number of rows to skip (8 in this case).\n", |
205 | 205 | "\n",
|
206 | 206 | "Let's try reading the datafile again, and this time defining the `skiprows` parameter."
|
207 | 207 | ]
|
|
251 | 251 | "editable": true
|
252 | 252 | },
|
253 | 253 | "source": [
|
254 |
| - "After reading in the data, it is always good to check that everything went well by printing out the data as we did here. However, often it is enough to have a look at the top rows of the data. \n", |
| 254 | + "After reading in the data, it is always good to check that everything went well by printing out the data as we did here. However, often it is enough to have a look at the top few rows of the data. \n", |
255 | 255 | "\n",
|
256 |
| - "We can use the `.head()` function of the pandas DataFrame object to quickly check the top rows. By default, the `.head()` -function returns 5 first rows of the DataFrame:" |
| 256 | + "We can use the `.head()` function of the pandas DataFrame object to quickly check the top rows. By default, the `.head()` function returns the first 5 rows of the DataFrame:" |
257 | 257 | ]
|
258 | 258 | },
|
259 | 259 | {
|
|
334 | 334 | "source": [
|
335 | 335 | "#### Check your understanding\n",
|
336 | 336 | "\n",
|
337 |
| - "Read in the file `Kumpula-June-2016-w-metadata.txt` again into a new variable called `temp_data` so that you only read in columns `YEARMODA` and `TEMP`. The new variable `temp_data` should have 30 rows and 2 columns. You can achieve this by using the `usecols` parameter when reading in the file. Check for more help in the [pandas.read_csv documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)." |
| 337 | + "Read the file `Kumpula-June-2016-w-metadata.txt` in again and store its contents in a new variable called `temp_data`. In this case you should only read in the columns `YEARMODA` and `TEMP`, so the new variable `temp_data` should have 30 rows and 2 columns. You can achieve this using the `usecols` parameter when reading in the file. Feel free to check for more help in the [pandas.read_csv documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)." |
338 | 338 | ]
|
339 | 339 | },
|
340 | 340 | {
|
341 | 341 | "cell_type": "code",
|
342 | 342 | "execution_count": null,
|
343 | 343 | "metadata": {},
|
344 | 344 | "outputs": [],
|
345 |
| - "source": [] |
| 345 | + "source": [ |
| 346 | + "# Read in the data file again here\n" |
| 347 | + ] |
346 | 348 | },
|
347 | 349 | {
|
348 | 350 | "cell_type": "code",
|
349 | 351 | "execution_count": null,
|
350 | 352 | "metadata": {},
|
351 | 353 | "outputs": [],
|
352 |
| - "source": [] |
| 354 | + "source": [ |
| 355 | + "# Check the contents of the first rows of the data file\n" |
| 356 | + ] |
353 | 357 | },
|
354 | 358 | {
|
355 | 359 | "cell_type": "markdown",
|
|
368 | 372 | "cell_type": "markdown",
|
369 | 373 | "metadata": {},
|
370 | 374 | "source": [
|
371 |
| - "Let's start by checking the size of our data frame. We can use the `len()` function similarly as with lists to check how many rows we have:" |
| 375 | + "Let's start by checking the size of our data frame. We can use the `len()` function similar to the use with lists to check how many rows we have:" |
372 | 376 | ]
|
373 | 377 | },
|
374 | 378 | {
|
|
384 | 388 | },
|
385 | 389 | "outputs": [],
|
386 | 390 | "source": [
|
387 |
| - "# Check the number of rows \n", |
388 |
| - "len(data)" |
| 391 | + "# Check the number of rows \n" |
389 | 392 | ]
|
390 | 393 | },
|
391 | 394 | {
|
|
411 | 414 | },
|
412 | 415 | "outputs": [],
|
413 | 416 | "source": [
|
414 |
| - "# Check dataframe shape (number of rows, number of columns)\n", |
415 |
| - "data.shape" |
| 417 | + "# Check dataframe shape (number of rows, number of columns)\n" |
416 | 418 | ]
|
417 | 419 | },
|
418 | 420 | {
|
|
422 | 424 | "editable": true
|
423 | 425 | },
|
424 | 426 | "source": [
|
425 |
| - "Here we see that our dataset has 30 rows, 4 columns, just as we saw above when printing out the whole DataFrame." |
| 427 | + "Here we see that our dataset has 30 rows and 4 columns, just as we saw above when printing out the entire DataFrame." |
426 | 428 | ]
|
427 | 429 | },
|
428 | 430 | {
|
429 | 431 | "cell_type": "markdown",
|
430 | 432 | "metadata": {},
|
431 | 433 | "source": [
|
432 |
| - "**Note:** `shape` is one of the several [attributes related to a pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#attributes-and-underlying-data)." |
| 434 | + "<div class=\"alert alert-info\">\n", |
| 435 | + "\n", |
| 436 | + "**Note**\n", |
| 437 | + " \n", |
| 438 | + "`shape` is one of the several [attributes related to a pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#attributes-and-underlying-data).\n", |
| 439 | + "\n", |
| 440 | + "</div>" |
433 | 441 | ]
|
434 | 442 | },
|
435 | 443 | {
|
|
452 | 460 | },
|
453 | 461 | "outputs": [],
|
454 | 462 | "source": [
|
455 |
| - "#Print column names\n", |
456 |
| - "data.columns.values" |
| 463 | + "#Print column names\n" |
457 | 464 | ]
|
458 | 465 | },
|
459 | 466 | {
|
|
479 | 486 | },
|
480 | 487 | "outputs": [],
|
481 | 488 | "source": [
|
482 |
| - "#Print index\n", |
483 |
| - "data.index" |
| 489 | + "#Print index\n" |
484 | 490 | ]
|
485 | 491 | },
|
486 | 492 | {
|
|
490 | 496 | "editable": true
|
491 | 497 | },
|
492 | 498 | "source": [
|
493 |
| - "Here we see how the data is indexed, starting at 0, ending at 30, and with an increment of 1 between each value. This is basically the same way in which Python lists are indexed, however, pandas allows also other ways of identifying the rows. DataFrame indices could, for example, be character strings, or date objects. We will learn more about re-setting the index later." |
| 499 | + "Here we see how the data is indexed, starting at 0, ending at 30, and with an increment of 1 between each value. This is basically the same way in which Python lists are indexed, however, pandas allows also other ways of identifying the rows. DataFrame indices could, for example, be character strings, or date objects. We will learn more about resetting the index later." |
494 | 500 | ]
|
495 | 501 | },
|
496 | 502 | {
|
|
516 | 522 | },
|
517 | 523 | "outputs": [],
|
518 | 524 | "source": [
|
519 |
| - "# Print data types\n", |
520 |
| - "data.dtypes" |
| 525 | + "# Print data types\n" |
521 | 526 | ]
|
522 | 527 | },
|
523 | 528 | {
|
|
534 | 539 | "cell_type": "markdown",
|
535 | 540 | "metadata": {},
|
536 | 541 | "source": [
|
537 |
| - "#### Check your unerstanding\n", |
| 542 | + "#### Check your understanding\n", |
538 | 543 | "\n",
|
539 |
| - "Print out the number of columns in our DataFrame:" |
| 544 | + "See if you can find a way to print out the number of columns in our DataFrame." |
540 | 545 | ]
|
541 | 546 | },
|
542 | 547 | {
|
543 | 548 | "cell_type": "code",
|
544 | 549 | "execution_count": null,
|
545 | 550 | "metadata": {},
|
546 | 551 | "outputs": [],
|
547 |
| - "source": [] |
| 552 | + "source": [ |
| 553 | + "# Calculate the number of columns here\n" |
| 554 | + ] |
548 | 555 | },
|
549 | 556 | {
|
550 | 557 | "cell_type": "markdown",
|
|
557 | 564 | "cell_type": "markdown",
|
558 | 565 | "metadata": {},
|
559 | 566 | "source": [
|
560 |
| - "We can select only spesific columns based on the column values. The basic syntax is `dataframe[value]`, where value can be a single column name, or a list of column names. Let's start by selecting two columns, `'YEARMODA'` and `'TEMP'`:" |
| 567 | + "We can select specific columns based on the column values. The basic syntax is `dataframe[value]`, where value can be a single column name, or a list of column names. Let's start by selecting two columns, `'YEARMODA'` and `'TEMP'`:" |
561 | 568 | ]
|
562 | 569 | },
|
563 | 570 | {
|
|
567 | 574 | "outputs": [],
|
568 | 575 | "source": []
|
569 | 576 | },
|
| 577 | + { |
| 578 | + "cell_type": "code", |
| 579 | + "execution_count": null, |
| 580 | + "metadata": {}, |
| 581 | + "outputs": [], |
| 582 | + "source": [] |
| 583 | + }, |
570 | 584 | {
|
571 | 585 | "cell_type": "markdown",
|
572 | 586 | "metadata": {},
|
|
638 | 652 | }
|
639 | 653 | },
|
640 | 654 | "outputs": [],
|
641 |
| - "source": [] |
| 655 | + "source": [ |
| 656 | + "# Check datatype of the column\n" |
| 657 | + ] |
642 | 658 | },
|
643 | 659 | {
|
644 | 660 | "cell_type": "markdown",
|
|
663 | 679 | "``` \n",
|
664 | 680 | "data.TEMP\n",
|
665 | 681 | "```\n",
|
| 682 | + "\n", |
666 | 683 | "This syntax works only if the column name is a valid name for a Python variable (e.g. the column name should not contain whitespace).\n",
|
667 | 684 | "The syntax `data[\"column\"]` works for all kinds of column names, so we recommend using this approach.\n",
|
| 685 | + "\n", |
668 | 686 | "</div>"
|
669 | 687 | ]
|
670 | 688 | },
|
|
754 | 772 | "cell_type": "markdown",
|
755 | 773 | "metadata": {},
|
756 | 774 | "source": [
|
757 |
| - "#### Check your unerstanding\n", |
| 775 | + "#### Check your understanding\n", |
758 | 776 | "\n",
|
759 |
| - "It doesn't make much sense to print out descriptive statistics for the `YEARMODA` column now that the values are stored as integer values (and not datetime objects that we will learn to handle later on). \n", |
| 777 | + "It doesn't make much sense to print out descriptive statistics for the `YEARMODA` column now that the values are stored as integer values (and not datetime objects, which we will learn about in later lessons). \n", |
760 | 778 | "\n",
|
761 |
| - "Print out descriptive statistics again, this time only for columns `TEMP`, `MAX`, `MIN`:" |
| 779 | + "See if you can print out the descriptive statistics again, this time only for columns `TEMP`, `MAX`, `MIN`:" |
762 | 780 | ]
|
763 | 781 | },
|
764 | 782 | {
|
765 | 783 | "cell_type": "code",
|
766 | 784 | "execution_count": null,
|
767 | 785 | "metadata": {},
|
768 | 786 | "outputs": [],
|
769 |
| - "source": [] |
| 787 | + "source": [ |
| 788 | + "# Get descriptive statistics for selected columns\n" |
| 789 | + ] |
770 | 790 | },
|
771 | 791 | {
|
772 | 792 | "cell_type": "markdown",
|
|
779 | 799 | "cell_type": "markdown",
|
780 | 800 | "metadata": {},
|
781 | 801 | "source": [
|
782 |
| - "Visualizing the data is a key part of data exploration. Pandas comes with a handful of plotting methods, which all rely on the plotting library [matplotlib](https://matplotlib.org/). " |
| 802 | + "Visualizing the data is a key part of data exploration, and Pandas comes with a handful of plotting methods, which all rely on the [matplotlib](https://matplotlib.org/) plotting library. " |
783 | 803 | ]
|
784 | 804 | },
|
785 | 805 | {
|
786 | 806 | "cell_type": "markdown",
|
787 | 807 | "metadata": {},
|
788 | 808 | "source": [
|
789 |
| - "For very basic plots, we don’t need to import matplotlib separately. \n", |
790 |
| - "We can already achieve very simple plots using the `DataFrame.plot` -method. \n", |
| 809 | + "For very basic plots, we don’t need to import matplotlib separately. We can already create very simple plots using the `DataFrame.plot` method. \n", |
791 | 810 | "\n",
|
792 | 811 | "Let's plot all the columns that contain values related to temperatures:"
|
793 | 812 | ]
|
|
954 | 973 | "source": [
|
955 | 974 | "Check more details about available paramenters and methods from [the pandas.DataFrame documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas-dataframe)."
|
956 | 975 | ]
|
957 |
| - }, |
958 |
| - { |
959 |
| - "cell_type": "markdown", |
960 |
| - "metadata": { |
961 |
| - "deletable": true, |
962 |
| - "editable": true |
963 |
| - }, |
964 |
| - "source": [ |
965 |
| - "That's it! Next, we will have a look at [basic operations for data analysis in Pandas](processing-data-with-pandas.ipynb)." |
966 |
| - ] |
967 | 976 | }
|
968 | 977 | ],
|
969 | 978 | "metadata": {
|
|
983 | 992 | "name": "python",
|
984 | 993 | "nbconvert_exporter": "python",
|
985 | 994 | "pygments_lexer": "ipython3",
|
986 |
| - "version": "3.7.6" |
| 995 | + "version": "3.7.4" |
987 | 996 | }
|
988 | 997 | },
|
989 | 998 | "nbformat": 4,
|
|
0 commit comments