Blame: sagemaker_processing/basic_sagemaker_data_processing/basic_sagemaker_processing.ipynb - aws/amazon-sagemaker-examples

aws / amazon-sagemaker-examples UNCLAIMED

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

0 0 16 Jupyter Notebook

Normal View History Raw

-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								{
 								 "cells": [
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "# Get started with SageMaker Processing\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "This notebook corresponds to the section \"Preprocessing Data With The Built-In Scikit-Learn Container\" in the blog post [Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/). \n",
 								    "It shows a lightweight example of using SageMaker Processing to create train, test, and validation datasets. SageMaker Processing is used to create these datasets, which then are written back to S3.\n",
 								    "\n",
 								    "## Runtime\n",
 								    "\n",
 								    "This notebook takes approximately 5 minutes to run.\n",
 								    "\n",
 								    "## Contents\n",
 								    "\n",
 								    "1. [Prepare resources](#Prepare-resources)\n",
 								    "1. [Download data](#Download-data)\n",
 								    "1. [Prepare Processing script](#Prepare-Processing-script)\n",
 								    "1. [Run Processing job](#Run-Processing-job)\n",
 								    "1. [Conclusion](#Conclusion)"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "## Prepare resources\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "\n",
 								    "First, let’s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "import boto3\n",
 								    "import sagemaker\n",
 								    "from sagemaker import get_execution_role\n",
 								    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
 								    "\n",
-												Use Sagemaker SDK instead of boto3 to get region_name (#3231)


											
										
										
											2022-03-21 11:33:51 -07:00
+								    "region = sagemaker.Session().boto_region_name\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "role = get_execution_role()\n",
-												infra: format all .ipynb files with black-nb (#2224)


											
										
										
											2021-05-11 22:43:29 +00:00
+								    "sklearn_processor = SKLearnProcessor(\n",
-												upgrading SKlearn version to 1.0-1 (#3414)

* upgrading SKlearn version to 1.0-1

* resolve errors related to 1.0-1 upgrade

* resolve errors related to 1.0-1 upgrade

* fix: Update Pipeline notebooks with latest sagemaker changes (#3424)

Co-authored-by: Dewen Qi <qidewen@amazon.com>

* fix: Fix format of Pipeline related notebooks (#3427)

Co-authored-by: Dewen Qi <qidewen@amazon.com>

* resolve errors related to 1.0-1 upgrade

* resolve errors related to 1.0-1 upgrade

* resolve errors related to 1.0-1 upgrade

* resolve errors related to 1.0-1 upgrade

* resolve errors related to 1.0-1 upgrade

* resolve errors related to 1.0-1 upgrade

* Update pipelines_product_ratings.ipynb

reverting changes

* Update sagemaker-pipelines-tuning-step.ipynb

reverting changes

* Update train register and deploy a pipeline model.ipynb

reverting changes

* Update sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb

reverting changes

Co-authored-by: Nikhil Raverkar <nraverka@amazon.com>
Co-authored-by: Julia Kroll <75504951+jkroll-aws@users.noreply.github.com>
Co-authored-by: qidewenwhen <32910701+qidewenwhen@users.noreply.github.com>
Co-authored-by: Dewen Qi <qidewen@amazon.com>
											
										
										
											2022-06-02 14:04:57 -04:00
+								    "    framework_version=\"1.0-1\", role=role, instance_type=\"ml.m5.xlarge\", instance_count=1\n",
-												infra: format all .ipynb files with black-nb (#2224)


											
										
										
											2021-05-11 22:43:29 +00:00
+								    ")"
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "## Download data\n",
 								    "\n",
 								    "Read in the raw data from a public S3 bucket. This example uses the [Census-Income (KDD) Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29) from the UCI Machine Learning Repository.\n",
 								    "\n",
 								    "> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science."
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "import pandas as pd\n",
 								    "\n",
-												Make notebooks use boto3 to import datasets (#3206)


											
										
										
											2022-03-08 10:08:12 -08:00
+								    "s3 = boto3.client(\"s3\")\n",
 								    "s3.download_file(\n",
 								    "    \"sagemaker-sample-data-{}\".format(region),\n",
 								    "    \"processing/census/census-income.csv\",\n",
 								    "    \"census-income.csv\",\n",
 								    ")\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "df = pd.read_csv(\"census-income.csv\")\n",
-												infra: format all .ipynb files with black-nb (#2224)


											
										
										
											2021-05-11 22:43:29 +00:00
+								    "df.to_csv(\"dataset.csv\")\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "df.head()"
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "## Prepare Processing script\n",
 								    "\n",
 								    "Write the Python script that will be run by SageMaker Processing. This script reads the single data file from S3; splits the rows into train, test, and validation sets; and then writes the three output files to S3."
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "%%writefile preprocessing.py\n",
 								    "import pandas as pd\n",
 								    "import os\n",
 								    "from sklearn.model_selection import train_test_split\n",
-												infra: format all .ipynb files with black-nb (#2224)


											
										
										
											2021-05-11 22:43:29 +00:00
+								    "\n",
 								    "input_data_path = os.path.join(\"/opt/ml/processing/input\", \"dataset.csv\")\n",
 								    "df = pd.read_csv(input_data_path)\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "print(\"Shape of data is:\", df.shape)\n",
 								    "train, test = train_test_split(df, test_size=0.2)\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "train, validation = train_test_split(train, test_size=0.2)\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "try:\n",
-												infra: format all .ipynb files with black-nb (#2224)


											
										
										
											2021-05-11 22:43:29 +00:00
+								    "    os.makedirs(\"/opt/ml/processing/output/train\")\n",
 								    "    os.makedirs(\"/opt/ml/processing/output/validation\")\n",
 								    "    os.makedirs(\"/opt/ml/processing/output/test\")\n",
 								    "    print(\"Successfully created directories\")\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "except Exception as e:\n",
-												infra: format all .ipynb files with black-nb (#2224)


											
										
										
											2021-05-11 22:43:29 +00:00
+								    "    # if the Processing call already creates these directories (or directory otherwise cannot be created)\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "    print(e)\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "    print(\"Could not make directories\")\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "    pass\n",
 								    "\n",
 								    "try:\n",
 								    "    train.to_csv(\"/opt/ml/processing/output/train/train.csv\")\n",
 								    "    validation.to_csv(\"/opt/ml/processing/output/validation/validation.csv\")\n",
 								    "    test.to_csv(\"/opt/ml/processing/output/test/test.csv\")\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "    print(\"Wrote files successfully\")\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "except Exception as e:\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "    print(\"Failed to write the files\")\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "    print(e)\n",
 								    "    pass\n",
 								    "\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "print(\"Completed running the processing job\")"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "## Run Processing job"
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "Run the Processing job, specifying the script name, input file, and output files."
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "%%capture output\n",
 								    "\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
-												infra: format all .ipynb files with black-nb (#2224)


											
										
										
											2021-05-11 22:43:29 +00:00
+								    "\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    "sklearn_processor.run(\n",
-												infra: format all .ipynb files with black-nb (#2224)


											
										
										
											2021-05-11 22:43:29 +00:00
+								    "    code=\"preprocessing.py\",\n",
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								    "    # arguments = [\"arg1\", \"arg2\"], # Arguments can optionally be specified here\n",
-												infra: format all .ipynb files with black-nb (#2224)


											
										
										
											2021-05-11 22:43:29 +00:00
+								    "    inputs=[ProcessingInput(source=\"dataset.csv\", destination=\"/opt/ml/processing/input\")],\n",
 								    "    outputs=[\n",
 								    "        ProcessingOutput(source=\"/opt/ml/processing/output/train\"),\n",
 								    "        ProcessingOutput(source=\"/opt/ml/processing/output/validation\"),\n",
 								    "        ProcessingOutput(source=\"/opt/ml/processing/output/test\"),\n",
 								    "    ],\n",
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								    ")"
 								   ]
 								  },
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "Get the Processing job logs and retrieve the job name."
 								   ]
 								  },
 								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
 								   "source": [
 								    "print(output)\n",
 								    "job_name = str(output).split(\"\\n\")[1].split(\" \")[-1]"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "Confirm that the output dataset files were written to S3."
 								   ]
 								  },
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								  {
 								   "cell_type": "code",
 								   "execution_count": null,
 								   "metadata": {},
 								   "outputs": [],
-												Revise SM Processing notebook markdown (#3277)

* Revise SM Processing notebook to add full markdown explanations

* Fix black-nb formatting

* Call df.head() later so output is displayed
											
										
										
											2022-04-01 16:12:11 -05:00
+								   "source": [
 								    "import boto3\n",
 								    "\n",
 								    "s3_client = boto3.client(\"s3\")\n",
 								    "default_bucket = sagemaker.Session().default_bucket()\n",
 								    "for i in range(1, 4):\n",
 								    "    prefix = s3_client.list_objects(\n",
 								    "        Bucket=default_bucket, Prefix=job_name + \"/output/output-\" + str(i) + \"/\"\n",
 								    "    )[\"Contents\"][0][\"Key\"]\n",
 								    "    print(\"s3://\" + default_bucket + \"/\" + prefix)"
 								   ]
 								  },
 								  {
 								   "cell_type": "markdown",
 								   "metadata": {},
 								   "source": [
 								    "## Conclusion\n",
 								    "\n",
 								    "In this notebook, we read a dataset from S3 and processed it into train, test, and validation sets using a SageMaker Processing job. You can extend this example for preprocessing your own datasets in preparation for machine learning or other applications."
 								   ]
-												add a simple,basic sagemaker processing example that corresponds to a blog (#2016)

Co-authored-by: Broyde <broydj@MacBook-Pro.local>
											
										
										
											2021-02-18 21:46:54 -05:00
+								  }
 								 ],
 								 "metadata": {
 								  "kernelspec": {
 								   "display_name": "Python 3",
 								   "language": "python",
 								   "name": "python3"
 								  },
 								  "language_info": {
 								   "codemirror_mode": {
 								    "name": "ipython",
 								    "version": 3
 								   },
 								   "file_extension": ".py",
 								   "mimetype": "text/x-python",
 								   "name": "python",
 								   "nbconvert_exporter": "python",
 								   "pygments_lexer": "ipython3",
 								   "version": "3.7.9"
 								  }
 								 },
 								 "nbformat": 4,
 								 "nbformat_minor": 4
 								}