Standalone batch task on Cloud Run Jobs

Build a maintainable, scalable, and sustainable batch task and run it locally and on Cloud Run Jobs.

Note: These instructions are written for Mac users. If you are using Windows or Linux, some commands (such as environment variable export, file paths, or activation of virtual environments) may differ. Adjust accordingly for your operating system.

This tutorial walks you through creating a standalone, batch-style task and running it locally and on Cloud Run Jobs. The example uses data ingestion (download CSV, validate, write Parquet, upload to Cloud Storage), but you can swap the task logic for any job.

Estimated time: 60 to 90 minutes.

Outcome: a containerized batch task you can run locally and on Cloud Run Jobs, with secure access to Cloud Storage.

The steps are written with MSS principles in mind:

What you will build

A Python task that:

  • Downloads a CSV from a public URL.
  • Validates a subset of columns with Pandera.
  • Writes a Parquet output.
  • Uploads the Parquet file to a non-public GCS bucket.

You can replace the ingestion logic with any other batch task (e.g., API calls, ETL, reporting).

Before you start

You will need:

  • A sandbox Google Cloud project you can deploy to.
  • Docker and the gcloud CLI installed locally.

You should also be able to:

  • Run the task locally with Python.
  • Run the task in a container.
  • Deploy the container to a batch runtime like Cloud Run Jobs.

Step 1: Log in, set project config, and enable services

First, log in and set your project and region. This config is reused throughout the tutorial. Replace your-project-id with your sandbox project ID.

gcloud auth login

export PROJECT_ID="your-project-id"
export REGION="europe-west2"

gcloud config set project ${PROJECT_ID}
gcloud config set run/region ${REGION}

Now enable the APIs used by Artifact Registry, Cloud Run Jobs, and Cloud Storage.

gcloud services enable \
  artifactregistry.googleapis.com \
  run.googleapis.com \
  storage.googleapis.com

Step 2: Set names and configuration

These variables keep the commands consistent. You can change them later, but keep the names stable while you follow the tutorial.

export PROJECT_NUMBER="$(gcloud projects describe ${PROJECT_ID} --format='value(projectNumber)')"
export DST_BUCKET="sv-data-bucket__${PROJECT_NUMBER}"
export OUTPUT_FILE="data.parquet"
export SOURCE_URL="https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
export SA_NAME="data-handler"
export SA_EMAIL="${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
export SA_KEY_DIR="$HOME/SA-KEYS"
export ARTIFACT_REPO="cloud-run-repo"
export IMAGE_NAME="import-data-img"
export IMAGE_TAG="latest"
export IMAGE_URI="${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_REPO}/${IMAGE_NAME}:${IMAGE_TAG}"
export JOB_NAME="import-data-job"

Variable reference:

Variable Meaning
PROJECT_ID Your Google Cloud project ID.
REGION Region for Artifact Registry and Cloud Run Jobs.
PROJECT_NUMBER Numeric project identifier used by some Google Cloud services.
DST_BUCKET Destination bucket name for outputs.
OUTPUT_FILE Output filename to write.
SOURCE_URL URL to download the source CSV from.
SA_NAME Service account name for the task.
SA_EMAIL Service account email for the task.
SA_KEY_DIR Local folder to store service account keys.
ARTIFACT_REPO Artifact Registry repository name.
IMAGE_NAME Container image name.
IMAGE_TAG Image tag.
IMAGE_URI Full Artifact Registry image URI.
JOB_NAME Cloud Run Job name.

Step 3: Create the destination bucket

The task writes Parquet to a non-public GCS bucket.

Create the destination bucket (non-public access):

What to expect in GCP: After running the command below, you should see a new bucket appear in the Google Cloud Console under Cloud Storage > Buckets with the name you specified in ${DST_BUCKET}. This confirms the bucket was created successfully.

gcloud storage buckets create "gs://${DST_BUCKET}" \
  --location=${REGION} \
  --public-access-prevention \
  --uniform-bucket-level-access

Step 4: Create and authorize a service account

Create a service account for the task and grant it object admin access to the bucket. This is the identity Cloud Run Jobs will use.

What to expect in GCP: After running the commands below, you should see a new service account appear in the Google Cloud Console under IAM & Admin > Service Accounts with the name ${SA_NAME}. The bucket’s permissions will also show this service account as having the Storage Object Admin role.

gcloud iam service-accounts create "${SA_NAME}" \
  --display-name="Standalone task service account"

gcloud storage buckets add-iam-policy-binding "gs://${DST_BUCKET}" \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/storage.objectAdmin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/serviceusage.serviceUsageConsumer"

For local runs, create a key and point GOOGLE_APPLICATION_CREDENTIALS at it.

mkdir -p "${SA_KEY_DIR}"

gcloud iam service-accounts keys create "${SA_KEY_DIR}/service-account.json" \
  --iam-account=${SA_EMAIL}

export GOOGLE_APPLICATION_CREDENTIALS="${SA_KEY_DIR}/service-account.json"

Step 5: Create the task code

Create a new folder under the repo root, for example import-data, and add a main.py file. This example script downloads a CSV, validates it, writes Parquet, and uploads to GCS. Replace the logic inside main() for other batch tasks.

import os
import logging
import pandas as pd
from google.cloud import storage
from pandera.pandas import Column, DataFrameSchema, Check
from pandera.errors import SchemaError

# Config (or environment variables)
DST_BUCKET = os.environ.get("DST_BUCKET", "sv-data-bucket__${PROJECT_NUMBER}")
OUTPUT_FILE = os.environ.get("OUTPUT_FILE", "data.parquet")
SOURCE_URL = os.environ.get(
    "SOURCE_URL",
    "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
)

schema = DataFrameSchema(
    {
        "country": Column(str, nullable=False),
        "year": Column(int, nullable=False, checks=Check.ge(1750)),
        "iso_code": Column(str, nullable=True, checks=Check.str_length(3, 3)),
        "population": Column(float, nullable=True, checks=Check.ge(0)),
        "gdp": Column(float, nullable=True, checks=Check.ge(0)),
        "co2": Column(float, nullable=True),
        "co2_per_capita": Column(float, nullable=True),
        "co2_per_gdp": Column(float, nullable=True),
        "co2_including_luc": Column(float, nullable=True),
        "co2_including_luc_per_capita": Column(float, nullable=True),
        "energy_per_capita": Column(float, nullable=True),
        "primary_energy_consumption": Column(float, nullable=True),
        "trade_co2": Column(float, nullable=True),
        "trade_co2_share": Column(float, nullable=True),
    },
    strict=False,
)

def main():
    logging.basicConfig(level=logging.INFO)

    # Create a client
    client = storage.Client()

    try:
        if not SOURCE_URL:
            raise ValueError("SOURCE_URL is required")

        df = pd.read_csv(SOURCE_URL)

        # Validate the DataFrame against the schema
        validated_df = schema.validate(df)

        # Write the DataFrame to a Parquet file
        validated_df.to_parquet(OUTPUT_FILE, index=False)

        # Upload the Parquet file to the destination bucket
        dst_bucket = client.bucket(DST_BUCKET)
        dst_blob = dst_bucket.blob(OUTPUT_FILE)
        dst_blob.upload_from_filename(OUTPUT_FILE)
        logging.info("Import succeeded: %s -> %s/%s", SOURCE_URL, DST_BUCKET, OUTPUT_FILE)

    except SchemaError:
        logging.exception("Schema validation failed.")
        raise
    except Exception:
        logging.exception("Import failed.")
        raise
    finally:
        if os.path.exists(OUTPUT_FILE):
            os.remove(OUTPUT_FILE)

if __name__ == "__main__":
    main()

Step 6: Add dependencies

Create requirements.txt in your task folder:

google-cloud-storage==3.9.0
pandas==3.0.1
pandera==0.29.0
pyarrow==23.0.1

Step 7: Add a Dockerfile

Create a Dockerfile in the task folder:

Important: In the Dockerfile below, update YOUR_PROJECT_NUMBER in the ENV DST_BUCKET=sv-data-bucket__YOUR_PROJECT_NUMBER line to match your actual Google Cloud project number. This is required for the batch task to upload to the correct bucket.

FROM python:3.13-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Set your environment variables here or via Cloud Run UI
ENV DST_BUCKET=sv-data-bucket__YOUR_PROJECT_NUMBER
ENV OUTPUT_FILE=data.parquet
ENV SOURCE_URL=https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv
ENV DISABLE_PANDERA_IMPORT_WARNING=True

CMD ["python", "main.py"]

Step 8: Run locally

From your task folder:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python main.py

Step 9: Run as a local container

From your task folder:

docker build -t my-task:local .

docker run --rm \
  -e GOOGLE_APPLICATION_CREDENTIALS=/tmp/sa-key.json \
  -e DST_BUCKET=${DST_BUCKET} \
  -e OUTPUT_FILE=${OUTPUT_FILE} \
  -e SOURCE_URL=${SOURCE_URL} \
  -v "${SA_KEY_DIR}/service-account.json:/tmp/sa-key.json:ro" \
  my-task:local

Step 10: Deploy as a Cloud Run Job

Note: The gcloud run jobs create command will fail if a job with the same name (${JOB_NAME}) already exists in your project and region. Cloud Run Job IDs must be unique. If you need to re-create a job, delete the existing one first or use a different job name.

This example uses Cloud Run Jobs for batch execution.

What to expect in GCP: - After running the Artifact Registry creation command, you should see a new repository under Artifact Registry > Repositories named ${ARTIFACT_REPO}. - After pushing the image, you should see your container image listed in this repository. - After creating the Cloud Run Job, you should see a new job under Cloud Run > Jobs with the name ${JOB_NAME}. When you execute the job, you can monitor its status and logs in the Cloud Console.

gcloud artifacts repositories create ${ARTIFACT_REPO} \
  --location=${REGION} \
  --repository-format=docker

gcloud auth configure-docker ${REGION}-docker.pkg.dev

docker buildx build --platform linux/amd64 \
  -t ${IMAGE_URI} \
  --push .

gcloud run jobs create ${JOB_NAME} \
  --image ${IMAGE_URI} \
  --region ${REGION} \
  --service-account ${SA_EMAIL}

gcloud run jobs execute ${JOB_NAME} --region ${REGION}

Step 11: Configure runtime settings

You can override runtime settings at job creation time.

gcloud run jobs create ${JOB_NAME} \
  --image ${IMAGE_URI} \
  --region ${REGION} \
  --set-env-vars SOURCE_URL=https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv,OUTPUT_FILE=data.parquet,DST_BUCKET=sv-data-bucket__${PROJECT_NUMBER}

For service account keys, prefer Workload Identity or attach a service account to the job instead of embedding credentials in the image.

Step 12: Clean up local keys

Delete the service account key after local runs to reduce risk.

gcloud iam service-accounts keys list --iam-account ${SA_EMAIL}

Run the following command to delete a key, replacing KEY_ID with the value from the list command:

gcloud iam service-accounts keys delete KEY_ID --iam-account ${SA_EMAIL}

Step 13: Remove billable cloud artifacts

Clean up Cloud Run, Artifact Registry, and Cloud Storage resources that can incur costs.

gcloud run jobs delete ${JOB_NAME} --region ${REGION}

gcloud artifacts docker images delete ${IMAGE_URI} --delete-tags --quiet
gcloud artifacts repositories delete ${ARTIFACT_REPO} --location ${REGION}

gcloud storage rm -r "gs://${DST_BUCKET}"

gcloud iam service-accounts delete ${SA_EMAIL}

docker rmi my-task:local

Troubleshooting

  • If Cloud Run says the image is not found, confirm the image was pushed to Artifact Registry.
  • If deployment fails on Apple Silicon, rebuild with --platform linux/amd64.
  • If the job starts but fails, check logs:
gcloud run jobs executions list --job ${JOB_NAME} --region ${REGION}

Use the execution name to view logs in the Cloud Console.

Summary

You created a standalone batch task, wired it to Cloud Storage with a service account, validated a CSV with Pandera, and ran the container locally and on Cloud Run Jobs. You can now swap the main() logic for any batch job you need.

MSS checklist

  • Maintainable: configuration is in env vars and dependencies are pinned.
  • Scalable: the task runs in a container and deploys to Cloud Run Jobs.
  • Sustainable: access is limited to a service account and you clean up billable resources.