How to Build an End-to-End Data Engineering and Machine Learning Pipeline with Apache Spark and PySpark


!pip install -q pyspark==3.5.1
from pyspark.sql import SparkSession, functions as F, Window
from pyspark.sql.types import IntegerType, StringType, StructType, StructField, FloatType
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


spark = (SparkSession.builder.appName("ColabSparkAdvancedTutorial")
        .master("local[*]")
        .config("spark.sql.shuffle.partitions", "4")
        .getOrCreate())
print("Spark version:", spark.version)


data = [
   (1, "Alice", "IN", "2025-10-01", 56000.0, "premium"),
   (2, "Bob", "US", "2025-10-03", 43000.0, "standard"),
   (3, "Carlos", "IN", "2025-09-27", 72000.0, "premium"),
   (4, "Diana", "UK", "2025-09-30", 39000.0, "standard"),
   (5, "Esha", "IN", "2025-10-02", 85000.0, "premium"),
   (6, "Farid", "AE", "2025-10-02", 31000.0, "basic"),
   (7, "Gita", "IN", "2025-09-29", 46000.0, "standard"),
   (8, "Hassan", "PK", "2025-10-01", 52000.0, "premium"),
]
schema = StructType([
   StructField("id", IntegerType(), False),
   StructField("name", StringType(), True),
   StructField("country", StringType(), True),
   StructField("signup_date", StringType(), True),
   StructField("income", FloatType(), True),
   StructField("plan", StringType(), True),
])
df = spark.createDataFrame(data, schema)
df.show()



Source link

  • Related Posts

    Meta AI Releases EUPE: A Compact Vision Encoder Family Under 100M Parameters That Rivals Specialist Models Across Image Understanding, Dense Prediction, and VLM Tasks

    Running powerful AI on your smartphone isn’t just a hardware problem — it’s a model architecture problem. Most state-of-the-art vision encoders are enormous, and when you trim them down to…

    An Implementation Guide to Running NVIDIA Transformer Engine with Mixed Precision, FP8 Checks, Benchmarking, and Fallback Execution

    In this tutorial, we implement an advanced, practical implementation of the NVIDIA Transformer Engine in Python, focusing on how mixed-precision acceleration can be explored in a realistic deep learning workflow.…

    Leave a Reply

    Your email address will not be published. Required fields are marked *