In this assignment, you will implement Batch Gradient Descent to fit a line into two-dimensional data set. You will implement a set of Spark jobs that will learn parameters for such lines from the New York City Taxi trip reports in 2013. The dataset was released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (https://chriswhong. com/open-data/foil_nyc_taxi/). See Assignment 1 for details about this data set. We would like to train a linear model between travel distance in miles and fare amount (the money paid to the taxis).
Taxi Data Set
Please have a look at the table description there. The data set is in Comma Separated Volume Format (CSV). When you read a line and split it by a comma sign ”,” you will find the string array with a length of 17. With the index number starting from zero, we need for this assignment to get index 5 trip distance (trip distance in miles) and index 11 fare amount ( fare amount in dollars) as stated in the following table.
You can use the following PySpark Code to clean up the data.
def isfloat(value): try: float(value) return True except: return False def correctRows(p): if(len(p)==17): if(isfloat(p) and isfloat(p)): if(float(p)!=0 and float(p)!=0): return p testDataFrame = spark.read.format('csv').\ options(header='false', inferSchema='true', sep =",").\ load(testFile) testRDD = testDataFrame.rdd.map(tuple) taxilinesCorrected = testRDD.filter(correctRows)
In addition to the above filtering, you should remove all of the rides that have a total amount larger than 600 USD and less than 1 USD. You can preprocess the data, clean it and store it in your own cluster storage. To avoid additional computation in each run.
Obtaining the Dataset
Small data set. (93 MB compressed, uncompressed 384 MB) for implementation and testing purposes (roughly 2 million taxi trips). This is available at Google Storage:
https://storage.googleapis.com/met-cs-777-data/taxi-data-sorted-small.csv.bz2 and the whole dataset (8GB) https://storage.googleapis.com/met-cs-777-data/taxi-data-sorted-large.csv.bz2
When running your code on the cluster, you can access the data sets using the following internal URLs:
Task 1: Simple Linear Regression
We want to find a simple line to our data (distance, money). Consider a Simple Linear Regression model given in equation (1). The solutions for the m slope of the line and y-intercept are calculated based on equations (2) and (3).
Implement a PySpark Job that calculates the exact answers for the parameters m and b. The line slope is the parameter m, and b is the y-intercept of the line.
Run your implementation on the large data set and report the computation time for your Spark Job for this task. You can find the time for the completion of your Job on the Cloud System. You find there a place that reports the time for you.
Note on Task 1: Execution of this task on Large data set, depending on your implementation, can take a longer time, for example, on a cluster with 12 cores in total, it takes more than 30 min computation time.
Task 2 - Find the Parameters using Gradient Descent
In this task, you should implement batch gradient descent to find the optimal parameters for our Simple Linear Regression model.
You should load the data into spark cluster memory as RDD, or Dataframe Cost function will be then:
Here is a list of recommended values of the important setup parameters:
Initialize all parameters with 0
Start with a very small learning rate and then increase it to speed up the learning. For example, you can start with learningRate=0.000001 and try bigger rates until the process is still converging. Note: The optimal learning rate for the big dataset can be different from the optimal learning rate for the small dataset
Maximum number of iterations should be num_iteration=50 iterations
Run your implementation on the large data set and report the computation time for your Spark Job for this task. Compare the computation time with the previous tasks.
Print out the costs in each iteration.
Print out the model parameters in each iteration.
Comment on how you can interpret the parameters of the model. What is the meaning of m and b in this case
Note: You might write some code for the iteration of gradient descent in PySpark that can work perfectly on your laptop but does not run on the clusters (AWS/Google Cloud). The main reason is that on your laptop, it is running in a single process, while on a cluster, it runs on multiple processes (shared-nothing processes). You need to be careful to reduce all of the jobs/processes to be able to update the variables, otherwise, each process will have its own variables
Fit Multiple Linear Regression using Gradient Descent
We would like to learn a linear model with four variables to predict the total paid amounts of Taxi rides. The following table describes the variables that we want to use.
Initialize all parameters with 0
Start with a very small learning rate and then increase it to speed up the learning. For example, you can start with learningRate=0.000000001 and try bigger rates until the process is still converging. Note: The optimal learning rate for the big dataset can be different from the optimal learning rate for the small dataset
Maximum number of iterations should be 50, num_iteration=50
Use NumPy Arrays to calculate the gradients - Vectorization
Implement the ”Bold Driver” technique to change the learning rate dynamically. (3 points of 8 points)
You need to find parameters that will lead to convergence of the optimization algorithm.
Print out the costs in each iteration
Print out the model parameters in each iteration
Comment on how you can interpret the parameters of the model. What is the meaning of mi and b in this case.
Machines to Use
One thing to be aware of is that you can choose virtually any configuration for your Cloud Cluster - you can choose different numbers of machines and different configurations of those machines. And each is going to cost you differently! Since this is real money, it makes sense to develop your code and run your jobs locally, on your laptop, using the small data set. Once things are working, you’ll then move to Cloud.
As a proposal for this assignment, you can use the e2-standard-4 machines on the Google Cloud, one for the Master node and two for worker nodes. You will have three machines with a total of 12 vCPU and 48GB RAM. 100 GB of disk space will be enough. Remember to delete your cluster after the calculation is finished!!!
More information regarding Google Cloud Pricing can be found here https://cloud.google.com/products/calculator. As you can see average server costs around 50 cents per hour. That is not much, but IT WILL ADD UP QUICKLY IF YOU FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and stop your machine as soon as you are done working. You can always come back and start your machine or create a new one easily when you begin your work again. Another thing to be aware of is that Google and Amazon charge you when you move data around. To avoid such charges, do everything in the ”Iowa (us-cental1)” region. That’s where data is, and that’s where you should put your data and machines.
You should document your code very well and as much as possible.
Your code should be compilable on a Unix-based operating system like Linux or macOS.
Hire Realcode4you expert to get solution of above problems or any other problems related to Big Data PySpark.