top of page

Coding Sample Paper - Machine Learning | Predict Nightly Airbnb Rental Prices in San Francisco.

Updated: May 18, 2022

Task 1


Create a Spark Dataframe from /databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet/. Visualize and explore the data. Note anything you find interesting. This dataset is slightly cleansed form of the Inside Airbnb dataset for San Francisco.

Q2: Model Development and Tracking

  • Split into 80/20 train-test split using SparkML APIs.

  • Build a model using SparkML to predict price given the other input features (or subset of them).

  • Mention why you chose this model, how it works, and other models that you considered.

  • Compute the loss metric on the test dataset and explain your choice of loss metric.

  • Log your model/hyperparameters/metrics to MLflow.

Task 2

Question 1

Part 1: Code analysis and documentation

In the following cells is code to generate a synthetic data set. At each point that is

marked by commenting blocks ( '#', '"""', '''''), fill in appropriate comments that explain

the functionality of each part of the subsequent code in standard python code style.

import collections
DataStructure = collections.namedtuple('DataStructure', 'value1 value2 value3 value4 
value5 value6')
ModuloResult = collections.namedtuple('ModuloResult', 'factor remain')
from pyspark.sql.types import DoubleType, StructType
from pyspark.sql.functions import lit, col
from pyspark.sql import DataFrame
import random
import numpy
from functools import reduce
import math
STDDEV_NAME = "_std"
class DataGenerator:
def __init__(self, DISTINCT_NOMINAL, STDDEV_NAME): 
 def modeFlag(self, mode: str):
 """ comments
 modeVal = {
 "ascending" : False,
 "descending" : True
 return modeVal.get(mode)
 def lfold(self, func, nums, exp):
 """ comments
 acc = []
 for i in range(len(nums)):
 result = reduce(func, nums[:i+1], exp)
 return acc
 def generateDoublesData(self, targetCount: int, start: float, step: float, mode: str):
 """ comments
 stoppingPoint = (targetCount * step) + start
 doubleArray = list(numpy.arange(start, stoppingPoint, step))
 try : 
 doubleArray = sorted(doubleArray, reverse=self.modeFlag(mode))
 if (mode == 'random'):
 raise Exception(mode, " is not supported.")
 return doubleArray
 def generateDoublesMod(self, targetCount: int, start: float, step: float, mode: str, exp: float):
 """ comments
 doubles = self.generateDoublesData(targetCount, start, step, mode)
 res = (lambda x, y: x + ((x + y) / x))
 return self.lfold(res, doubles, exp)
 def generateDoublesMod2(self, targetCount: int, start: float, step: float, mode: str):
 """ comments
 doubles = self.generateDoublesData(targetCount, start, step, mode)
 func = (lambda x, y: (math.pow((x-y)/math.sqrt(y), 2)))
 sequenceEval = reduce(func, doubles, 0)
 res = (lambda x, y: (x + (x / y)) / x)
 return self.lfold(res, doubles, sequenceEval)
 def generateIntData(self, targetCount: int, start: int, step: int, mode: str):
 """ comments
 stoppingPoint = (targetCount * step) + start
 intArray = list(range(start, stoppingPoint, step))
 try : 
 intArray = sorted(intArray, reverse=self.modeFlag(mode))
 if (mode == 'random'):

 raise Exception(mode, " is not supported.")
 return intArray
 def generateRepeatingIntData(self, targetCount: int, start: int, step: int, mode: str, 
distinctValues: int): 
 """ comments
 subStopPoint = (distinctValues * step) + start - 1
 distinctArray = list(range(start, subStopPoint, step))
 try : 
 sortedArray = sorted(distinctArray, reverse=self.modeFlag(mode))
 if (mode != 'random'):
 raise Exception(mode, " is not supported.")
 outputArray = numpy.full((int(targetCount / (len(sortedArray) - 1)), len(sortedArray)),
 if (mode == 'random'):
 return outputArray

def getDoubleCols(self, schema: StructType):
 """ comments
 return [ for s in schema if s.dataType == DoubleType()]
 def normalizeDoubleTypes(self, df: DataFrame):
 """ comments
 doubleTypes = self.getDoubleCols(df.schema)
 stddevValues ="stddev").first()
 for indx in range(0, len(doubleTypes)):
 df = df.withColumn(doubleTypes[indx]+STDDEV_NAME, 
 return df
 def generateData(self, targetCount: int):
 """ comments

seq1 = self.generateIntData(targetCount, 1, 1, "ascending")
 seq2 = self.generateDoublesData(targetCount, 1.0, 1.0, "descending")
 seq3 = self.generateDoublesMod(targetCount, 1.0, 1.0, "ascending", 2.0)
 seq4 = list(map(lambda x: x * -10, self.generateDoublesMod2(targetCount, 1.0, 1.0, 
 seq5 = self.generateRepeatingIntData(targetCount, 0, 5, "ascending", 
 seq6 = self.generateDoublesMod2(targetCount, 1.0, 1.0, "descending")
 seqData: List[DataStructure] = []
 for i in range(0, targetCount):
 seqData.append(DataStructure(value1=seq1[i], value2=seq2[i].item(), 
value3=seq3[i].item(), value4=seq4[i].item(), 
 value5=seq5[i], value6=seq6[i].item()))
 return self.normalizeDoubleTypes(spark.createDataFrame(seqData))
 def generateCoordData(self, targetCount: int):
 """ comments
 coordData = self.generateData(targetCount).withColumnRenamed("value2_std", 
"x1").withColumnRenamed("value3_std", "x2").withColumnRenamed("value4_std", 
"y1").withColumnRenamed("value6_std", "y2").select("x1", "x2", "y1", "y2")
 return coordData

Part 2: Data Normalcy and Filtering

Many data manipulation tasks require the identification and handling of outlier data. In this section, examine the data set that is generated and write a function that will determine the distribution type of a collection of column names passed in. The only distribution types that are required to be detected are:

  • Normal Distriubtion

  • Left Tailed

  • Right Tailed The return type of this function should be a Dictionary of (ColumnName -> Distriubtion Type)

dataGenerator = DataGenerator(DISTINCT_NOMINAL, STDDEV_NAME) 
data = dataGenerator.generateData(1000) 
columnsToCheck = ["value2_std", "value3_std", "value4_std", "value6_std"] 

Part 3: Testing

In order to validate that the function that you have written performs as intended, write a simple test that could be placed in a unit testing framework.

  • Demonstrate that the test passes while validating proper classification of at maximum 1 type of distribution

  • Demonstate the test failing at classifying correctly, but ensure that the application continues to run (handle the exception and report the failure to stdout)

(Hint: Distribution characteristics may change with the number of rows generated based on the data generator's equations)

Part 4: Efficient Calculations

In this section, create a function that allows for the calculation of euclidean distance between the pairs (x1, y1) and (x2, y2). Choose the approach that scales best to extremely large data sizes.

  • Once complete, determine the distribution type of your derived distance column using the function you developed above in Part 2.

  • Show a plot of the distribution to ensure that the distribution type is correct.

coordData = dataGenerator.generateCoordData(1000) 

Part 5: Complex DataTypes

In this section, create a new column that shows the mid-point coordinates between the (x1, y1) and (x2, y2) values in each row.

  • After the new column has been created, write a function that will calculate the distance from each pair (x1, y1) and (x2, y2) to the mid-point value.

  • Once the distances have been calculated, run a validation check to ensure that the expected result is achieved.

Part 6: Precision

  • How many rows of data do not match?

  • Why would they / wouldn't they match?

Get help in machine learning coding, machine learning project, machine learning homework, data science and data visualization. You need to send your assignment request at below mail id or you can chat on website chatbot. We are available 24/7 for your support and help.

Contact Us!!!

1 Comment

Jun 14

What's the solution for the above paper ?

bottom of page