top of page

Coding Sample Paper - Machine Learning | Predict Nightly Airbnb Rental Prices in San Francisco.

Updated: May 18, 2022



Task 1

Q1: EDA

Create a Spark Dataframe from /databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet/. Visualize and explore the data. Note anything you find interesting. This dataset is slightly cleansed form of the Inside Airbnb dataset for San Francisco.


http://insideairbnb.com/get-the-data.html


Q2: Model Development and Tracking

  • Split into 80/20 train-test split using SparkML APIs.

  • Build a model using SparkML to predict price given the other input features (or subset of them).

  • Mention why you chose this model, how it works, and other models that you considered.

  • Compute the loss metric on the test dataset and explain your choice of loss metric.

  • Log your model/hyperparameters/metrics to MLflow.


Task 2

Question 1

Part 1: Code analysis and documentation


In the following cells is code to generate a synthetic data set. At each point that is

marked by commenting blocks ( '#', '"""', '''''), fill in appropriate comments that explain

the functionality of each part of the subsequent code in standard python code style.


import collections
"""
"""
DataStructure = collections.namedtuple('DataStructure', 'value1 value2 value3 value4 
value5 value6')
ModuloResult = collections.namedtuple('ModuloResult', 'factor remain')
from pyspark.sql.types import DoubleType, StructType
from pyspark.sql.functions import lit, col
from pyspark.sql import DataFrame
import random
import numpy
from functools import reduce
import math
DISTINCT_NOMINAL = 5
STDDEV_NAME = "_std"
 
class DataGenerator:
def __init__(self, DISTINCT_NOMINAL, STDDEV_NAME): 
 self.DISTINCT_NOMINAL = DISTINCT_NOMINAL
 self.STDDEV_NAME = STDDEV_NAME
 
 def modeFlag(self, mode: str):
 """ comments
 """
 
 modeVal = {
 "ascending" : False,
 "descending" : True
 }
 return modeVal.get(mode)
 
 
 def lfold(self, func, nums, exp):
 """ comments
 """
 
 acc = []
 for i in range(len(nums)):
 result = reduce(func, nums[:i+1], exp)
 acc.append(result)
 return acc
 
 def generateDoublesData(self, targetCount: int, start: float, step: float, mode: str):
 
 """ comments
 """
 
 stoppingPoint = (targetCount * step) + start
 doubleArray = list(numpy.arange(start, stoppingPoint, step))
 try : 
 doubleArray = sorted(doubleArray, reverse=self.modeFlag(mode))
 except:
 if (mode == 'random'):
 random.shuffle(doubleArray)
 else: 
 raise Exception(mode, " is not supported.")
 
 return doubleArray
 
 
 def generateDoublesMod(self, targetCount: int, start: float, step: float, mode: str, exp: float):
 
 """ comments
 """
 doubles = self.generateDoublesData(targetCount, start, step, mode)
 res = (lambda x, y: x + ((x + y) / x))
 
 return self.lfold(res, doubles, exp)
 
 def generateDoublesMod2(self, targetCount: int, start: float, step: float, mode: str):
 
 """ comments
 """
 
 doubles = self.generateDoublesData(targetCount, start, step, mode)
 
 func = (lambda x, y: (math.pow((x-y)/math.sqrt(y), 2)))
 sequenceEval = reduce(func, doubles, 0)
 
 res = (lambda x, y: (x + (x / y)) / x)
 return self.lfold(res, doubles, sequenceEval)
 
 
 def generateIntData(self, targetCount: int, start: int, step: int, mode: str):
 
 """ comments
 """
 
 stoppingPoint = (targetCount * step) + start
 intArray = list(range(start, stoppingPoint, step))
 try : 
 intArray = sorted(intArray, reverse=self.modeFlag(mode))
 except:
 if (mode == 'random'):

random.shuffle(intArray)
 else: 
 raise Exception(mode, " is not supported.")
 
 return intArray
 
 def generateRepeatingIntData(self, targetCount: int, start: int, step: int, mode: str, 
distinctValues: int): 
 
 """ comments
 """
 
 subStopPoint = (distinctValues * step) + start - 1
 distinctArray = list(range(start, subStopPoint, step))
 try : 
 sortedArray = sorted(distinctArray, reverse=self.modeFlag(mode))
 except:
 if (mode != 'random'):
 raise Exception(mode, " is not supported.")
 outputArray = numpy.full((int(targetCount / (len(sortedArray) - 1)), len(sortedArray)),
 sortedArray).flatten().tolist()[:targetCount]
 if (mode == 'random'):
 random.shuffle(outputArray)
 
 return outputArray

def getDoubleCols(self, schema: StructType):
 
 """ comments
 """
 
 return [s.name for s in schema if s.dataType == DoubleType()]
 
 def normalizeDoubleTypes(self, df: DataFrame):
 """ comments
 """
 doubleTypes = self.getDoubleCols(df.schema)
 stddevValues = df.select(doubleTypes).summary("stddev").first()
 
 for indx in range(0, len(doubleTypes)):
 df = df.withColumn(doubleTypes[indx]+STDDEV_NAME, 
col(doubleTypes[indx])/stddevValues[indx+1])
 return df
 
 
 def generateData(self, targetCount: int):
 
 """ comments
 """

seq1 = self.generateIntData(targetCount, 1, 1, "ascending")
 seq2 = self.generateDoublesData(targetCount, 1.0, 1.0, "descending")
 seq3 = self.generateDoublesMod(targetCount, 1.0, 1.0, "ascending", 2.0)
 seq4 = list(map(lambda x: x * -10, self.generateDoublesMod2(targetCount, 1.0, 1.0, 
"ascending")))
 seq5 = self.generateRepeatingIntData(targetCount, 0, 5, "ascending", 
DISTINCT_NOMINAL)
 seq6 = self.generateDoublesMod2(targetCount, 1.0, 1.0, "descending")
 
 seqData: List[DataStructure] = []
 
 for i in range(0, targetCount):
 seqData.append(DataStructure(value1=seq1[i], value2=seq2[i].item(), 
value3=seq3[i].item(), value4=seq4[i].item(), 
 value5=seq5[i], value6=seq6[i].item()))
 
 return self.normalizeDoubleTypes(spark.createDataFrame(seqData))
 
 def generateCoordData(self, targetCount: int):
 
 """ comments
 """
 
 coordData = self.generateData(targetCount).withColumnRenamed("value2_std", 
"x1").withColumnRenamed("value3_std", "x2").withColumnRenamed("value4_std", 
"y1").withColumnRenamed("value6_std", "y2").select("x1", "x2", "y1", "y2")
 return coordData

Part 2: Data Normalcy and Filtering

Many data manipulation tasks require the identification and handling of outlier data. In this section, examine the data set that is generated and write a function that will determine the distribution type of a collection of column names passed in. The only distribution types that are required to be detected are:

  • Normal Distriubtion

  • Left Tailed

  • Right Tailed The return type of this function should be a Dictionary of (ColumnName -> Distriubtion Type)


dataGenerator = DataGenerator(DISTINCT_NOMINAL, STDDEV_NAME) 
data = dataGenerator.generateData(1000) 
columnsToCheck = ["value2_std", "value3_std", "value4_std", "value6_std"] 

Part 3: Testing

In order to validate that the function that you have written performs as intended, write a simple test that could be placed in a unit testing framework.

  • Demonstrate that the test passes while validating proper classification of at maximum 1 type of distribution

  • Demonstate the test failing at classifying correctly, but ensure that the application continues to run (handle the exception and report the failure to stdout)

(Hint: Distribution characteristics may change with the number of rows generated based on the data generator's equations)


Part 4: Efficient Calculations

In this section, create a function that allows for the calculation of euclidean distance between the pairs (x1, y1) and (x2, y2). Choose the approach that scales best to extremely large data sizes.

  • Once complete, determine the distribution type of your derived distance column using the function you developed above in Part 2.

  • Show a plot of the distribution to ensure that the distribution type is correct.


coordData = dataGenerator.generateCoordData(1000) 
display(coordData)

Part 5: Complex DataTypes

In this section, create a new column that shows the mid-point coordinates between the (x1, y1) and (x2, y2) values in each row.

  • After the new column has been created, write a function that will calculate the distance from each pair (x1, y1) and (x2, y2) to the mid-point value.

  • Once the distances have been calculated, run a validation check to ensure that the expected result is achieved.


Part 6: Precision

  • How many rows of data do not match?

  • Why would they / wouldn't they match?



Get help in machine learning coding, machine learning project, machine learning homework, data science and data visualization. You need to send your assignment request at below mail id or you can chat on website chatbot. We are available 24/7 for your support and help.


Contact Us!!!


realcode4you@gmail.com
bottom of page