If you are looking to hire expert that can do your big data hive project with an affordable price then Realcode4you.com is the right choice.
Our expert delivered 500+ successful hive projects without any issues. Here you also get other big data related assignment help and project help. Below the some big data related topics in which you can get help.
Big Data Topics In Which You can Get Help
Big Data Revolution
Hadoop Architecture and Ecosystem
Setting up Hadoop
Hadoop Distributed File System (HDFS) Architecture
Hadoop Distributed File System (HDFS) Programming Basics
Hadoop Distributed File System (HDFS) Programming Advanced
YARN and MapReduce Architecture
MapReduce Programming Basics
MapReduce Programming Intermediate
MapReduce Programming Advanced
Data Analysis using Hive
Data Analysis using Pig
Hadoop NOSQL Database HBase
Spark
Miscellaneous Hadoop Topics
Here You Get
Understand the trends that is fueling the modern Big Data Revolution.
Gain a solid understanding of the Apache Hadoop Architecture including HDFS and MapReduce.
Apply the HDFS Programming model and the ability to author HDFS Programs using Apache Hadoop HDFS API for importing and exporting data into Hadoop.
Apply the Distributed Storage and Distributed Programming model for distributed processing.
Best practices for Hadoop development, debugging, and implementation of work
How to leverage Hive and Pig for big data processing, and a look at related Hadoop projects.
Required Software
Oracle VM VirtualBox
You will need to install VirtualBox Oracle VM. This is open source software. Information on how to install and configure you can get from us.
Ubuntu Linux
You will need to install CentOS Linux as your own VM on VirtualBox. This is open source software. Information on how to install and configure you can get from us.
Apache Hadoop
You will need to install Apache Hadoop inside your CentOS Linux VM. Hadoop is open source software. Get help in how to install and configure it also when start this.
Apache Hive
You will need to install Apache Hive inside your CentOS Linux VM. Hadoop is open source software. Get help in how to install and configure it also when start this.
Apache Pig
You will need to install Apache Pig inside your CentOS Linux VM. Hadoop is open source software. Get help in how to install and configure it also when start this.
Apache HBase
You will need to install Apache Pig inside your CentOS Linux VM. Hadoop is open source software. Get help in how to install and configure it also when start this.
Apache Spark
You will need to install Apache Spark inside your CentOS Linux VM. Hadoop is open source software. Get help in how to install and configure it also when start this.
IDE (NetBeans or Eclipse)
You will need to install NetBeans or Eclipse inside your CentOS VM. NetBeans or Eclipse is your standard open source IDE software.
Sample Paper 1
Problem Statement
With online sales gaining popularity, tech companies are exploring ways to improve their sales by analysing customer behaviour and gaining insights about product trends. Furthermore, the websites make it easier for customers to find the products they require without much scavenging. Needless to say, the role of big data analysts is among the most sought-after job profiles of this decade. Therefore, as part of this assignment, we will be challenging you, as a big data analyst, to extract data and gather insights from a real-life data set of an e-commerce company.
For this assignment, you will be working with a public clickstream dataset of a cosmetics store. Using this dataset, your job is to extract valuable insights which generally data engineers come up within an e-retail company.
You will find the data in the link given below. https://e-commerce-events-ml.s3.amazonaws.com/2019-Oct.csv https://e-commerce-events-ml.s3.amazonaws.com/2019-Nov.csv You can find the description of the attributes in the dataset given below. The implementation phase can be divided into the following parts: Copying the data set into the HDFS: Launch an EMR cluster that utilizes the Hive services, and Move the data from the S3 bucket into the HDFS Creating the database and launching Hive queries on your EMR cluster: Create the structure of your database, Use optimized techniques to run your queries as efficiently as possible Show the improvement of the performance after using optimization on any single query. Run Hive queries to answer the questions given below.
Cleaning up
Drop your database, and
Terminate your cluster
You are required to provide answers to the questions given below. 1. Find the total revenue generated due to purchases made in October. 2. Write a query to yield the total sum of purchases per month in a single output. 3. Write a query to find the change in revenue generated due to purchases from October to November. 4. Find distinct categories of products. Categories with null category code can be ignored. 5. Find the total number of products available under each category. 6. Which brand had the maximum sales in October and November combined? 7. Which brands increased their sales from October to November? 8. Your company wants to reward the top 10 users of its website with a Golden Customer plan. 9. Write a query to generate a list of top 10 users who spend the most. Note:
To write your queries, please make necessary optimizations, such as selecting the appropriate table format and using partitioned/bucketed tables. You will be awarded marks for enhancing the performance of your queries.
Each question should have one query only. Use a 2-node EMR cluster with both the master and core nodes as M4.large. Make sure you terminate the cluster when you are done working with it. Since EMR can only be terminated and cannot be stopped, always have a copy of your queries in a text editor so that you can copy-paste them every time you launch a new cluster. Do not leave PuTTY idle for so long. Do some activity like pressing the space bar at regular intervals. If the terminal becomes inactive, you don't have to start a new cluster. You can reconnect to the master node by opening the puTTY terminal again, giving the host address and loading .ppk key file. For your information, if you are using emr-6.x release, certain queries might take a longer time, we would suggest you use emr-5.29.0 release for this case study. Important Note: For this project, you can use the m4 EMR instance types. In AWS Academy, any other instances other than m4 (i.e m4.large, m4.xlarge etc.) might lead to the deactivation of your account. There are different options for storing the data in an EMR cluster. You can briefly explore them in this link. In your previous module on hive querying, you copied the data to the local file system, i.e., to the master node's file system and performed the queries. Since the size of the dataset is large here in this case study, it is a good practice to load the data into the HDFS and not into the local file system.
Sample Paper 2
Introduction:
In this Assignment, you will be working with the car.csv dataset that you can download from https://www.kaggle.com/mirosval/personal-cars-classifieds or use the wget command provided which can directly load the dataset into HDFS.
Thisdataset has the classified recordsfor several EasternEuropean countries over several years. Beware that the data is not “clean” and investigating and cleaning the data is an important part of the Assignment.
Problem Background
You are the data analyst at a large investment firm that is contemplating to invest in a used car business. Your task is to provide data driven advice to the stakeholders, thatwill enable them to make a sound investment decision.Failure to make the best decision may resultin large financial consequences and irreversible damage to the company reputation and brand.
Your manager has instructed you to use the cars.csvdataset, because the veracity of this data has been established.
Cleaning Tasks The cars dataset is very unclean.Before you can do any analysis, you must clean the dataset so it can be used for analysis. Use the following steps to clean the dataset for analysis. 1. Write a Hive queryto create a table calledused_cars_yourname fromdata. Use a schema that is appropriate for the column headings. “yourname” is your first name.
2. Write Hive queries to see how many missingvalues you have in each attribute.
Remove any records that do not have a price
Remove any recordsthat do not have a model listed
3. Group the price column and count the number of unique prices. Do you notice if there is a single price that is repeating acrossthe ads? Remove records that have these prices.
4. Find all the recordswhere the modeldoes not have a makervalue. Based on the model, fill in the maker value to complete the record. For example, if the model is listed as Civic but does not have a maker value, put “Honda” as the maker.
5. Find the average price for cars of different models and makers from an external source (cite this source) or make a best estimate on these values. Then write queries that will remove any records where prices are multiple factors above this price. The factor can be chosen by you basedon your best judgement. For example, let’s say that Honda Civic price for 2015 with 100,000 Km is $5,000. Then remove any records for Honda Civic where price ismore than 3 times this,ie price > $15,000. Note prices in the datasetare in Euros.
6. Of the remaining records, remove any other records which you feel are abnormal or cannot be trusted. This is an open ended questionso use your own judgement and creativity.
Analysis Tasks
Assume that your car dealership needs to sell 10 cars per day on average and is open 6 days a week. Based on this requirement, use the clean dataset to recommend between 10-15 cars that you would recommend. Note that the recommended cars should at least be able to meet the numbers quota but apart from this, they must also meet some other quotas that you think are relevant. This again is open ended question – ask yourself, what cars will sell best? Try to justify your answer with the dataset at hand and/or use external resources (consumer reports etc).
When writingyour report, rememberthat your audienceis not a technical audience and as such, the report should focus on analysisrather than technical details.
Any technical details should be provided in an appendix (example, loading of data into Hive,cleaning, analysis etc). You can use externaltools for visualization of any queries (exampleExcel, PowerBI, Tableau)but the actual analysis must be done in Hive and clearly demonstrated in your appendices (should have screenshots of your queries).
For more details your can contact us on below mail id:
realcode4you@gmail.com
Comments