APACHE SPARK讲解、辅导MLLIB/ML留学生、java程序设计讲解、辅导java语言

- 首页 >> OS编程
ASSIGNMENT 2 – APACHE SPARK
Introduction
In this assignment, you will use MLLIB/ML, which are Apache Spark based machine
learning libraries on real world datasets.
Before you start working on the assignment, you must have completed the in-class
exercise (based on http://spark.apache.org/docs/latest/quick-start.html) and the Machine
Learning Library (MLlib) at http://spark.apache.org/docs/latest/mllib-guide.html
Datasets
1. US fatal road accident data for automobiles, 1998 to 2010.
2. Consumer Complaints
Download the datasets from: \FACULTY COURSE RESOURCES\Big Data and Largescale
Computing\DataSetsforAssignment2M19. The datasets are easy to
understand. Just study the header row for attribute information.
Task 1 (50 points) – Write a SPARK program for classification
Select any two classification learning algorithms available in Spark’s Machine
Learning Library.
Select a target attribute from each of the datasets provided and learn a
classification model to predict the target attribute.
Use 70% training and 30% test splits of data.
For both datasets, print the test error rates.
A useful JAVA example for decision tree learning can be found here:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/
spark/examples/mllib/JavaDecisionTreeClassificationExample.java2
Task 2 (50 points) – Write a SPARK program to cluster data
Select K-means and Gaussian mixture clustering algorithms from Spark’s
Machine Learning Library.
Select appropriate attributes to cluster the data in each of the two datasets.
Apply the clustering algorithms to the transformed datasets.
For the Gaussian mixture clustering your program should output the parameters
of the mixture model and for K-means the “Within Set Sum of Squared Errors”.
A useful JAVA example for k-means can be found here:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/
spark/examples/mllib/JavaKMeansExample.java
Submission requirements and grading
Upload the source code for your program in a zipped file to Canvas. Demonstrate
both tasks to the TA during the Lab or consultation hours.
Remember that all work must be your own.

站长地图