Big data Map reduce
Completed a big data project where applied efficient data extraction and processing techniques using MongoDB, pymongo, and MapReduce. By extracting only the necessary data—such as artist names, release years, and sales—from a large dataset of song documents, was able to significantly reduce computation time and costs. Using MapReduce with mrjob, processed this data in a distributed manner to aggregate and analyze key insights, such as the total sales for each artist per year and identifying top-selling artists. This approach optimized resource usage while delivering accurate and meaningful results efficiently.
Skills:
Big data handling
Data extraction
Python
MongoDB ( relational database management)
Data aggregation
Map reduce
Data Extraction for Song Triplets
Pseudocode:
Import the pymongo library
Establish a connection to the MongoDB database
Select the database and collection from which to retrieve data
Create a new text file for storing the output
Query the collection to retrieve the desired data fields
Iterate over the query results
Extract the values of "Artist", "Year", and "Sales" from each document
Replace any commas in the artist name with dashes
Create a triplet in the format <artist, year, sales>
Write the triplet to the text file
Close the text file and the MongoDB connection
The goal of this pseudocode is to extract specific data fields—Artist, Year, and Sales—from a MongoDB database containing song documents. The extracted data is then organized into triplets in the format <artist, year, sales>, which are written to a text file. This triplet format will be used as input for subsequent MapReduce programs. By cleaning up data (replacing commas in artist names) and properly structuring the output, this step ensures that the data is prepared for efficient processing in later stages of the project.
MapReduce for Total Sales Per Artist Per Year
Pseudocode:
Import the MRJob library for MapReduce jobs
Import the MRStep library for defining MapReduce steps
Define a class MRTotalSales that inherits from MRJob
Define a mapper function that takes a key and a line as input
Split the line into fields
Extract the artist, year, and sales from the fields
Try to convert the sales to a float, setting it to 0.0 if the conversion fails
Emit a key-value pair with the artist and year as the key and sales as the value
Define a reducer function that takes a key and a list of values as input
Sum up the sales for each artist in each year
Emit the total sales for each artist in each year as a string
If the script is run as the main program
Run the MRTotalSales job
In this part I used MapReduce program to calculate the total sales for each artist in each year from task1_1_output.txt file. The mapper function extracts artist, year, and sales data from input lines and emits key-value pairs. The reducer function sums up the sales values for each artist in each year.
Map reduce for top-selling artist for each year
In this part I used Map reduce to identify the top-selling artist for each year from task1_2_output.txt file. The mapper function extracts artist, year, and sales data from each line, emitting key-value pairs with the year as the key and a tuple containing the artist and sales as the value. The first reducer function then finds the top-selling artist for each year based on sales, emitting key-value pairs with None as the key and a tuple containing the year, top-selling artist, and sales. Finally, the second reducer function sorts the results by year in descending order and emits key-value pairs with the year as the key and a list containing the top-selling artist and sales.
Map reduce for top-selling artists for each decade
In this task I used MapReduce to find the top-selling artists for each decade from task1_2_output.txt file. The mapper extracts the artist, year, and sales from the input data, calculates the decade for each year, and emits key-value pairs with the decade range and artist as the key, and the sales as the value. The first reducer sums up the sales for each artist within each decade, while the second reducer sorts the artists by total sales in descending order within each decade and emits the top 3 artists. The final reducer sorts the decades in descending order and emits the top artists for each decade.