Big data Map reduce

Completed a big data project where applied efficient data extraction and processing techniques using MongoDB, pymongo, and MapReduce. By extracting only the necessary data—such as artist names, release years, and sales—from a large dataset of song documents, was able to significantly reduce computation time and costs. Using MapReduce with mrjob, processed this data in a distributed manner to aggregate and analyze key insights, such as the total sales for each artist per year and identifying top-selling artists. This approach optimized resource usage while delivering accurate and meaningful results efficiently.

Skills:

Big data handling
Data extraction
Python
MongoDB ( relational database management)
Data aggregation
Map reduce

Data Extraction for Song Triplets

Pseudocode:

Import the pymongo library

Establish a connection to the MongoDB database

Select the database and collection from which to retrieve data

Create a new text file for storing the output

Query the collection to retrieve the desired data fields

Iterate over the query results

Extract the values of "Artist", "Year", and "Sales" from each document

Replace any commas in the artist name with dashes

Create a triplet in the format <artist, year, sales>

Write the triplet to the text file

Close the text file and the MongoDB connection

The goal of this pseudocode is to extract specific data fields—Artist, Year, and Sales—from a MongoDB database containing song documents. The extracted data is then organized into triplets in the format <artist, year, sales>, which are written to a text file. This triplet format will be used as input for subsequent MapReduce programs. By cleaning up data (replacing commas in artist names) and properly structuring the output, this step ensures that the data is prepared for efficient processing in later stages of the project.

MapReduce for Total Sales Per Artist Per Year

Pseudocode:

Import the MRJob library for MapReduce jobs

Import the MRStep library for defining MapReduce steps

Define a class MRTotalSales that inherits from MRJob

Define a mapper function that takes a key and a line as input

Split the line into fields

Extract the artist, year, and sales from the fields

Try to convert the sales to a float, setting it to 0.0 if the conversion fails

Emit a key-value pair with the artist and year as the key and sales as the value

Define a reducer function that takes a key and a list of values as input

Sum up the sales for each artist in each year

Emit the total sales for each artist in each year as a string

If the script is run as the main program

Run the MRTotalSales job

In this part I used MapReduce program to calculate the total sales for each artist in each year from task1_1_output.txt file. The mapper function extracts artist, year, and sales data from input lines and emits key-value pairs. The reducer function sums up the sales values for each artist in each year.

Map reduce for top-selling artist for each year

In this part I used Map reduce to identify the top-selling artist for each year from task1_2_output.txt file. The mapper function extracts artist, year, and sales data from each line, emitting key-value pairs with the year as the key and a tuple containing the artist and sales as the value. The first reducer function then finds the top-selling artist for each year based on sales, emitting key-value pairs with None as the key and a tuple containing the year, top-selling artist, and sales. Finally, the second reducer function sorts the results by year in descending order and emits key-value pairs with the year as the key and a list containing the top-selling artist and sales.

Map reduce for top-selling artists for each decade

In this task I used MapReduce to find the top-selling artists for each decade from task1_2_output.txt file. The mapper extracts the artist, year, and sales from the input data, calculates the decade for each year, and emits key-value pairs with the decade range and artist as the key, and the sales as the value. The first reducer sums up the sales for each artist within each decade, while the second reducer sorts the artists by total sales in descending order within each decade and emits the top 3 artists. The final reducer sorts the decades in descending order and emits the top artists for each decade.