Talk:MapReduce

This article is rated C-class on Wikipedia's content assessment scale.
It is of interest to the following WikiProjects:

Google Low‑importance

This article is within the scope of WikiProject Google, a collaborative effort to improve the coverage of Google and related topics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.GoogleWikipedia:WikiProject GoogleTemplate:WikiProject GoogleGoogle

Low This article has been rated as Low-importance on the project's importance scale.

WikiProject Google To-do:

Here are some tasks awaiting attention:

Article requests : Articles for most of the other products listed here and here.
Assess : All articles in the Category:Unknown-importance Google articles and Category:Unassessed Google articles using the project's assessment scale
Expand : Google Mapathon, Google Talkback
Maintain : This WikiProject
Merge : Google Mobile Services into List of Google products
Stubs : Category:Stub-Class Google articles and Category:Google stubs
Update : List of features in Android and Gmail interface#Product integration. Update logos of Google Marketing Platform products
Other :
- Add more stuff to this to do list if you like! (click here...)
- create:
- Help the Google article for a good article status
- Improve the Outline of Google
- Get more members using :
{{subst:Wikipedia:WikiProject Google/Invite Members}}
- Infobox Images with transparent areas needing a different background color

It is requested that a computing diagram or diagrams be included in this article to improve its quality. Specific illustrations, plots or diagrams can be requested at the Graphic Lab.
For more information, refer to discussion on this page and/or the listing at Wikipedia:Requested images.

Archives

1

Why no 'needs work' tag?

Hello; I'm a new contributor; sorry for this comment's simplicity... This article on (what I would have thought) an important topic is pretty bad; the C grade it's received reflects this. It's also considered of "low-importance"; I have not yet located the project's importance scale, so I don't understand this. MY QUESTION: why is there no "needs work" indicator at the top of the article itself, like I see in so many others? Readers should be warned. DrTLesterThomas (talk) 15:01, 26 January 2014 (UTC)[reply]

Comparison with fork-join

Hi there,

I think a comparison with fork-join would be helpful. These two concepts seem similar, and pointing out both the similarities and differences would be helpful for the reader. There's an academic paper on the subject actually. Thank you. 205.175.116.125 (talk) 21:03, 18 March 2014 (UTC)[reply]

More prior art

I first heard about map-reduce frameworks in a lecture at Imperial College in 1994 given by Qian Wu. [This paper] covers some of that work. I wish I could find the lecture notes because they actually contained a diagram which was map-reduce. I'm just putting this here for people looking for prior art against Google's patent. Richard W.M. Jones (talk) 12:14, 3 May 2014 (UTC)[reply]

Interesting pointer. However, note that map-reduce is not about map, or reduce IMHO. It's about optimizing the shuffle once, to get fail-safe recovery from machine loss for a whole class of programs. I don't know if fail-safety and recovery has been handled in this prior art? (As for map, and reduce, these have been common in functional programming anyway). Nevertheless, your source would make a good addition to the article to emphasize this point, that map-reduce is not about the map+reduce functions themselves, but about how to make this scale. --Chire (talk) 08:36, 5 May 2014 (UTC)[reply]

Request for cleanup of Talk Page

I started fresh on this topic on Wikipedia (I am bit new to this page of Wikipedia), though i had some good understanding of the concept (MapReduce). As suggested on the top of the article, I wanted to improve the article to make it easy to understand. For this, i tried to look at the previous discussions and feedback, which seems pretty old (more than 3-4 years old now in 2015). Also it is difficult to understand about which comments are already addressed, and which ones need attention. For instance talks regarding the examples (K1, K2) and citation cleanup seems like already addressed. If some existing followers of this topic may throw some light of what is done and what is pending for action, that could be helpful. — Preceding unsigned comment added by Vishal0soni (talk • contribs) 13:30, 2 January 2015 (UTC)[reply]

If you notice that some discussion point no longer applies, consider adding a {{done}} template or {{Resolved mark}} or {{Fixed}} to it, so that others can easily see that this does not require attention. (See Template:done for a list of such markers.) Later on, we can also add an Wikipedia:Archive, once the outdated and the still-current discussions have been flagged out.

As for your YARN change, I reverted it, sorry. YARN is not a "programming model"; and MapReducev2 is not a new model either. Yarn is an Hadoop API change, but I don't think it is notable on its own for Wikipedia. It coincides with the Hadoop 2 milestone, and it cannot be used without Hadoop. It is a refactoring of the Hadoop codebase to allow sharing certain code between MapReduce and other jobs. MapReducev2 isn't fundamentally different - it's simply Hadoops MR, now using YARN for resource management instead of having an own resource management. I'm not aware of any major breakthrough on MR enabled by YARN; but from a MR point of view this is only maintainance. The appropriate article for YARN is Apache Hadoop, and it is already covered there. —138.246.2.241 (talk) 18:00, 2 January 2015 (UTC)[reply]

Thanks for the suggestion about using resolved templates. I'll again have a look at the suggestions and try to update article and talk page accordingly.

Regarding Yarn, can you please provide references for your explainations. For the changes i made, i had already mentioned appropriate source. As far my understanding, with MRv2, the entire architecture and functioning of MapReduce has changed. Now Job Tracker and Task Tracker does not exists, these are replaced by resource manager,Application manager and few other additional components. So the entire processing workflow has been redefined. Vishal0soni (talk) 02:15, 3 January 2015 (UTC)[reply]

Inclusion of YARN

Statement by 138.246.2.241 (talk) As for your YARN change, I reverted it, sorry. YARN is not a "programming model"; and MapReducev2 is not a new model either. Yarn is an Hadoop API change, but I don't think it is notable on its own for Wikipedia. It coincides with the Hadoop 2 milestone, and it cannot be used without Hadoop. It is a refactoring of the Hadoop codebase to allow sharing certain code between MapReduce and other jobs. MapReducev2 isn't fundamentally different - it's simply Hadoops MR, now using YARN for resource management instead of having an own resource management. I'm not aware of any major breakthrough on MR enabled by YARN; but from a MR point of view this is only maintainance. The appropriate article for YARN is Apache Hadoop, and it is already covered there. —138.246.2.241 (talk) 18:00, 2 January 2015 (UTC)

Regarding Yarn, can you please provide references for your explainations. For the changes i made, i had already mentioned appropriate source. As far my understanding, with MRv2, the entire architecture and functioning of MapReduce has changed. Now Job Tracker and Task Tracker does not exists, these are replaced by resource manager,Application manager and few other additional components. So the entire processing workflow has been redefined. Vishal0soni (talk) 02:15, 3 January 2015 (UTC)

Few more references:

"MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN." as stated by Apache Software Foundation ^[1]
"Sometimes called MapReduce 2.0, YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component,..." ^[2]

This only says they changed their implementation of MapReduce. But the article is not about Hadoop MapReduce, but about the MapReduce concept. Can you find any reference about anything that has changed on a theoretical side (not Hadoop implementation details)? Class name changes etc. are not of interest; not that they like calling their new implementation (of the same concept!) MRv2 now. The references you have are for nonencyclopedic implementation details of Hadoop, not for MapReduce as a processing model. As far as I known, YARN/MRvs is still the same MapReduce (only implemented slightly differently internally) as far as I can tell. --94.216.222.254 (talk) 17:04, 5 January 2015 (UTC)[reply]

I totally agree that this is about generic MapReduce Concept and not about its Hadoop implementation. My main concern was that in technology world, we very commonly keep hearing the terms MapReduce and its so called later version YARN (though only implementation level, but yes, it has become the talk of the topic) together. So would like to have a mention of YARN when someone is reading about the MapReduce concept. And that is why suggested to just mention about the latest update for Apache's MapReduce implementation, just next to where we have Apache Hadoop mentioned in the article. Vishal0soni (talk) 05:16, 15 January 2015 (UTC)[reply]

If it's about the implementation only, then it should be mentioned under MapReduce#Implementations_of_MapReduce. But it already says "Apache Hadoop", which includes both the old MapReduce and YARN, doesn't it? It's not as if YARN wasn't Hadoop! And I could not spot any reference specific to Apache Hadoop MapReduce v1, where it would make sense to mention the "v2 based on YARN". --Chire (talk) 10:41, 15 January 2015 (UTC)[reply]

References

^ "Apache Hadoop NextGen MapReduce (YARN)". hadoop.apache.org. Apache Software Foundation. Retrieved 2015-01-05.
^ "Apache Hadoop YARN (Yet Another Resource Negotiator)". searchdatamanagement.techtarget.com. TechTarget. Retrieved 2015-01-05.

"MapReduce is dead" reference supportive of statement?

^[1] Does this reference support "MapReduce as a big data processing model is considered dead by many domain experts, as development has moved on to more capable and less disk-oriented mechanisms that incorporate full map and reduce capabilities."? 199.64.7.56 (talk) 06:11, 24 September 2015 (UTC)[reply]

It is easy to find further references. E.g. Apache Mahout no longer accepts MapReduce contributions. "Exodus away from MapReduce" --Chire (talk) 09:29, 24 September 2015 (UTC)[reply]

Good find. Thank you. 199.64.7.58 (talk) 00:17, 25 September 2015 (UTC)[reply]

References

^ Sean Owen (Cloudera Director of Data Science). "Is Hadoop dead and is it time to move to Spark?". Quora. Retrieved 2015-06-18.

Removed "Theoretical Background" section

This section was unsourced and does not reflect how MapReduce works in the real world. The statement that a Reducer must be a monoid in particular is not correct. The only reason for a Reducer to be a monoid would be if the process of reducing a set of pairs with the same key was split across processes, however neither Google's MapReduce nor Hadoop do this, and neither require that a Reducer be monoidal. When key-value pairs (output by a Mapper) are shuffled, all the pairs with the same key are sent to a single process, which then processes these pairs in a single call to the Reducer. There are therefore no requirements for the Reducer to be associative, commutative, or to have an identity element. The example of taking an average, mentioned as inappropriate for MapReduce, is in reality a perfectly ordinary operation for a MapReduce. — Preceding unsigned comment added by 98.239.129.142 (talk) 07:51, 25 May 2017 (UTC)[reply]

So, someone with no credentials removed my "Theoretial Background" section, and posted an pretty incorrect explanation.

I've just noticed this. Ignorance is not a proof, you know. I'll restore the section, and will add a link, for those who are interested.

Vlad Patryshev (talk) 20:13, 10 January 2023 (UTC)[reply]

External links modified

Hello fellow Wikipedians,

I have just modified 2 external links on MapReduce. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://archive.is/20121214201610/http://ysmart.cse.ohio-state.edu/ to http://ysmart.cse.ohio-state.edu/
Added archive https://web.archive.org/web/20100114053209/http://graal.ens-lyon.fr/mapreduce/ to http://graal.ens-lyon.fr/mapreduce/

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 21:46, 30 December 2017 (UTC)[reply]

External links modified

Hello fellow Wikipedians,

I have just modified one external link on MapReduce. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20140328121334/http://www.clusterpoint.com/ to http://www.clusterpoint.com/

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 03:01, 16 January 2018 (UTC)[reply]

New draft article Reduce (algorithmics)

FYI a new user has submitted an AfC for Draft:Reduce (algorithmics), since there's some overlap I thought I would mention it here. Rolf H Nelson (talk) 05:18, 4 May 2018 (UTC)[reply]

Why are there no spaces in the MapReduce programming model?

I expect it to be "map reduce", "map-reduce", "map/reduce". Because without spaces, it is difficult to distinguish the programming model from the Hadoop MapReduce program. — Preceding unsigned comment added by Sunapi386 (talk • contribs) 21:59, 8 January 2019 (UTC)[reply]

Why are there no spaces in the Hadoop MapReduce program? The original Google paper that introduced/popularized MapReduce did not use spaces, but used the title "MapReduce". Therefore, this is the most appropriate name. The Hadoop name is dervied from this, not the other way round. HelpUsStopSpam (talk) 21:42, 10 January 2019 (UTC)[reply]

Reduce Called Multiple Times?

I was trying to understand the examples on this page, in light of the type signatures given for Map and Reduce. I was confused by the second example of averaging social network contacts, which said "The count info in the record is important if the processing is reduced more than one time". Is this supposed to refer to calling Reduce more than once for the same key, and that it's essential to have C_new in the output (Y,(A,Cnew)) as opposed to just outputting (Y,A) ? If Reduce is called more than one time for the same key, then either Reduce needs to be able to fold its output with previous calls (which puts more rigid type constraints and invariants than the ones given in this article for Reduce), or MapReduce doesn't result in a finalized average. Stackoverflow summary of the confusion here. Metaeducation (talk) 07:00, 9 October 2023 (UTC)[reply]

Article is misrepresenting map-reduce.

MapReduce is introduced as a clustered method and largely summarizes Googles technology for map-reduce in a clustered environment from circa 2004. However map-reduce has been an integral part of functional programming and enabler of parallelism decades before. Concatenating the words Map and Reduce does not sufficiently identify this as a Google specific technology when users are sent to this page searching for general information on map-reduce and parallelism.

Pure functional programs exhibit deterministic parallelism and their execution environment can enable parallelization internally without altering the external behavior of the program. Because fp languages cannot express the explicit scheduling of data tasks in parallel, the map-reduce pattern is the primary enabler of parallelism.

The article doesn't really do anything to discuss the history and roots of map-reduce and does not discuss the general theory outside the scope of a specific application by Google and the adaptations made for a clustered environment.

To resolve the issue the article should either

- Present the more general topic of map-reduce and parallelism first before diving into specific adaptations. - Change the article to reflect in its title and introduction that it is specifically about Googles clustered technology only.

Søren Poulsen (talk) 07:38, 11 October 2024 (UTC)[reply]

Did you find the fourth paragraph of Lack of novelty to be insufficient? Do you think that paragraph on parallelism and functional programming should be broken out into its own subsection with its own title? Or do you think that paragraph is lacking in some way? Michaelmalak (talk) 18:30, 19 October 2024 (UTC)[reply]

the article *is* about the *MapReduce* distributed processing architecture as introduced by Google, not about functional programming where "map" and "reduce" are just two of *many* functions. Others include zip, collect, etc. - there is no "mapreduce" in functional programming. just because functional programming also has common functions "map" and "reduce" does not mean that is "MapReduce". The *key contribution* of MapReduce is in fact the distributed shuffle mechanism. — Preceding unsigned comment added by 94.31.98.105 (talk) 14:46, 20 October 2024 (UTC)[reply]

[1] "Apache Hadoop NextGen MapReduce (YARN)". hadoop.apache.org. Apache Software Foundation. Retrieved 2015-01-05.

[2] "Apache Hadoop YARN (Yet Another Resource Negotiator)". searchdatamanagement.techtarget.com. TechTarget. Retrieved 2015-01-05.

[3] Sean Owen (Cloudera Director of Data Science). "Is Hadoop dead and is it time to move to Spark?". Quora. Retrieved 2015-06-18.

[1]

[2]

[1]