forked from lintool/MapReduceAlgorithms
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ed1.html
315 lines (215 loc) · 14.2 KB
/
ed1.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Data-Intensive Text Processing with MapReduce</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="">
<link href="assets/css/bootstrap.css" rel="stylesheet">
<link href="assets/css/bootstrap-responsive.css" rel="stylesheet">
<link href="assets/css/docs.css" rel="stylesheet">
<link href="assets/js/google-code-prettify/prettify.css" rel="stylesheet">
<!-- Le HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body style="background-image: url(assets/img/grid-18px-masked.png)">
<div class="navbar navbar-fixed-top">
<div class="navbar-inner">
<div class="container">
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</a>
<div class="nav-collapse">
<ul class="nav">
<li class="">
<a href="index.html">Home</a>
</li>
<li class="active">
<a href="ed1.html">1st Edition</a>
</li>
<li class="">
<a href="ed1n.html">1.N Edition</a>
</li>
<li class="">
<a href="ed2.html">2nd Edition</a>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="container">
<!-- Masthead
================================================== -->
<header class="jumbotron subhead" id="overview" style="padding-top: 30px;">
<h1>Data-Intensive Text Processing<br/>with MapReduce</h1>
<p class="lead">by Jimmy Lin and Chris Dyer.<br/>
Morgan & Claypool Publishers, 2010.</p>
<p><a href="MapReduce-book-final.pdf" class="btn btn-primary">Download book now!</a></p>
<div class="subnav">
<ul class="nav nav-pills">
<li><a href="#info">Book Information</a></li>
<li><a href="#ref">Reference Implementations</a></li>
<li><a href="#adoption">Reviews and Adoption</a></li>
</ul>
</div>
</header>
<section id="info">
<div class="page-header">
<h1>Book Information</h1>
<p class="pull-right"><a href="ed1.html">Back to top</a></p>
</div>
<div class="row">
<div class="span4">
<h2>Table of Contents</h2>
<ol>
<li>Introduction</li>
<li>MapReduce Basics</li>
<li>MapReduce algorithm design</li>
<li>Inverted Indexing for Text Retrieval</li>
<li>Graph Algorithms</li>
<li>EM Algorithms for Text Processing</li>
<li>Closing Remarks</li>
</ol>
</div>
<div class="span8">
<h2>Abstract</h2>
<p>Our world is being revolutionized by data-driven methods: access to
large amounts of data has generated new insights and opened exciting
new opportunities in commerce, science, and computing
applications. Processing the enormous quantities of data necessary for
these advances requires large clusters, making distributed computing
paradigms more crucial than ever. MapReduce is a programming model for
expressing distributed computations on massive datasets and an
execution framework for large-scale data processing on clusters of
commodity servers. The programming model provides an
easy-to-understand abstraction for designing scalable algorithms,
while the execution framework transparently handles many system-level
details, ranging from scheduling to synchronization to fault
tolerance. This book focuses on MapReduce algorithm design, with an
emphasis on text processing algorithms common in natural language
processing, information retrieval, and machine learning. We introduce
the notion of MapReduce design patterns, which represent general
reusable solutions to commonly occurring problems across a variety of
problem domains. This book not only intends to help the reader "think
in MapReduce", but also discusses limitations of the programming model
as well.</p>
<p>Quite explicitly, this book focuses on MapReduce algorithm design, not <a href="http://hadoop.apache.org/">Hadoop</a> programming. Tom White's <a href="http://www.amazon.com/gp/product/0596521979?ie=UTF8&tag=dataintetextp-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596521979">Hadoop: The Definitive Guide</a><img src="http://www.assoc-amazon.com/e/ir?t=dataintetextp-20&l=as2&o=1&a=0596521979" width="1" height="1" alt="" style="border:none !important; margin:0px !important;" /> is a great resource for learning Hadoop.</p>
<h2>Publisher</h2>
<p>This book is part of the Morgan & Claypool <a
href="http://www.morganclaypool.com/toc/hlt/1/1">Synthesis Lectures on
Human Language Technologies</a>. If you're at a university, your
institution may already subscribe to the series, in which case you can
access the <a
href="http://dx.doi.org/10.2200/S00274ED1V01Y201006HLT007">electronic
version</a> directly without cost (see <a
href="http://www.morganclaypool.com/page/licensed">this page</a> for a
list of institutional subscribers). Otherwise, to purchase:</p>
<ul>
<li>Electronic and print copies from <a href="http://dx.doi.org/10.2200/S00274ED1V01Y201006HLT007">Morgan & Claypool</a> (publisher's site)</li>
<li>Print copies from <a href="http://www.amazon.com/gp/product/1608453421?ie=UTF8&tag=dataintetextp-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1608453421">Amazon.com</a><img src="http://www.assoc-amazon.com/e/ir?t=dataintetextp-20&l=as2&o=1&a=1608453421" width="1" height="1" alt="" style="border:none !important; margin:0px !important;" /></li>
</ul>
<p>We are pleased to provide
the <a href="MapReduce-book-final.pdf">final pre-production
manuscript</a> (April 11, 2010) as a preview. If you find this
resource helpful, please consider purchasing an actual copy to support
our work!</p>
</div>
</div>
</section>
<section id="ref">
<div class="page-header">
<h1>Reference Implementations</h1>
<p class="pull-right"><a href="ed1.html">Back to top</a></p>
</div>
<p><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html">Cloud<sup><small>9</small></sup></a>
is a MapReduce library for Hadoop designed to serve as both a teaching
tool and to support research in data-intensive text processing. It
also serves as a repository of
<a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/patterns.html">many
examples</a> discussed in the book. Reference implementations of
design patterns and other algorithms discussed in the book are being
added gradually, so please come back periodically. Thus far, the
repository contains:</p>
<ul>
<li><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/order-inversion.html">Order inversion</a> from Chapter 3 for computing bigram
relative frequencies.</li>
<li><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/pairs-stripes.html">"Pairs"
and "stripes"</a> from Chapter 3 for computing the word
co-occurrence matrix of a large text collection.</li>
<li><a href="http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/pagerank.html">PageRank</a> from Chapter 4
as well as some more design patterns for graph algorithms not discussed in the book.</li>
</ul>
</section>
<section id="adoption">
<div class="page-header">
<h1>Reviews and Adoption</h1>
<p class="pull-right"><a href="ed1.html">Back to top</a></p>
</div>
<h3>What others are saying...</h3>
<ul>
<li>Book cited in a special report on managing information in <a href="http://www.economist.com/specialreports/displaystory.cfm?story_id=15557413">The Economist</a> (February 25, 2010)</li>
<li>Design patterns mentioned by <a href="http://mir-in-action.blogspot.com/2010/04/mapreduce-algorithm-design.html">Mark Levy</a> at Last.fm (April 6, 2010)</li>
<li>Google Research <a href="http://googleresearch.blogspot.com/2010/05/recent-accomplishments-by-research.html">plugs the book</a> (May 19, 2010)</li>
<li>Mentioned in a blog post by <a href="http://www.ctctlabs.com/index.php/blog/detail/applying_data_mining_techniques_to_mapreduce/">Constant Contact Labs</a> (May 27, 2010)</li>
<li>Deepak Singh from Amazon <a href="http://mndoci.com/2010/07/02/recommendation-data-intensive-text-processing-with-mapreduce/">recommends the book</a> (July 2, 2010)</li>
</ul>
<h3>Courses using this book...</h3>
<ul>
<li><a href="http://www.csee.ogi.edu/~zak/cs506-pslc/">CS 506/606: Special Topics: Problem Solving with Large Clusters</a> by Izhak Shafran and Richard Sproat at Oregon Health & Science University (Spring 2010)</li>
<li><a href="http://net.pku.edu.cn/~course/cs402/2010/index.html">Peking University course</a> on cloud computing by Hongfei Yan and Bo Peng (Summer, 2010)</li>
<li>EEL 6935: Special Topics in Cloud Computing and Storage by Andy Li at the University of Florida (Fall, 2010 and Fall, 2011)</li>
<li>CSCE 689: Internet-Scale Data Management by James Caverlee at Texas A&M (Fall, 2010)</li>
<li>CSCE 670: Information Storage and Retrieval by James Caverlee at Texas A&M (Spring, 2011 and Spring, 2012)</li>
<li>CS 691-001: Cloud Computing by Susan Vrbsky at University of Alabama (Spring, 2011)</li>
<li><a href="http://courses.washington.edu/css534/syllabi/s11.html">CSS 534: Parallel Programming in Grid and Cloud</a> by Munehiro Fukuda at University of Washington (Spring, 2011)</li>
<li><a href="http://snap.stanford.edu/class/cs341-2011/">CS341: Advanced Topics in Data Mining</a> by Jure Leskovec, Anand Rajaraman, and Jeff Ullman at Stanford (Spring, 2011)</li>
<li>Summer School on Cloud Computing: Challenges and Opportunities by Pietro Michiard (Summer, 2011)</li>
<li><a href="http://net.pku.edu.cn/~course/cs402/2011/index.html">Peking University course</a> on mass data processing/cloud computing by Hongfei Yan and Bo Peng (Summer, 2011)</li>
<li><a href="http://dicta-f11.utcompling.com/">CS395T / INF385T / LIN386M: Data-Intensive Computing for Text Analysis</a> by Jason Baldridge and Matt Lease at the University of Texas, Austin (Fall, 2011)</li>
<li>CS 6240: Parallel Data Processing in MapReduce</a> by Mirek Riedewald at Northeastern University (<a href="http://www.ccs.neu.edu/home/mirek/classes/2011-F-CS6240/index.htm">Fall 2011</a>, <a href="http://www.ccs.neu.edu/home/mirek/classes/2013-S-CS6240/index.htm">Spring 2013</a>)</li>
<li><a href="http://www.cs.gmu.edu/syllabus/syllabi-fall11/CS795BarbaraD.html">CS 795 Mining Massive Datasets</a> by Daniel Barbara at George Mason University (Fall, 2011)</li>
<li><a href="http://www.cs.kent.edu/~jin/Cloud12Spring/Cloud.html">CS 4/5/6/79995: Advanced Computing Platforms for Data Processing</a> by Ruoming Jin at Kent State University (Spring 2012)</li>
<li><a href="http://beowulf.lcs.mit.edu/18.337/">18.337/6.338: Parallel Computing</a> by Alan Edelman at MIT (Fall, 2011)</li>
<li><a href="http://www.cse.buffalo.edu/~bina/cse487/fall2011/">CSE487/587 Data-Intensive Computing</a> by Bina Ramamurthy at SUNY Buffalo (Fall, 2011)</li>
<li><a href="http://www.csc.lsu.edu/~wuyj/Teaching/7481/fa12/">CSC7481/LIS 7610 - Information Retrieval Systems</a> by Yejun Wu at LSU (Fall, 2012)</li>
<li><a href="http://www.cs.brown.edu/courses/csci2950-u/f11/index.html">CSCI-2950u: Data-Intensive Scalable Computing</a> by Rodrigo Fonseca at Brown (Fall, 2011)</li>
<li><a href="http://www.cs.sunysb.edu/~rezaul/CSE590-S12.html">CSE 590 (#50569): Topics in Computer Science (Supercomputing)</a> by Rezaul A. Chowdhury at Stony Brook University (Spring 2012)</li>
<li><a href="http://www-scf.usc.edu/~csci572/">Course 572: Information Retrieval and Web Search Engines</a> by Ellis Horowitz at USC (Spring 2012)</li>
<li><a href="http://mlt.sv.cmu.edu/teaching/advancedML/syllabus.pdf">18-799 M / 96-842 A: Special Topics in Signal Processing: Advanced Machine Learning</a> by Joy Zhang and Ole Mengshoel (Spring 2012)</li>
<li><a href="http://courses.cse.tamu.edu/caverlee/csce470/">CSCE 470: Information Storage and Retrieval</a> by James Caverlee at Texas A&M (Fall, 2012)</li>
<li><a href="http://www.ccs.neu.edu/home/mirek/classes/2012-F-CS6240/index.htm">CS 6240: Parallel Data Processing in MapReduce</a> by Bryan Lackaye at Northeastern University (Fall, 2012)</li>
<li><a href="http://www.eecs.ucf.edu/~jwang/Teaching/EEL6938-s12/">EEL 6938 (FEEDS): Data Intensive Computing and Clouds</a> by Jun Wang at University of Central Florida (Fall 2012)</li>
<li><a href="http://vgc.poly.edu/~juliana/courses/cs9223/">CS9223: Massive Data Analysis</a> by Juliana Freire and Jerome Simeon at NYU Poly (Fall 2012)</li>
<li><a href="http://people.stern.nyu.edu/ja1517/pdsfall2012/">INFO-GB.3359.10: Practical Data Science</a> by Josh Attenberg and Foster Provost at NYU (Fall 2012)</li>
<li><a href="http://cs691vrbsky.cs.ua.edu/2012/Syll.htm">CS491/591-001: Cloud Computing</a> by Susan Vrbsky at the University of Alabama (Fall 2012)</li>
<li><a href="http://courses.cse.tamu.edu/caverlee/csce489/">CSCE 489: Introduction to Data Science</a> by James Caverlee at Texas A&M (Spring, 2013)</li>
</ul>
</section>
<footer class="footer">
<p class="pull-right"><a href="ed1.html">Back to top</a></p>
</footer>
</div> <!-- /container -->
<script type="text/javascript" src="assets/js/widgets.js"></script>
<script src="assets/js/jquery.js"></script>
<script src="assets/js/google-code-prettify/prettify.js"></script>
<script src="assets/js/bootstrap-transition.js"></script>
<script src="assets/js/bootstrap-alert.js"></script>
<script src="assets/js/bootstrap-modal.js"></script>
<script src="assets/js/bootstrap-dropdown.js"></script>
<script src="assets/js/bootstrap-scrollspy.js"></script>
<script src="assets/js/bootstrap-tab.js"></script>
<script src="assets/js/bootstrap-tooltip.js"></script>
<script src="assets/js/bootstrap-popover.js"></script>
<script src="assets/js/bootstrap-button.js"></script>
<script src="assets/js/bootstrap-collapse.js"></script>
<script src="assets/js/bootstrap-carousel.js"></script>
<script src="assets/js/bootstrap-typeahead.js"></script>
<script src="assets/js/application.js"></script>
</body>
</html>