Search posterous

Search all posts and users. Type a name, type a favorite song title, whatever! See what comes up.
  

More posterous blogs











More recommended blogs »

Here are posterous posts filed under hadoop...

hdknr says...

■ Elastic Map Reduceでお手軽 Wikipediaマイニング

数百行程度のpythonスクリプトで大規模データ処理を実現する例を紹介。90万記事程度のwikipediaの日本語を題材に
AmazonEC2/S3とElastic Map Reduceを活用。

結構実践的な話が多く、実際に試した人ならではの話(S3は頻繁な入出力が想定されていないインフラのようなので頻繁に読み書きがされる用途ではパフォーマンス面ではオススメできないとのこと)とか、実際の分析結果(年、国名とかが上位にくるがこれを除外するようにフィルタリングすると、明治、神部天皇とかがPagerankとして高い...)という話は面白かったなぁ。

AmazonEC2/S3、Elastic Map Reduce良い/悪い部分もまとめてくださいました。
-良い点:かんたん。小規模なジョブならMasterの値段分安い。お手軽。1時間1台0.1ドルなので、1時間100台1000円
-悪い所:多数のジョブを走らせる事を考えるともったいない。ログが見にくい(開発時には生産性低下につながる?)。独自のディスクイメージが利用出来ない。

Q.amazonのサービスは時間によってパフォーマンスの差があると聞いた事があるが、実際にそのような事があったのか?
A.ある程度台数をつかっているからかもしれないが、時間によって差があるというのはかんじられなかった。(せいぜい2,3分)
※アメリカの昼時間帯に実際にジョブを実行していたとのこと。

Filed under: Hadoop

hdknr says...

Hadoop provides two filesystems that use S3.

S3 Native FileSystem (URI scheme: s3n)
A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).
S3 Block FileSystem (URI scheme: s3)
A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

AWS Hadoop Filesystem(HDFS)
- ネイティブ・ファイルシステム( s3n:// )
- ブロック・ファイルシステム ( s3:// )

Filed under: Hadoop

sigizmund says...

Recently I wrote a Hadoop article in Russian for one of very popular Russian IT blogs. After giving this idea a second thought, I translated this article (or, rather, first part of this article as the second is still in progress) to English and uploaded it to my website (Posterous format isn't very good for such long articles).

Check it here: http://romankirillov.info/hadoop.html

(and don't be mad for my clumsy English!)

 

P. S. in case you can read Russian: http://sigizmund.habrahabr.ru/blog/74792/

Filed under: hadoop

Jared says...

We are flying back from Boston after an excellent week at the Architecture Technology Review. This was my first interaction with the Nokia architecture community at large, and I was really pleased (and I have to admit, somewhat surprised), to see how awesome many of the developments coming down the pipe are. Ville gave a talk on what we have been doing with Disco, and we also gave a demo during one of the 'speed geeking' sessions. One of the most common questions we were asked was, "why not Hadoop?", so I thought I'd give my opinion on the subject.

Prior to coming to the NRC, I was using Hadoop for about a year and a half (doing bioinformatics), and I must say that it served me quite well. To be sure, there were problems along the way, but Hadoop enabled me to do analyses that I would not otherwise have done, not because they would be impossible without Hadoop, but because mapreduce makes it so easy to parallelize a huge class of problems, that the overhead of doing things with big data becomes amazingly small.

Even when using Hadoop, I always used Python (with Hadoop Streaming) to write map/reduce functions, because Python is such a pleasure to write, and because I am much more productive writing Python than Java (or pretty much any other language). Because of my love for Python, I often wondered why noone had yet written a Python implementation of mapreduce, and even considered writing my own. I think it is natural for anyone who thinks about the design of systems, to question the validity of architecture decisions and to wonder how those designs might be improved. Of course, actually implementing a new design is a whole other story, and finding the impetus to do so, especially when a reasonably good implementation (with lots of high-profile developers) already exists, is not always easy.

When I discovered the Disco project, which is part Erlang, part Python, I was deeply intrigued. I questioned the choice of Erlang (not knowing much about it), but Ville's argument was extremely pragmatic: Erlang is really good at distributed stuff (that's what it was built to do), and Python is awesome for high-level programming (i.e. its fun, easy to read/write, expressive, etc.). But I guess the question remains, why not Hadoop? The reason answering this question is hard, is because largely it is a matter of taste. The bottom line is that neither Hadoop nor Disco is really a mature project (Hadoop IS more highly developed than Disco though), while it seems to me the choice of framework is a long-term question. For me, wanting to use Python to improve the framework itself is a no-brainer (additionally, Jython is currently too far behind CPython for me to consider it a replacement).

Why Disco? Because of it's philosophy: massive data - "minimal code". Lightweight is a design goal in Disco, and we really, truly, care about programmer overhead. Framework development should be as agile as possible, if we are trying to optimize programmer productivity. My vision of Disco is a framework that can be shaped to the needs of its users (including myself), by its users. For me, the reality of Hadoop was quite different.

Filed under: hadoop

sigizmund says...

Hey out there! Still not tired of my Hadoop experiments? Not yet? That’s another one for you!

What’d you think the difference is between two snippets of code? Say, this:

SomeCodeWhichChangesConfig.initialise(getConf()); 
Job job = new Job(conf, "MyHadoopJob");

// ... setting the job details

if (!job.waitForCompletion(true))
{
System.err.println("FAILED, cannot continue");
}

… and this:

Job job = new Job(conf, "MyHadoopJob"); 

// ... setting the job details

SomeCodeWhichChangesConfig.initialise(getConf());
if (!job.waitForCompletion(true))
{
System.err.println("FAILED, cannot continue");
}

No difference, you say? Not quite right, sir: the difference is that whatever you do to conf after creating a job will have no further effect. That is, Job constructor apparently copies all the data and doesn’t link your copy of Configuration object with it’s copy. Brilliant, no?

(and I spent a couple of hours trying to understand why distributed cache works properly in one app and doesn’t work at all in another). So you know now. Be warned.

Filed under: hadoop

sigizmund says...

Today I finally hit the task I was scared for so long — processing large XML files on Hadoop. I won’t tell you for how long I crawled the Internet trying to find some working solution… not that anyone wants to know? Eventually, I came out with the solution of my own — even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car.

To make things more simple, I won’t include the full source code. I won’t even include the whole InputFormat class. So, to make yourself comfortable, please do following:

  1. Open LineRecordReader from org.apache.hadoop.mapreduce.lib.input so you can see it
  2. Open TextInputFormat from the same package.
  3. Create the input format and record reader of your own, just by copying and pasting the code from aforementioned classes.
  4. Change the constructor of your input format class so it’ll return your newly-defined record reader.

Now, we’re almost there. Now I’ll include the piece of code for nextKeyValue() which turned out to be the most critical method here. Hold on tight:

public boolean nextKeyValue() throws IOException
{
StringBuilder sb = new StringBuilder();
if (key == null)
{
key = new LongWritable();
}
key.set(pos);
if (value == null)
{
value = new Text();
}
int newSize = 0;

boolean xmlRecordStarted = false;
Text tmpLine = new Text();

while (pos < end)
{
newSize = in.readLine(tmpLine,
maxLineLength,
Math.max((int)
Math.min(Integer.MAX_VALUE,
end - pos),
maxLineLength));

if (newSize == 0)
{
break;
}

if (tmpLine.toString().contains("<document "))
{
xmlRecordStarted = true;
}

if (xmlRecordStarted)
{
sb.append(tmpLine.toString().replaceAll("\n", " "));
}

if (tmpLine.toString().contains("</document>"))
{
xmlRecordStarted = false;
this.value.set(sb.toString());
break;
}

pos += newSize;

}

if (newSize == 0)
{
key = null;
value = null;
return false;
}
else
{
return true;
}
}

WTF — you will say? It’s the same code? Well — yes, and no. It’s almost the same. Take a look at this line:

if (tmpLine.toString().contains("<document")) 

and this line:

if (tmpLine.toString().contains("</document>")) 

This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won’t add anything else.

Now, it’s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits:

  1. It uses very little custom code (you remember, we copied and pasted all the classes?). Unfortunately you cannot just inherit the class — some fields are private, and we clearly want to modify them.
  2. It’s configurable — you can easily change the <document and </document> strings to anything else (and again, I will do it tomorrow, but now I feel too lazy).
  3. It works.

There’re few limitations of this approach. One of them is that if the document contains something like </document><document> it obviously won’t work. Another is — you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into Writable-compatible class).

Have fun!

Update: As you can see, I have added a space in "<document " string constant – today I realised that "<documenttype" elements has been successfully used for splits, hence producing inconsistent results.

Filed under: hadoop

BioInfo says...

Genome Analysis in the Clouds | GenomeWeb http://ow.ly/xCck hadoop

Filed under: hadoop

siculars says...

A Brief Overview

For the last year or so I've been hearing about "Hadoop" and all the amazing things it can do for you. With the passing of the seasons, the drumbeat has steadily been getting stronger and louder. This past Friday, October 2nd, 2009, was Hadoopworld NYC, which I attended with the wonderment fall semester provides a bright-eyed college freshman. The first thing that caught my attention upon entering the conference space was the formidable array of companies throwing their heft behind the "Hadoop" platform. I put Hadoop in quotes because Hadoop is less a single product but rather an ecosystem of products all working from a central thesis. That thesis is that the way data storage and analysis have been done in the past should officially be regarded as legacy. Traditional RDBMS systems will yield to the distributed, map/reduce paradigm of Hadoop at a certain threshold. That threshold will continue to shift as Hadoop becomes more user friendly, abstract and accessible. Hadoop is the New Way ™.

When I say "things" I mean Data Analysis and all the lovelies that flow from it. I capitalize Data Analysis because as @igrigorik so succinctly puts it Hadoop is powerful. Formidable enough to power some of the biggest data sets anywhere on the web. From IBM to Facebook, Yahoo to Amazon, Chase to China Mobile - these are just some of the marquee names heavily invested in the Hadoop ecosystem. What do they know that the rest of us should? Let's rewind the clock. Hadoop has its roots in a search engine called Nutch a web search system co-founded by Doug Cutting. At the time Cutting, et al., were working on a problem that required massive resources in terms of storage and computational power to maintain a searchable index of the web, algorithm not withstanding. It just so happens, like with most things big and interweby, Google cut its teeth into this research sandbox. Two seminal academic papers released by Google researchers, The Google File System in 2003 and MapReduce: Simplified Data Processing on Large Clusters in 2004, led the way in answering two very important questions Nutch had been trying to answer - how to store massive amounts of data and how to process it. To their credit Cutting, et al., adopted the lessons learned at Google rather than reinvent the wheel. I highly recommend anyone interested in this space familiarize themselves with the concepts in those two papers as a start.

Briefly, "The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing"  and goes on to enumerate the list of projects comprising the Hadoop ecosystem as noted on the main Hadoop project page.

  • Hadoop Common: The common utilities that support the other Hadoop subprojects.
  • Avro: A data serialization system that provides dynamic integration with scripting languages.
  • Chukwa: A data collection system for managing large distributed systems.
  • HBase: A scalable, distributed database that supports structured data storage for large tables.
  • HDFS: A distributed file system that provides high throughput access to application data.
  • Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • MapReduce: A software framework for distributed processing of large data sets on compute clusters.
  • Pig: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper: A high-performance coordination service for distributed applications.

If all that leaves you at a loss let's look at what Hadoop does for users in the real world. At the conference were a number of presenters who would fall into the end-user category. These are corporations employing Hadoop today to solve real world problems in a number of different industries.

- eHarmony presented (not exact slides but close enough) during Amazon's time in the morning session. They use Hadoop via Amazon's web services to crunch hundreds of millions of replies to their extensive questionnaire by tens of millions of users. You do the math - lots of data points, lots of permutations. The only way they could go forward with increasingly complex and interesting matching models was to make the move to Hadoop.

- IBM previewed their M2 technology, which is a front end for Hadoop. They demoed the ingestion of raw patent data from the USPTO, rejiggering that data to better suit their needs and extracting meaningful data quickly, enabling their patent lawyers to work faster. Uh ya... that's what we need... more productive patent lawyers.

- StumbleUpon and Streamy spoke about serving their users without databases via HBase which is part of the Hadoop ecosystem.

- Visa uses Hadoop to speed up their risk score modeling, decreasing compute time from 1 month to 13 minutes.

 

Observations from Hadoopworld

These are my thoughts after a day of immersion in the world of Hadoop. Taking a look at the Hadoopworld agenda will give you an understanding of the caliber of presenters and breadth of topics covered. While I was both impressed and excited, there was also reason for concern. To be certain, there is a tremendous amount of energy, talented people and deep-pocketed corporations behind the success of Hadoop. However, users should be aware of the pros and cons before taking the plunge. As indicated in many spheres, Hadoop is powerful in that it allows you to provision resources which when employed correctly will accommodate the growth of your particular data problem with near linear scale. Moreover, Hadoop does provide an open source solution that solves problems in both the scalability and computational space. That said, let's take off the rose-colored glasses for a moment and enumerate some of the problems and challenges Hadoop faces and discuss solutions the community is providing.

  • Learning Curve
  • Interoperability
  • Security
  • Governance

Learning Curve

Do not kid yourself into thinking you can press a button and have your data problems melt away with Hadoop. Hadoop has quite a steep learning curve. Not only do you need to become familiarized with the dozen odd projects within the Hadoop ecosystem but perhaps more fundamentally one must reorient their way of thinking from a traditional RDBMS mindset. By and large web workers and any other data dredger have been weaned on a steady diet of RDBMS either in the form of MySql, Postgres or any number of other database systems. Databases and SQL are the way generations of knowledge workers have plied their trade. Of course, this is not an apples to apples comparison. Where Hadoop provides tremendous scalability and computational agility, RDBMS's provide transactional accuracy and programmatic familiarity. Also, Hadoop in its most basic incarnation is not a stand-in replacement for a database system. For that you have the HBase (based on another Google paper, BigTable: A Distributed Storage System for Structured Data, 2006) project which gets you a lot closer.

Highlighted at Hadoopworld were a number of players moving the ball forward to lessen the learning curve slope. Among them we have Cloudera, the organizers of the event, and from them we get Cloudera Desktop. Cloudera Desktop brings a number of creature comforts to the Hadoop operator and user, two of which are a file browser for the underlying HDFS that acts very much like Windows Explorer and a cluster health monitor so you can get a bird's-eye view of your systems. From Karmasphere, comes Karmasphere Studio, a plug-in for Netbeans that simplifies the development process for developers by masking many command line tools and instructions behind a much simpler GUI. The growing Hadoop user-base is attracting new focus to abstract away the difficulty of working in a new technical environment. Closer to the core are internal projects that add a layer of abstraction to raw map/reduce programming, specifically Hive and Pig. Interestingly, their roots can be traced to two early Hadoop adopters, Pig at Yahoo! and Hive at Facebook.

 

Interoperability 

I am not talking about interoperability in the sense of how Windows and Mac used to not get along but rather, Windows vs. Windows or Mac vs. Mac. It turns out that different versions of Hadoop are not compatible with each other. There are a number of different data formats for keeping files in HDFS (the underpinning of the Hadoop ecosystem). This basically means that all of the nodes within the Hadoop cluster need to run the same version. I'm sure I'll be corrected, but I'm pretty sure that as it stands data migration from one Hadoop version to the next is a major concern. The best advice I have read to perform a migration miracle is simply to install a separate cluster and write a map function that will pull data out of the old system and input it into the new system. Do not play games with mismatched versioning of cluster components. Strange and potentially scary things may happen to you, your pets, your children and/or your data.

Doug Cutting, now at Cloudera, is taking a major lead in fixing this situation by spearheading development of the Avro binary data format. Avro is being promoted as the new de facto data format for all things deep down in the Hadoop internals and the heir apparent to thrift. Prostrations to the new binary Prince aside, Avro is currently represented only in a small fraction of production Hadoop logs.

Beyond the data format issue is the actual programmable API issue. As it stands programs written against a certain version of Hadoop need to be recompiled and perhaps rewritten when moving to a new version of Hadoop. "Recompiled" and "rewritten" are two words with dollar signs attached to them. Watch any talk by Owen O'Malley (at Yahoo!, Hadoopworld slides - pdf) on Hadoop and you will see how this is evolving. The API is in flux but efforts are underway to annotate methods and functions with status notifications, and in so doing let developers know what is stable and what is still a work in progress.

Good luck and God's speed.

 

Security

Hadoop grew up in an environment of a handful of trusted users running map/reduce jobs against a large set of data. Over time the Hadoop cluster has become an asset to the organization running it and in short order there are more and more users competing for time on the cluster. One benefit of using a Hadoop cluster is that you can broadly share data sets in your organization. One problem with using a Hadoop cluster is that you can broadly share data sets in your organization. Once the number of users start to balloon you run into security considerations and that is where Hadoop as a platform is now. As it stands there are many ways users can interfere with each others jobs and data. Everything ranging from killing running jobs to deleting data. You know, Very Bad Things™. As with migrating data, the way to deal with this problem is to isolate users and data within separate clusters.

Moving forward there is progress being made in authenticating users via kerberose at various levels within the cluster. With validated users you will be able to secure job creation and scheduling and have a higher level of trust in your HDFS access logs. Also on the drawing board is how to effectively hide data from people you don't want to share with, which will be critical for getting Hadoop into sensitive industries and areas such as finance and medicine. Sarbanes-Oxley and HIPAA are two regulations you do not want to run afoul of in regards to access control.

 

Governance

Throughout this post I have mentioned major contributors and components to the Hadoop ecosystem and their provenance. As it stands from my vantage there are at least three major players here. Yahoo!, Facebook and Cloudera. Yahoo! takes the role of grandfather, having employed Doug Cutting during the project's metamorphoses towards what we now know as Hadoop and being the single largest contributor and user of Hadoop. Recently, Doug has moved on to Cloudera to join his co-founder, Mike Cafarella. In my world Facebook is the father in this picture, as it is likely the second largest user and major contributor, employing a number of Hadoop luminaries like Ashish Thusoo, father of Hive, and Dhruba Borthakur, HDFS expert, all-around nice guy and formerly of Yahoo!. Now to Cloudera, the son. They do not extensively use Hadoop internally to solve their own business problems, rather they supply Hadoop to the rest of us, and in so doing perhaps become the biggest user of them all.

When asked at the conference about the direction of Hadoop, one of the early members of the Yahoo! Hadoop team told me that up until recently most of the main contributors were a close-knit bunch, who all knew each other from their days at Yahoo! and can more or less still get on the phone with one another although they may be at different organizations now. As time goes on this will no longer be the case. Between competing interests all tugging at Hadoop to solve their business needs, the three main players and those in the rest of the field (Amazon, IBM, etc.) it's not that hard to imagine a fragmented and forked Hadoop. The one solace I take is that direction for the open source project is under the stewardship of the Apache Foundation. The Apache Foundation has guided the maturation of a number of open source projects, most notably the Apache web server and the Lucene search engine to name but two. The question is: Can Hadoop grow its contributor base while maintaining focus in its direction? Who knows. So far so good in my opinion. One example to the positive is the fact that the scheduling system received a pluggable architecture refactoring thereby allowing both the Facebook solution and the Yahoo! solution to coexist. 

 

Conclusion

If you are still getting started in your career or have exhausted what a traditional RDBMS can do for you I would highly recommend getting to know Hadoop. As I noted during the conference this space is very hot now and for the foreseeable future. All the players I've mentioned are very interested (practically falling over themselves) in hiring people with expertise and working knowledge of Hadoop.With the help of Amazon's Elastic Map Reduce mere mortals may hone their skills by harnessing the power of Hadoop at utility prices while using publicly available data sets. With tools like Karmasphere you can even deploy your code against a local system installation provided for free by Cloudera.

At the end of the day I have to say the thesis holds true. Hadoop is the New Way. In my humble opinion, I submit that Hadoop and the growing ecosystem of products around Hadoop will be at the center of the data analysis and heavy data storage and retrieval universe for some time to come (at least until Google unveils their successor papers to GFS/Map Reduce/BigTable). The benefits of sheer scalability and near limitless computational horsepower on the cheep is entirely too tantalizing a prospect for most to ignore. The open source nature of Hadoop will, with the right guidance, ensure that the best technical solutions find their way into the core of the platform; sub-projects will merge or wither (I'm looking at you Pig/Hive) and all will be well in the number crunching world.

 

Updates:

-20091006 slide link for Owen O'Malley's hadoopworld talk

Filed under: hadoop

sigizmund says...

Well, it can be annoying - it can be awfully annoying, in fact, to debug Hadoop applications. But sometimes you need it, because logging doesn't show anything, and you've tried anything but still cannot get under the Hadoop's cover. In this case, do few simple steps.

1. Download and unpack Hadoop to your local machine. 
2. Prepare small set of data you're planning to run the test on
3. Check that you actually can run Hadoop locally, something like this (don't forget to set $HADOOP_CLASSPATH first!): 

bin/hdebug jar yourprogram.jar com.company.project.HadoopApp \
          tiny.txt ./out

4. Go to Hadoop's directory, and copy file bin/hadoop to bin/hdebug
5. Now, we need to make Hadoop start in debug mode. What you should do is to add one line of text into the starting script:

Yes, here's it. Copy it from here:

-Xdebug -Xrunjdwp:transport=dt_socket,address=8001,server=y,suspend=y

What does it say basically is an instruction to Java to start in debug mode, and wait for socket connection of the remote debugger on port 8001; execution should be suspended after the start until debugger is connected.

Now, go and start your grid application like you did in step 3, but now use bin/hdebug script we've created. If you've done everything correctly, program should output something like this:

Listening for transport dt_socket at address: 8001

and wait for debugger. So, let's get it some debugger then! Fire up your Eclipse with your project (likely you have it opened already since you're trying to debug something) and add new Debug configuration:

After you've set everything up, click "Apply" and close the window for now – probably, you'd want to set some breakpoints before starting the actual debugging. Go and do it, and then simply choose created debug configuration - and off you go! If everything worked properly, you should soon get a standard debugger window, with all the nice things Java can offer you. Hope it'll help some of us in our difficult business of writing distributed grid-enabled applications! :)


Filed under: hadoop

hdknr says...

hBaseはBigTableのオープンソースクローンです。Web検索エンジンを開発しているPowerset社のエンジニアが主導して開発しており、Hadoopと同じくJavaで記述されています。hBaseはHadoopに依存しており、実際のデータはHDFS上に安全に保存されます。昔のバージョンのソースコードはHadoopに同梱されていましたが、現在は分かれているので気をつけてください。

 以下にhBaseの情報が掲載されているURLを集めてみました。

Filed under: Hadoop