Today I finally hit the task I was scared for so long — processing large XML files on Hadoop. I won’t tell you for how long I crawled the Internet trying to find some working solution… not that anyone wants to know? Eventually, I came out with the solution of my own — even though I hate re-inventing the wheel, in this particular case all the wheels I found were either square or were utterly incompatible with my model of car.
To make things more simple, I won’t include the full source code. I won’t even include the whole InputFormat class. So, to make yourself comfortable, please do following:
- Open
LineRecordReader from org.apache.hadoop.mapreduce.lib.input so you can see it
- Open
TextInputFormat from the same package.
- Create the input format and record reader of your own, just by copying and pasting the code from aforementioned classes.
- Change the constructor of your input format class so it’ll return your newly-defined record reader.
Now, we’re almost there. Now I’ll include the piece of code for nextKeyValue() which turned out to be the most critical method here. Hold on tight:
public boolean nextKeyValue() throws IOException
{
StringBuilder sb = new StringBuilder();
if (key == null)
{
key = new LongWritable();
}
key.set(pos);
if (value == null)
{
value = new Text();
}
int newSize = 0; boolean xmlRecordStarted = false;
Text tmpLine = new Text(); while (pos < end)
{
newSize = in.readLine(tmpLine,
maxLineLength,
Math.max((int)
Math.min(Integer.MAX_VALUE,
end - pos),
maxLineLength)); if (newSize == 0)
{
break;
} if (tmpLine.toString().contains("<document "))
{
xmlRecordStarted = true;
} if (xmlRecordStarted)
{
sb.append(tmpLine.toString().replaceAll("\n", " "));
} if (tmpLine.toString().contains("</document>"))
{
xmlRecordStarted = false;
this.value.set(sb.toString());
break;
} pos += newSize; } if (newSize == 0)
{
key = null;
value = null;
return false;
}
else
{
return true;
}
}
WTF — you will say? It’s the same code? Well — yes, and no. It’s almost the same. Take a look at this line:
if (tmpLine.toString().contains("<document"))
and this line:
if (tmpLine.toString().contains("</document>"))
This is where we actually split the document into chunks. Code is pretty-much self-explaining so I won’t add anything else.
Now, it’s not the most clean and streamlined solution and I probably will spend a while tomorrow making it more production-ready and good-looking, but compared to other solutions, it has few major benefits:
- It uses very little custom code (you remember, we copied and pasted all the classes?). Unfortunately you cannot just inherit the class — some fields are private, and we clearly want to modify them.
- It’s configurable — you can easily change the
<document and </document> strings to anything else (and again, I will do it tomorrow, but now I feel too lazy).
- It works.
There’re few limitations of this approach. One of them is that if the document contains something like </document><document> it obviously won’t work. Another is — you still need to parse elements in your mapper (although you can easily change it by parsing records in your record reader into Writable-compatible class).
Have fun!
Update: As you can see, I have added a space in "<document " string constant – today I realised that "<documenttype" elements has been successfully used for splits, hence producing inconsistent results.