Snakebite documentation

Snakebite is a python package that provides:

Background

Since the ‘normal’ Hadoop HDFS client (hadoop fs) is written in Java and has a lot of dependencies on Hadoop jars, startup times are quite high (> 3 secs). This isn’t ideal for integrating Hadoop commands in python projects.

At Spotify we use the luigi job scheduler that relies on doing a lot of existence checks and moving data around in HDFS. And since calling hadoop from python is expensive, we decided to write a pure python HDFS client that only relies on protobuf. The current snakebite.client library uses protobuf messages and implements the Hadoop RPC protocol for talking to the NameNode.

During development, we needed to verify snakebite.client behavior against the real client and for that we implemented a minicluster that wraps a Hadoop Java mini cluster. Obviously this minicluster can be used in different projects, so we made it a part of snakebite.

And since it’s nice to have a CLI that uses snakebite.client we’ve implemented a CLI client as well.

Warning

all methods that read data from a data node are able to check the CRC during transfer, but this is disabled by default because of performance reasons. This is the opposite behaviour from the stock Hadoop client.

LICENSE

Copyright (c) 2013 - 2014 Spotify AB

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Code in channel, logger and service was borrowed from https://code.google.com/p/protobuf-socket-rpc/ and carries it’s respective license.

Indices and tables