Client library

class snakebite.client.Client(host, port=8020, hadoop_version=9, use_trash=False, effective_user=None, use_sasl=False, hdfs_namenode_principal=None, sock_connect_timeout=10000, sock_request_timeout=10000, use_datanode_hostname=False)

A pure python HDFS client.

Example:

>>> from snakebite.client import Client
>>> client = Client("localhost", 8020, use_trash=False)
>>> for x in client.ls(['/']):
...     print x

Warning

Many methods return generators, which mean they need to be consumed to execute! Documentation will explicitly specify which methods return generators.

Note

paths parameters in methods are often passed as lists, since operations can work on multiple paths.

Note

Parameters like include_children and recurse are not used when paths contain globs.

Note

Different Hadoop distributions use different protocol versions. Snakebite defaults to 9, but this can be set by passing in the hadoop_version parameter to the constructor.

Parameters:
  • host (string) – Hostname or IP address of the NameNode
  • port (int) – RPC Port of the NameNode
  • hadoop_version (int) – What hadoop protocol version should be used (default: 9)
  • use_trash (boolean) – Use a trash when removing files.
  • effective_user (string) – Effective user for the HDFS operations (default: None - current user)
  • use_sasl (boolean) – Use SASL authentication or not
  • hdfs_namenode_principal (string) – Kerberos principal to use for HDFS
  • sock_connect_timeout (int) – Socket connection timeout in seconds
  • sock_request_timeout (int) – Request timeout in seconds
  • use_datanode_hostname (boolean) – Use hostname instead of IP address to commuicate with datanodes
cat(paths, check_crc=False)

Fetch all files that match the source file pattern and display their content on stdout.

Parameters:
  • paths (list of strings) – Paths to display
  • check_crc (boolean) – Check for checksum errors
Returns:

a generator that yields strings

chgrp(paths, group, recurse=False)

Change the group of paths.

Parameters:
  • paths (list) – List of paths to chgrp
  • group – New group
  • recurse (boolean) – Recursive chgrp
Returns:

a generator that yields dictionaries

chmod(paths, mode, recurse=False)

Change the mode for paths. This returns a list of maps containing the resut of the operation.

Parameters:
  • paths (list) – List of paths to chmod
  • mode (int) – Octal mode (e.g. 0o755)
  • recurse (boolean) – Recursive chmod
Returns:

a generator that yields dictionaries

Note

The top level directory is always included when recurse=True

chown(paths, owner, recurse=False)

Change the owner for paths. The owner can be specified as user or user:group

Parameters:
  • paths (list) – List of paths to chmod
  • owner (string) – New owner
  • recurse (boolean) – Recursive chown
Returns:

a generator that yields dictionaries

This always include the toplevel when recursing.

copyToLocal(paths, dst, check_crc=False)

Copy files that match the file source pattern to the local name. Source is kept. When copying multiple, files, the destination must be a directory.

Parameters:
  • paths (list of strings) – Paths to copy
  • dst (string) – Destination path
  • check_crc (boolean) – Check for checksum errors
Returns:

a generator that yields strings

count(paths)

Count files in a path

Parameters:paths (list) – List of paths to count
Returns:a generator that yields dictionaries

Examples:

>>> list(client.count(['/']))
[{'spaceConsumed': 260185L, 'quota': 2147483647L, 'spaceQuota': 18446744073709551615L, 'length': 260185L, 'directoryCount': 9L, 'path': '/', 'fileCount': 34L}]
delete(paths, recurse=False)

Delete paths

Parameters:
  • paths (list) – Paths to delete
  • recurse (boolean) – Recursive delete (use with care!)
Returns:

a generator that yields dictionaries

Note

Recursive deletion uses the NameNode recursive deletion functionality instead of letting the client recurse. Hadoops client recurses by itself and thus showing all files and directories that are deleted. Snakebite doesn’t.

df()

Get FS information

Returns:a dictionary

Examples:

>>> client.df()
{'used': 491520L, 'capacity': 120137519104L, 'under_replicated': 0L, 'missing_blocks': 0L, 'filesystem': 'hdfs://localhost:8020', 'remaining': 19669295104L, 'corrupt_blocks': 0L}
du(paths, include_toplevel=False, include_children=True)

Returns size information for paths

Parameters:
  • paths (list) – Paths to du
  • include_toplevel (boolean) – Include the given path in the result. If the path is a file, include_toplevel is always True.
  • include_children (boolean) – Include child nodes in the result.
Returns:

a generator that yields dictionaries

Examples:

Children:

>>> list(client.du(['/']))
[{'path': '/Makefile', 'length': 6783L}, {'path': '/build', 'length': 244778L}, {'path': '/index.asciidoc', 'length': 100L}, {'path': '/source', 'length': 8524L}]

Directory only:

>>> list(client.du(['/'], include_toplevel=True, include_children=False))
[{'path': '/', 'length': 260185L}]
getmerge(path, dst, newline=False, check_crc=False)

Get all the files in the directories that match the source file pattern and merge and sort them to only one file on local fs.

Parameters:
  • paths (string) – Directory containing files that will be merged
  • dst (string) – Path of file that will be written
  • nl (boolean) – Add a newline character at the end of each file.
Returns:

string content of the merged file at dst

ls(paths, recurse=False, include_toplevel=False, include_children=True)

Issues ‘ls’ command and returns a list of maps that contain fileinfo

Parameters:
  • paths (list) – Paths to list
  • recurse (boolean) – Recursive listing
  • include_toplevel (boolean) – Include the given path in the listing. If the path is a file, include_toplevel is always True.
  • include_children (boolean) – Include child nodes in the listing.
Returns:

a generator that yields dictionaries

Examples:

Directory listing

>>> list(client.ls(["/"]))
[{'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'access_time': 1367317324982L, 'block_replication': 1, 'modification_time': 1367317325346L, 'length': 6783L, 'blocksize': 134217728L, 'owner': u'wouter', 'path': '/Makefile'}, {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1367317325431L, 'length': 0L, 'blocksize': 0L, 'owner': u'wouter', 'path': '/build'}, {'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'access_time': 1367317326510L, 'block_replication': 1, 'modification_time': 1367317326522L, 'length': 100L, 'blocksize': 134217728L, 'owner': u'wouter', 'path': '/index.asciidoc'}, {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1367317326628L, 'length': 0L, 'blocksize': 0L, 'owner': u'wouter', 'path': '/source'}]

File listing

>>> list(client.ls(["/Makefile"]))
[{'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'access_time': 1367317324982L, 'block_replication': 1, 'modification_time': 1367317325346L, 'length': 6783L, 'blocksize': 134217728L, 'owner': u'wouter', 'path': '/Makefile'}]

Get directory information

>>> list(client.ls(["/source"], include_toplevel=True, include_children=False))
[{'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1367317326628L, 'length': 0L, 'blocksize': 0L, 'owner': u'wouter', 'path': '/source'}]
mkdir(paths, create_parent=False, mode=493)

Create a directoryCount

Parameters:
  • paths (list of strings) – Paths to create
  • create_parent (boolean) – Also create the parent directories
  • mode (int) – Mode the directory should be created with
Returns:

a generator that yields dictionaries

rename(paths, dst)

Rename (move) path(s) to a destination

Parameters:
  • paths (list) – Source paths
  • dst (string) – destination
Returns:

a generator that yields dictionaries

rename2(path, dst, overwriteDest=False)

Rename (but don’t move) path to a destination

By only renaming, we mean that you can’t move a file or folder out or in other folder. The renaming can only happen within the folder the file or folder lies in.

Note that this operation “always succeeds” unless an exception is raised, hence, the dict returned from this function doesn’t have the ‘result’ key.

Since you can’t move with this operation, and only rename, it would not make sense to pass multiple paths to rename to a single destination. This method uses the underlying rename2 method.

https://github.com/apache/hadoop/blob/ae91b13/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java#L483-L523

Out of all the different exceptions mentioned in the link above, this method only wraps the FileAlreadyExistsException exception. You will also get a FileAlreadyExistsException if you have overwriteDest=True and the destination folder is not empty. The other exceptions will just be passed along.

Parameters:
  • path (string) – Source path
  • dst (string) – destination
Returns:

A dictionary or None

rmdir(paths)

Delete a directory

Parameters:paths (list) – Paths to delete
Returns:a generator that yields dictionaries
serverdefaults(force_reload=False)

Get server defaults, caching the results. If there are no results saved, or the force_reload flag is True, it will query the HDFS server for its default parameter values. Otherwise, it will simply return the results it has already queried.

Note: This function returns a copy of the results loaded from the server, so you can manipulate or change them as you’d like. If for any reason you need to change the results the client saves, you must access the property client._server_defaults directly.

Parameters:force_reload (bool) – Should the server defaults be reloaded even if they already exist?
Returns:dictionary with the following keys: blockSize, bytesPerChecksum, writePacketSize, replication, fileBufferSize, encryptDataTransfer, trashInterval, checksumType

Example:

>>> client.serverdefaults()
[{'writePacketSize': 65536, 'fileBufferSize': 4096, 'replication': 1, 'bytesPerChecksum': 512, 'trashInterval': 0L, 'blockSize': 134217728L, 'encryptDataTransfer': False, 'checksumType': 2}]
setrep(paths, replication, recurse=False)

Set the replication factor for paths

Parameters:
  • paths (list) – Paths
  • replication – Replication factor
  • recurse (boolean) – Apply replication factor recursive
Returns:

a generator that yields dictionaries

stat(paths)

Stat a fileCount

Parameters:paths (string) – Path
Returns:a dictionary

Example:

>>> client.stat(['/index.asciidoc'])
{'blocksize': 134217728L, 'owner': u'wouter', 'length': 100L, 'access_time': 1367317326510L, 'group': u'supergroup', 'permission': 420, 'file_type': 'f', 'path': '/index.asciidoc', 'modification_time': 1367317326522L, 'block_replication': 1}
tail(path, tail_length=1024, append=False)

Show the end of the file - default 1KB, supports up to the Hadoop block size.

Parameters:
  • path (string) – Path to read
  • tail_length (int) – The length to read from the end of the file - default 1KB, up to block size.
  • append (bool) – Currently not implemented
Returns:

a generator that yields strings

test(path, exists=False, directory=False, zero_length=False)

Test if a path exist, is a directory or has zero length

Parameters:
  • path (string) – Path to test
  • exists (boolean) – Check if the path exists
  • directory (boolean) – Check if the path is a directory
  • zero_length (boolean) – Check if the path is zero-length
Returns:

a boolean

Note

directory and zero length are AND’d.

text(paths, check_crc=False)

Takes a source file and outputs the file in text format. The allowed formats are gzip and bzip2

Parameters:
  • paths (list of strings) – Paths to display
  • check_crc (boolean) – Check for checksum errors
Returns:

a generator that yields strings

touchz(paths, replication=None, blocksize=None)

Create a zero length file or updates the timestamp on a zero length file

Parameters:
  • paths (list) – Paths
  • replication – Replication factor
  • blocksize (int) – Block size (in bytes) of the newly created file
Returns:

a generator that yields dictionaries

class snakebite.client.AutoConfigClient(hadoop_version=9, effective_user=None, use_sasl=False)

A pure python HDFS client that support HA and is auto configured through the HADOOP_HOME environment variable.

HAClient is fully backwards compatible with the vanilla Client and can be used for a non HA cluster as well. This client tries to read ${HADOOP_HOME}/conf/hdfs-site.xml and ${HADOOP_HOME}/conf/core-site.xml to get the address of the namenode.

The behaviour is the same as Client.

Example:

>>> from snakebite.client import AutoConfigClient
>>> client = AutoConfigClient()
>>> for x in client.ls(['/']):
...     print x

Note

Different Hadoop distributions use different protocol versions. Snakebite defaults to 9, but this can be set by passing in the hadoop_version parameter to the constructor.

Parameters:
  • hadoop_version (int) – What hadoop protocol version should be used (default: 9)
  • effective_user (string) – Effective user for the HDFS operations (default: None - current user)
  • use_sasl (boolean) – Use SASL for authenication or not
class snakebite.client.HAClient(namenodes, use_trash=False, effective_user=None, use_sasl=False, hdfs_namenode_principal=None, max_failovers=15, max_retries=10, base_sleep=500, max_sleep=15000, sock_connect_timeout=10000, sock_request_timeout=10000, use_datanode_hostname=False)

Snakebite client with support for High Availability

HAClient is fully backwards compatible with the vanilla Client and can be used for a non HA cluster as well.

Example:

>>> from snakebite.client import HAClient
>>> from snakebite.namenode import Namenode
>>> n1 = Namenode("namenode1.mydomain", 8020)
>>> n2 = Namenode("namenode2.mydomain", 8020)
>>> client = HAClient([n1, n2], use_trash=True)
>>> for x in client.ls(['/']):
...     print x

Note

Different Hadoop distributions use different protocol versions. Snakebite defaults to 9, but this can be set by passing in the version parameter to the Namenode class constructor.

Parameters:
  • namenodes (list) – Set of namenodes for HA setup
  • use_trash (boolean) – Use a trash when removing files.
  • effective_user (string) – Effective user for the HDFS operations (default: None - current user)
  • use_sasl (boolean) – Use SASL authentication or not
  • hdfs_namenode_principal (string) – Kerberos principal to use for HDFS
  • max_retries (int) – Number of failovers in case of connection issues
  • max_retries – Max number of retries for failures
  • base_sleep (int) – Base sleep time for retries in milliseconds
  • max_sleep (int) – Max sleep time for retries in milliseconds
  • sock_connect_timeout (int) – Socket connection timeout in seconds
  • sock_request_timeout (int) – Request timeout in seconds
  • use_datanode_hostname (boolean) – Use hostname instead of IP address to commuicate with datanodes