putting a remote file into hadoop without copying it to local disk

Question

asked Jul 7, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I am writing a shell script to put data into hadoop as soon as they are generated. I can ssh to my master node, copy the files to a folder over there and then put them into hadoop. I am looking for a shell command to get rid of copying the file to the local disk on master node. to better explain what I need, here below you can find what I have so far:

1) copy the file to the master node's local disk:

scp test.txt username@masternode:/folderName/

I have already setup SSH connection using keys. So no password is needed to do this.

2) I can use ssh to remotely execute the hadoop put command:

ssh username@masternode "hadoop dfs -put /folderName/test.txt hadoopFolderName/"

what I am looking for is how to pipe/combine these two steps into one and skip the local copy of the file on masterNode's local disk.

1 Answer

Amit Rawat · Answer 1 · 2019-07-08T05:39:59+0000

Try this command:

cat test.txt | ssh username@masternode "hadoop dfs -put - hadoopFoldername/"

I've used similar tricks to copy directories around.

tar cf - . | ssh remote "(cd /destination && tar xvf -)"

This sends the output of local-tar into the input of remote-tar.

This does the job of piping according to your problem but copying a single file to master node's local drive and then putting it into Hadoop using ssh remote is faster than piping the cat | ssh remote.

If you want to know more about Hadoop, then do check out this awesome video tutorial:

putting a remote file into hadoop without copying it to local disk

1 Answer

Related questions

Browse Categories