Solved: Odd disk usage with shared storage

stevehoward2020 · ‎01-28-2015

We have deployed a 15 node cluster using shared storage. When we issue a df -h, we see the root filesystem and /mnt filesystems mounted.

We also see the filesystems from each of the other datanodes on each others local storage. For example...

/mnt/scsi-123abc is mounted on server01

/mnt/scsi-456def is mounted on server02

Under /mnt on server01, we also see /mnt/scsi-456def as part of the root filesystem, not mounted as its own filesystem.

Under /mnt on server02, we also see /mnt/scsi-123abc as part of the root filesystem, not mounted as its own filesystem.

Is this a storage configuration issue, or is it something that BDE decided when it deployed?

Thanks,

Steve

jessehuvmw · ‎01-29-2015

BTW, you can add your hack script or the scripts I provided into the file /opt/serengeti/sbin/format-disk.py in the node template, then after the empty cluster is created, the mount points /mnt/bigdata1 ... /mnt/bigdataN are there.

Cheers, Jesse Hu

View solution in original post

fakber · ‎01-28-2015

Hi Steve,

Could you please run the following commands and paste the output here?

df -h

df -hl

mount

cat /proc/mounts

cat /proc/partitions

Also, if you with, you could file a Support Request with VMware GSS as well.

Thanks,

Faisal Akber

stevehoward2020 · ‎01-28-2015

Hi Faisal,

Server 1

<pre>

Filesystem Size Used Avail Use% Mounted on

/dev/sda3 20G 16G 2.9G 85% /

tmpfs 1.9G 8.0K 1.9G 1% /dev/shm

/dev/sda1 124M 26M 93M 22% /boot

/dev/sdc1 50G 23G 25G 49% /mnt/scsi-36000c2912867e896bd3f6f71f18124fe-part1

Filesystem Size Used Avail Use% Mounted on

/dev/sda3 20G 16G 2.9G 85% /

tmpfs 1.9G 8.0K 1.9G 1% /dev/shm

/dev/sda1 124M 26M 93M 22% /boot

/dev/sdc1 50G 23G 25G 49% /mnt/scsi-36000c2912867e896bd3f6f71f18124fe-part1

/dev/sda3 on / type ext3 (rw)

proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)

devpts on /dev/pts type devpts (rw,gid=5,mode=620)

tmpfs on /dev/shm type tmpfs (rw)

/dev/sda1 on /boot type ext3 (rw)

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

/dev/sdc1 on /mnt/scsi-36000c2912867e896bd3f6f71f18124fe-part1 type ext4 (rw,noatime)

rootfs / rootfs rw 0 0

proc /proc proc rw,relatime 0 0

sysfs /sys sysfs rw,relatime 0 0

devtmpfs /dev devtmpfs rw,relatime,size=1953972k,nr_inodes=488493,mode=755 0 0

devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0

tmpfs /dev/shm tmpfs rw,relatime 0 0

/dev/sda3 / ext3 rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0

/proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0

/dev/sda1 /boot ext3 rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0

none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0

/dev/sdc1 /mnt/scsi-36000c2912867e896bd3f6f71f18124fe-part1 ext4 rw,noatime,barrier=1,data=ordered 0 0

major minor #blocks name

8 0 20971520 sda

8 1 131072 sda1

8 2 131072 sda2

8 3 20708352 sda3

8 16 4194304 sdb

8 32 52428800 sdc

8 33 52428127 sdc1

</pre>

Server 2

<pre>

Filesystem Size Used Avail Use% Mounted on

/dev/sda3 20G 13G 5.9G 69% /

tmpfs 1.9G 0 1.9G 0% /dev/shm

/dev/sda1 124M 26M 93M 22% /boot

/dev/sdc1 50G 598M 47G 2% /mnt/scsi-36000c29ea6ca69dceb427dad32b4289e-part1

Filesystem Size Used Avail Use% Mounted on

/dev/sda3 20G 13G 5.9G 69% /

tmpfs 1.9G 0 1.9G 0% /dev/shm

/dev/sda1 124M 26M 93M 22% /boot

/dev/sdc1 50G 598M 47G 2% /mnt/scsi-36000c29ea6ca69dceb427dad32b4289e-part1

/dev/sda3 on / type ext3 (rw)

proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)

devpts on /dev/pts type devpts (rw,gid=5,mode=620)

tmpfs on /dev/shm type tmpfs (rw)

/dev/sda1 on /boot type ext3 (rw)

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

/dev/sdc1 on /mnt/scsi-36000c29ea6ca69dceb427dad32b4289e-part1 type ext4 (rw,noatime)

rootfs / rootfs rw 0 0

proc /proc proc rw,relatime 0 0

sysfs /sys sysfs rw,relatime 0 0

devtmpfs /dev devtmpfs rw,relatime,size=1953972k,nr_inodes=488493,mode=755 0 0

devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0

tmpfs /dev/shm tmpfs rw,relatime 0 0

/dev/sda3 / ext3 rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0

/proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0

/dev/sda1 /boot ext3 rw,relatime,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0

none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0

/dev/sdc1 /mnt/scsi-36000c29ea6ca69dceb427dad32b4289e-part1 ext4 rw,noatime,barrier=1,data=ordered 0 0

major minor #blocks name

8 0 20971520 sda

8 1 131072 sda1

8 2 131072 sda2

8 3 20708352 sda3

8 16 4194304 sdb

8 32 52428800 sdc

8 33 52428127 sdc1

</pre>

Below I also show the I can show the respectively mounted filesystems on the opposite node...

<pre>
[root@cmhlpdlkedat09 ~]# ls -lrt /mnt/scsi-36000c2912867e896bd3f6f71f18124fe-part1

total 4

drwxr-xr-x 4 root root 4096 Dec 17 21:06 hadoop

[root@cmhlpdlkedat09 ~]# exit

logout

Connection to cmhlpdlkedat09 closed.

[root@cmhlpdlkedat10 ~]# ls -lrt /mnt/scsi-36000c29ea6ca69dceb427dad32b4289e-part1

total 4

drwxr-xr-x 4 root root 4096 Dec 17 21:06 hadoop

[root@cmhlpdlkedat10 ~]# mount | grep mnt
/dev/sdc1 on /mnt/scsi-36000c2912867e896bd3f6f71f18124fe-part1 type ext4 (rw,noatime)
[root@cmhlpdlkedat10 ~]#

</pre>

fakber · ‎01-28-2015

Hi Steve

From the output you have provided I see that the volume in question is a local volume to the VM itself.

They may have the same name but it is not shared across the two VMs.

Faisal Akber

stevehoward2020 · ‎01-28-2015

That is what is so odd, and almost impossible to believe. We have 12 worker nodes, and each one has mounted filesystem that is on the other 11 hosts with the exact same name, and even with current content. As you noted, it appears to be local to each of those 11 VM's. I may create another cluster to see if it is duplicated, add some content in HDFS, and if the condition exists, delete the content on the "local" filesystem. I am willing to bet that breaks something. I am not the storage admin, so I will talk with them in the AM and update this thread.

The reason I even noticed it is that the root filesystem was filling up, and most of it came from these local filesystems with the same name as the mounted filesystem on other nodes.

Thanks for taking a look!

jessehuvmw · ‎01-29-2015

Hi Steve,

Could you provide more info for this issue please ?

1) run the following commands on the 2 nodes and paste the output here

ls -lrt /mnt/scsi-*

mount

2) the cluster spec file you used for this cluster

3) How many shared storage you have? And the output of executing 'datastore list --detail' in BDE CLI

Thanks.

Cheers, Jesse Hu

stevehoward2020 · ‎01-29-2015

I think we have figured this out. We used Ambari in HDP to build the actual hadoop cluster software, and by default, it assumes that each host has the same mount point name for the HDFS datanode storage. Since BDE creates the partitions and names them in /etc/fstab, it looks like Ambari adds the directory names for all 12 (in our case) data node directories to the common hdfs-site.xml across all 12 nodes. Only one of these is a mounted filesystem on any given node, and the other 11 end up as "local" directories used by HDFS. That is why they have the exact same name, but in fact do have different content from their "real" counterparts on the node where they are actually mounted. As noted, we noticed this as our root filesystem was throwing space warnings.

We have HDP in today for a general discussion, so I will ask them about this. It could be we missed an option in the original setup, but we can rebuild this, so no harm, no foul. I will reply back to this thread with the results, but unless I am missing it, this may be an ugly hack with /etc/fstab after the BDE cluster is built (but before Ambari is used) to get a single name for the mounted HDFS filesystem across all nodes.

stevehoward2020 · ‎01-29-2015

HortonWorks was in to see us this AM, and indicated this would always be an issue for Ambari, as it maintains the hdfs-site.xml by listing all data directories for each node in the common copy maintained in postgres by ambari. The design is that the filesystems will be named the same. If they aren't, the result will be what we saw, i.e, "local" directories created on each of the data nodes with the same name as the one mounted as an actual filesystem on one other node.

I ended writing the ugly hack below on each data node after building the cluster, but before running the ambari install...

cd /mnt
mkdir hdfs
cp -pr /etc/fstab /etc/fstab.bak
#this assumes there is only one /mnt/scsi filesystem in play
grep -v "/mnt/scsi" /etc/fstab > /tmp/tmp.txt
mv /tmp/tmp.txt /etc/fstab
awk '$0 ~ "mnt/scsi" {gsub(".*","/mnt/hdfs",$2);print}' /etc/fstab.bak >> /etc/fstab
umount /mnt/scsi*
mount /mnt/hdfs

While this works, it is hardly elegant, and I'm sure unnecessary as I am probably missing something. Is there anyway to get the partitions (or more specifically, the directories on which they are mounted) created by BDE to have the same name across all nodes?

jessehuvmw · ‎01-29-2015

Hi Steve,

I think you are using BDE to create an empty cluster and use Ambari to deploy hadoop cluster on these VMs. So you encountered this disk mount point issue.

I have two quick solution for you now. You can choose one per your justification.

1) Use BDE 2.1 to provision an Ambari Hadoop cluster. In this way, BDE will create an empty cluster and call Ambari REST API to deploy the cluster on these VMs. BDE will send the disk mount points for each VM to Ambari REST API. This feature is new in BDE 2.1.

2) After you create an empty cluster in BDE 2.1 or 2.0, you can run this script on BDE Server to make alias for the mount points :

serengeti-ssh.sh <cluster_name> 'x=1; for i in `ls -d /mnt/scsi-*` ; do sudo ln -s $i /mnt/bigdata$x ; x=$((x+1)); done'

Then run this script on BDE Server to get all moint points for all nodes in your cluster:

serengeti-ssh.sh <cluster_name> 'for i in `ls -d /mnt/bigdata*` ; do echo -n $i, ; done'

The mount points nams are like /mnt/bigdata1, /mnt/bigdata2, ... /mnt/bigdataN for all the nodes. The N will be different on each nodes because BDE places the nodes on different ESX hosts and every hosts may have different datastores numbers, the datastore numbers on the host determine how many mount points on the node ( mount points number = node storage size / available datastore numbers). So if you have the same datastores number on every ESX hosts, you should get the same N for every nodes, then you can use /mnt/bigdata1, /mnt/bigdata2, ... /mnt/bigdataN for all nodes in the same node group.

Cheers, Jesse Hu

jessehuvmw · ‎01-29-2015

BTW, you can add your hack script or the scripts I provided into the file /opt/serengeti/sbin/format-disk.py in the node template, then after the empty cluster is created, the mount points /mnt/bigdata1 ... /mnt/bigdataN are there.

Cheers, Jesse Hu

stevehoward2020 · ‎02-02-2015

Thanks for the tips guys, very helpful!

charliejllewell · ‎02-02-2015

Hi Both, we have also run into this issue and there is another solution. The one which BDE actually uses when deploying via an ambari app-manager.

In Ambari you can create node groups which can have config that is bespoke to them and can be assigned to specific hosts thus being able to reference the correct mount points.

This may be preferential depending on how you are doing capacity management as if you add more disks to your existing ESXi hosts new datanodes will span a larger number of disks, meaning more points, or, more importantly more mount points that don't exist on the other nodes leaving you in the same situation.

Also Faisal, thanks for the command...extremely useful!! I think it missed out the counter incrementation though

serengeti-ssh.sh 1_395_1_22fb21 'x=1; for i in `ls -d /mnt/scsi-*` ; do sudo ln -s $i /mnt/hdpdata$x ; x=$(($x+1)); done'

jessehuvmw · ‎02-02-2015

Thanks a lot Charlie. I added the 'x' counter increment in my comment.

Glad to know this solution works for you both.

Cheers, Jesse Hu

All

Odd disk usage with shared storage