EMCworld Pebble Watchface

Some of you may have heard of the pebble watch - a watch that connects to your Android or iPhone and displays texts, phone calls and, recently, now has a C based SDK for writing your own watch face.  I’m sure at least some of you have one.

I decided to get a little creative, and built a watch face for EMCworld.  See below, but basically it shows the EMC/EMCworld logos, the current time, and at the bottom, the amount of storage EMC has shipped so far today (since midnight local time) (extrapolated from our 2012 annual report) and the amount of new data created today (using IDC report numbers).  The values update every minute – pretty fun – I plan to wear it to EMCworld this year.

IMG_0223

If anyone wants the watch face, just click this link from your phone with the Pebble app installed: EMCworld

If you want the source, its available here on Github:

Unisphere for VMAX API & Carbon/Graphite

Recently, a customer asked me how they could get performance data from their Symmetrix array into Graphite, their metrics graphing tool of choice.

I originally went straight to my old standby of ‘symstat’, which is good, but doesn’t have an easy-to-parse output.  Unlike many other SYMCLI commands, it doesn’t have an XML output, so it becomes quite an exercise to parse the data.  Nontheless, I did it, and you can find the file here:

https://github.com/mcowger/randompython/blob/master/symmcarbon.py

However, I decided I’d rather use the newly-available Unisphere for VMAX REST API.  I found some interesting things.  You’ll need Unisphere for VMAX 1.5.1 or later to use this.

Without further ado, here is a script that pulls the total system IOPs from the Symmetrix and pushes it into Carbon/Graphite using the MetricLineReceiver method.

https://github.com/mcowger/randompython/blob/master/symmREST.py

 

Some Quick pySphere Code Demos

A couple requests came in over the internal EMC vSpecialist lists last week to demonstrate 2 things with the vSphere API.

The first was a request for a demonstration of how to extract individual VM performance disk performance data.  Here’s a python script using the pysphere library to accomplish that:

# Import the pysphere library (makes python talk to vsphere)

from pysphere import VIServer
from pysphere.resources import VimService_services as VI

#Create an object to work with
server = VIServer()

#Connect to the server
server.connect(“IP.ad.dre.ss”, “Administrator”, “AReallyStrongPassword”)

#Get the list of all VMs (their VMX paths) in the vCenter that are powered on.
vmlist = server.get_registered_vms(status=’poweredOn’)

#For each path….
for vmpath in vmlist:
#Get the current performance manager object (it changes, so we can’t just instatiate it once)
pm = server.get_performance_manager()
#Get an actual VM object from the path
vm = server.get_vm_by_path(vmpath)
#Get the managed object reference for the VM, because the performance manager only accepts MoRefs
mor = vm._mor
#Get all the counters and their current values for that VM.
counterValues = pm.get_entity_counters(mor)
#Do some quick math on the values.
#They come to us in a convienent dictionary form.
#Values are descrobed here: http://www.vmware.com/support/developer/vc-sdk/visdk41pubs/ApiReference/virtual_disk_counters.html
IOPs = counterValues['virtualDisk.numberReadAveraged'] + counterValues['virtualDisk.numberWriteAveraged']
BandwidthKBs = counterValues['virtualDisk.read'] + counterValues['virtualDisk.write']
ReadLatency = counterValues['virtualDisk.totalReadLatency']
WriteLatency = counterValues['virtualDisk.totalWriteLatency']
#print them out.
print “VM Name”,vm.get_property(‘name’)
print ” IOPs”,IOPs
print ” Bandwidth(KB/s):”,BandwidthKBs
print ” Read Latency (ms):”,ReadLatency
print ” Write Latency (ms):”,WriteLatency
print “——-”

The next was a request to show the datastore overprovisioning rate:

from __future__ import division
#Use Python 3.0′s division (default to long division, not integer)
from pysphere import *
# Import the pysphere library (makes python talk to vsphere)
#Create an object to work with
server = VIServer()
#Connect to the server
server.connect(“IP.addy.goes.here”, “Administrator”, “P@ssword1!”)
for ds, name in server.get_datastores().items():
#Get all the datastores and their names.
props = VIProperty(server, ds)
#Create a object that contains all the properties
try:
capacity = props.summary.capacity
free = props.summary.freeSpace
uncommitted = props.summary.uncommitted
#Per the API docs: “Total additional storage space, in bytes, potentially used by all virtual machines on this datastore. The server periodically updates this value.”
used = capacity – free
overprov = used + uncommitted
#Combine whats currently used with what could be used.
overprovPercent = (overprov / capacity) * 100
print “Datastore”, name
print ” “,”Capacity (MB) : “,capacity / 1024 / 1024
print ” “,”Free (MB) : “,free / 1024 / 1024
print ” “,”Used (MB) : “,used / 1024 / 1024
print ” “,”OverProvisioning (MB) : “,overprov / 1024 / 1024
print ” “,”OverProvisioning % : “,overprovPercent
except AttributeError:
pass
#Not all datastores have this property (NFS, for example), so we put it in a try/catch to prevent errors.

 

Things I’ve Learned About Pebble

This is an ongoing blog post about things I’ve learned about the Pebble Watch by pulling apart the firmware.

EDIT:  Much of this information has been merged into the wiki here:

http://wiki.skumler.net/wiki/Pebble

  1. The firmware is a simple zip file, with a JSON manifest, a resources file and the firmware file
  2. The current firmware seems to have been codnamed ‘tintin’
  3. Looking at the strings in the firmware file, it appears that the third party bluetooth stack they use is the stonestreet ‘BlueTopia‘ stack
  4. STM STM32 F2 Cortex M3 (we already knew it was Cortex M3, but not the specific model, I don’t think).
  5. The running environment appears to be FreeRTOS.
  6. It looks like there are specific handlers for various message types.  At a minimum are: email, facebook, SMS, and ‘pebble beach’.
  7. It appears there is a maximum of 4 alarms you can set.
  8. There are some funny debug log comments: “Seconds until alarm is crazy, got %d seconds” and “No one cleaned up after themselves before?”, “Message is a command. Pebble don’t take commands from nobody.”, “WTF: This should never happen”,
  9. Alarm snooze options appear to be:10 minutes
    20 minutes
    45 minutes
    1 hour
    1 hour, 15 minutes
    1 hour, 30 minutes
    2 hours
  10. It has an airplane mode thats not visible in settings (at least that I can find): “Airplane Mode”  EDIT: Its been found if you toggle bluetooth mode.
  11. There appears to be Mfg test code left in:Mfg Func Test Battery
    Mfg Func Test Black
    Mfg Func Test Buttons
    Mfg Func Test Version
  12. There are LOTS of assertions – thats a good thing.
  13. Some cool ASCII art:^
    / //o__o
    / / __/
    ______ /
    /
    —-
  14. Looks like there may be some sort of App->Watch authentication: Authentication Version: %d.%”
  15. The code appears to be built with Eclipse
  16. Its built in UTF-8, which means future support for RTL and other languages.
  17. There are hints that it might support OBEX – which makes sense.
  18. There are references to text rendering commands, which should make life easier for watchface makers.

I haven’t been able to figure out the format of the actual firmware BIN file yet, but its probably ARM specific, due to the Cortex.  Will have to do some FreeRTOS and ARM research.  Its 432KB

Same for the system resources file.  Haven’t been able to pin down the format, but its 144KB.

A very unhappy, unhealthy VMAX array

This is what happens, children, when you think you are smarter than the array auto-tiering software and force all your LUNs onto specific disks. You end up with terrible balance and bad performance.

Quick Legend: This is a timelapse of 24 hours in the life of the array.  The bottom rows (the small squares) are the individual disks, with the hot upper left section being all SATA drives, the rest are FC drives.  Clearly a pretty strong imbalance.  The larger squares above that are the actual CPUs that control the data flow.

 

Part 5: Building A Large VMAX & NFS Environment

The is the last post in a series describing the design considerations of a VMAX + VG8 NFS environment for running a significant vSphere farm.  I’ve discussed the requirements, high-level design for the VMAX and VG8 components and described the implementation details for the VMAX.  This post is the last in the series and will cover the same implementation details for the VG8 gateway.

Installation

Installation of the VG8 is very straight forward if the follow the instruction manual available on the support site.  Every step is prompted and pretty reasonably described.

If you’ve properly completed your zoning and masking steps described in the previous articles, then the VG8 cluster will automatically find the control volumes and install to them.  One thing worth noting – the install process gives time estimates for the various parts of the process; I’ve found that these are almost universally wildly incorrect.  Some that estimate 2 minutes complete in less than a 10th of a second, while some that estimate 1 minute took 15.  Be prepared for the entire install process to take at least 4 hours (although 95% of that is hands off time you can head to lunch for).

I suspect that a huge part of this is the simple fact that each HBA on each datamover will be seeing / scanning nearly 2000 devices, and there are 16 HBAs.  Scanning 32,000 devices is quite time consuming for any operating system.

Disk Marking

Once the install procedure has completed, the VG8 will begin the disk marking procedure.  This mostly went without a hitch, bit there was a small hiccup.  One of out TDEVs we had masked to the system was in a Not Ready state (I believe I had failed to bind it to a pool), and so the marking process choked on that disk when it was able to see it but not perform any IO. For the more SCSI minded of you, the device was returned in a INQ response, but responded with Check Conditions to any IO.

Once that small issue was sorted we continued to volume creation.

Volume Creation

The first issue we found was that the system had automatically created a disk pool and added all the devices into it!  Well, we aren’t going to use automatic volume management (AVM) on this array (why? – see the next section), so we dont want this pool created?  When we attempted to delete it however, we were met with an ‘Invalid Operation’ error.  A quick call to support instructed us to simply created our desired stripe/meta structures and the devices would be removed from the pool automatically.  When the pool was empty, it was automatically deleted.

Why don’t we want to use AVM? Well, because it has a small, but real performance overhead, and also has a limit of 8 devices in a stripe set…we can do better manually to squeeze the last inch of performance out of this system. Normally, I would recommend against fully manual volume management – its too much of a hassle for a small (<10%) performance gain.  However, in this case, once we setup this gateway*, its never going to change.  As a result, this is a one time effort isn’t too much pain.

*famous last words

To create the devices, we first use the nas_volume command to build the raw devices (dVols) into a stripe set (of 17 members):

nas_volume -n stv_01 -S 262144 d500,d501,d502,d503,d504,d505,d506,d507,d508,d509,d501,d511,d512,d13,d14,d515,d16

This produces a stripe called stv_01 with a stripe size of 256KB from the 17 volumes listed.  You can determine the mapping between the dVol number and the Symmetrix device number using the nas_disk -l command. In this case, because all my devices are TDEVs virtualized in a FAST VP pool, they are all the same and I dont have to worry about spreading across busses, etc.  The Symmetrix handles that for us.

Next, we create a single metavolume out of that stripe.  Why bother?  Well, just for the future, in case we need to expand things, we will have the option to expand the meta easily.

nas_volume -n mtv_01 -c stv_01

We create a meta called mtv_01 from the stripe stv_01.  Pretty straght forward stuff.

Last, we create a filesystem on the metavolume:

nas_fs -name fs_01 -create mtv_01 log_type=split -option nbpi=32768

To break this down:

  • -name fs_01 – the name for the filesystem.  This isn’t the same as the mount point.
  • -create mtv_01 – create a new filesystem backed by volume mtv_01 (our meta)
  • log_type=split – this is a new option for 7.1 DART code that puts the intent log in the volumes themselves (rather than the control volumes) to prevent hot spots.
  • -option nbpi=32768 – this option changes the number of inodes that are created.  The default is to create 1 inode every 8K (8192), which is reasonable for a filesystem that will contain word documents, etc.  However, ours will contain giant 100GB VMDKs, so we dont need to waste so much metadata memory on inodes that we’ll never use, so we crank it down to every 32K.

Last, we mount the filesystem:

server_mount server_2 -option uncached fs_01 /fs_01

The uncached option prevents the gateway from attempting to perform write coaslescing, which significantly improves response time on random IO workloads.

And with that, we are done – our VMAX is carved, our filesystems are mounted, and we are ready to access these filesystem using our regular connections.

I hope you’ve enjoyed this series, and if I’ve missed an important part, please let me know in the comments so I can add it in!.

Part 4: Building A Large VMAX & NFS Environment

In previous parts of this series, I’ve described the requirements for this project, as well as the overall design features of restrictions of the VG8 gateway and VMAX parts of this design. In this part, we’ll go in depth to the configuration required to accomplish the design on the VMAX. This will be very far in the weeds, so prepare yourselves!

(note: While I prepared this script for use with solutions enabler, the customer decided to do this work in Unisphere for VMAX (the web GUI). As a result, this set of scripts and commands are untested, and probably contain a couple minor syntax errors. Consider them as a guide, not a recipe.)

BIN File

The first setup step for any Symmetrix system is the creation of the BIN file, which defines to the system all of its features, available parts (like disks) and basic configuration. I’ll agree up front this is a bit of an anachronism compared to other systems (even our own VNX). Nonetheless, it exists mostly because it enables EMC to squeeze the last tiny bit of configurability and performance from the system. It also allows EMC to predefine for customers, if they like, the configuration of volumes for certain use cases. We’ve made use of that here.

Default Devices

The first thing we’ve defined in this BIN file are the TDAT devices which make up the pools of storage that will be used later. While this can be done by the customer at any time later, we chose to do it here, in this step to speed up the customer’s deployment process. Specifically, nearly 100% of every drive in the system (EFD, FC< and SATA included) has been allocated via TDAT devices. We choose the number of the TDAT devices per disk based on some internal best practices and restrictions, but generally we follow the rule that we like to see ~8-12 TDAT virtual devices for EFD and FC devices, and around 25 for SATA drives. Note: These are special ‘TDAT’ devices, and exist solely to back the storage pools. We will create customer-sized devices later as thin devices (called TDEVs)

Also, because we knew that this entire frame would be used exclusively by the VG8 gateway, we also precreated the ~6 devices required for the gateway to store its configuration data (about 60GB worth of devices).

Pool Creation

First, we create and name the pools of disk we will need using a symconfigure script:

--- createpool.txt ---
 create pool FC15_600GB_R10,type=thin;
 create pool EFD_400GB_3R5,type=thin;
 create pool SATA_2TB_R10,type=thin;
 ---
 symconfigure -sid XXXX -f createpool.txt commit

Note that at this time the pools are empty – they have no actual storage backing them. We’ll correct that next:

--- fillpools.txt ---
 add dev 144:783 to pool FC15_600GB_R10, type=thin, member_state=ENABLE;
 add dev 784:A33 to pool EFD_400GB_3R5, type=thin, member_state=ENABLE;
 add dev A34:C13 to pool SATA_2TB_R10, type=thin, member_state=ENABLE;
 ---
 symconfigure -sid XXXX -f fillpools.txt commit

But wait, you say? Where did those hex numbers come from? Well, for those of you without Symmetrix experience, those are the hex identifiers we use for identifying volumes. Rather than just calling them LUN 340, we call it 0×154 (or 154 for short). In fact, these are the identifiers of the pre-created TDATs in the BIN file I described above. If we pre-create them, we know what the IDs will be, and can provide them to the customer for preparation. Most of the configuration lines above are self-explanatory – member_state=ENABLE causes the device to be enabled for use by the pool as soon as they are added. Some customers might not want them to be auto enabled (in case they have more to add in a short time, for example), so we don’t presume.

AutoMeta Configuration

Symmetrix has a limit of single devices of 240GB (yes, this is also a bit anachronistic). This is a bit small for most uses, and so it also supports the creation of larger virtual devices by combining these smaller devices together. This technique is called a Meta or a MetaLun.

Now, the user can manually create those if they wish, but manually creating devices and combining them is a pain inthe butt, and easily repeatable, so lets let the computer do it. By enabling ‘autometa’, we simply specify a size of device over which the system should create these meta devices on its own, and a couple parameters under which it should do so:

--- autometa.txt ---
set auto_meta=ENABLE;
set auto_meta_config=STRIPED;
set auto_meta_member_size=262668 CYL;
---
symconfigure -sid XXXX -f autometa.txt commit

To break this down a bit:

  • “set auto_meta=ENABLE” – This enables the autometa system as a whole
  • “set auto_meta_config=STRIPED” – All new metas should be written to in a striped fashion, rather than concatnated together. This is a performance option that should almost always be true.
  • “set auto_meta_member_size=262668 CYL” – metas should be created using 262668 cylinder parts. I could have just as easily specified this in GB, but I wanted to use the absolute maximum value (this happens to equate to approx 240 GB)

TDEV Creation

Now we need to create the actual devices that our VG8 gateway will store data on. We ended up deciding on 480 GB devices (more on how we go to that number in a future part of the series, but suffice to say that number struck a good balance between performance and manageability). In total, we will need 476 of these devices (again, how we got there was explained in part 2).

--- createtdevs.txt ---
create dev count=476,size=525336 CYL,emulation=CELERRA_FBA,config=TDEV,binding to pool=FC15_600GB_R10,preallocate size=200 GB;
bind tdev C15:C1A to pool FC15_600GB_R10,preallocate size = ALL;
---
symconfigure -sid XXXX -f createtdevs.txt commit

Allow me to break this one down as well:

  • create dev count=476 – create 476 of the following device
  • size=525336 CYL – They should be this size. This happens to be 480GB. I could have specified it in GB, but again I am a control freak.
  • emulation=CELERRA_FBA – Devices that will be used by a VG8 gateway need some special care and options set, and this is basically a macro that sets them easily.
  • config=TDEV – create these as thin devices
  • binding to pool=FC15_600GB_R10 – bind these to the FC drive pool. By default new writes will go to this pool (unless FAST moves the relevant blocks to a different pool)
  • preallocate size=200 GB – preallocate 200GB of each device. We know that these will get used, so we preallocate a little less than half the space as a small initial performance boost.
  • The second line binds the pre-created control volumes to the same pool, so that they have backing.

With that, we’ve brought up our array with its BIN file, create the pools of storage and created the 476 thin devices needed. When the customer performed this task using the GUI (as a complete Symmetrix newbie), this part of the process took him about 20 minutes. Using this command line, this would probably take about 10 minutes.

Masking & Device Presentation

Next, we want to get the VG8 to see these devices we’ve created.  Note, because we presented the control devices directly to the ports using the BIN file, we dont have to deal with them at this time.

VMAX uses a mechanism similar to VNX’s host groups/storage groups concept to present devices, but with significantly more flexibility.   To present devices, we create ‘masking views’, which are composed of initiator groups (host WWPNs), port groups (through which the hosts should access the storage) and storage groups (the devices that should be accessed).  Once we create these three constructs and link them together in a masking view, the VG8 will be able to access the storage.

Initiator Groups

Initiator groups are pretty easy to create with the command line (and via GUI):

We simply create the initiator record itself with a reasonable name and with the consistent_lun option (which guarantees that all initiators from the same host will see a given storage device on the same LUN value.

symaccess -sid XXXX -name Server2_IG -type initiator -consistent_lun create

Then we add each of the actual WWPNs from the server to the record:

symaccess -sid XXXX -name Server2_IG -type initiator -wwn 50:06:01:60:47:20:XX:YY add
symaccess -sid XXXX -name Server2_IG -type initiator -wwn 50:06:01:61:47:20:XX:YY add

We repeat this process for all 8 datamovers in the VG8.

Port Groups

I mentioned in part 2 how we planned to zone the ports of the VG8 gateway to the FA ports on the VMAX.  We simply use the same information here to inform the VMAX of the decisions made.

symaccess -sid XXXX  -name Server2_PG -type port -dirport  1e:0, 3e:0, 14e:0, 16e:0, 2e:0, 4e:0, 13e:0, 15e:0 create

Here we’ve specified a portgroup for server2 with 8 total ports (4 for each of the 2 HBAs in the datamover).  Again, repeat this for all 8 datamovers.

Storage Group

Now, we create the storage group.  Because all datamovers in a cluster should see the same devices, we need only a single storage group.

symaccess -sid XXXX -name ServerALL_SG -type storage devs 0C2C:E08

So, we create a storage group using the range nomenclature; C2c:E08 means ‘include all device between C2c and E08, inclusive, in the group.  Now how did I know the device numbers?  When I created the devices above using ‘create dev’, symconfigure output the new device range as part of its completion notice.

We’ll also create a storage pool that contains just the control volumes and the gatekeper as well:

symaccess -sid XXXX -name ServerALL_CV -type storage devs C15:C1A

Masking Views

Lastly, lets create masking views to link all these together.

symaccess -sid XXXX create view -name Server2_View -sg ServerALL_SG -pg Server2_PG -ig Server2_IG -lun 010 -celerra

Here we’ve specified that we want a new view called Server2_View using the storage, port and initiator groups we’ve created.  We also specify which LUN # to start at (VG8 requires you start at 010) and that the system is a ‘Celerra’ (the older name for VG8).  Once this command finishes executing, server2 is able to see all its data storage.  Simply repeat this process for the remaining 7 datamovers.

Next we’ll add the control LUNs:

symaccess -sid XXXX create view -name Server2_View_CV -sg ServerALL_CV -pg Server2_PG -ig Server2_IG -lun 00 -celerra

And repeat for all the datamovers.

Tiering

The last step is to setup the tiering mechanisms, because we want the system to use its EFD, FC and SATA drives to their best possible use.  As it stands at this moment, nothing will ever move from FC, because tiering is not enabled.  Let’s fix that.

Tier Identification

First, we need to tell the system which pools to use for tiering, and what they are.  Because we only have 3 pools, they should all be used for tiering, this is pretty easy:

symtier -sid XXXX create -name FC15_600GB_R10_TIER -tgt_raid1 -technology FC -vp
symtier -sid XXXX create -name EFD_400GB_3R5_TIER -tgt_raid5 -technology EFD -vp
symtier -sid XXXX create -name SATA_2TB_R10_TIER -tgt_raid1 -technology SATA -vp

To break down the first line:  We want to create a tier called FC15_600GB_R10_TIER which should use only pools that are RAID1 and FC disks that are VP pools (there are other kinds of non-VP pools, but they are rarely used).

Next, we enable the FAST system altogether, and allow it to move data automatically as it sees fit.  Its possible to require a human to approve all moves, but to me, that kind of defeats the purpose:

symfast -sid XXXX enable -vp
symfast -sid XXXX set -control_parms -approval_mode AUTO_APPROVE
symfast -sid XXXX set -control_parms -vp_data_move_mode AUTO 
symfast -sid XXXX set -control_parms -vp_allocation_by_fp ENABLE

Third, we create a FAST policy using the tiers from above.  By default, a policy created using three 3 tiers will allow any and all movement of data between tiers without limits.  You can set limits (for example, to prevent dev/test from chewing up too much EFD), but in this case, its not needed.

symfast -sid XXXX -fp create -name PA_VG8_01
symfast -sid XXXX -fp add -tier_name FC15_600GB_R10_TIER
symfast -sid XXXX -fp add -tier_name EFD_400GB_3R5_TIER
symfast -sid XXXX -fp add -tier_name SATA_2TB_R10_TIER

Last, we associate our storage group with this policy.  Upon association, we’ve told the array to apply the data movement rules specified in the policy to the storage group, and to begin work.

symfast -sid XXXX associate -sg ServerALL_SG -fp_name PA_VG8_01 -priority 2

Note that we did not assign a fast policy to the Control Volume devices, because those are required (as per the manual) to stay on FC mirrored drives.

And with that, we are 100% complete carving and allocating storage to the VG8 system.  All told, this process took a newbie administrator with no previous symmetrix experience less than an hour to complete, including significant discussion about what the options mean.  If I were to do it using just command line, it would likely be closer to 30 minutes.

Part 3: Building A Large VMAX & NFS Environment

In Part 1 of this series, I discussed the requirements for this design and an overview.  In Part 2, I went into depth on the design questions and decisions for the gateway side of the solution.  In this, Part 3, I will do the same for the VMAX side.

Make / Model / Color

Model: The VMAX is available in a number of different models.  They all run the same operating system (Enginuity), but they have different performance, scaling and feature levels.  For this particular environment, necessity dictated the model.  With a requirement of 400K frontend IO/s, we needed a system that can keep up.  In a RAID1 system, 400K IOPs at the front end of the array corresponds to nearly 800K on the backend, or a total of about 1.2M IO/s.  Only the largest VMAX 40K would be able to reach these numbers while also providing the other requirements: sub-3ms response times and 0% performance degredation during a component failure.

Options: We also elected to use the largest cache configuration available; 2TB.  While not cheap by any means, a bursty, write-heavy workload like this one can really benefit from the extra breathing room.  We also chose to go with a large, 8 engine (aka 16 director) system for maximum CPU horsepower.

Color: The VMAX comes in any color you like, as long as its black with blue light.

Disks & Protection

Obviously, we have to put some serious disk muscle behind this thing to need all that performance, and so we did.  The final disk configuration looks like this:

  • EFD: 400 GB eMLC.  Count: 352.   RAID: R5 3+1

    Obviously the fastest disk option is the EFDs.  In order to maximize available space, we chose the 400GB model, which unlike the 200/100GB models uses MLC, not SLC flash technology.  We rate the performance of these drives equal to the regular SLC drives, however.  We also rate them at a similar level of longevity to the SLC technology as well, due to the over provisioning we do of the drives (e.g. they contain substantially more flash than the rated 400GB).

    We chose RAID5 3+1 protection for these drives.  Normally, for a write intensive workload this would be a poor choice, but we anticipate that the raw speed & number of these drives along with the substantial impact of the DRAM cache will make this a non-issue.  Remember, again, that the goal for these was not raw performance, but consistent performance.  So if we spend an extra 2ms (still sub 3)due to RAID5 overheads but gain consistency across a much larger set of data (due to RAID5′s benefit for capacity compared to RAID10), we’ve won.

  • FC: 600 GB 15K.  Count: 408.   RAID: R10

    Next, we needed some significant amount of space to land recently-used-but-currently-idle data on.  We needed north of 100TB to do this, so we ended up with 400 FC drives to do so.  It also helps that a large 8-engine system requires a minimum of 320 drives, and those drives should be regular magnetic media (just to avoid using pricy EFDs for Vault space that is rarely used.

    We decided on RAID10 to maximize performance.

  • SATA: 2TB 7.2K.  Count: 124.  RAID: R10

    Last, we needed approx. 100TB to store very old datasets that were unlikely to ever be accessed again.  For these kinds of workloads, SATA is pretty clearly the best choice, so we went with it.

    Again, R10 to protect the data.  We decided against RAID6 due to the write performance penalty.

Tiering

So with the disks out of the way, we need to talk about tiering.  No particular tier of disks in this system can accommodate the full workload, so we have to tier it.  Also – we want to.  We dont want to have the admins for this environment manually moving hundreds of VMs around per day.

So, we’ll run this entire thing with FAST VP (EMC’s goofy marketing name for tiering technology on VMAX).

Of course, it goes without saying at this point that the entire environment will be wide striped across all relevant disks (thats the VP – Virtual Provisioning – part).  At this point, with the latest Enginuity code (5876), VP has very little overhead, and lots of advantages (like wide striping, more granular tiering, etc).

Next Up, in Part 4, we’ll discuss the actual commands and process required to make the above possible.

Part 2: Building A Large VMAX & NFS Environment

In Part 1 of this series, I discussed the requirements that drove the decision to build a pretty massive VMAX+VG8 system.

In this, Part 2, I’d like to discuss how the system will be designed internally and externally.  Most of these decisions are EMC best practices, but I will try to call out when those have been violated, and why.

Load Balancing

The customer requirements dictated NFS storage access, which is the reason the VG8 gateway exists at all in this system.  The customer intends to have 7 clusters of compute, with each cluster accessing its own subset of the storage.  Its important to note that the workload here is highly uniform; all 256+ hosts will be running hundreds of copies of nearly identical virtual machines, and so many technologies designed to balance load are not really useful here, because they load is balanced on its own.

Datamover Usage

So, to start with, the customer and I came up with a design where each of the 7 compute clusters is dedicated a single VG8 datamover blade.  This makes fault isolation and capacity planning much easier, so its basically a a no-brainer.  We wouldn’t always recommend this however – if the workloads weren’t naturally nearly perfectly balanced, we’d want to spread a cluster over multiple datamovers to prevent hotspots.  The last datamover (server_9) is reserved as a hotspare (or standby).

Filesystem Count

For ease of management reasons, the customer (and, in general, most customers) wanted to minimize the number of datastores assigned to a cluster; fewer points of management or potential screwup.  However, that has necessary limits when we consider the real world and how threading, etc works within the VNX datamovers.  As a result, we agreed on each datamover hosting 4 filesystems of ~8TB each.  This allowed for their automation processes to be efficient and for good performance on the datamovers themselves.

But those 4 x 8 filesystems need to be backed by some kind of storage, and as the VG8 is merely a gateway and has no disk of its own (not even for its operating systems or control volumes!).  That backend disk will be accessed via fiber channel from the VMAX 40K.  We’ll get to how that beast is setup in the next post.

Backend Device Size & Count

But, for now, how do we want to create these filesystems?  Do we want to just create an 8TB filesystem out of 1 giant LUN from the array (ignoring the fact that a VMAX can’t create a single LUN that large)?  Probably not – we’d end up with all sorts of queueing problems on the backend, limiting performance.  So, instead, we spread the filesystem across multiple VMAX devices to ensure good balancing.  How many do we choose?

An initial design had the VMAX exporting 240GB devices (the maximum device size without resorting to META LUNs).  That, however, made the math a little rough, as that would have required 35 devices per filesystem.  Do all the multiplication:

35 devices / FS * 4 FS / datamover * 7 datamover = 980 LUNs.

This isn’t so bad, but once you consider that each HBA on each datamover will have 4 paths through the VMAX to its LUNs (more on that later), you end up with each HBA needing to see 3920 devices, which is very close to its limit of 4096 per HBA, and would severely limit future filesystem growth if needed.

So, an updated design changed to using 17 devices per filesystem, which, going back through the math above, results in a total of 1960 devices per HBA, which is far more manageable.  This does require the use of META devices on the Symmetrix side, leading to a small increase in complexity.

Device Count

Why 17?  Why not 18? or 16?  17 certainly seems like a strange number.  Well, due to how some of the volume striping within the DART filsystem works, an even number of devices can, under certain (uncommon, but possible) conditions cause hotspots during metadata updates on the filesystem.  Given that this environment will likely see significant metadata traffic, we wanted to avoid that.  So, by choosing an odd number of devices (and, even better, a prime number), we can avoid this hotspot effect.

Pathing Considerations

The last major part of the VNX VG8 gateway is to consider how it will be zoned/connected to the VMAX.  There are a huge number of options here, because the VG8 has 2 ports per datamover, and the VMAX has 128 total ports.

In general (and I will go into this further in the next post), we want to have a host (or gateway or whatever) use as many of the VMAX resources as it can, as unlike in VNX (and most ALUA arrays) there is no penalty for using all paths simultaneously.

So there are a total of 16 ports for the datamovers (8 DMs * 2 ports / DM).  Additionally, there are 64 ports on the VMAX we want to use (not all 128 – again a topic for the next post).   64 backend / 16 frontend  = 4:1 ratio.  In other words, we can make optimal use of our resources if we make sure that each VG8 datamover port can access 4 VMAX ports.

And so, thats how we ended up zoning it out

ServerName:   PortsForHBA0    |     PortsForHBA1
Server2: 1e0, 3e0, 14e0, 16e0 | 2e0, 4e0, 13e0, 15e0
Server3: 1f0, 3f0, 14f0, 16f0 | 2f0, 4f0, 13f0, 15f0
Server4: 1g0, 3g0, 14g0, 16g0 | 2g0, 4g0, 13g0, 15g0
Server5: 1h0, 3h0, 14h0, 16h0 | 2h0, 4h0, 13h0, 15h0
Server6: 5e0, 7e0, 10e0, 12e0 | 6e0, 8e0, 9e0, 11e0
Server7: 5f0, 7f0, 10f0, 12f0 | 6f0, 8f0, 9f0, 11f0
Server8: 5g0, 7g0, 10g0, 12g0 | 6g0, 8g0, 9g0, 11g0
Server9: 5h0, 7h0, 10h0, 12h0 | 6h0, 8h0, 9h0, 11h0

As you can see, each HBA has dedicated access to 4 VMAX ports.  We also use every port on the VMAX, meaning we get an optimal spread of resources across the array.

In Part 3, we’ll discuss the setup of the VMAX itself.  In Parts 4 & 5  I’ll go over how to configure the VMAX, and then Part 6 will be the VG8 gateway.

Part 1: Building A Large VMAX & NFS Environment

I’ve recently been working with a customer to design and implement a pretty sizeable VMware environment thats well in excess of 256 hosts.  I did much of the storage design for this, with some notable help from fellow EMCer Joel Sprouse for the NFS related stuff.  As I went though it, I thought it would be interesting to present the requirements and methods for building such an environment.

So, welcome to Part 1: The Requirements

  • Support at least 400K IO/s (small block, random, 30/70 R/W ratio) in a single system.
  • Support the above IO at less than 3ms latency.
  • Support NFS as the sole access method to the storage from the hosts.
  • Support N+1 failure scenarios with 0% performance loss.
  • Support above performance assuming near-worst case skew (in other words, assume tiering isn’t the optimal 5%/70%/25% mix where we see 5% of the space doing 90% of the IO.
  • Proven 5-nines or better availability.
  • No “science project” storage (e.g. GA gear).

The EMC team looked at a number of our options when designing this.  Our initial idea, because the data on this system would be very dedupe-friendly was to use XtremIO, our badass new deduping, all flash box.  However, that broke the last 2 rules – its not yet GA, and it hasn’t proven 5 nines in the field (yet!).

Next, we looked at our old standby of the VNX7500 platform.  It could certainly do the vast majority of the requirements, and do them well.  But, it missed on #1 – we can’t do 400K IO/s in a single VNX7500 – it just lacks the horsepower.

So, we set our sights on a third option – a VMAX 40K fronted with a VNX VG8 gateway.  This gives us all the requirements – plenty of IOPs, NFS access and all the availability that  VMAX and VG8 are known for.  Eventually, we ended up with a configuration that looked like this:

VMAX

  • 40K – 8 Engines
  • 2TB total cache
  • 128 8GBit FC Ports
  • 352 400 GB EFD (SSD) drives.  8 of those are spares.
  • 408 600 GB 15K FC drives.  8 of those are spares.
  • 124 2TB 7.2K SATA drives.  4 of those are spares.
  • 100% using FAST and VP (Virtual Provisioning)

VNX VG8

  • VG8 Model
  • 8 Datamovers
    • 2 x 10GBe Ethernet Frontend
    • 2 x 8GBit FC Backend
  • 1 Datamover reserved as warm spare (aka standby)

In the next post in the series, I’ll outline the initial configuration of the VMAX, including the ‘BIN’ file.