21st Large Installation System Administration Conference 2007, November 11-16, 2007, in Dallas, TX
by John Strassner
John mentiones two autonomic systems: HAL9000 on one side, and the computers they have on the USS Enterprise. While HAL decides that it does not need the humans anymore, the Startrek computer takes commands from Jean-Luc, figures out how to solve the problem and then ask back if it should go a head.
Autonomics is not about replacing people, it is about giving people better information so that they can do their jobs better. Through the use of autonimic systems, the decisions of humans can become more and more high level.
The way to implement autonomic systems is to add a management layer on top of existing systems controlling their operation instead of expecting the existing systems to change. In order to limit complexity, John proposes to introduce a meta layer right above the systems providing a vendor neutral interface (Model-Based Translation Layer) for the management system to access.
by Carson Gaspar
Carson runs a large scale monitoring system based on Nagios at Goldman Sachs. The reason for initially going with Nagios, back in 2003 when the project was started, was its freeness and the well developed eco system around it.
As they built their system, they run into a number of problems:
The system did not scale to the hunderds of hosts they were looking at monitoring.
Nagios uses fork/exec to run external testing scripts. For every test this results in at a max rate of 1 test per second (active testing).
No integration with their existing monitoring systems
No automatic provisioning since they were looking at 10'000 plus hosts.
No self monitoring.
They adressed the problems by:
Using passive checking, where Nagios reads test results from a named pipe they rised the capacity to 1'800 updates per second.
Setting up multiple Nagios servers in a HA setup they were able to guarantee availability of the monitoring system.
Data-driven configuration file generation. All config files get generated based on the existing system configuration management infrastructure. This makes sure that clients and servers stay in sync and also that new machines get provisioned automatically.
Critical success factors were:
Good documentation to further aceptance at other groups in other departments of the company.
Extreem over engeneering when looking at the initial deployment size since this allowed the system to grow to the size this is now running.
Watch the Nagios maling lists for a planned source code release, expected at the beginning of 2008.
by Michael Haungs
With CAMP you get a common API for gathering performance statistics for machines running under Linux, Windows and Solaris. The system is created in a way that allows it to be expanded to other OSes.
The different systems have different concepts regarding process identification and device nameing.
To verify the correctness of their system they created a test suite going along with it. The test suite tries to create actual load on the system and used this to see if CAMP was reporting the expected results where ever possible. Following this path the system makes sure that it is not just reproducing an error present in the systems measurement interface.
Regarding future work, they are looking at integrating other platforms into the system. First in line is OSX.
A wiki for camp is available on http://wiki.csc.calpoly.edu/camp.
by Dave Plonka
Daves system polls 176k targets in a 5 minute interval with mrtg.
When he started out with his work, his system often took over 600 seconds to complete a cycle of polling, which is way above the intended 5 minute cycle time.
Analysing the system performance he found that the system is a) spending a lot of time in io-wait. And b) it is doing about 6 reads for every write.
The first approach at improving performance was to collect several rrd-updates in a journal and then apply multiple updates to an rrd file in one go every hour. With this approach they were able to bring down the cycle time to well under a minute.
In a second round they investigated the massive read overhead. They found that Linux 2.6 does rather agressive read-a-head on all file access. Effectively causing the whole rrdfile to get read into cache while only the currently active blocks of the files should be in cache. Interestingly enough there is a group of system calls fadvise and madvise which allows you to tell the os that it should NOT read-a-head. With this change allone they were able to achive their performance goal as well.
by Hajime Inoue
NetADHICT is a tool sitting in between wireshark (low-level) and mrtg (high-level) to help people investigate network problems at an intermediate level. NetADHICT is able to identify cluster of related network packets adapting as it goes along.
When running this tool for several hours on network traffice, it will produce a visual presentation of the network traffic, showing the different types of traffic.
NetADHICT is available from http://www.ccsl.carleton.ch/software.
by Tony Cass, Cern
Tony gave an overview of the all-different IT world at CERN, with their outlandish needs for computing and storage capacity. Intersting tibits:
Cern is gearing up to the opening of their new Accelerator in Spring 2008 when they expect to see several Peta Bytes of data generated every year.
In 1969 they had 4GB disks with 10 MB/s IO rate. In 2007 we have 500GB disks with 60 MB/s. This looks like an improvment, and in some respects it is, but looking at overall transfer rate it is abysmal. 1 TB in 1996 was atached with 2.5 GB/s bandwidth. Today there is only 120 MB/s for a TB. This leads to a situation where CERN started providing two types of disk services ... Fast and slow.
To cater for the Data transfer needs between center and other European Research Facilities they use Dark Fiber (privat fiber lines) from their HQ in Geneva to most the western European Countries including some of the Eastern European Countries like Hungary for example.
While they have detailed monitoring all over their systems, they are strugeling with getting an overview of what is going on in their infrastructure. Their favorite methode at the moment is to depict their resources in color coded tree maps to identify the most pressing problems in relation to their impact on the available computing resources.
by Erik Nygren
Akamay transports up to 5 PByte on peak Days. Akamay runnns 10-15% of the worlds HTTP Traffic. Most of the larg web properties rely on Akamay to get their content out to the customers. Akamay runs servers at 1'400 locations around the worlds. They need so many locations since the internet is so widly distributed. No single proveder controls more than 8% of the network. To reach over 90% of the users they have to be present in over 1000 locations.
Most websites do not need all that much bandwidth on a sustained basis, so when a big event (advertising, software release) the akamay network is ideal to provide good response times during these brief periodes.
Due to the distributed nature and size of their computing base they:
by Jeff Bonwick and Bill Moore
ZFS points of interesst:
Use of a Memory-Model for storage. You don't have to format or partition your RAM. So you should not have todo that with storage. Aka Pooled Storage.
Transactional objects. Things are consistant on disk. No fsck.
End to end data integrity through checksumming at the very top of the software stack. This efficiently fights silent data corruption which is becoming more and more common as data size increass without eror rate dropping. The problem ZFS is adresing here is that data corruption not only occures at the physical layer but also at any of the intermediate layer of the software stack. The checksum for a data block is stored at its parent block pointer thus it gets physically isolated from the actual data.
When mirroring is on operation zfs will use the checksumming data to identify the 'good' blocks and use them to fix/re-write broken blocks.
ZFS RAID-Z always creats full stripes by using a variable block size of 512 to 128k Bytes. It supports single and double parity mimicing RAID5 and RAID6. Using the birth-time block marking it is evenpossible to resync disks which have been disconnected for some time.
Platfrom independance. Disks can be moved from one system to another.
Move a away from the physical Model where you deal with Virtual Disks based on Volumes to a model of pooled storage where you do malloc/free type operations.
Operations where you don't care about where this happens. Leave ZFS to make sure things are efficient. Since ZFS manages the filesystem as well as the physical disks it is in a unique position to optimize all operations.
ZFS operates on Objects. It can handle 2^48 objects each up to 2^64 Bytes in size. All functionality like Snapshots, compression, encryption happens on the Data Management Unit level.
Copy on write transactions. Instead of overwriting existing data, we always allocate new space for updates and rewrite the pointer to the data once all the new data has been written. As a side effect of this we get free snapshot abilities.
Each block-parent contains the birth-time of its child. This allows to find all objects that changed since any point in time in a highly efficient manner. With this is posible to synchronize a zfs file system to another filesystem with virtually no overhead. Running the synchronization process every 10 seconds will not hurt the system at all.
by Kenneth G. Brill, Director Uptime Institute
Energy Consumtion of Servers on 2000 was 0.6% in 2005 it was 1.2% by 2010 the Uptime Institute projects them at 2.1%.
The fundamental problem with Moore's Law and power consumtion is, that the number of transistors grows exponentially while the energy efficiency does not go up accordingly. Today the cost of data centers is primarily driven by the power consumtion of the equipment installed.
If you bought computers for 1M$ in 2000 you used 32kW. In 2012 the 1M$ equipment will use 398kW. Together with the rising energy prices this is shifting the economics of data centers drastically.
Less than halve of the power consumtion at the plug goes into the actual computational work.
The only way out of this cycle is to reduce energy consumption:
IT strategy optimization regarding Business requirements, systems architecture and platformas, data topology, network design.
IT hardware asset utilization. Up to 30% of the IT infrastructure in large datacenters are sitting around eating enery withtout doing any work.
IT energy efficient hardware deployment. Be prepared to pay a premium to for more efficient HW. Interesting: Doubeling disk spindle speed will multiply energy consumtion by a factor of 8.
Site infrastructure overhead minimization.
More can be found on http://www.uptimeinstitute.org.
by Chris Mc Eniry, Sony Computer Enetertainment America
They are running a setup with 2000 homogeneous game servers. They boot with PXE/TFTP and then install an image on the client. Since their client population is huge the network capacity is not sufficient to transfer the 2GB image to all clients within the 2 hour maintenance window they have.
By switching to BitTorrent for the OS distribution they could dramatically reduce the distribution time (15 minutes). BitTorrent has some added benefits, in that it checksumms the image at transfer time and it even lets you distribute incremental updates to the image if the new image shares binary equal sections with the previous version.