In regular (or uncompressed) files, each CKD track or FBA block occupies a specific spot in the emulation file. The offset of the track or block in the file can be directly calculated knowing the track or block number and the maximum size of the track or block. In compressed files, each track image or group of blocks may be compressed by zlib or bzip2, and only occupies the space neccessary for the compressed image. The offset of a compressed track or block is obtained by performing a two-table lookup. The lookup tables themselves reside in the emulation file.
Because FBA blocks are 512 bytes in length, and that being a rather small number, FBA blocks are grouped into block groups. Each block group contains 120 FBA blocks (60K).
Whenever a track or block group is written to a compressed file, it is written either to an existing free space within the file, or at the end of the file, then the lookup tables are updated, and then the space the track or block group previously occupied is freed. The location of a track or block group in the file can change many times.
In the event of a catastrophic failure (for example, Hercules crash, operating system crash, power failure), the compressed emulation file on the host's physical disk may be out of sync if the host operating system defers physical writes to the file system containing the emulation file. A number of techniques have been provided to minimize emulation file corruption in such an event.
A compressed file may occupy only 20% of the disk space required by an uncompressed file. In other words, you may be able to have 5 times more emulated volumes using compressed DASD files. However, compressed files are more sensitive to failures and corruption may occur.
Shadow files are specified by the sf=shadow-file-name parameter
on the device statement for the compressed DASD device. The shadow file name
should have spot where the shadow file number will be set. This is
either the character preceding the last period after the last slash or the
last character if there is no period. For example:
0100 3390 disks/linux1.dsk sf=shadows/linux1_*.dsk
There can be up to 8 shadow files in use at any time for an emulated dasd device. The base file is designated file[0] and the shadow files are file[1] to file[8]. The highest numbered file in use at a given time is the current file, where all writes will occur. Track reads start with the current file and proceed down until a file is found that actually contains the track image.
A shadow file contains all the changes made to the emulated dasd since it was created, until the next shadow file is created. The moment of the shadow file's creation can be thought of as a snapshot of the current emulated dasd at that time, because if the shadow file is later removed, then the emulated dasd reverts back to the state it was at when the snapshot was taken.
Using shadow files, you can keep the base file on a read-only device such as cdrom, or change the base file attributes to read-only, ensuring that this file can never be corrupted.
Hercules console commands are provided to add a new shadow file, remove
the current shadow file (with or without backward merge), compress the
curent shadow file, and display the shadow file status and statistics:
sf+ | unit |    Create a new shadow file | |
sf- | unit |    Remove a shadow file with backwards merge | |
sf- | unit | nomerge |    Remove a shadow file without backwards merge |
sfc | unit |    Compress the current file | |
sfd | unit |    Display shadow file status and statistics |
The first 512 bytes of a compressed DASD file contains a device header. The device header contains an eye-catcher that identifies the file type (CKD or FBA and base or shadow). The device type and file size is also specified in this header. The header is identical to the header used for uncompressed CKD files, except for the eye-catcher:
devid | heads | trksize | |||||||||||||
devt | seq | hicyl |   | ||||||||||||
reserved |
The next 512 bytes contains the compressed device header. This contains file usage information such as the amount of free space in the file:
vrm | opts | numl1 | numl2 | size | |||||||||||||
used | ->free | free | largest | ||||||||||||||
number |   | cyls |   | comp | parm | ||||||||||||
reserved |
After the compressed device header is the primary lookup table, also called the level 1 table or l1tab. Each 4 byte unsigned entry in the l1tab contains the file offset of a secondary lookup table or level 2 table or l2tab. The track or block group number being accessed divided by 256 gives the index into the l1tab. That is, each l1tab entry represents 256 tracks or block groups. The number of entries in the l1tab is dependent on the size of the emulated device:
l20 | l21 | l22 | l23 | ||||||||||||
l24 | l25 | l26 | l27 | ||||||||||||
|
|||||||||||||||
l2n-4 | l2n-3 | l2n-2 | l2n-1 |
Following the l1tab, in no particular order, are l2tabs, track or block group images, and free spaces.
Each secondary lookup table (or l2tab), contains 256 8-byte entries. The entry is indexed by the remainder of the track or block group number divided by 256. Each entry contains an unsigned 4 byte offset and an unsigned 2 byte length of the track or block group image:
0  ->image         | length | unused | |||||
1  ->image         | length | unused | |||||
.   .   . |
|||||||
255  ->image         | length | unused |
A track or block group image contains two fields, a 5-byte header and a variable amount of data that may or may not be compressed. The length in the l2tab entry includes the length of the header and the data.
hdr | track or block group data |
The 5 byte header contains a 1 byte flag field and 4 bytes that identify the track or block group. The format of the identifier depends on whether the emulated device is CKD or FBA:
CKD hdr
flags | CC   | HH   |
The 2 byte CC is the cylinder number for the track image and the HH is the head number. These numbers are stored in big-endian byte order. When the flag byte is zeroed, the 5 byte header is identical to the Home Address (or HA) for the track image. The data, which may or may not be compressed, begins with the R0 count and ends with the end-of-track (or eot) marker, which is a count field containing 8 0xff's. The HA plus the uncompressed track data comprise the track image.
FBA hdr
flags | nnnn         |
The 4 byte nnnn field is the FBA block group number in big-endian byte order. The data contains 120 FBA blocks, which may or may not be compressed. Uncompressed, the FBA block group is 60K. The header for FBA, unlike CKD, is not used as part of the uncompressed image.
The flags byte contains 8 bits in the format
0   0   0   0   0   0   c   c   |
0   0 |     Data is uncompressed |
0   1 |     Data is compressed using zlib |
1   0 |     Data is compressed using bzip2 |
1   1 |     Not valid |
Free space contains a 4-byte offset to the next free space, a 4-byte length of the free space, and zero or more bytes of residual data:
->next | length |    residual    |
The minimum length of a free space is 8 bytes. The free space chain is ordered by file offset and no two free spaces are adjacent. The compressed device header contains the offset to the first free space. The chain is terminated when a free space has zero offset to the next free space. The free space chain is read when the file is opened for read-write and written when the file is closed; while the file is opened, the free space chain is maintained in storage.
All compressed devices share a common cache; the devices can be a mixture of FBA and/or CKD device types. Each cache entry contains a pointer to a 64K buffer containing an uncompressed track or block group image. If the track or block group image being read is not found in the cache, then the oldest (or least recently used or LRU) entry that is not busy is stolen. A cache entry is busy if it is being read, or last accessed by an active channel program, or updated but not yet written, or being written. If no cache entries are available then the read must enter a cache wait. When images are detected to be accessed sequentially then the readahead thread(s) may be signalled to read following sequential images.
Writing
When a cache entry is updated or written to, a bit is turned on indicating
the cache entry has been updated. When a cache wait occurs, or
(more likely) during garbage collection, a cache flush is performed.
When the cache is flushed, if any entries have the updated bit on, then
the writer thread(s) are signalled. The writer thread selects the oldest
cache entry with the updated bit on, compresses the image, and writes it
to the file. The new image is written to a new space in the file and then
the space previously occupied by the image is freed. In certain circumstances,
the image may be written under stress. A stress write occurs when
a reading thread is in a cache wait or when a high percentage of
cache entries are pending write. In this circumstance, the compression
parameters are relaxed to reduce the CPU requirements. An image written
under stress is likely to take up more space than the same image written
not under stress. The writer thread(s) run 1 nicer than the CPU thread(s);
compression is a CPU intensive activity.
Garbage Collection
The primary function of the garbage collector is to keep the emulated
compressed DASD files as small as possible. After all, that is the reason
for using compressed DASD files in the first place. Another function
is to perform emulation file synchronization.
A single garbage collector thread runs for all compressed devices. By default it wakes up at 5 second intervals. The garbage collector performs space recovery for each compressed device in the order that the device was defined or attached. After space recovery the garbage collector flushes the cache to force all outstanding writes. Once all the writes have been completed, a file synchronization (fsync()) may optionally be performed, which commits any outstanding host I/O to the physical disk. Finally free space is flushed (to be explained later).
We see that with the fsync option enabled that the physical disk file has a coherent emulation file at the end of each garbage collection cycle. Space freed since the last garbage collection cycle completed is not available for allocation until the current garbage collection cycle completes. This free space is called pending free space. That is, previous track or block group images are not overwritten until the current garbage collection completes. If a catastrophic error occurs, then the emulation file should be recoverable at least up to the point of the last garbage collection cycle.
However, performing an fsync() may decrease performance. You can increase the garbage collection interval, to reduce the number of fsync()s, but this may also increase the probability of a cache wait occurring. You can increase the size of the cache to decrease this probability, but you may increase paging or have to decrease the size of emulated memory.
Another possibility is to not enable the fsync option. This is the default. In this circumstance, by default, freed space is not available until 2 garbage collection cycles complete. That is, pending free space is not an attribute but a count. You have the option to explicitly set the pending free space count. However, by increasing the free space count or by increasing the garbage collection interval, then you may be increasing the size of the emulation file.
At the very end of the garbage collection cycle, the free space is flushed. This means that the pending free space count is decremented for all free spaces with a non-zero count. If the count goes to zero and the preceding space is a free space with a zero count then the spaces are combined.
The space recovery process of the garbage collector simply attempts to move some amount of used space towards the beginning of the file causing free space to move towards the end of the file. When a free space reaches the end of the file, the file is truncated, reducing its size. The amount of used space moved depends on the ratio of free space to used space and on the number of free spaces. The larger the numbers, the more space the garbage collector attempts to move. That is, the garbage collector attempts to decrease the ratio of free space to used space and to decrease the number of free spaces. Within a cycle, the garbage collector might not move the selected amount of used space if the moves are detected to be counter-productive (ie the offset of the new space is greater than the current offset).
Syntax:
cckd | help | Display cckd help |
cckd | stats | Display current cckd statistics |
cckd | opts | Display current cckd options |
cckd | opt=value | Set a cckd option |
Multiple options may be specified, separated by a comma with no intervening blanks. | ||
cache=n | Cache size in M | |
l2cache=n | L2 cache size in K | |
ra=n | Number readahead threads | |
raq=n | Readahead queue size | |
rat=n | Number of tracks to readahead | |
wr=n | Number writer threads | |
gcint=n | Garbage collection interval | |
gcparm=n | Garbage collection parameter | |
nostress=n | Turn stress writes on or off | |
freepend=n | Set the free pending value | |
fsync=n | Turn fsync on or off | |
ftruncwa=n | Turn ftruncate bug workaround on or off | |
trace=n | Number of trace table entries |
Options:
cache=n | Size of the cache in megabytes. Each cache entry points
to a 64K buffer. Therefore each megabyte represents 16 cache entries.
The default is 8, or 256 cache entries. You can specify a number between 1 and 64 (16 to 1024 cache entries).
|
l2cache=n  | Size of the level 2 table cache in kilobytes.
Each cache entry points to a 2K l2tab. Therefore each 2K
represents a single cache entry.
The default is 512, or 256 cache entries. You can specify a number between 256 and 2048 (128 to 1024 cache entries).
|
ra=n | Number of readahead threads. When sequential track or block group
access is detected, some number (rat= ) of tracks or
block groups are queued (raq= ) to be read by one of the
readahead threads.
The default is 2. You can specify a number between 1 and 9.
|
raq=n | Size of the readahead queue. When sequential track or block group
access is detected, some number (rat= ) of tracks or
block groups are queued in the readahead queue.
The default is 4. You can specify a number between 0 and 16 (a value of zero disables readahead).
|
rat=n | Number of tracks or block groups to read ahead when sequential access
has been detected.
The default is 2. You can specify a number between 0 and 16 (a value of zero disables readahead).
|
wr=n | Number of writer threads. When the cache is flushed updated
cache entries are marked write pending and a writer thread is signalled.
The writer thread compresses the track or block group and writes the
compressed image to the emulation file. A writer thread is cpu-intensive
while compressing the track or block group and i/o-intensive while writing
the compressed image. The writer thread runs one nicer than
the CPU thread(s).
The default is 2. You can specify a number between 1 and 9.
|
gcint=n | Number of seconds the garbage collector thread waits durinng an interval.
At the end of an interval, the garbage collector performs space recovery,
flushes the cache, and optionally fsyncs the emulation file.
(However, the file will not be fsynced unless at least 5
seconds have elapsed since the last fsync).
The default is 5 seconds. You can specify a number between 1 and 60.
|
gcparm=n | A value affecting the amount of data moved during the garbage collector's
space recovery routine. The garbage collector determines an amount of
space to move based on the ratio of free space to used space in an
emulation file, and on the number of free spaces in the file. (The
garbage collector wants to reduce the free space to used space ratio
and the number of free spaces). The value is logarithmic; a value
of 8 means moving 28 the selected value while a negative
value similarly decreases the amount to be moved. Normally, 256K
will be moved for a file in an interval. Specifying a value of 8 can
increase the amount to 64M. At least 64K will be moved. Interestingly,
specifying a large value (such as 8) may not increase the garbage
collection efficiency correspondingly.
The default is 0. You can specify a number between -8 and 8.
|
nostress=n  | Indicates whether stress writes will occur or not. A track
or block group may be written under stress when a high percentage of
the cache is pending write or when a device i/o thread is waiting for
a cache entry. When a stressed write occurs, the compression algorithm
and/or compression parm may be relaxed, resulting in faster compression
but usually a larger compressed image. If nostress is set
to one, then a stressed situation is ignored. You would typically
set this value to one when you want create the smallest emulation file
possible in exchange for a possible performance degradation.
The default is 0. You can specify 0 (enable stressed writes) or 1 (disable stressed writes).
|
freepend=n  | Specifies the free pending value for freed space. When a
track or block group image is written the space it previously occupied
is freed. This space will not be available for future
allocations until n garbage collection intervals have completed.
In the event of a catastrophic failure, previously written track or
block group images should be recoverable if the current image has
not yet been written to the physical disk. By default the value
is set to -1. This means that if fsync is specified
then the value is 1 otherwise it is 2. If 0 is specified then freed
space is immediately available for new allocations.
The default is -1. You can specify a number between -1 and 4.
|
fsync=n  | Enables or disables fsync. When fsync is enabled, then
the disk emulation file is synchronized with the physical hard
disk at the end of a garbage collection interval (however, no more
often than 5 seconds). This means that if freepend is
non-zero then if a catastrophic error occurs then the emulated disks
should be recovered coherently. However, fsync may cause
performance degradation depending on the host operating system and/or
the host operating system level.
The default is 0 (fsync disabled). You can specify 0 (disable fsync) or 1 (enable fsync).
|
ftruncwa=n  | Work-around for a linux kernel bug in 2.4.18 (shipped in at least RH7.3
and RH8.0). Symptom is excessive amount of kernel cpu time and
non-responsiveness of the associated hercules emulated dasd file.
The problem may still occur with this option turned on, although
less freqently. The problem appears to be fixed in 2.4.19. The default is 0. You can specify 0 or 1 (enable workaround).
|
trace=n  | Number of cckd trace entries. You would normally specify a non-zero
value when debugging or capturing a problem in cckd code. When the
problem occurs, you should enter the k Hercules console command
which will print the trace table entries.
The default is 0. You can specify a number between 0 and 200000. Each entry represents 128 bytes. Normally, for debugging, I use 100000.
|
Q. | What devices are supported ? | ||||||||
A. |
2311, 2314, 3330, 3340, 3350, 3375, 3380, 3390 and 9345.
| ||||||||
Q. | Is a 3390 model 9 supported ? | ||||||||
A. |
Yes, maybe. A 3390-9 is a little over 8G in size.
A cckd file cannot exceed 2G on a system that does
not support large files, otherwise it cannot exceed
4G. If the data on the 3390-9 compresses to below
these limits then the answer is Yes.
| ||||||||
Q. | How can I get rid of the free space in my files ? | ||||||||
A. |
Once the total amount of free space falls below 6% of
the total file size, the garbage collector is not very
aggressive about eliminating free space. To remove
all free space from the file while Hercules is running
use the sfc console command. See
Using Shadow Files above.
Otherwise, you can use the cckdcomp utility.
See Utilities above.
| ||||||||
Q. | How can I display the space statistics for a compressed file ? | ||||||||
A. |
The statistics are displayed when the compressed file
is opened. Currently, there is no supplied method to
display these statistics at any other time. However,
it shouldn't be too hard to write a shell script
(similar to dasdlist ) to display these
statistics. The statistics are contained in the
CCKDDASD_DEVHDR which is at offset 512
in the compressed file; the header is mapped in
hercules.h .
| ||||||||
Q. | What is a "null track" anyway ? | ||||||||
A. |
The term "null track" is just something I made up. It is
what is returned when a zero offset is found in either the
primary or secondary lookup table for the track. It contains
the folllowing fields:
| ||||||||
Q. | I want to try bzip2 but I'm getting compiler errors. What am I doing wrong ? | ||||||||
A. |
Probably bzip2 is not installed or is not installed
properly. You can obtain bzip2 from
here.
If bzip2 is installed, then you need to find the directory
where bzlib.h is installed and the
directory where libbz2.a is installed.
You can then add "-I bzlib.h-directory" to the
CFLAGS in the make file and add "-L libbz2.a-directory"
to the LFLAGS.
| ||||||||
Q. | Which is better, zlib or bzip2 ? | ||||||||
A. |
This is a religious question. I have no actual preference,
I just wanted to make a choice available.
| ||||||||
Q. | Can other compression programs be used ? | ||||||||
A. |
Yes. The program is architecturally structured so that other
compression algorithms can be added rather painlessly. This
will require, of course, an update to the source.
| ||||||||
Q. | Can this compression scheme be used for FBA devices too ? | ||||||||
A. |
I have not worked with FBA devices for over 20 years.
However, it seems to me that a similar program for FBA
devices should be simpler than this program for CKD devices
(none of those count/key/data fields mucking everything
up). Since an FBA block is 512 bytes, it might not
be efficient to have each block compressed individually;
it might be better to compress blocks in 32K or 64K chunks.
If someone asks very nicely, I may consider looking into it;-)
|
Greg Smith gsmith@nc.rr.com
Last updated 17 Nov 2002