Monday 28 January 2013

tar, bzip2, gzip


Compression is a means to shrink the physical size of a file in bytes. The technical aspects of how compression works is a bit beyond the scope of this guide, so suffice it to say that the computer uses an algorithm to combine redundant bytes of data together. Archiving on the other hand, is the act of combining several files together into one, for ease of backup and distribution, all the while keeping the individual file attributes and permissions intact.

File Extentions

filename.tar

This is a standard tar archive. Tar is short for tape archive which is a throwback to the old days of backing up hard drives to regular tape.

filename.gz

This is a compressed file using the GNU Zip compression algorithm. At this time, it is the most common compression method used.

filename.bz2

This is a compressed file using the newer bzip2 compression algorithm (if anyone knows what the ‘b’ stands for please let me know). Bzip2 is a more efficient algorithym which results in smaller file sizes. Many of the major FTP archive sites are switching to bzip2 to save disk space and bandwidth.

Of course, the most common case is that tar is used together with a compression tool, which results in file extensions such as .tar.gz, .tar.bz2, and .tgz. It is important to keep in mind that these file extensions are for the benefit of humans, not computers. There is nothing to stop you but convention from naming your tar archives with say, the .joe filename extension. It will still untar just fine

Using tar

The format of using tar on the command line is:

tar functionoptions files...

You do not need to use the usual ‘-’ after tar, because the first argument to the tar command is what is called the function, rather than an option. You can use the dash if you like though. The most common functions used with tar are:

c -to create a new archive

x -to extract files from an archive (untar them)

t -to list the contents of a tar archive

r -to add more files to an existing archive

There are, of course, more functions available for more esoteric tasks, which you can discover with man tar if you are really curious. The most important option is the ‘f’ flag, which must be specified right before the filename you intend to act on. There are other functions as well, but I do not want to give them away now lest I ruin the suprise later. Now I want you to find a directory somewhere in your home directory, preferably one with 4-5 files so we can have some working examples. As an example I will use a directory where I keep some of the python scripts I am working on:


[bulliver@badcomputer scripts]$ ls -l python

-rwxr-xr-x 1 bulliver bulliver 62 Dec 26 16:08 exp1.py

-rwxr-xr-x 1 bulliver bulliver 110 Feb 12 02:42 exp2.py

-rwxr-xr-x 1 bulliver bulliver 5549 Jan 9 02:10 find_mu.py

-rwxr-xr-x 1 bulliver bulliver 3433 Jan 9 18:20 find_mu2.py

-rw-r–r– 1 bulliver bulliver 3415 Jan 29 03:58 find_mu2.txt

-rwxr-xr-x 1 bulliver bulliver 9285 Jan 29 03:58 hockey_pool.py

Now we want to combine these six files into a single archive, so we’ll use the ‘c’ function. It is also wise to keep all the files you intend to archive contained in a directory. If you have ever unpacked a tar file you downloaded from the internet and had the files explode all over the current directory you will understand why it is good form to package all of your archives in a top-level directory.

[bulliver@badcomputer scripts]$ tar cvf python.tar python

python/

python/hockey_pool.py

python/find_mu.py

python/exp1.py

python/exp2.py

python/find_mu2.py

python/find_mu2.txt

[bulliver@badcomputer scripts]$ ls -l

drwxr-xr-x 2 bulliver bulliver 288 Feb 4 18:59 bash

drwxr-xr-x 2 bulliver bulliver 392 Jan 29 03:57 c

drwxr-xr-x 2 bulliver bulliver 224 Feb 12 00:32 python

-rw-r–r– 1 bulliver bulliver 20480 Feb 16 00:38 python.tar

drwxr-xr-x 2 bulliver bulliver 48 Feb 4 18:14 test

There are a couple of things to notice here. First of all, the ‘v’ (verbose) option gives us a list of the files that tar is archiving. Keep in mind that tar works recursively: (if there were any directories in ‘python’ they would have been added to the archive (along with any deeper levels of files and directories)). Another thing to notice is that tar left our original python directory intact. Now lets have a look inside our archive:

[bulliver@badcomputer scripts]$ tar tvf python.tar

drwxr-xr-x bulliver/bulliver 0 2003-02-12 00:32 python/

-rwxr-xr-x bulliver/bulliver 9285 2003-01-29 03:58 python/hockey_pool.py

-rwxr-xr_x bulliver/bulliver 5549 2003-01-09 02:10 python/find_mu.py

-rwxr-xr-x bulliver/bulliver 62 2002-12-26 16:08 python/exp1.py

-rwxr-xr-x bulliver/bulliver 110 2003-02-12 02:42 python/exp2.py

-rwxr-xr-x bulliver/bulliver 3433 2003-01-09 18:20 python/find_mu2.py

-rwxr-xr-x bulliver/bulliver 3415 2003-01-29 03:58 python/find_mu2.txt

As you can see, all of the files we archived have retained their permissions and timestamps. To extract the archive we would use:

[bulliver@badcomputer scripts]$ tar xvf python.tar

python/

python/hockey_pool.py

python/find_mu.py

python/exp1.py

python/exp2.py

python/find_mu2.py

python/find_mu2.txt


using gzip and bzip2

As mentioned before, gzip and bzip2 do basically the same thing using two different methods, so which should we use? Far be it for me to tell you, so I’ll just say that a good rule of thumb is that if the archive is smaller, say less than 3-5 MB you should use gzip and if it is larger, use bzip2. Why? Well, bzip is more efficient, but takes longer to compress/decompress, so only use it if it will actually make a noticeable difference; ie the size difference between a 1 MB tar archive compressed with gzip and bzip2 is negligable, whilst the difference between a 50 MB archive can be quite pronounced. That being said, just use whichever you prefer :).

Compressing files is pretty easy. Here are a few examples:

[bulliver@badcomputer cruft]$ ls

bg_1.txt bg_2.txt bg_3.txt pangram.txt screenie.txt

[bulliver@badcomputer cruft]$ gzip pangram.txt

[bulliver@badcomputer cruft]$ bzip2 screenie.txt

[bulliver@badcomputer cruft]$ ls

bg_1.txt bg_2.txt bg_3.txt pangram.txt.gz screenie.txt.bz2

Notice that the original files are not preserved. Uncompressing the files is just as simple:

[bulliver@badcomputer scripts]$ gzip python.tar

and using bzip2:

[bulliver@badcomputer scripts]$ bzip2 python.tar

That’s it! Of course there are numerous options available to both commands, but I will let you discover them by reading the man pages.

putting it all together

So how do you deal with the foo.tar.gz or foo.tar.bz2 file you just downloaded from the internet? You could extract it manually:

[bulliver@badcomputer bulliver]$ gunzip foo.tar.gz

[bulliver@badcomputer bulliver]$ tar xvf foo.tar

That seems a bit tacky though. If you’re a guru you might string it together using zcat and a pipe, and take advantage of unix filestreams:

[bulliver@badcomputer bulliver]$ zcat foo.tar.gz | tar xv

That seems a bit esoteric though. If you’re like the rest of us you just take advantage of GNU tar’s decompression utilities which are built right in. This means you can extract and uncompress the archive with one command, instead of dealing with two seperate steps. When you pass the ‘z’ option to tar it will uncompress using gzip, and with the ‘j’ option it will uncompress using bzip2. So:

[bulliver@badcomputer scripts]$ tar xzf foo.tar.gz

[bulliver@badcomputer scripts]$ tar xjf foo.tar.bz2

Conversly, you can of course create your own compressed archives with one command as well. Just replace the extract function with ‘c’ for create:

[bulliver@badcomputer scripts]$ tar czf foo.tar.gz foo/

[bulliver@badcomputer scripts]$ tar cjf foo.tar.bz2 foo/

Remember certain option defined, c for create, z for extract and t for test. v verbose, and f is file. There are 2 common archive people are interested in, bz2 and tar.gz (tgz). bz2 is more compress than tar.gz, but tar.gz is faster for creating and extracting

Howto: Use tar Command Through Network Over SSH Session

The GNU version of the tar archiving utility (and other old version of tar) can be use through network over ssh session. Do not use telnet command, it is insecure. You can use Unix/Linux pipes to create actives. Following command backups /wwwdata directory to dumpserver.nixcraft.in (IP 192.168.1.201) host over ssh session.

# tar zcvf - /wwwdata | ssh root@dumpserver.nixcraft.in "cat > /backup/wwwdata.tar.gz"

OR

# tar zcvf - /wwwdata | ssh root@192.168.1.201 "cat > /backup/wwwdata.tar.gz"

Output:

tar: Removing leading `/' from member names

/wwwdata/

/wwwdata/n/nixcraft.in/

/wwwdata/c/cyberciti.biz/

Password:

You can also use dd command for clarity purpose:
# tar cvzf - /wwwdata | ssh root@192.168.1.201 "dd of=/backup/wwwdata.tar.gz"
It is also possible to dump backup to remote tape device:
# tar cvzf - /wwwdata | ssh root@192.168.1.201 "cat > /dev/nst0"
OR you can use mt to rewind tape and then dump it using cat command:
# tar cvzf - /wwwdata | ssh root@192.168.1.201 $(mt -f /dev/nst0 rewind; cat > /dev/nst0)$
You can restore tar backup over ssh session: # cd /
# ssh root@192.168.1.201 "cat /backup/wwwdata.tar.gz" | tar zxvf -

Tar Extract a Single File(s) From a Large Tarball

GNU tar can be used to extract a single or more files from a tarball. To extract specific archive members, give their exact member names as arguments, as printed by -t option.

Extracting Specific Files

Extract a file called etc/default/sysstat from config.tar.gz tarball:
$ tar -ztvf config.tar.gz
$ tar -zxvf config.tar.gz etc/default/sysstat
$ tar -xvf {tarball.tar} {path/to/file}
Some people prefers following syntax:
tar --extract --file={tarball.tar} {file}
Extract a directory called css from cbz.tar:
$ tar --extract --file=cbz.tar css

Wildcard based extracting

You can also extract those files that match a specific globbing pattern (wildcards). For example, to extract from cbz.tar all files that begin with pic, no matter their directory prefix, you could type:
$ tar -xf cbz.tar --wildcards --no-anchored 'pic*'
To extract all php files, enter:
$ tar -xf cbz.tar --wildcards --no-anchored '*.php'

Where,
-x: instructs tar to extract files.
-f: specifies filename / tarball name.
-v: Verbose (show progress while extracting files).
-j : filter archive through bzip2, use to decompress .bz2 files.
-z: filter archive through gzip, use to decompress .gz files.
--wildcards: instructs tar to treat command line arguments as globbing patterns.
--no-anchored: informs it that the patterns apply to member names after any / delimiter.

List the contents of a tar or tar.gz file

GNU/tar is an archiving program designed to store and extract files from an archive file known as a tarfile. You can create a tar file or compressed tar file tar. However sometime you need to list the contents of a tar or tar.gz file on screen before extracting the all files.

Task: List the contents of a tar file

Use the following command:
$ tar -tvf file.tar

Task: List the contents of a tar.gz file

Use the following command:
$ tar -ztvf file.tar.gz

Task: List the contents of a tar.bz2 file

Use the following command:
$ tar -jtvf file.tar.bz2

Where,
t: List the contents of an archive
v: Verbosely list files processed (display detailed information)
z: Filter the archive through gzip so that we can open compressed (decompress) .gz tar file
j: Filter archive through bzip2, use to decompress .bz2 files.
f filename: Use archive file called filename

Exclude Certain Files When Creating A Tarball Using Tar Command

example:

/home/me/file1

/home/me/dir1

/home/me/dir2

/home/me/abc

/home/me/xyz

How do I execute xyz and abc file while using a tar command?

The GNU version of the tar archiving utility has --exclude and -X options. So to exclude abc and xyz file you need to type the command as follows:
$ tar -zcvf /tmp/mybackup.tar.gz --exclude='abc' --exclude='xyz' /home/me

If you have more than 2 files use -X option to specify multiple file names. It reads list of exclude file names from a text file. For example create a file called exclude.txt:
$ vi exclude.txtAppend file names:
abc
xyz
*.bak

Save and close the file. This lists the file patterns that need to be excluded. Now type the command:
$ tar -zcvf /tmp/mybackup.tar.gz -X exclude.txt /home/me

Where,
-X file.txt :exclude files matching patterns listed in FILE file.txt

Compressing files under Linux or UNIX cheat sheet

Both Linux and UNIX include various commands for Compressing and decompresses (read as expand compressed file). To compress files you can use gzip, bzip2 and zip commands. To expand compressed file (decompresses) you can use and gzip -d, bunzip2 (bzip2 -d), unzip commands.

Compressing files

Syntax

Description

Example(s)

gzip {filename}

Gzip compress the size of the given files using Lempel-Ziv coding (LZ77). Whenever possible, each file is replaced by one with the extension .gz.

gzip mydata.doc
gzip *.jpg
ls -l

bzip2 {filename}

bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by bzip command (LZ77/LZ78-based compressors). Whenever possible, each file is replaced by one with the extension .bz2.

bzip2 mydata.doc
bzip2 *.jpg
ls -l
zip {.zip-filename} {filename-to-compress}

zip is a compression and file packaging utility for Unix/Linux. Each file is stored in single .zip {.zip-filename} file with the extension .zip.

zip mydata.zip mydata.doc
zip data.zip *.doc
ls -l
tar -zcvf {.tgz-file} {files}
tar -jcvf {.tbz2-file} {files}

The GNU tar is archiving utility but it can be use to compressing large file(s). GNU tar supports both archive compressing through gzip and bzip2. If you have more than 2 files then it is recommended to use tar instead of gzip or bzip2.
-z: use gzip compress
-j: use bzip2 compress

tar -zcvf data.tgz *.doc
tar -zcvf pics.tar.gz *.jpg *.png
tar -jcvf data.tbz2 *.doc
ls -l


Decompressing files

Syntax

Description

Example(s)

gzip -d {.gz file}
gunzip {.gz file}

Decompressed a file that is created using gzip command. File is restored to their original form using this command.

gzip -d mydata.doc.gz
gunzip mydata.doc.gz

bzip2 -d {.bz2-file}
bunzip2 {.bz2-file}

Decompressed a file that is created using bzip2 command. File is restored to their original form using this command.

bzip2 -d mydata.doc.bz2
gunzip mydata.doc.bz2

unzip {.zip file}

Extract compressed files in a ZIP archive.

unzip file.zip
unzip data.zip resume.doc


tar -zxvf {.tgz-file}
tar -jxvf {.tbz2-file}

Untar or decompressed a file(s) that is created using tar compressing through gzip and bzip2 filter

tar -zxvf data.tgz
tar -zxvf pics.tar.gz *.jpg
tar -jxvf data.tbz2


List the contents of an archive/compressed file

Some time you just wanted to look at files inside an archive or compressed file. Then all of the above command supports file list option.


Syntax

Description

Example(s)


gzip -l {.gz file}

List files from a GZIP archive

gzip -l mydata.doc.gz


unzip -l {.zip file}

List files from a ZIP archive

unzip -l mydata.zip


tar -ztvf {.tar.gz}
tar -jtvf {.tbz2}

List files from a TAR archive

tar -ztvf pics.tar.gz
tar -jtvf data.tbz2

1. Creating an archive using tar command

Creating an uncompressed tar archive using option cvf

This is the basic command to create a tar archive.

$ tar cvf archive_name.tar dirname/

In the above command:
c – create a new archive
v – verbosely list files which are processed.
f – following is the archive file name

Creating a tar gzipped archive using option cvzf

The above tar cvf option, does not provide any compression. To use a gzip compression on the tar archive, use the z option as shown below.

$ tar cvzf archive_name.tar.gz dirname/
z – filter the archive through gzip

Note: .tgz is same as .tar.gz

Note: I like to keep the ‘cvf’ (or tvf, or xvf) option unchanged for all archive creation (or view, or extract) and add additional option at the end, which is easier to remember. i.e cvf for archive creation, cvfz for compressed gzip archive creation, cvfj for compressed bzip2 archive creation etc., For this method to work properly, don’t give – in front of the options.

Creating a bzipped tar archive using option cvjf

Create a bzip2 tar archive as shown below:

$ tar cvfj archive_name.tar.bz2 dirname/
j – filter the archive through bzip2

gzip vs bzip2: bzip2 takes more time to compress and decompress than gzip. bzip2 archival size is less than gzip.

Note: .tbz and .tb2 is same as .tar.bz2

2. Extracting (untar) an archive using tar command

Extract a *.tar file using option xvf

Extract a tar file using option x as shown below:

$ tar xvf archive_name.tar
x – extract files from archive

Extract a gzipped tar archive ( *.tar.gz ) using option xvzf

Use the option z for uncompressing a gzip tar archive.

$ tar xvfz archive_name.tar.gz

Extracting a bzipped tar archive ( *.tar.bz2 ) using option xvjf

Use the option j for uncompressing a bzip2 tar archive.

$ tar xvfj archive_name.tar.bz2

Note: In all the above commands v is optional, which lists the file being processed.

3. Listing an archive using tar command

View the tar archive file content without extracting using option tvf

You can view the *.tar file content before extracting as shown below.

$ tar tvf archive_name.tar

View the *.tar.gz file content without extracting using option tvzf

You can view the *.tar.gz file content before extracting as shown below.

$ tar tvfz archive_name.tar.gz

View the *.tar.bz2 file content without extracting using option tvjf

You can view the *.tar.bz2 file content before extracting as shown below.

$ tar tvfj archive_name.tar.bz2

4. Listing out the tar file content with less command

When the number of files in an archive is more, you may pipe the output of tar to less. But, you can also use less command directly to view the tar archive output, as explained in one of our previous article Open & View 10 Different File Types with Linux Less Command — The Ultimate Power of Less.

5. Extract a single file from tar, tar.gz, tar.bz2 file

To extract a specific file from a tar archive, specify the file name at the end of the tar xvf command as shown below. The following command extracts only a specific file from a large tar file.

$ tar xvf archive_file.tar /path/to/file

Use the relevant option z or j according to the compression method gzip or bzip2 respectively as shown below.

$ tar xvfz archive_file.tar.gz /path/to/file

$ tar xvfj archive_file.tar.bz2 /path/to/file

6. Extract a single directory from tar, tar.gz, tar.bz2 file

To extract a single directory (along with it’s subdirectory and files) from a tar archive, specify the directory name at the end of the tar xvf command as shown below. The following extracts only a specific directory from a large tar file.

$ tar xvf archive_file.tar /path/to/dir/

To extract multiple directories from a tar archive, specify those individual directory names at the end of the tar xvf command as shown below.

$ tar xvf archive_file.tar /path/to/dir1/ /path/to/dir2/

Use the relevant option z or j according to the compression method gzip or bzip2 respectively as shown below.

$ tar xvfz archive_file.tar.gz /path/to/dir/

$ tar xvfj archive_file.tar.bz2 /path/to/dir/

7. Extract group of files from tar, tar.gz, tar.bz2 archives using regular expression

You can specify a regex, to extract files matching a specified pattern. For example, following tar command extracts all the files with pl extension.

$ tar xvf archive_file.tar --wildcards '*.pl'

Options explanation:
–wildcards *.pl – files with pl extension

8. Adding a file or directory to an existing archive using option -r

You can add additional files to an existing tar archive as shown below. For example, to append a file to *.tar file do the following:

$ tar rvf archive_name.tar newfile

This newfile will be added to the existing archive_name.tar. Adding a directory to the tar is also similar,

$ tar rvf archive_name.tar newdir/

Note: You cannot add file or directory to a compressed archive. If you try to do so, you will get “tar: Cannot update compressed archives” error as shown below.

$ tar rvfz archive_name.tgz newfile

tar: Cannot update compressed archives

Try `tar --help' or `tar --usage' for more information.

9. Verify files available in tar using option -W

As part of creating a tar file, you can verify the archive file that got created using the option W as shown below.

$ tar cvfW file_name.tar dir/

If you are planning to remove a directory/file from an archive file or from the file system, you might want to verify the archive file before doing it as shown below.

$ tar tvfW file_name.tar

Verify 1/file1

1/file1: Mod time differs

1/file1: Size differs

Verify 1/file2

Verify 1/file3

If an output line starts with Verify, and there is no differs line then the file/directory is Ok. If not, you should investigate the issue.

Note: for a compressed archive file ( *.tar.gz, *.tar.bz2 ) you cannot do the verification.

Finding the difference between an archive and file system can be done even for a compressed archive. It also shows the same output as above excluding the lines with Verify.

Finding the difference between gzip archive file and file system

$ tar dfz file_name.tgz

Finding the difference between bzip2 archive file and file system

$ tar dfj file_name.tar.bz2

10. Estimate the tar archive size

The following command, estimates the tar file size ( in KB ) before you create the tar file.

$ tar -cf - /directory/to/archive/ | wc -c

20480

The following command, estimates the compressed tar file size ( in KB ) before you create the tar.gz, tar.bz2 files.

$ tar -czf - /directory/to/archive/ | wc -c

508

$ tar -cjf - /directory/to/archive/ | wc -c

428

gzip is very fast and has small memory footprint. According to this benchmark, neither bzip2 can compete with gzip in terms of speed or memory usage. bzip2 has notably better compression ratio than gzip, which has to be the reason for the popularity of bzip2; it is slower than gzip especially in decompression and uses more memory. However the memory requirements of bzip2 should be nowadays no problem even on older hardware.


No comments:

Post a Comment