For every single file that touches the filesystem, it travels with it a tremendous amout of artifacts.
One of these is the metadatawhich can be both for a specific file or from the filesystem.
Analysis of the file can be through content identification and metadata extraction.
Content identification is the process of determining or verifying what a specific file is; contents and stuff.
Metadata extraction is the retrieval of any embedded metadata that may be present in a given file; and this is going to be our focus in this tutorial.
METADATA EXTRACTION:
metadata are information stored within the file itself that provide some possibly interesting but otherwise nonessential information about the file. Many tools;exiftool, exiv2, hachoir-metadata, et-la,are used to view it.
All files can include filesystem MAC [modified, access,creation] times.
IMAGES:
These are file that contain data to be rendered as graphics.
There are 3 types of metadata in image file;
-EXIF (Exchangeable Image File Format) was developed to embed information about the device capturing the image (typically a camera) into the image itself.
-IPTC was designed originally to embed information about images used by newspapers and news agencies.
-XMP is the XML-based "eXensible Metadata Platform" developed by Adobe in 2001.
JPEG: Is rich with all forms of metadata as above plus the JFIF metadata. Other image formats are scarce with metadata since they are mostly computer generated like GIF, PNG and TIFF.
AUDIO:
Data that imparts sound when decoded properly.
1.WAV(Waveform Audio File Format): WAV audio is stored inside of a RIFF container which supports INFO chunks with various metadata. And can also contain XMP tags. Take note of app used to create, date of creation and Author.
2.MPEG-3/MP3(Moving Picture Experts Group): MP3 contains two metadata types; ID3v1 and ID3v2. ID3v1 has 128bytes of appended metadata and 227bytes when extended. Size in ID3v2 is unrestricted. For both ID3 versions, ID3v2 tool, exiftool and hachoir-metadata can extract them.
3.MPEG-4 Audio(AAC/M4A) and ASF/WMA: Also contain ID3 tags and mild metadata respectively.
VIDEO:
Data that decode into a sequence of moving images.
Video formats can contain XMP tags, INFO chunks and other video specific metadata plus ofcourse the MAC times.
ARCHIVES:
These are container files designed to hold other files, apply compression and sometime encryption to contained files. Some archive types may retain information from their system of origin, including UID and GID information from Unix-like systems.
1.ZIP: We can use the hachoir-urwid, unzip command to retrieve information about the content of a ZIP archive without actually extracting. $ unzip -v file.zip
This can be good to examine file modification dates embedded in the archive.
2.RAR: We can examine the contents of RAR archives using the RAR plugin to 7zip.$ 7z l file.rar
Some RAR archives may contain comments, "NFO" files.
3.TAR,GZIP,BZIP2: Tarballs have a property of retaining the owner and group information of system they were created on plus the time stamps like other formats. Not to lose fidelity, they are better analysed in layers: Compression layer with $ gunzip --list --verbose file.tar.gz
Then the archive itself with $ tar --list --verbose --gunzip --file file.tar.gz
And you will be surprised.
DOCUMENTS:
A filetype containing text, images and rendering information. These come with alot of metadata, ranging from authorship information and document revision histories to internal timestamps and information about the system(s) used to edit the file.
1.OLE Compound Files: Documents created using the Microsoft Office 1997–2003 binary formats e.g. PowerPoint presentations, Word Documents (DOC), and Excel Spreadsheets (XLS). These have a portable filesystem format with storage objects as (sub-)directories and stream objects which are sequences of sectors allocated for a discrete piece of data. So imagine how much metadata you can live in this.
2.OpenDocument Format and Office Open XML like Docx: Goodnews is that these have a Zip format so all the techniques for zip files work here. Care should be taken to analyse every other individual object in them so that none is advertising you to the world.$unzip -l kendocs.docx/.odt
3.Rich Text Format: metadata are contained in the body of the document itself and can be viewed directly with no additional tools. So good luck analysing a really big RTF doc.
4.PDF: Basically, it is container file that holds a sequence of PostScript layout instructions and embedded fonts and graphics. These contain two different types of metadata; Document Information
Directory contains key/value pairs with authorship information, document title, and creation/modification time stamps, and XMP.
I wish i could have made this shorter but be careful when uploading anything to the interwebs, a small thing forgotten might be your undoing.
Best of luck!