Steganography Detection for Digital Forensics

Sage LaTorra Digital Forensics Steganography Detection for Digital Forensics Steganography is the pursuit of cryptography that doesn't appear to be cryptography. In essence, steganography is hiding a message inside another message (which may be the same format or a different format). One of the most simple methods of steganography is to create a sentence where every word starts with a word from the phrase being encoded, in order. For example, the phrase Sage A could be embedded as Survey a great elephant again. Steganography does not mean that the message is completely hidden or that the medium is not noticeably different. Watermarking is technically a form of steganography as a message (the watermark) is embedded in the medium (the image) to convey information (whatever the watermark conveys, maybe ownership or legal information). As a digital forensic investigator, another form of steganography can become important, where digital information embedded into another digital collection of data. Human-readable steganography tends to be hard for computer to recognize, much less crack. Computer steganography on the other hand is, in most cases, hidden enough so as to be invisible to human viewers. Digital steganography is usually based on the least significant bit. Since the least significant bit has little effect on the actual color, human viewers are unlikely to notice (unless the color pallet is particularly small). Bitmaps, jpegs and other formats can all easily carry hidden messages in the least significant bit of the color (or the transform in the case of a jpeg). The key factor of steganography is that it doesn't look like encryption (though steganography could be used on an encrypted message). Where as an encrypted message has no discernible form when encrypted, a steganographic message has at least a semblance of meaning even after steganography has been applied. This is, from the point of view of a forensic investigator, a huge challenge. An encrypted

message is a clear target for investigation (anything worth encrypting is worth investigating), but a well done steganographic message should have appear typical for a file of its type. The problem with detecting steganography is that, to butcher Potter Stewart's concurrence in Jacobellis v. Ohio, we don't know it when we see it. As an investigator then, one of the key factors in breaking steganography is knowing where to look for it. In some cases, such as images with a small color pallet, it may be easy for a human viewer to discern that the file is somehow altered. However, images that suffered corruption due to some cause often resemble steganographic images. In fact, the effects of corruption and the effects of an unknown method of steganography can be hard to distinguish. As an investigator, checking for steganography is a judgment call that should take into account the users of the system under investigation, time available, the information already obtained (presence of steganography programs, data expected but not found). In any case, if there is time for it, a check for steganography is always an option as it should not corrupt or change any other forensic information. Tools for detecting steganography (usually digital steganography) often rely on statistical approaches, building a library of known steganographic and clean files and then comparing new files to the information gathered on known files. These methods rely on probability, looking for similar features of known files. These methods are accurate in the normal case, but have a hard time with the outliers. In addition, they can generally only catch steganographic methods that they have been trained on. One advantage to statistical tools is that, in some cases, they can become more accurate over large data sets. However, this approach still has the fundamental flaw of being based on a statistical analysis, which will always fail for outliers in the data. This is the most widely used approach to finding steganography. Another method for finding steganography relies on the fact that steganographic changes for a given file are often similar for any given data embedded into it. The RQP method takes an input file, analyzes it, applies steganography to it, then analyzes it again. If the features of the file change

drastically after the data is embedded, then the original file likely did not contain a steganographic message. However, if the features of the file do not change much then it is likely that the original file already contained embedded data. The RQP method in particular relies on the ratio of close color pairs pixels to all color pair pixels. If this ratio changes after data is embedded, then it is likely that nothing was embedded originally. The exact reasons for this change are not important to a digital forensic investigator, what matters is that RQP can be somewhat accurate without much baseline training, and it relies only on the current file. Another, mostly theoretical system for steganography detection is based on genetic algorithms that rely on the emergent properties of complex systems to find ways to detect constantly changing steganography. This method, called Computer Immune System detection, should be very powerful, with the power to adapt to new steganographic methods. However, the problem is providing positive feedback to develop new detection. For a CIS system to learn new types of steganography it has to select for them, which implies that the system receives feedback on steganography that has not been detected. CIS systems could be valuable for more efficient detection, but it does not avoid the problem of training. Digital steganography is much like virus detection. Both fields are faced with large amounts of data which may or may not hold important information. However, the important information is not known, so a simple search will not suffice. Just as much of virus detection is based on signatures of virii, the simplest forms of steganography detection rely on the signatures left by certain methods. It is possible that any universal method for one could easily be applied to the other. One area that computers have not had a great effect on is language-based steganography. While there are methods for finding some types of digital steganography, there is no similar system for steganography that is human-readable. Techniques such as RQP are focused on images and rely heavily on the idea that the embedding is done in the least-significant bit. Since human languages are so flexible to begin with, finding odd patterns (which could indicate steganography) in a given passage is

very hard from a systematic point of view. This is one place where humans can outshine computers, as a human reader is much more likely to recognize strange syntactic patterns. Once excellent example of this is Spam Mimic, a web-based tool that embeds a short message in what looks like a typical spam message. The resulting message looks just as intelligible as normal spam, with a loose grasp of the English language, but nay human reader can read it about as easily as normal spam. In my opinion this is one of the more devious steganographic attacks, since spam is so common as to be ignored and the message is embedded in a way that makes computer detection very challenging. As a forensics investigator, a spam mail folder would likely not get much of my attention past a quick search for out-of-place files, but it wold be trivially easy to collaborate in illegal business via messages embedded in spam. In the most abstract sense, steganography detection is pattern recognition: if a given data set has a statistically unlikely trend that could be used for information encoding, it is possibly a carrier for a steganographic message. This is a problem that is not easy to solve, as it is impossible to decide if an odd trend is a natural part of the data or an embedded message. It is possible (though astonishingly unlikely) for any image file (of sufficient size) to have a trend that could be interpreted as embedded data. This becomes much easier in some cases, where the manipulation that creates the embedded message has a clear fingerprint, but in the case of steganography that does nothing but embed (such as some of the language-based steganography) the presence or absence of a message can depend entirely on the technique used to attempt decoding. Even in cases where steganography can be detected, it can be hard to know which embedding technique was used. Even a universal steganography detector (a fanciful invention) would be of limited use unless it could also identify the type of steganography used, which is nearly impossible. In conclusion, for a forensic investigator the biggest threat from steganography is that you may not know you've encountered it. Even the best steganalysis tools must be run across files to check for the existence of steganography, leaving the investigator little choice but to try steganalysis on every file that might have an embedded message. This is clearly not an efficient procedure, requiring a huge

amount of time for large datasets, but unless some great advances are made in pattern detection, it is the only choice. An investigator's best tool in finding steganography is knowing where it will be. Once steganography is found, it can be decoded much like a simple encryption, but finding it is a challenge.

Bibliography Fridrich, J., Rui Du, and Long Meng, Steganalysis of LSB Encoding in Color Images. ICME 2000, 2 August 2000. Jackson, Jacob, Gregg Gunsch, Roger Claypoole and Gary Lamont, Blind Steganography Detection Using a Computational Immune System Approach: A Proposal. Kessler, Gary. An Overview of Steganography for the Computer Forensics Examiner. Forensics Science Communications. 19 September 2006. <www.fbi.gov/hq/lab/fsc/backissu/july2004/research/2004_03_research01.htm> Luo, Xiangyang, Bin Liu, Fenlin Liu, Detecting LSB Steganography Based on Dynamic Masks. ISDA, 2005, pg. 251-255