Hadoop Forensics. Presented at SecTor. October, Kevvie Fowler, GCFA Gold, CISSP, MCTS, MCDBA, MCSD, MCSE

Size: px

Start display at page:

Download "Hadoop Forensics. Presented at SecTor. October, 2012. Kevvie Fowler, GCFA Gold, CISSP, MCTS, MCDBA, MCSD, MCSE"

Calvin James
10 years ago
Views:

1 Hadoop Forensics Presented at SecTor October, 2012 Kevvie Fowler, GCFA Gold, CISSP, MCTS, MCDBA, MCSD, MCSE

Ringzero Frequent speaker at major security conferences (Black

2 About me Kevvie Fowler Day job: Lead the TELUS Intelligent Analysis practice Night job: Founder and principal consultant, Ringzero Frequent speaker at major security conferences (Black Hat USA, Sector, Appsec Asia, Microsoft, etc.) Industry contributions: 2

3 Data, data and more data The world uses and stores a lot of data Increased regulatory requirements Modernizing of industries Disk space is cheap and people store and forget Industry has experienced 40 to 50% growth in storage over the last 3 years Several new data management technologies have been released over the past few years to help us manage the problem 3

Industry has experienced 40 to 50% growth in storage over the last 3 years Several new

4 Technology to the rescue Data management landscape (SQL, NoSQL, NewSQL, etc.) Source: blogs.the451group.com 4

5 The crime scene of today The digital crime scene has increased What we historically examined over a year will be dwarfed in a single big data investigation 5

6 Hadoop the popular big data solution Business is looking at Hadoop to solve the big data problem But wait aren t there security issues with Hadoop? Secure or not investigations will lead there and investigations will need to occur 6

7 Hadoop overview What is this Hadoop? An opensource framework for large-scale data processing Moves processing to data Leverages MapReduce to take a query, divide it and run in parallel over multiple commodity nodes Hadoop is basically an open source MapReduce implementation Supports structured and unstructured data 7

data Leverages MapReduce to take a query, divide it and run in parallel over

8 Hadoop architecture Web Server Web Server Storage Daemons Meta Data Data Node 1 Task Tracker Logging File 1 Block 1 File 2 Block 1 File 2 Block 2 File System Image Web Server Storage FS Image files Edit logs Job- Tracker Data Node 2 Task Tracker Logging Web Server File 1 Block 1 File 2 Block 2 Storage File 2 Block 1 Interfaces Name Node Name Node logs Data Node 3 Task Tracker Logging File 1 Block 1 File 2 Block 1 File 2 Block 2 8

Tracker Data Node 2 Task Tracker Logging Web Server File 1 Block 1 File 2 Block 2 Storage File 2 Block 1

9 Forensics 101 The standard forensic approach Acquire it all, ask questions later Future science can be applied against it Repeatable analysis Many Hadoop clusters consist of huge data sets that can t practically be acquired in their entirety This session will cover Hadoop specific data reduction techniques Cloudera Hadoop Hadoop version 2.0 Hadoop HDFS 9

huge data sets that can t practically be acquired in their entirety This session will

10 Hadoop artifacts Hadoop artifacts Standard OS fare Connections, Users, Logs (SSH, etc.) Cluster properties Configuration, size, servers, key settings Recent activity Jobs (active, historical) Current state Metadata FSImage Edit logs Files Only the necessary ones! Tools Native Hadoop tools (Command line, WebUI) 10

) Cluster properties Configuration, size, servers, key settings Recent

11 Hadoop artifacts cluster properties Cluster properties via command line hdfs dfsadmin -report Results Size of Hadoop cluster Number/addresses of data nodes 11

12 Hadoop artifacts cluster properties Cluster properties via WebUI <address>:50070/dfshealth.jsp 12

13 Hadoop artifacts cluster properties Cluster properties via WebUI <address>:50070/dfshealth.jsp (continued) 13

14 Hadoop artifacts cluster properties Cluster properties Modes Modes Standalone single node & everything in one java process Pseudo-Distributed single node & daemons in separate java processes Fully-Distributed daemons run on a cluster of machines Managed via two types of configuration files Default (*-default.xml) Site-specific (*-site.xml) 14

separate java processes Fully-Distributed daemons run on a cluster of machines

15 Hadoop artifacts cluster properties File (/hadoop/conf) Description Key properties core-site.xml General Hadoop properties fs.trash.interval, fs.trash.checkpoint.interval hdfs-site.xml HDFS properties dfs.replication, dfs.blocksize dfs.permissions.enabled, hadoop.tmp.dir, dfs.namenode.name.dir, dfs.namenode.checkpoint.dir, dfs.datanode.data.dir mapred-site.xml MapReduce properties mapreduce.task.tmp.dir mapreduce.jobhistory.done-dir log4j.properties Logging properties hadoop.mapred.jobtracker hadoop.mapred.tasktracker namenode.fsnamesystem.audit 10/9/

dir, dfs.namenode.name.dir, dfs.namenode.checkpoint.dir, dfs.datanode.data.dir mapred-site.xml MapReduce properties mapreduce.task.tmp.

16 Hadoop artifacts recent activity Recent activity Jobs Jobs are executed from a client JAR file Each job will have one or more sub tasks managed by the task trackers Two files are created in conjunction with each executed job Job Configuration XML file contain the job configuration as specified when the job is launched Job Status File the counters, status, start/stop time, task attempt details etc. 16

conjunction with each executed job Job Configuration XML file contain the job configuration as

17 Hadoop artifacts recent activity Recent activity Job history mapred job list all Lists all jobs active (yet to complete) JobID, StartTime, Executing user JobID can be used to view the job status file which contains job execution details 17

complete) JobID, StartTime, Executing user JobID can be

18 Hadoop artifacts recent activity Recent activity Job JAR file Jar file 18

19 Hadoop artifacts recent activity Recent activity Job history Job configuration file (/history) 19

20 Hadoop artifacts recent activity Recent activity Job history Job status file (/history) 20

21 Artifact preservation metadata Metadata FSImage file Contains listing of all directory\files and associated metadata Name Replication level Modification time (Files & directories) Access time Block size Permissions (directories only) Etc. 2 saved FSImage files by default 21

22 Artifact preservation metadata Metadata FSImage file _nn The last transaction (and all prior) that modified the image 22

23 Current state of Hadoop FSImage file Metadata FSImage file Recommended: Force a checkpoint prior to gathering FSImage to refresh the image file hdfs secondarynamenode checkpoint force Beware - Requires the cluster to be taken off-line! Offline Image Viewer (OIV) can be used to gather the current FSImage file hdfs oiv -i " <file and path> " -o " <file and path> " -p Delimited -delimiter " " 23

24 Current state of Hadoop FSImage file Metadata FSImage file Results 24

25 Current state of Hadoop Edit log Metadata Edit log All writes to files on disk are first completed within the edit log 1M Edit operations are retained by default 25

26 Current state of Hadoop Edit log Metadata Edit log (continued) Operations within the Hadoop editlog OP_INVALID OP_ADD OP_RENAME_OLD OP_DELETE OP_MKDIR OP_SET_REPLICATION OP_DATANODE_ADD OP_DATANODE_REMOVE OP_SET_PERMISSIONS OP_SET_OWNER OP_CLOSE OP_SET_GENSTAMP OP_SET_NS_QUOTA OP_CLEAR_NS_QUOTA OP_TIMES OP_SET_QUOTA OP_RENAME OP_CONCAT_DELETE OP_SYMLINK OP_GET_DELEGATION_TOKEN OP_RENEW_DELEGATION_TOKEN OP_CANCEL_DELEGATION_TOKEN OP_UPDATE_MASTER_KEY OP_REASSIGN_LEASE OP_END_LOG_SEGMENT OP_START_LOG_SEGMENT OP_UPDATE_BLOCKS 26

27 Current state of Hadoop Edit log Metadata Edit log (continued) hdfs oev i <file and path> -o <file and path> -p stats Offline Edits Viewer (OEV) with the stats processor dumps a summary of operations within the edit log OEV can be run online 27

28 Current state of Hadoop Edit log Metadata Edit log (continued) Offline Edits Viewer (OEV) without the stats processor dumps operations performed after the last FSImage file hdfs oev -i "<edit filepath>" o "<out file>" -v 28

29 Current state of Hadoop Edit log Edit log (continued) Operation Transaction ID File created Modification time Access time Allocated block File permissions Example of file copied from local system to hadoop 29

30 Impersonation overview Current state of Hadoop File retrieval hadoop distcp Uses Map/reduce to move large groups of files Parallel file transfer -p[rbugp] Preserve r: replication number b: block size u: user g: group p: permission pt: last modification and last access times planned for a future release Fast but not the best option Pollutes job and task history Does not preserve timestamps 30

31 Hadoop overview what is it Fsck command can be used to list the blocks associated with a given file hdfs fsck /<file> -files -blocks -racks 31

32 Hadoop overview what is it Current state of Hadoop File retrieval (continued) Navigate the local FS of the specified data node to the block location Blocks (files) can be imaged directly from the local FS 32

33 Hadoop overview what is it Hadoop Trash Often enabled by organizations When enabled deleted files are not really deleted Files are moved to Trash location (../user/<username>/.trash) where they are scheduled for deletion at a later date\time 33

34 References References hadoop.apache.org 34

35 Thank-you! Additional information and questions: 35

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop) Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create