The K-Medoids clustering algorithm solves the problem of the K-Means algorithm on processing the outlier samples, but it is not be able to process big-data because of the time complexity. MapReduce is a parallel programming model for processing big-data, and has been implemented in Hadoop. In order to break the big-data limits, the parallel K-Medoids algorithm HK-Medoids based on Hadoop was proposed. Every submitted job has many iterative MapReduce procedures: In the map phase, each sample was assigned to one cluster whose center is the most similar with the sample; in the combine phase, an intermediate center for each cluster was calculated; and in the reduce phase, the new center was calculated. The iterator stops when the new center is similar to the old one. The experimental results showed that HK-Medoids algorithm has a good clustering result and linear speedup for big-data.
Characteristics of flow describe the pattern and trend of network traffic, it helps network operator understanding network usage and user behavior, especially useful for those who concerns more about network capacity planning, traffic engineering and fault handling. Due to the large scale of datacenter network and explosive growth of traffic volume, it’s hard to collect, store and analyze Internet traffic on a single machine. Hadoop has become a popular infrastructure for massive data analytics because it facilitates scalable data processing and storage services on a distributed computing system consisting of commodity hardware. In this paper, we present a Hadoop-based traffic analysis system, which accepts input from multiple data traces, performs flow identification, characteristics mining and flow clustering, output of the system provides guidance in resource allocation, flow scheduling and some other tasks. Experiment on a dataset about 8G size from university datacenter network shows that the system is able to finish flow characteristics mining on a four node cluster within 23 minutes.
This work introduces a new task preemption primitive for Hadoop, that allows tasks to be suspended and resumed exploiting existing memory management mechanisms readily available in modern operating systems. Our technique fills the gap that exists between the two extreme cases of killing tasks (which waste work) or waiting for their completion (which introduces latency): experimental results indicate superior performance and very small overheads when compared to existing alternatives.
As the volume of available data continues to rapidly grow from a variety of sources, scalable and performant analytics solutions have become an essential tool to enhance business productivity and revenue. Existing data analysis environments, such as R, are constrained by the size of the main memory and cannot scale in many applications. This paper introduces Big R, a new platform which enables accessing, manipulating, analyzing, and visualizing data residing on a Hadoop cluster from the R user interface. Big R is inspired by R semantics and overloads a number of R primitives to support big data. Hence, users will be able to quickly prototype big data analytics routines without the need of learning a new programming paradigm. The current Big R implementation works on two main fronts: (1) data exploration, which enables R as a query language for Hadoop and (2) partitioned execution, allowing the execution of any R function on smaller pieces of a large dataset across the nodes in the cluster
In this paper, we propose a novel spatial data index based on Hadoop: HQ-Tree. In HQ-Tree, we use PR QuadTree to solve the problem of poor efficiency in parallel processing, which is caused by data insertion order and space overlapping. For the problem that HDFS cannot support random write, we propose an updating mechanism, called “Copy Write”, to support the index update. Additionally, HQ-Tree employs a two-level index caching mechanism to reduce the cost of network transferring and I/O operations. Finally, we develop MapReduce-based algorithms, which are able to significantly enhance the efficiency of index creation and query. Experimental results demonstrate the effectiveness of our methods.
.Current implementation of Hadoop is based on an assumption that all the nodes in a Hadoop cluster are homogenous. Data in a Hadoop cluster is split into blocks and are replicated based on the replication factor. Service time for jobs that accesses data stored in Hadoop considerably increases when the number of jobs is greater than the number of copies of data and when the nodes in Hadoop cluster differ much in their processing capabilities. This paper addresses dynamic data rebalancing in a heterogeneous Hadoop cluster. Data rebalancing is done by replicating data dynamically with minimum data movement cost based on the number of incoming parallel mapreduce jobs. Our experiments indicate that as a result of dynamic data rebalancing service time of mapreduce jobs were reduced by over 30% and resource utilization is increased by over 50% when compared against Hadoop.
Hadoop is a convenient framework in e-Science enabling scalable distributed data analysis. In molecular biology, next-generation sequencing produces vast amounts of data and requires flexible frameworks for constructing analysis pipelines. We extend the popular HTSeq package into the Hadoop realm by introducing massively parallel versions of short read quality assessment as well as functionality to count genes mapped by the short reads. We use the Hadoop-streaming library which allows the components to run in both Hadoop and regular Linux systems and evaluate their performance in two different execution environments: A single node on a computational cluster and a Hadoop cluster in a private cloud. We compare the implementations with Apache Pig showing improved runtime performance of our developed methods. We also inject the components in the graphical platform Cloudgene to simplify user interaction.
Hadoop Distributed File System (HDFS) is the core component of Apache Hadoop project. In HDFS, the computation is carried out in the nodes where relevant data is stored. Hadoop also implemented a parallel computational paradigm named as Map-Reduce. In this paper, we have measured the performance of read and write operations in HDFS by considering small and large files. For performance evaluation, we have used a Hadoop cluster with five nodes. The results indicate that HDFS performs well for the files with the size greater than the default block size and performs poorly for the files with the size less than the default block size.
It’s based on map-reduce approach where the application is divided into small fragments of work, each of which may be executed on any node in the cluster. Hadoop is very efficient tool in storing and processing unstructured, semi-structured and structured data. Unstructured data usually refers to the data stored in files not in traditional row and column way. Examples of unstructured data is e-mail messages, videos, audio files, photos, web-pages, and many other kinds of business documents. Our work primarily focuses on detecting malware for unstructured data stored in Hadoop distributed file system environment. Here we use calm AV’s updated free virus signature database. We also propose a fast string search algorithm based on map-reduce approach.
Today, network traffic has increased because of the appearance of various applications and services. However, methods for network traffic analysis are not developed to catch up the trend of increasing usage of the network. Most methods for network traffic analysis are operated on a single server environment, which results in the limits about memory, processing speed, storage capacity. When considering the increment of network traffic, we need a method of network traffic to handle the Bigdata traffic. Hadoop system can be effectively used for analyzing Bigdata traffic. In this paper, we propose a method of application traffic classification in Hadoop distributed computing system and compare the processing time of the proposed system with a single server system to show the advantages of Hadoop.
Even as the web 2.0 grows, e-mail continues to be one of the most used forms of communication in the Internet, being responsible for the generation of huge amounts of data. Spam traffic, for example, accounts for terabytes of data daily. It becomes necessary to create tools that are able to process these data efficiently, in large volumes, in order to understand their characteristics. Although mail servers are able to receive and store messages as they arrive, applying complex algorithms to a large set of mailboxes, either for characterization, security reasons or for data mining goals is challenging. Big data processing environments such as Hadoop are useful for the analysis of large data sets, although originally designed to handle text files in general. In this paper we present a Hadoop extension used to process and analyze large sets of e-mail, organized in mailboxes. To evaluate it, we used gigabytes of real spam traffic data collected around the world and we showed that our approach is efficient to process large amounts of mail data.
B. E (Computer Science)
B. E (Electronics and Communication)
B. E (Electrical and Electronics Eng.)
B. E (Information Technology)
B. E (Instrumentation Control and Eng.)
M. E (Computer Science)
M. E (Power Electronics)
M. E (Control System)
M. E (Software Engg)
M. E (Applied Electronics)
M. SC (IT , IT&M , CS&M, CS)
B.Sc. (IT , CS)