Hard way or Easy way?: Cloud Computing and Testing: A Simpler View

This write-up will give you a summary about the benefits, design and framework, programming and testing of clouds both as a service and as a structure too.

What is Cloud Computing?

Cloud Computing has made a great impact on IT industry. Data moved away from Personal computers and Enterprise application servers to be clustered on Cloud.

Cloud computing is a model which provides a convenient way to access and consume a shared resource pool which contains a wide variety of services: storage, networks, servers, applications etc. and that too on a demand basis. Additionally, the service provisioning and release is very easy to manage and doesn’t always require service provider’s intervention

For this, clouds use a large cluster of servers which provide a low-cost technology benefits to consumers by using specialized data connections for data processing. Virtualization is often used to multiply the potential of cloud computing.

It has three delivery models:

Infrastructure as a Service (IaaS)

Platform as a Service (PaaS)

Software as a Service (SaaS)

1. It’s the basic layer of cloud
2. Servers, networks, storage is provided by service provider
3. Software etc are cloud consumer's responsibility

1. No control by consumer over underlying infrastructure
2. A platform e.g. a web server or database or some content management tool like Wordpress is provided by Service provider which helps in application development
3. Here you will have a Virtual machine with all necessary software

1. Here whole application is outsourced to cloud provider
2. It will be provider's responsibility to manage license and access related issues
3. Examples are google docs or any hosted email services

Types of Clouds:

Public	Private	Hybrid
1. Here services are available to all 2. Service provider uses internet and his applications are widest group of users	1. Services (Equipment and data centres) are private to organization 2. A secure access is given to users of organization	1. A mixture of both services 2. Some services of organization can be used by all and some are private to users inside

There are benefits of using Cloud Computing but, there are limitations too e.g. data integrity, will it be secure, will it stay private and also will services be available to all at all times.

Here comes the need of testing.

Types of Testing in Cloud Computing:

Testing a Cloud
Functional Testing	1. System Verification Testing: Functional needs are tested 2. Acceptance Testing: User testing is done for meeting requirements 3. Interoperability Testing: Application should function well anywhere even if transferred away from cloud
Non Functional Testing	1. Availability Testing: It is the responsibility of cloud vendor that the cloud is without sudden downtime and without affecting client's business 2. Security Testing: Making sure that there's no unauthorized access and that data integrity is maintained 3. Performance Testing: Stress and load testing to make sure that performance remains intact during situations of both maximum and decrease in load 4. Multi Tenancy Testing: Testing to make sure that services are available to multiple clients at same time and that data is secure to avoid access level conflicts 5. Disaster recovery Testing: Verification that the services are restored in case of failure with less disaster recovery time and with no harm to client's business 6. Scalability Testing: Verification that services can be scaled up or down as per needs 7. Interoperability Testing: It should be easy and possible to move a cloud application from one environment/platform to other

How does a Cloud store and process data?

Hadoop and MapReduce:

Earlier when data was manageable, it was stored in databases which had defined schema and relation. As data grew to Big data:Terabytes and Petabytes, (this data has unique characteristic than regular data : “write once read many (WORM)” ) ; Google Introduced GFS (Google File System) which was not open source. Google developed a new programming model called MapReduce. MapReduce is a software framework that allows programming to process stupendous amounts of unstructured data parallel across distributed cluster of processors. And Google Introduced BigTable: A distributed storage for managing structured data that allows scalability to large size: petabytes of data across thousands of commodity servers

Later, Hadoop Distributed File System (HDFS) was developed which is open source and distributed by Apache. Software framework used is MapReduce and the whole project is called Hadoop

MapReduce uses four entities:

Client	submits MR job
Jobtracker	helps in managing the job run. It is Java application whose main class is Jobtracker
Tasktracker	runs the tasks which are divided from job
Distributed File system	(commonly HDFS) which is used to share files among entities

Properties of HDFS:

Large	consists of thousands of server machines, each storing a fragment of system’s data
Replication	Each data job is replicated a number of times (default 3)
Failure	It is not taken as exception and is standard
Fault Tolerance	Detecting Faults and fast automatic recovery

Hadoop doesn’t waste time diagnosing the slow-running tasks instead it just detects when a task is slower and fires a replica of it as backup.

Apache HBase:

HBase is the Hadoop database. It is open source implementation of BigTable. For Real time and random access (read/write) needs to Big Data, HBase is used. It has very large tables hosting billions of rows*millions of columns. It is an open source, distributed storage structure for structured data. It is NoSQL database which stores data as key/value pairs in columns while HDFS uses flat files. So, it uses a combination of scalable abilities of Hadoop by running on the HDFS with real-time and random data access using key/value store and problem-solving properties of Map Reduce.

HBase uses four-dimensional data model and these 4 coordinates define each cell:

Row Key	Every row has unique key; the row key does not have a data type and is treated internally as a byte array.
Column Family	Data inside a row is organized into column families; each row having same set of column families, but across rows, the same column families don't require same column qualifiers. HBase stores column families in their own data files, which require definition upfront, and its hard to make changes to column families
Column Qualifier	Column families define columns, which are known as column qualifiers. Column qualifiers can be taken as the columns themselves
Version	Every column can have a configurable no of versions, and data can be accessed for a specific version of a column qualifier.

HBase allows 2 types of access: random access of rows through their row keys, column family, column qualifier, and version and offline or batch access through map-reduce queries. This dual-approach makes it very powerful.

QA Testing your MR jobs: which is actually testing the whole Cloud

Traditional unit testing framework e.g. JUnit, PyUnit etc. can be used to get started testing MR jobs. Unit tests are a great way for testing MR jobs at micro level. Although they don’t test MR jobs as whole inside Hadoop

MRUnit is a tool that can be used to unit-test map and reduce functions. MRUnit involves testing the same way as traditional unit tests so it’s simple and doesn’t require Hadoop to be running.There are some drawbacks of using MRUnit but, much more are the benefits.

MRUnit tests are simple. No external I/O files are needed and tests are faster. Illustration of a test class:

class DummyTest() {

private Dummy.MyMapper mapper

private Dummy.MyReducer reducer

private MapReduceDriver driver

@Before void setUp() {

mapper = new Dummy.MyMapper()

reducer = new Dummy.MyReducer()

driver = new MapReduceDriver(mapper, reducer)

}

@Test void testMapReduce() {

driver.withInput(new Text('key'), new Text('val'))

.withOutput(new Text('foo'), new Text('bar'))

.runTest()

}

Map and Reduce can be tested separately and counters can be tested too.

During a job execution, Counters tell if a particular event occurred and how often. Hadoop has 4 types of counters:

File system, Job, Framework and Custom

Traditional unit tests and MRUnit help in detecting bugs early, but neither can test MR jobs within Hadoop. The local job runner let’s run Hadoop on a local machine, in one JVM, enabling MR jobs a little easier to debug in case of failing job.

Pseudo-distributed cluster constitutes of a single machine running all Hadoop giants. It tests integration with Hadoop better than the local job runner.

Running MR Jobs on a QA Cluster: Its most exhaustive but most complex and challenging mechanism of testing MR jobs on a QA cluster consisting at least a few machines

QA practices should be chosen based on organizational needs and budget. Unit-tests/MRUnit/local job runner can test MR jobs extensively in a simple way. But, running jobs on a QA or development cluster is obviously the best way to fully test MR jobs.

I hope that this blog will tell you that study of cloud is as vast as a cloud itself.

Hard way or Easy way?

Saturday, May 9, 2015

Cloud Computing and Testing: A Simpler View

Running MR Jobs on a QA Cluster: Its most exhaustive but most complex and challenging mechanism of testing MR jobs on a QA cluster consisting at least a few machines

No comments:

Post a Comment

About Me

Blog Archive