This write-up will give you a summary about the benefits, design
and framework, programming and testing of clouds both as a service and as a
structure too.
What is Cloud Computing?
Cloud Computing has made a great impact on IT industry. Data
moved away from Personal computers and Enterprise application servers to be
clustered on Cloud.
Cloud computing is a model which provides a convenient way
to access and consume a shared resource pool which contains a wide variety of
services: storage, networks, servers, applications etc. and that too on a
demand basis. Additionally, the service provisioning and release is very easy
to manage and doesn’t always require service provider’s intervention
For this, clouds use a large cluster of servers which
provide a low-cost technology benefits to consumers by using specialized data
connections for data processing. Virtualization is often used to multiply the
potential of cloud computing.
It has three delivery
models:
Infrastructure as a Service (IaaS)
|
Platform as a Service (PaaS)
|
Software as a Service (SaaS)
|
|
|
1. It’s the basic layer of cloud
2. Servers, networks, storage is provided by service provider 3. Software etc are cloud consumer's responsibility |
1. No control by consumer over underlying infrastructure
2. A platform e.g. a web server or database or some content management tool like Wordpress is provided by Service provider which helps in application development 3. Here you will have a Virtual machine with all necessary software |
1. Here whole application is outsourced to cloud provider
2. It will be provider's responsibility to manage license and access related issues 3. Examples are google docs or any hosted email services |
||
|
|
Types of Clouds:
Public
|
Private
|
Hybrid
|
1. Here services are available to all
2. Service provider uses internet and his applications are widest group of users |
1. Services (Equipment and data centres) are private to
organization
2. A secure access is given to users of organization |
1. A mixture of both services
2. Some services of organization can be used by all and some are private to users inside |
There are benefits of using Cloud Computing but, there are limitations too e.g. data integrity, will it be secure, will it stay private and also will services be available to all at all times.
Here comes the need of testing.
Types of Testing in
Cloud Computing:
Testing a Cloud
|
|
Functional Testing
|
1. System Verification Testing: Functional needs are tested 2. Acceptance Testing: User testing is done for meeting requirements 3. Interoperability Testing: Application should function well anywhere even if transferred away from cloud |
Non Functional Testing
|
1. Availability Testing: It is the responsibility of cloud vendor that the cloud is without
sudden downtime and without affecting client's business
2. Security Testing: Making sure that there's no unauthorized access and that data integrity is maintained 3. Performance Testing: Stress and load testing to make sure that performance remains intact during situations of both maximum and decrease in load 4. Multi Tenancy Testing: Testing to make sure that services are available to multiple clients at same time and that data is secure to avoid access level conflicts
5. Disaster recovery Testing: Verification that the
services are restored in case of failure with less disaster recovery time and
with no harm to client's business
6. Scalability Testing: Verification that services can be scaled up or down as per needs 7. Interoperability Testing: It should be easy and possible to move a cloud application from one environment/platform to other |
How does a Cloud
store and process data?
Hadoop and MapReduce:
Earlier when data was manageable, it was stored in databases
which had defined schema and relation. As data grew to Big data:Terabytes and
Petabytes, (this data has unique characteristic than regular data : “write once
read many (WORM)” ) ; Google Introduced GFS (Google File System) which was not
open source. Google developed a new programming model called MapReduce. MapReduce
is a software framework that allows programming to process stupendous amounts
of unstructured data parallel across distributed cluster of processors. And
Google Introduced BigTable: A distributed storage for managing structured data
that allows scalability to large size: petabytes of data across thousands of
commodity servers
Later, Hadoop Distributed File System (HDFS) was developed which
is open source and distributed by Apache. Software framework used is MapReduce
and the whole project is called Hadoop
MapReduce uses
four entities:
Client
|
submits MR job
|
Jobtracker
|
helps in managing the job run. It is Java application whose main
class is Jobtracker
|
Tasktracker
|
runs the tasks which are divided from job
|
Distributed File system
|
(commonly HDFS) which is used to share files among entities
|
Properties of HDFS:
Large
|
consists of thousands of server machines, each storing a
fragment of system’s data
|
Replication
|
Each data job is replicated a number of times (default 3)
|
Failure
|
It is not taken as exception and is standard
|
Fault Tolerance
|
Detecting Faults and fast automatic recovery
|
Hadoop doesn’t waste time diagnosing the slow-running tasks
instead it just detects when a task is slower and fires a replica of it as
backup.
Apache HBase:
HBase is the Hadoop database. It is open source
implementation of BigTable. For Real time and random access (read/write) needs
to Big Data, HBase is used. It has very large tables hosting billions of
rows*millions of columns. It is an open source, distributed storage structure
for structured data. It is NoSQL database which stores data as key/value pairs
in columns while HDFS uses flat files. So, it uses a combination of scalable
abilities of Hadoop by running on the HDFS with real-time and random data
access using key/value store and problem-solving properties of Map Reduce.
HBase uses four-dimensional data model and these 4
coordinates define each cell:
Row Key
|
Every row has unique key; the row key does not have a data type
and is treated internally as a byte array.
|
Column Family
|
Data inside a row is organized into column families; each row
having same set of column families, but across rows, the same column families
don't require same column qualifiers. HBase stores column families in their
own data files, which require definition upfront, and its hard to make
changes to column families
|
Column Qualifier
|
Column families define columns, which are known as column
qualifiers. Column qualifiers can be taken as the columns themselves
|
Version
|
Every column can have a configurable no of versions, and data
can be accessed for a specific version of a column qualifier.
|
HBase allows 2 types of access: random access of rows
through their row keys, column family, column qualifier, and version and
offline or batch access through map-reduce queries. This dual-approach makes it
very powerful.
QA Testing your MR jobs: which is actually testing the whole Cloud
Traditional unit
testing framework e.g. JUnit, PyUnit etc. can be used to get started
testing MR jobs. Unit tests are a great way for testing MR jobs at micro level.
Although they don’t test MR jobs as whole inside Hadoop
MRUnit is a tool
that can be used to unit-test map and reduce functions. MRUnit involves testing
the same way as traditional unit tests so it’s simple and doesn’t require Hadoop
to be running.There are some drawbacks of using MRUnit but, much more are the
benefits.
MRUnit tests are simple. No external I/O files are needed
and tests are faster. Illustration of a test class:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
class DummyTest() {
private Dummy.MyMapper
mapper
private Dummy.MyReducer
reducer
private
MapReduceDriver driver
@Before void
setUp() {
mapper = new Dummy.MyMapper()
reducer = new Dummy.MyReducer()
driver = new
MapReduceDriver(mapper, reducer)
}
@Test void
testMapReduce() {
driver.withInput(new Text('key'), new Text('val'))
.withOutput(new Text('foo'), new Text('bar'))
.runTest()
}
}
|
Map and Reduce can be tested separately and counters can be
tested too.
During a job execution, Counters tell if a particular event
occurred and how often. Hadoop has 4 types of counters:
File system, Job, Framework and Custom
Traditional unit tests and MRUnit help in detecting bugs
early, but neither can test MR jobs within Hadoop. The local job runner let’s run Hadoop on a local machine, in one JVM,
enabling MR jobs a little easier to debug in case of failing job.
Pseudo-distributed
cluster constitutes of a single machine running all Hadoop giants. It tests
integration with Hadoop better than the local job runner.
Running MR Jobs on a QA Cluster: Its most exhaustive but most complex and
challenging mechanism of testing MR jobs on a QA cluster consisting at least a
few machines
QA practices should be chosen based on organizational needs
and budget. Unit-tests/MRUnit/local job runner can test MR jobs extensively in
a simple way. But, running jobs on a QA or development cluster is obviously the
best way to fully test MR jobs.
I hope that this blog will tell you that study of cloud is
as vast as a cloud itself.
No comments:
Post a Comment