Coord an open source project for large-scale data analysis
Introduction to Coord
Coord is an open source implementation of a SBA(Space-based Architecture) built on DHT(Distributed Hash Table). In a technical point of view, Coord is similar to Tuple Space which is an implementation of the associative memory paradigm for parallel/distributed computing. Coord transparently manages such a space which can be mapped to the memory, file, or even database, and converge those into a large-scale virtual space. In the virtual space, data is located by one or more hash functions as if a point is placed on the "coordinates", a process communicates with another only through the space which looks like "coordinator ". In result, Coord provides a large-scale sharable object storage for parallel/distributed computing.
Nevertheless, Coord is NOT just for the emerging distributed key-value storage systems such as Bigtable, Hbase, Dynamo, MemcacheDB, CouchDB, and Cassandra. That is because Coord provides a distributed computing framework, Coord MapReduce, for large-scale data analysis, which is coined in new semantics by Google. The distinct feature of Coord MapReduce is to acheive on-the-fly MapReduce processing in realtime. It makes a big difference from batch-processing based MapReduce. Coord MapReduce is worked on a simple distributed file system, dust . It splits and scatters a file by the chunk, but does not have a centralized metadata server to locate the chunks. In this point, dust differs from GFS/HDFS. In order to take advantage of Coord MapReduce, there is no need to install a special distributed file system. By means of using dust, Coord MapReduce finally helps users to parallelize map/reduce tasks with massive data.
Moreover, Coord provides a better capability for remote execution and parallel processing, warp . It is NOT just a remote or parallel execution tool such as ssh/gexec since it supports load-balance. In a real time, warp assigns the best node to run legacy codes such as c/c++, java, python, or scripts. With warp, users do not need to worry about where to perform their tasks any more. It enables users to easily parallelize their codes programmed in a single machine over the cluster.
Now, Coord is being evolved into a cloud computing platform for large-scale data analysis such as infromation retrival, data/text mining, and machine learning.
How to Get Started
How to Analyze Large-scale Dataset
- Read the Coord Tutorial
How to Get Involved
Join the mailing lists and participate in the discussions around the development of Coord. If you encounter a problem and have an idea how to fix it, please start by making a patch and filing it with our issue tracking system.
Contact to Me
Coord 0.4.0 is released
Coord 0.4.0 provides coord tutorial & dataset for demos.
Coord 0.3.5 is released
Coord 0.3.5 adds a graph search for enabling semantic search.
2009 Coord Summer of Code
Join the event, and get the prize!!!
2008 WoC Award
Coord+Lucene project got 2008 WoC(Winter of Code) Award at an open source contest for university students which NCSoft and KIPA held.
2008 NHN Conference Presentation
Coord was introduced in 2008 NHN Conference and demonstated in 2008 NHN Deview.
Greenplum - Greenplum takes MapReduce to support petabyte-scale data analytics
Aster - Aster is a high-performance database system for data warehousing and analytics, which tightly integrate SQL with MapReduce
Voldemort - Voldemort is a distributed key-value storage system
Ringo - Ringo is a distributed key-value storage system
Scalaris - Scalaris is a scalable, transactional, distributed key-value store. It can be used for building scalable Web 2.0 services
Kai - Kai is an open source implementation of Amazon Dynamo
Dynomite - Dynomite is an eventually consistent distributed key value store based off of Amazon Dynamo
MemcacheDB- MemcacheDB is a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval
ThruDB - Thrudb is a set of simple services built on top of the Apache Thrift framework that provides indexing and document storage services for building and scaling websites
CouchDB - Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API
Cassandra - Cassandra is a distributed storage system for managing structured data while providing reliability at a massive scale
Neptune - Neptune is Distributed Large scale Structured Data Storage, and open source project implementing Google Bigtable
HBase - HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google Bigtable
Hypertable - Hypertable is an open source project based on published best practices and our own experience in solving large-scale data-intensive tasks