Microsoft's Internet Explorer browser has no built-in vector graphics machinery required for "loss-free" gradient background themes.

Please upgrade to a better browser such as Firefox, Opera, Safari or others with built-in vector graphics machinery and much more. (Learn more or post questions or comments at the Slide Show (S9) project site. Thanks!)

Plasma

graph based distributed computing

Big Picture

Europe has about 490,000,000 people, 7.3% of global population
  • 60% penetration => 20.8 % of total internet usage

(Internet World Stats)

Between 50% and 65% of internet traffic is from P2P downloads
  • 61% video
  • 11.3% audio
  • 27.2% applications
  • avg. file size of 1 gigabyte

(CacheLogic)

Google has more than 450,000 servers, with another couple hundred thousand on the way.
  • 8,180 racks per data center x 40 dual core CPUs per rack = 654,400 CPUs!
  • $1.5 billion per year to fund server growth

(New York Times, ZDnet)

Centralized servers are being decentralized

Modern servers are distributed systems using P2P technology

  • Amazon's Dynamo, EC2, Simple-DB
  • Google's MapReduce, Big Table
  • IBM and Sun compute clusters

Who are the mothers

Data driven information assistant

Mother please

Distributed recommendation systems

  • community structuring of shared content
    • research areas
    • document ratings
  • interest feeds
  • Community based paper review

And this is just the beginning...

The state of networking

Transparency:

  • location
  • concurrency
  • replication
  • failures
  • mobility

Protocol level

  • Structured streams
  • Asynchronous message passing and mailboxes
  • SOAP, RMI, CORBA
  • REST and resource based interfaces

Middleware

  • tuple spaces
  • shared queues
  • distributed object systems

P2P frameworks

  • Distributed hash table (DHT)
  • probabilistic search (gossip, random walk)

P2P Topologies: Structured Geometries

  • network[object-id] => object-address

Unstructured "Random" Graphs

  • network.search(query) => result-objects

A shared substrate

  • Published at IEEE Hot P2P 2007
  • NAT, sockets, memory, bandwidth, and complexity

A multi-service abstraction

Service Topology Purpose
Distributed hash table ring (hypercube) object lookup by ID
Random walk search random graph supports arbitrary queries
Local flood star field neighborhood communications
Probabilistic Gossip random graph maintenance, monitoring, measurement
Torrent and data stream trees and forrests media streaming
Data driven flow trees and forrests content based routing
Dynamic flow graphs directed graph pipeline media streams and streaming databases
Distributed clustering small world graph recommendation systems
  • P2P and distributed computing systems are too complex
    • NAT traversal, resource usage, timeouts
    • failure, churn, maintenance
    • consistency and synchronization with massive parallelism

Research Problem

How can we support:

  • data driven algorithms
  • multiple topologies
  • concise, re-usable data operations
  • efficient implementations

On a wide range of targets:

  • desktop applications (torrents, tunes and bibliographies)
  • data center services (Amazon, Google, Slashdot)
  • distributed data mining (research documents, gene profiles)

Back to the roots

Complexity, meet abstraction.

Data driven = information structures

Data driven algorithms require

  • mechanisms to represent application information
  • a way to query and manipulate the information
  • networking capabilities and topology representation

Bachman's Network Model

Navigational data access

Codd's Relational Model

  • Logical layer vs storage layer
  • Simple to understand schema
  • Sound theoretical framework for data access

What's in a database?

Efficient Storage Layer

B+ tree

Semi-structured data and RDF

  • XML databases
  • The Semantic web (RDF, OWL)
    • Logic statements vs navigational structure

Research Problem

How can we support:

  • data driven algorithms
  • multiple topologies
  • concise, re-usable data operations
  • efficient implementations

On a wide range of targets:

  • desktop applications (torrents, tunes and bibliographies)
  • data center services (Amazon, Google, Slashdot)
  • distributed data mining (research documents, gene profiles)

Plasma

  • Organize data in a semi-structured manner
  • Also need to support networking, in at least a semi-transparent way
    • Network of resources as a database
  • Both data and topology in the same structure
  • A new logical abstraction
    • Work at the logical level, letting optimizations happen underneath

Plasma: modeling with graphs

Plasma: distributed computing

Plasma: data access and network programming

A navigational graph query language

Query.new().
  path_select(/net/peer/app/doc/file).
    sort_on(:score).
      max(10)

Plasma architecture: a high level view

Data Model: basics

Item

  • globally unique UUID
  • key/value property map

Node < Item

  • in and out edges

Edge < Item

  • source and target nodes

Graph < Item

  • nodes and edges
  • root node

Data Model: bonus features

  • remote nodes (network links)
  • events (triggers, sort-of)
  • sub-graphs (sub-sets)
  • views (overlays)
  • perspectives (ORM for graphs)

Query Model: graph and arithmetic operators

Input = graph(s) Output = graph(s)

  • Edge traversal
  • Property access
  • Iterators (depth and breadth first)
  • min, max, count, avg

Enables:

  • Path expressions
  • neighborhood statistics (clustering coefficient, degree)
  • Dikstra's shortest path

Query Model: Set Operators

  • set membership function uses UUIDs

Query Model: Path expressions

/net/peer
/app/social/friend/app/doc/file
/net/peer/app/social/friend

Path expressions - advanced

  • Property predicates
  • Pattern matching (branches)
  • Perspectives

Query Processing

  • query graphs
  • graph optimization
  • query distribution
Query.new().
  path_select(/net/peer/app/doc/file).
    sort_on(:score).
      max(10)

Event system

  • streaming networks
  • reactive algorithms

Open questions

  • in-code or in-query
  • distributed graph semantics
    • query results
    • copies
    • proxies
    • versions
  • data model features
    • sorted sets and efficient data structures
    • remote nodes vs networked edges
    • subscriptions
  • query model features
    • views
    • projections

Networking
  • Optimizations available with more restricted operators
  • Limiting network cycles

Evaluation Plan

Case studies

  • distributed clustering
  • structured geometries (e.g. Chord, CAN)
  • unstructured search (e.g. Push-sum, Cyclone, BubbleStorm)
  • map/reduce style data mining example
  • grid oriented service mapping, discovery, monitoring

Goals and quantitative evaluation

  1. We aren't worse in performance in large deployments
  2. With automatic optimizations we can do better in many cases
  3. Algorithms implementations are concise
  4. Components implemented on graphs are re-usable

Plasma Project Plan

Platform Development Time Estimate
Basic set of graph operators 1 month
Distributed graph queries 2 months
Paper on data model and operators 1 month
Improved query engine (optimizer) 2 months
Implement evaluation algorithms 3 months
System deployment and testing 2 month
Analysis and final paper 2 month

(In Amsterdam!)

Questions?

Teaching and Mentoring

  • Atelier 1 => bash scripts, html, css
  • Atelier 4 => full course design, web application development
  • Atelier 4 => improved course, web application development
  • Atelier 2 => user interface design, java projects
  • UROP => mentored two undergraduate students

Atelier 4

Developed a new course from scratch

  • Creating database backed web applications
  • Using Ruby on Rails
  • Self assessment
  • Student directed content
  • Open source development model

Research Foci

  • Focus is P2P networking and graph based distributed computing
  • Sub-focus is graph databases
  • Summer school on game theory
  • Amer's course on profiling
  • Lectures on information retrieval
  • Reading group on pattern classification
  • Reading group on neuro-science

Presentations and Writing

Presentations

  • Ruby on Rails tutorial at swiss object oriented society
  • Presented Spinneret paper at Hot P2P
  • Gave final survey talk at the MICS workshop

Peer Reviewed Articles:

  • J. Rose, A. Carzaniga, "Plasma: a graph based distributed computing model," Tech Report, 2008.
  • J. Rose, C. Hall, A. Carzaniga, "Spinneret: A Log Random Substrate for P2P Networks," IEEE HotP2P, 2007
  • S. Bhatti, J. Carlson, H. Dai, J. Deng, J. Rose, A. Sheth, B. Shucker, C. Gruenwald, A. Torgerson, R. Han, "MANTIS OS: An Embedded Multithreaded Operating System for Wireless Micro Sensor Platforms, " ACM/Kluwer Mobile Networks & Applications (MONET), Special Issue on Wireless Sensor Networks, vol. 10, no. 4, August 2005, guest co-editors P. Ramanathan, R. Govindan and K. Sivalingam, pp. 563-579.
  • H. Abrach, S. Bhatti, J. Carlson, H. Dai, J. Rose, A. Sheth, B. Shucker, J. Deng, R. Han, " MANTIS: System Support For MultimodAl NeTworks of In-situ Sensors," 2nd ACM International Workshop on Wireless Sensor Networks and Applications (WSNA) 2003, pp. 50-59.

Book Chapters:

  • M. Jazayeri, C. Mesnage, J. Rose, "Modern Web Application Development," Wiley, 2007.
  • B. Shucker, J. Rose, A. Sheth, J. Carlson, S. Bhatti, H. Dai, J. Deng, and R. Han, Book chapter 6 on "Embedded Operating Systems for Wireless Microsensor Nodes", in Handbook of Sensor Networks: Algorithms and Architectures, Wiley, 2005, editor Ivan Stojmenovic, pp. 173-197.