Distributed software development issues


















We define a distributed software system see Figure 1 as a system with two or more independent processing sites that communicate with each other over a medium whose transmission delays may exceed the time between successive state changes. Note that this definition is broad enough to encompass logically distributed systems. These are software systems that are concentrated on a single processing node but, for one reason or another, 1 exhibit the above properties.

An example of such a system is one that spans multiple heavyweight processes running on the same processor and which interact by exchanging asynchronous messages. Since each process is an independent unit of failure, it fits our definition. The challenges of distributed software The majority of problems associated with distributed systems pertain to failures of some kind. These are generally manifestations of the unpredictable, asynchronous, and highly diverse nature of the physical world.

In other words, because fault-tolerant distributed systems must contend with the complexity of the physical world, they are inherently complex. Failures, faults, and errors Let's introduce some basic terminology. This is usually because the system experiencing the failure has reached some invalid state.

We refer to such an undesirable state as an error. The underlying cause of an error is called a fault. When the value of that bit is read and used in a calculation, the outcome will be a failure.

Of course, this classification is a relative one. A fault is typically a failure at some lower level of abstraction that is, the stuck-high bit may be the result of a lower-level fault due to an impurity in the manufacturing process. When a failure occurs, it is first necessary to detect it and then to perform some basic failure handling.

The latter involves diagnosis determining the underlying cause of the fault , fault removal, and failure recovery. Each of these activities can be quite complex. Consider, for example, failure diagnosis. A single fault can often lead to many errors and many different cascading failures, each of which may be reported independently. A key difficulty lies in sorting through the possible flurry of consequent error reports, correlating them, and determining the basic underlying cause the fault.

Processing site failures Because the processing sites of a distributed system are independent of each other, they are independent points of failure. While this is an advantage from the viewpoint of the user of the system, it presents a complex problem for developers.

In a centralized system, the failure of a processing site implies the failure of all the software. In contrast, in a fault-tolerant distributed system, a processing site failure means that the software on the remaining sites needs to detect and handle that failure in some way. This may involve redistributing the functionality from the failed site to other, operational, sites, or it may mean switching to some emergency mode of operation.

Communication media failures Another kind of failure that is inherent in most distributed systems comes from the communication medium. The most obvious, of course, is a complete hard failure of the entire medium, whereby communication between processing sites is not possible. In the most severe cases, this type of failure can lead to partitioning of the system into multiple parts that are completely isolated from each other.

The danger here is that the different parts will undertake conflicting activities. A different type of media failure is an intermittent failure. These are failures whereby messages travelling through a communication medium are lost, reordered, or duplicated. Note that these are not always due to hardware failures. For example, a message may be lost because the system may have temporarily run out of memory for buffering it. Message reordering may occur due to successive messages taking different paths through the communication medium.

If the delays incurred on these paths are different, they may overtake each other. Duplication can occur in a number of ways. For instance, it may result from a retransmission due to an erroneous conclusion that the original message was lost in transit.

One of the central problems with unreliable communication media is that it is not always possible to positively ascertain that a message that was sent has actually been received by the intended remote destination.

A common technique for dealing with this is to use some type of positive acknowledgement protocol. In such protocols, the receiver notifies the sender when it receives a message. Of course, there is the possibility that the acknowledgement message itself will be lost, so that such protocols are merely an optimization and not a solution. The most common technique for detecting lost messages is based on time-outs.

Namely, if we do not get a positive acknowledgement that our message was received within some reasonable time interval, we conclude that it was dropped somewhere along the way. The difficulty of this approach is distinguishing whether a message or its acknowledgement is simply slow or actually lost. If we make the time-out interval too short, we risk duplicating messages and, in some cases, reordering.

If we make the interval too long, the system may become unresponsive. Transmission delays While transmission delays are not necessarily failures, they can certainly lead to failures. We've already discussed the situation where a delay can be misconstrued as a lost message. Two different types of problems are caused by message delays. One type of problem results from variable delays jitter. That is, the time taken for messages to reach the destination may vary significantly.

The delays depend on a number of factors, such as the route taken through the communication medium, congestion in the medium, congestion at the processing sites for example, a busy receiver , intermittent hardware failures, and so on. If the transmission delay were constant, we could better assess when a message has been lost. For this reason, some communication networks are designed as synchronous networks, so that delay values are fixed and known in advance.

However, even if the transmission delay is constant, the problem of out-of-date information still exists. Since messages convey information about state changes between components of the distributed system, the information in these messages may be out of date if the delays experienced are greater than the time required to change from one state to the next.

This can lead to unstable systems. Imagine trying to drive a car in a situation where the visual input to your eyes is delayed by several seconds.

Transmission delays also lead to a complex situation that we will refer to as the relativistic effect. This is because transmission delays between different processing sites in a distributed system may be different.

As a result, different sites could see the same set of messages but in a different order. Due to the different routes taken by the individual messages and the different delays along those routes, we see that ClientB sees one sequence event1 followed by event2 , whereas ClientA sees another. As a consequence, the two clients may reach different conclusions about the state of the system. Note that the mismatch here is not the result of message overtaking although this effect is compounded if overtaking occurs , but is merely a consequence of the different locations of the distributed agents relative to each other.

Distributed agreement problems The various failure scenarios in distributed systems, and transmission delays in particular, have instigated important work on the foundations of distributed software. Much of this work has focussed on the central issue of distributed agreement.

There are many variations of this problem, including time synchronization, consistent distributed state, distributed mutual exclusion, distributed transaction commit, distributed termination, distributed election, and so on. However, all of these reduce to the common problem of reaching agreement in a distributed environment in the presence of failures.

We introduce this problem with the following apocryphal story. Consider the case of two army generals of ancient times, when communication between distant sites could only be achieved by physically carrying messages between the sites. The two generals, Amphiloheus and Basileus, are in a predicament whereby the enemy host lies between them.

While neither has the strength to single-handedly defeat the enemy, their combined force is sufficient. Thus, they must commence their attack at precisely the same time or they risk being defeated in turn.

Their problem then is to agree on a time of attack. Unfortunately for them, a messenger going from one general to the other must pass through enemy lines with a high likelihood of being caught. Assume then, that Amphiloheus sends his messenger to Basileus with a proposed time of attack. To ensure that the message was received, he demands that the messenger return with a confirmation from Basileus. While this is going on, Basileus could be in the process of sending his own messenger with his proposal-possibly different-for the time of the attack.

The problem is obvious: if the messenger fails to get back to Amphiloheus, what conclusions can be reached? If the messenger succeeded in reaching the other side but was intercepted on the way back, there is a possibility that Basileus will attack at the proposed time but not Amphiloheus since he did not get a confirmation. However, if the messenger was caught before he reached Basileus, then Amphiloheus is in danger of acting alone and suffering defeat.

Furthermore, even if the messenger succeeds in getting back to Amphiloheus, there is still a possibility that. Basileus will not attack, because he is unsure that his confirmation actually got through. To remedy this, Basileus may decide to send his own messenger to Amphiloheus to ensure that his confirmation got through. But, the only way he can be certain of that is if he gets a confirmation of his confirmation.

Since there is a possibility that neither messenger got through to Amphiloheus, Basileus is no better off than before if his second messenger does not return. Clearly, while sending additional messengers can increase the likelihood that a confirmation will get through, it does not fundamentally solve the problem since there will always be a finite probability that messengers will get intercepted.

Impossibility result Our parable of the generals is simply an illustration of a fundamental impossibility result. Namely, it has been formally proven that it is not possible to guarantee that two or more distributed sites will reach agreement in finite time over an asynchronous 3 communication medium, if the medium between them is lossy [7] or if one of the distributed sites can fail [2]. This important result is, unfortunately, little known and many distributed system developers are still trying to solve what is known to be an unsolvable problem-the modern-day equivalent of trying to square the circle or devise a perpetual motion machine.

The best that we can do in these circumstances is to reduce the possibility of non-termination to something that is highly unlikely. The Byzantine generals problemA common paradigm for a particular form of the distributed agreement problem is the so-called Byzantine generals problem. Furthermore, any faulty processing sites are assumed to be malicious in the sense that they are trying to subvert agreement by intentionally sending incorrect information.

Openness : The openness of the distributed system is determined primarily by the degree to which new resource-sharing services can be made available to the users. Open systems are characterized by the fact that their key interfaces are published. It is based on a uniform communication mechanism and published interface for access to shared resources. It can be constructed from heterogeneous hardware and software. Scalability : Scalability of the system should remain efficient even with a significant increase in the number of users and resources connected.

Security : Security of information system has three components Confidentially, integrity and availability. Encryption protects shared resources, keeps sensitive information secrets when transmitted. Failure Handling : When some faults occur in hardware and the software program, it may produce incorrect results or they may stop before they have completed the intended computation so corrective measures should to implemented to handle this case.

Failure handling is difficult in distributed systems because the failure is partial i, e, some components fail while others continue to function. Concurrency : There is a possibility that several clients will attempt to access a shared resource at the same time.

Multiple users make requests on the same resources, i. Each resource must be safe in a concurrent environment. Any object that represents a shared resource a distributed system must ensure that it operates correctly in a concurrent environment.

Hsieh, Y. Rutkowski, A. Deshpande, S. Shah, H. Winkler, J. Abufardeh, S. Damian, D. Endrass, B. Noll, J. Nidhra, S. Monasor, M. Boutellier, R. Jensen, M. Pauleen, D. Ebert, C. Chudoba, K. Measuring virtuality in a global organization. Herbsleb, J. Mikulovic, V. Levesque, L. Orlikowski, W. Mishra, D. In: Meersman, R. OTM LNCS, vol.



0コメント

  • 1000 / 1000