Solving the Greatest Common Divisor Problem in Parallel Derrick Coetzee University of California, Berkeley CS 273, Fall 2010, Prof. Satish Rao The problem and serial solutions • Given two positive integers (represented in binary), find the large positive integer that divides both • Let n will be the total number of bits in the inputs • Solvable in O(n) divisions and O(n2) bit operations with Euclidean algorithm: – while b ≠ 0: (a, b) ← (b, a mod b) • Matrix formulation: • Schönhage uses this to solve in O(n log2n log log n) Adding parallelism: Algorithm PM • Brent & Kung (1982): reduce time of each step to O(1) (O(n) bit complexity) – α ← n; β ← n (where a,b ≤ 2n and a odd) while b ≠ 0: while b ≡ 0 (mod 2) { b ← b/2; β ← β − 1 } if α ≥ β { swap(a,b); swap(α, β) } if a+b ≡ 0 (mod 4): b ← ⌊(a+b)/2⌋ else: b ← ⌊(a−b)/2⌋ – Test “a+b ≡ 0 (mod 4)” uses only 2 lowest bits of sum, enables effective pipelining of the steps – Number of steps is still linear, requires n processors Algorithm PM: Example • • • • • • GCD(105, 30) (a = 11010012, b = 111102) α ← 7; β ← 7 b ← 11112; β ← 6 swap (a ← 11112, b ← 11010012, α ← 6, β ← 7) a + b ≡ 112 + 012 ≡ 002 (mod 4) Compute trailing zeroes of new b and 2 more bits: – b ← ⌊(a+b)/2⌋ = (011112 + …010012)/2 = …1100 • b ← …112; β ← 5 • swap (a ← …112, b ← 11112, α ← 5, β ← 6) Algorithm PM: Example • a = …112, b = 11112, α = 5, β = 6 • a + b ≡ 112 + 112 ≡ 102 (mod 4) • Compute LSB of new b while previous iteration computes next bit of a in parallel: – b' = ⌊(a−b)/2⌋ = …0 – a = …1112 • With another bit of a, can compute next bit of b’ • Final result: a= 11112 , b’ = 0 Sublinear time GCD • Kannan et al (1987): Break the problem into n/log n steps, each of which takes log log n time and eliminates log n bits of each input. – CRCW, O(n log log n/log n) time, O(n2 log2n) processors – Idea: instead of a mod b, compute pa mod b for all 0≤p≤n • If gcd(a,b) = d and 0 ≤ p ≤ n, then gcd(pa,b)/d ≤ n • Pigeonhole principle gives first log n bits same for at least two p – gcd(a,b) = gcd(b,(p1a mod b)-(p2a mod b)) (ignoring factors ≤ n) – Only need to know first log n bits to identify p1, p2 – can be computed in log log n time Sublinear time GCD: Example • GCD(143, 221), n = 8 – – – – – – – – – (0×143) mod 221 = 000…2 (1×143) mod 221 = 100…2 (2×143) mod 221 = 010…2 (3×143) mod 221 = 110…2 (4×143) mod 221 = 100…2 (5×143) mod 221 = 001…2 (6×143) mod 221 = 110…2 (7×143) mod 221 = 011…2 (8×143) mod 221 = 001…2 • (4×143) mod 221 − (1×143) mod 221 = 13 Sublinear time GCD: Using matrix form • Problem: (p1a mod b) − (p2a mod b) takes log n time – Carry lookahead adder needs log n time to add n-bit numbers – Idea: Delay updates to (a, b) for log n steps, collecting updates in a 2 × 2 matrix T, then compute T(a,b) in log n time – Results in n/log2n phases each taking log n time – Each step performs constant number of log2n-bit operations • Ensures that each step still takes log log n time – Time dominated by cost of the O(n/log n) steps Getting rid of the log log n factor • Chor and Goldreich (1990): Parallelize Brent & Kung’s PM algorithm instead of Euclidean algorithm – α ← n; β ← n while b ≠ 0: while b = 0 (mod 2) { b ← b/2; β ← β − 1 } if α ≥ β { swap(a,b); swap(α, β) } if a+b ≡ 0 (mod 4): b ← ⌊(a+b)/2⌋ else: b ← ⌊(a−b)/2⌋ – Idea: pack log n iterations of PM algorithm into each phase • Each phase can be done in constant time – Requires O(n/log n) time on O(n1+ε) processors – Best-known algorithm log n iterations in constant time – α ← n; β ← n δ ← 0 while b ≠ 0: while b = 0 (mod 2) { b ← b/2; β ← β − 1 δ ← δ + 1} if α ≥ β δ ≥ 0 { swap(a,b); swap(α, β) δ ← −δ } if a+b ≡ 0 (mod 4): b ← ⌊(a+b)/2⌋ else: b ← ⌊(a−b)/2⌋ – δ together with the last log n+1 bits of a and b determine the transformation matrix of the next log n iterations – Precompute a table – lookup, do the matrix multiplication • Technicality: Need to treat |δ| ≤ log n and |δ| > log n differently Multiplication in constant time • How to multiply an n-bit number by a log n bit number in constant time? – Represent both numbers in base 2log n – Precompute a multiplication table for this base – Separate the larger number into a sum of two numbers, each having zero for every other digit, e.g. 1234 = 1030 + 204 – Multiply both by the smaller number; no carries are required • Example: 1234 × 7 = (1030+204)×7 = 7210+1428 = 8638 – Finally, Chandra et al (1983) shows we can add two n-bit numbers in O(1) time with O(n2) processors • But this requires concurrent writes Complexity perspective: an analogy • P = NP? – Open problem, but most practical problems have either been shown to be in P or else NP-hard. – Exceptions: integer factorization, graph isomorphism • If integer factorization is NP-complete, NP = coNP • If graph isomorphism is NP-complete, PH collapses to 2nd level • NC = P? – Open problem, but most practical problems have either been shown to be in NC or else P-hard. – Exceptions: GCD • What would GCD being P-complete imply? Questions?