3. LU factorization
We would like to compute the solution to a system of linear equations
AX = B, where A is a real or complex matrix, and X and B are rectangular
matrices or vectors. Gaussian elimination with row interchanges is used to
factor A as LU = PA, where P is a permutation matrix, L is a unit lower tri-
angular matrix, and U is an upper triangular matrix. The factored form of A
is then used to solve the system of equations AX = B.
The recursive algorithm of the Gauss LU factorization is described in
detail in [3, 8]. We give two recursive algorithms here. They are listed in
figs. 3.1 and 3.2.
Recursive Algorithm 3.1 Recursive LU factorization without pivoting
(rgausslu):
Do recursion
• if min(m,n) > 1 then
• (L
1
,U
1
) = rgausslu of A
1
• L
11
U
21
= A
12
→ RTRSM
•
22
Aˆ
:= A
22
– L
21
U
12
→ _GEMM
• (L
22
,U
22
) = rgausslu of
22
Aˆ
• otherwise
• L
1
:= A
1
/a
11
and U
1
= a
11
End recursion
• if n > m then ,
•LU
3
= A
3
→ RTRSM
180
The matrices A
1
, A
2
, A
3
, A
12
, A
22
,
L
1
, L
11
, L
21
, L
22
, U
1
, U
3
, U
12
and U
22
are
submatrices of A, L and U respectively, a
11
∈A
1
.
Recursive Algorithm 3.2. Recursive LU=PA factorization with partial
pivoting (rgausslu):
Do recursion
• if min(m,n) > 1 then
• (P
1
,L
1
,U
1
) = rgausslu of A
1
• Forward pivot A
2
by P → _LASWP__•_L_11__U_21__=_A_12__→_RTRSM'>_LASWP
• L
11
U
21
= A
12
→ RTRSM
•
22
Aˆ
:= A
22
– L
21
U
12
→ _GEMM
• (P
2
,L
22
,U
22
) = rgausslu of
22
Aˆ
• Back pivot A
1
by P
2
→ _LASWP
• P = P
2
P
1
• otherwise
• pivot A
1
• L
1
:= A
1
/a
11
and U
1
: = a
11
End recursion
• if n > m then
• Forward pivot A
3
by P → _LASWP
•LU
3
= A
3
→ RTRSM
4. Recursive BLAS and Parallelism
Two recursive BLAS (Basic Linear Algebra Subprograms, see [20])
subroutines are used in our recursive Cholesky and recursive LU algo-
rithms: RTRSM and RSYRK. These two routines will be explained below.
4.1 RTRSM
RTRSM is a recursive formulation of _TRSM, where _ is a precision
and arithmetic indicator: S, D, C or Z. _TRSM subroutine solves one of the
matrix equation
AX =
αB, A
T
X =
αB, XA = αB, or XA
T
=
αB,
where
α is a scalar. X and B are (m × n) rectangular matrices. A is a unit, or
non-unit, upper or lower (m xm) triangular matrix. The matrix X is
overwritten on B. We have 16 different triangular equations because A and
A
T
can be either upper or lower triangular matrices, and the diagonal is
normal or unit.
181
We will introduce the recursive formulation only for one case AX =
α B,
where A is lower triangular. The other cases will be similar.
The matrices A, B, and X can be partitioned into smaller submatrices,
thus
.
α
=
22
21
12
11
22
21
12
11
22
21
11
B
B
B
B
X
X
X
X
A
A
A
The matrices A
11
= A(1 : h,
1 : h) , A
21
= A( h+1 : m,
1 : h) ,
A
22
= A( h+1 : m, h+1 : m) , B
11
= B(1 : h, 1 : p) , B
12
= B(1 : h, p+1 : n) ,
B
21
= B( h+1 : m, 1 : p) , B
22
= B( h+1 : m, p+1 : n) , X
11
= X(1 : h, 1 : p) ,
X
12
= X(1 : h, p+1 : n) , X
21
= X( h+1 : m, 1 : p) and X
22
= X( h+1 : m, p+1 : n)
are submatrices of A, B and X respectively, and h = m/2 and p = n/2. Multi-
plying the matrix A by X gives:
.
α
α
α
α
=
+
+
22
21
12
11
22
22
12
21
21
22
11
21
12
11
11
11
B
B
B
B
X
A
X
A
X
A
X
A
X
A
X
A
We have got two independent groups of triangular systems:
12
21
22
22
22
11
21
21
21
22
12
12
11
11
11
11
X
A
B
X
A
X
A
B
X
A
B
X
A
B
X
A
−
α
=
−
α
=
α
=
α
=
We could do a double recursion on m and n; i. e. on h and p. However,
we do not do the recursion on p. This results in the following algorithm:
Recursive Algorithm 4.1. Recursive algorithm for the AX = B opera-
tion ( one group only) , where A is a lower triangular matrix ( rtrsm) :
Do recursion
• if m > 1 then
• A
11
X
1
=
α B
1
→ _GEMM'>RTRSM
• B
2
:=
α B
2
– A
21
X
1
→ _GEMM
• A
22
X
2
=
α B
2
→ RTRSM
• otherwise
• a
11
X
1
=
α B
1
End recursion
4.2 RSYRK__•_C_11__:=_11__C_ˆ_+_α_A_12__T__A_12__→_RSYRK'>RSYRK
RSYRK is a recursive formulation of _SYRK, where _ is a precision
and arithmetic indicator: S, D, C or Z. _SYRK performs one of the symmet-
ric rank k operations:
C :=
α AA
T
+
β C or C := α A
T
A +
β C
182
where
α and β are scalars. A is a rectangular matrix ( m × n) . C is a square
symmetric matrix.
We will introduce the recursive formulation only for one of the four
cases of _SYHK:
C :=
α AA
T
+
β C.
The matrices A and C can be partitioned into smaller submatrices:
.
T
A
A
A
A
A
A
A
A
C
C
C
C
C
C
α
+
β
=
22
21
12
11
22
21
12
11
22
21
11
22
21
11
The matrices
A
11
= A(1 : h, 1
:
p) , A
12
= A(1 : h, p+1 : n) ,
A
21
= A( h+1 : m,1 : p) , A
22
= A( h+1 : m, p+1 : n) , C
11
= C(1 :h, 1 :h) ,
C
21
= C( h+1 : m, 1 : h), C
22
= C( h+1 : m, h+1 : m) are submatrices of A and
C respectively, and h = m/2 and p = n/2 . The recursion could be done again
on two variables, ft and p, but we do recursion on ft only.
In terms of the partitioning we have three independent formulas:
T
T
T
T
T
T
A
A
A
A
C
C
A
A
A
A
C
C
A
A
A
A
C
C
22
22
21
21
22
22
12
22
11
21
21
21
12
12
11
11
11
11
α
+
α
+
β
=
α
+
α
+
β
=
α
+
α
+
β
=
These three computations can he done in parallel. We now formulate a
recursive algorithm as follows:
Recursive Algorithm 4.2 Recursive algorithm for the C := aAA
T
+
β C
symmetric rank k operations ( rsyrk) :
Do recursion
• if m
≥ 1 then
Perform computation C
11
:
•
11
C
ˆ := β C
11
+
α A
11
T
A
11
→ RSYRK
• C
11
:=
11
C
ˆ + α A
12
T
A
12
→ RSYRK
Perform computation C
21
:
•
21
C
ˆ := β C
21
+
α A
21
T
A
11
→ _GEMM
• C
21
:=
21
C
ˆ + α A
22
T
A
12
→ _GEMM
Perform computation C
22
:
•
22
C
ˆ := β C
22
+
α A
21
T
A
21
→ RSYRK
• C
22
:=
22
C
ˆ + α A
22
T
A
22
→ RSYRK
End recursion
183
4.3. A fast _GEMM algorithm
The _ of _GEMM is a precision and arithmetic indicator: S, D, C or Z.
_GEMM subroutine does the following operations
C :=
α AB + β C, C := α AB
T
+
β C, C := α A
T
B +
β C,
C :=
α A
T
B
T
+
β C, or C := α AB
C
+
β C, C := α A
C
B +
β C,
and
C :=
α A
C
B
C
+
β C,
where
α and β are scalars. A, B and C are rectangular matrices. A
T
, B
T
, A
C
and B
C
are transpose and conjugate matrices respectively.
The GEMM operation is very well documented and explained in [9, 10,
20]. We can see that work is done by _GEMM for both our BLAS RTRSM
Section 4.1 and RSYRK Section 4.2. The speed of our computation depends
very much from the speed of a good _GEMM. Good _GEMM implementa-
tions are usually developed by computer manufacturers. The model imple-
mentation of _GEMM can be obtained from netlib [20]; it works correctly
but slowly. However, an excellent set of high performance BLAS, called
_GEMM based BLAS was developed by Bo Kågström at the University of
Umeå in Sweden, see for example [14, 10]. A key idea behind _GEMM
based BLAS was to cast all BLAS algorithms in terms of the simple BLAS
_GEMM. Recently, the Innovative Computing Laboratory at University of
Tennessee in Knoxville developed a system called ATLAS which can pro-
duce a fast _GEMM program.
4.3.1 ATLAS
Automatically Tuned Linear Algebra Software (ATLAS) [19]. ATLAS
is an approach for the automatic generation and optimization of numerical
software for processors with deep memory hierarchies and pipelined func-
tional units. The production of such software for machines ranging from
desktop workstations to embedded processors can be a tedious and time
consuming task. ATLAS has been designed to automate much of this proc-
ess. So, having a fast _GEMM means our RTRSM and RSYRK routines
will be fast. The ATLAS GEMM is often better than _GEMM developed by
the computer manufacture. What is important, the ATLAS software is
available to everybody, free of charge. Every personal computer can have a
good GEMM.
Достарыңызбен бөлісу: |